Multithreaded Unsupervised Clustering
3DRobotics | 2015-01-07
http://synsem.com/3DRobotics


Jace A Mogill
jace@mogill.com
 
[ Use Arrow Keys or Swipe To Navigate ]
 

Outline

Automated, Unguided, Unsupervised Clustering
Performed Using Shared Memory Multithreaded Parallelism

▶ What Kind of Clustering?
▶ Who Cares?
▶ The Challenges
▶ Anchor's Hierarchy
    ▶ Clustering by Sorting
    ▶ Radix Sort & Parallel Counting Sort
    ▶ Reduction via Parallel Prefix
▶ Voronoi Diagrams

What is Unsupervised Clustering?

Categorization of arbitrary things by likeness

Cluster techniques are generic --
Searching for hidden structure without a priori knowledge

Other than the “distance” or “similarity” function, domain knowledge isn't involved. Clusters will vary with different distance functions.

Clustering methods differ by how groups are chosen and what an object is compared to.

Who Cares About Unsupervised Clustering?


Enables categorization without knowing what you're looking for


▶ Machine Learning
▶ Market Segmentation
▶ Sentiment Analysis
▶ Allegiance Switching
▶ Regression Analysis
▶ Association Rules
▶ Structured Prediction
▶ Feature Learning
▶ Dimensionality Reduction
▶ Anomaly Detection
▶ Recommendation Engines
▶ Computer Vision
▶ Astronomy
▶ Agriculture

Challenges of Algorithmic Clustering

Execution is dominated by making comparisons, which happens in such great quantity that even simple comparisons can dominate execution time.

Optimization means reducing the number of comparisons performed


▶ Brute force all-to-all comparisons are $$O(n^2)$$
▶ Instead, one comparison per cluster is typically performed. Closer to $$O(n \log n)$$

Anchors Hierarchy

Anchors Hierarchy by Sorting

Sort by distance, then by anchor
Ordered storage permits fast binary searches to find half-planes

Radix Sort

Sort Over Multiple Phases

▶ From least significant to most significant digits
▶ Serial execution of phases, but each sort phase may execute in parallel
▶ Phase sort may use any stable sorting technique
▶ Phases allow comparison of keys of arbitrary length

Counting Sort

Synthesize a sorted list from a histogram

$$O(n+k)$$ Excellent performance when number of keys is small

Challenges with parallelization:
▶ Hot-spot on counts

▶ Fine-grained synchronization needed for parallel prefix sum (offsets)

▶ Load imbalance on output fan-out

Parallel Prefix

Cyclic reduction performed by parallel prefix $$y_{i} = y_{i − 1} + x_{i}$$
$$O(\log n)$$ Strong Scaling for reductions

Challenges with parallelization:
▶ Fine-grained synchronization during reductions

▶ Load imbalance

▶ Bandwidth to memory using strided access

Contrary Notions of Strong & Weak Scaling

 
Strong Scaling
Weak Scaling
Scaling Goal Solve the same problem, only faster Solve a bigger problem in the same amount of time
Problem Size Stays constant while number of processors increases Grows with the number of processors
Scaling is limited by Inter-process communication Problem size
Resiliency Single failure causes entire job to fail, SLAs achieved through efficient checkpoint-restart. Failed sub-tasks are detected and retried. SLAs achieved through fault resiliency.
Programming Models MPI, GASNet, Chapel, X10, Co-Array Fortran, UPC Batch jobs, Map-Reduce

Anchors Hierarchy by Radix Sort

Use Radix Sort for compound key
▶ Sort by distance to the needed precision
▶ Sort again by anchor

Nested Parallelism

Collapse nested parallelism to 1-D parallelism which is easier to manage

Hierarchy Formation

Pennant Tree
▶ Binary tree where one child is also the parent
▶ Two anchors which mutually form the smallest new anchor are merged
▶ Repeat until all anchors are merged

Wikipedia Document Clustering Workflow

Voronoi Diagrams and Delaunay Meshes

Performance Results

Cray XMP
▶ 128 hardware threads per processor
▶ Over 13,000 concurrent threads executing
▶ 512GB global shared flat memory
▶ 200MHz CPU clock, 100MHz memory, 12MIPS serial instruction rate

x86 Cluster
▶ 8 cores per processor
▶ 116 Processors (696 Threads)
▶ 1,856GB of distributed memory
▶ 3,000MHz CPU clock, 1,666MHz memory, 1,800MIPS serial instruction rate

Conclusions


Multithreading, Clustering, Sorting, Parallelism, Geometry

▶ Unsupervised Clustering is practical in many application domains
▶ The Anchors Hierarchy reduces the number of comparisons required
▶ Atomic Read-Modify-Write operations are critical for parallelism
▶ Adding work can make execution faster if used to implement parallelism
▶ Duality of clustering and computational geometry

Thanks!

http://synsem.com/3DRobotics
jace@mogill.com
 

Here is where we will say something.