Clustering large datasets

On the MIT-internal "csail-related" mailing list someone recently asked for software to help him perform matrix multiplications of 10^6 x 10^6-sized matrices.  Ron Rivest quite correctly replied that to multiply matrices this size, even for a single multiplication you would probably need about 4 years of compute time -- because there are a trillion entries in matrices this size.  I posted the following reply, which I am re-posting here partly for my own reference because it contains a lot of the points I have learned in various work clustering huge datasets, and partly in the hope that somebody else will find it useful.


Ron's analysis is correct, unless your matrix is very sparse -- in which case sparse matrix methods may make this problem entirely tractable, and any of the linear algebra toolkits implement efficient sparse matrix methods that you can use. The main problem you'll have is fitting it all in memory -- you'll need to look into matrix blocking techniques (dividing the big matrix into submatrices, and figuring out the correct way to multiply the subblocks to get the full result). There's some great discussion about keeping subblocks in CPU cache in the following paper that may help you figure out how to keep your much larger subblocks in main memory as long as possible: http://homepages.cae.wisc.edu/~ece539/software/matrix/mbpv11.ps The difference between swapping subblocks in and out at the right time vs. the wrong time could make several orders of magnitude in difference in how long it takes to compute your result. There are also some parallel solutions to multiplying large matrices that will run on large clusters and trade off time swapping subblocks in and out of memory for data communication overhead between nodes.

A similar problem to matrix multiplication problem you describe is encountered in data clustering, given the "N Choose 2" or O(N^2) scaling of the number of pairwise distances in a dataset. It is intractable to use all-pairs distances with even the simplest clustering algorithms in large datasets, for example hierarchical agglomerative / bottom-up clustering applied to more than tens of thousands of points. Depending on the exact nature of the problem, you may be able to exploit spatial coherence within your problem space -- e.g. for agglomerative clustering, you use only the nearest neighbors when joining clusters, so you can often reduce the complexity of clustering problems using a smart data structure like a kd-tree that gives approximately O(log(N)) time per nearest neighbor lookup. However the kd-tree algorithm quickly degrades to ~O(N) performance per lookup in high-dimensional spaces because of the curse of dimensionality, so you may need to do dimensionality reduction first anyway to make the kd-tree useful. (You'd also have to reframe your matrix multiplication problem into a format where using a kd-tree to find nearest neighbors in your vectorspace helps you compute a fast, close approximation to your desired solution.)

Another approach used to cluster datasets with millions of points (and therefore trillions of pairs of points) is to pick a few exemplar points and cluster these instead to generate a sample approximation of cluster assignments for the full dataset. For example you could randomly choose one point out of every thousand, and cluster these into your K target clusters (= O(N^2 / 1000^2) time to cluster 1/1000th of the points), then go back through your full dataset and find the closest exemplar point to each original data point in order to compute the cluster labeling (= O(N^2/1000), although you can often skip this step entirely if you just use the exemplars). The largish constants in the time complexity can make this approach tractable for larger datasets than you could otherwise work with. The exemplar method I just described is incidentally half an iteration of the k-means/k-medians algorithm applied to a set of exemplar points. You can go further by completing the full first iteration of k-medians by going back and updating your selection of exemplar points using the medians of the newly-labeled clusters, and then depending on how much compute time you can afford, you could run multiple whole iterations of this exemplar-modified k-medians algorithm (or run until convergence) to get better exemplars -- though even the first half-iteration may be sufficient to get you a useful result. As far as how to phrase your matrix multiplication problem in this framework, depending on the problem you may be able to find representative row/column vectors this way and then just multiply the representative vectors together to get a product that in some sense is a low-dimensional approximation of the full matrix product.

Ultimately whether you use (PCA-based) dimensionality reduction or k-means / k-medians, you're doing approximately the same thing, and this is why preprocessing with PCA can often help k-means to converge faster : quoting from http://en.wikipedia.org/wiki/Principal_component_analysis#Relation_between_PCA_and_K-means_clustering :

It has been shown recently (2007) [12] [13] that the relaxed solution of K-means clustering, specified by the cluster indicators, is given by the PCA principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix. Thus PCA automatically projects to the subspace where the global solution of K-means clustering lie, and thus facilitate K-means clustering to find near-optimal solutions.