Cluster Analysis of Gene Expression Data

23 September 2005 09:34

A couple of years ago I was involved in some data mining work for the government. One of our tasks was to investigate the many COTS (commercial off-the-shelf) tools and software used for data mining. The second of our tasks was to investigate the many clustering algorithms and perhaps develop more suitable ones. Now, I really knew nothing about data mining, nor about clustering algorithms. As luck would have it, when I looked at the various algorithms out there, especially the popular K-means algorithm, I was able to see that in some ways it was similar to the kind of gravitational N-body problems I did when doing galactic simulations (see, for example, this paper). The complexity of the algorithm--O(N²)--where N is the number of data points, is the same as all N-body physical simulations with a two-body force.

In many cases, it is possible to reduce an N-body simulation to a grid. This method involves assigning each of the bodies (e.g., particles, data points, etc.) to a grid point in an n-dimensional space (where n is the dimensionality of the point). In this way the information of all N data points is now encapsulated in the m grid points. The cost of this is O(N), and the cost of determining the clusters in the grid with m grid points is m log m. This scheme actually works, as shown in this paper given in the 2003 SIGKDD conference on data mining.

Things to do:

Learn more about gene expression data and the applications of the new clustering algorithm I've discussed to this problem.

Bibliography

E. Domany, Cluster Analysis of Gene Expression Data, arxiv.org/abs/physics/0206056 (2002).
W. Peter, J. Chiochetti, and C. Giardina, New unsupervised clustering algorithm for large datasets, KDD-2003, the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ISBN: 1-58113-737-0, p. 643 (2003).