Cluster Analysis of Gene Expression Data
23 September 2005 09:34
A couple of years ago I was involved in some data mining work for the government. One of our tasks was to investigate the many COTS (commercial off-the-shelf) tools and software used for data mining. The second of our tasks was to investigate the many clustering algorithms and perhaps develop more suitable ones. Now, I really knew nothing about data mining, nor about clustering algorithms. As luck would have it, when I looked at the various algorithms out there, especially the popular K-means algorithm, I was able to see that in some ways it was similar to the kind of gravitational N-body problems I did when doing galactic simulations (see, for example, this paper). The complexity of the algorithm--O(N2)--where N is the number of data points, is the same as all N-body physical simulations with a two-body force.
In many cases, it is possible to reduce an N-body simulation to a
grid. This method involves assigning each of the bodies (e.g., particles,
data points, etc.) to a grid point in an n-dimensional space (where n
is the dimensionality of the point). In this way the information of all N
data points is now encapsulated in the m grid points. The cost of this
is O(N), and the cost of determining the clusters in the grid with
m grid
points is m log m. This scheme actually works, as shown in this
paper
given in the 2003 SIGKDD conference on data mining.
Things to do:
Learn more about gene expression data and the applications of the new clustering algorithm I've discussed to this problem.
See also: Computation; Information Theory; Physics; Quantum Mechanics
Bibliography
E. Domany, Cluster Analysis of Gene Expression Data, arxiv.org/abs/physics/0206056 (2002).
W. Peter, J. Chiochetti, and C. Giardina, New unsupervised clustering algorithm for large datasets, KDD-2003, the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ISBN: 1-58113-737-0, p. 643 (2003).