The goal of data clustering (also known as unsupervised learning) is to organize a set of n objects into k clusters such that objects in the same cluster are more similar to each other than objects in different clusters. Clustering is one of the most popular tools for data exploration and data organization that has been widely used in almost every scientific discipline that collects data. Given the exponential growth in data generation (estimated to be over 35 trillion gigabytes by the year 2020), clustering is receiving renewed interest and use in applications such as social networks, image retrieval, web search and gene expression analysis. This talk introduces the data clustering problem and discusses the challenges and opportunities in research on large-scale clustering, with focus on two main issues: (i) how to define pairwise similarity between objects? and (ii) how to efficiently cluster hundreds of millions of objects? Recent developments in large scale clustering research are discussed.
Related articles:
Data clustering: 50 years beyond K-means (http://www.sciencedirect.com/science/...)
Approximate kernel k-means: solution to large scale kernel clustering (http://dl.acm.org/citation.cfm?id=202...)
Nonlinear component analysis as a kernel eigenvalue problem (http://www.mitpressjournals.org/doi/a...)
Algorithms for Clustering Data
(http://www.cse.msu.edu/~jain/Clusteri...)
コメント