Tutorials: Topology-preserving gene selection and clustering

Topology-Preserving Selection and Clustering (TPSC)

GO TO ➢ [ Summary · Vector Space Model · SOM · SVD ] ➢ [ Hybrid SOM-SVD · Two-Phase Clustering ] ➢ [ HOWTO ] ➢ [ Citations ]

Vector Space Model

For generality, we define topological structure of gene expression data as inherent relationships within data itself. To intuitively describe these relationships, we conceptually express multi-dimensional microarray data in terms of vector space model. This model considers expression values (typically, log-2 transformed ratio relative to a control) of a given gene across N related samples as coordinates of the gene in the N-dimensional hyperspace. Accordingly, the set of G genes in the primary expression matrix correspond to data clouds on the hyperspace. Data points around the origin of the hyperspace are more likely to correspond to genes with no change in expression or random variation, while those located far away from the origin could be genes with an observable expression pattern. In other words, the spatial properties of the data clouds can be considered as a proxy for topological structure of gene expression data. Featured and informative data tend to be on the fringe of the hyperspace, whereas randomized and artificial data are always centered on the origin of the hyperspace. Moreover, the data clouds resulting from the observed data matrix tends to be farther away from the origin of the hyperspace than that resulting from the randomized matrix. Therefore, this topological structure of the data provides a basis for the recognition and selections of biologically meaningful genes. One advantage of such topological preservation is to perform exploratory analysis of large and complex multi-dimensional data, particularly for data without a priori assumption of data structure.

Tutorials: Topology-preserving gene selection and clustering