Tutorials: Topology-preserving gene selection and clustering

Topology-Preserving Selection and Clustering (TPSC)

GO TO ➢ [ Summary · Vector Space Model · SOM · SVD ] ➢ [ Hybrid SOM-SVD · Two-Phase Clustering ] ➢ [ HOWTO ] ➢ [ Citations ]

How To

This document describes how to use MATLAB PACKAGE to realize the implementation of topology-preserving gene selection and clustering, and is organized into the following three sections:

  • 1: System requirements - explanation of operating systems required for computing environment
  • 2: Protocol outline - general recipe for the analysis flow
  • 3: Intuitive demo - illustration of the proposed methodology by processing genome-scale expression data of human cells during various stress responses


1: System requirements


2: Protocol outline

    ➢ Step 1: Non-linear transformation of primary gene expression data (gene expression matrix as input) into output codebook matrix using SOM with EP kernel function;

    ➢ Step 2: Linear decomposition of the resulting codebook matrix by SVD;

    ➢ Step 3: Calculation of distance statistic for each node, following dominant eigennode selection and SVD subspace projection;

    ➢ Step 4: Using FDR multiple comparison procedure for significant node assessment and subsequent gene selection;

    ➢ Step 5: Two-phase gene clustering of selected genes through SOM-based methods, including the selected gene expression matrix trained by SOM with Gaussian kernel function (1st phase) and distance matrix based clustering of SOM (2nd phase).


3: Intuitive Demo

     Data: Generated by cDNA microarray containing over 40,000 cDNA elements, representing 25,802 genes; primary gene expression (36,164 genes×76 samples), after preliminary clean-up procedures including transformation relative to zero time point for each time course by subtracting the median log2 expression ratios in the time-zero replicates, and exclusion of those with missing values more than 40%, and finally excluding samples at zero time points for follow-up. Details of stress conditions can be referred to Table 1.

      Table 1. Details of stress conditions for human expression data during various stress responses

     Results: See Table 3 for description of MATLAB variables in the Workspace. Besides, four variables (Dunit, Mhits, Dunit_base and Mhits_base) are exported into files ("Dunit.xls", "Mhits.xls", "Dunit_base.xls" and "Mhits_base"); A new subdirectory is created for storing hierarchical displays with each of obtained bases/clusters.

      Table 3. Description of MATLAB Variables in the Workspace

     Intuitive steps: Illustrate the proposed methodology, together with resultant images below.

    Step 1:

    Non-linear transformation of primary gene expression data into output codebook matrix using SOM with EP kernel function (som_make_modified.m)

    CPPs of SOM outputs (cppsom_stress.m)

    Step 2:

    Linear decomposition of the resulting codebook matrix by SVD (svd_heatmap.m)

    Step 3:

    Calculation of distance statistic for each node, following dominant eigennode selection and SVD subspace projection

    Barchart for S and estimation of the number of significant eigennodes (S_barchart.m)

    Pair-wise SVD projection scatter plots (svd_scatterplots.m)

    Step 4:

    Using FDR multiple comparison procedure for significant node assessment and subsequent gene selection (svd_fdr.m)

    The number of nodes and genes selected under indicated FDR threshold (fdr_threshold.m)

    Visualize the nodes selected under indicated FDR using SOM (fdr_som.m)

    Step 5:

    Two-phase gene clustering of selected genes through SOM-based methods, including the selected gene expression matrix trained by SOM with Gaussian kernel function (1st phase) and distance matrix based clustering of SOM (2nd phase)

    CPPs of SOM outputs (cppsom_stress.m)

    Quantitative clustering of SOM based on neighbor distance matrix (som_bases.m)

Tutorials: Topology-preserving gene selection and clustering