Home

Methods

  • Stegoscripts and statistical model
  • Wordspy algorithm
  • Model optimization
  • Over-represented motif discovery
  • Word clustering

Insights

Results

 

Identifying Yeast cell-cycle trancription factor binding motifs


As another application of WordSpy, we applied it to discovering TFBMs in the regulatory regions of about 800 cell-cycle related genes of S. cerevisiae. The cell-cycle gene names are from http://genome-www.stanford.edu/cellcycle/data/rawdata/ and the promoter sequences are gotten with RSA tools http://rsat.scmbb.ulb.ac.be/rsat/. By removing the homologs and dubious genes, the input sequences we used in this experiment contains 645 promoter sequences. The fasta file is available here (cleaned yeast cell cycle promoters).

To evaluate the quality of a motif (for being a biologically meaningful motif), we measure the coherence of expression profile of the genes whose promoters contain that motif. We can use the average coherence of pairs of genes associated with a motif and call this coherence measure G-score. The yeast gene expression data are from http://cmgm.stanford.edu/~kimlab/multispecies/Data/yeast.zip. The motifs discovered by WordSpy were reordered based on their G-scores. Interestingly, most of known motifs are ranked high in our dictionary; many obvious repeats which have very high Z-scores, such as GAAAAAA, can be identified as not biologically significant and thus removed from the dictionary, thanks to their low G-scores.We also performed the whole genome analysis on the specificity of the motifs, Zg-scores, with the promoters of all the genes of S. cerevisiae. Most of known TFBMs are also ranked high with Zg-scores.

To facilitate motif selection for a real application, we clustered similar motifs. The motifs were first sorted by Zg-score or G-score. From the highest to the lowest rankings, we took a motif as a seed that had not been clustered, and grouped it with all the motifs that shared a common substring of length 6 with the seed or its reverse complementary. The detail results are shown below.

Results:

Identified known motifs and their ranks.
All putative motifs for yeast cell-cycle genes ordered by G-score.
Putative motif clusters based on G-score ranking.

Putative motif clusters based on Zg-score ranking.

The dictionaries built by Wordspy: