Gene Expression Coherence Score (G-Score)
Gene expression coherence score (G-score) is a measure of how similar a set of gene expression profiles are. Co-regulated genes very often tend to have similar expression profiles over different conditions. We can thus evaluate the likelihood of a motif being biologically meaningful by the coherence of the expression profiles of all the genes whose promoters contain the motif. For a pair of genes, expression coherence can be measured in many ways, such as Euclidean distances and correlation coefficients. However, for a set of arbitary number of genes, the measure of expression coherence more difficult to define. Considering a gene profile as a point in the n dimentional condition space, for any set of gene profiless, a good measurement should reflect how tightly these points are clustered together in the condition space. We have tried three different methods.
Define G-score as the Average Coherence Of Pairwise gene expression profiles (ACOP). The simplest way to define the coherence of a set of gene is to use the average of the coherences of pairwise gene expression profiles, i.e., the average distance between any two points. We measure the distance by the orrelation coefficients. In this case, the G-score measures the absolute tightness of a set of points, i.e., how much space these set of points spread out, without considering the number of points in this set.
Define G-score as the Significance of ACOP against the randomly sampled gene profiles (SACOP). To consider the number of points in the set, we can randomly sample the same number gene proflies thousands of times and calculate the mean value and the standard deviation. Then by computing how significant this ACOP score is against these random samples, this SACOP can measure the relative tightness of a set of points.
Define G-score as the ratio of the number of Good Pairs against that of the randomly sampled gene profiles (GP). In case where the gene cluster is split to two very tightly clustered subsets, that are yet remote from each other, the ACOP or SACOP will become relatively low, although may be still higher then the random. Another approach is to define a ratio for good pairs and count the number of good pairs against the randomly sampled gene profiles. In this case, we can randomly sample 100 genes from the entire genome and calculate the pair-wise coherence between every expression profile pairs, and then define a threshold (T) as the lowest value in the fifth percentile of the distribution of these distances. For a set of N genes, we calculate the coherence of each pair of genes, and acount the number of good pairs (GPs) that are above T. By randomly sampling N genes thousands of times from whole genome, by count the number of good pairs for each run, we can compute the mean and standard deviation for these data. Then we can calculate the significance of the test data.
We run the experiments on cellcycle genes with all these three methods. The results are shown below.
| Motif |
ACOP rank
|
SACOP rank
|
GP rank
|
Total # putative motifs
|
| TGCTGG |
22
|
60
|
51
|
147
|
| GCTGG |
10
|
16
|
6
|
30
|
| ACGCGT |
1
|
1
|
2
|
147
|
| CACGAAA |
47
|
61
|
33
|
419
|
| CGCGAAA |
8
|
4
|
6
|
419
|
| ATAAACAA |
44
|
19
|
43
|
1015
|
| GTAAACAA |
21
|
12
|
56
|
1015
|
| GTAAACA |
21
|
10
|
72
|
419
|
| TTTCCTAA |
61
|
77
|
52
|
1015
|
| TCACGTG |
93
|
167
|
142
|
419
|
| TGAAACA |
55
|
57
|
162
|
1015
|