Nearest-Neighbor Analysis
- About the data files
- Finding neighbors of a candidate gene or an input profile
- Filtering the data
- Normalization methods
- Distance metrics
Boolean Query
- About the data files
- Constructing a query: operators and factors
- Filtering the data
- Sorting the output
Return to Home
About the data files
There are two data files available for searching. They are permutations of the data set. In file ploidymat.xpr, the data for each gene are ordered by ploidy in order to optimize the visualization of ploidy-dependent gene expression patterns. In file matploidy.xpr, the data are grouped by mating type to facilitate the visualization of mating-type effects.The experiments are labelled as follows: The ploidy is indicated by the number of characters in the label. The characters represent the mating-type genotype; "a" indicates MATa, "x" indicates MATalpha. For example, aaxx represents the MATa/MATa/MATalpha/MATalpha tetraploid strain.
ploidy MATa MATalpha MATa/alpha n a x 2n aa xx ax 3n aaa xxx aax 4n aaaa xxxx aaxx Finding neighbors of a candidate gene or an input profile
One can query the data to view the expression of a particular gene of interest, and its nearest neighbors (those genes most like it); this is the "candidate gene" mode. Alternatively, one can specify a hypothetical expression pattern and find its nearest neighbors; this is the "input profile" mode.To query a candidate gene enter a search string. The search is case-insensitive and can be a partial match. All synonyms for each gene (the ORF desigantions and common names) are recognized. Muliple matches may result and you will be prompted to select from them. For example, searching for "ste" will match all of the STE "sterile" genes.
To query an input profile, select this option and continue. You will be prompted to enter hypothetical expression levels for each experiment. Because this input profile (and the experimental data) will be normalized in the analysis, the magnitude of any individual input datum is irrelevant. What matters is the relative differences between the input data, that is, the "shape" of the data. For example, to find genes expressed specifically in MATa strains, enter a "1" for all the MATa experiments, and a "0" for all of the MATalpha and MATa/alpha experiments. As another example, to find genes induced in proportion to the ploidy, enter a "1" for all haploid experiments, a "2" for all diploid experiments, a "3" for all triploid experiments, and a "4" for all tetraploid experiments.
Filtering the data
The expression data are derived from the hybridization intensities of 20 perfect-match and mismatch oligonucleotide probe pairs. Each expression level is a trimmed average difference (perfect-match minus mismatch). Lower average differences (lower hybridization signals) show more variability than moderate or high average differences. Using filtering , one can eliminate genes showing consistently low average differences from an analysis. Also, many genes show very little or no change in expression levels in the data set. Using filtering, one can eliminate these genes from consideration.Filtering works like this: Each gene is evaluated based on the filtering criteria and either it is eliminated from the analysis or it is retained. For each gene, the filter finds the maximum and minimum expression values and calculates a difference and a ratio:
difference = maximum - minimum
ratio = maximum / minimum
If either of these values is less than the values entered in the filtering options, the gene is rejected. In the analyses published in the SCIENCE paper, we filtered with difference = 100 and ratio = 3.
Normalization methods
To compare the expression pattern of one gene to another, the expression data for each gene are normalized. There are three normalization methods available. We used standardization to explore ploidy-dependent gene expression. We devised the fractionalization method to explore mating-type-specific gene exoression; this method optimizes the discovery of binary (on or off) expression patterns. We have set up this site with file-specific default normalization methods. For file ploidymat.xpr, the default is standardization; for matploidy.xpr, the default is fractionalization. Also, when veiwing the results of a query, the default is to show the un-normalized data. There is a checkbox to see the normalized data in the output.The standardization method transforms the data for each gene to have mean=0 and standard deviation=1. Standardized data represent standard deviations from the mean.
The fraction method transforms the data for each gene to have sum=1. The data are summed, and each datum is divided by that sum. Fractionalized data range from 0 to 1.
In the log2 method, the data are logarithmically transformed and then the mean for each gene is set to 0. log2 data represent a linear index of fold change with symmetry about 0.
Distance metrics
To determine how near or far a given gene is from a candidate gene or input profile, one must use a distance metric. This site offers four options. We used the Pearson correlation coefficient to study ploidy-dependent expression; this is the default metric when using file ploidymat.xpr. To find mating-type-specific expression patterns, we used a modifed Euclidean distance metric, "w_euclidean". This metric in combination with the "fraction" normalization method offers enhanced discrimination of binary (on or off) expression and less-than-binary expression.The pearson metric is the standard Pearson correlation coefficient.
The Fisher metric is a transform of the Pearson coefficient. It will give the same list of genes as the pearson metric. The advantage of the Fisher transform is that it distributes Pearson coefficients to allow hypothesis testing and significance testing.
The Euclidean metric is simply the geometic distance (the square root of the sum of the squared differences) from one gene to another when one considers each gene as a point in n-dimensional space, where n is the number of experiments (data dimensions for each gene).
The w_Euclidean metric is a modification of the Euclidean. Each squared difference is divided by the square root of the absolute value of the product.
About the data files
There are two data files available for searching. They are permutations of the data set. In file ploidymat.xpr, the data for each gene are ordered by ploidy in order to optimize the visualization of ploidy-dependent gene expression patterns. In file matploidy.xpr, the data are grouped by mating type to facilitate the visualization of mating-type effects.The experiments are labelled as follows: The ploidy is indicated by the number of characters in the label. The characters represent the mating-type genotype; "a" indicates MATa, "x" indicates MATalpha. For example, aaxx represents the MATa/MATa/MATalpha/MATalpha tetraploid strain.
ploidy MATa MATalpha MATa/alpha n a x 2n aa xx ax 3n aaa xxx aax 4n aaaa xxxx aaxx Constructing a query: operators and factors
To construct a Boolean query, you are presented a matrix containing an intersection of every experiment with every other experiment. In any intersection, you may specify an expression consisting of an operator (in the pull-down menu) and a factor (in the text box). The expression is read like this:
row experiment operator factor column experiment For example, selecting the ">" operator and factor "10" in the upper left intersection specifies the expression, x > 10 a. Entering this query will find all genes for which this expression is true; that is, all genes for which the MATalpha haploid is greater than 10 times the MATa haploid. Also, multiple expressions may be specified. Only genes for which all expressions are true will answer the query (the matrix is a Boolean AND matrix).
There are 7 operators available:
!= "not equal"
< "less than"
<= "less than or equal"
== "equal"
> "greater than"
>= "greater than or equal"
[ ] "within a factor of"
Filtering the data
The expression data are derived from the hybridization intensities of 20 perfect-match and mismatch oligonucleotide probe pairs. Each expression level is a trimmed average difference (perfect-match minus mismatch). Lower average differences (lower hybridization signals) show more variability than moderate or high average differences. Using filtering , one can eliminate genes showing consistently low average differences from an analysis. Also, many genes show very little or no change in expression levels in the data set. Using filtering, one can eliminate these genes from consideration.Filtering works like this: Each gene is evaluated based on the filtering criteria and either it is eliminated from the analysis or it is retained. For each gene, the filter finds the maximum and minimum expression values and calculates a difference and a ratio:
difference = maximum - minimum
ratio = maximum / minimum
If either of these values is less than the values entered in the filtering options, the gene is rejected. In the analyses published in the SCIENCE paper, we filtered with difference = 100 and ratio = 3.
Sorting the output
There are three methods available to sort the genes that will appear in the output of a Boolean query. The "maximum overall fold difference" method sorts the genes by the ratio of the maximum expression value to he minimum expression value. The "fold difference between" method allows you to select specific experiments (from the pull-down menus) to determine the sorting ratio. Lastly, the "expression in experiment" method sorts the genes by their expression level in a user-specified experiment.