Guide to Analyzing the MAP kinase data using nearest-neighbor analysis and Boolean queries

 

Nearest-Neighbor Analysis

 

Boolean Query

 

Return to Home


Nearest-Neighbor Analysis

 

About the data files

There are two data files available for searching. One corresponds to haploid strains grown under rich media conditions (synthetic complete medium to an optical densitiy at 600 nm. of 1.0) and the other corresponds to diploid strains grown under nitrogen starvation conditions (SLAD medium) for 4 hours.

Finding neighbors of a candidate gene or an input profile

One can query the data to view the expression of a particular gene of interest, and its nearest neighbors (those genes most like it); this is the "candidate gene" mode. Alternatively, one can specify a hypothetical expression pattern and find its nearest neighbors; this is the "input profile" mode.

To query a candidate gene enter a search string. The search is case-insensitive and can be a partial match. All synonyms for each gene (the ORF desigantions and common names) are recognized. Muliple matches may result and you will be prompted to select from them. For example, searching for "ste" will match all of the STE "sterile" genes.

To query an input profile, select this option and continue. You will be prompted to enter hypothetical expression levels for each experiment. Because this input profile (and the experimental data) will be normalized in the analysis, the magnitude of any individual input datum is irrelevant. What matters is the relative differences between the input data, that is, the "shape" of the data. For example, to find genes whose expression is reduced in tec1, ste12, and ste7 mutants, enter "1" for wild-type and "0" for the three mutants.

 

Filtering the data

The expression data are derived from the hybridization intensities of 20 perfect-match and mismatch oligonucleotide probe pairs. Each expression level is a trimmed average difference (perfect-match minus mismatch). Lower average differences (lower hybridization signals) show more variability than moderate or high average differences. Using filtering , one can eliminate genes showing consistently low average differences from an analysis. Also, many genes show very little or no change in expression levels in the data set. Using filtering, one can eliminate these genes from consideration.

Filtering works like this: Each gene is evaluated based on the filtering criteria and either it is eliminated from the analysis or it is retained. For each gene, the filter finds the maximum and minimum expression values and calculates a difference and a ratio:

difference = maximum - minimum

ratio = maximum / minimum

If either of these values is less than the values entered in the filtering options, the gene is rejected. In the analyses published in the PNAS paper, we filtered with maximum = 100 and ratio = 2.

 

Normalization methods

To compare the expression pattern of one gene to another, the expression data for each gene are normalized. There are three normalization methods available. We used standardization to explore ploidy-dependent gene expression. We devised the fractionalization method to explore mating-type-specific gene exoression; this method optimizes the discovery of binary (on or off) expression patterns. We have set up this site with file-specific default normalization methods. The default is standardization. Also, when veiwing the results of a query, the default is to show the un-normalized data. There is a checkbox to see the normalized data in the output.

The standardization method transforms the data for each gene to have mean=0 and standard deviation=1. Standardized data represent standard deviations from the mean.

The fraction method transforms the data for each gene to have sum=1. The data are summed, and each datum is divided by that sum. Fractionalized data range from 0 to 1.

In the log2 method, the data are logarithmically transformed and then the mean for each gene is set to 0. log2 data represent a linear index of fold change with symmetry about 0.

 

Distance metrics

To determine how near or far a given gene is from a candidate gene or input profile, one must use a distance metric. This site offers four options. The deffault is the Pearson metric.

The Pearson metric is the standard Pearson correlation coefficient.

The Fisher metric is a transform of the Pearson coefficient. It will give the same list of genes as the pearson metric. The advantage of the Fisher transform is that it distributes Pearson coefficients to allow hypothesis testing and significance testing.

The Euclidean metric is simply the geometic distance (the square root of the sum of the squared differences) from one gene to another when one considers each gene as a point in n-dimensional space, where n is the number of experiments (data dimensions for each gene).

The w_Euclidean metric is a modification of the Euclidean. Each squared difference is divided by the square root of the absolute value of the product.


Boolean Query

 

About the data files

There are two data files available for searching. One corresponds to haploid strains grown under rich media conditions (synthetic complete medium to an optical densitiy at 600 nm. of 1.0) and the other corresponds to diploid strains grown under nitrogen starvation conditions (SLAD medium) for 4 hours.

Constructing a query: operators and factors

To construct a Boolean query, you are presented a matrix containing an intersection of every experiment with every other experiment. In any intersection, you may specify an expression consisting of an operator (in the pull-down menu) and a factor (in the text box). The expression is read like this:
row experiment
operator
factor
column experiment

For example, selecting the ">" operator and factor "10" in the upper left intersection specifies the expression, x > 10 a. Entering this query will find all genes for which this expression is true; that is, all genes for which the MATalpha haploid is greater than 10 times the MATa haploid. Also, multiple expressions may be specified. Only genes for which all expressions are true will answer the query (the matrix is a Boolean AND matrix).

There are 7 operators available:

!= "not equal"

< "less than"

<= "less than or equal"

== "equal"

> "greater than"

>= "greater than or equal"

[ ] "within a factor of"

 

Filtering the data

The expression data are derived from the hybridization intensities of 20 perfect-match and mismatch oligonucleotide probe pairs. Each expression level is a trimmed average difference (perfect-match minus mismatch). Lower average differences (lower hybridization signals) show more variability than moderate or high average differences. Using filtering , one can eliminate genes showing consistently low average differences from an analysis. Also, many genes show very little or no change in expression levels in the data set. Using filtering, one can eliminate these genes from consideration.

Filtering works like this: Each gene is evaluated based on the filtering criteria and either it is eliminated from the analysis or it is retained. For each gene, the filter finds the maximum and minimum expression values and calculates a difference and a ratio:

difference = maximum - minimum

ratio = maximum / minimum

If either of these values is less than the values entered in the filtering options, the gene is rejected. In the analyses published in the PNAS paper, we filtered with maximum = 100 and ratio = 2.

 

Sorting the output

There are three methods available to sort the genes that will appear in the output of a Boolean query. The "maximum overall fold difference" method sorts the genes by the ratio of the maximum expression value to he minimum expression value. The "fold difference between" method allows you to select specific experiments (from the pull-down menus) to determine the sorting ratio. Lastly, the "expression in experiment" method sorts the genes by their expression level in a user-specified experiment.