Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD OF DETECTING DEFECTS IN DNA MICROARRAY DATA
Document Type and Number:
WIPO Patent Application WO/2008/062855
Kind Code:
A1
Abstract:
Abstract The present invention seeks to preclude problems caused by uneven hybridization and dust contamination. A difference value between each cell value of DNA microarray data and each corresponding standard value of standard data is obtained. A pseudo image is obtained by replacing the each cell value of DNA microarray data with the difference value. A window corresponding to a predetermined number of cells in the pseudo image is provided. A median value for each window is calculated with sequentially moving the window over the pseudo image to obtain a representative value set of the windows. One or more windows whose index value exceeds a critical value are detected. Cells of detected windows are cancelled.

Inventors:
KONISHI TOMOKAZU (JP)
Application Number:
PCT/JP2007/072605
Publication Date:
May 29, 2008
Filing Date:
November 15, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
AKITA PREFECTURAL UNIVERSITY (JP)
KONISHI TOMOKAZU (JP)
International Classes:
C12N15/09; C12Q1/68; G16B25/00
Domestic Patent References:
WO2006030822A12006-03-23
Other References:
KONISHI T.: "Detection and restoration of hybridization problems in Affymetrix GeneChip data by parametric scanning", GENOME INFORMATICS, vol. 17, December 2006 (2006-12-01), pages 100 - 109
Attorney, Agent or Firm:
INABA, Shigeru (INABA & INAMIHanabishi Imas Hirakawacho Building,4th Floor, 3-11, Hirakawacho 2-chome,Chiyoda-k, Tokyo 93, JP)
Download PDF:
Claims:
Claims

1. A method of detecting a defect in DNA microarray data comprising: providing target DNA microarray data that are a set of cell values obtained from a DNA microarray; providing standard data that are a set of standard values, each standard value corresponding to each cell value of said DNA microarray data; obtaining a difference value between each cell value of said DNA microarray data and each standard value of said standard data; obtaining a pseudo image by replacing said each cell value of said DNA microarray data with said difference value; calculating a representative value for a small region corresponding to a predetermined number of cells in said pseudo image based on said difference values of said predetermined number of cells, repeating said calculating by moving said small region over said pseudo image cell by cell to obtain a set of representative values for the small regions; and detecting one or more small regions having an outlying representative value based on a distribution of said set of representative values in comparison with an expected normal distribution of said set of representative values wherein said detected one or more small regions include cells having defective cell values.

2. The method of claim 1, wherein said target DNA microarray data and said standard data are normalized.

3. The method of claim 1 or 2, wherein said cell values and said standard values are logarithmic values.

4. The method of claim 1, wherein said standard value is a representative value for each cell obtained from a plurality of normalized DNA microarray data.

5. The method of claim 4, wherein said plurality of normalized DNA microarray data are obtained by the same type of DNA microarray as the target DNA microarray data.

6. The method of claim 4 or 5, wherein said plurality of normalized DNA microarray data for the standard data are obtained based on the same tissue.

7. The method of claim 4 or 5, wherein said plurality of normalized DNA microarray data for the standard data are obtained based on a plurality of difference tissues.

8. The method of any one of claims 4 to 7, wherein said representative value for each cell is a measure of central tendency.

9. The method of claim 8, wherein said measure of central tendency is trimmed mean, median, or weighted mean.

10. The method of any one of claims 1 to 9, the size of said small region is between 3 x 3 cells and 10 x 10 cells.

11. The method of any one of claims 1 to 10, wherein said representative value for a small region is a measure of central tendency.

12. The method of claim 11, wherein said measure of central tendency is median, trimmed mean, or weighted mean.

13. The method of claim 1 , wherein said detecting comprises: normalizing said set of representative values for the small regions to obtain a set of indices; and detecting one or more small regions whose index value exceeds a critical value that is predetermined according to an expected normal distribution of said set of indices.

14. The method of claim 13, wherein said indices and said critical value are z-scores.

15. The method of any one of claims 1 to 14 further comprising: cancelling cell values for cells belonging to said detected one or more small regions.

16. A computer program for causing a computer to execute the steps as claimed in any one of claims 1 to 15.

17. A method of detecting and removing a defect hi DNA microarray data comprising: providing target DNA microarray data that are a set of cell values obtained from a DNA microarray; providing standard data that are a set of standard values, each standard value corresponding to each cell value of said DNA microarray data; obtaining a difference value between each cell value of said DNA microarray data and each standard value of said standard data; obtaining a pseudo image by replacing said each cell value of said DNA microarray data with said difference value; calculating a representative value for a small region corresponding to a predetermined number of cells in said pseudo image based on said difference values of said predetermined number of cells, repeating said calculating by moving said small region over said pseudo image cell by cell to obtain a set of representative values for the small regions; detecting one or more small regions having an outlying representative value based on a distribution for said set of representative values in comparison with an expected normal distribution of said set of representative values wherein said detected one or more small regions include cells having defective cell values; and cancelling cell values for all cells belonging to said detected one or more small regions.

Description:

DESCRIPTION

A METHOD OF DETECTING DEFECTS IN DNA MICRO ARRAY DATA

FIELD OF THE INVENTION

The present invention relates to a method of detecting hybridization problems in DNA microarray data.

BACKGROUND OF THE INVENTION Hybridization is the basis of microarray analysis and, while widely used, is not free from technical problems. For example, some hybridizations form a doughnut-like geometric pattern around the center of chip images. Such patterns often result in reduced signals from certain areas of the chip, appearing similar to surface scratching that may be attributed to the entrainment of dust. Although analytical programs that identify such problems have been proposed, the methods are destructive, resulting in the total cancellation of the array chip data when large defects are present. The dChip package implements several automated algorithms for recognizing and removing outliers during model-based data normalization. The algorithms find patterns in the responses among perfect match (PM) and mismatch (MM) probes for each gene, and cells and probe sets that disagree with the resultant patterns are identified as outliers. However, this approach is based on a series of mathematical models that are derived from a very simplified view of both biological fundamentals and the composition of the data. Furthermore, the appropriateness of the models and the calculation methods are difficult to check rigorously as there is no objective indicator for how well the models, which inevitably contain parameters for handling noise, describe the experimental system.

One of the reasons why the recognition of hybridization flaws remains ad hoc is that such problems, even if occupying a large proportion of the chip area, are believed to be harmless to the signal or the scaled probe value, which reflect the transcript level. Furthermore, in a GeneChip ® , a transcript is measured by approximately ten pairs of adjacent PM and MM cells, with pairs dispersed across the chip. Thus, a failure will simultaneously ruin both the PM and MM probe of the relevant pair, but will not ruin more than one probe pair for a gene. The signal is found by several calculation algorithms based on different philosophies, and most pay attention to outliers caused by

such probe failures. For example, Affymetrix MAS5 finds tne signal as a weignteα mean among probe pairs, while RMA finds the signal by a median polish of PM values.

It is desirable, however, that problems should be recognized and removed from data prior to analysis in order to prevent loss of accuracy in signal data. Weighted means and medians are robust only if the outliers occur in both directions (i.e., positive and negative) at the same frequency. This is rarely the case in practice, as problems often produce outliers that reflect the cause. For example, bright spots will appear if the problem is caused by fluorescent material, while dark spots will appear if the chip surface has been damaged. These types of defects will affect the results by breaking the robustness of calculations. Such defects also have a direct effect on analyses when the target is not the gene signal but the cell data, as in the case of analyzing processing variants of mRNA. Microarray preparation problems thus present a barrier to progress in advanced analyses of GeneChip ® data.

SUMMARY OF THE INVENTION

As foregoing, microarray data often include problems caused by uneven hybridization and dust contamination. Such problems should be removed prior to analysis to prevent degradation of analytical accuracy and false positive results.

The present invention seeks a method that finds out the troubles as local tendency of cell data in comparisons of each array to an ideal standard of hybridization. Cells at the identified locations of the troubles are cancelled before data normalization. The cancellations will not affect the original distribution of the array data, since the cancellations are independent to signal intensities. Consequently, remained data will be able to be used for analyses. According to one aspect of the present invention, a method of detecting defects in

DNA microarray data comprises: providing target DNA microarray data that are a set of cell values obtained from a DNA microarray; providing standard data that are a set of standard values, each standard value corresponding to each cell value of the DNA microarray data; obtaining a difference value between each cell value of the DNA microarray data and each standard value of the standard data; obtaining a pseudo image by replacing each cell value of the DNA microarray data

with the difference value; calculating a representative value for a small region corresponding to a predetermined number of cells in the pseudo image based on the difference values of the predetermined number of cells, repeating the calculating by moving the small region over the pseudo image cell by cell to obtain a set of representative values for the small regions; and detecting one or more small regions having an outlying representative value based on a distribution of the set of representative values in comparison with an expected normal distribution of the set of representative values wherein the detected one or more small regions include cells having defective cell values. hi one preferable aspect, the target DNA microarray data and the standard data are normalized. Specifically, the cell values and the standard values are logarithmic values.

In one preferable aspect, the standard value is a representative value for each cell obtained from a plurality of normalized DNA microarray data. In one aspect, the representative value for each cell is a measure of central tendency including mean, median and mode. In preferable examples, the measure of central tendency is trimmed mean, median, or weighted means.

A plurality of normalized DNA microarray data are obtained by the same type of DNA microarray as the target DNA microarray data. The number of DNA microarray data may be six to ten sets of DNA microarray data, for example. In one aspect, the DNA microarray data sets for the standard data are obtained based on the same tissue. In one aspect, the DNA microarray data sets for the standard data are obtained based on a plurality of difference tissues, hi the latter, preferably, a number of DNA microarray data for a variety of different tissues are prepared. In one aspect, the size of window, namely, the size of small region is between 3 x 3 cells and 10 x 10 cells, hi one preferable example, the size of window or the size of small region is 5 x 5 cells. In one aspect, the representative value for a window is a measure of central tendency of difference values for each cell. Specifically, the measure of central tendency is median, trimmed mean, or weighted mean. In one preferable aspect, the set of representative values for the small regions are

normalized to obtain a set of indices; and one or more small regions whose index value exceeds a critical value that is predetermined according to an expected normal distribution of the set of indices are detected.

In one aspect, both the indices and the critical value are z-scores. In one aspect, cell values for cells belonging to the detected one or more windows are cancelled.

The present invention relates to a computer program for causing a computer to execute the above described steps for detecting defect in DNA microarray data.

The present invention relates to a computer program for causing a computer to execute the above described steps for detecting and removing defect hi DNA microarray data.

The present invention relates to a computer-readable medium for storing the above described programs.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows a histogram of standard deviations for medians of moving windows. The mode is 0.31, larger than the expected value of 0.25.

Figure 2 shows coincidences between two sets of ideal standards for leaves analyzed by two different laboratories. Figures 3A, 3B, and 3C show distributions of differences between hybridizations and standards. Straight line at y = x denotes the normal distribution. Data are denser at the center of the plots. Only 2.3%, 0.1%, and 0.003% of data have z-scores greater than of 2, 3 and 4, respectively.

Figures 4A, 4B, and 4C show distributions of index values. Figure 5 shows reproducibility in repeated experiments. A combination of experiments is shown in each row. Left: original data. Center: remaining data. Right: cancelled data. PM data (n = 10000) randomly selected from the indicated pairs of arrays are shown. The expectation value for the cancellation was 2 windows.

Figure 6 shows numbers of cancelled cells at expectations of 2 and 20 windows (50 and 500 cells, respectively).

Figure 7 shows standard deviations of differences among cell data in reproducibility measurements.

Figure 8 shows reproducibility in data treated by the α^mp pacnage. jκ.esuiιs using the PM-only models are shown. The corresponding original data are presented in Fig. 5 (left). Left: remaining data. Right: cancelled data. PM data (n = 10000) randomly selected from the indicated pairs of arrays are shown. Figure 9 shows positions of cancelled windows in a chip. Four typical examples at the indicated expectations are shown. Upper left: hybridization with relatively small numbers of cancellations. Upper right: uneven hybridization. Lower left: regular shapes with straight boundaries. Lower right: clusters at symmetric positions.

Figure 10 shows generation of standard data. Figure 11 shows obtaining of a difference value between DNA microarray data and standard data.

Figure 12 shows scanning of window over a pseudo image.

Figure 13 shows a window of the present invention provided in a pseudo image.

Figure 14 shows a flow chart explaining the present invention.

DETAILED DESCRIPTION

A. General Description of the Present Invention

A method of detecting and removing problems caused by uneven hybridization and dust contamination will be explained referring to Fig.10- Fig.14.

Hardware configuration for executing the method of the present invention comprises a computer apparatus not shown including an input device, an output device, a display device, a storage device that may include a hard disc drive, a memory device, a computer-readable medium, or any other storage means, and a processor. Various data for the present invention including measured data and calculated data are stored in the memory device. Various calculations are conducted by the processor. Optionally, various data including measured data and calculated data may be displayed on the display device in various forms.

Target DNA microarray data are provided (Fig. 14 Sl). The target DNA microarray data are a set of cell values. The target DNA microarray data are originally obtained as a set of signal intensities of probe cells of a DNA microarray. In one preferable aspect,

each cell value is a normalized logarithmic value (z-score) that is obtained by taking logarithm and z-normalizing the logarithmic value. The median-based normalization can be used. The target DNA microarray data are stored in the storage device.

Standard data are provided. The standard data are a set of standard values. Each standard value corresponds to each probe cell of the DNA microarray. The standard data are hypothetical data or reference data and a set of standard values are typically obtained by calculation results. Ideally, the standard data are a set of expected hypothetical values that are the most average or most probable values. Typically, the standard data set is prepared as a z-score so as to correspond to z-normalized cell values of target DNA microarray data. The standard data are stored in the storage device.

Referring to Fig. 10, in one aspect, the standard data are obtained from a plurality of normalized data sets (e.g. 6 to 10 sets) obtained by the same type of target DNA microarray. Each standard value is obtained by calculating a representative value for each cell values of the plurality of normalized data sets. Typically, the representative value is a measure of central tendency including mean, median and mode. In preferable aspects, the representative value may be trimmed mean, median or weighted mean. In one preferable aspect, the standard data are found from a plurality of normalized DNA microarray data sets for the same tissue as used for the target DNA microarray data. The tissue used for the standard data set is not limited to the same tissue. Alternatively, the standard data are found from a representative value (trimmed mean, median, weighted mean, for example) for a plurality of normalized DNA microarray data sets for a variety of different tissues. The median-based normalization can be used. If the target DNA microarray data are GeneChip ® data, the standard data are preferably obtained from a plurality of GeneChip ® data sets. If there exists a perfect DNA microarray data with no defects and errors, one DNA microarray data can be used as the standard data.

The standard data are not limited to those obtained based on actual measurement. Each standard value of the standard data may be the same value. Each standard value of the standard data may be zero. In this case, a difference value and each cell value for the target DNA microarray data are equivalent. Further, the standard value may be a set of

pseudo-random numbers with small variance.

Normalization techniques for the target DNA microarray data and the standard data are not restricted to the median-based method. Rather, it is possible to use other methods that enable to make the normalized data (typically z-scores) comparable therebetween. For example, the three-parameter method (Konishi, T., Three-parameter lognormal distribution ubiquitously found in cDNA microarray data and its application to parametric data treatment. BMC Bioinformatics, 5:5, 2004, which is incorporated herein by reference) can be used for normalization. It may be possible to disregard a background value for the three-parameter method at this stage. Also, other normalization techniques known to the person skilled in the art can be used.

Referring to Fig. 11, a difference value between each cell value of the target DNA microarray data and each standard value of the standard data is obtained (Fig. 14 S2). At this stage, it is not possible to cancel cell values with a large difference value because the large difference value may be biologically significant value. A difference value may include a ratio of each cell value to the standard value. The difference values are calculated by the processor and are stored in the storage device. Each difference value corresponds to each cell of the DNA microarray.

A pseudo image is obtained by replacing each cell value of the DNA microarray data with the difference value (Fig. 14 S3). Namely, if the DNA microarray is comprised of M x N probe cells, also known as spots, a pseudo image is also comprised of M x N cells, each cell of which has a respective difference value. As shown in Fig.13, the pseudo image is an image where each cell has difference values δ z h A z 2 λ z 3 λ z 4 ,A z 5 , .... The pseudo image may be displayed on the display device but it is optional. A window of a predetermined size corresponding to a predetermined number of cells of a small region in the pseudo image is provided. Referring to Fig. 12, The window is moving over the pseudo image one cell by one cell while sequentially calculating a representative value for each window based on the difference value of the predetermined number of cells, thereby obtaining a representative value set of the windows (Fig. 14 S4). Fig. 12 merely shows a schematic representation for the moving window and a moving

(scanning) direction of the window may be any directions including a horizontal direction and a vertical direction on the pseudo image. Referring to Fig. 13, a window W sequentially moves horizontally on the pseudo image cell by cell (W t , W t+ls W t+2 ....). The representative values for windows are calculated by the processor and are stored in the storage device.

The window algorithm or operation of the present invention itself is similar to neighborhood or local operation of image processing. Namely, visit each cell C in the pseudo image and calculate a representative value for a neighborhood (a small region) in the pseudo image including the cell C. However, according to the window operation of the present invention, unlike the neighborhood operation of image processing, it is not necessary to renew a value of cell C with the representative value. According to the present invention, an aim of the window operation is to obtain a value for representing a small region in the pseudo image. Also, according to the present invention, the cell C is not necessarily centered on the window or small region. The position of the cell C in each small region can be any predetermined position in the small region.

A representative value for the window, namely a small region of a predetermined number of cells, is a measure of central tendency including mean, median and mode. In preferable aspect, the representative values may be median, trimmed mean, or weighted mean. In one aspect, the size of the small region (window) is between 3 x 3 cells and 10 x 10 cells, hi one preferable aspect, the size of the small region (window) is 5 x 5 cells. In Fig.13, the window W corresponds to a small region of 5 X 5 cells. A representative value (median, for example) for 25 cell values of the pseudo image is obtained and the obtained value represents the window (small region) of interest. The representative value for a small region (5 x 5 cells, for example) is calculated while moving the window W over the pseudo image one cell by one cell. The obtained representative value for each window (small region) is stored in the memory device. The calculation of the representative value for the window does not require actual display of the pseudo image on the display device.

Here, because probe cells of DNA microarray are randomly arranged without

biological significance, even with neighboring probe cells of DNA microarray, the representative values for windows (a small region of a predetermined number of cells) must be distributed normally according to central limiting theorem.

The representative value sets are normalized to obtain z-scores of the representative values so as to compare with a critical value (Fig. 14 S5). The z-scores of the representative values are indices with which a predetermined critical value or cut off value (prepared as a z-score) is compared. When normalizing the representative value set, it is necessary to obtain a width of distribution for the representative value sets. The width can be obtained indirectly from a width of distribution for the difference value set. The width of distribution for the difference value set is obtained and then correcting the obtained width with a compensation factor. The compensation factor may vary depending on the representative value. For example, if the representative value is mean, the compensation factor may be 1 V ~ n. The compensation factor may be obtained using a simulation such as Monte Carlo method. The width of distribution for the representative value sets can be obtained directly from its distribution. For example, IQR (Interquartile Range) or MAD (Mean Absolute Deviation) can be used as the width for the distribution. The width can be obtained from a slope of linear regression to approximate Q-Q plots for the representative value set. The width can be predetermined based on various actual measurements. Specifically, a set of widths obtained from various measurements are prepared and a mode for the set is used as a predetermined width. The standard deviation may be used for normalizing the representative value set.

One or more windows including possible defect cell values are detected by comparing each index value with a critical value that is predetermined according to an expected normal distribution of the indices. A window whose index value exceeds a predetermined critical value is regarded as a small region including defect cell values (Fig. 14 S6). The critical value is predetermined as a z-score according to normal distribution by an operator. For example, if it is wished to cancel two windows from an ideal normal distribution, a z-score of 4.61 may be predetermined. However, if the critical value of 4.61 is predetermined, more than two windows are generally detected.

All cell values of detected one or more windows are cancelled or rejected (Fig. 14 S7). For example, if the window corresponds to 25 cells (5 x 5), twenty-five cell values are canceled per one window. If two detected windows are separated, fifty cell values are canceled. If two detected windows are overlapping with 5 x 4 cells, cancelled area is 5 x 6 cells. Resultant data after cancellation and/or cancelled data may be displayed on the display device in the various forms including a graph and an image of arrayed probe cells, for example. Alternatively, the cell values of detected one or more windows are corrected instead of cancelling.

According to the typical embodiment of the present invention, a window of predetermined size is used. However, in some embodiments, predetermined different sized windows may be used. For example, a window corresponding to 3 x 3 cells and another window corresponding to 7 x 7 cells are used and results are combined. Namely, windows are examined based on two sets of representative values. Detected windows are compared between the results for the different sized windows and overlapping cells may be cancelled. For example, if detected windows are completely overlapped, cell data for 3 x 3 cells are cancelled.

The following section explains an algorithm that finds and removes the troubles. The troubles are distinguished from biological effects by means of data distribution. The algorithm bases on several verifiable assumptions of which appropriateness is tested with GeneChip ® data as a non-limiting example in the Results section. The validity of the algorithm and the effects of data cancellation are tested using GeneChip ® data obtained from a series of experiments. The algorithm is demonstrated to greatly improve the reproducibility of measurements, and removes only a small number of faultless data.

B. Specific Methods and Experiments B-I Algorithm

The proposed method, hereinafter referred to as a parametric scanning algorithm, for identifying microarray problems is as follows.

The present invention provides a parameter-scanning algorithm to detect such defects on the basis of the character of data distributions. The cell data is thoroughly scanned using a window algorithm, and windows with an index value greater than a critical value, also known as a threshold are recognized as defects and removed from the

array data. The index is found from the differences between tne target ana an iaeai standard of hybridization obtained as a trimmed mean among experiments, representing the statistical center of differences in each section. The threshold is derived as a screening level designated by the operator, but has only limited effect on the effectiveness of data cancellation.

A standard, ideal array is selected, and indices representing the size of distinct regions in each chip are determined. Regions with indices larger than a threshold value in reference to the standard are recognized as problem areas.

The standard is found as a set of trimmed means among hybridizations. The experiments are simply normalized by dividing the respective median values (including both PM and MM cells) and taking logarithms. The trimmed means of data for each cell in the array are calculated, the resulting set of means is adopted as the ideal standard of hybridization. If the means are calculated using a sufficiently large number of array data, the values can be considered stable and to be suitable for a standard. No particular distributions are expected in the ideal standard.

Differences between simply normalized array data and the standard are then found for each cell. These differences may represent both biological responses and experimental noise. The distribution of the differences is expected to be approximately normal, since the logarithms of biological changes appropriately measured and normalized obey a normal distribution. The differences are therefore z-normalized using robust estimators of the distribution parameters, and the distributions are checked on quantile-quantile (Q-Q) plots. Normalization of the differences is for analysis of the characteristic of differences and is an optional step for the present invention.

The indices are found by using the medians of the z-normalized differences among neighboring cells which correspond to cells within a window on an array. The matrix of the differences is rearranged to reflect the physical order of the chip, and data are collected via a moving window that simulates scanning through a pseudo image of the chip to find the medians. The window median is robust to biological responses, since neighboring cells on a chip do not have biological relationships. In contrast, experimental problems that hide or add signals at the window will affect the window median. The window medians will obey a normal distribution in a strict sense, according to the effect described by the central limiting theorem. Although this model does not expect particular distributions for problems, affected windows will produce outliers in the normal

distribution of the matrix medians.

The indices are found by normalizing the matrix medians. There is a difficulty in the normalization; width of the distribution of matrix medians is not robust to problems. Indeed, the width may increase with the number of problems. If the distribution is simply z-normalized, the number of recognized problems will be reduced. However, this effect can be readily avoided by finding the width from that of the distribution of the differences among cells. In principle, a width of 0.25 was predicted in the present study for the mean of a window of 25 cells. Here, the width of the distribution of cell differences is robust with respect to problems, since large problems will produce outliers that will not affect the distribution at the central quantiles. In practice, the distributions for cells are not perfectly normal, having long tails possibly due to systematic additive noise in the data. However, the proper width can be estimated robustly from the proper quantiles. Consequently, the effect of the problems can be excluded by estimating the width of the distribution of indices according to the distribution for cells. Systematic noise as well as hybridization problems may change the compensation factor of 0.25 to somewhat larger values. In this description, a constant of 0.31 was used, obtained as the mode in actual measurements and being smaller than many other values that may have been affected by many problems (Fig. 1). All indices were adjusted or normalized by dividing by this constant. The threshold is derived by a test level decided by analysis prior to the operation, similar to screening levels in other statistical tests. The parametric nature of data handling makes it possible to estimate how many indices will be larger (and smaller) among half a million results. The program will ask the operator how many windows should be expected. If an array is problem-free, the expected number of windows will be recognized by the random neighboring of biological responses on the chip. In practice, the affected indices will not obey the normal distribution and will more likely take values that exceed the threshold.

B-2 Program A program for the parametric scanning method is provided in the form of a function for the R. The function requires the library "affy", which is available from BioC (http://www.bioconductor.org/). An outsourcing service is available as a part of data normalization (http://www.super-norm.com).

B-3 Data source and data processing

Arabidopsis GeneChip ® data were obtained from TAIR

(http://www.arabidopsis.org/index.jsp). Leaf data from two research groups were used in the comparison of the ideal standard of hybridizations: 15 arrays for the rosette leaf used in drawing expression maps (Schmid, M., Davison, T. S., Henz, S. R., Pape, U. J., Demar, M., Vingron, M., Schδlkopf, B., Weigel, D., and Lohmann, J., A gene expression map of Arabidopsis development. Nature Genetics 37:501-506, 2005), and 18 arrays of 0.5-5 day-old control plants in infection experiments by Dr. F. Ausubel's group (http://www.arabidopsis.org/index.jsp). Human data were obtained from the public domain resource at RCAST, University of Tokyo

(http://www.genome.rcast.u-tokyo.ac.jp/normal/). PM data for the arrays were normalized according to the three-parameter method (Konishi, T., Three-parameter lognormal distribution ubiquitously found in cDNA microarray data and its application to parametric data treatment. BMC Bioinformatics, 5:5, 2004).

C. Results

C-I Verification of Assumptions

C-I -1 Stability of hybridization standard The method compares each datum with the ideal standard of hybridization, which should represent a stable pattern of the sample tissue. If the pattern is truly stable, the pattern will coincide with that of other standards determined using different sets of data on identical tissue. To confirm this coincidence, the standards obtained using data from two research groups were compared. Both groups determined the transcriptome of leaves, one as part of an atlas of plants, and the other as a control for infection experiments. Standards were obtained as trimmed means of the median-normalized log data. The results were compared on a scatter plot with 1000 corresponding cell data (Fig. 2). The coincidence between laboratories was thus confirmed. Some other examples of inter- and intra- laboratory comparisons are presented, showing likely correspondences. Such coincidences cannot be obtained by chance; for example, standards found from different tissues have different tendencies, which will appear as wide scatter in the plot. Such tendencies may show a tissue-dependence of the standard, and attention should be paid in a practical usage of the program.

C- 1-2 Normality of differences between array data and the standard

The proposed method assumes that the differences between each datum and the ideal standard of hybridization will be distributed normally in a rough sense. This assumption was confirmed by means of QQ plots for the data distribution. The distributions had long tails, which may reflect the systematic additive noise of measurement. However, all of the distributions were coincident with the theoretic values at -1.5 to 1.5 (Figs. 3 A, 3B and 3C), indicating that more than 85% of the data obeyed the normal distribution. As problems and noise influence the distribution, hybridizations with large problems had a narrower range of coincidence, as observed in the case shown in Fig. 3C (ATGE 14C).

C- 1-3 Normality of distribution of indices

The method also assumes that the indices, which are derived from the medians of the moving windows, will be distributed normally when large problems are not present. This assumption was also confirmed by means of QQ plots (Figs. 4A, 4B and 4C). The distributions observed were roughly normal, as expected from the central limiting theorem. The standard deviation of 0.31, determined from many hybridizations (Fig. 1), afforded good compensation for the width of the distribution and slope of the plot (Figs. 4A and 4B). As expected, the width of the distribution increased with the severity of the problems (Fig. AC, ATGE 14C).

C-2 Confirmation of Method

C-2-1 Improvement of reproducibility in repeated experiments If parametric scanning effectively eliminates problems from data, it will reduce the fluctuations found in duplicate experiments. This effect was checked using sets of repeated measurements on Arabidopsis leaves (http://www.arabidopsis.org/index.jsp).

Before and after cancellation, PM data were normalized using the SuperNORM ® algorithm, which is based on a three-parameter method. The resultant z-scores were compared on scatter plots (Fig. 5), from which it is clear that the proposed method eliminated the diffusions found in the plots (Fig. 5, left) and achieves the expected reproducibility (center).

As in other statistic tests, some clean and faultless data were also cancelled by

parametric scanning. In a sense, this is a cost required to iinα someimng Dy means or statistical tests. However, in this algorithm, the number of cancelled clean data was not large. The nature of the cancelled data was checked from the reproducibility of experiments (Fig. 5, right). The number of data on the plots increased as the quality of hybridization decreased. The cancelled data did not display narrow concentrations to the y = x line, but were instead dispersed (Fig. 5, right). Coincidence was observed only when many cell data were cancelled (Fig. 5, lower right), and the data concentrated on the y = x line were only a limited part of the cancelled data.

Some of the fluctuations found in the examples shown in Fig. 5 were critically large. Such examples were not exceptions among the many examinations. Figure 6 compares the numbers of cancelled data under different expectations. Data sources of Figure 6 are as follows: rectangles (Schmid, M., Davison, T. S., Henz, S. R., Pape, U. J., Demar, M., Vingron, M., Schδlkopf, B., Weigel, D., and Lohmann, J., A gene expression map of Arabidopsis development. Nature Genetics 37:501-506, 2005); circles (http://www.arabidopsis.org/index.jsp); and triangles (Ge, X., Yamamoto, S., Tsutsumi, S., Midorikawa, Y., Ihara S., Wang S., Aburatani H., Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues, Genomics. 86:127-141, 2005.). It is obvious that the extreme examples have not been taken from outliers. The improvement of reproducibility was further checked from the reductions in the standard deviations for the differences in z-scores between the corresponding PM cells of paired hybridizations. To minimize the effect of additive noise and saturation of measurements, standard deviations were calculated using normalized values (0 to 1). The effect was checked on a scatter plot (Fig. 7), which clearly shows that parametric scanning reduces the standard deviation in the differences among obtained z-scores.

C-2-2 Comparison with other algorithms

The method was evaluated against the same sets of arrays treated using other automated methods in the dChip package rather than new experimental data. All the spikes and outliers recognized by dChip were cancelled using the PM-only model, and the data were normalized in an identical manner. As shown in Fig. 8, dChip gave lower reproducibility (Fig. 8, left), showing weaker detection power. This does not necessary means that dChip preserves faultless data; it cancelled the complete set of cells for

certain genes (0.004-6.4% of the total), while no gene was toiaiiy canceneα oy ine parametric scan. In such genes, no information will be retained for analysis. Table 1 shows the number of the genes of which cells were completely cancelled are shown (unit = %).

Table 1 Percentage of the cancelled genes

C-2-3 Sensitivity of threshold parameter The number of data actually cancelled in each hybridization was not clearly dependent on the threshold parameter, which is a test level decided by the operator. The number of cancelled data was much larger than that of the expectations estimated from the threshold parameter (Fig. 6), reaching as high as a quarter of the total number of cells (tens of thousands), even when the expectation was 50 cells of 2 windows. However, the number of cancelled cells did not increase by ten times when the expectation was increased from 2 to 20. The relationship between the expectation and the actual number of cancellations became poorer as the number of canceled data increased. Processing of

data obtained from three different laboratories suggesteα a siaσie reiauonsnip oetween cancelled windows at the two expectations (Fig. 7). It should be noted that the expected numbers, which appears at (1.7, 2.7) in the plot, briefly satisfies the extrapolated relationship (Fig. 7). The number of cancelled data may depend on the quality of hybridization, as the number of cancellations was observed to be higher when major problems were found (Fig. 5). The cancelled windows often formed clusters in the chip, suggesting a single cause within the cluster (Fig. 9). Such clusters were found regardless of the value of the expectation parameter. The frequencies and area of cancellation differed among data from the different laboratories (Fig. 6).The data measured in one particular laboratory (triangles in the figure) were clearly larger than from the other laboratories. Many of the clusters may represent polishing of the chip surface or uneven hybridization. It is likely that the differences in the frequencies of problems are due to the differences in protocols and skills in wet experiments, which will differ according to the laboratory and the time of preparation. These problems were highlighted by high index values, producing many cancelled windows in the case of severely defected cells even when the expectation was rather small.

The results above are considered evidences showing the insensitivity of parametric scanning to the value of the expectation parameter, that is, the proposed method appears to have good fidelity with respect to problem detection. Such insensitivity implies objectivity in the algorithm, since the threshold is the only parameter subject to operator selection.

On the basis of the observations above, the proposed method is recommended for practical use on all GeneChip ® expression data prior to normalization. The assumptions in the approach were validated through analysis of data distributions, and the only arbitrary parameter was shown to have limited effect on the results. Furthermore, through tests in many additional experiments (not shown), the parameter scanning method has been found to be very effective in eliminating hybridization problems. The appropriateness of the method can be checked in every analysis, with the data required

for the checking process supplied by the software. The numbers of cancelled data are always larger than the expectation, suggesting that most hybridizations have problems of some sort.

The problems detected had patterns indicative of surface polishing, uneven hybridization, and errors in the fabricated cell structure. Symmetric patterns of clusters surrounding the center of the chip (Fig. 9, lower right) can be identified as polishing artifacts. In such a case, the signals in the affected area are always distinctively lower and thus insensitive to the expectation value. Cases with advanced degree of surface polishing will form the common doughnut-like cluster pattern. In contrast, clusters with indefinite shape are more likely indicative of uneven hybridization. Within the cluster, data has a tendency to increase or decrease, producing diffusion in the scatter plot with experimental reproducibility (Fig. 5). Such unevenness can be derived from several sources, and some of the distinctive regions are insensitive to the expectation value while some are not (Fig. 9, ATGE 14 C). The differences in sensitivity correspond to the differences in the magnitude of the defect. Defects detected as smaller clusters or isolated windows may have been formed by dust. Again, some of these features are distinct while others are not. Errors in the chip structure can be identified as repeated clusters in the same parts of multiple chips, forming regular shapes often surrounded by straight lines. Many such defects are not problems but control cells designed and placed on the chip, although some may be caused by problems, appearing in all chips with similar batch numbers (i.e., same manufacturing lot). Such problems might be caused by product errors that have not been detected in quality controls and can result in serious problems. In the case shown in Fig. 5, the huge upward diffusion is attributed to this sort of failure (Fig. 9, lower left). The proposed method will reduce false positives in microarray data analyses. Such errors are not unique to microarray analyses, but the multiplicity of tests in conducted using microarrays increases the seriousness of errors. Multiplicity is realized through the comprehensiveness of the microarray and other post-genomic analyses, which generate distinctively different targets of analyses compared to conventional methods measures

only a limited number of gene products. In the hyper-multiple comparisons, a large number of false positives wilF hinder analysis, producing both intra- and inter-laboratory contradictions in the observations. For example, permitting type-I error at a probability of 1%, half a million double-sided tests will produce 10000 errors. Ignoring hybridizations problems will greatly increase this expectation (Fig. 5). Additionally, such problems will affect data normalization and the summarized data for genes. Consequently, hybridization problems should be detected and eliminated before normalization. The proposed method will rescue clean data from a failure-free region of hybridization, and the data remaining after cancellation can be normalized and used for further analysis. The resultant data set showed fair coincidence with the corresponding pairs in reproducibility experiments (Fig. 5, center). The total cost of experiments will be reduced in comparison to an ad hoc approach to cancellation of genes in arrays and/or entire arrays.

The R program will be affected by the tissue effect in discovery of the ideal standard of hybridization. That is, the standards will differ according to the differentiation of cells in the sample. Such an effect will occur when treating small numbers of arrays together with large number of arrays on a different tissue. Additionally, treating data using less than four arrays is not encouraged, since the standard cannot be considered stable. The stability of the standard can be checked using the approach shown in Fig. 2, and the tissue effect can be noticed by a marked increase in cancellations without producing the clusters of cancelled windows found in Fig. 9. Such problems can be avoided by finding the standard separately from the recognition process. Practically, two alternative ways can be employed to discover the ideal standard: using randomly selected samples among various tissues of many arrays, and by finding tissue-specific standards and using these for the corresponding arrays.

INDUSTRIAL APPLICABILITY

The present invention can be utilized for microarray analysis for detecting nucleotide hybridization including measuring mRNA levels and finding SNPs, for example.