YARMUS LONNY (US)
RULE ANA (US)
WO2019053414A1 | 2019-03-21 |
US20180275131A1 | 2018-09-27 | |||
US20200378973A1 | 2020-12-03 | |||
US20180180590A1 | 2018-06-28 | |||
US20200103394A1 | 2020-04-02 | |||
US20130236981A1 | 2013-09-12 |
CLAIMS 1. A method of detecting stage one lung cancer in a subject comprising: receiving a breath sample from the subject; analyzing the breath sample to detect at least one of Acetoin, Dodecane, and p-Cymene; initiating a follow-up plan for the subject, if the at least one of Acetoin, Dodecane, and p-Cymene are detected. 2. The method of claim 1 further comprising recieving multiple breath samples from the subject. 3. The method of claim 1 further comprising using a device for analysis of the VOCs in the breath. 4. The method of claim 3 further comprising using the device more than once, in order to confirm results. 5. The method of claim 1 further comprising collecting the breath sample in a bag or other receptacle. 6. The method of claim 5 wherein the bag or other receptacle comprises a Tedlar® bag or other film bag. 7. The method of claim 1 further comprising analyzing the breath samples within 24 hours of collection. 8. The method of claim 1 further comprising analyzing the breath samples within 2 hours of collection. 9. The method of claim 1 further comprising using a gas chromatograph for analysis of the breath sample. 10. The method of claim 1 wherein the follow up plan further comprises additional testing, treatment, preventative and/or lifestyle changes. 11. A method of detecting stage one lung cancer in a subject comprising: analyzing a breath sample from the subject to detect at least one of Acetoin, Dodecane, and p-Cymene; initiating a follow-up plan for the subject, if the at least one of Acetoin, Dodecane, and p-Cymene are detected. 12. The method of claim 11 further comprising analyzing multiple breath samples from the subject. 13. The method of claim 11 further comprising using a device for analysis of the VOCs in the breath. 14. The method of claim 13 further comprising using the device more than once, in order to confirm results. 15. The method of claim 11 further comprising analyzing the breath samples within 24 hours of collection. 16. The method of claim 11 further comprising analyzing the breath samples within 2 hours of collection. 17. The method of claim 11 further comprising using a gas chromatograph for analysis of the breath sample. |
[0044] After the exclusion of groups that did not have a biopsy-confirmed S1LC case, there were 231 study participants left (cases and controls). These data include a total of 92 cases with 51 control housemates and 88 matched controls. The number and type of controls are displayed for these 92 cases in Table 6. [0045] From these data four patients with biopsy-confirmed S1LC were further excluded. Out of these four patients 2 did not have either matched or housemate control data. The other 2 cases had only housemate control but not matched control data. Data for these groups were excluded from the analysis. These exclusions were applied to avoid groups that are not balanced on covariates. [0046] Data analysis is conducted only for groups of study participants that contained a patient with biopsy-confirmed S1LC. For this analysis, only 88 groups with a case who had least one available matched control were used. These data are referred to as “included groups”. Among the included groups, 39 groups had only one matched control and no housemate control and 49 had both matched control and housemate controls; see Table 6 for more details. [0047] FIG.1 illustrates a graphical view of an enrollment graph for the study as a function of time. Each line represents the cumulative enrollment by participant type (case=solid line, matched control=dotted line, and housemate control = dashed line). For example, by January 2019 there were 55 enrolled biopsy-confirmed S1LC cases in the study. [0048] There are 330 total number of study participants, which included 157 potential cases (patients who were identified by the clinical team as potential cases before the biopsy). Out of the 157 potential cases, 65 (41.4%) were excluded from the analysis. Most potential cases were excluded from the study because biopsy results did not confirm the S1LC diagnosis; see Table 2. Matching control and housemate control data associated with the cases that were excluded were also removed from the analysis. The data used in this analysis has 225 participants, which includes a total of 88 cases with at least one available matched control. [0049] According to the pre-specified analysis protocol, data were split into training (for biomarker discovery and model exploration) and testing (for validation of biomarkers and models). The first 30 groups and their controls were used for training and the remaining 58 groups were used for testing. Table 5: Demographics table for included data
[0050] The demographic and behavioral summaries for the study participants in the 88 analyzed groups (case and at least one available matched control) are presented in Table 5. Details are further provided by the three study participant types (case, matched control, housemate control). Table 6 provides the demographic and behavioral information separated by training and testing data sets. For each subject, two bags of exhaled breath were collected consecutively during one forceful exhalation process. Bag 1 (diluted) had a volume of 0.5 liters and was used to collect the first air exhaled (tidal volume), which is thought to represent the normal exhalation process. Bag 2 (alveolar) had a volume of 1.0 liter and was used to collect the expiratory reserve volume (the gas mixture coming from the dead space of the bronchial tree and the alveolar gas exchange space of the lungs). The air from each bag was injected into a gas chromatograph (GC-MS), which separated the different compounds in the exhaled air into a series of “peaks”. Each peak was associated with a distinct VOC. [0051] To convert an original GC-MS peak area result (unitless) to a concentration value in the sample (mass of compound per volume of air), a calibration curve was constructed for each of the 13 quantifiable VOC compounds described in Section 4.7.3. A calibration curve was obtained by serially diluting a chemical standard to obtain at least five different and known concentrations, which are plotted along the x-axis. These known concentrations are injected into the GC-MS and the resulting peaks are plotted along the y-axis. Each calibration curve was compound specific. This provided the mapping (calibration) of VOC peak areas to concentrations measurements for Bags 1 and 2. Table 6: Demographics table for S1LC patients in included data [0052] The first step is to compare the consistency of VOC quantification in the two bags. Note: bag comparison results are based on the analyzed data only, which included 225 study participants (88 cases, 88 matched controls, and 49 housemate controls). As both measures are highly right skewed, the log10 (peak area) and log10 (concentration) were used instead. [0053] Results indicate that for most compounds, the VOC peak areas measurements for the two bags are strongly correlated; see Table 7 and FIG.2. These results are based on larger sample sizes than the corresponding results for concentrations, which require both bag measurements to be above the limit of detection. The columns labeled “n” in Table 7 provide the number of study participants who had both bag measurements by VOC. For VOCs with concentrations above the limit of detection in more than 100 bag pairs there is a good agreement between the correlation of VOC peaks and concentrations. [0054] FIG.3 provides the distributions of VOC peak areas (left panel) and concentrations (right panel) in Bags 1 (dark grey) and 2 (light grey) separated by compound (x-axis). For 2- Pentanone, Acetoin, and Dodecane the log-peak areas and concentrations was smaller on average in Bag 1 than in Bag 2. However, for the other VOCs measurements tended to be on average similar in the two Bags or even larger in Bag 1. Table 9 provides the results of paired t-tests for the null hypothesis of no difference in the mean log 10 peak areas and concentrations between Bags 1 and 2. For log 10 peak areas there was a statistically significant difference for 3-Methyl-1-Butanol (p-value< 0.001, larger values in Bag 1), Acetoin (p-value< 0.001, smaller values in Bag 1), Dodecane (p-value= 0.005, smaller values in Bag 1), Ethylbenzene (p-value< 0.001, larger values in Bag 1), Hexanal (p-value< 0.001, larger values in Bag 1), and Toluene (p-value< 0.001, larger values in Bag 1). For log10 concentrations there was a statistically significant difference for 2-Pentanone (p-value= 0.002, smaller values in Bag 1), Acetoin (p-value< 0.001, smaller values in Bag 1), Cyclohexanone (p-value= 0.018, larger values in Bag 1), and Dodecane (p-value= 0.009, smaller values in Bag 1). The difference in testing results between log peak areas and concentrations can be attributed to the large number of missing concentrations that are below the limit of detection. [0055] FIG.2 illustrates graphical views of scatterplots of log10 (peak) for Bag 1 (x-axis) versus Bag 2 (y-axis). Dark grey: regression line; light grey: identity line; axis labels: displayed on the original scale. FIG.3 illustrates graphical views of boxplots of peak area (left panel) and concentrations (right panel) on log 10 scale for Bag 1 (dark grey) and Bag 2 (light grey). The y-axes labels are displayed on the original scale. The peak data are unitless and the concentrations are expressed in μg/L. Table 7: Bag 1 vs. Bag 2. In each pairwise complete comparison values with missing or undetected peak or concentration in either Bag 1 or Bag 2 were excluded from analysis. The number of samples used to compute correlation is recorded in column n samples. [0056] The association between the measurements in the two bags was also quantified using a linear model regression for Bag 2 (y, outcome) versus Bag 1 (x, regressor) based on log 10 peak areas and concentrations, respectively. Table 9 provides summaries of these regressions, where: (1) the columns labeled “Estimate” provide the point estimate for the slope of the regression; (2) the column labeled “p-value” is the p-value for testing the null hypothesis of no association between measurements in Bags 1 and 2; (3) the columns labeled “lower CL” and “upper CL” are the lower and upper limits of the 95% confidence intervals for the participants who had both bag measurements. Results indicate that there is strong evidence that the log peak area measurements in the two Bags are strongly statistically associated for all quantifiable compounds peak data, where 2-Butanone, Toluene have the slope estimates greater than 0.9 and Ethylbenzene, p-Cymene greater than 0.8. Scatterplots of Bag 1 (x-axis) versus Bag 2 (y-axis) measurements are shown in FIG.2 for peak area and FIG.4 for concentrations. The regression line is shown in dark grey and the 45 ◦ (identity) line is shown in light grey. Data are plotted on the log scale, but labels are shown on the original scale. FIG.2 indicates strong associations between the peak area measurements in Bags 1 and 2 for most compounds and were used for quality control purposes. [0057] FIG.4 illustrates graphical views of scatterplots of log 10 (concentration) for Bag 1 (x-axis) versus Bag 2 (y-axis). Dark grey: regression line; light grey: the identity line; axis labels: displayed on the original scale. Table 8: Paired t-test of no difference between Bag 1 and Bag 2 using log 10 peak areas and concentrations. P-values and the lower and upper confidence limits of the 95% confidence intervals are provided [0058] The fewer data points in FIG.2 compared to FIG.4 is due to the fact that many concentrations were below the limit of detection. For this reason the estimates of the slope parameters for the regression of log concentrations in Bag 2 versus Bag 1 tended to be smaller than for log peaks. For log concentrations only 2-Butanone, 2-Pentanone, p-Cymene had the slope estimates larger than 0.8. [0059] According to the study design, each study participant started exhaling in Bag 1 (diluted), and continued exhaling into Bag 2 (alveolar), which was assumed to collect deeper air from the lungs. Comparison of Bags 1 and 2 peak area and concentration measurements indicates that there are strong correlation between the measurements in the two bags; see Table 7, FIG.2 and FIG.4. For some compounds there are statistically significant differences between Bag 1 and 2 measurements. For log 10 peak areas some VOC measurements are higher on average in Bag 1 and some a higher in Bag 2. For log 10 concentrations there were either no statistically significant differences between the two Bags or measurements were lower on average in Bag 1. These differences can be attributed to the large number of missing concentrations (below limit of detection) for many VOCs. For the purposes of data analysis only Bag 2 data was used. Table 9: Bag similarity: linear fit results and 95 percent confidence intervals. Bag 1 (x) vs. Bag 2 (y), each were log 10 transformed. Table 10: Number of undetected compounds by bag information (included participants data with at least one matched control)
Quantifiable compounds were not detected for some study participants. The missing (below limit of detection) concentrations by VOC and collection bag are presented in Table 10. Here missing concentration values include both missing peak values, which did not produce a concentration value after calibration, and peak values which corresponded to a VOC concentration value that was considered below the limit of detection. There were 4 (Control- Housemate: N=1, Matched-control: N=3) study participants in the test data set with missing Bag 1 measurement. These study participants were removed from the Bag 1 versus 2 analysis, but were kept in the predictive modeling analysis. Table 11: Number of undetected concentrations in Bag 2
[0060] Table 11 further lists the number of case and control study participants in the training and testing data with missing quantifiable peaks and concentrations, respectively. Results indicate that the individual VOC limit of detection and percent missingness depends on the compound type both for peaks and concentrations. There is also a bag effect for peak areas, with fewer missing peak areas in Bag 2 (with the exception of Ethylbenzene). For concentrations with lower percent missingness (Dodecane, Acetoin, 2-Pentanone, Heptanal) the percent missing observations was lower in Bag 2. For concentrations with higher percent missingness the difference between bags was less clear. [0061] The quantifiable VOC peak area obtained from Bag 2 (alveolar) in the training data is examined. FIG.5 displays the boxplot of log 10 (peak) area for cases and controls combined. The x-axis is on the original scale even though data were log 10 -transformed. FIG.6 displays the same data as FIG.5, but boxplots are separated by cases (dark grey), housemate controls (light grey) and matched controls (grey). A visual inspection of the data suggests that Acetoin, 2-Hexanal, Hexanal, Heptanal, p-Cymene and Dodecane exhibit differences in the distribution of log10 peak areas between cases and controls in the training data. For all of these VOCs, cases tend to have on average lower, not higher, log10 peak areas than controls. FIG.7 displays the same data as FIG.6, with cases shown in dark grey and controls (combined housemate and matched controls) shown in light grey. [0062] FIG.5 illustrates graphical views of boxplots of log 10 (peak) for quantifiable VOCs. The x-axis are the compounds and the y-axis labels are displayed on the original scale while the data were log 10 transformed. VOC peaks in S1LC cases tend to be lower than in controls, which contradicts currently published literature Acetoin, 2-Hexanal, Hexanal, Heptanal, p- Cymene and Dodecane exhibit visual differences in the distribution of log 10 peak areas between cases and controls in the training data. [0063] The overall goal of the project is to identify individual or VOC combinations that discriminate S1LC patients from controls. The first step was to conduct forward selection based on logistic regression on the training data, regressing on the case/control status. The ideas is to select the combination of variables with the highest predictive performance as measured by the area under the receiver operating characteristic (AUC) curve in the training data set. The second step is to apply and evaluate these models on the test data set. A control is defined as a study participant in the “included data” subset who does not have cancer (either control housemate or matched control). For each compound the missing observations were removed in all models that contained that compound. [0064] FIG.6 illustrates graphical views of boxplots of log10(peak) for quantifiable VOCs separated by cases (dark grey), housemate controls (light grey), and matched controls (grey). The x-axis are the compounds and the y-axis labels are displayed on the original scale even though the data were log10 transformed. [0065] FIG.7 illustrates graphical views of boxplots of log10(peak) for quantifiable VOCs separated by cases (dark grey), housemate and matched controls combined (light grey). The x-axis are the compounds and the y-axis labels are displayed on the original scale even though the data were log10 transformed. FIG.8 illustrates an infographical view of correlation among VOC log 10 (peaks). [0066] Pairs of VOCs with high correlations between log peak area measurements may not improve the predictive performance of models using only one of the VOCs in the pair. This is due to the overlap in information between the two VOCs in the pair. On the contrary, pairs of VOCs with low correlations are good candidates for jointly improving prediction. In this data set, many VOC pairs have highly correlated log peaks; see FIG.8. For example, Toluene has a correlation of 0.79 with Ethylbenzene and Hexanal has a correlation of 0.71 with Heptanal. In contrast, Acetoin has lower correlations with all quantifiable compounds, with a maximum correlation of 0.54 with Dodecane. [0067] The performance of each VOC (log peak area) in a univariate model is examined, that is, using each VOC as a single predictor of lung cancer. Table 12 ranks predictive performance of each compound. Based on the training AUC results, p-Cymene, Heptanal, Acetoin are the top 3 VOCs in terms of S1LC case prediction performance. Table 12 also shows that the top individual predictors ranked by test AUC are Acetoin (test AUC 0.648), p- Cymene (test AUC 0.612) and 2-Butanone (test AUC 0.61). [0068] Table 12 displays the results of the forward selection procedure, where each VOC is added in the predictive model based on the maximum AUC criteria in the training set. The model with maximum test AUC included p-Cymene and 2-Butanone (test AUC 0.669). The second best performing model included p-Cymene, 2-Butanone, Heptanal, and Acetonin (test AUC 0.620). Table 12: VOCs ranked by the individual prediction performance of S1LC cases based on log peak area. Ranking criterion: AUC in the training data set [0069] A major practical limitation of the VOC peak-based analysis is that multiple compounds are below the limit of detection; see Tables 10 and 11. For example, the top predictor based on log peak area used p-Cymene (64% missing concentrations in cases/training, 38% missing concentrations in controls/training, 70% missing concentrations in cases/test, and 47% missing concentrations in controls/test) and 2-Butanone (94% missing concentrations in training cases and controls and 98% missing concentrations in test cases and controls). This is a problem because even if the compounds may have discriminatory power, they are generally under the limit of detection of the GC-MS instrument used in the study. The implication is that concentration thresholds with discriminating properties cannot be provided for these compounds. [0070] Therefore, in what follows VOC concentrations with values above the limit of Detection are used. [0071] FIG.9 compares VOC concentrations for training data separated by each control type and the two bags (left panel corresponds to Bag 1 and right panel corresponds to Bag 2). Only compounds with less than 20% missing data (either in Bag 1 or 2) are used in the analysis. Boxplots are shown in dark grey for cases, light grey for housemate controls, and grey for matched controls. For each compound the boxplots are based on a different number of study participants, as missing concentrations were excluded. Table 13: Forward selection models using log peak area for VOCs based on maximum improvement in AUC in the training data Table 14: Correlations of log concentrations of quantifiable VOCs that have at least 20% concentration measurements above the limit of detection [0072] FIG.10 provides the same information as FIG.9, but combines housemate and matched control data into a single control category. According to the protocol, only the data obtained from Bag 2 is used. For prediction modeling the housemate and matched controls are combined into one category, as shown in FIG.10. [0073] Correlations between individual VOC log concentrations (using pairwise complete observations) in the training data are presented in Table 14. Results are consistent with the correlation results for VOC log peak areas; see FIG.2. Dodecane and 2-Pentanone had highest correlation (correlation 0.656) among all VOC pairs. Acetoin has consistently low correlations with the other quantifiable VOCs shown in Table 14 with the largest correlation in absolute value with Dodecane (correlation 0.374) and Heptanal (correlation 0.369). [0074] FIG.9 illustrates graphical views of boxplots of log 10 (concentrations) for quantifiable VOCs with concentrations above the limit of detection for at least 20% of measurements. Boxplots are separated by cases (dark grey), housemate controls (light grey) and matched controls (grey). The x-axis provides the compounds and the y-axis labels are displayed on the original scale even though the data were log 10 transformed. FIG.10 illustrates graphical views of boxplots of log 10 (concentrations) for quantifiable VOCs with concentrations above the limit of detection for at least 20% of measurements. Boxplots are separated by cases (dark grey) and housemate and matched controls combined (light grey). The x-axis provides the compounds and the y-axis labels are displayed on the original scale even though the data were log 10 transformed. Table 15: Prediction performance of log concentration of quantifiable VOCs that have at least 20 percent concentration measurements above the limit of detection. Performance is assessed as AUC in single-variable models and is reported in the training and test data. Ranking based on training data. The column labeled N indicates the number of samples used in the model [0075] Table 15 provides individual VOCs S1LC case prediction performance using univariate logistic regression based on log concentrations above the limit of detection. Acetoin, Heptanal have training AUC greater than 0.6, while other compounds have AUCs close to 0.5.The AUC for Acetoin is 0.649 in the training data and 0.650 in the testing data. In contrast, the AUC for Heptanal is 0.610 in the training data, but falls to 0.511 in the test data. Dodecane has a consistent AUC across training and test data (0.574 in training and 0.541 in testing). [0076] A forward selection approach was used to identify the combination of most predictive VOCs. Selection of VOCs and ranking of models were based on the maximum improvement in the AUC using training data. For each selected model the AUC on the test data was also computed. Missing observations are excluded when individual VOCs are below the detection limit in each candidate model. Table 16 displays the results of the procedure and provides both the training and test AUC as additional covariates are included into the model. The table is cumulative; for example, the row labeled 2-Pentanone indicates that 2-Pentanone was the third variable added to the model and the corresponding AUC refers to the model that includes Acetoin, Heptanal, and 2-Pentanone. [0077] In the log concentration analysis, Acetoin is the strongest predictor with a training AUC of 0.649 and a test AUC of 0.65. Adding Heptanal increases the training AUC to 0.669 and decreases the test AUC to 0.669. Adding 2-Pentanone to the model increases slightly the training AUC (from 0.669 to 0.689) though the test AUC of 0.601 is still below the test AUC of 0.65 for Acetoin alone. This suggests that using a one variable model based on Acetoin may be the best approach. One could also consider a two variable model adding either Dodecane or 2-Pentanone. However, more complex models are not considered at this time given the results in Table 16 and the high correlations among the other log concentrations of quantifiable VOCs shown in Table 14. Table 16: Forward selection results based on quantifiable VOC log concentrations. Each row indicates a cumulative model; for example, the row labeled Dodecane correspond to a model that includes Acetoin, Heptanal and Dodecane. Ranking is based on training AUC (both training and test AUCs are shown). The column labeled N indicates the number of samples used in the model [0078] Table 17: Results for t-tests comparing the mean of the log concentration between cases and combined controls. Results are shown for the training, test, and combined data [0079] Un-paired t-tests were conducted to compare the mean of the log concentration among cases and combined controls separately in the training and test data as well as in the combined test and training data. Table 17 provides the results indicating that the difference in log concentrations of Acetoin is: (1) not significant at the α = 0.05 level in the training sample (p-value = 0.091; (2) is significant in the test sample (p-value = 0.001); and (3) is significant in the combined sample (p-value = <0.001). This is likely due to the differences in sample sizes between the training and testing data sets. For all other VOCs and data sets, the differences were not statistically significant at the α = 0.05 level. [0080] Results based on VOC concentrations suggest that Acetoin: (1) has most concentrations above the limit of detection; (2) leads to the best predictive model in the test data; and (3) has a stable performance when transitioning from training to test data. Thus, the specific Acetoin concentration thresholds expressed in μg/L and their associated S1LC case prediction performance are explored. Because Acetoin concentrations were, on average, lower in S1LC patients compared to controls, the test follows the following rule: [0081] The thresholds, thresholdtrain, can be chosen in many different ways to balance sensitivity and specificity. Here, the following thresholds on the percentiles of Acetoin concentrations in the training data of controls are considered: (a) the 10th percentile (0.026 μg/L); (b) the 25th percentile (0.044 μg/L); and the 50th percentile (0.098 μg/L). These thresholds are provided directly on the concentration scale. The corresponding thresholds, threshold train , on the log 10 concentration scale can be obtained by taking the log 10 transformation of the thresholds on the concentration scale. These choices are made for illustration purposes only. [0082] FIG.11 displays the Acetoin concentration for each biopsy-confirmed S1LC case (dark grey dot) and control (grey dot). The x-axis is the test group number starting from 31 because the first 30 groups were used for training. On each vertical line there are either: (1) two dots (one dark grey and one grey), when the group contains a biopsy-confirmed S1LC case and a matched control; or (2) three dots (one dark grey and two grey) when the group contains a biopsy confirmed S1LC case, a matched control, and a housemate control. For example, group 31 has two dots and group 32 has three dots (dots shown on vertical lines). The y-axis is labeled on the scale of the concentration (μg/L), even though data was log 10 transformed for visualization purposes. The dashed horizontal lines correspond to the classification thresholds based on the distribution of Acetoin concentration in controls in the training data set: 10 th percentile shown in black (0.026 μg/L), 25th percentile shown in light grey (0.044 μg/L) and 50 th percentile (0.098 μg/L) shown in magenta. For each threshold, study participants below the corresponding line are classified as cases and above the line as controls. In summary, the color of the dots is the true S1LC case status (dark grey cancer, grey), while the position of the dot relative to one of the horizontal lines is the prediction of S1LC case status (below cancer, above control). This Figure provides the visual tradeoff in terms of false positives and false negative predictions as a function of the threshold on Acetoin concentrations. [0083] Table 18 further quantifies the results displayed in FIG.11. The part of the table labeled “Test Data” corresponds exactly to FIG.11 (test data), while the part labeled “All Data” corresponds to the combination of training and test data (corresponding figure not shown). For example, consider the scenario when S1LC cases are predicted when Acetoin concentration is below 0.026 μg/L. When the test data are used, 37 S1LC cases and 49 controls are correctly identified, 20 cases are incorrectly classified as controls and 35 controls are incorrectly classified as cases. When all data are used (cases and controls) 44 S1LC cases and 93 controls are correctly identified, 41 cases are incorrectly classified as controls and 40 controls are incorrectly classified as cases. [0084] FIG.11 illustrates a graphical view of classification based on Acetoin concentration threshold using the test data. The x-axis is the group number (starting at 31 because the first 30 groups are for training), each group with either two or three study participants. The y-axis is labeled on the concentration scale (μg/L), but data are log 10 transformed. Each point is a study participant (dark grey S1LC case, grey control). Horizontal lines correspond to three thresholds based on percentiles of the Acetoin concentrations distribution in all training data controls: 10th (0.026 μg/L, shown in black), 25th (0.044 μg/L, shown in light grey) and 50th (0.098 μg/L, shown in magenta). For each threshold, participants below the line are classified as cases and above the line as controls. Table 18: Classification table for three Acetoin concentration thresholds using the Test data and All data (training + test) Table 19: Estimated sensitivity (proportion of correctly identified S1LC cases), specificity (proportion of correctly identified controls), and accuracy (proportion of correctly classified cases and controls) for three Acetoin concentration thresholds using the Test data and All data (training + test) [0085] Table 19 provides the estimated sensitivity (proportion of correctly identified S1LC cases), specificity (proportion of correctly identified controls), and accuracy (proportion of correctly classified cases and controls). The part of the table labeled “Test Data” corresponds exactly to FIG.11 (test data), while the part labeled “All Data” corresponds to the combination of training and test data (corresponding figure not shown). There is a direct correspondence between Tables 19 and 18. For example, consider the scenario when S1LC cases are predicted when Acetoin concentration is below 0.026 μg/L. When the test data are used sensitivity was 0.649 = 37/(37 + 20), specificity was 0.583 = 49/(49 + 35) and accuracy was 0.61 = (37 + 49)/(37 + 20 + 49 + 35). [0086] Focus has been on the prediction performance of concentrations when they are above the limit of detection, which was the main goal of the study. However, several VOCs have large proportions of observations that are below the limit of detection. Thus, there is a need to investigate whether being above/below the limit of detection predicts S1LC status. To conduct this analysis missing VOC concentrations were recoded as 0 and those present were recoded as 1. These recoded variables are referred to as presence/absence of individual VOCs. Table 20: Missing concentrations individual VOC discriminative ability. Prediction performance measured as area under the curve (AUC) in the training and test data when predicting S1LC cases based on individual binary predictors defined as “above or below LOD” for each VOC. VOCs are ordered by name, not by any measure. [0087] Analyses were conducted using individual quantifiable VOCs presence/absence data as predictors and S1LC case indicators as outcome. Table 20 provides the train and test data AUC for each VOC presence/absence data. All models are univariate (using one presence/absence predictor). The test AUCs for all compounds, except p-Cymene are close to 0.5. The AUC for p-Cymene is 0.633 in the training data and 0.580 in the test data. The limit of detection for p-Cymene (see Table 21) was 0.00011 μg/L. The model uses a decision rule of having a p-Cymene breath concentration below 0.00011 μg/L to predict S1LC cases. [0088] Analysis of VOC concentrations data indicated that Acetoin was the strongest predictor S1LC cases in the test data set. Analysis of presence/absence concentrations data indicated that p-Cymene being below the limit of detection was predictive of S1LC. Here, the investigation is focused on whether the combination of Acetoin and presence/absence of p- Cymene given its specific LOD in or study performs better than Acetoin alone. [0089] Results indicate that the model with Acetoin alone has better prediction performance (training AUC = 0.649; testing AUC = 0.65) than the model with Acetoin and the indicator variable for presence/absence of p-Cymene (training AUC = 0.606; testing AUC = 0.504). Table 21: Compounds concentration range, limit of detection, and upper bound of calibration curve. This is for all data (train and test). All values are in μg/L [0090] Table 21 provides the range of the distribution of detected concentrations for Bag 2 in all analyzed data (testing and training combined) and the corresponding limit of detection for every compound. All values are expressed in μg/L. For example, for 2-Pentanone the minimum observed concentration was 0.00133 μg/L and the maximum observed concentration was 0.22125 μg/L with a limit of detection of 0.00130 μg/L and an upper bound for the concentration curve calibration of 0.10000 μg/L. It is worth noting that most limits of detection are in the nanograms (one thousandth of one microgram) per liter (ng/L) range. The highest limit of detection among the thirteen quantifiable compounds in this study is Toluene, with a limit of detection of 0.01854 or approximately, 18 ng/L. [0091] The maximum upper bound for concentration for each compound is related to the data available for calibrating the curves. A few observations were estimated to be above the upper bound and were based on extrapolation of the calibration curve. All analyses were based on data using these few extrapolated values. Two sensitivity analyses were conducted by: (1) removing all observations that were above the upper bound of concentrations; and (2) removing all observations that were more than 20% above the upper bound. Results were robust to these changes in the data, most likely because very few data points were affected by this problem. [0092] Focus has been on thirteen quantifiable VOCs, which were identified from literature as potential predictors of cancer and for which calibration (transformation from peak area to concentrations) was possible. These are referred to as quantifiable compounds, though the term is specific to the analysis and report as the number and type of VOCs that are quantifiable can vary with the study. However, there is a large number of VOCs that were not calibrated in the data. More precisely, they have an associated peak area measurement, but do not a corresponding concentration expressed in international units of measurement. These VOCs will be referred to as “unquantifiable” VOCs, though, the list of VOCs that are not quantifiable can vary substantially from study to study. [0093] In the study, Tentatively Identified Compounds (TICs) information was used for the unquantifiable VOC analysis. This information was obtained directly from a Chromeleon CDS system (Version 7.2.8 with NIST MS search V.2.0, Thermo Fisher Scientific). As mentioned in the EPA TIC (2006) document : “The [TICs] identification is not considered “absolute” or “confirmed” until a known standard for the suspect compound can be analyzed on the same instrument which made the tentative identification.” Due to various constraints this was not done for the unquantifiable VOCs in the study, though it was done for the 13 quantifiable VOCs. [0094] Before the study started it was not known what and how many additional VOCs will be identified and in what proportion of the study participants each VOC will be present. Here an exploratory analysis of the unquantifiable VOCs identified in the study is provided. The statistical analysis mirrors the one conducted for the log peak area of quantifiable compounds and quantifies: (1) the association between presence/absence of each VOC and the S1LC case indicator; and (2) the association between log 10 peak corresponding to each VOC and the S1LC case indicator. The same training/test data split used for the analysis of quantifiable VOCs was used for the unquantifiable VOCs. For prediction purposes only data based on Bag 2 (alveolar) was used, though some summary statistics are presented for Bag 1 (tidal), as well. [0095] In the case of quantifiable compounds only one peak area was returned by the software and calibrated to concentrations. However, for some unquantifiable VOCs sometimes there are multiple peak areas that are associated with the same compound. In these cases the area of the maximum peak was used and the other peaks were discarded. Additional analysis could be conducted using the sum of the areas or a repeated measures analysis. Only VOC peaks that were identified as being “excellent” were retained based on the criterion that both the Similarity Index (SI) and the Reverse Search Index (RSI) are greater than or equal to 900. Table 22: Number of compounds that are present in each bag and the total number of distinct compounds across two bags, by compound quality. Training and test data combined. [0096] There were 167 total VOCs with identified peaks, out of which 60 were identified as “good” (both SI and RSI greater than or equal to 800), and 24 compounds with “excellent” data (both SI and RSI greater than or equal to 900). These numbers contain VOCs in either bag that were identified in at least one study participant in all included data (training and test combined). Results are summarized in Table 22. [0097] However, the number of VOCs identified in the breath of at least one individual was different depending on the bag. For example, there were 129 total VOCs in Bag 1 compared to 144 in Bag 2, 48 VOCs of “good” quality in Bag 1 compared to 55 in Bag 2, and 23 VOCs of “excellent” quality in Bag 1 compared to 22 in Bag 2. [0098] Recall that the training data consists of 30 groups with a total of 81 study participants, with 30 cases and 51 combined matched and housemate controls. The test data consists of 58 groups with a total of 144 study participants, with 58 cases and 86 combined matched and housemate controls. [0099] Table 23 provides the results for Fisher’s exact test of the null hypothesis of no association between the presence/absence indicator of individual VOCs and S1LC case status in the training data for Bag 2. The column labeled N present denotes the number of cases and controls that have the specific VOC present among the 81 study participants (ncases = 30; ncontrols = 51) Bag 2 training data. Results are shown for VOCs with a p-value for Fisher’s exact test less than 0.5 (not 0.05) for exploratory reasons. VOCs are ranked from the smallest (stronger evidence against the null hypothesis) to the largest p-value. The columns labeled “Sensitivity” and “Specificity” provide the sensitivity and specificity of the test that predicts a S1LC case if the VOC is present in the training data. Table 23: Fisher’s exact test of the null hypothesis of no association between the presence/absence indicator of individual VOCs and S1LC case status. N present: number of participants that have the VOC present among the 30 cases and 51 controls. Results are shown for VOCs with a p-value for Fisher’s exact test less than 0.5, with corresponding sensitivity and specificity. An individual is predicted to be a S1LC case if the VOC is present. All results are based on the training data for Bag 2. [00100] Table 24 provides the AUC for the training and test data for the presence/absence data for the top eight unquantifiable compounds in the study. Ranking was based on the p- values of the Fisher’s exact test for no association between presence/absence and S1LC case status in the training data. Current software implementations of AUC (as implemented in the function prediction in R package ROCR) are used, though this may be inappropriate for binary predictors. A better measure of AUC is estimating the AUC without adding in ties, which tends to provide lower values of AUC. However, this version of AUC is used to keep the AUC calculations consistent within this report. [00101] With the exception is Argon (Training AUC=0.607, Test AUC=0.509), there is good agreement between the training and test AUC. This may be due to the fact that for binary prediction there is no tuning parameter (decision threshold). Thus, the consistency of AUCs is a consequence of the stability of missing VOC proportions in the training and test data. The presence/absence of the VOCs listed in Table 24 could be potentially useful for building prediction models for S1LC cancer cases. However, the definition of presence/absence depends substantially on the technology used and its VOC detection sensitivity. In the absence of information about limits of detection and calibration curves this information cannot be directly generalizable. In this section the prediction performance of the log 10 peak area of unquantifiable VOCs for S1LC case status is explored. Only VOC peaks that were identified as being of “excellent” quality (SI and RSI greater than or equal to 900) are used. FIG.12 displays the boxplots of the log 10 peak areas in the training data for VOCs that had at least 5 cases and 5 controls with data of “excellent” quality. The grey boxplots correspond to combined matched and housemate controls and dark grey boxplots correspond to biopsy-confirmed S1LC cases. In the training data set some of the unquantifiable VOCs have higher log 10 peak areas in S1LC cases than controls; see, for example, Acetone, Argon, Carbamic Acid, Carbon Dioxide, and Isopropyl Alcohol. However, other unquantifiable VOCs have lower log 10 peak areas in S1LC cases than controls; see, for example, 1,4-Pentadiene, Ethanol, and N,N- Dimethylacetamide. Table 24: P-values and AUCs for presence/absence predictors of S1LC cases in the training and testing data.
[00102] Table 25 displays the S1LC case prediction performance of log10 peak area of unquantifiable VOCs based on t-tests and AUCs. VOCs are ranked from the smallest to the largest p-value for the t-test and only VOCs with an AUC larger than 0.55 are shown. Also shown are the number of samples available for each compound broken down by case status. Table 26 displays similar results with Table 25, but includes VOCs that had an AUC greater than 0.55 in either the training or test data sets. VOCs are ranked from the largest to the smallest AUC in the test data. [00103] Phosponic acid has a large training AUC (0.838), but this is based on a small number of study participants who had this particular VOC detected (9 cases and 11 controls). In the test data the AUC for Phosponic acid is much smaller (0.538) based on a larger number of study participants who had this particular VOC detected (31 cases and 34 controls). Carbamic acid (training AUC = 0.637, test AUC = 0.595), Acetone (training AUC = 0.572, test AUC = 0.698), Carbon dioxide (training AUC = 0.658, test AUC = 0.512), and Cyclopropane (training AUC = 0.571, test AUC = 0.532) have been identified as possible targets for further investigation. All compounds in Table 26 could be of interest in future analyses. [00104] Overall, a list of promising unquantifiable VOC based both on the presence/ absence and on the compound and peak area are identified. To evaluate the translational potential of these findings additional studies would need to be conducted, including developing calibration curves to transform peak area values into concentrations and independent validation studies. Given the experience with quantifiable VOCs, results may or may not be reproducible depending on the limits of detection and the patterns of missingness induced by technological limitations. [00105] FIG.12 illustrates graphical views of boxplots of log 10 (peak) for unquantifiable VOCs separated by cases (dark grey), housemate and matched controls combined (grey). The x-axis are the compounds and the y-axis labels are displayed on the original scale even though the data were log 10 transformed. Table 25: Training data: area under the curve (AUC) and p-values for unpaired t-tests for prediction of S1LC case status from individual unquantifiable VOC log 10 peak areas. Mean: mean log 10 peak areas in cases and controls, respectively. Number: number of study participants with a particular VOC among cases, controls, and combined. VOCs are ranked from the largest to smallest AUC in the training data and only VOCs with AUC larger than 0.55 are shown. VOCs are ranked from the largest to smallest AUC in the training data Table 26: Test data: area under the curve (AUC) and p-values for unpaired t-tests for prediction of S1LC case status from individual unquantifiable VOC log 10 peak areas. Mean: mean log 10 peak areas in cases and controls, respectively. Number: number of study participants with a particular VOC among cases, controls, and combined. VOCs are ranked from the largest to smallest AUC in the training data. Only VOCs with AUC larger than 0.55 in either the training or test data are shown. VOCs are ranked from the largest to smallest AUC in the test data
[00106] Many VOCs in exhaled breath had low concentrations in the range of 0.0001 to 17.4973 μg/L for Acetoin and 0.00011 to 0.22125 μg/L for all other VOCs. Each VOC had a different LOD and the percent of VOC measurements below the LOD for most VOCs was high for combined, training, and test data. Among the thirteen quantifiable VOCs considered in this analysis, only four VOC were below the LOD in less than 10% across all samples: 2- Pentanone (7.6%), Acetoin (3.1%), Heptanal (8.4%), Dodecane (1.8%). The proportion of VOCs below LOD among cases and controls in testing and training data was similar for all VOCs except p-Cymene. For p-Cymene the percentage of compounds below LOD was higher in S1LC cases. In the training data, 64% of the measurements were below LOD among cases and 38% among controls. In the test data 70% of the measurements were below LOD among cases and 54% among controls. [00107] FIGS.13A and 13B illustrate graphical views of the distributions of VOC concentrations for training (FIG.13A) and test (FIG.13B) data separated by cases and control types. Acetoin concentrations tended to be lower both in training and test cases, while Heptanal and Dodecane concentrations tended to be lower in training and roughly similar in test samples.2-Pentanone concentrations tended to be higher in both training and test cases than controls, though the difference was not significant (combined data t-test p-value = 0.699; see Table 17). [00108] As several VOCs had large proportions of observations that are below the LOD, the predictive performance was investigated for every VOC being above/below the limit of detection (LOD). Univariate analyses of the prediction performance of S1LC cases using the predictors “above or below the LOD” indicated that p-Cymene had the highest predictive accuracy (training AUC = 0.630; testing AUC = 0.580; see Table 20). The limit of detection for p-Cymene was 0.00011 µg/L; thus, the model uses a decision rule of having a p-Cymene breath concentration below 0.00011 µg/L to predict S1LC cases. The test AUCs for the remaining 12 VOCs was close to 0.5 indicating that being above or below the LOD was not predictive of S1LC. [00109] Table 17 presents the results of comparing the mean of the log10 concentration among cases and combined controls in the training, test, and combined test and training data using unpaired t-test. With the exception of Acetoin, the difference between cases and controls was not statistically significant for any of the group comparisons. For Acetoin the difference in the means was: (1) not significant in the training sample (p-value = 0.091); (2) significant in the test sample (p-value = 0.001); and (3) significant in the combined sample (p-value < 0.001). These differences are likely due to the difference in sample size; for example, for Acetoin there are 28 cases and 49 controls in the training data, but there are 85 cases and 133 controls in the combined data. [00110] Table 16 provides individual VOCs S1LC case prediction performance using univariate and multivariate forward selection logistic regression based on log 10 concentrations above the LOD. In univariate models (one predictor at a time) Acetoin and Heptanal have training AUC greater than 0.6, while other compounds have AUCs close to 0.5. The AUC for Acetoin is 0.649 in the training data (N=77) and 0.650 in the test data (N=141), indicating that the predictive performance of Acetoin was preserved in the test data. In contrast, the AUC for Heptanal is 0.610 in the training data (N=68) and only 0.511 in the test data (N=138), indicating that Heptanal may not be a reliable predictor of S1LC cases. Dodecane has a consistent, low AUC for training (0.574) and test (0.541) data. [00111] Cumulative AUCs for the multivariate forward selection logistic regression as additional VOCs are included into the model are provided in Table 16 for both the training and test data. Acetoin is the strongest predictor with a training AUC of 0.649 and a test AUC of 0.650. Adding Heptanal increases the training AUC to 0.669 and decreases the test AUC to 0.559. Adding 2-Pentanone to the model increases the training AUC (from 0.669 to 0.689) though the test AUC of 0.601 is lower than the test AUC of 0.65 for Acetoin alone. A two variable model adding either Dodecane or 2-Pentanone could also be considered. However, more complex models are not considered at this time given the low individual AUC values for these VOCs and the high correlations among the other log concentrations of VOC (Table S3 in the supplementary materials). [00112] Results based on VOC concentrations suggest that Acetoin: (1) has most concentrations above the limit of detection; (2) leads to the best predictive model in the test data; and (3) has a stable performance when transitioning from training to test data. Thus, the specific Acetoin concentration thresholds expressed in mg/L and their associated S1LC case prediction performance are examined. Because Acetoin concentrations were, on average, lower in S1LC patients compared to controls, the test follows the following rule: if Acetoin test < 10 threshold(from training data) participant is classified as S1LC case; if Acetoin test 3 10 threshold(from training data) participant is classified as control. [00113] The threshold (from training data), can be chosen in many different ways to balance sensitivity and specificity. Here the following thresholds were considered based on the percentiles of Acetoin concentrations in the training data of controls: (a) the 10th percentile (0.026 mg/L); (b) the 25th percentile (0.044 mg/L); and the 50 th percentile (0.098 mg/L). [00114] This was the largest case-control VOC study to date with the inclusion of a healthy control and a housemate control to aid in the elimination of potential environmental confounders for VOCs that may indicate the presence of lung cancer. The control group (S1LC) cases was diverse in terms of covariates and analytic approach of combining type 1 and 2 cases ensures that study results are generalizable to the population. The novelty of the study consists of its focus on: (1) early lung cancer detection, specifically S1LC; (2) practical, translatable and reproducible signature of breath VOC for S1LC; (3) design of experiment targeted to elimination of potential confounders due to environment, technology, and breath analysis procedure; and (4) definition of training and testing data sets before data were collected. The data presents results that are contrary to the published literature indicating that: (a) most VOCs published in the literature have a weak or inexistent association with S1LC; (b) Acetoin, the only VOC that was associated with S1LC, has a much lower predictive performance than the performance of previously published VOC signatures, though none of these results specifically focused on S1LC; and (c) Acetoin concentrations were on average lower (not higher) in the breath of S1LC cases than in controls. Acetoin has an AUC of 0.65 with a sensitivity of 87.1% (specificity of 36.8%) when predicting that a person has S1LC if the Acetoin concentration is below 0.098 mg/L. This is a promising result that will need further investigation as this single VOC approaches the sensitivity of LDCT 4 . [00115] Acetoin has not been a VOC closely studied in its relationship to lung cancer, and in a recent review article on VOCs it was not a described candidate VOC for the detection of lung cancer but is typically used in the flavorings of foods, as well as e-cigarettes. As additional VOCs were added to the model, the test AUC dropped. This is in contrast to multiple other studies. Indeed, in a small study of seventy patients, a signature was identified without providing the specific VOCs, with a sensitivity of 81% and specificity of 91%. A prior study of 229 participants reported an AUC of 0.81, though the VOCs used were not disclosed. Another studied 2-butanone, 3-hydroxy-2-butanone, 2-hydroxyacetaldehyde and 4- hydroxyhexanal in a large study with 405 participants, and were able to show a sensitivity and specificity of 93.6% and 85.6%, respectively. There are concerns about these studies, especially because: (1) the data are not available; (2) methods used are only superficially described; (3) analytic methods used can be over-fit; (4) VOC measurements are not expressed in concentration units, which implies that the measurement values may be indistinguishable from the experimental noise; and (5) there are many levels of data processing and cleaning that cannot be understood when data and code are not reproducible. [00116] There are several limitations to this trial. First, the presence of dead space in the lung can dilute VOC’s in the same breath. To combat this, a separate Tedlar® bag was used for the first 150-200cc of exhalation, followed by the rest of the breath into a 1L Tedlar® bag. Second, the effect of condensation on VOCs is unknown and, unfortunately, this effect was not controllable in the Tedlar® bags. Third, it is not possible to control for all environmental exposures, so there may be confounders present that were not considered- this includes the potential that participants did not abstain from smoking, vaping or drinking prior to breath collection. Fourth, although the protocol planned to analyze all breaths within a 24- hour period, this was not always the case. It is possible that these delays could have led to changes in the VOC concentrations in the Tedlar® bags. Fifth, S1LC was the focus, which may not be associated with substantial changes in breath VOCs. This leaves the possibility that changes may occur in more advanced stages of lung cancer. Sixth, the time interval to abstain from smoking, vaping, or drinking for at least 30 minutes prior to collecting exhaled breath interval was chosen as a reasonable compromise for the participants and the study feasibility, however different interval lengths could affect the concentration of individual VOCs. Last, many of the demographic confounders were based on recall, such as a family history of cancer- selective memory may have played a part in answers when participants are being biopsied to assess whether they have cancer or not. [00117] Lung cancer is the number one cause of cancer related deaths in the United States 1 . The 5-year survival of patients identified to have lung cancer drastically decreases with each advancing stage. In the most recent American Cancer Society statistics, the 5-year survival for localized, regional and distant was 61%, 35%, 6%,respectively. Given the drastic decrease in survival for every increasing stage, a minimally invasive, accurate diagnostic test is needed. [00118] Although the present invention has been described in connection with preferred embodiments thereof, it will be appreciated by those skilled in the art that additions, deletions, modifications, and substitutions not specifically described may be made without departing from the spirit and scope of the invention as defined in the appended claims.
Next Patent: RECYCLABLE INSULATION MATERIAL, METHODS FOR MAKING, AND MACHINES FOR MAKING