Blue = original retraction request to Mosley and Dettmer
Brown = Dettmer reply
Green = reply to Dettmer’s reply
Black = text from the publication
We carefully read the retraction request made by the concerned readers, in fact the company IROA technologies. It goes without saying, that we take the readers’ concerns very seriously. We are very sorry that the paper is perceived negatively by the company. We do not share this sentiment as the paper also states advantages of using the IROA standard. We employed the IROA standard as a stable isotope labelled internal standard for untargeted metabolomics of a highly complex matrix that is hugely different from yeast. A major concern seems to be the low number of IROA peak pairs described. Of course, this can be boosted, but the major focus of the work is the correlation of the 12C/13C ratios with quantitative data and not the number of detected peaks. In the paper we clearly state:
“In general, using a yeast extract for urine analysis is not optimal. We, however, did not address the other aspects of using the kit here, such as the range of covered metabolites, which would be readily affected by the matrix type. We, instead, focus on a few identified metabolites and investigate the use of a complex IS and its ability to improve the data correlation to a reference dataset.”
What is meant by a “reference dataset”? Is this a reference to the IS?
In the following letter we will address the concerns raised. Paragraphs taken from the retraction request are marked in blue.
“the process described in this publication bears no relationship to the TruQuant protocol as published”
The reader is correct, we did not follow the protocol slavishly but used a strategy adapted to the special requirements of urine samples. We basically used the kit as source for a complex stable isotope labelled internal standard. In retrospect we realize that we should have stated more clearly that we, for good reasons, did not intent to employ the full workflow. Nevertheless, all steps are detailed in the experimental section and every deviation was a compromise we deemed acceptable to achieve a practical usage of the kit in our workflow and test its ability to be incorporated in the analysis of patients’ urine specimens.
The changes to the protocol the authors made were extreme. By analogy, consider a publication that was presented to Metabolites outlining a CRISPR study in which 1) the CRISPR-cas9 was replaced with urease and the temperature was dropped to 5 degrees C, and 2) for the results a millimeter ruler was used and a big statistical analysis of the resulting data was run. It is unlikely Metabolites would accept the paper unless it showed a positive result. The only real difference between the current situation and the above is that CRISPR is not an analytical method (yet) that would require precision to measure the results. The senior author's admission that they "did not follow the protocol slavishly" and the language beyond it is yet another admission that this paper needs to be retracted. The modifications to the TruQuant protocols, and inappropriate software used all make every aspect of the data upon which this paper is based unacceptable as published.
The casualness emoted by their lack of “slavish devotion” is merely the tip of the iceberg for all of the errors contained in the paper. What is a protocol for if you don’t follow it? If you devise a new protocol that does not work, is it worthy of publication?
Note to editor: The Journal also needs to protect it’s reputation and retract any paper whereby the data, factual errors, and protocols deviations contain such severe mistakes that they cannot be rectified by a letter of concern except by a full retraction. We believe this paper needs to be retracted.
“All of the dilutions are different both from one another and the published protocol.”
“The LTRS (Long Term Reference Standard) is diluted to half (0.5X) of the concentration required by the protocol.”
This is a deviation from the protocol, but we only used the LTRS to identify IROA peak pairs. The number of pairs detected will of course go down with a lower LTRS concentration, but in the protocol published on the Sigma website, an injection volume of 3 or even 2 μL is recommended for the protocol in positive mode. Injecting 5 μL of 0.5X LTRS (i.e., 2.5 μL 1.0X LTRS) falls within this range.
Basic chromatography - Injecting 5 ul of a half strength solution will produce broader less distinguishable peaks. It is in no way comparable to injecting a much smaller volume of a more concentrated solution. In addition, since the column that was used is expensive, a Waters™ Atlantis™ PREMIER BEH C18 AX (1.7 _m 150 _ 2.1 mm id), it was most likely protected by a guard column, especially for urines. A guard column, which is not mentioned in the experimental details, will broaden the already broad peaks further and increases the importance of injecting the smallest volume of sample at the highest concentration and not using a large injection of a dilute sample. This argument is rejected. This extreme dilution, a large injection of a dilute solution, is exactly why their peaks are so broad, and weak.
Please note: Multiple times, the reader points out that the injection volume is not given. This is not correct. In the experimental section 4.3, we clearly state “Sample volumes of 5 μL were injected.” (A sample is every vial in the autosampler of the LC system.)
We apologize for missing this point.
The number of detected peak pairs is also influenced by the LC conditions (stationary phase, gradient, etc.) and of course the mass spectrometer in use.
Had the authors considered these points ahead of the study and performed the Calibration Study (that the authors performed after the paper was published), all these effects would have been actually tested, and the authors would have known that their samples were too dilute, pushing many signals closer to the LLOQ.
It is correct that 10 μL of double strength IS were added to 20 μL of sample resulting in 66% of the protocol recommendation. This was chosen to keep the dilution of low concentrated urine samples to a minimum. All samples with a creatinine concentration below 3 mM were directly used (20 μL pure urine plus 10 μL IS). It is a common consent in the metabolomics community that sample preparation in untargeted metabolomics should be non-selective and as simple as possible to keep the sample composition constant. Often, a dilute-and-shoot strategy is used, where all samples irrespective of their individual creatinine concentration are diluted by a constant factor such as 1:4 or 1:2. Unnecessarily drying the samples and reconstituting them in IS solution is not only time-consuming and impractical with large samples sets, but it also bears a large risk of changing the sample composition and thereby, introducing additional unwanted variance in the data. Therefore, we did not follow the recommendation in the protocol to dry the samples but used the described compromise.
We proposed to the authors that we work with them to find a urine method. For instance, they did in fact dry the samples for the calibration study, did it really alter the samples (other than removing volatiles)? In our experience it does not. Did they really need to do the pre-normalization step? We are very happy to see both a pre-injection normalization step followed by our post-injection normalization, but they did not use the post-injection normalization, nor did they use the ion-suppression correction the protocol offers. The Authors demonstrate in the Calibration study (performed after the publication) that the protocol could handle the normalization quite well even without the pre-injection normalization (which we would still encourage). They are correct that urine is a complex and difficult medium. Studies have demonstrated that using creatinine as a normalization factor is not reliable as it fluctuates wildly under normal physiologic activities (Metabolites. 2019;9:198. doi: 10.3390/metabo9100198; and Metabolites. 2020;10(9):376. doi: 10.3390/metabo10090376. PMID: 32961779; PMCID: PMC7570207) , but as a pre-injection normalization tool followed by a post-injection normalization tool the results should be better, and even better because the suppression that urine is known to induce will be corrected for. The authors made no use of these key benefits of the protocol; therefore, their data is both extremely ion suppressed and not normalized (see Figure 1 for both). In addition to the fact that the samples were overly dilute to start with and then they were further diluted, broadening the peaks because of the volumes injected, all of this acting to push this dataset closer to the LLOQ.
“Adding 10 ul of double strength IS to 20 ul of sample resulted in a concentration of 66% of the protocol requirements, i.e. not “the correct concentration” as stated in the text of the publication in question.“
The quoted expression “the correct concentration” is not to be found in our publication. We wrote:
“To each vial, 600 μL of pure water was added, instead of the recommended 1200 μL, to keep a proper concentration after dilution with urine.”
We apologize that the word “correct” replaced the word “proper” in our comments.
We did not want to overdilute the IS, aiming for an acceptable concentration.
(Note that by writing “instead of the recommended 1200 μL” we did point out a deviation from protocol.)
The only deviation in concentration is between samples/ IS blanks (both 66%) and the LTRS (50%). The “IS concentration” was 66% both in the samples and in the IS blanks. Adding 80 μL water to 40 μL IS solution (of course the double strength solution, prepared as described) is the same ratio as adding 10 μL IS solution to 20 μL of urine. Hence, all the concerns raised multiple times throughout the letter with respect to the IS blank concentration are void as it is identical with the IS sample concentration.
The critical element here is that all three be the same concentration to minimize any chromatographic shifts that are concentration dependent. The fact that they are all very dilute and were injected as larger injections simply served to broaden what would already be broad peaks, may minimize the concentration effects, but as we have seen it minimized the ability to see the peaks. Had the authors done the calibration experiment to test their conditions they would have seen and corrected these errors up front. The IS blank needs to be at the same concentration as the LTRS is to create RT linkages between the LTRS, and the IS containing samples, of which the IS blank is the most critical.
“The authors did not follow a calibration process to balance the sample concentration to the internal standard concentration, which is a required prerequisite to assure an adequate LC- MS signal strength.”
Urine is a complex and very variable matrix. Even after normalization to creatinine, metabolite concentrations can vary by factors of 5 to 10, or even higher (for ranges see for example: Bouatra et al. The human urine metabolome. PLoS One. 2013 Sep 4;8(9):e73076.). As mentioned above, a dilute-and-shoot approach is widely used in urinary metabolomics. We refined the dilute-and-shoot approach by diluting the samples to similar creatinine concentrations because a uniform dilution would also dilute samples with low creatinine concentration. Therefore, samples with creatinine concentrations below 3 mM were not diluted, but only mixed with IS solution. We chose the urine dilution factors based on previous experience, the creatinine concentration in our samples and the overall signal intensities in the base peak chromatograms of urine samples. Diluting the urine to 1 – 2 mM creatinine still shows highly abundant, even saturated peaks in the base peak chromatograms (BPCs). In our original sample set of 244 urine specimens, creatinine ranged from 0.7 to
26.3 mM creatinine with a median value of 6.19 mM creatinine.
With this wide creatinine range, the selection of a sample for the calibration experiment is almost impossible. One can choose a pool sample, but individual samples will deviate from the pool. The recommended calibration experiment requires drying down different sample (extract) volumes. Since we did not intend to dry the urine samples, we did not perform a full calibration experiment, but chose the urine dilution factor as described above (Note: In the product information from Sigma Aldrich we find the following sentence “it is recommended that a “calibration” step is done ahead of time, i.e. test different amounts of standard prep with the same amount of Internal Standard to figure out how much to balance with the Internal Standard.” Recommended not mandatory).
Rather than being “almost impossible”, the range quoted above would have been a perfect basis for the design of the calibration study, and it could/should have been done on a pooled sample, although any high creatinine sample would have worked. The good thing about urines is that they are rarely limiting in quantity.
Had the authors followed the protocol and performed the recommended Calibration study it would demonstrated that their arbitrarily selected median concentration (< 3mM creatinine) was inappropriately low and would have guided them in the design of the study.
IROA cannot “mandate” a method, but only explain the value and make recommendations. The authors failed to follow these recommendations and did not follow the protocol therefore they cannot draw such conclusions as stated in the abstract: “On the positive side, the ratio approach helps to reduce batch effects, but it does not perform better than computational methods such as the “removebatcheffect”function in the R package Limma.” Despite the fact that the authors saw this benefit, had they actually done the experiment correctly, calculated the ratios correctly, and completed a suppression-correction and normalization, both of which are error-correcting steps, they would have seen a much more significant improvement. Furthermore, the TruQuant normalization is a Dual MSTUS normalization that normalizes the sample to the internal standard. In doing this the sample is matched to all samples that perform a dual-MSTUS Normalization using the same concentration of the same IS. We believe this will effectively remove the concept of batch entirely.
How is “better” defined or even demonstrated here? At no point did the authors apply the IROA normalization, prior to running the calibration, as seen in their Figure 1 (below) the TruQuant normalizations have made all samples perfectly comparable despite significant sample-to-sample size variation. There is no indication that the authors even understood what the TruQuant normalization procedure entailed.
In Figure 5 the authors show PCAs on their 1) full-featurelist (“T-list” Fig5a, which had the most error, was ion suppressed, not normalized and included, as the majority population, artifacts, and noise peaks), 2) their measure of “IROA-Abs” (Fig 5b, which had significant error but included only biological compounds, was strongly ion suppressed, and not normalized), and 3) their version of the “IROA-ratio” (Fig 5d, which had the least error of these three was based on partial measurements of the C12 and IS, monoisotopic peaks only and not the full carbon envelope, and was not normalized). All three of these datasets were obtained from peaks that had very poor signal to noise (S/N) because they were closer to LLOQ than they should have been. Despite the poor quality of the collected data (1, 2, and 3), it is no surprise that the “best” was the IROA ratio. Also, it is no surprise that in both of the raw IROA datasets all of the QCs were 1) clustered in PC1 at the zero point, 2) were found in a much tighter range in both the PC1 and the PC2 axes, and 3) that the “total variance accounted for” in both IROA datsets was twice that of the “T-List”. Had the authors understood and followed the TruQuant protocol and used the data that was first suppression-corrected and then suppression corrected and normalized then each would have yielded, respectively, less and less error. (See the Authors Figure 1 below to see the suppression correction and normalization in these extremely suppressed and different samples.) PCs of these “corrected” datasets should have been even tighter. The IROA standard report provides all four of these datasets as a matter of due diligence.
As discussed above, three of the graphs in Figure 5 are based on raw data that was improperly determined, collected close to the LLOQ, and whose basis was calculated based on only partial readings of the peak’s true qualities, the final three graphs represent post-hoc manipulations to correct the error in these datasets. These manipulations were applied inconsistently, the “T-list” was manipulated using a Z-score normalization (for each datapoint) while the RBE normalization is applied to each sample. Why were the Z-score normalizations not done on either IROA sample or the “RBE” applied to the “IROA ratio”? Clearly, the “IROA-ratio” raw data was the best of the raw datasets and similar comparisons would have been informative, but the suppression-corrected and normalized datasets would have had even lower total error, and is all based on real time physical (experimental) data, i.e. it is not based on a post hoc artificial manipulation that is not reproducible across time and experiments. Without the similar analysis of the IROA suppression-corrected data, and IROA-normalized data Figure 5 cannot be interpreted at all in relation to the IROA TruQuant protocol, and Figure 5 cannot be allowed to remain. It is too severely flawed.
We agree that urine is a difficult sample, it is chemically complex with unusually modified chemical components, and it is highly variable in concentration. All of this strongly suggests the recommended small Calibration study would have benefited them, as it also would have benefited them to follow the protocol. The authors would have known the suppression corrected and normalized datasets were produced and understood their use. The authors lack of “slavish” attention and lack of understanding it’s fundamental properties cost them the results.
However, to address the concern with regards to the calibration, we now performed a calibration experiment according to the protocol using the IS and LTRS dilutions from the protocol and 11 creatinine concentrations (0.25, 0.5, 1.25, 1.5, 2, 2.5, 3.75, 5, 7.5, 10, and 20) mM in triplicates. All samples (40 μL) were dried and reconstituted in 40 μL IS solution. As result, a graph (see Figure 1) is obtained that indicates a creatinine concentration that
“yields an overall mass spectral signal that is equal to the overall mass spectral signal of the IS. This is the amount of sample that will most accurately be measured using the IS in the future, i.e. well balanced by the standard 40 μL of IS.” (Ref: product information sheet from Sigma Aldrich). This is in Figure 1 the intersection between the lines of the normalized IS MSTUS marked by blue squares and the normalized C12 MSTUS values marked by green crosses and the line of the suppression corrected C12 values marked by red circles. In our case, this would be 6.5 mM creatinine! We did send the graph to IROA technologies to make sure, that we had interpreted indeed the graph correctly. We then repeated the original experiment using a subset of 26 urine specimens with original creatinine concentrations equal or greater than 6.5 mM creatinine. Aliquots of 40 μL of urine with a creatinine concentration of 6.5 mM (either pure or prediluted to 6.5 mM) were dried and reconstituted in IS (dissolved according to protocol in 1.2 mL) so that each sample contained a final urine concentration of 6.5 mM. The LTRS and IS blanks were also prepared according to protocol. Additionally, aliquots from the same 26 urine specimens were diluted to 2 mM creatinine to obtain concentrations comparable to our original data set, and similarly 40 μL were also dried and reconstituted in
IS.
The data was analyzed with ClusterFinder4 (CF4). We also analyzed the same samples from our original samples with CF4 (see Table 1). Using a creatinine concentration of 6.5 mM as recommended by the calibration resulted in 342 IROA features (pairs) with 299 features containing no missing values. Interestingly, when we used a creatinine concentration of 2 mM, we obtained 323 IROA features, with 301 features containing no missing values. Clearly, the use of highly concentrated samples is not beneficial for the number of features detected. Granted, the number in our original data set was much lower with 190 IROA features, with 180 features containing no missing values. This may be attributed to the lower IS concentration.
“Clearly” the difference between 299 and 301 is trivial and could be attributable to a minor change in any one of the many parameters the user could adjust in the program. The big difference is that their solutions were balanced to one another, and the peaks were correctly found. While the author does not comment on it the ratios seen in the 6.5 mM samples would have been in a range where, on average, they were quantitatively more accurate. Not to mention the fact that ion-suppression, injection, and other forms of in-source errors were now corrected.
“The C12 and C13 peaks for most compounds are optimally measured when they are closer to
a 1:1 ratio.”
We also evaluated the 12C/13C ratios of the features. Only 9.6% of the features in the 6.5 mM creatinine samples had ratios between 0.8 and 1.2. Hence, only for a small subset of features the intended ratios are seen. The percentage in the old data set and in the 2 mM creatinine samples were lower, but not drastically. It is also worth to mention, that the 6.5 mM creatinine samples had higher proportions of features with ratios >2 (27.8%) or even 10 (7.5%). This would in turn indicate that the samples are too concentrated.
This is yet another fundamental misunderstanding of the protocol on the part of the authors. While the measurement is most accurate at 1:1 the accuracy is reasonably linear over the range 0.1:1 until 10:1 (and should be used if manually curated for peaks from 0.01:1 and 1:100). The software will calculate what it measures and reports beyond this range, but then the user needs to take the amplitude into account, and disregard samples too close to the noise level. At no point has anyone suggested all measurements need to be made between 0.8 and 1.2 and only in this range. Where does this misconception come from? Indeed, this would be an impossible-to-meet requirement is most sample collections.
As we have noted many times (and the author has often confirmed) the ratios used in this paper were calculated using only the monoisotopic peaks. Again, this represents a fundamental misunderstanding of the reality of these peaks. The natural abundance peaks, containing ~1.1% 13C, will have a monoisotopic peak that represents 98+% of the majority of metabolomically-relevant molecules. On the other hand, the monoisotopic peaks of the compounds in the internal standard will represent only 60% of the total IS molecules for a molecule containing 10 carbons (77% for 5 carbons, 46% for 15 carbons, only 36% for 20 carbons, and gets lower beyond this). Therefore, without a doubt the authors reliance on a 1:1 ratio will in all cases require a much larger than a true 1:1 concentration ratio to get the ratio they see as 1:1, and this error will be a variable function of the number of carbons in the molecule. IROA sums all C13 containing isotopomeric peaks (for both the natural abundance and internal standards to assure the concentration equality of the ratio despite these non-equal distributions. The ratios in this paper are meaningless numbers, and as such all of the Figures (and Tables) that use them are simply wrong.
The authors are worried about compounds that have a ratio of >10 but fail to comment on all of the compounds that have ratios of <0.1, probably because in their original dataset this was likely the majority. We accept that the linearity drops at the extremes but the purpose of the “Calibration”, indeed the reason it is so named, is to find for a given sample type the sample concentration that will produce the largest percentage of ratios within the linear range. The study upon which this paper is based is centered on the low end, with more compounds needing to be measured at LLOQ or in noise.
Most importantly, we also evaluated the quantitative performance by correlating the peak areas and 12C/13C ratios with quantitative data as performed in the publication. Data was analyzed with ClusterFinder4 and MZmine (see Table 2). The files from the original experiment corresponding to the 26 specimens were also evaluated as seen in Table 2.
While data derived only from those peaks that are close to a concentration ratio of 1:1 C12 AUC : C13 AUC, may be a point of maximum quantitative accuracy, it is possible to get very good data as long as the peaks are both of sufficient size to be well characterized.
An examination of Figure 6 in the paper clearly shows that the author’s peaks are unfortunately broad (Fig 6 D =1.8 minutes wide) and degraded. We agree that at a resolution of 21,000 these signals will only be partially resolved but it points to why the authors needed to decrease the volume injected as an attempt to increase the injected concentration. To blame IROA for tuning and instrumentation issues is a poor excuse. The instrument could have been tuned better (Its technical spec is 40,000, not 21,000) and the chromatography should have been better. None-the-less, the chromatographic peaks shown in Figure 6 are indeed broad and show little to no resolution, but this is not a problem caused by IROA. In other places, the author erroneously state that the inclusion of the IS increases suppression. While this is true to some extent, the ability to correct for suppression over a very broad range of concentrations is critical to the production of better-quality data. While it varies across concentration and chromatography, we find that ion suppression averages about 20% (range: 0.1% to 90+%), but with the ability to correct it accurately, the suppression itself is no longer an issue.
A 6.5 mM creatine concentration together with the recommended concentration of IS would help maximize the number of features found.
One should also keep in mind that maintaining a creatinine concentration of 6.5 mM over all samples would require to dry down about 371 μL of urine with a creatinine concentration of 0.7 mM! Moreover, these overly concentrated samples will most likely cause a rapid contamination of the system and fast deterioration in instrument performance.
With regard to the “need” to dry down 371 ul of urine, a) this volume is easily dried down under nitrogen, and b) while we think it is good to do both a pre-normalization (as in this dry-down) and a post-normalization (done in-silico) what figure 1 demonstrates is that the in-silico normalization of the samples (green crosses) does do a very good job even without the sample concentration. Needless-tosay, we would always suggest randomization of injections and the use of blanks between very concentrated samples.
Figure 1 (From Dettmer reply). Calibration experiment results. Note that the X axis shows the creatinine concentrations multiplied by a factor of 100 merely for the software to recognise it, as it had troubles recognising decimals.
With regard to the author’s comment on Figure 1, unfortunately many programs, including
ClusterFinder, have trouble discerning between the European decimal and US decimal nomenclature (comma vs. period). We look to address this concern in future versions of the software. This problem is part of the Microsoft OS.
Table 1 (From Dettmer reply): Compounds detected in the three data sets and feature counts using CF4
Most importantly, we also evaluated the quantitative performance by correlating the peak
areas and 12C/13C ratios with quantitative data as performed in the publication. Data was analyzed with ClusterFinder4 and MZmine (see Table 2). The files from the original experiment corresponding to the 26 specimens were also evaluated as seen in Table 2.
While we extracted more features with CF4, which is a clear benefit, the overall quantitative performance did not significantly improve, when employing the protocol with dried urine samples and the suggested IS concentration. Only for arginine, a clearly improved correlation was seen for the ratios when using the samples adjusted to 6.5 mM creatinine and evaluated with CF4 i.e., where we exactly followed the suggested protocol (columns 6 and 7 of Table 2). Therefore, as we already stated in our original publication, the obtained improvement by employing the IROA kit in terms of quantitative performance is only moderate.
Did the authors again choose to throw out pairs in the LTRS samples that deviated more than 30%? Which again could explain why they found so few peak pairs. As noted above, not only was this not necessary, but the ratio that was used had no bearing on the concentration ratios they were supposed to support.
Table 2: Coefficients of determination. Missing entries correspond to cases where no bin(s) were present in the data.
We need to point out that msms data as a post-source data stream is not immune to in-source errors, such as ion-suppression, and in-source fragmentation, therefore unless an Internal Standard is used as a point of comparison msms data should not be considered to be an accurate quantitative method. Without an Internal Standard peak msms should only be used where multiple same mass peaks may be simultaneously eluting and identity needs to be verified. It is not quantitative.
It is also worth noting that the Amino Acid Quantitative analysis that was performed is exactly such a
method, it used isotopic internal standards, derivatized according to a specific SOP, and runs a very exacting chromatographic protocol (maybe even “slavishly” performed) in order to generate accurate results. In principle, this is very similar to the IROA protocol which can take the data several steps further, suppression-correction and sample-to-sample normalization, because of the additional isotopic features of the IROA isotopic patterns.
Responding specifically to Table 2:
1) To begin with we would never expect any significant correlation using the area because the peaks are suppressed so we will focus on only on the use of ratios, even though as has been discussed and demonstrated above the ratios that the authors calculated bore no relationship to the concentration ratios that they believe they represent.
2)With regard to the ratio correlations since we know that if the IS concentrations were consistent in all of the samples, then it should be more highly correlated under good chromatographic conditions since this is exactly the same as the methodology as the AA Quantitative analysis. Clearly there are chromatographic issues here they have not resolved. As seen in Table 2, many of these AAs may be present at extremely low concentrations in the experimental dataset, as seen by the fact that they are only seen the 6.5 samples, or could not be seen in the NMR (Arginine, Asparagine, Aspartic acid, Glutamine, Histidine, Lysine, Methionine, and Serine) and it is unlikely that the authors achieved separation of Homoserine and Threonine and these are both likely seen, unseparated in the void volume. Therefore, only Tryptophan, Phenylalanine and Tyrosine remain as likely in their chromatographic system to have a good correlation, and in fact these all demonstrate that with a good chromatographic system the ratios are quite consistent with the AA results despite their lack of representing concentration they would correlate as they will each display similar rates of error.
3) We should also point out that Figure 6 demonstrates some of the instrumental issues the authors faced. In this figure they suggest that at a resolution of 21,000 they were unable to clearly see a 12 ppm mass difference despite the fact that half height these should separate. Clearly, they had instrumentation and chromatographic issues.
4) We believe that the AA msms protocol should produce very high-quality data and recognize that the
NMR signals many have been contaminated because this table seems to suggest that the AA msms and NMR data are at variance with one another. Under any circumstances this lack of agreement between the msms data and NMR data again calls into question the integrity of the work done in this paper.
5) In our original Letter to the Editor we looked only at the protocol issues. The data quality issues we are now raising should be applied to all of the graphical Figures and Tables. The publication dataset was based on low concentration large injection samples and subjected to poorly resolved chromatographic separations, on an instrument that had only been tuned (21,000) to barely half of its specified resolution (40,000), and the ratios do not represent the relationship to the underlying concentrations (at all).
With all of these irrefutable errors, retraction of the entire paper is required to protect the journal’s reputation.
“They did not attempt to acquire or use the suppression-corrected or post-injection normalization data. These two steps are specifically designed to reduce the impact of starting sample differences. Because of the failure to follow the TruQuant protocol they could not do either the suppression-correction or post-injection normalization.”
The sentence above is a bit confusing as it is not possible to acquire suppression corrected data. Only after data acquisition, data will be corrected for ion suppression. By computing the ratio between C12 and C13-IS values data will be suppression corrected as we did. The next step to get from the obtained ratios the true areas that accurately reflect biological concentrations is a bit problematic as detailed below.
With English not the writers first language, we apologize that the word “acquired” was misunderstood. We acquire (obtain) suppression-corrected data as part of following the TruQuant protocol. In this case the reader meant that the authors did not attempt to use CF to generate suppression-corrected data (following the acquisition of the flawed raw data). As the major benefit of the protocol this, and the inability to normalize the samples, was unfortunate.
These areas are influenced by both ion suppression for which we corrected and general ionization efficiencies, which may greatly vary across compounds, and which are unknown.
The author states that “By computing the ratio between C12 and C13-IS values data will be suppression corrected as we did”. This is technically incorrect in a very important way. The ratio is immune to suppression because suppression is a function of a molecule’s chemical structure and therefore all of the isotopologues of any compound will suffer almost identical suppression. The authors made no attempt to correct anything, they simply used the ratio as a stand-in value. No suppression-corrected value was calculated, no quantitation was done, and no normalization is possible based on these ratios alone.
In addition to ion suppression the author is correct that ionization inefficiencies, source variances and even injection variances will contribute to errors in the raw ms data. From the step in which the Internal Standard is introduced into the sample all of these issues will be able to be accounted for due to exactly the same reasoning given for suppression above, i.e. given the minor mass differences the chemical properties will cause all carbon isotopologues of a particular compound to generally operate similarly (msms fragmentation will show minor mass-based differences). Thus, the IROA suppression-correction and normalization corrects for all of these.
Again, had the author used the correct protocol and software then they would have generated data of higher quality (with a higher S/N), that had all in-source and injection errors corrected. The data that they show as IROA data is not and can never be considered to be equivalent. Among other things, the authors ratio includes only monoisotopic peaks. As discussed previously, this is flawed because the monoisotopic peak in the IS, at 5% 12C will lose increasingly large percentages as the number of carbons in any molecule changes, for instance, at 5 carbons the 13C M-1 isotopologues other than the monoisotopic will have 23% of the IS contributed isotopologues, while at 10 carbons the 13C M-1 isotopologue will represent almost 40% of the IS contributed isotopologues. This becomes quite critical when comparing a natural abundance peak (which will have only a 1.1% loss in height), with an IROA monoisotopic peak height (which will have lost a significant but variable formula-defined amount of its height). This ratio of monoisotopic, as used by the author is not quantitatively accurate. For the more accurate quantitative comparison the natural abundance isotopologues must be summed and compared to the sum of the internal standards isotopologues, i.e. each contribution must be considered as a whole and not as some variable fragment.
As the reader says the aim of post-injection normalization is to reduce for example starting sample differences, i.e., make data more comparable across samples. However, here we first compared for each individual sample the values obtained by an absolute quantitative method to those obtained using the IROA kit employing a correlation analysis. Therefore, a normalization to make data more comparable across samples is not required for this kind of analyses. Such a normalization is helpful for the subsequent analysis of differences between groups as we did with the PCA analyses shown in Figures 5 and 7 of our original contribution.
Here, we used probabilistic quotient normalization (PQN) introduced by Dieterle et al. (Anal. Chem. 2006, 78, 13, 4281–4290) for all data sets, as it is a widely used method for human urine and can be applied to pure area data and the ratios obtained with the IROA-IS. In this context, please also see below the paragraph on data normalization.
As detailed previously there were significant problems with the data due to the use of dilute samples, oversized injections, poor chromatography, and poor instrument preparation.
The IROA protocol does not need to use any post hoc data manipulation all of the IROA data, as all of the calculations are derived from experimental data directly generated within the protocol.
With respect to the computation of peak areas corrected for ion suppressed areas = x*C12/IROAIS with X being the least suppressed value of this analyte, one should keep in mind that the obtained areas are corrected for differences in ion suppression and the amount of internal standard present for this compound, but differences in general ionization efficiency of a compound, which may vary greatly across compounds, are not corrected for. Therefore, differences in corrected areas between different analytes may not truly reflect biological difference and should be treated with care.
This is not correct. The Internal Standard analytes will suffer the same ion source losses as the C12 analytes suffer in each individual sample, by normalizing the natural abundance (C12) data to the Internal Standard we are setting the C12 data to be equal to a constant and reproducible value and one that will be responsive to instrumentation variances at the same time and rate as the samples. These ion source losses are corrected for each individual compound, thereby the true biological differences can be seen.
The selection of the highest IS area (picked from the raw 13C data) in the software is a bit problematic. It is picked from one sample (not the average over all IS blank samples) that shows the highest 13C area for a particular feature. The highest area for the next feature can stem from a different IS blank sample.
According to our patented protocol for doing this there are many ways to determine it. In the current software the default is “selected as the average across all of the IS-Blank samples”, but there are many other options, each of which has specific situational benefits.
Normalization:
The MSTUS (mass spectrometry total useful signal) approach is a variation of the common normalization to a constant sum (CS), as described in Craig et al., Anal. Chem. 2006, 78, 2262- 2267. It attempts to limit the contributions of xenobiotics and artifacts to the normalization factor by including only a subset of signals for computation of the normalization factor. As IROA technologies states in its white paper “all the peaks we used in computing a normalization factor had minimum criteria to qualify: 1) both the C12 and C13 isotopic clusters have to be present in all samples, 2) they both have to be above a minimum peak area, and 3) and the ratio between the C13-IS and the C12 monoisotopic peaks has to be greater than
0.001.” (Ref: Integration of Standards for Ion Suppression Correction and QC in an Untargeted Metabolomics Workflow, DOI: 10.13140/RG.2.2.28112.74245 ). Next, the areas of all remaining signals will be corrected for ion suppression.
Back in 2004 Metabolon was doing a MSTUS (using Craigs definition for NMR) correction. Warrack (2009) was the first to apply it to MS while limiting the selection of peaks to those they expected to be of biological origin. IROA improves on this by assuring that all molecules are actually of biological origin, in addition to other quality exclusions.
BTW – Warrack demonstrated quite clearly that in urine, creatinine normalization was comparable to no normalization and that MSTUS performed better. They also demonstrated that MSTUS was able to generate a normalization factor (NF) that could be applied sample-wide and not just to the compounds used to generate it. It is true that we have improved on Warrack by assuring the data used in generating the NF are of purely biological origin, something Warrack could not do, and by using the MSTUS of the IS for the normalization value. We have called this a Dual-MSTUS algorithm for this reason.
The computation of the MSTUS correction factor is described in an IROA poster as follows: “The calculation for the MSTUS normalization correction factors is: sumSCC12/sumSCIS where sumSCC12 is the total suppression corrected area of all considered C12 compounds, and sumSCIS is the total suppression corrected area of all considered IS compounds. The suppression corrected values are multiplied by these factors to normalize them all to the same base.” (Ref: Poster, Lorenzi et al. “Correction of Ion Suppression and Normalization for
Improved Quantitative Rigor and Reproducibility Using IROA”, Metabolomics Society 2018).
Therefore, this is clearly an improved version of the common CS normalization. However, CS normalization in general has its limitations. Its inherent assumption is that across groups only a relatively small number of features is regulated in approximately equal shares up and down. As Craig et al. correctly state for urine: “For a series of spectra with highly similar internal peak ratios but differing in total intensity because of such dilution or concentration effects, CS normalization of each spectrum can be considered to approximate the relative concentration of species (i.e., as in solute). Importantly, this approximation will break down when large perturbations occur to intensities in some spectra (e.g., those from certain toxin-treated animals or the use of diuretic drugs, for example). This is easily seen because if some peak areas are increased and the total is normalized to a constant, others will appear to have decreased, and this effect in “closed” data sets has been noted previously.” (Ref. Craig et al., Anal. Chem. 2006, 78, 2262-2267). Due to the inherent limitations of CS normalization especially for urine, the normalization strategy suggested by the reader will be appropriate in many cases such as for many cell culture experiments but is of only limited use for urine. For a discussion on different normalization approaches see for example Wulff and Mitchell, Advances in Bioscience and Biotechnology, 2018, 9, 339-351.
Please see Warrack et al. (Journal of Chromatography B, 877 (2009) 547–552). The basic idea to keep in mind is that Garbage in produces garbage out no matter how hard you may statistically massage it. PQN and RBE make appropriateness assumptions that most users barely understand concerning the quality and nature of the underlying data. How appropriate are techniques which have been developed for genetic data and are now used for metabolomic data? The nature of the datasets are extremely different (number of datapoints per sample, scope and range of data, etc.). The IROA TruQuant protocol does not do any post-hoc statistical manipulations, but rather all computations are based on specific experimental datapoints that are generated within the experiment. It is fundamentally sounder, and is based solely on first principles, but it does require that the user generate good data.
“The Authors used an outdated version of ClusterFinder. ClusterFinder Version 3 did not have any of the optimizations needed to find and use the TruQuant data.”
One would expect that version 3 of a software is mature, in particular when the option to analyze TruQuant data is given. Moreover, despite CF4 being a better version, we still encountered so many errors, leading to several attempts until we managed to get results. We downloaded our software from the Sigma Aldrich homepage in November 2019 and were not alerted to the newer version that shows a good performance improvement.
Suppression-correction and normalization were new features in CF4. Although we keep Sigma Aldrich updated with our latest software, because of their size and so many offerings they are slow to upload and recommend new materials.
The first author of the paper was in contact with IROA at the end of 2019, but the updated CF4 software version never came up in the discussion. We did inform IROA that we did not detect as many metabolites in LTRS (diluted according to protocol) as they claimed. They did reply at first and we sent them a data file. They also ended up getting low find rates (despite seeing a lot of IROA peaks when inspecting the spectra). Suspecting issues with the conversion, they asked for the original (not converted) data file, which we also provided. After not hearing from IROA for three weeks, we contacted them again in January 2020 asking whether they had found out what seemed to be the problem, but our inquiry was left unanswered.
We are committed to each and every one of our customers and would be happy to work with you in any way we can. Upon reviewing the author’s data, we realized that there were issues with the chromatography and the balance of IS-to-sample was suboptimal and we are sorry that this was not properly communicated.
Nevertheless, we used CF4 to analyze our data and agree that CF4 performs better and yields more IROA pairs than our MZmine workflow.
The TruQuant protocol was introduced in 2019 along with the CF4 because CF4 supported it. CF3 only supported finding IROA peaks in Basic and Phenotypic protocols.
The authors comment in the abstract that “the ratio approach helps to reduce batch effects, but it does not perform better that computational methods such as the “removebatcheffect” function in the R package Limma”. In this case they clearly cannot make this judgement because not only did they fail to follow the protocol, but they never completed the protocol, and therefore never generated suppression corrected or normalized datasets. This comment in the abstract is fundamentally wrong and must be withdrawn.
“ClusterFinder finds and interprets the entire isotopic envelope for all isotopic balances, including natural abundance, U-13C 5%, U-13C 5%. Because of the nature of the labeling in these situations the full isotopic envelope needs to be summed to determine the actual quantity of material on either side.”....”The data used for the analysis was not generated by ClusterFinder but rather by MzMine2 and only data from the monoisotopic peaks was collected which would not have been sufficient for accurate measurements. It is hard to fathom the effect of this discrepancy, but it fundamentally means that every peak ratio was incorrect.”
There is no need to sum the isotope signals.
See previous discussion.
Yes, it is important to sum the full isotopic envelope for all isotopic balances to determine the actual quantity of material on both the 12C and C13 side! The isotopic dilution distributes molecules over a number of masses and their associated isotopic peaks are used to accurately calculate their respective areas accurately (and also corrected for ion-suppression). Normalization is achieved utilizing the total intensity of components common to all samples after baseline correction.
The isotopic pattern of the internal standards will not change from sample to sample, only the signal intensity of the isotopic peaks will vary relative to each other due to ion suppression, etc. yes, this is exactly why all of the peaks need to be collected to get an accurate measurement.
The isotopic pattern of the endogenous metabolite from the urine sample is also fixed. The monoisotopic peaks reflect the abundance of the endogenous metabolite and IS in the sample. Nevertheless, we compared the ratios of CF4 with our MZmine ratio and, as expected, they show excellent correlation (Table 2). The claim that the MZmine ratios are wrong is not true.
It is the ratio of monoisotopic peaks and is simple wrong! It does not represent the concentration ratio of the compounds. See previous discussion.
In addition, our very strict data curation (“For instance, the 12C peaks should be minimal in the IS blank, so if a considerable 12C peak (12C/13C > 20%) was detected there, the pair was excluded.”) is interpreted as a carryover by the reader. Under the incorrect assumption of the reader that the IS concentration in the IS blank is not the same as in the sample, the carryover is even inflated to 60%. This is not the case. The higher ratios seen for a few pairs are most likely caused by mismatched peaks. Blank water samples (of course the water was always purified by a PURELAB Plus system ELGA, LabWater, Celle, Germany) were inspected for carryover.
There is no question about this - There is no C12 material in the IS. It has never seen a natural abundance world. If the authors are seeing C12 material in the IS-Only sample, then it is carry-over. It is possible that the authors are misinterpreting some random MS signal but we have worked with the IS extensively and have never seen any contaminating material on a clean column. Could their column have been too dirty?
Dear editor, we are happy to include the new data generated according to the suggested protocol in an addendum that will show that a higher number of detected IROA pairs is achieved but it will not demonstrate a significant improvement in quantitative performance over our original protocol.
The fundamental failure of all data collected, tables of incorrect data, and graphs based upon it make it impossible to allow this publication to remain publicly available, and therefore we continue to request that the publication be retracted.