Reproducibility: Quantifying the stochastic component of epigenetic aging

Reproducibility
Author
Affiliation

Richard J. Acton

Published

December 16, 2024

Modified

January 8, 2025

manuscript doi: 10.1038/s43587-024-00600-8

Note

This post has be revised following feedback from Prof. Andrew Teschendorff, an author of the paper covered.

(See the end of the post for excerpts from our communications.)

The complete revision history of this post, and all others is available in the git repository

Key points:

  • The bulk of the post focuses on data availability / provenance of 2 of the 31 datasets used in the paper. In its original form the post inadequately contextualised this, more context is now provided. I would emphasise that this critique is inconsequential to the overall conclusions of the manuscript.
  • The post discusses how data availability / provenance might ideally be handled when authors have full control. As well as how authors might, depending on their authority to publicly share (meta)data, mitigate availability / provenance issues when they do not. I would emphasise that in this case the authors do not have the authority to deposit these 2 datasets in a public repository, this rests with the clinicians and private entities which generated this data, and that it is communicated how to request access to these datasets.
  • I noted that these 2 datasets have not previously been ‘published’, merely that previous publications had included conclusions based on these datasets. In doing so I used ‘results’ in an ambiguous fashion which could be read to imply that the results presented by Tong et al. (2024) had previously been published as opposed to other results based on this data. I am specifically taking issue with referring to a dataset as ‘published’ if it is not public, wholly included in the publication referenced, or at worst arguably if it is separately citable but access restricted, not meaning to imply that any of the results presented by Tong et al. (2024) had previously been published. This has been re-worded to be unambiguous about this point.

Overview

It has previously been established that it is possible to accurately predict chronological age using DNA methylation data from a relatively small number of methylation sites in the genome using a method called elastic net regression (Horvath 2013). It has also been demonstrated that this same method of training an age predictor also works when using simulated data where stochastic changes are made to a baseline state over time (Meyer and Schumacher 2024). Tong et al. (2024) address the question of what proportion of the accuracy of DNA methylation based clocks can be explained by purely stochastic changes. They conclude that stochastic changes account for approximately 66-75% of the accuracy of these clocks. They also observe that several factors previously associated with differences in biological age acceleration, the difference between predicted and chronological age, are not driven by differences in the rate of stochastic change.

Given that this paper is largely based on simulation models that are informed by and compared with results from existing datasets the most important factors to consider in gauging its overall reproducibility are the availability of the data and the reproducibility of the computational work. This paper does an excellent job of checking that its results are replicable by applying their method to a large number of datasets, which are themselves sizeable and from a diversity of sources. It also demonstrates that the result of stocastic changes explaining less accuracy than the biological clocks is robust to differences in the clock model and that this generalises for different clocks in different datasets.

This post focuses mostly on some narrow issues of reproducibility, the ability to get the same results using the same analysis on the same data, (see: the definitions section on our format page). Most of the data is appropriately available, 2 of the 31 datasets used are not as FAIR as would be desirable. Computational reproducibility is rather lacking. Whilst their simulation method is well described in the methods the code provided allows the user to make use of the stochastic age models generated in this paper on their own data but not to generate these models again for themselves.

Data Availability

Most of the datasets used in the paper are publicly available and referenced by links to their accessions in appropriate databases.

The authors state in the ethics section of the methods:

All DNAm datasets analyzed here have already been published elsewhere. We refer to the respective publications. For the TruD cohort, already published previously by us (Luo et al. 2023)

And in the methods sub-section: “DNAm datasets of solid tissues representing normal and precancer states”:

Lung preinvasive dataset. This is an Illumina 450k DNAm dataset of lung tissue samples that we have previously published (Teschendorff et al. 2015)

This is however somewhat contradicted in the data availability statement:

The lung preinvasive dataset is available upon request to the corresponding author. The TruD DNA methylation dataset is available upon request to TruDiagnostic (TD) Inc. (varun@trudiagnostic.com). To protect data privacy of the individuals represented in this cohort, individual applications will be reviewed by TD and in case TD is willing to share data, a data sharing agreement will be set up.

Whilst it is quite commonplace to refer to previous publications which have made use of datasets to reach their conclusions as though it were publication of a dataset this is in my view inaccurate unless the complete dataset accompanies the publication. For a dataset to be said to be ‘published’ it should ideally be ‘public’, where access limitations are imposed a dataset would ideally still be citable with a unique, persistent, resolvable identifier. Datasets in public repositories with access restrictions are also arguably not ‘published’ per se merely citable as they are available subject to approval and access can be denied for potentially arbitrary and spurious reasons.

Where a direct data reference is unavailable citing the works which generated the dataset, as was done in this paper, is the general practice. It it also preferable to extend this to directly cite any papers which provide metadata relevant to the dataset, particularly as used in the work. This might include papers in which the samples were collected if this differs from the paper in which data was generated from these samples.

It would be more accurate to say of the TruDiagnostic dataset and of the lung pre-invasive dataset in the ethics sub-section that: ‘results from previous papers that were derived from these datasets’ had previously been published not that: ‘the datasets’ had previously been published. As the datasets themselves are not available in a public repository and referenceable with a suitable identifier they cannot themselves meaningfully be said to have been published.

Luo et al. (2023) provides some additional details about the TruD dataset however this dataset from TruDiagnostic Inc. (TD / TruD) poses some data access issues. It is reasonable to base published conclusions on data which may require access restrictions for privacy reasons, however such data should be in public archival repositories. Where it should have metadata, including access application requirements & processes, available and can be referred to with a unique, persistent, resolvable identifier. An email address to request access from a private entity with no defined access application process is far from optimal. Which is not to say the current practices for deciding what should be subject to access restriction, how access requests should be processed, by whom, and with what accountability mechanisms, both for those administering access and those given access, are necessarily adequate to this task in publicly run data repositories.

Teschendorff et al. (2015) provides some additional details for the preinvasive lung lesion data.

The second paragraph of the Methods sub-section “Data Sets and Ethics Approval” cites two further papers related to the origins of the preinvasive lung lesion samples:

Samples from preinvasive lung lesions were taken from a cohort described recently. [Banerjee, Rabbitts, and George (2004)](McCaughan et al. 2011) A subset of 24 laser-microdissected samples, consisting of lesions that did (n = 19) and did not (n = 5) progress to invasive lung cancer (all assessed by means of bronchoscopy) and that were matched for smoking pack-years (SPY), was used. In addition, 21 normal lung samples (bronchial brushings) from individuals at high risk of developing lung cancer were taken from anatomical sites with no documented history of preinvasive lesions. See eMethods in the Supplement for details regarding the data sets used.

The DNA methylation arrays were performed on these samples by Teschendorff et al. (2015), this is clear from the last sentence of the “DNAme Analysis” sub-section of the methods:

DNA from preinvasive lung lesions and normal adjacent tissue was extracted from fresh frozen laser capture microdissected sections (or bronchial brushings from controls), and genome-wide DNAme profiles were obtained using the Methylation450 BeadChip.

Also the supplementary ‘eMethods’ to Teschendorff et al. (2015) include these details:

Pre-invasive lung lesion set: Illumina 450k data was normalized with ChAMP [21] and BMIQ [6]. Inter-sample variation was further assessed using Singular Value Decomposition. From an initial total of 95 samples, including multiple lung biopsies from the same patient, we first performed hierarchical clustering to check whether samples cluster according to individual. Since, the multiple biopsies from the same patients were generally always more similar than the samples from different patients, we averaged multiple biopsies of the same patient, whenever these had the same outcome (regression or progression). This resulted in 21 normal samples, 13 samples which did not progress and 22 samples which did. In order to not confound the analysis by potential differences in the SPY between regressors and progressors, we selected a subset which were matched for SPY (focusing on those with SPY>40). This resulted in 5 regressive and 19 progressive samples. The normal samples were used as a common reference to estimate the smoking index in all 24 samples.

This aligns with the number of samples described in the methods sub-section: “DNAm datasets of solid tissues representing normal and precancer states”:

encompassing 21 normal lung and 35 age-matched lung-carcinoma in situ (LCIS) samples, and 462,912 probes after quality control. Of these 35 LCIS samples, 22 progressed to an invasive lung cancer.

To ascertain more complete details of the biological samples assayed by Teschendorff et al. (2015) we can take a look at the two papers cited in the methods. Banerjee, Rabbitts, and George (2004) is not open access so was a dead-end for further details. McCaughan et al. (2011) had this to say in the ‘Patients and samples’ sub-section of their methods:

The patients were enrolled in the University College London Hospital Early Lung Cancer Project (ELCP). This is a longitudinal bronchoscopic surveillance study that has been described previously (Banerjee, Rabbitts, and George 2004). At the time of enrolment none of the patients had an active diagnosis of lung cancer, although they may have had a prior history of lung cancer. Local Regional Ethical Committee approval was obtained (01/0148). Further details of the methodology and protocol for histological diagnosis have been published Banerjee, Rabbitts, and George (2004). Clinical details of the three patients who are the subject of this publication have been published previously McCaughan et al. (2010).

A further three references potentially to explore for further details. Jeremy George et al. (2007) and Foster et al. (2005) are not open access but McCaughan et al. (2010) does have some additional details:

All samples were from patients enrolled in the University College London Hospital Early Lung Cancer Project (32). This is a bronchoscopic surveillance study in which patients undergo repeated assessment under a protocol that includes autofluorescence bronchoscopy, computed tomography, and fluorodeoxyglucose–positron emission tomography scanning. Patients are enrolled on the basis of having a biopsy-proven dysplastic lesion of the bronchial tree. At the time of enrollment none of the patients have an active diagnosis of lung cancer, although they may have a prior history of lung cancer. Local Regional Ethical Committee approval was obtained (01/0148). The patients included in this report had undergone an average of 7.4 bronchoscopies (range 1–19) in the surveillance study up to May 2007. The analyzed biopsies were obtained over a period between 1998 and 2007. Research biopsies were taken during surveillance bronchoscopies and fixed immediately for 4 hours in a solution of 4% formaldehyde in phosphate-buffered saline.

Biopsies were chosen from the research archive on the basis of the grade of lesion recorded on the paired clinical biopsy. Seven biopsies with low-grade dysplasia (LGD; mild or moderate dysplasia) and 10 with high-grade dysplasia (HGD; severe dysplasia or carcinoma in situ) were selected. Sections were then taken from the corresponding research biopsy. A team of three consultant pathologists, including the reference thoracic pathologist, read the clinical biopsies. The corresponding research biopsies were read “blind” by the reference thoracic pathologist (M.R.F.). In all except three lesions the paired clinical and research biopsies were read as the same grade, and in the three discordant readings the opinion of the reference thoracic pathologist was accepted. Further demographic and biopsy-related details are in Table 1.

Here in table 1 we have the ages of the sample donors, information necessary for the use to which this data was put in the original publication we were discussing.

This lung preinvasive dataset appears to be the basis solely for figure 6 c, and the TruDiagnostic dataset is similarly limited in scope so the data provenance limitations of these datasets do not apply to the overall conclusions of the paper.

Code Availability

Code was included with the article as a supplement and placed on figshare. Whilst figshare does give the code a separately citable identifier which is good, this is not necessarily the optimal way to distribute citable code. For a small script like this this approach is mostly fine but for more code there are better options. The Code included is sufficient to apply the stochastic predictors to new models but not to precisely re-create the process by which the predictors were generated. The code to generate the simulated datasets is not included. For simulation code it would be important to include the random seed(s) used so that it is possible to check the simple reproducibility of the code on a different system. A preferable way to distribute, archive, and reference code a is git repository on a suitable host with a CITATION.cff file with the repository archived to zenodo or software heritage to get a unique, persistent, resolvable identifier. Had all the code needed to reproduce the computations underlying the analysis in this paper been included it would likely be sufficiently long that this approach is preferable to the upload of a script to figshare.

Figshare is quite suitable for The R data objects though, especially as it exposes an API with which the objects could be directly downloaded. Indeed we have made use of the figshare API to download copies of data shared there in a previous post.

Details of the computational environment in which the analysis was performed such as the versions of the software packages used and those of all of their dependencies should also be provided, for more complete computational reproducibility. Getting these details in standard formats can be automated with with a suitable environment management tool.

Licensing

Comment from RunStochClocks.R:

Copyright permission: RunStochClocks is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version-3 as published by the Free Software Foundation. epiTOC2 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details (http://www.gnu.org/licenses/).

Ideally one should include a copy of the license with the software or this may not be valid. The text of the GPLv3 license itself is clear that it should be included with software distributed under it.

Code Portability

The code provided in RunStochClocks.R makes use of the load() function to load the Rdata file glmStocALL.Rd. This file contains the glmnet objects needed to run the stochastic clocks on new datasets.

The trouble with using load() and its counterpart save() is that the person using load() does not know the names of the objects saved to the Rdata file that they are loading. If they have objects in their environment which happen to have the same name running load() will overwrite them with the object(s) from the file. This can lead to some confusing bugs.

Using saveRDS() and readRDS() to save individual objects is a safer approach as you have to assign the result, obj <- readRDS() this way the user of the data controls the objects’ name in their code.

Tip

If you have large R objects it is also worth taking a look at {qs} and the new {qs2} packages for R object serialisation with performant compression. Using these libraries/formats to save R objects you get smaller files and faster read/write times especially if you tweak the arguments in the install function to enable compilation with hardware accelerated multithreaded compression, (see the install instructions in their documentation).

Generalised Reproducibility Best Practice Suggestions Emerging From Discussion

The long walk that we took to get more details of the lung preinvasive dataset samples illustrates the importance of data provenance and treating datasets and their metadata as independently citable objects. This dataset does poorly when you consider its FAIRness. If is not straightforward to find the details of these samples and the assays performed on them as they scattered across several publications. Much of the sample metadata is accessible, if not the easiest to locate and process but the data itself is not. The raw array data, were it generally available, does poorly on interoperability as it would be in the proprietary IDAT format which as we have previously discussed had to be reverse engineered to be analysed with open tools. IDAT files are now a de facto standard for this data type but the resources wasted in reverse engineering efforts of proprietary formats from instrument makers should not be understated. For data to be re-usable provenance information is necessary or how can we trust the conclusions based on this data, and indeed, in the case of human data, be sure that we are complying with any use restrictions imposed during the ethics and consent process.

All datasets used in a published work should be deposited in a suitable repository, ideally publicly but with access controls if needed for privacy reasons, before, or at the latest, by the time of publication. These depositions should follow FAIR principles, simply depositing them and getting suitable identifier with which to refer to them is inadequate to meet this standard. The database in which datasets are deposited must have high curatorial standards to ensure that the data and metadata are available to facilitate searching for datasets which have the properties necessary to address particular research questions.

For example such a record should provide metadata such as age data for samples or whether or not age data for the samples is available on request if that is withheld for privacy reasons. Another piece of metadata it would preferable to include would be the ethics approvals for collecting the samples including such details as what donors consented to be done with them. The identifier “(01/0148)” and the information that the patients were from University College London Hospital tells us very little about the ethics approvals for these samples as this is not a unique, persistent, resolvable identifier. This is an unremarkable example of the commonplace lack of consistency, FAIRness and transparency in how research ethics approvals are referenced, including what format the referenced objects should take. This is something that should clearly be addressed by the research community as transparency about these processes is important for public trust.

Unfortunately datasets have not not always been treated like this historically so how can researchers deal with this now?

  • Don’t make the problem worse by failing to do this properly when you generate your own data
  • If it is consistent with the data privacy considerations and ethics approvals create such a citable entity for data you want to use if none yet exists
  • As a last resort avoid using datasets which lack the information necessary to create such a record

If for some reason you cannot find an existing dataset or generate a new dataset with a robust public record of its provenance and must use a dataset without this you can attempt to somewhat mitigate this. Beyond providing any contact information by which to request access, you can curate what information is publicly available about the dataset, with citations, possibly in your supplementary materials. This can clarify where which pieces of information originate and make it clear how you have interpreted descriptions of data from other sources. It also provides something for someone recursing through citations to retrieve this information something to check against. This can also take information that would otherwise be behind a pay-wall and put in into an open access publication.

Short takeaway suggestions

  • Suggestions for publishers
    • Don’t have two different types of supplementary figure, knock it off with the ‘extended data figures’ Nature Aging it’s confusing and unnecessary
    • As often repeated: Require and check that all data be suitably deposited before publication. Do not permit the inclusion of sections which depend on data which is not suitably archived in published works
    • Cross references should be to specific sections, references to “Methods” are overly vague. Allow for and enforce more granular cross referencing.
    • Do not count data related citations towards any citations caps you impose (they are fairly absurd in the era digital publishing anyway), researchers need to be able to cite all works necessary in support of good data provenance.
  • Suggestions for research ethics committees
    • Adopt unique, persistent, resolvable identifiers for details of research ethics approvals so that details of approvals can be cross-checked with published work and referenced in metadata for datasets

Slides

slides from the discussion session on this manuscript.

Note

Exerpts, with permission, from communication with Prof. Teschendorff

In our Nat Aging paper we analyzed 25 independent whole blood cohorts, as well as 6 independent precancer datasets (a fact not emphasized in your blog!). This is a very significant number, and certainly much more than what you would find in a typical Nat Aging paper. For instance, the CALERIE trial and resource paper, both published in Nat Aging, only presents results in one DNAm dataset, so we have no idea if it is scientifically reproducible. I for one have repeatedly argued (see Teschendorff et al Nat Mater 2019) how important it is that results are validated in as many independent omic datasets as possible, because in practice it is almost impossible to account for all potential confounding factors in any given dataset. Hence, for your blog to focus, obsessively and unreasonably so, on only 2 of the >31 DNAm datasets analyzed not being ‘publicly available’, is extremely unfair, specially so, because (1) all the results in our manuscript are highly reproducible across the 25+6 independent datasets, (2) because the key results of our manuscript do not hinge at all on the TD and lung preinvasive datasets, and (3) because the two DNAm datasets in question are available on request. Anyone reading your blog however, and not reading our paper or your slides, would falsely conclude that our paper has a ‘reproducibility problem’, simply because you are overemphasizing these 2 datasets so much

Moreover, our statements in the data availability and ethics subsections of the paper are in my opinion logical and correct. The TD and lung preinvasive datasets have been published in Luo Q et al Genome Med 2023 and Teschendorff et al JAMA Oncology 2015 papers, because these are the first papers to present the data. Whether a specific dataset has been deposited in the public domain or whether it is available upon request, is not relevant to a logical definition of ‘publication’. Yes, I understand that current definitions, which require deposition of the data ‘somewhere’ (ie with a doi or weblink) to deserve the term ‘published’, would suggest that you are right, but in my opinion, these new definitions completely miss the point.

References

Banerjee, Anindo K., Pamela H. Rabbitts, and P. Jeremy George. 2004. “Preinvasive Bronchial Lesions.” Chest 125 (5): 95S–96S. https://doi.org/10.1378/chest.125.5_suppl.95S.
Foster, Nicola A., Anindo K. Banerjee, Jian Xian, Ian Roberts, Francesco Pezzella, Nicholas Coleman, Andrew G. Nicholson, Peter Goldstraw, Jeremy P. George, and Pamela H. Rabbitts. 2005. “Somatic Genetic Changes Accompanying Lung Tumor Development.” Genes, Chromosomes and Cancer 44 (1): 65–75. https://doi.org/10.1002/gcc.20223.
Horvath, Steve. 2013. DNA Methylation Age of Human Tissues and Cell Types.” Genome Biology 14 (10): R115. https://doi.org/10.1186/gb-2013-14-10-r115.
Jeremy George, P., A. K Banerjee, C. A Read, C. O’Sullivan, M. Falzon, F. Pezzella, A. G Nicholson, P. Shaw, G. Laurent, and P. H Rabbitts. 2007. “Surveillance for the Detection of Early Lung Cancer in Patients with Bronchial Dysplasia.” Thorax 62 (1): 43–50. https://doi.org/10.1136/thx.2005.052191.
Luo, Qi, Varun B. Dwaraka, Qingwen Chen, Huige Tong, Tianyu Zhu, Kirsten Seale, Joseph M. Raffaele, et al. 2023. “A Meta-Analysis of Immune-Cell Fractions at High Resolution Reveals Novel Associations with Common Phenotypes and Health Outcomes.” Genome Medicine 15 (1): 59. https://doi.org/10.1186/s13073-023-01211-5.
McCaughan, Frank, Christodoulos P Pipinikas, Sam M Janes, P Jeremy George, Pamela H Rabbitts, and Paul H Dear. 2011. “Genomic Evidence of Pre‐invasive Clonal Expansion, Dispersal and Progression in Bronchial Dysplasia.” The Journal of Pathology 224 (2): 153–59. https://doi.org/10.1002/path.2887.
McCaughan, Frank, Jessica C. M. Pole, Alan T. Bankier, Bernard A. Konfortov, Bernadette Carroll, Mary Falzon, Terence H. Rabbitts, P. Jeremy George, Paul H. Dear, and Pamela H. Rabbitts. 2010. “Progressive 3q Amplification Consistently Targets SOX2 in Preinvasive Squamous Lung Cancer.” American Journal of Respiratory and Critical Care Medicine 182 (1): 83–91. https://doi.org/10.1164/rccm.201001-0005OC.
Meyer, David H., and Björn Schumacher. 2024. “Aging Clocks Based on Accumulating Stochastic Variation.” Nature Aging, May. https://doi.org/10.1038/s43587-024-00619-x.
Teschendorff, Andrew E., Zhen Yang, Andrew Wong, Christodoulos P. Pipinikas, Yinming Jiao, Allison Jones, Shahzia Anjum, et al. 2015. “Correlation of Smoking-Associated DNA Methylation Changes in Buccal Cells With DNA Methylation Changes in Epithelial Cancer.” JAMA Oncology 1 (4): 476. https://doi.org/10.1001/jamaoncol.2015.1053.
Tong, Huige, Varun B. Dwaraka, Qingwen Chen, Qi Luo, Jessica A. Lasky-Su, Ryan Smith, and Andrew E. Teschendorff. 2024. “Quantifying the Stochastic Component of Epigenetic Aging.” Nature Aging 4 (6): 886–901. https://doi.org/10.1038/s43587-024-00600-8.

Reuse