Reproducibility: Common DNA sequence variation influences epigenetic aging in African Populations
manuscript doi: 10.1101/2024.08.26.608843
Overview
Meeks et al. (2024) note that DNA methylation (DNAm) based age prediction models have largely been trained on populations of predominantly European ancestory and have not taken into account the potential effects of genetic variants with known effects on DNA methylation, methylation Quantitative Trait Loci (meQTLs), on their predictions. Failure to consider the effects of meQTLs on DNAm age estimators, may, when attempting to apply them to populations with different variant frequencies from those on which they were trained, result in reduced accuracy.
They address this question of the lack of portability of these models to other populations by applying existing age models to members of groups not repesented in the training data.
They develop an age predictor minimally affected by meQTLs across populations, and identify meQTL variants which correlate with the rate of epigenetic ageing.
Some critiques that we have raised before about other works, at greater length, pertaining to deposition of analysis code, protocol level methodological details, and use of a contributor roles ontologies also apply here. These are commonplace failings in current publishing norms and we won’t re-visit them again here. Overall this manuscript is in line with current practice including some of the deficiencies of that practice. It is a pre-print which is in itself a movement in the right direction. The authors do a good job of indicating the exact formulation of their statistical models.
The bulk of our commentary on this work pertains to failures of openness in upstream publications which acted as an impediment to the authors work and which they had to work around.
Impact on this work of failures of openness in upstream projects
The Authors of this work note that they did not have access to the raw intensities for two of their datasets.
We did not have access to the raw intensities for the Baka and ‡Khomani San methylation datasets
This meant that they had to adopt an alternative and potentially less robust approach to batch correction. Control probes on these arrays provide quite extensive technical control information for a variety of sources of error and information from them is often used to correct for this error even within individual studies (Fortin et al. 2014).
Raw intensities are generally represented in the proprietary IDAT file format produced by Illumina scanners for their bead arrays. This format originally required reverse engineering efforts from researchers to allow the open and reproducible processing of these files (Smith et al. 2013), processing these files with open tools is now the norm.
dnamage.clockfoundation.org does an inadequate job of methods transparency. The goal of providing a convenient online tool to perform these calculations is laudable but for any such tool to be suitable for use in scientific research it must be highly transparent in how it functions. Source code for how the computations are actually performed should be available and version information included so that anyone citing calculations performed with such a tool can provide a proper citation. In the absence of version information the best that can be done is to include the date on which an analysis was performed with such a web based tool. In the absence of sufficient details about its back-end functioning and how this may change over time to reflect, updates, bugfixes, new data etc. a date of use is of limited interpretability with respect to what it means for the actual analysis performed with such tools.
Best practices for code sharing have been covered in previous posts on this blog.
These methods availability issues hampered author’s ability to assess meQTL impacted CpGs in GrimAge/2 models.
We excluded GrimAge and GrimAge2 as the details of these models are not publically available.
In the software section of their methods for GrimAge2 Lu et al. (2022) simply point to the aforementioned online calculator and Lu et al. (2019)’s GrimAge paper lacks sufficient detail to reproduce exactly reproduce their calculations with supplementary table 2 detailing the coefficients of the covariates in their model but not the specific CpGs use as surrogates to predict them.
Data Availability
The authors note that:
[N]ewly generated Himba methylation data will be available via GEO deposition.
(emphasis mine)
It is currently a common practice to only release data underpinning a manuscript at the time of its publication in an academic journal, and often not at the pre-print stage. Instead I advocate adopting an approach of depositing data once it is generated, and then using it as though you were a third party, this makes it easier to test the FAIRness of your data outputs. You can be much more confident that your deposition is re-usable if you only allow yourself to use the information in your deposition in your analysis of your own data. In other words: dogfooding the public record of your data is the best way to be confident others could re-use it so switch to this workflow.
Reproducibility best practice suggestions
- Suggestions for future authors
- Deposit data before publication and dogfood your deposition in your analysis.
- Avoid the use of online analysis tools which do not provide adequate methodological detail for how they perform their analyses. If you have sufficient detail you might consider implementing your own locally and comparing the results if there is insufficient detail and/or they diverge, point this out.
- Suggestions for Publishers
- Adopt policies more stringent about the methodological transparency and detail that you require in papers that you publish and do not publish papers which do not meet these standards. More stringent standards here could have alleviated some of the methods availability issue faced by the authors of this manuscript.
- Suggestions for operators of online tools for scientific researchers
- Make your source code available and citable, ideally with sufficient detail for a technical user to deploy a development/testing version of your online tool locally.
- Provide a clear version information about your tool as it is deployed in your tool.
- Include any random seeds used which affect outputs and permit them to be set by user seeking to reproduce previous outputs.
- Include references to any datasets used by your tool, if the data are access controlled for privacy reasons reference an identifier for the data set which provides high level metadata and information about the access restrictions.
- Provide citation guidance which includes: version, date, source code repository, and if relevant random seed information.
- Suggestions for policy makers / funders / procurement
- Do not permit the purchase of instruments which produce outputs in proprietary formats, require at minimum the full technical specifications of the formats to be openly available so that 3rd parties can implement open tools to use them without the need to perform painstaking and uncertain reverse engineering efforts to do so. Include such criteria in tender documents to make it clear to instrument manufactures that this is a priority, it is important that understanding of these requirements penetrate the business and sales side of instrument manufacturers not just the technical and engineering side.
Slides
slides from the discussion session on this manuscript.
See also the companion piece to this post on data visualisation and figure design in this manuscript.