Reproducibility: Loss of DNA glycosylases improves health and cognitive function in a C. elegans model of human tauopathy

Reproducibility
Author
Affiliation

Richard J. Acton

Published

September 23, 2024

Modified

October 7, 2024

paper doi: 10.1093/nar/gkae705

Overview

Tiwari et al. (2024) present the interesting and initially counterintuitive result that knocking out the two C. elegans DNA glycosylases NTH-1 and UNG-1, responsible for the first steps in the base excision repair pathway (BER) DNA damage response pathway, improved outcomes in a tauopathy model of Alzheimer’s disease. They argue that the unresolved BER intermediates such as abasic sites and single strand breaks which result from DNA glycosylase activity contribute to neurodegeneration including by the signaling function of DNA bound NTH-1.

Open & FAIR materials processes & outputs

Pros

  • Sequencing data appears to have been deposited in a suitable repository
  • Authors include source code for some of their functional enrichment analyses
  • Version information is provided for software tools used in RNA-seq analysis
  • Validation of ‘hits’ in RNA-seq analysis with qPCR, and primer sequences are provided
  • Robust controls: inclusion of the non-aggregation prone tau expressing strain and complete set controls for both temperatures
  • What seems to me like appropriate use of post-hoc test for ANOVAs

Cons

  • Deposited Sequencing data is not yet set to public
  • Provided code is in an inappropriate format (in the body of the supplementary PDF)
  • Raw data for everything other than sequencing is not provided
  • Version information is not included for software other that that which is uses in the RNA-seq analysis

Data Availability

Data availability practices are applied somewhat inconsistently in this paper, this gives the impression that they were approached as something of a box ticking exercise rather than with informed intention. For example the RNA-seq data, and downstream analysis were reported in considerable detail albeit in a sub-optimal format, (more on this later). Despite this attempt at detailed reporting of raw sequencing data no raw data is reported for the bulk of the experiments in this paper, only summary stats.

As of 2024-09-17 GSE235015 was private and scheduled for release on 2026-04-15. Picking the longest embargo available by default when you don’t know what the publication date is going to be is commonplace. It is easy to forget to set this to public once the paper is published. (see recommendations for publishers below for what they can do to stop people forgetting.) It is also quite possibly not the lead author(s), but a core facility staff member with the ability/know-how to do this who needs to be informed by the other authors that the paper is out now and the data can be set to public. One can also ask: ‘does it really matter if the data is available a little before the paper?’ and just add the reference once the paper is published.

The RNA-seq results in this and many similar papers are not the most interesting or robust findings in the paper. Indeed RNA-seq experiments often serve to provide: somewhat circumstantial sanity checks for knock-outs/downs, the broader biological effects of the intervention(s) of interest, as a means of identifying new biological targets for follow up in future work. Whilst useful and nice to have reported there is no reason why these results should be reported in greater detail than ‘bench’ experiments that involve things like counting the number of worms. Arguably bioinformatics and work on sequencing data in general has stronger cultural norms around public data deposition and sharing analysis methods than is present for ‘bench’ experiments. However, raw worm counts are in this case, just as, if not more interesting than raw reads.

This leaves us in the somewhat dissonant and inconsistent state of relatively complete and detailed reporting of arguably less important and interesting results, which are much more expensive and complicated to archive, whilst we are missing some small and basic tables which record the raw data for more important and interesting results.

The relevant methods section: ‘RNA isolation and RNAseq data analysis’ implies that the RNA-seq and its analysis were outsourced to Novogene which may explain some of this inconsistency of approach. It is somewhat ambiguous about which aspects of the process were performed by the authors and which by Novogene.

Contrasting the social sciences and psychology literature to the biological literature highlights the odd behaviour of biologists in this regard. Much of the raw data in these fields is a spreadsheet or set of spreadsheets which are pretty small relative to sequencing or imaging datasets. These fields are increasingly stringent about sharing these tables which often are all that their raw data is comprised of. There is no reason why not to deposit your raw spreadsheets along with your paper, data does not need to be big, in a special format, and require deposition in a specialist repository to be worth sharing. Zenodo, OSF, and figshare are just a few of the services where you can deposit your relatively small raw datasets and generate persistent identifiers for them that you can reference in your paper. These are especially useful if your publication venue lacks a mechanism to deposit materials which are not PDFs alongside your publication, as they allow you to work around this limitation. Depositing simple tabular data in these repositories is far easier and less involved than submitting a dataset to GEO or BIA (neither of which is overly difficult).

In summation biologists need to develop the habit/a culture of depositing the raw data associated with ‘bench’ experiments not just sequencing experiments and this is very easy for them to do.

Reproducibility best preactice suggestions

Strain referencing

The C. elegans strains BR5270 & BR5271 are referred to simply by these identifiers. These need to be contextualised with the information of which index, reference, database, namespace or similar in which these identifiers are meaningful. Fortunately these are quite ‘googleable’ purely numeric identifiers are often not. In this case these IDs reference their Caenorhabditis Genetics Center (CGC) entries. These strains also have wormbase identifiers which could also be included to improve discoverability by someone performing a search by an alternative identifier.

For example: “Caenorhabditis Genetics Center (CGC) strains: BR5270(WBStrain00003901) & BR5271(WBStrain00003902)”. Making the accessions hyperlinks create backlinks from the paper to the database making it easier to find both page-rank like search.

A good identifier is:

  • Unique - unambiguously identifies the object and only the object
  • Persistent - stable over time, it is not updated but a new revision is issued
  • Resolvable - the identifier is sufficient to look up the entry for the thing it is referencing

Unlike a DOI (Digital object identifier) neither of these identifiers are generally resolvable. You need the context that it is a CGC strain to infer the url or in the case of wormbase both that it is a wormbase strain and of which species.

Citing web based tools

They performed some functional enrichment analysis using the TEA tool on the Wormbase site. They do not report a date of access when citing this tool. Whenever citing such an online tool it is important to specify a date of access, if possible version information and if relevant and available any random seeds it may have used.

Online tools can change to fix bugs, be updated to include new data etc. and without knowing date of access the reader cannot know if the results are before or after a given bugfix/update. Good tools should provide this information and have a ‘change log’ of fixes and updates to their resources so that access date can be more easily and usefully be interpreted.

Code Sharing and organisation Best Practices

Distributing Analysis code

Including code directly in a pdf is highly non-optimal way of distributing code. Whitespace is rarely properly preserved and other errors are common when copying and pasting from PDFs to plain text editors.

This is what a block of code copied and pasted directly from the pdf, with no editing, looks like:

df <- read_csv("/path/all_compare.csv") #___________________________________________________________________________ #Ung1_N2 set z_df71 <- df %>% filter(EN2_count>=100 & EUng1_count>=100 & E71_count>=100 & EU71_count>=100 ) %>

filter(abs(EUng1vsEN2_log2FoldChange) >= 1 | abs(E71vsEN2_log2FoldChange) | abs(EU71vsEN2_log2FoldChange) >= 1 ) %>% filter(EUng1vsEN2_padj <= 0.05 |E71vsEN2_padj <= 0.05 | EU71vsEN2_padj <= 0.05 ) count( z_df71)

Note that the copy-paste also dropped a % for no readily apparent reason introducing a syntax error. This should look something like this:

df <- read_csv("/path/all_compare.csv")

#___________________________________________________________________________

#Ung1_N2 set 

z_df71 <- df %>% 
    filter(
        EN2_count>=100 & EUng1_count>=100 & E71_count>=100 & EU71_count>=100
    ) %>%
    filter(
        abs(EUng1vsEN2_log2FoldChange) >= 1 | abs(E71vsEN2_log2FoldChange) |
        abs(EU71vsEN2_log2FoldChange) >= 1 
    ) %>% 
    filter(
        EUng1vsEN2_padj <= 0.05 | E71vsEN2_padj <= 0.05 | 
        EU71vsEN2_padj <= 0.05 
    ) 

count( z_df71)

The preferable way to distribute code is through a git hosting service such as codeberg, gitlab, or github. This is true even if you do not make use of git for version control, thought this is also advisable.

You code repository should also contain a file called: LICENSE which contains the license for your code, in most copyright regimes you retain all rights to you code by default and in order for someone else to use it without potentially being liable to copyright infringement you must license your code in a way which gives them permission to use it.

For small chunks of one off analysis code I would suggest putting these under a permissive license such as the MIT license. This allows anyone to re-use your code for any purpose, and to re-license and re-distribute their versions. For more I recommed The Turing Way chapter on licensing.

Ideally the repository should also contain a CITATION.cff file, this contains some simple metadata that lets citation managers import your code repository so people can reference it.

In give your code repository a good (Unique, Persistent, Resolvable) identifier you can use a tool like Zenodo to archive a snapshot of your code repo and generate a DOI for it.

For more checkout the series of checklists I’m developing for best practices when sharing different kinds of research code.

Project Structure

Starting an R script with setwd() (set working directory) and including absolute file paths are common but often relatively readily fixed bad practices (Trisovic et al. 2022).

It is preferable for all the data to live in a sub-folder of the project called something like data, and to be referred to with relative paths. i.e. data/experiment.csv not /home/username/project/data/experiment.csv as the former should work wherever you project folder is and the latter breaks if you move project. In situations where the data is too large to place in this sub-folder for distribution with the code you can add data to the .gitignore file so that git does not track the data.

Whilst working on the project on a compute cluster for example you can do this: ln -s /central-data-dir/my-data /home/$USER/project/data to create a symbolic link or shortcut from a central repository of data to your working directory.

If, as will often be the case, you are distributing your data separately from your code you may want to include code for retrieving the data that it expects to be able to run on from where it is deposited, if it does not already have a local copy. This is a good test that someone else could retrieve your data and run your code on it and get the same result.

CODECHECK and reprohack are places where you can look to finde someone else check that they can run your code and get the same results with the documentation that you have provided.

Recommendations for publishers

Check data availability at time of publication

Publishers should be checking public data availability just before the point publication, and not releasing the paper until the data is available or they loose their leverage and cannot properly enforce their own data availability standards.

Require Plot Source Data

This is a reeat of the recommendation from a previous post

Provide CODECHECK-like reviews

Make sure that a 3rd party can actually rerun the code provided with papers and get the same results with the documentation provided as a part of the review process.

Misc. nit-picks

  • ‘DE’ differentially expressed - used but not disambiguated
  • ‘log fold change’ is used without a specified base in one instance. \(log\) without a base is ambiguous as it is usually assumed to be natural log, but sometimes \(log_{10}\), in this context it is probably \(log_2\) as changes in this scale helpfully correspond to a doubling or halving of expression level. Consequently \(log\) without a base should never be used as it is unclear. \(ln\) is acceptable for natural log instead of \(log_e\) as this is not ambiguous.

Slides

Download slides from discussion session.

References

Tiwari, Vinod, Elisabeth Buvarp, Fivos Borbolis, Chandrakala Puligilla, Deborah L Croteau, Konstantinos Palikaras, and Vilhelm A Bohr. 2024. “Loss of DNA Glycosylases Improves Health and Cognitive Function in a C. Elegans Model of Human Tauopathy.” Nucleic Acids Research, August. https://doi.org/10.1093/nar/gkae705.
Trisovic, Ana, Matthew K. Lau, Thomas Pasquier, and Mercè Crosas. 2022. “A Large-Scale Study on Research Code Quality and Execution.” Scientific Data 9 (1). https://doi.org/10.1038/s41597-022-01143-6.

Reuse