Visualisation: Loss of DNA glycosylases improves health and cognitive function in a C. elegans model of human tauopathy

Visualisation
Authors
Affiliation

Richard J. Acton

Alex Beliavskaia

Published

September 23, 2024

Modified

October 7, 2024

Code
suppressPackageStartupMessages({
    library(dplyr)
    library(readr)
    library(ggplot2)
    library(ggbeeswarm)
    library(colorblindr)
    library(latex2exp)
})

paper doi: 10.1093/nar/gkae705

Overview

Overall the data visualisation in this paper is clear and comprehensible.

I like the choice to split the Kaplan-Meier survival curve plots out into their pairwise comparisons and also include the plot with all curves overlayed, this makes visual comparisons much easier. As in figure 2 for example:

As we will return to below use of colour was quite consistent. Point shape was not used consistently and it is not clear what, if anything, it is intended to indicate. Why are UNG-1 and BR5270 represented with squares and diamonds whilst all other conditions are circles in figure 1 A for example?

Order of information is sometimes inconsistent for example within the keys of figures 2 A & B, B flips the order of NTH-1; BR5271 and BR5271 relative to the expectation of order previously established in figure 1 and 2 A an inversion which persists across the other Kaplan-Meier plots.

BR5270 & BR5271 are not very memorable, interpretable, or semantically meaningful names for ease of human understanding. They differ by a single character from 6 making differentiating them at a glance a little unreliable. Replacing them with simpler semantically meaningful names for the strains in the manuscript, makes it harder to confuse the two and have to look up which is which in the methods, and easier to differentiate them at a glance. This contributes to improving the ease of comprehension of the manuscript. The strain reference numbers need only be mentioned when introducing them and/or in the methods. ‘τAg’ & ‘τ’ or ‘tauAg’ & ‘tau’ for BR5270 & BR5271 respectively, for example are succinct and capture the fact that BR5270 expresses the aggregation prone tau variant and BR5271 is a control which still expresses tau but of a variant not prone to aggregation.

Colour Palette Semantics

The authors did an good job of consistent use of the colour palette that they chose. It is common to see inconsistent use of colour between figures and the authors generally did not make this mistake.

Figures 4 A, C, 5 E, F, G are the only places where the N2, UNG-1, & NTH-1 colours differ from their use elsewhere and could probably have been kept the same here for clarity; especially as pink here mean UNG-1 and elsewhere means UNG-1; BR5270.

Figure 10 B reuses the burgundy colour which elsewhere has meant ‘UNG-2; BR5271’ to represent \(H_2O_2\) and uses a different colour, a shade of orange not previously used, in figure 1 C to encode \(H_2O_2\).

Let’s take a look at the original palette:

orig_colour_palette <- c(
    "N2" = "#000000",
    "UNG-1" = "#FF7F80",
    "NTH-1" = "#000080",
    "tauAg" = "#8E7D28",
    "UNG-1; tauAg" = "#9D054E",
    "NTH-1; tauAg" = "#FD5302",
    "tau" = "#0D97B1",
    "UNG-1; tau" = "#AF01E7",
    "NTH-1; tau" = "#808080"
)
Code
conditions <- names(orig_colour_palette)

palette_plot <- tibble(
        condition = factor( # invert order so same in key as plot
            conditions, levels = rev(conditions), ordered = TRUE
        ),
        value = 1
    ) %>%
    ggplot(aes(condition, value)) + 
        geom_col(aes(fill = condition)) + 
        ggplot2::scale_fill_manual(values = orig_colour_palette) + 
        coord_flip() + 
        theme_void()

palette_plot

With the {colorblindr} package we can simulate colour vision issues to try and ensure that our palette will be accessible.

palette_plot %>% cvd_grid()

The colour palette here has quite a tricky combination of factors to capture.

Three main ‘axes’:

  • Control / Experimental
  • Mutated Glycosylase Genotype
  • Background Strain

Conditions: 5 primary:

  • N2
  • tau
  • tauAg
  • NTH-1
  • UNG-1

4 hybrid:

  • NTH-1; tau
  • UNG-1; tau
  • NTH-1; tauAg
  • UNG-1; tauAg

At 9 different colours this is approaching the upper limit on the number of colours you can include in a single palette and have them comfortably resolvable. Though because the NTH-1 and UNG-1 strains are grown at different temperatures they are rarely the subject of direct comparison in the same plot, figures 4 & 5 being an exception, so there is scope to have these two be less differentiable if necessary.

Controls, especially negative ones, are often well represented by greyscale or lighter and desaturated colours. So if we take our ‘control only’ conditions N2 and tau and give them darker grey / black colours Then we are free to use lighter desaturated colours for our other conditions NTH-1; tau, UNG-1; tau and NTH-1 / UNG-1 which are also often acting as controls. We can then give the darker and more saturated colours to our intervention conditions NTH-1; tauAg, UNG-1; tauAg, and tauAg. Relatively bright orange colours, are generally associated with something alarming or a warning thus using one for the disease model that we expect to be negatively impacted by tau aggregates (tauAg) makes sense. This condition can also be thought of as a positive control so the more saturated colour differentiates if from the negative controls.

tauAg darker and redder / more orange, UNG-1 more purple, NTH-1 more green.

Code
colour_palette <- c(
    "N2" = "#333333ff",
    "UNG-1" = "#ae8fd9ff",
    "NTH-1" = "#c1bf8eff",
    "tau" = "#666666ff",
    "UNG-1; tau" = "#b1b6d8ff",
    "NTH-1; tau" = "#b2e2c6ff",
    "tauAg" = "#f89540ff",
    "UNG-1; tauAg" = "#54278fff",
    "NTH-1; tauAg" = "#006d2cff"
)
Code
conditions <- names(colour_palette)

palette_plot <- tibble(
        condition = factor( # invert order so same in key as plot
            conditions, levels = rev(conditions), ordered = TRUE
        ),
        value = 1
    ) %>%
    ggplot(aes(condition, value)) + 
        geom_col(aes(fill = condition)) + 
        ggplot2::scale_fill_manual(values = colour_palette) + 
        coord_flip() + 
        theme_void()

palette_plot

Code
palette_plot %>% cvd_grid()

Re-vizualising some example figures

As raw data for the plots was not provided numbers used are eyeballed estimates from looking at the original plots

For simplicity I have forgone representing pair-wise significance brackets in these re-plots. Instead of a barplot I have opted to use a beeswarm plot with box and whiskers plot to indicate the mean and spread of the data. ‘beeswarm plots’ are just dotplots with a ‘dodging’ algorithm applied to avoid overplotting plots and produce a density or violin plot-like shape of the points when plotted. I think this works a little better than SEM in cases like % viability where an error bar extending over 100% does not make sense.

Using our new colour palette, and names we can rework some of the figures in the manuscript and see how they look with these changes.

I’ve also changed the order in which the conditions appear in the plot so the all of the conditions which function in some way as controls are grouped together. UNG-1, tau, & UNG-1; tau are all expected to behave similarly to N2 if the non-aggregating tau control is good, the UNG-1 KO has no substantial detrimental effects and neither does the combination of the two. These are serving as negative controls. tauAg is a positive control in that it is expected to be negatively impacted by the aggregating tau. The last column contains the experimental condition of interest which is to be contrasted with all the other conditions, the UNG-1 mutant in the tau aggregating background. The same rationale can be allied to NTH-1.

# factor order determines the order in the plot
conditions_fct <- factor(conditions, levels = conditions, ordered = TRUE)

base_plot <- function(data) {
    data %>%
        ggplot(aes(Condition, value, colour = Condition)) +
        scale_colour_manual(values = colour_palette[conditions_fct]) + 
        geom_boxplot(outlier.shape = NA, size = 0.8, show.legend = FALSE) +
        # geom_point(size = 2, show.legend = FALSE) +
        geom_beeswarm(size = 2, show.legend = FALSE, groupOnX = TRUE) +
        theme_minimal() +
        theme(
            axis.title.x = element_blank(),
            axis.text.x = element_text(angle = 35, vjust = 1, hjust = 1)
        )
}
Code
UNG1_conditions_fct <- conditions_fct[!grepl("NTH-1", conditions_fct)]
genotype_fct <- factor(
    c("UNG-1", "NTH-1"), levels = c("UNG-1", "NTH-1"), ordered = TRUE
)
UNG1_brood_size <- tibble(
    Condition = rep(UNG1_conditions_fct, each = 6),
    value = c( # Brood Size
        40,50,56,60,70,78, # N2
        30,38,48,50,58,70, # UNG-1
        8, 36,38,40,41,69, # tau
        18,22,50,52,61,61, # UNG-1; tau
        8, 10,16,18,32,34, # tauAg
        48,49,54,60,66,68  # UNG-1; tauAg
    ),
    measure = "Brood Size", genotype = genotype_fct[genotype_fct == "UNG-1"]
)

UNG1_brood_size_plot <- UNG1_brood_size %>%
    base_plot() +
    lims(y = c(0, 100)) +
    labs(title = "(A) UNG-1 Brood Size", y = "No. of eggs / worm")

UNG1_brood_size_plot
knitr::include_graphics("tiwari2024_fig1-a.png")

rework

rework

original

original
Code
NTH1_conditions_fct <- conditions_fct[!grepl("UNG-1", conditions_fct)]
NTH1_brood_size <- tibble(
    Condition = rep(NTH1_conditions_fct, each = 3),
    value = c( # Brood Size
        75,76,88, # N2
        72,68,78, # NTH-1
        72,79,89, # tau
        74,80,87, # NTH-1; tau
        18,19,20, # tauAg
        52,57,58  # NTH-1; tauAg
    ),
    measure = "Brood Size", genotype = genotype_fct[genotype_fct == "NTH-1"]
)
NTH1_brood_size_plot <- NTH1_brood_size %>%
    base_plot() +
    lims(y = c(0, 100)) + 
    labs(title = "(B) NTH-1 Brood Size", y = "No. of eggs / worm")

NTH1_brood_size_plot
knitr::include_graphics("tiwari2024_fig1-b.png")

rework

rework

original

original

With this semantically designed palette you can see more easily at a glance if UNG-1 or NHT-1 is the group within which comparisons are being made as the respective purple and green hues are present in all the variables associated with the knockout. In the original NTH-1 palette navy blue, grey, and orange have no obvious relationship, the peach, pink and burgundy used for UNG-1 are a more cohesive combination.

Code
UNG1_viability <- tibble(
    Condition = rep(UNG1_conditions_fct, each = 6),
    value = c( # egg viability %
        96, 98, 97 ,100,100,100, # N2
        95, 96, 97 ,98, 99, 100,# UNG1
        0,  97, 100,100,100,100,# tau
        95, 96, 97, 98, 100,100,# UNG1 tau
        50, 70, 80, 84, 96, 100,# tauAg
        96, 97, 97, 99,100,100# UNG1 tauAg
    ),
    measure = "Egg Viability", genotype = genotype_fct[genotype_fct == "UNG-1"]
)

NTH1_viability <- tibble(
    Condition = rep(NTH1_conditions_fct, each = 6),
    value = rep(100,6*6), # egg viability %,
    measure = "Egg Viability", genotype = genotype_fct[genotype_fct == "NTH-1"]
)

UNG1_viability_plot <- UNG1_viability %>%
    base_plot() +
    lims(y = c(0, 100)) + 
    labs(title = "(C) UNG-1 Egg Viability", y = "% hatching")

NTH1_viability_plot <- NTH1_viability %>%
    base_plot() +
    lims(y = c(0, 100)) + 
    labs(title = "(D) NTH-1 Egg Viability", y = "% hatching")

UNG1_brood_size_plot
NTH1_brood_size_plot
UNG1_viability_plot
NTH1_viability_plot

It is unfortunately extremely common for individual plots in multi-panel figures (aka matrix plots) to lack titles. The panel letter is often prominent instead of the panel title, the panel letter is devoid of semantic content and functionally useless for interpreting the plot, it only serves to let the reader look up additional context and is not an adequate title. Frequently one of the axis titles contains what should be the figure panel title. Unfortunately this often mixes the panel title information in with other information which should be on the axes such as units. Which axis the plot title information is nestled within varies with the content of the plot. This makes the subject of many plots difficult to immediately discern without carefully parsing the axis labels and sometimes reference to the key, legend or even the body of the text.

The title should contain the basic context necessary to interpret the plot and by being positioned prominently as it provides an ‘entry point’ to the graphic. You want to give their context to your reader first so they don’t have to dig for it in the axis titles etc. and know immediately if they are looking at the right panel in the figure for what they want to know. The top left is a convention for a starting position - at least in English, figures in the context of other writing systems would want to follow their conventions.

‘small multiples’ or ‘faceted’ visualisations where you are plotting groups withing a dataset in a grid of the same types of plot to facilitate comparison between them present opportunities opportunities to increase clarity: Visual elements to indicate groupings, the grey strips in the example below. Pulling the common element of the mutation of interest that is being compared (UNG-1 / NTH-1) out of the x axis labels and into the top strip make this grouping more immediately apparent. Shared axis labels dispels the need to check if things are plotted on the same scale and reduces unhelpful redundancy.

When these two separate small multiples plots are combined within a multi-panel figure the redundant titles and vertical strips emphasize the two separate panels which represent measurements of two different quantities - despite their similar data ranges. They are not rows within the same small multiples plot but seperate panels. You could call these A and B and two panels here instead of a four.

Code
bind_rows(UNG1_brood_size, NTH1_brood_size) %>%
    base_plot() + 
    facet_grid(measure~genotype, drop = TRUE, scales = "free_x") + 
    theme(
        strip.background = element_rect(
            fill = "lightgrey", linetype = "blank"
        )
    ) +
    lims(y = c(0, 100)) + 
    labs(title = "Brood Size", y = "No. of eggs / worm")
bind_rows(UNG1_viability, NTH1_viability) %>%
    base_plot() + 
    facet_grid(measure~genotype, drop = TRUE, scales = "free_x") + 
    theme(
        strip.background = element_rect(
            fill = "lightgrey", linetype = "blank"
        )
    ) +
    lims(y = c(0, 100)) + 
    labs(title = "Egg Viability", y = "% hatching")

Original for comparison:

Figure 5 D

Create a bar chart using the new values, with each bar shaded in varying intensities of blue based on the \(-log_{10}Q\) value (i.e., higher values should appear in darker blue, and lower values in lighter blue).

Tip

The {latex2exp} package provides the TeX() function which can convert the widely used LaTeX maths syntax to R’s arcane expressions language making it much easier to create plot axis labels which contain mathematical expressions.

Code
"tiwari2024_q-values_from_table_S8.csv" %>%
    read_csv(show_col_types = FALSE) %>%
    arrange(-Value) %>%
    mutate(
        "minus_log10Q" = -log10(Value),
        Category = factor(Category, levels = Category)
    ) %>%
    ggplot(aes(x = Category, y = minus_log10Q, fill = minus_log10Q)) +
        geom_bar(stat = "identity") + 
        coord_flip() + 
        theme_minimal() +
        scale_fill_gradient(low = "lightblue", high = "darkblue") +
        theme(legend.position = "none") +
        labs(
            title = "glycosylaseKO:BR5270 consensus vs N2",
            y = TeX("$-log_{10}Q$"), x = NULL
        )

Original For Comparison

Original For Comparison

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] latex2exp_0.9.4   colorblindr_0.1.0 colorspace_2.0-3  ggbeeswarm_0.6.0 
[5] ggplot2_3.3.6     readr_2.1.2       dplyr_1.1.4      

loaded via a namespace (and not attached):
 [1] bit_4.0.4         gtable_0.3.0      jsonlite_1.8.0    crayon_1.5.1     
 [5] compiler_4.3.1    renv_0.15.5       tidyselect_1.2.0  stringr_1.4.0    
 [9] parallel_4.3.1    scales_1.3.0      yaml_2.3.5        fastmap_1.1.0    
[13] R6_2.5.1          labeling_0.4.2    generics_0.1.2    knitr_1.39       
[17] htmlwidgets_1.6.4 tibble_3.2.1      munsell_0.5.0     pillar_1.9.0     
[21] tzdb_0.3.0        rlang_1.1.0       utf8_1.2.2        stringi_1.7.6    
[25] xfun_0.38         bit64_4.0.5       cli_3.6.2         withr_2.5.0      
[29] magrittr_2.0.3    digest_0.6.29     grid_4.3.1        vroom_1.5.7      
[33] hms_1.1.1         cowplot_1.1.1     beeswarm_0.4.0    lifecycle_1.0.3  
[37] vipor_0.4.5       vctrs_0.6.5       evaluate_0.15     glue_1.6.2       
[41] farver_2.1.0      fansi_1.0.3       rmarkdown_2.14    tools_4.3.1      
[45] pkgconfig_2.0.3   ellipsis_0.3.2    htmltools_0.5.7