Visualisation: Single-cell transcriptome profiling of the human developing spinal cord reveals a conserved genetic programme with human-specific features

Visualisation
Authors

Richard J. Acton

Jamie McLeod

Published

December 4, 2023

Modified

December 7, 2023

paper doi: 10.1242/dev.199711

As in the previous post covering this paper we are focused on figure 1 and will bring in some related points from the rest of the paper.

Some general comments before we move on to the main graphic we chose to focus on figure 1 B. Overall we liked the data visualisation in this paper, it was clean, clear and well ordered. We have some nitpicks about some minor points that we thought were opportunities for improvement. We thought that it was a nice touch using the diagram in figure 2 B as the colour key, however some aspects of this were not entirely clear. The shades of blue and red/orange used in the diagram and the plots were an imperfect match and there was some ambiguity to the meaning of the different shades of grey, as well as between RP & FP. A supplementary colour key with the 7 colours used in figure 2 A & C alongside the diagram in figure 2 B would have provided a bit more clarity.

Some other small suggestions:

Code
suppressPackageStartupMessages({
    library(dplyr)
    library(tidyr)
    library(ggplot2)
    library(cowplot)
    library(colorspace)
    library(colorblindr)
    library(gt)
})

Examing representations of cell type proportions over time

Consider the following figures: Figure 1 (B), Figure 5 (C-F), Figure 6 (A & B) Figure 1 (B) shows the proportion of cells of a given category in the dataset at the four stages during embryonic development at which embryos were sampled. This figure choose to represent these datasets discretely using a stacked bar chart. Whereas the other two figures with plots of this kind choose to use an area plot. The Carnegie stages are ordinal data, as they occur sequential in a fixed temporal order. Both representation are potentially valid choices for this underlying data but an area plot should generally be used when the value on the x axis is continuous not discrete. Connecting the stages with a line or area implies a coupling between values at the different points, in this case a temporal one. However if you have time on the x axis a value on the y and you draw a line between them then this creates an expectation that the gradient of the line should be interpretable as an rate of change in the value from interpolating between the two time points. Thus the x axis should be on a continuous scale not a discrete scale if you are using line or area plots. Consequently it could be argued that figure 1 (B) therefore is the more correct choice to represent this data.

However it might be useful to be able to convey an impression of the rate of change over time and the spacing in time between these developmental stages so how can we improve on the area plot so that the gradients interpolated between the sampled stages are not misleading?

We have eyeballed the percentages in our re-creation of figure 1 (B) as the exact data underlying this plot was immediately available.

Based on this Timeline human development I have approximated the range of time in days post fertilisation covered by the Carnegie stages sampled here.

nstages <- c(12, 14, 17, 19)
fstages <- 1:4
names(fstages) <- paste0("CS", nstages)
#stage_day_ranges <-  list(12 = c(25,27), 14=c(32), 17=c(41,43), 19=c(48))
csbegin <- c(24.5, 31.5, 40.5, 47.5)
csend <- c(27.5, 32.5, 43.5, 48.5)

#dpf_equivalent <- c(25,32,41,48) # begin
dpf_equivalent <- c(26,32,42,48) # mid

stages <- factor(paste0("CS", nstages), ordered = TRUE)

types <- c(
    "CNS Progenitor", "CNS Neuron", "PNS Progenitor", "PNS Neuron",
    "Mesoderm", "Skin", "Blood", "Other"
)
types <- factor(types, levels = types)

approximate_proportions <- tibble(
    stage = rep(stages, each = length(types)),
    nstage = rep(nstages, each = length(types)),
    fstage = rep(fstages, each = length(types)),
    dpf = rep(dpf_equivalent, each = length(types)),
    csbegin = rep(csbegin, each = length(types)),
    csend = rep(csend, each = length(types)),
    type = rep(types, length(stages)),
    prop = c(
        # Eyeballed from the original figure!
        37.5, 5, 7, 0, 45, 1, 1, 3.5,
        22, 15, 7, 6, 47, 1, 1, 1,
        15, 16, 9, 7, 45, 2, 6, 0,
        24, 46, 8, 10, 9, 0, 3, 0
    )
) %>%
    group_by(stage) %>%
    arrange(stage, prop) %>%
    mutate(cprop = cumsum(prop))

approximate_proportions %>% 
    gt() %>%
    tab_style(style = cell_text(color = "#000000"), locations = cells_body())
nstage fstage dpf csbegin csend type prop cprop
CS12
12 1 26 24.5 27.5 PNS Neuron 0.0 0.0
12 1 26 24.5 27.5 Skin 1.0 1.0
12 1 26 24.5 27.5 Blood 1.0 2.0
12 1 26 24.5 27.5 Other 3.5 5.5
12 1 26 24.5 27.5 CNS Neuron 5.0 10.5
12 1 26 24.5 27.5 PNS Progenitor 7.0 17.5
12 1 26 24.5 27.5 CNS Progenitor 37.5 55.0
12 1 26 24.5 27.5 Mesoderm 45.0 100.0
CS14
14 2 32 31.5 32.5 Skin 1.0 1.0
14 2 32 31.5 32.5 Blood 1.0 2.0
14 2 32 31.5 32.5 Other 1.0 3.0
14 2 32 31.5 32.5 PNS Neuron 6.0 9.0
14 2 32 31.5 32.5 PNS Progenitor 7.0 16.0
14 2 32 31.5 32.5 CNS Neuron 15.0 31.0
14 2 32 31.5 32.5 CNS Progenitor 22.0 53.0
14 2 32 31.5 32.5 Mesoderm 47.0 100.0
CS17
17 3 42 40.5 43.5 Other 0.0 0.0
17 3 42 40.5 43.5 Skin 2.0 2.0
17 3 42 40.5 43.5 Blood 6.0 8.0
17 3 42 40.5 43.5 PNS Neuron 7.0 15.0
17 3 42 40.5 43.5 PNS Progenitor 9.0 24.0
17 3 42 40.5 43.5 CNS Progenitor 15.0 39.0
17 3 42 40.5 43.5 CNS Neuron 16.0 55.0
17 3 42 40.5 43.5 Mesoderm 45.0 100.0
CS19
19 4 48 47.5 48.5 Skin 0.0 0.0
19 4 48 47.5 48.5 Other 0.0 0.0
19 4 48 47.5 48.5 Blood 3.0 3.0
19 4 48 47.5 48.5 PNS Progenitor 8.0 11.0
19 4 48 47.5 48.5 Mesoderm 9.0 20.0
19 4 48 47.5 48.5 PNS Neuron 10.0 30.0
19 4 48 47.5 48.5 CNS Progenitor 24.0 54.0
19 4 48 47.5 48.5 CNS Neuron 46.0 100.0

Recreating figure 1 (B)

colour_key_orig <- c(
    "#1B9D77","#DA5F02","#7570B3","#E7298B",
    "#DFC27E","#BE802C","#8B5113","#D8D8D8"
)
names(colour_key_orig) <- types
cell_type_props_by_cs_bar <- 
approximate_proportions %>% 
    ggplot(aes(x = stage)) + 
    geom_col(aes(y = prop, fill = type)) +
    theme_minimal() +
    labs(
        x = "Carnegie stage", y = "% Cell Type", fill = "Cell Type"
    )

cell_type_props_by_cs_bar_orig_clr <- cell_type_props_by_cs_bar +
    scale_fill_manual(values = colour_key_orig)

cell_type_props_by_cs_bar_orig_clr

Rethinking colours

Starting with the original colour key, this perfectly fine and CNS/PNS neurons/progenitors have highly distinguishable colours in this palette. It might be interesting to use the same hue for CNS vs PNS and vary the lightness/saturation to denote progenitor vs neuron aspect. Using a lighter colour to denote the less differentiated progenitor state and a darker represent a more mature state. This palette is also reasonably distinct from those used in the other figures.

let’s take a look at how the original performs in reduced colour space to attempt to assess its accessibility to those with limited colour vision, using the very handy {colorblindr} package.

cell_type_props_by_cs_bar_orig_clr %>% cvd_grid()

Not bad they are fairly distinguishable, though not great when full desaturated.

colour_key <- c(
    # oranges are too close to the browns in skin/medoderm/blood
    # "#E589B9", "#E7298B", "#DA5F02", "#E59050",
    # Blues & greens are quite similar to the palette used in the other figures
    # for more granular cell identities so best to have something distinct
    # "#a6cee3", "#1f78b4", "#b2df8a", "#33a02c",
    # light and dark versions of the same hues of pink & purple 
    # don't overlap too much with the other palettes and differentiate quite 
    # well from each other in full spectrum colour but don't quite do as well
    # in reduced spectrum so starting with green and purple and tweaking the
    # lightness & saturation gave better results in a reduced spectrum
    #"#E589B9", "#E7298B",
    "#1B9D77", "#14664D",
    "#9B99BF", "#7570B3",
    # Keeping originals for mesoderm/skin/blood/other
    "#DFC27E","#BE802C","#8B5113","#D8D8D8"
)
names(colour_key) <- types

Checkout the code comments above for some notes on the experimentation we did. Here’s a new take

cell_type_props_by_cs_bar_new_clr <- cell_type_props_by_cs_bar +
    scale_fill_manual(values = colour_key)

cell_type_props_by_cs_bar_new_clr

cell_type_props_by_cs_bar_new_clr %>% cvd_grid()

I think that this version also looks better in the reduced colour ranges. The apparent improved performance in the desaturated might be a bit of an artifact of the order of the bars. The CNS, PNS groupings by hue are also retained quite well across the different versions.

Figure 1 (B) in area plot style

ggplot directs you away from creating figures of the form of 5 (C-F), & 6 (A & B) as the area geom is the continuous form of the stacked bar chart and will not accept a discrete set of values for x so a continuous version with a fixed interval made and relabeled appropriately must be used to get this effect.

approximate_proportions %>% 
    ggplot(aes(x = fstage)) + 
    geom_area(aes(y = prop, fill = type)) +
    geom_vline(aes(xintercept = fstage), linetype = "dashed") + 
    geom_line(aes(y = prop, group = type), position = "stack") +
    # geom_text(aes(x = fstage + 0.2, y = 105, label = stage)) +
    scale_x_continuous(labels = names(fstages)) + 
    scale_fill_manual(values = colour_key) +
    labs(x = "Stage", y = "% cell type", fill = "Cell Type") +
    #geom_point(aes(y = prop, group = type), position = "stack")
    theme_minimal()

With the Carnegie stages as a fixed time interval

This is not optimal as the Carnegie stages are not of constant duration so this representation still has some of the same problem we had when we where just joining the points that we sampled. The gradient will still not be reflective of an interpolated rate of change over time.

approximate_proportions %>% 
    ggplot(aes(x = nstage)) + 
    geom_area(aes(y = prop, fill = type)) +
    geom_vline(aes(xintercept = nstage), linetype = "dashed") + 
    geom_line(aes(y = prop, group = type), position = "stack") +
    geom_text(aes(x = nstage + 0.5, y = 105, label = stage)) +
    scale_fill_manual(values = colour_key) +
    labs(x = "Carnegie stage", y = "% Cell Type", fill = "Cell Type") +
    #geom_point(aes(y = prop, group = type), position = "stack")
    theme_minimal()

With days post fertilisation at which the Carnegie stages occur

This approach mostly solves our gradient issue.

approximate_proportions %>% 
    ggplot(aes(x = dpf)) + 
    geom_area(aes(y = prop, fill = type)) +
    geom_vline(aes(xintercept = dpf), linetype = "dashed") + 
    geom_line(aes(y = prop, group = type), position = "stack") +
    geom_text(aes(x = dpf + 1.5, y = 105, label = stage)) +
    scale_fill_manual(values = colour_key) +
    labs(x = "d.p.f.", y = "% Cell Type", fill = "Cell Type") +
    #geom_point(aes(y = prop, group = type), position = "stack")
    theme_minimal()

With days post fertilisation and Carnegie stages intervals

The Carnegie stages are not instantaneous points in time. They are defined by developmental milestones which are typically reached at a given point in time. Thus if you do want to represent time continuously then it might be a good idea to indicate the uncertainty in time associated with a Carnegie stage, this version of the plot illustrates the windows in time associated with the stage.

This is especially applicable if you staged the embryos that you are sampling using developmental milestones to estimate their age, a opposed to other methods of estimating their age.

cell_type_props_by_dpf_area_cs_ranges <- approximate_proportions %>% 
    ggplot(aes(x = dpf)) + 
    #geom_col(aes(y = prop, fill = type))
    geom_area(aes(y = prop, fill = type)) +
    #geom_vline(aes(xintercept = csbegin), linetype = "dashed") + 
    #geom_vline(aes(xintercept = csend), linetype = "dashed") + 
    geom_vline(aes(xintercept = dpf), linetype = "dashed") + 
    geom_line(aes(y = prop, group = type), position = "stack", size = 0.1) +
    #geom_text(aes(x = (csbegin + csend)/2, y = 107, label = stage, angle = 90)) +
    #geom_point(aes(y = prop, group = type), position = "stack", shape = 4) +
    geom_rect(
        aes(xmin = csbegin , xmax = csend, ymin = -20, ymax = 105),
        fill = "grey", alpha = 0.05
    ) +
    scale_fill_manual(values = colour_key) +
    
    labs(x = "d.p.f.", y = "% Cell Type", fill = "Cell Type") +
    theme_minimal()

cell_type_props_by_dpf_area_cs_ranges + 
    geom_text(aes(x = dpf + 0.5, y = -10, label = stage), angle = 90)

It also remains quite effective in a small form factor that might be required in a multi-panel figure (These are often unavoidable even if we should stop using them Richard’s View.)

cell_type_props_by_dpf_area_cs_ranges + 
    geom_text(aes(x = dpf + 0.6, y = -10, label = stage), angle = 90, size = 2.5)

This version also seems to work well in reduced colour

{cell_type_props_by_dpf_area_cs_ranges + 
    geom_text(aes(x = dpf + 0.6, y = -10, label = stage), angle = 90, size = 2.5)} %>% 
    cvd_grid()

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] gt_0.10.1         colorblindr_0.1.0 colorspace_2.0-3  cowplot_1.1.1    
[5] ggplot2_3.3.6     tidyr_1.2.0       dplyr_1.1.4      

loaded via a namespace (and not attached):
 [1] gtable_0.3.0      jsonlite_1.8.0    compiler_4.3.1    renv_0.15.5      
 [5] tidyselect_1.2.0  xml2_1.3.6        stringr_1.4.0     scales_1.3.0     
 [9] yaml_2.3.5        fastmap_1.1.0     R6_2.5.1          labeling_0.4.2   
[13] generics_0.1.2    knitr_1.39        htmlwidgets_1.6.4 tibble_3.2.1     
[17] munsell_0.5.0     pillar_1.9.0      rlang_1.1.0       utf8_1.2.2       
[21] stringi_1.7.6     xfun_0.38         sass_0.4.8        cli_3.6.2        
[25] withr_2.5.0       magrittr_2.0.3    digest_0.6.29     grid_4.3.1       
[29] lifecycle_1.0.3   vctrs_0.6.5       evaluate_0.15     glue_1.6.2       
[33] farver_2.1.0      fansi_1.0.3       rmarkdown_2.14    purrr_0.3.4      
[37] tools_4.3.1       pkgconfig_2.0.3   htmltools_0.5.7  

References

Schmied, Christopher, Michael S. Nelson, Sergiy Avilov, Gert-Jan Bakker, Cristina Bertocchi, Johanna Bischof, Ulrike Boehm, et al. 2023. “Community-Developed Checklists for Publishing Images and Image Analyses.” Nature Methods, September. https://doi.org/10.1038/s41592-023-01987-9.