Code
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(ggplot2)
library(cowplot)
library(colorspace)
library(colorblindr)
library(gt)
})
paper doi: 10.1242/dev.199711
As in the previous post covering this paper we are focused on figure 1 and will bring in some related points from the rest of the paper.
Some general comments before we move on to the main graphic we chose to focus on figure 1 B. Overall we liked the data visualisation in this paper, it was clean, clear and well ordered. We have some nitpicks about some minor points that we thought were opportunities for improvement. We thought that it was a nice touch using the diagram in figure 2 B as the colour key, however some aspects of this were not entirely clear. The shades of blue and red/orange used in the diagram and the plots were an imperfect match and there was some ambiguity to the meaning of the different shades of grey, as well as between RP & FP. A supplementary colour key with the 7 colours used in figure 2 A & C alongside the diagram in figure 2 B would have provided a bit more clarity.
Some other small suggestions:
Consider the following figures: Figure 1 (B), Figure 5 (C-F), Figure 6 (A & B) Figure 1 (B) shows the proportion of cells of a given category in the dataset at the four stages during embryonic development at which embryos were sampled. This figure choose to represent these datasets discretely using a stacked bar chart. Whereas the other two figures with plots of this kind choose to use an area plot. The Carnegie stages are ordinal data, as they occur sequential in a fixed temporal order. Both representation are potentially valid choices for this underlying data but an area plot should generally be used when the value on the x axis is continuous not discrete. Connecting the stages with a line or area implies a coupling between values at the different points, in this case a temporal one. However if you have time on the x axis a value on the y and you draw a line between them then this creates an expectation that the gradient of the line should be interpretable as an rate of change in the value from interpolating between the two time points. Thus the x axis should be on a continuous scale not a discrete scale if you are using line or area plots. Consequently it could be argued that figure 1 (B) therefore is the more correct choice to represent this data.
However it might be useful to be able to convey an impression of the rate of change over time and the spacing in time between these developmental stages so how can we improve on the area plot so that the gradients interpolated between the sampled stages are not misleading?
We have eyeballed the percentages in our re-creation of figure 1 (B) as the exact data underlying this plot was immediately available.
Based on this Timeline human development I have approximated the range of time in days post fertilisation covered by the Carnegie stages sampled here.
nstages <- c(12, 14, 17, 19)
fstages <- 1:4
names(fstages) <- paste0("CS", nstages)
#stage_day_ranges <- list(12 = c(25,27), 14=c(32), 17=c(41,43), 19=c(48))
csbegin <- c(24.5, 31.5, 40.5, 47.5)
csend <- c(27.5, 32.5, 43.5, 48.5)
#dpf_equivalent <- c(25,32,41,48) # begin
dpf_equivalent <- c(26,32,42,48) # mid
stages <- factor(paste0("CS", nstages), ordered = TRUE)
types <- c(
"CNS Progenitor", "CNS Neuron", "PNS Progenitor", "PNS Neuron",
"Mesoderm", "Skin", "Blood", "Other"
)
types <- factor(types, levels = types)
approximate_proportions <- tibble(
stage = rep(stages, each = length(types)),
nstage = rep(nstages, each = length(types)),
fstage = rep(fstages, each = length(types)),
dpf = rep(dpf_equivalent, each = length(types)),
csbegin = rep(csbegin, each = length(types)),
csend = rep(csend, each = length(types)),
type = rep(types, length(stages)),
prop = c(
# Eyeballed from the original figure!
37.5, 5, 7, 0, 45, 1, 1, 3.5,
22, 15, 7, 6, 47, 1, 1, 1,
15, 16, 9, 7, 45, 2, 6, 0,
24, 46, 8, 10, 9, 0, 3, 0
)
) %>%
group_by(stage) %>%
arrange(stage, prop) %>%
mutate(cprop = cumsum(prop))
approximate_proportions %>%
gt() %>%
tab_style(style = cell_text(color = "#000000"), locations = cells_body())
nstage | fstage | dpf | csbegin | csend | type | prop | cprop |
---|---|---|---|---|---|---|---|
CS12 | |||||||
12 | 1 | 26 | 24.5 | 27.5 | PNS Neuron | 0.0 | 0.0 |
12 | 1 | 26 | 24.5 | 27.5 | Skin | 1.0 | 1.0 |
12 | 1 | 26 | 24.5 | 27.5 | Blood | 1.0 | 2.0 |
12 | 1 | 26 | 24.5 | 27.5 | Other | 3.5 | 5.5 |
12 | 1 | 26 | 24.5 | 27.5 | CNS Neuron | 5.0 | 10.5 |
12 | 1 | 26 | 24.5 | 27.5 | PNS Progenitor | 7.0 | 17.5 |
12 | 1 | 26 | 24.5 | 27.5 | CNS Progenitor | 37.5 | 55.0 |
12 | 1 | 26 | 24.5 | 27.5 | Mesoderm | 45.0 | 100.0 |
CS14 | |||||||
14 | 2 | 32 | 31.5 | 32.5 | Skin | 1.0 | 1.0 |
14 | 2 | 32 | 31.5 | 32.5 | Blood | 1.0 | 2.0 |
14 | 2 | 32 | 31.5 | 32.5 | Other | 1.0 | 3.0 |
14 | 2 | 32 | 31.5 | 32.5 | PNS Neuron | 6.0 | 9.0 |
14 | 2 | 32 | 31.5 | 32.5 | PNS Progenitor | 7.0 | 16.0 |
14 | 2 | 32 | 31.5 | 32.5 | CNS Neuron | 15.0 | 31.0 |
14 | 2 | 32 | 31.5 | 32.5 | CNS Progenitor | 22.0 | 53.0 |
14 | 2 | 32 | 31.5 | 32.5 | Mesoderm | 47.0 | 100.0 |
CS17 | |||||||
17 | 3 | 42 | 40.5 | 43.5 | Other | 0.0 | 0.0 |
17 | 3 | 42 | 40.5 | 43.5 | Skin | 2.0 | 2.0 |
17 | 3 | 42 | 40.5 | 43.5 | Blood | 6.0 | 8.0 |
17 | 3 | 42 | 40.5 | 43.5 | PNS Neuron | 7.0 | 15.0 |
17 | 3 | 42 | 40.5 | 43.5 | PNS Progenitor | 9.0 | 24.0 |
17 | 3 | 42 | 40.5 | 43.5 | CNS Progenitor | 15.0 | 39.0 |
17 | 3 | 42 | 40.5 | 43.5 | CNS Neuron | 16.0 | 55.0 |
17 | 3 | 42 | 40.5 | 43.5 | Mesoderm | 45.0 | 100.0 |
CS19 | |||||||
19 | 4 | 48 | 47.5 | 48.5 | Skin | 0.0 | 0.0 |
19 | 4 | 48 | 47.5 | 48.5 | Other | 0.0 | 0.0 |
19 | 4 | 48 | 47.5 | 48.5 | Blood | 3.0 | 3.0 |
19 | 4 | 48 | 47.5 | 48.5 | PNS Progenitor | 8.0 | 11.0 |
19 | 4 | 48 | 47.5 | 48.5 | Mesoderm | 9.0 | 20.0 |
19 | 4 | 48 | 47.5 | 48.5 | PNS Neuron | 10.0 | 30.0 |
19 | 4 | 48 | 47.5 | 48.5 | CNS Progenitor | 24.0 | 54.0 |
19 | 4 | 48 | 47.5 | 48.5 | CNS Neuron | 46.0 | 100.0 |
cell_type_props_by_cs_bar <-
approximate_proportions %>%
ggplot(aes(x = stage)) +
geom_col(aes(y = prop, fill = type)) +
theme_minimal() +
labs(
x = "Carnegie stage", y = "% Cell Type", fill = "Cell Type"
)
cell_type_props_by_cs_bar_orig_clr <- cell_type_props_by_cs_bar +
scale_fill_manual(values = colour_key_orig)
cell_type_props_by_cs_bar_orig_clr
Starting with the original colour key, this perfectly fine and CNS/PNS neurons/progenitors have highly distinguishable colours in this palette. It might be interesting to use the same hue for CNS vs PNS and vary the lightness/saturation to denote progenitor vs neuron aspect. Using a lighter colour to denote the less differentiated progenitor state and a darker represent a more mature state. This palette is also reasonably distinct from those used in the other figures.
let’s take a look at how the original performs in reduced colour space to attempt to assess its accessibility to those with limited colour vision, using the very handy {colorblindr
} package.
Not bad they are fairly distinguishable, though not great when full desaturated.
colour_key <- c(
# oranges are too close to the browns in skin/medoderm/blood
# "#E589B9", "#E7298B", "#DA5F02", "#E59050",
# Blues & greens are quite similar to the palette used in the other figures
# for more granular cell identities so best to have something distinct
# "#a6cee3", "#1f78b4", "#b2df8a", "#33a02c",
# light and dark versions of the same hues of pink & purple
# don't overlap too much with the other palettes and differentiate quite
# well from each other in full spectrum colour but don't quite do as well
# in reduced spectrum so starting with green and purple and tweaking the
# lightness & saturation gave better results in a reduced spectrum
#"#E589B9", "#E7298B",
"#1B9D77", "#14664D",
"#9B99BF", "#7570B3",
# Keeping originals for mesoderm/skin/blood/other
"#DFC27E","#BE802C","#8B5113","#D8D8D8"
)
names(colour_key) <- types
Checkout the code comments above for some notes on the experimentation we did. Here’s a new take
I think that this version also looks better in the reduced colour ranges. The apparent improved performance in the desaturated might be a bit of an artifact of the order of the bars. The CNS, PNS groupings by hue are also retained quite well across the different versions.
ggplot directs you away from creating figures of the form of 5 (C-F), & 6 (A & B) as the area geom is the continuous form of the stacked bar chart and will not accept a discrete set of values for x so a continuous version with a fixed interval made and relabeled appropriately must be used to get this effect.
approximate_proportions %>%
ggplot(aes(x = fstage)) +
geom_area(aes(y = prop, fill = type)) +
geom_vline(aes(xintercept = fstage), linetype = "dashed") +
geom_line(aes(y = prop, group = type), position = "stack") +
# geom_text(aes(x = fstage + 0.2, y = 105, label = stage)) +
scale_x_continuous(labels = names(fstages)) +
scale_fill_manual(values = colour_key) +
labs(x = "Stage", y = "% cell type", fill = "Cell Type") +
#geom_point(aes(y = prop, group = type), position = "stack")
theme_minimal()
This is not optimal as the Carnegie stages are not of constant duration so this representation still has some of the same problem we had when we where just joining the points that we sampled. The gradient will still not be reflective of an interpolated rate of change over time.
approximate_proportions %>%
ggplot(aes(x = nstage)) +
geom_area(aes(y = prop, fill = type)) +
geom_vline(aes(xintercept = nstage), linetype = "dashed") +
geom_line(aes(y = prop, group = type), position = "stack") +
geom_text(aes(x = nstage + 0.5, y = 105, label = stage)) +
scale_fill_manual(values = colour_key) +
labs(x = "Carnegie stage", y = "% Cell Type", fill = "Cell Type") +
#geom_point(aes(y = prop, group = type), position = "stack")
theme_minimal()
This approach mostly solves our gradient issue.
approximate_proportions %>%
ggplot(aes(x = dpf)) +
geom_area(aes(y = prop, fill = type)) +
geom_vline(aes(xintercept = dpf), linetype = "dashed") +
geom_line(aes(y = prop, group = type), position = "stack") +
geom_text(aes(x = dpf + 1.5, y = 105, label = stage)) +
scale_fill_manual(values = colour_key) +
labs(x = "d.p.f.", y = "% Cell Type", fill = "Cell Type") +
#geom_point(aes(y = prop, group = type), position = "stack")
theme_minimal()
The Carnegie stages are not instantaneous points in time. They are defined by developmental milestones which are typically reached at a given point in time. Thus if you do want to represent time continuously then it might be a good idea to indicate the uncertainty in time associated with a Carnegie stage, this version of the plot illustrates the windows in time associated with the stage.
This is especially applicable if you staged the embryos that you are sampling using developmental milestones to estimate their age, a opposed to other methods of estimating their age.
cell_type_props_by_dpf_area_cs_ranges <- approximate_proportions %>%
ggplot(aes(x = dpf)) +
#geom_col(aes(y = prop, fill = type))
geom_area(aes(y = prop, fill = type)) +
#geom_vline(aes(xintercept = csbegin), linetype = "dashed") +
#geom_vline(aes(xintercept = csend), linetype = "dashed") +
geom_vline(aes(xintercept = dpf), linetype = "dashed") +
geom_line(aes(y = prop, group = type), position = "stack", size = 0.1) +
#geom_text(aes(x = (csbegin + csend)/2, y = 107, label = stage, angle = 90)) +
#geom_point(aes(y = prop, group = type), position = "stack", shape = 4) +
geom_rect(
aes(xmin = csbegin , xmax = csend, ymin = -20, ymax = 105),
fill = "grey", alpha = 0.05
) +
scale_fill_manual(values = colour_key) +
labs(x = "d.p.f.", y = "% Cell Type", fill = "Cell Type") +
theme_minimal()
cell_type_props_by_dpf_area_cs_ranges +
geom_text(aes(x = dpf + 0.5, y = -10, label = stage), angle = 90)
It also remains quite effective in a small form factor that might be required in a multi-panel figure (These are often unavoidable even if we should stop using them Richard’s View.)
This version also seems to work well in reduced colour
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] gt_0.10.1 colorblindr_0.1.0 colorspace_2.0-3 cowplot_1.1.1
[5] ggplot2_3.3.6 tidyr_1.2.0 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] gtable_0.3.0 jsonlite_1.8.0 compiler_4.3.1 renv_0.15.5
[5] tidyselect_1.2.0 xml2_1.3.6 stringr_1.4.0 scales_1.3.0
[9] yaml_2.3.5 fastmap_1.1.0 R6_2.5.1 labeling_0.4.2
[13] generics_0.1.2 knitr_1.39 htmlwidgets_1.6.4 tibble_3.2.1
[17] munsell_0.5.0 pillar_1.9.0 rlang_1.1.0 utf8_1.2.2
[21] stringi_1.7.6 xfun_0.38 sass_0.4.8 cli_3.6.2
[25] withr_2.5.0 magrittr_2.0.3 digest_0.6.29 grid_4.3.1
[29] lifecycle_1.0.3 vctrs_0.6.5 evaluate_0.15 glue_1.6.2
[33] farver_2.1.0 fansi_1.0.3 rmarkdown_2.14 purrr_0.3.4
[37] tools_4.3.1 pkgconfig_2.0.3 htmltools_0.5.7