We will be Focusing out attention on figure 7 E & F from this paper (Xu et al. 2024). This is as something of a pre-text to discuss the challenging problem that is the useful visual summary of the results of functional enrichment analyses.
Figures 7 E & F do not do a great job of highlighting the insights that the authors drew from the underlying data analysis. In their original context in the easyGSEA dashboard on eVITTA referenced by the authors these graphic make a little more sense where they are an interactive graphics with tooltips, and where clicking on the bars for a terms exposes further information. They are one of a number of visuals with which you can explore this data, so do not have to do as much heavy lifting on their own. Designing a graph to accommodate arbitrary dynamically generated contents is a harder problem than designing a one-off visual especially where data ranges can vary quite dramatically. Generally you have to make certain compromises to achieve generality for a graph. That said there are some critiques which still apply even with the benefit of this context.
The y-axis tick labels are truncated limiting the ability to discern which terms were enriched
“BP” and “KEGG” lead all these labels, this information is redundant so does not need to be here and could be pulled out in the axis label or title so that they only need to be mentioned once.
There is no discernible variation in the colour scale for the \(-log_{10}(p-value)\), it may be that this is scaled for the full range of variation in the p-value and as a result there is minimal variation in the top results. For this to be useful in discerning differences in p-value among the top results this needs to be re-scaled to a range suitable for the visible points.
Overlapping transparent objects in conjunction with a continuous colour scale is not a good idea as this produces colours which are not meaningful on the scale.
There is a problem of emphasis in these plots. The terms are arguably the most important thing for people to be able to read and interpret. Yet the terms are buried in strings with the ‘aspect’ or source of the term and the terms identifier. They are separated with underscores, not aligned to the terms’ starts in a predictable place and sometimes truncated. The direction is for the most part as important as the terms, as knowing the know the direction of change of a term can also be important for its interpretation. The specific significance, score and rank order of terms is less cruital, interpretation of functional enrichment results is a pretty qualitative process.
The ID of the term is very helpful for anyone doing follow up to weigh other factors in how to interpret these functional enrichment results but is only needed for reference and it need not feature prominently. (Having it sometimes truncated is not optimal for the ability to look up the term.)
Other factors that are potentially relevant for interpretation are:
The size of the term
The fraction of genes in the term
The overlap of genes between terms and/or the relationships between terms
A smaller term with all or a very high fraction of genes showing a difference, that might not be as significant as a larger term where a mere substantial fraction of genes show a change, due to the effective ‘limit of resolution’ resulting from its small size, might be more biologically interesting to follow up.
The latter point is perhaps the hardest to capture in a summary
When assessing a single functional enrichment analysis, and not attempting to compare the enrichment of multiple different gene sets, the case can be made that a table is a preferable form of visual representation of this information than a graph. We are asking: ‘Which subsets of a set of genes that differ from some single point of comparison, by varying degrees, stand out in some way?’. Which subsets we look at, how we quantify the degree of difference from some baseline and what we mean by ‘stand out’ can all vary but we are starting from a single vector of functional elements, be they genes, transcripts, or proteins; with a single metric, be it their rank order or some score such as p-value or \(log_2\) fold change. Thus a single table of terms with columns to represent various parameters about these sets can potentially provide us with a more efficient and effective representation.
For anyone thinking about the design of tables for a scientific publication an good resource is chapter 2 of Maarten Boers’ excellent book: “Data Visualization for Biomedical Scientists” (Boers 2022).
A good case can be made that gene set enrichment analysis results lend themselves more to table and table-like representations than graphs.
There are number of reasons that a table or a hybrid table/graph may be better suited to this data than a purely graphical representation.
Key information is the term name which is text, often quite long and quite varied in length.
Rank order, precision, multiple variables for each term - that can provide important context for interpretation
How big the terms are. Which genes in the leading edge set (in GSEA, which differentially expressed genes are in over-represented terms in other types of functional enrichment analysis) Overlap between the genes in one term and another. broad terms can be parents of narrower terms and be enriched as a result of changes in the same or different sets of genes
Data underpinning these figure panels is available directly from figshare, great!
Code
# Downloading data data_file_urls <-c("table_s4.csv"="https://ndownloader.figshare.com/files/44137139","table_s5.csv"="https://ndownloader.figshare.com/files/44137145")if(!all(file_exists(names(data_file_urls)))) { purrr::iwalk(data_file_urls, ~download.file(.x, .y))}
Tables in R
R has lots of options for packages that format tables some examples include:
These have varying degrees of support for use in printed vs web based outputs. gt and kable tables have good print support and also work web outputs, gt is a little better at being consistent between different outputs formats. rhandsontable is somewhat spreadsheet like and can be good in shiny apps with with complex interactive tabular inputs. DT and reactable have good interactive sorting, searching and filtering capabilities which can run client-side in javascript providing quite a lot of interactivity from a static web page as long as the data is not too big. However reactable is perhaps a little better documented for R users and to perform some more advanced customisations with DT knowledge of JS/CSS is more necessary. flextable has quite nice syntax somewhat reminiscent of ggplot2. It is better at static rather than dynamics tables as opposed to DT, reactable, & rhandsontable, R native visual customization are also easier than DT or reactable where JS/CSS knowledge is helpful. It makes it very easy to embed arbitrary R plots into cells of your table so making custom ‘sparkline’ style plots with ggplot2 and adding them to your tables is easy. It also plays very well with embedding tables into other larger R graphics using grid.
#' process_gsea_table#' #' @param path the path to a csv file with GSEA data#' 1process_gsea_table <-function(path) {2 path %>%3 readr::read_csv(show_col_types =FALSE) %>%4 tidyr::extract( pathway, into =c("aspect", "term", "id"),regex ="(\\w+?)_(.*?)%(\\w+)",remove =FALSE ) %>%5 dplyr::mutate(6size =as.integer(size),7term =gsub("_", " ", term),8Genes =strsplit(Genes, ";"),9n_genes_in_leading_edge_subset = purrr::map_int(Genes, length),10direction = dplyr::if_else(sign(NES) ==1, "Up", "Down") )}
1
Function for gene set enrichment analysis data processing
2
Send the path to the file to the read comma separated values function
3
Hide the column type guesses
4
Process the contents of the pathway column into their constituent components. make three new columns aspect, term, and id. Three groups (one per column). First group: matching 1 or more word characters non-greedily, an underscore, 2nd group: 0 or more of any value non-greedily, a % sign, group 3: 1 or more word values.
5
Make some changes to the following columns
6
convert the type of size to an integer
7
remove all _ from terms and replace them with spaces
8
Split genes string into a list of genes by the ; character
9
Count the number of genes in the list
10
If NES is positive set direction to Up if it’s negative to Down
Warning: Since gt v0.9.0, the `colors` argument has been deprecated.
• Please use the `fn` argument instead.
This warning is displayed once every 8 hours.
As a rule of thumb it is good to use proportional fonts in your tables, they are generally easier to read than monospace fonts where all the letters are the same width. However some proportional fonts also have digit characters which are proportional in width. This can result in issues with the alignment of numbers in tables which should center around the decimal point.
Numerical formatting rules in some table packages can be somewhat limited so to produce correct alignments it may be necessary to convert numerical values to strings which you can space properly with more powerful numerical formatting functions like R’s sprintf. This, however, has the disadvantage that the table library no longer sees these columns as numerical values so cannot do things like sorting them interactively on a web page. Even table libraries with extensive numerical formatting options frequently fail to align their formatted result correctly around the decimal point. It may annoyingly be necessary in these cases to compromise on correct alignment to get these interactive features until such time as these alignment deficiencies can be fixed in the upstream table packages. gt does have decimal alignment in versions >=v0.8.0 but this is not the default behavior which I would content it should be. I’ve not got more optimal solution for DT
Unfortunately it is also a habit of some web based tables to ignore leading white-space characters, like spaces, which you may be relying upon to align your numbers correctly. So setting the CSS properties font-family to monospace and white-space to pre for the relevant columns of your table gets around these issues as an alignment based on the the character widths in a string should then be reliably correctly spaced when rendered on the web page.
Registered S3 method overwritten by 'webshot':
method from
print.webshot webshot2
PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Boers, Maarten. 2022. Data Visualization for Biomedical Scientists: Creating Tables and Graphs That Work. Amsterdam: VU University Press.
Xu, Jiaming, Brendil Sabatino, Junran Yan, Glafira Ermakova, Kelsie R S Doering, and Stefan Taubert. 2024. “The Unfolded Protein Response of the Endoplasmic Reticulum Protects Caenorhabditis Elegans Against DNA Damage Caused by Stalled Replication Forks.” Edited by S Lee. G3: Genes, Genomes, Genetics, January. https://doi.org/10.1093/g3journal/jkae017.