Core Enrichment Functions

Basic enrichment analysis involves performing a batch “dictionary lookup” for a set of genes to determine their associations. The most commonly used method is over-representation analysis (ORA). If you also have information on weights, then Gene Set Enrichment Analysis (GSEA) is another classic choice.

Enrichment of genes

This is a C++ accelerated ORA tool. The only things you need is your favourite gene symbols. If is all prepared, then load a database, run it!

Compared to the most popular clusterProfiler, the functions of EnrichGT differ slightly. This is mainly to accommodate wet lab researchers. First, most beginners are confused by the default input of clusterProfiler, which is “ENTREZ ID.” Most people familiar with biology are used to Gene Symbols, and even Ensembl IDs are not widely known, let alone a series of seemingly random numbers. Therefore, EnrichGT uses Gene Symbol as the default input, seamlessly integrating with most downstream results from various companies, making it more suitable for non-experts in the lab.

Second, clusterProfiler outputs an S4 object, which may be too complex for beginners (this is no joke); whereas EnrichGT outputs a simple table. The time of non-experts is precious, so I made these two adjustments. The only downside is that the GSEA peak plot is difficult to generate, but in reality, we focus more on NES and p-values, and in this case, bar plots are more convincing.

Furthermore, The pre-processing step of the hypergeometric test in EnrichGT’s ORA function (which determines overlap) is accelerated using hash tables in C++, making it over five times faster than clusterProfiler::enricher(), which is a pure R implementation.

res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_GO_BP(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_Reactome(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_from_gmt("MsigDB_Hallmark.gmt"))

Now, we load the necessary packages and example dataset, to provide you with an example.

library(dplyr)
library(tibble)
library(ggplot2)
library(org.Hs.eg.db)
library(gt)
library(EnrichGT)
library(readr)
DEGexample <- read_csv("./DEG.csv")
DEGexample_UpReg <- DEGexample |> dplyr::filter(pvalue<0.05,log2FoldChange>0.7)
Remember to load AnnotationDbi object

Most of errors are caused by this. Please load Like AnnotationDbi like org.Hs.eg.db before doing enrichment.

Then we start ORA:

ora_result <- egt_enrichment_analysis(genes = DEGexample_UpReg$...1,database = database_GO_BP(org.Hs.eg.db))
✔ success loaded database, time used : 14.9625239372253 sec.
✔ Done ORA in 0.138070106506348 sec.
ora_result |> as_tibble() # You don't need to call as_tibble, this is just for better printing
# A tibble: 2,979 × 9
   ID       Description GeneRatio BgRatio   pvalue p.adjust  qvalue geneID Count
   <chr>    <chr>       <chr>     <chr>      <dbl>    <dbl>   <dbl> <chr>  <int>
 1 GO:0035… synaptic t… 19/457    111/18… 2.06e-11  8.05e-8 8.41e-5 ATP1A…    19
 2 GO:0051… regulation… 16/457    79/188… 5.74e-11  8.05e-8 8.41e-5 ATP1A…    16
 3 GO:0050… modulation… 39/457    489/18… 7.26e-11  8.05e-8 8.41e-5 ACHE/…    39
 4 GO:0099… regulation… 39/457    490/18… 7.71e-11  8.05e-8 8.41e-5 ACHE/…    39
 5 GO:0050… synapse or… 36/457    483/18… 2.51e- 9  2.10e-6 1.11e-3 ACHE/…    36
 6 GO:0048… regulation… 12/457    56/188… 7.50e- 9  4.18e-6 1.11e-3 APOE/…    12
 7 GO:0007… mitotic sp… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
 8 GO:0071… mitotic sp… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
 9 GO:0071… spindle as… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
10 GO:0031… spindle ch… 11/457    47/188… 1.18e- 8  4.18e-6 1.11e-3 BIRC5…    11
# ℹ 2,969 more rows
Have many sources of genes?
  • Case 1: many groups of genes:

This function also support many groups of genes, you can input a list.

# For many groups of genes
res <- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
Fibroblast=c("COL1A2","COL1A3"),"You can add more groups"),
 database = database_from_gmt("panglaoDB.gmt"))
  • Case 2: Differential expressed genes with different direction:

In above example, if we don’t filter out the up-regulated genes for ORA, but want to combine all DEGs without selecting them according to log2FC. How can we achieve that?

From version 0.8, egt_enrichment_analysis() supports input genes using genes_with_weights() function. For example:

DEGexample <- read_csv("~/Documents/4Fun/EGTFun/DEG.csv")
DEGexample2 <- DEGexample |> dplyr::filter(pvalue<0.05)
ora_result <- egt_enrichment_analysis(genes = genes_with_weights(DEGexample2$...1,DEGexample2$log2FoldChange), database = database_GO_BP(org.Hs.eg.db))

Then, the output will automatically consider the proportion of up-regulated genes and down-regulated genes, and finally showing them in any results EnrichGT generated, like data.frames, gt table or dot plots.

For visualize, see ploting ORA results for details. You can use egt_plot_results() to view the enriched result. And you can see re-enrichment and FUSING guides for more further analysis.

Enrichment of weighted genes (GSEA)

Genes with specific weights (e.g. the log2FC) can use GSEA method. It should input a pre-ranked geneset. This use C++ accelerated fgsea::fgsea() as backend, so it is also very fast.

This provides a quick display of NES, p-values, and leading-edge genes. This function uses the same backend as the industry-standard clusterProfiler (it is also implemented using the fgsea package). However, EnrichGT does not delve as deeply as clusterProfiler and lacks advanced visualization capabilities. While it may not be sufficient for bioinformatics experts, the current implementation is adequate for wet-lab researchers. If comprehensive analysis is required, consider using clusterProfiler. However, if you only need an overview and reclustering, EnrichGT may be enough.

How to build pre-ranked gene set?

genes_with_weights(genes,weights) function is used to build the pre-ranked gene set for GSEA analysis.

Usually, GSEA use the log2FoldChange from DEG analysis as the weights of genes. This is an example:

resGSEA <- egt_gsea_analysis(genes = 
                           genes_with_weights(genes = DEGexample$...1, 
                                              weights = DEGexample$log2FoldChange),
                         database = database_GO_BP(org.Hs.eg.db)
                         )
✔ Use cached database: GO_BP_org.Hs.eg.db
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.02% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
✔ Sucessful GSEA, time last 8.56986784934998 secs.
resGSEA |> as_tibble() # You don't need to call as_tibble, this is just for better printing
# A tibble: 3,847 × 7
   ID         Description              ES   NES  pvalue p.adjust core_enrichment
   <chr>      <chr>                 <dbl> <dbl>   <dbl>    <dbl> <chr>          
 1 GO:0050906 detection of stimulu… 0.594  1.95 8.15e-5 0.00203  OR51E1/CST1/OR…
 2 GO:0050907 detection of chemica… 0.767  1.95 2.22e-4 0.00457  OR51E1/CST1/OR…
 3 GO:0060078 regulation of postsy… 0.564  1.94 9.08e-6 0.000343 GRIK3/HCN1/GRM…
 4 GO:0035249 synaptic transmissio… 0.560  1.92 1.57e-5 0.000523 GRIK3/UNC13C/G…
 5 GO:0051966 regulation of synapt… 0.581  1.91 9.37e-5 0.00228  GRIK3/GRM1/CAC…
 6 GO:0031577 spindle checkpoint s… 0.627  1.90 5.41e-4 0.00873  SPC24/BIRC5/PL…
 7 GO:0007216 G protein-coupled gl… 0.840  1.90 1.76e-4 0.00378  GRIK3/GRM1/GRM…
 8 GO:0007628 adult walking behavi… 0.717  1.88 1.16e-3 0.0150   ZIC1/CNTN2/CAC…
 9 GO:0045132 meiotic chromosome s… 0.587  1.88 1.50e-4 0.00332  SPATA22/KASH5/…
10 GO:0007094 mitotic spindle asse… 0.628  1.88 5.47e-4 0.00873  SPC24/BIRC5/PL…
# ℹ 3,837 more rows

Other kind of weights, like the loading from PCA or NMF, or the importance of random forest, can be also used.

resExample <- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,
                                                    weights =PCA_res$PC1_loading),
                         database = database_from_gmt("MsigDB_Hallmark.gmt")
                         )

GSEA results also support re-enrichment. Brifely, GSEA now supports two kinds of figures. For visualize, see plotting page for further guides. The most frequently used is the bar plot (by function egt_plot_results), showing p-values and NES. And we provides two ranking figures (single figure / viewing by table) by warping plotting functions from fgsea - the function is called egt_plot_gsea().

Back to top