Core Enrichment Functions

Basic enrichment analysis involves performing a batch “dictionary lookup” for a set of genes to determine their associations. The most commonly used method is over-representation analysis (ORA). If you also have information on weights, then Gene Set Enrichment Analysis (GSEA) is another classic choice.

Enrichment of genes

This is a C++ accelerated ORA tool. The only things you need is your favourite gene symbols. If is all prepared, then load a database, run it!

Compared to the most popular clusterProfiler, the functions of EnrichGT differ slightly. This is mainly to accommodate wet lab researchers. First, most beginners are confused by the default input of clusterProfiler, which is “ENTREZ ID.” Most people familiar with biology are used to Gene Symbols, and even Ensembl IDs are not widely known, let alone a series of seemingly random numbers. Therefore, EnrichGT uses Gene Symbol as the default input, seamlessly integrating with most downstream results from various companies, making it more suitable for non-experts in the lab.

Second, clusterProfiler outputs an S4 object, which may be too complex for beginners (this is no joke); whereas EnrichGT outputs a simple table. The time of non-experts is precious, so I made these two adjustments. The only downside is that the GSEA peak plot is difficult to generate, but in reality, we focus more on NES and p-values, and in this case, bar plots are more convincing.

Furthermore, The pre-processing step of the hypergeometric test in EnrichGT’s ORA function (which determines overlap) is accelerated using hash tables in C++, making it over five times faster than clusterProfiler::enricher(), which is a pure R implementation.

res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_GO_BP(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_Reactome(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_from_gmt("MsigDB_Hallmark.gmt"))

Now, we load the necessary packages and example dataset, to provide you with an example.

library(dplyr)
library(tibble)
library(ggplot2)
library(org.Hs.eg.db)
library(gt)
library(EnrichGT)
library(readr)
DEGexample <- read_csv("./DEG.csv")
DEGexample_UpReg <- DEGexample |> dplyr::filter(pvalue<0.05,log2FoldChange>0.7)
Remember to load AnnotationDbi object

Most of errors are caused by this. Please load Like AnnotationDbi like org.Hs.eg.db before doing enrichment.

Then we start ORA:

ora_result <- egt_enrichment_analysis(genes = DEGexample_UpReg$...1,database = database_GO_BP(org.Hs.eg.db))
✔ success loaded database, time used : 16.4522099494934 sec.
✔ Done ORA in 0.130129098892212 sec.
ora_result |> as_tibble() # You don't need to call as_tibble, this is just for better printing
# A tibble: 2,979 × 9
   ID       Description GeneRatio BgRatio   pvalue p.adjust  qvalue geneID Count
   <chr>    <chr>       <chr>     <chr>      <dbl>    <dbl>   <dbl> <chr>  <int>
 1 GO:0035… synaptic t… 19/457    111/18… 2.06e-11  8.05e-8 8.41e-5 ATP1A…    19
 2 GO:0051… regulation… 16/457    79/188… 5.74e-11  8.05e-8 8.41e-5 ATP1A…    16
 3 GO:0050… modulation… 39/457    489/18… 7.26e-11  8.05e-8 8.41e-5 ACHE/…    39
 4 GO:0099… regulation… 39/457    490/18… 7.71e-11  8.05e-8 8.41e-5 ACHE/…    39
 5 GO:0050… synapse or… 36/457    483/18… 2.51e- 9  2.10e-6 1.11e-3 ACHE/…    36
 6 GO:0048… regulation… 12/457    56/188… 7.50e- 9  4.18e-6 1.11e-3 APOE/…    12
 7 GO:0007… mitotic sp… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
 8 GO:0071… mitotic sp… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
 9 GO:0071… spindle as… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
10 GO:0031… spindle ch… 11/457    47/188… 1.18e- 8  4.18e-6 1.11e-3 BIRC5…    11
# ℹ 2,969 more rows
Have many sources of genes?
  • Case 1: many groups of genes:

This function also support many groups of genes, you can input a list.

# For many groups of genes
res <- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
Fibroblast=c("COL1A2","COL1A3"),"You can add more groups"),
 database = database_from_gmt("panglaoDB.gmt"))
  • Case 2: Differential expressed genes with different direction:

In above example, if we don’t filter out the up-regulated genes for ORA, but want to combine all DEGs without selecting them according to log2FC. How can we achieve that?

From version 0.8, egt_enrichment_analysis() supports input genes using genes_with_weights() function. For example:

DEGexample <- read_csv("~/Documents/4Fun/EGTFun/DEG.csv")
DEGexample2 <- DEGexample |> dplyr::filter(pvalue<0.05)
ora_result <- egt_enrichment_analysis(genes = genes_with_weights(DEGexample2$...1,DEGexample2$log2FoldChange), database = database_GO_BP(org.Hs.eg.db))

Then, the output will automatically consider the proportion of up-regulated genes and down-regulated genes, and finally showing them in any results EnrichGT generated, like data.frames, gt table or dot plots.

For visualize, see [Ploting ORA result] for details.

Enrichment of weighted genes (GSEA)

Genes with specific weights (e.g. the log2FC) can use GSEA method. It should input a pre-ranked geneset. This use C++ accelerated fgsea::fgsea() as backend, so it is also very fast.

This provides a quick display of NES, p-values, and leading-edge genes. This function uses the same backend as the industry-standard clusterProfiler (it is also implemented using the fgsea package). However, EnrichGT does not delve as deeply as clusterProfiler and lacks advanced visualization capabilities. While it may not be sufficient for bioinformatics experts, the current implementation is adequate for wet-lab researchers. If comprehensive analysis is required, consider using clusterProfiler. However, if you only need an overview and reclustering, EnrichGT may be enough.

How to build pre-ranked gene set?

genes_with_weights(genes,weights) function is used to build the pre-ranked gene set for GSEA analysis.

Usually, GSEA use the log2FoldChange from DEG analysis as the weights of genes. This is an example:

resGSEA <- egt_gsea_analysis(genes = 
                           genes_with_weights(genes = DEGexample$...1, 
                                              weights = DEGexample$log2FoldChange),
                         database = database_GO_BP(org.Hs.eg.db)
                         )
✔ Use cached database: GO_BP_org.Hs.eg.db
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.02% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
✔ Sucessful GSEA, time last 8.068598985672 secs.
resGSEA |> as_tibble() # You don't need to call as_tibble, this is just for better printing
# A tibble: 3,842 × 7
   ID         Description              ES   NES  pvalue p.adjust core_enrichment
   <chr>      <chr>                 <dbl> <dbl>   <dbl>    <dbl> <chr>          
 1 GO:0050907 detection of chemica… 0.767  1.96 2.38e-4 0.00487  OR51E1/CST1/OR…
 2 GO:0060078 regulation of postsy… 0.564  1.93 3.00e-6 0.000126 GRIK3/HCN1/GRM…
 3 GO:0035249 synaptic transmissio… 0.560  1.91 1.14e-5 0.000411 GRIK3/UNC13C/G…
 4 GO:0050906 detection of stimulu… 0.594  1.90 7.17e-5 0.00190  OR51E1/CST1/OR…
 5 GO:0007216 G protein-coupled gl… 0.840  1.89 1.28e-4 0.00300  GRIK3/GRM1/GRM…
 6 GO:0061982 meiosis I cell cycle… 0.553  1.86 7.08e-5 0.00188  SPATA22/KASH5/…
 7 GO:0051966 regulation of synapt… 0.581  1.86 1.17e-4 0.00283  GRIK3/GRM1/CAC…
 8 GO:0031577 spindle checkpoint s… 0.627  1.85 7.85e-4 0.0111   SPC24/BIRC5/PL…
 9 GO:0045132 meiotic chromosome s… 0.587  1.85 2.46e-4 0.00494  SPATA22/KASH5/…
10 GO:0007094 mitotic spindle asse… 0.628  1.85 5.33e-4 0.00853  SPC24/BIRC5/PL…
# ℹ 3,832 more rows

Other kind of weights, like the loading from PCA or NMF, or the importance of random forest, can be also used.

resExample <- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,
                                                    weights =PCA_res$PC1_loading),
                         database = database_from_gmt("MsigDB_Hallmark.gmt")
                         )

For visualize, see ploting page for details.

Back to top