Core Enrichment Functions

Basic enrichment analysis involves performing a batch “dictionary lookup” for a set of genes to determine their associations. The most commonly used method is over-representation analysis (ORA). If you also have information on weights, then Gene Set Enrichment Analysis (GSEA) is another classic choice.

Enrichment of genes

This is a C++ accelerated ORA tool. The only things you need is your favourite gene symbols. If is all prepared, then load a database, run it!

Difference from other tools

Compared to the most popular clusterProfiler, the functions of EnrichGT differ slightly. This is mainly to accommodate wet lab researchers. First, most beginners are confused by the default input of clusterProfiler, which is “ENTREZ ID.” Most people familiar with biology are used to Gene Symbols, and even Ensembl IDs are not widely known, let alone a series of seemingly random numbers. Therefore, EnrichGT uses Gene Symbol as the default input, seamlessly integrating with most downstream results from various companies, making it more suitable for non-experts in the lab.

Second, clusterProfiler outputs an S4 object, which may be too complex for beginners (this is no joke); whereas EnrichGT outputs a simple table. The time of non-experts is precious, so I made these two adjustments. The only downside is that the GSEA peak plot is difficult to generate, but in reality, we focus more on NES and p-values, and in this case, bar plots are more convincing.

Furthermore, The pre-processing step of the hypergeometric test in EnrichGT’s ORA function (which determines overlap) is accelerated using hash tables in C++, making it over five times faster than clusterProfiler::enricher(), which is a pure R implementation.

Basic usage

res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_GO_BP(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_Reactome(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_from_gmt("MsigDB_Hallmark.gmt"))

Now, we load the necessary packages and example dataset, to provide you with an example.

library(dplyr)
library(tibble)
library(ggplot2)
library(org.Hs.eg.db)
library(gt)
library(EnrichGT)
library(readr)
DEGexample <- read_csv("./DEG.csv")
DEGexample_UpReg <- DEGexample |> dplyr::filter(pvalue<0.05,log2FoldChange>0.7)

Remember to load AnnotationDbi object

Most of errors are caused by this. Please load Like AnnotationDbi like org.Hs.eg.db before doing enrichment.

Then we start ORA:

ora_result <- egt_enrichment_analysis(genes = DEGexample_UpReg$...1,database = database_GO_BP(org.Hs.eg.db))

✔ success loaded database, time used : 16.954647064209 sec.

✔ Done ORA in 0.134052991867065 sec.

ora_result

# A tibble: 2,979 × 9
   ID       Description GeneRatio BgRatio   pvalue p.adjust  qvalue geneID Count
   <chr>    <chr>       <chr>     <chr>      <dbl>    <dbl>   <dbl> <chr>  <int>
 1 GO:0035… synaptic t… 19/457    111/18… 2.06e-11  8.05e-8 8.41e-5 ATP1A…    19
 2 GO:0051… regulation… 16/457    79/188… 5.74e-11  8.05e-8 8.41e-5 ATP1A…    16
 3 GO:0050… modulation… 39/457    489/18… 7.26e-11  8.05e-8 8.41e-5 ACHE/…    39
 4 GO:0099… regulation… 39/457    490/18… 7.71e-11  8.05e-8 8.41e-5 ACHE/…    39
 5 GO:0050… synapse or… 36/457    483/18… 2.51e- 9  2.10e-6 1.11e-3 ACHE/…    36
 6 GO:0048… regulation… 12/457    56/188… 7.50e- 9  4.18e-6 1.11e-3 APOE/…    12
 7 GO:0007… mitotic sp… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
 8 GO:0071… mitotic sp… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
 9 GO:0071… spindle as… 11/457    46/188… 9.27e- 9  4.18e-6 1.11e-3 BIRC5…    11
10 GO:0031… spindle ch… 11/457    47/188… 1.18e- 8  4.18e-6 1.11e-3 BIRC5…    11
# ℹ 2,969 more rows

Have many sources of genes?

Case 1: many groups of genes:

This function also support many groups of genes, you can input a list.

# For many groups of genes
res <- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
Fibroblast=c("COL1A2","COL1A3"),"You can add more groups"),
 database = database_from_gmt("panglaoDB.gmt"))

Case 2: Differential expressed genes with different direction:

In above example, if we don’t filter out the up-regulated genes for ORA, but want to combine all DEGs without selecting them according to log2FC. How can we achieve that?

From version 0.8, egt_enrichment_analysis() supports input genes using genes_with_weights() function. For example:

DEGexample <- read_csv("~/Documents/4Fun/EGTFun/DEG.csv")
DEGexample2 <- DEGexample |> dplyr::filter(pvalue<0.05)
ora_result <- egt_enrichment_analysis(genes = genes_with_weights(DEGexample2$...1,DEGexample2$log2FoldChange), database = database_GO_BP(org.Hs.eg.db))

Then, the output will automatically consider the proportion of up-regulated genes and down-regulated genes, and finally showing them in any results EnrichGT generated, like data.frames, gt table or dot plots.

For visualize, see ploting ORA results for details. You can use egt_plot_results() to view the enriched result. And you can see re-enrichment and FUSING guides for more further analysis.

Enrichment of weighted genes (GSEA)

Genes with specific weights (e.g. the log2FC) can use GSEA method. It should input a pre-ranked geneset. This use C++ accelerated fgsea::fgsea() as backend, so it is also very fast.

This provides a quick display of NES, p-values, and leading-edge genes. This function uses the same backend as the industry-standard clusterProfiler (it is also implemented using the fgsea package). However, EnrichGT does not delve as deeply as clusterProfiler and lacks advanced visualization capabilities. While it may not be sufficient for bioinformatics experts, the current implementation is adequate for wet-lab researchers. If comprehensive analysis is required, consider using clusterProfiler. However, if you only need an overview and reclustering, EnrichGT may be enough.

How to build pre-ranked gene set?

genes_with_weights(genes,weights) function is used to build the pre-ranked gene set for GSEA analysis.

Usually, GSEA use the log2FoldChange from DEG analysis as the weights of genes. This is an example:

resGSEA <- egt_gsea_analysis(genes = 
                           genes_with_weights(genes = DEGexample$...1, 
                                              weights = DEGexample$log2FoldChange),
                         database = database_GO_BP(org.Hs.eg.db)
                         )

✔ Use cached database: GO_BP_org.Hs.eg.db

Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.02% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

✔ Sucessful GSEA, time last 8.36612606048584 secs.

resGSEA

# A tibble: 3,850 × 7
   ID         Description              ES   NES  pvalue p.adjust core_enrichment
   <chr>      <chr>                 <dbl> <dbl>   <dbl>    <dbl> <chr>          
 1 GO:0060078 regulation of postsy… 0.564  1.94 4.60e-6 0.000188 GRIK3/HCN1/GRM…
 2 GO:0007216 G protein-coupled gl… 0.840  1.92 1.31e-4 0.00295  GRIK3/GRM1/GRM…
 3 GO:0050907 detection of chemica… 0.767  1.92 1.55e-4 0.00342  OR51E1/CST1/OR…
 4 GO:0035249 synaptic transmissio… 0.560  1.92 1.87e-5 0.000621 GRIK3/UNC13C/G…
 5 GO:0050906 detection of stimulu… 0.594  1.90 4.98e-5 0.00136  OR51E1/CST1/OR…
 6 GO:0061982 meiosis I cell cycle… 0.553  1.87 1.09e-4 0.00259  SPATA22/KASH5/…
 7 GO:0051966 regulation of synapt… 0.581  1.87 7.66e-5 0.00191  GRIK3/GRM1/CAC…
 8 GO:0031577 spindle checkpoint s… 0.627  1.86 5.57e-4 0.00905  SPC24/BIRC5/PL…
 9 GO:0007628 adult walking behavi… 0.717  1.86 4.16e-4 0.00715  ZIC1/CNTN2/CAC…
10 GO:0007094 mitotic spindle asse… 0.628  1.85 6.81e-4 0.0103   SPC24/BIRC5/PL…
# ℹ 3,840 more rows

Other kind of weights, like the loading from PCA or NMF, or the importance of random forest, can be also used.

resExample <- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,
                                                    weights =PCA_res$PC1_loading),
                         database = database_from_gmt("MsigDB_Hallmark.gmt")
                         )

GSEA results also support re-enrichment. Brifely, GSEA now supports two kinds of figures. For visualize, see plotting page for further guides. The most frequently used is the bar plot (by function egt_plot_results), showing p-values and NES. And we provides two ranking figures (single figure / viewing by table) by warping plotting functions from fgsea - the function is called egt_plot_gsea().