library(dplyr)
library(tibble)
library(ggplot2)
library(org.Hs.eg.db)
library(gt)
library(EnrichGT)
library(readr)
DEGexample <- read_csv("./DEG.csv")
DEGexample_UpReg <- DEGexample |> dplyr::filter(pvalue<0.05,log2FoldChange>0.7)Core Enrichment Functions
Basic enrichment analysis involves performing a batch “dictionary lookup” for a set of genes to determine their associations. The most commonly used method is over-representation analysis (ORA). If you also have information on weights, then Gene Set Enrichment Analysis (GSEA) is another classic choice.
Enrichment of genes
This is an accelerated ORA tool. The only things you need is your favourite gene symbols. If is all prepared, then load a database, run it!
Functions in EnrichGT is mainly to accommodate wet lab researchers. First, most beginners are confused by “ENTREZ ID” used by most of tools. And people often much familiar with biology are used to Gene Symbols, and even Ensembl IDs are not widely known, let alone a series of seemingly random numbers. Therefore, EnrichGT uses Gene Symbol as the default input, seamlessly integrating with most downstream results from various companies, making it more suitable for non-experts in the lab.
While other tools often output S4 objects, which may be too complex for beginners; whereas EnrichGT outputs a simple table. The time of non-experts is precious, so I made these two adjustments. The only downside is that the GSEA peak plot is difficult to generate, but in reality, we focus more on NES and p-values, and in this case, bar plots are more convincing.
Furthermore, The pre-processing step of the hypergeometric test in EnrichGT’s ORA function (which determines overlap) is accelerated using hash tables in C++, making it over five times faster than other pure R implementation.
res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_GO_BP(org.Hs.eg.db))
res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
"You can add more genes"),
database = database_Reactome(org.Hs.eg.db))
res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
"You can add more genes"),
database = database_from_gmt("MsigDB_Hallmark.gmt"))Now, we load the necessary packages and example dataset, to provide you with an example.
AnnotationDbi object
Most of errors are caused by this. Please load Like AnnotationDbi like org.Hs.eg.db before doing enrichment.
Then we start ORA:
ora_result <- egt_enrichment_analysis(genes = DEGexample_UpReg$...1,database = database_GO_BP(org.Hs.eg.db))✔ success loaded database, time used : 10.3418889045715 sec.
✔ Done ORA in 0.111586093902588 sec.
ora_result# A tibble: 3,003 × 9
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
1 GO:0035… synaptic t… 18/465 111/18… 2.47e-10 1.05e-6 0.00245 ATP1A… 18
2 GO:0051… regulation… 15/465 77/188… 5.47e-10 1.16e-6 0.00245 ATP1A… 15
3 GO:0007… glutamate … 12/465 55/188… 7.59e- 9 1.07e-5 0.00389 GRIA4… 12
4 GO:0048… regulation… 12/465 60/188… 2.16e- 8 1.19e-5 0.00389 APOE/… 12
5 GO:0007… mitotic sp… 11/465 49/188… 2.35e- 8 1.19e-5 0.00389 BIRC5… 11
6 GO:0071… mitotic sp… 11/465 49/188… 2.35e- 8 1.19e-5 0.00389 BIRC5… 11
7 GO:0071… spindle as… 11/465 49/188… 2.35e- 8 1.19e-5 0.00389 BIRC5… 11
8 GO:0031… spindle ch… 11/465 50/188… 2.94e- 8 1.19e-5 0.00389 BIRC5… 11
9 GO:0010… regulation… 12/465 62/188… 3.20e- 8 1.19e-5 0.00389 BIRC5… 12
10 GO:0045… negative r… 11/465 51/188… 3.67e- 8 1.19e-5 0.00389 BIRC5… 11
# ℹ 2,993 more rows
- Case 1: many groups of genes:
This function also support many groups of genes, you can input a list.
# For many groups of genes
res <- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
Fibroblast=c("COL1A2","COL1A3"),"You can add more groups"),
database = database_from_gmt("panglaoDB.gmt"))- Case 2: Differential expressed genes with different direction:
In above example, if we don’t filter out the up-regulated genes for ORA, but want to combine all DEGs without selecting them according to log2FC. How can we achieve that?
From version 0.8, egt_enrichment_analysis() supports input genes using genes_with_weights() function. For example:
DEGexample <- read_csv("~/Documents/4Fun/EGTFun/DEG.csv")
DEGexample2 <- DEGexample |> dplyr::filter(pvalue<0.05)
ora_result <- egt_enrichment_analysis(genes = genes_with_weights(DEGexample2$...1,DEGexample2$log2FoldChange), database = database_GO_BP(org.Hs.eg.db))Then, the output will automatically consider the proportion of up-regulated genes and down-regulated genes, and finally showing them in any results EnrichGT generated, like data.frames, gt table or dot plots.
For visualize, see ploting ORA results for details. You can use egt_plot_results() to view the enriched result. And you can see re-enrichment and FUSING guides for more further analysis.
Enrichment of weighted genes (GSEA)
Genes with specific weights (e.g. the log2FC) can use GSEA method. It should input a pre-ranked geneset. This use fgsea::fgsea() as backend, it is also very fast.
This provides a quick display of NES, p-values, and leading-edge genes. This function uses the same backend as the industry-standard clusterProfiler (it is also implemented using the fgsea package). However, EnrichGT does not delve as deeply as clusterProfiler and lacks advanced visualization capabilities. While it may not be sufficient for bioinformatics experts, the current implementation is adequate for wet-lab researchers. If comprehensive analysis is required, consider using clusterProfiler. However, if you only need an overview and reclustering, EnrichGT may be enough.
genes_with_weights(genes,weights) function is used to build the pre-ranked gene set for GSEA analysis.
Usually, GSEA use the log2FoldChange from DEG analysis as the weights of genes. This is an example:
resGSEA <- egt_gsea_analysis(genes =
genes_with_weights(genes = DEGexample$...1,
weights = DEGexample$log2FoldChange),
database = database_GO_BP(org.Hs.eg.db)
)✔ Use cached database: GO_BP_org.Hs.eg.db
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.02% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
✔ Sucessful GSEA, time last 7.00248885154724 secs.
ℹ Performing leading edge analysis...
resGSEA# A tibble: 3,785 × 12
ID Description ES NES pvalue p.adjust core_enrichment rank tags
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 GO:0045… meiotic ch… 0.612 1.96 3.73e-5 0.00117 SPATA22/KASH5/… 2679 43%
2 GO:0009… purine rib… 0.575 1.92 4.15e-5 0.00126 ADCY10/NME9/ND… 4558 55%
3 GO:0050… detection … 0.767 1.92 3.41e-4 0.00611 OR51E1/CST1/OR… 2023 63%
4 GO:0009… purine nuc… 0.570 1.91 3.21e-5 0.00104 ADCY10/NME9/ND… 4558 54%
5 GO:0006… ATP biosyn… 0.582 1.90 6.38e-5 0.00179 ADCY10/NDUFS6/… 4403 56%
6 GO:0035… synaptic t… 0.557 1.88 3.23e-5 0.00104 GRIK3/UNC13C/G… 1330 31%
7 GO:0007… adult walk… 0.714 1.88 6.84e-4 0.0103 ZIC1/CNTN2/CAC… 866 25%
8 GO:0031… spindle ch… 0.637 1.88 3.12e-4 0.00574 SPC24/BIRC5/PL… 1435 39%
9 GO:0009… ribonucleo… 0.558 1.88 8.49e-5 0.00222 ADCY10/NME9/ND… 6138 76%
10 GO:0050… detection … 0.574 1.88 1.76e-4 0.00360 OR51E1/CST1/OR… 1826 33%
# ℹ 3,775 more rows
# ℹ 3 more variables: list <chr>, signal <chr>, leading_edge <chr>
Other kind of weights, like the loading from PCA or NMF, or the importance of random forest, can be also used.
resExample <- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,
weights =PCA_res$PC1_loading),
database = database_from_gmt("MsigDB_Hallmark.gmt")
)GSEA results also support re-enrichment. Brifely, GSEA now supports two kinds of figures. For visualize, see plotting page for further guides. The most frequently used is the bar plot (by function egt_plot_results), showing p-values and NES. And we provides two ranking figures (single figure / viewing by table) by warping plotting functions from fgsea - the function is called egt_plot_gsea().