library(dplyr)
library(tibble)
library(ggplot2)
library(org.Hs.eg.db)
library(gt)
library(EnrichGT)
library(readr)
<- read_csv("./DEG.csv")
DEGexample <- DEGexample |> dplyr::filter(pvalue<0.05,log2FoldChange>0.7) DEGexample_UpReg
Core Enrichment Functions
Basic enrichment analysis involves performing a batch “dictionary lookup” for a set of genes to determine their associations. The most commonly used method is over-representation analysis (ORA). If you also have information on weights, then Gene Set Enrichment Analysis (GSEA) is another classic choice.
Enrichment of genes
This is a C++
accelerated ORA tool. The only things you need is your favourite gene symbols. If is all prepared, then load a database, run it!
Compared to the most popular clusterProfiler, the functions of EnrichGT differ slightly. This is mainly to accommodate wet lab researchers. First, most beginners are confused by the default input of clusterProfiler, which is “ENTREZ ID.” Most people familiar with biology are used to Gene Symbols, and even Ensembl IDs are not widely known, let alone a series of seemingly random numbers. Therefore, EnrichGT uses Gene Symbol as the default input, seamlessly integrating with most downstream results from various companies, making it more suitable for non-experts in the lab.
Second, clusterProfiler outputs an S4 object, which may be too complex for beginners (this is no joke); whereas EnrichGT outputs a simple table. The time of non-experts is precious, so I made these two adjustments. The only downside is that the GSEA peak plot is difficult to generate, but in reality, we focus more on NES and p-values, and in this case, bar plots are more convincing.
Furthermore, The pre-processing step of the hypergeometric test in EnrichGT’s ORA function (which determines overlap) is accelerated using hash tables in C++, making it over five times faster than clusterProfiler::enricher()
, which is a pure R implementation.
<- egt_enrichment_analysis(genes = DEGtable$Genes,
res database = database_GO_BP(org.Hs.eg.db))
<- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
res "You can add more genes"),
database = database_Reactome(org.Hs.eg.db))
<- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
res "You can add more genes"),
database = database_from_gmt("MsigDB_Hallmark.gmt"))
Now, we load the necessary packages and example dataset, to provide you with an example.
AnnotationDbi
object
Most of errors are caused by this. Please load Like AnnotationDbi
like org.Hs.eg.db
before doing enrichment.
Then we start ORA:
<- egt_enrichment_analysis(genes = DEGexample_UpReg$...1,database = database_GO_BP(org.Hs.eg.db)) ora_result
✔ success loaded database, time used : 16.4522099494934 sec.
✔ Done ORA in 0.130129098892212 sec.
|> as_tibble() # You don't need to call as_tibble, this is just for better printing ora_result
# A tibble: 2,979 × 9
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
1 GO:0035… synaptic t… 19/457 111/18… 2.06e-11 8.05e-8 8.41e-5 ATP1A… 19
2 GO:0051… regulation… 16/457 79/188… 5.74e-11 8.05e-8 8.41e-5 ATP1A… 16
3 GO:0050… modulation… 39/457 489/18… 7.26e-11 8.05e-8 8.41e-5 ACHE/… 39
4 GO:0099… regulation… 39/457 490/18… 7.71e-11 8.05e-8 8.41e-5 ACHE/… 39
5 GO:0050… synapse or… 36/457 483/18… 2.51e- 9 2.10e-6 1.11e-3 ACHE/… 36
6 GO:0048… regulation… 12/457 56/188… 7.50e- 9 4.18e-6 1.11e-3 APOE/… 12
7 GO:0007… mitotic sp… 11/457 46/188… 9.27e- 9 4.18e-6 1.11e-3 BIRC5… 11
8 GO:0071… mitotic sp… 11/457 46/188… 9.27e- 9 4.18e-6 1.11e-3 BIRC5… 11
9 GO:0071… spindle as… 11/457 46/188… 9.27e- 9 4.18e-6 1.11e-3 BIRC5… 11
10 GO:0031… spindle ch… 11/457 47/188… 1.18e- 8 4.18e-6 1.11e-3 BIRC5… 11
# ℹ 2,969 more rows
- Case 1: many groups of genes:
This function also support many groups of genes, you can input a list
.
# For many groups of genes
<- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
res Fibroblast=c("COL1A2","COL1A3"),"You can add more groups"),
database = database_from_gmt("panglaoDB.gmt"))
- Case 2: Differential expressed genes with different direction:
In above example, if we don’t filter out the up-regulated genes for ORA, but want to combine all DEGs without selecting them according to log2FC
. How can we achieve that?
From version 0.8, egt_enrichment_analysis()
supports input genes using genes_with_weights()
function. For example:
<- read_csv("~/Documents/4Fun/EGTFun/DEG.csv")
DEGexample <- DEGexample |> dplyr::filter(pvalue<0.05)
DEGexample2 <- egt_enrichment_analysis(genes = genes_with_weights(DEGexample2$...1,DEGexample2$log2FoldChange), database = database_GO_BP(org.Hs.eg.db)) ora_result
Then, the output will automatically consider the proportion of up-regulated genes and down-regulated genes, and finally showing them in any results EnrichGT generated, like data.frame
s, gt
table or dot plots.
For visualize, see [Ploting ORA result] for details.
Enrichment of weighted genes (GSEA)
Genes with specific weights (e.g. the log2FC) can use GSEA method. It should input a pre-ranked geneset. This use C++
accelerated fgsea::fgsea()
as backend, so it is also very fast.
This provides a quick display of NES, p-values, and leading-edge genes. This function uses the same backend as the industry-standard clusterProfiler
(it is also implemented using the fgsea package). However, EnrichGT
does not delve as deeply as clusterProfiler
and lacks advanced visualization capabilities. While it may not be sufficient for bioinformatics experts, the current implementation is adequate for wet-lab researchers. If comprehensive analysis is required, consider using clusterProfiler.
However, if you only need an overview and reclustering, EnrichGT
may be enough.
genes_with_weights(genes,weights)
function is used to build the pre-ranked gene set for GSEA analysis.
Usually, GSEA use the log2FoldChange
from DEG analysis as the weights of genes. This is an example:
<- egt_gsea_analysis(genes =
resGSEA genes_with_weights(genes = DEGexample$...1,
weights = DEGexample$log2FoldChange),
database = database_GO_BP(org.Hs.eg.db)
)
✔ Use cached database: GO_BP_org.Hs.eg.db
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.02% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
✔ Sucessful GSEA, time last 8.068598985672 secs.
|> as_tibble() # You don't need to call as_tibble, this is just for better printing resGSEA
# A tibble: 3,842 × 7
ID Description ES NES pvalue p.adjust core_enrichment
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 GO:0050907 detection of chemica… 0.767 1.96 2.38e-4 0.00487 OR51E1/CST1/OR…
2 GO:0060078 regulation of postsy… 0.564 1.93 3.00e-6 0.000126 GRIK3/HCN1/GRM…
3 GO:0035249 synaptic transmissio… 0.560 1.91 1.14e-5 0.000411 GRIK3/UNC13C/G…
4 GO:0050906 detection of stimulu… 0.594 1.90 7.17e-5 0.00190 OR51E1/CST1/OR…
5 GO:0007216 G protein-coupled gl… 0.840 1.89 1.28e-4 0.00300 GRIK3/GRM1/GRM…
6 GO:0061982 meiosis I cell cycle… 0.553 1.86 7.08e-5 0.00188 SPATA22/KASH5/…
7 GO:0051966 regulation of synapt… 0.581 1.86 1.17e-4 0.00283 GRIK3/GRM1/CAC…
8 GO:0031577 spindle checkpoint s… 0.627 1.85 7.85e-4 0.0111 SPC24/BIRC5/PL…
9 GO:0045132 meiotic chromosome s… 0.587 1.85 2.46e-4 0.00494 SPATA22/KASH5/…
10 GO:0007094 mitotic spindle asse… 0.628 1.85 5.33e-4 0.00853 SPC24/BIRC5/PL…
# ℹ 3,832 more rows
Other kind of weights, like the loading from PCA or NMF, or the importance of random forest, can be also used.
<- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,
resExample weights =PCA_res$PC1_loading),
database = database_from_gmt("MsigDB_Hallmark.gmt")
)
For visualize, see ploting page for details.