DESCRIPTION
Type: Package
Package: EnrichGT
Title: EnrichGT - all in one enrichment analysis soluction
Version: 0.8.4
Author: Zhiming Ye
Maintainer: Zhiming Ye <garnetcrow@hotmail.com>
Description: Do biological enrichment analysis and parsing and clustering
enrichment result to insightful results in just ONE package
License: GPL-V3
URL: https://zhimingye.github.io/EnrichGT/
Depends:
R (>= 2.10)
Imports:
AnnotationDbi,
BiocManager,
cli,
dplyr,
fgsea,
fontawesome,
forcats,
ggplot2,
ggrepel,
glue,
GO.db,
grDevices,
gt,
Matrix,
methods,
parallel,
proxy,
qvalue,
RColorBrewer,
Rcpp,
reactome.db,
rlang,
scales,
stats,
stringr,
text2vec,
tibble,
umap,
utils,
xfun
LinkingTo:
Rcpp
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2
convert_annotations_genes
: Convert gene annotations from any keys to any keys
Description
Convert gene annotations from any keys to any keys
Usage
convert_annotations_genes(genes, from_what, to_what, OrgDB)
Arguments
genes
: gene vectorfrom_what
: input type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help.to_what
: output type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help. Can be multiple items E.g.c("ENTREZID","ENSEMBL","GENENAME")
OrgDB
: human = org.Hs.eg.db, mouse = org.Mm.eg.db, search BioConductor website for further help
Value
a data.frame
database_from_gmt
: parse GMT files file to a data.frame
Description
Read .gmt
files. You can download them from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp WikiPathway database also provides pre-built GMT files (https://data.wikipathways.org/current/gmt/). In default they are recorded as ENTREZ IDs, so you need to provide proper species database (e.g. org.Hs.eg.db for human), to database_from_gmt function and EnrichGT will automatically convert ENTREZ IDs to gene symbols for enrichment analysis.
Usage
database_from_gmt(gmtfile, OrgDB = NULL, convert_2_symbols = T)
Arguments
gmtfile
: gmt file pathOrgDB
: Only need when converting genes, human = org.Hs.eg.db, mouse = org.Mm.eg.db, search BioConductor website for further help. Default is NULL.convert_2_symbols
: Force to convert numeric gene ids (as ENTREZIDs) to gene symbols
Value
data.frame
egt_compare_groups
: 2-Group Comparison of enrichment results and further clustering and visualizing
Description
See ?EnrichGT()
Usage
egt_compare_groups(
obj.test,
obj.ctrl,name.test = NULL,
name.ctrl = NULL,
ClusterNum = 15,
P.adj = 0.05,
force = F,
nTop = 10,
method = "ward.D2",
... )
Arguments
obj.test
: the enriched object from tested group. WARNING:obj.test
andobj.ctrl
should come from same database (e.g. GO Biological Process(GOBP)).obj.ctrl
: the enriched object from control group. WARNING:obj.test
andobj.ctrl
should come from same database (e.g. GO Biological Process(GOBP)).name.test
: optional, the name of the testing group. If isNULL
, the object name ofobj.test
will be used.name.ctrl
: optional, the name of the control group. If isNULL
, the object name ofobj.ctrl
will be used.ClusterNum
: how many cluster will be clusteredP.adj
: p.adjust cut-off. To avoid slow visualization, you can make stricter p-cut off.force
: ignore all auto-self-checks, which is usefulnTop
: keep n top items according to p-adj in each cluster.method
: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC)....
: Others options.
Details
Execute obj.test
VS obj.ctrl
tests, showing pathway overlaps (or differences) and meta-gene modules of test group and control group. Supports ORA and GSEA results (enriched object or data.frame). !WARNING!: obj.test
and obj.ctrl
should come from same database (e.g. GO Biological Process(GOBP)).
Value
List
containing multiple EnrichGT_obj
objects. The List
contains objects with overlapped enriched terms, unique enrich terms.
egt_enrichment_analysis
: A C++ accelerated universal enrichment analyzer (Over-Representation Analysis (ORA))
Description
ORA is a statistical method used to identify biological pathways or gene sets that are significantly enriched in a given list of genes (e.g., differentially expressed genes). The method compares the proportion of genes in the target list that belong to a specific category (e.g., pathways, GO terms) to the expected proportion in the background gene set. To accelerate the computation in ORA analysis, EnrichGT
have implemented a function that leverages C++ for high-performance computation. The core algorithm utilizes hash tables for efficient lookup and counting of genes across categories. Also It provides multi-Core parallel calculation by package parallel
.
Usage
<- egt_enrichment_analysis(genes = DEGtable$Genes,
res database = database_GO_BP(org.Hs.eg.db))
<- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",...),
res database = database_GO_ALL(org.Hs.eg.db))
<- egt_enrichment_analysis(genes = genes_with_weights(geneSymbols, log2FC),
res database = database_kegg(kegg_organism="hsa",OrgDB = org.Hs.eg.db))
<- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",...),
res database = database_from_gmt("MsigDB_Hallmark.gmt"))
<- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
res Fibroblast=c("COL1A2","COL1A3"),...),
database = database_from_gmt("panglaoDB.gmt"))
Arguments
genes
: a vector of gene ids likec("TP53","CD169","CD68","CD163"...)
.If you have genes from multiple source or experiment group, you can also pass a list with gene ids in it. For Example ,list(Macrophages=c("CD169","CD68","CD163"),Fibroblast=c("COL1A2","COL1A3))
.The genes should be match in the second paramdatabase
’sgene
column. For example, if database provides Ensembl IDs, you should input Ensembl IDs. But in default databases provided byEnrichGT
is gene symbols.Of note, since ver 0.8, genes argument supports inputs from genes_with_weights(), EnrichGT will use the whole DEG for ORA, and final split gene candidated into high-expressing and lowly-expressing according to weights.database
: a database data frame, can contain 3 columns (ID, Pathway_Name, Genes) or just 2 columns (Pathway_Name, Genes). You can read a data frame and pass it through this or rundatabase_GO_CC()
to get them, see example.You can rundatabase_GO_CC()
to see an example.The ID column is not necessary. EnrichGT contains several databases, functions about databases are named starts withdatabase_...
, likedatabase_GO_BP()
ordatabase_Reactome()
.The default gene in each database EnrichGT provided to input isGENE SYMBOL
(like TP53, not 1256 or ENSG…), notENTREZ ID
orEnsembl ID
.It will be more convince for new users. Avaliable databases includesdatabase_GO_BP()
,database_GO_CC()
,database_GO_MF()
anddatabase_Reactome()
.You can add more database by downloading MsigDB (https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp)’s GMT files. It can be load by usingdatabase_from_gmt(FILE_PATH)
.If you only have simple a table, you can also pass a data frame through this arguement. Of note, it should contains at least 2 coloumn (colnames(df) = c(“Terms”,“Genes)), the first is term names and the second are the corresponding genes. If you have term ids, you can add aID
column at the first column, andTerms
becomes the second column andGenes
the third.p_adj_methods
: one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”p_val_cut_off
: adjusted pvalue cutoff on enrichment tests to reportbackground_genes
: background genes. If missing, the all genes listed in the databasemin_geneset_size
: minimal size of genes annotated for testingmax_geneset_size
: maximal size of genes annotated for testingmulti_cores
: multi_cores (Experimental), only used when analysis a list of genes (multi-source or groups). Set to 0 or 1 to disable. May use lots of RAM.
Value
a data frame with ORA results.
egt_gsea_analysis
: Gene Set Enrichment Analysis (GSEA) by EnrichGT
Description
A warpper of fgsea::fgsea()
. GSEA is a computational method used to determine whether predefined gene sets (e.g., pathways, GO terms) are statistically enriched in a ranked list of genes. Unlike ORA, GSEA considers the entire gene list and focuses on the cumulative distribution of gene ranks to identify coordinated changes. The fgsea (https://github.com/ctlab/fgsea) package is an R tool that implements an accelerated version of GSEA. It uses precomputed statistical methods and efficient algorithms to dramatically speed up enrichment analysis, especially for large datasets.
Usage
<- egt_gsea_analysis(genes = genes_with_weights(genes = DEG$genes, weights = DEG$log2FoldChange),
res database = database_GO_BP())
<- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,weights =PCA_res$PC1_loading),
res database = database_from_gmt("MsigDB_Hallmark.gmt"))
Arguments
genes
: a named numeric vector, for example c(TP53
=1.2,KRT15
=1.1,IL1B
=1.0,PMP22
= 0.5,FABP1
= -0.9,GLUT1
= -2).The number is the weight of each gene, can use the logFC form DEG analysis results instead. Also NMF or PCA’s loading can also be used.EnrichGT
provides agenes_with_weights(genes,weights)
function to build this numeric vector. Importantly, this vector should be !SORTED! for larger to smaller.database
: a database data frame, can contain 3 columns (ID, Pathway_Name, Genes) or just 2 columns (Pathway_Name, Genes). You can read a data frame and pass it through this or rundatabase_GO_CC()
to get them, see example.The ID column is not necessary. EnrichGT contains several databases, functions about databases are named starts withdatabase_...
, likedatabase_GO_BP()
ordatabase_Reactome()
.The default gene in each database EnrichGT provided to input isGENE SYMBOL
(like TP53, not 1256 or ENSG…), notENTREZ ID
orEnsembl ID
. It will be more convince for new users.Avaliable databases includesdatabase_GO_BP()
,database_GO_CC()
,database_GO_MF()
anddatabase_Reactome()
.You can add more database by downloading MsigDB (https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp)’s GMT files. It can be load by usingdatabase_from_gmt(FILE_PATH)
.If you only have simple a table, you can also pass a data frame through this arguement. Of note, it should contains at least 2 coloumn (colnames(df) = c(“Terms”,“Genes)), the first is term names and the second are the corresponding genes. If you have term ids, you can add aID
column at the first column, andTerms
becomes the second column andGenes
the third.p_val_cut_off
: adjusted pvalue cutoff on enrichment tests to reportmin_geneset_size
: minimal size of genes annotated for testingmax_geneset_size
: maximal size of genes annotated for testinggseaParam
: other param passing to fgseap_adj_methods
: one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”
Value
a data frame
egt_infer_act
: Infering Pathway or Transcript Factors activity from EnrichGT meta-gene modules
Description
Only supports gene symbols. so you must use enrichedObj |> setReadable(OrgDb = xxx,keyType = “ENTREZID”) |> EnrichGT() . Do Not supports ENTREZIDs! PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous. If when doing re-enrichment, you select a high number of clusters, that may cause low gene number in each meta-gene module, and then can’t be infered sucessfully. So if result is empty, please increase the number of re-clustering when doing it.
Usage
egt_infer_act(x, DB = "collectri", species = "human")
Arguments
x
: an EnrichGT_obj object.DB
: can be “progeny” (the Pathway activity database), or “collectri” (TF activity database)species
: can be “human” or “mouse”
Value
a compareCluster
result from clusterProfiler
egt_plot_results
: Visualize enrichment results using simple plot
Description
This plot is the most widely like enrichplot::dotplot()
used method to visualize enriched terms. It shows the enrichment scores (e.g. p values) and gene ratio or NES as dot size and color / or bar height. Users can specify the number of terms using ntop
or selected terms to color via the low.col
and hi.col
.
Usage
egt_plot_results(
x,ntop = NULL,
showIDs = F,
max_len_descript = 40,
...,P.adj = NULL
)
Arguments
x
: a data frame form enriched result likeegt_enrichment_analysis()
oregt_gsea_analysis()
, or an re-clusteredEnrichGT
objectntop
: Show top N in each cluster. In default, for origin enriched result, showing top 15; for re-clustered object, showing top 5 in each cluster.showIDs
: bool, show pathway IDs or not. Default is FALSEmax_len_descript
: the label format length, default as 40....
: Other paramP.adj
: If pass an origin data.frame from original enriched result, you can specify the P-adjust value cut off. If is null, default is 0.05. When passingEnrichGT_obj
, this filter is previously done byegt_recluster_analysis
.low.col
: the color for the lowesthi.col
: the color for the highest
Value
a ggplot2 object
egt_plot_umap
: Visualize results generated form EnrichGT()
using UMAP
Description
A word frequency matrix represents the frequency of words or tokens across different documents or text samples. Each row corresponds to a document, and each column represents a word or token, with the cell values indicating the frequency of the respective word in that document.However, high-dimensional data like word frequency matrices can be challenging to interpret directly. To make such data easier to analyze, we can reduce its dimensionality and visualize the patterns or clusters in a 2D or 3D space. UMAP (Uniform Manifold Approximation and Projection) is a powerful, non-linear dimensionality reduction technique widely used for this purpose.
Usage
egt_plot_umap(x, ...)
Arguments
x
: an EnrichGT object...
: Other param
Value
a ggplot2 object
egt_recluster_analysis
: Parse enrichment results and further clustering and visualizing
Description
Cluster enrichment results based on hit genes for ORA (e.g, typical GO enrichment) or core enrichment from GSEA using term frequency analysis. This provides a clearer view of biological relevance by focusing on the genes that matter most. Gene enrichment analysis can often be misleading due to the redundancy within gene set databases and the limitations of most enrichment tools. Many tools, by default, only display a few top results and fail to filter out redundancy. This can result in both biological misinterpretation and valuable information being overlooked. For instance, high expression of certain immune genes can cause many immune-related gene sets to appear overrepresented. However, a closer look often reveals that these gene sets are derived from the same group of genes, which might represent only a small fraction. Less than 1/10 of the differentially expressed genes (DEGs). What about the other 9/10? Do they hold no biological significance? The main purpose of developing this package is to provide a lightweight and practical solution to the problems mentioned above.
Usage
egt_recluster_analysis(
x,ClusterNum = 17,
P.adj = 0.05,
force = F,
nTop = 10,
method = "ward.D2",
... )
Arguments
x
: an enrichment result fromclusterProfiler
, or adata.frame
containing result fromclusterProfier
. To perform fusing multi-database enrichment results, please give alist
object.ClusterNum
: how many cluster will be clusteredP.adj
: p.adjust cut-off. To avoid slow visualization, you can make stricter p-cut off.force
: ignore all auto-self-checks, which is usefulnTop
: keep n top items according to p-adj in each cluster.method
: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC)....
: Others options.
Details
For an ORA result, c(“ID”,“Description”,“GeneRatio”,“pvalue”,“p.adjust”,“geneID”,“Count”) should be contained; For GSEA, c(“ID”,“Description”,“NES”,“pvalue”,“p.adjust”,“core_enrichment”) should be contain. For compareClusterResult
, a compareClusterResult
object or a data-frame with additional Cluster
column should be contained, others similar to ORA result. To perform fusing multi-database enrichment results, please give a list
object.
Value
an EnrichGT_obj
object. slot enriched_result
contains a data.frame with enriched results. gt_object
contains gt
object. you can use obj@gt_object
to get it and use functions from gt
like gtsave
. gene_modules
is a list containing meta-gene modules of each cluster. pathway_clusters
contains pathways names in each cluster. clustering_tree
contains the clustering tree object from hclust()
, you can use other packages like ggtree
for further visualization and analysis. raw_enriched_result
contains raw table without selecting nTop
.
genes_with_weights
: Return ranked gene list which is use for “GSEA” or other places
Description
Return ranked gene list which is use for “GSEA” or other places
Usage
genes_with_weights(genes, weights)
Arguments
genes
: A vector containing genesweights
: A vector contain weight of genes, typically like log2FC from DEG analysis
Value
A ranked named numeric vector. Names of the numbers is the ENTREZID.
database_GO_BP
: Get database for enrichment analysis
Description
Get Gene Ontology (GO), Reactome, and other term-to-gene database, for enrichment analysis
Usage
database_GO_BP(OrgDB = org.Hs.eg.db)
database_GO_CC(OrgDB = org.Hs.eg.db)
database_GO_MF(OrgDB = org.Hs.eg.db)
database_GO_ALL(OrgDB = org.Hs.eg.db)
database_Reactome(OrgDB = org.Hs.eg.db)
database_progeny_human()
database_progeny_mouse()
database_CollecTRI_human()
database_CollecTRI_mouse()
Arguments
OrgDB
: The AnnotationDbi database to fetch pathway data and convert gene IDs to gene symbols. For human it would beorg.Hs.eg.db
, for mouse it would beorg.Mm.eg.db
. In AnnotationDbi there are many species, please searchAnnotationDbi
for other species annotation database. GO and Reactome should add this, progeny and collectri do not.
Value
a data.frame with ID, terms and genes
database_kegg
: Get KEGG database from KEGG website
Description
KEGG is a commercialized database. So EnrichGT can’t pre-cache them locally. You can use this function to fetch KEGG database pathways and modules.
Usage
database_kegg(kegg_organism="hsa",OrgDB = org.Hs.eg.db,kegg_modules=F,local_cache=F)
database_kegg_show_organism()
Arguments
kegg_organism
: Determine which species data from KEGG will be fetch. For human, it would behsa
(in default); For mouse, it would bemmu
. If you wants other species, seedatabase_kegg_show_organism()
for details.OrgDB
: The AnnotationDbi database to convert KEGG gene ID to gene symbols. For human it would beorg.Hs.eg.db
, for mouse it would beorg.Mm.eg.db
. In AnnotationDbi there are many species, please searchAnnotationDbi
for other species annotation database.kegg_modules
: If TRUE, returns KEGG module; If FALSE returns KEGG pathways. In default, this is setted to FALSE to get mouse commonly used KEGG pathways.local_cache
: cache a copy in local working folder. It will be saved as a.enrichgt_cache
file in working dictionary. The.enrichgt_cache
is just a.rds
file, feel free to read it usingreadRDS()
.