Package Functions Help
DESCRIPTION
Type: Package
Package: EnrichGT
Title: EnrichGT - all in one enrichment analysis solution
Version: 1.0.3
Author: Zhiming Ye
Maintainer: Zhiming Ye <garnetcrow@hotmail.com>
Description: Do biological enrichment analysis and parsing and clustering
enrichment result to insightful results in just ONE package
License: GPL-V3
URL: https://zhimingye.github.io/EnrichGT/
Depends:
R (>= 2.10)
Imports:
AnnotationDbi,
BiocManager,
cli,
dplyr,
ellmer,
fgsea,
fontawesome,
forcats,
ggplot2,
glue,
GO.db,
grDevices,
gt,
Matrix,
methods,
parallel,
proxy,
qvalue,
RColorBrewer,
Rcpp,
reactome.db,
rlang,
scales,
stats,
stringr,
text2vec,
tibble,
utils,
xfun
Suggests:
devtools,
readr,
testthat (>= 3.0.0),
tidyverse
LinkingTo:
Rcpp
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2
convert_annotations_genes
: Convert gene annotations from any keys to any keys
Description
Convert gene annotations from any keys to any keys
Usage
convert_annotations_genes(genes, from_what, to_what, OrgDB)
Arguments
genes
: gene vectorfrom_what
: input type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help.to_what
: output type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help. Can be multiple items E.g.c("ENTREZID","ENSEMBL","GENENAME")
OrgDB
: human = org.Hs.eg.db, mouse = org.Mm.eg.db, search BioConductor website for further help
Value
a data.frame
database_from_gmt
: Parse GMT format gene set files
Description
Reads gene set files in GMT format (e.g., from MSigDB or WikiPathways) and converts them to a data frame suitable for enrichment analysis. Can optionally convert ENTREZ IDs to gene symbols.
Usage
database_from_gmt(gmtfile, OrgDB = NULL, convert_2_symbols = T)
Arguments
gmtfile
: Path to GMT format fileOrgDB
: Annotation database for ID conversion (e.g., org.Hs.eg.db for human). Required if convert_2_symbols=TRUE.convert_2_symbols
: Logical indicating whether to convert ENTREZ IDs to gene symbols. Default is TRUE.
Value
A data frame with columns:
- term: Gene set name
- gene: Gene identifiers (symbols or ENTREZ IDs)
If input has 3 columns, includes an additional ID column.
Examples
# Read MSigDB hallmark gene sets
<- system.file("extdata", "h.all.v7.4.symbols.gmt", package = "EnrichGT")
gmt_file <- database_from_gmt(gmt_file)
gene_sets
# Read WikiPathways with ENTREZ to symbol conversion
<- "wikipathways-20220310-gmt-Homo_sapiens.gmt"
gmt_file <- database_from_gmt(gmt_file, OrgDB = org.Hs.eg.db) gene_sets
egt_compare_groups
: 2-Group Comparison of enrichment results and further clustering and visualizing
Description
See ?egt_enrichment_analysis()
Usage
egt_compare_groups(
obj.test,
obj.ctrl,name.test = NULL,
name.ctrl = NULL,
ClusterNum = 15,
P.adj = 0.05,
force = F,
nTop = 10,
method = "ward.D2",
... )
Arguments
obj.test
: the enriched object from tested group. WARNING:obj.test
andobj.ctrl
should come from same database (e.g. GO Biological Process(GOBP)).obj.ctrl
: the enriched object from control group. WARNING:obj.test
andobj.ctrl
should come from same database (e.g. GO Biological Process(GOBP)).name.test
: optional, the name of the testing group. If isNULL
, the object name ofobj.test
will be used.name.ctrl
: optional, the name of the control group. If isNULL
, the object name ofobj.ctrl
will be used.ClusterNum
: how many cluster will be clusteredP.adj
: p.adjust cut-off. To avoid slow visualization, you can make stricter p-cut off.force
: ignore all auto-self-checks, which is usefulnTop
: keep n top items according to p-adj in each cluster.method
: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC)....
: Others options.
Details
Execute obj.test
VS obj.ctrl
tests, showing pathway overlaps (or differences) and meta-gene modules of test group and control group. Supports ORA and GSEA results (enriched object or data.frame). !WARNING!: obj.test
and obj.ctrl
should come from same database (e.g. GO Biological Process(GOBP)).
Value
List
containing multiple EnrichGT_obj
objects. The List
contains objects with overlapped enriched terms, unique enrich terms.
egt_enrichment_analysis
: Perform Over-Representation Analysis (ORA)
Description
ORA compares the proportion of genes in your target list that belong to specific categories (pathways, GO terms etc.) against the expected proportion in a background set. This implementation uses hash tables for efficient gene counting and supports parallel processing for analyzing multiple gene lists.
Usage
egt_enrichment_analysis(
genes,
database,p_adj_methods = "BH",
p_val_cut_off = 0.5,
background_genes = NULL,
min_geneset_size = 10,
max_geneset_size = 500,
multi_cores = 0
)
Arguments
genes
: Input genes, either:- Character vector of gene IDs (e.g.,
c("TP53","BRCA1")
) - Named numeric vector from
genes_with_weights()
(will split by expression direction) - List of gene vectors for multiple comparisons (e.g., by cell type)
database
: Gene set database, either:- Built-in database from
database_GO_BP()
,database_KEGG()
etc. - Custom data frame with columns: (ID, Pathway_Name, Genes) or (Pathway_Name, Genes)
- GMT file loaded via
database_from_gmt()
p_val_cut_off
: Adjusted p-value cutoff (default 0.5)background_genes
: Custom background genes (default: all genes in database)min_geneset_size
: Minimum genes per set (default 10)max_geneset_size
: Maximum genes per set (default 500)multi_cores
: (Please don’t use this since it has several known bugs) Number of cores for parallel processing (default 0 = serial)p_adj_method
: Multiple testing correction method (default “BH” for Benjamini-Hochberg)
Details
Identifies enriched biological pathways or gene sets in a gene list using high-performance C++ implementation with parallel processing support.
Value
A data frame with columns:
- ID: Gene set identifier
- Description: Gene set name
- GeneRatio: Enriched genes / input genes
- BgRatio: Set genes / background genes
- pvalue: Raw p-value
- p.adjust: Adjusted p-value
- geneID: Enriched genes
- Count: Number of enriched genes
For weighted input, additional columns show up/down regulated genes.
Examples
# Basic ORA with GO Biological Processes
<- c("TP53", "BRCA1", "EGFR", "CDK2")
genes <- egt_enrichment_analysis(genes, database_GO_BP())
res
# ORA with DEG results (split by direction)
<- genes_with_weights(DEG$gene, DEG$log2FC)
deg_genes <- egt_enrichment_analysis(deg_genes, database_KEGG())
res
# Multi-group ORA with parallel processing
<- list(
gene_lists Macrophages = c("CD68", "CD163", "CD169"),
Fibroblasts = c("COL1A1", "COL1A2", "ACTA2")
)<- egt_enrichment_analysis(gene_lists, database_Reactome(), multi_cores=0) res
egt_gsea_analysis
: Perform Gene Set Enrichment Analysis (GSEA)
Description
GSEA analyzes whether predefined gene sets show statistically significant enrichment at the top or bottom of a ranked gene list. This implementation uses the fast fgsea algorithm from the fgsea package.
Usage
egt_gsea_analysis(
genes,
database,p_val_cut_off = 0.5,
min_geneset_size = 10,
max_geneset_size = 500,
gseaParam = 1
)
Arguments
genes
: A named numeric vector of ranked genes where:- Names are gene identifiers
- Values are ranking metric (e.g., log2 fold change, PCA loading)Must be sorted in descending order (use
genes_with_weights()
to prepare) database
: Gene set database, either:- Built-in database from
database_GO_BP()
,database_KEGG()
etc. - Custom data frame with columns: (ID, Pathway_Name, Genes) or (Pathway_Name, Genes)
- GMT file loaded via
database_from_gmt()
p_val_cut_off
: Adjusted p-value cutoff (default 0.5)min_geneset_size
: Minimum genes per set (default 10)max_geneset_size
: Maximum genes per set (default 500)gseaParam
: GSEA parameter controlling weight of ranking (default 1)p_adj_method
: Multiple testing correction method (default “BH” for Benjamini-Hochberg)
Details
Identifies enriched biological pathways in a ranked gene list using the fgsea algorithm.
Value
A data frame with columns:
- ID: Gene set identifier
- Description: Gene set name
- ES: Enrichment score
- NES: Normalized enrichment score
- pvalue: Raw p-value
- p.adjust: Adjusted p-value
- core_enrichment: Leading edge genes
Examples
# Using differential expression results
<- genes_with_weights(DEG$gene, DEG$log2FC)
ranked_genes <- egt_gsea_analysis(ranked_genes, database_GO_BP())
res
# Using PCA loadings
<- genes_with_weights(rownames(pca$rotation), pca$rotation[,1])
ranked_genes <- egt_gsea_analysis(ranked_genes, database_KEGG())
res
# Custom gene sets from GMT file
<- genes_with_weights(genes, weights)
ranked_genes <- egt_gsea_analysis(ranked_genes, database_from_gmt("pathways.gmt")) res
egt_infer_act
: Infering Pathway or Transcript Factors activity from EnrichGT meta-gene modules
Description
Only supports gene symbols. PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous. If when doing re-enrichment, you select a high number of clusters, that may cause low gene number in each meta-gene module, and then can’t be infered sucessfully. So if result is empty, please increase the number of re-clustering when doing it.
Usage
egt_infer_act(x, DB = "collectri", species = "human")
Arguments
x
: an EnrichGT_obj object.DB
: can be “progeny” (the Pathway activity database), or “collectri” (TF activity database)species
: can be “human” or “mouse”
Value
an ORA result list
egt_llm_summary
: Summarize EnrichGT results using LLM
Description
This function uses a Large Language Model (LLM) to generate summaries for pathway clusters and gene modules in an EnrichGT_obj object.
Usage
egt_llm_summary(x, chat, lang = "English")
Arguments
x
: An EnrichGT_obj object created by[egt_recluster_analysis](egt_recluster_analysis)
.chat
: An LLM chat object created by theellmer
package.lang
: Language pass to LLM. Can beEnglish
orChinese
.
Note
It is recommended not to add system prompts when creating the chat object. The function provides its own carefully crafted prompts for biological analysis.
References
For more information about creating chat objects, see the ellmer package documentation.
Seealso
[egt_recluster_analysis](egt_recluster_analysis)
to create the input object.
Value
Returns the input EnrichGT_obj object with added LLM annotations in the LLM_Annotation
slot. The annotations include:
pathways
: Summaries of pathway clustersgenes_and_title
: Summaries of gene modules and their titles
Examples
# Create LLM chat object
<- chat_deepseek(api_key = YOUR_API_KEY, model = "deepseek-chat", system_prompt = "")
chat
# Run enrichment analysis and get EnrichGT_obj
<- egt_recluster_analysis(...)
re_enrichment_results
# Get LLM summaries
<- egt_llm_summary(re_enrichment_results, chat) re_enrichment_results
egt_plot_gsea
: Generate GSEA Enrichment Plots
Description
This function creates graphical representations of Gene Set Enrichment Analysis (GSEA) results, including either a multi-panel GSEA table plot for multiple pathways or a single pathway enrichment plot. The visualization leverages the fgsea
package’s plotting functions.
Usage
egt_plot_gsea(resGSEA$Description[1],genes = genes_with_weights(genes = DEGexample$...1, weights = DEGexample$log2FoldChange),database = database_GO_BP(org.Hs.eg.db))
egt_plot_gsea(resGSEA[1:8,],genes = genes_with_weights(genes = DEGexample$...1, weights = DEGexample$log2FoldChange),database = database_GO_BP(org.Hs.eg.db))
Arguments
x
: A GSEA result object. Can be either:- A data frame containing GSEA results (requires columns: pvalue, p.adjust, Description)
- A character string specifying a single pathway name
genes
: A named numeric vector fromgenes_with_weights()
. These should match the gene identifiers used in the GSEA analysis.database
: A databasedata.frame
, You can obtain it fromdatabase_xxx()
functions. This should correspond to the database used in the original GSEA analysis.
Value
A ggplot object:
- When
x
is a data frame: Returns a multi-panel plot showing normalized enrichment scores (NES), p-values, and leading edge plots for top pathways - When
x
is a pathway name: Returns an enrichment plot showing the running enrichment score for the specified pathway
egt_plot_results
: Visualize enrichment results using simple plot
Description
This plot is the most widely like enrichplot::dotplot()
used method to visualize enriched terms. It shows the enrichment scores (e.g. p values) and gene ratio or NES as dot size and color / or bar height. Users can specify the number of terms using ntop
or selected terms to color via the low.col
and hi.col
.
Usage
egt_plot_results(
x,ntop = NULL,
showIDs = F,
max_len_descript = 40,
keepAll = F,
...,P.adj = NULL
)
Arguments
x
: a data frame form enriched result likeegt_enrichment_analysis()
oregt_gsea_analysis()
, or an re-clusteredEnrichGT
objectntop
: Show top N in each cluster. In default, for origin enriched result, showing top 15; for re-clustered object, showing top 5 in each cluster.showIDs
: bool, show pathway IDs or not. Default is FALSEmax_len_descript
: the label format length, default as 40....
: Other paramP.adj
: If pass an origin data.frame from original enriched result, you can specify the P-adjust value cut off. If is null, default is 0.05. When passingEnrichGT_obj
, this filter is previously done byegt_recluster_analysis
.low.col
: the color for the lowesthi.col
: the color for the highest
Value
a ggplot2 object
egt_recluster_analysis
: Cluster and re-enrichment enrichment results
Description
Performs hierarchical clustering on enrichment results (ORA or GSEA) based on gene-term associations to reduce redundancy and improve biological interpretation. The function helps identify coherent groups of related terms while preserving important but less significant findings.
Usage
egt_recluster_analysis(
x,ClusterNum = 10,
P.adj = 0.05,
force = F,
nTop = 10,
method = "ward.D2",
... )
Arguments
x
: Enrichment result fromEnrichGT
orclusterProfiler
. For multi-database results, provide alist
.ClusterNum
: Number of clusters to create (default: 10).P.adj
: Adjusted p-value cutoff (default: 0.05). Stricter values improve performance.force
: Logical to bypass validation checks (default: FALSE).nTop
: Number of top terms to keep per cluster by p-value (default: 10).method
: Hierarchical clustering method (default: “ward.D2”). One of: “ward.D”, “ward.D2”, “single”, “complete”, “average” (UPGMA), “mcquitty” (WPGMA), “median” (WPGMC), or “centroid” (UPGMC)....
: Additional arguments passed to clustering functions.
Details
Input requirements by analysis type: ORA results: Required columns: “ID”, “Description”, “GeneRatio”, “pvalue”, “p.adjust”, “geneID”, “Count” GSEA results:
Required columns: “ID”, “Description”, “NES”, “pvalue”, “p.adjust”, “core_enrichment” compareClusterResult: Either the compareClusterResult object or a data frame with: - All ORA columns listed above - Additional “Cluster” column Multi-database: Provide as a named list of the above result types
Value
An EnrichGT_obj
containing:
- enriched_result: Filtered results data frame
- gt_object: Formatted
gt
table object - gene_modules: List of gene modules per cluster
- pathway_clusters: Pathway names by cluster
- clustering_tree:
hclust
object for visualization - raw_enriched_result: Unfiltered results table
Examples
# ORA example
<- egt_recluster_analysis(ora_result, ClusterNum=8)
res plot(res@clustering_tree)
# GSEA example
<- egt_recluster_analysis(gsea_result, method="average")
gsea_res gtsave(gsea_res@gt_object, "results.html")
genes_with_weights
: Create a ranked gene list for GSEA analysis
Description
Takes gene identifiers and corresponding weights (like log2 fold changes) and returns a ranked vector suitable for Gene Set Enrichment Analysis (GSEA).
Usage
genes_with_weights(genes, weights)
Arguments
genes
: Character vector of gene identifiers (e.g., gene symbols or ENTREZ IDs)weights
: Numeric vector of weights for each gene (typically log2 fold changes)
Value
A named numeric vector sorted in descending order by weight, where: - Names are gene identifiers
- Values are the corresponding weights
Examples
# Example using differential expression results
<- c("TP53", "BRCA1", "EGFR")
genes <- c(1.5, -2.1, 0.8)
log2fc <- genes_with_weights(genes, log2fc) ranked_genes
database_...
: Get database for enrichment analysis
Description
Get Gene Ontology (GO), Reactome, and other term-to-gene database, for enrichment analysis
Usage
database_GO_BP(OrgDB = org.Hs.eg.db)
database_GO_CC(OrgDB = org.Hs.eg.db)
database_GO_MF(OrgDB = org.Hs.eg.db)
database_GO_ALL(OrgDB = org.Hs.eg.db)
database_Reactome(OrgDB = org.Hs.eg.db)
database_progeny_human()
database_progeny_mouse()
database_CollecTRI_human()
database_CollecTRI_mouse()
Arguments
OrgDB
: The AnnotationDbi database to fetch pathway data and convert gene IDs to gene symbols. For human it would beorg.Hs.eg.db
, for mouse it would beorg.Mm.eg.db
. In AnnotationDbi there are many species, please searchAnnotationDbi
for other species annotation database. GO and Reactome should add this, progeny and collectri do not.
Value
a data.frame with ID, terms and genes
database_KEGG
: Get KEGG database from KEGG website
Description
KEGG is a commercialized database. So EnrichGT can’t pre-cache them locally. You can use this function to fetch KEGG database pathways and modules.
Usage
database_KEGG(kegg_organism="hsa",OrgDB = org.Hs.eg.db,kegg_modules=F,local_cache=F)
database_KEGG_show_organism()
Arguments
kegg_organism
: Determine which species data from KEGG will be fetch. For human, it would behsa
(in default); For mouse, it would bemmu
. If you wants other species, seedatabase_kegg_show_organism()
for details.OrgDB
: The AnnotationDbi database to convert KEGG gene ID to gene symbols. For human it would beorg.Hs.eg.db
, for mouse it would beorg.Mm.eg.db
. In AnnotationDbi there are many species, please searchAnnotationDbi
for other species annotation database.kegg_modules
: If TRUE, returns KEGG module; If FALSE returns KEGG pathways. In default, this is setted to FALSE to get mouse commonly used KEGG pathways.local_cache
: cache a copy in local working folder. It will be saved as a.enrichgt_cache
file in working dictionary. The.enrichgt_cache
is just a.rds
file, feel free to read it usingreadRDS()
.
%-delete->%
: Filter Enrichment Results by Description Pattern
Description
Infix operator to filter enrichment results by matching against Description field. For EnrichGT_obj objects, re-runs clustering analysis after filtering.
Usage
%-delete->% y x
Arguments
x
: Either an EnrichGT_obj object or data.frame containing enrichment resultsy
: Regular expression pattern to match against Description field
Details
This operator helps refine enrichment results by removing terms matching the given pattern from the Description field. When applied to EnrichGT_obj, it preserves all original parameters and re-runs the clustering analysis on the filtered results.
Value
For EnrichGT_obj input: A new EnrichGT_obj with filtered and re-clustered results. For data.frame input: A filtered data.frame.
Examples
# Filter out "ribosome" related terms
<- reenrichment_obj %-delete->% "ribosome"
filtered_results
# Filter data.frame directly
<- df %-delete->% "metabolism" filtered_df