Package Functions Help

DESCRIPTION

Type: Package
Package: EnrichGT
Title: EnrichGT - all in one enrichment analysis solution
Version: 1.0.3
Author: Zhiming Ye
Maintainer: Zhiming Ye <garnetcrow@hotmail.com>
Description: Do biological enrichment analysis and parsing and clustering
    enrichment result to insightful results in just ONE package
License: GPL-V3
URL: https://zhimingye.github.io/EnrichGT/
Depends: 
    R (>= 2.10)
Imports: 
    AnnotationDbi,
    BiocManager,
    cli,
    dplyr,
    ellmer,
    fgsea,
    fontawesome,
    forcats,
    ggplot2,
    glue,
    GO.db,
    grDevices,
    gt,
    Matrix,
    methods,
    parallel,
    proxy,
    qvalue,
    RColorBrewer,
    Rcpp,
    reactome.db,
    rlang,
    scales,
    stats,
    stringr,
    text2vec,
    tibble,
    utils,
    xfun
Suggests: 
    devtools,
    readr,
    testthat (>= 3.0.0),
    tidyverse
LinkingTo: 
    Rcpp
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2

`convert_annotations_genes`: Convert gene annotations from any keys to any keys

Description

Convert gene annotations from any keys to any keys

Usage

convert_annotations_genes(genes, from_what, to_what, OrgDB)

Arguments

genes: gene vector
from_what: input type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help.
to_what: output type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help. Can be multiple items E.g. c("ENTREZID","ENSEMBL","GENENAME")
OrgDB: human = org.Hs.eg.db, mouse = org.Mm.eg.db, search BioConductor website for further help

Value

a data.frame

`database_from_gmt`: Parse GMT format gene set files

Description

Reads gene set files in GMT format (e.g., from MSigDB or WikiPathways) and converts them to a data frame suitable for enrichment analysis. Can optionally convert ENTREZ IDs to gene symbols.

Usage

database_from_gmt(gmtfile, OrgDB = NULL, convert_2_symbols = T)

Arguments

gmtfile: Path to GMT format file
OrgDB: Annotation database for ID conversion (e.g., org.Hs.eg.db for human). Required if convert_2_symbols=TRUE.
convert_2_symbols: Logical indicating whether to convert ENTREZ IDs to gene symbols. Default is TRUE.

Author

Original GMT parser by Guangchuang Yu (https://github.com/YuLab-SMU/gson). Cache system and enhancements by Zhiming Ye.

Value

A data frame with columns:

term: Gene set name
gene: Gene identifiers (symbols or ENTREZ IDs)

If input has 3 columns, includes an additional ID column.

Examples

# Read MSigDB hallmark gene sets
gmt_file <- system.file("extdata", "h.all.v7.4.symbols.gmt", package = "EnrichGT")
gene_sets <- database_from_gmt(gmt_file)

# Read WikiPathways with ENTREZ to symbol conversion  
gmt_file <- "wikipathways-20220310-gmt-Homo_sapiens.gmt"
gene_sets <- database_from_gmt(gmt_file, OrgDB = org.Hs.eg.db)

`egt_compare_groups`: 2-Group Comparison of enrichment results and further clustering and visualizing

Description

See ?egt_enrichment_analysis()

Usage

egt_compare_groups(
  obj.test,
  obj.ctrl,
  name.test = NULL,
  name.ctrl = NULL,
  ClusterNum = 15,
  P.adj = 0.05,
  force = F,
  nTop = 10,
  method = "ward.D2",
  ...
)

Arguments

obj.test: the enriched object from tested group. WARNING: obj.test and obj.ctrl should come from same database (e.g. GO Biological Process(GOBP)).
obj.ctrl: the enriched object from control group. WARNING: obj.test and obj.ctrl should come from same database (e.g. GO Biological Process(GOBP)).
name.test: optional, the name of the testing group. If is NULL, the object name of obj.test will be used.
name.ctrl: optional, the name of the control group. If is NULL, the object name of obj.ctrl will be used.
ClusterNum: how many cluster will be clustered
P.adj: p.adjust cut-off. To avoid slow visualization, you can make stricter p-cut off.
force: ignore all auto-self-checks, which is useful
nTop: keep n top items according to p-adj in each cluster.
method: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
...: Others options.

Details

Execute obj.test VS obj.ctrl tests, showing pathway overlaps (or differences) and meta-gene modules of test group and control group. Supports ORA and GSEA results (enriched object or data.frame). !WARNING!: obj.test and obj.ctrl should come from same database (e.g. GO Biological Process(GOBP)).

Author

Zhiming Ye

Value

List containing multiple EnrichGT_obj objects. The List contains objects with overlapped enriched terms, unique enrich terms.

`egt_enrichment_analysis`: Perform Over-Representation Analysis (ORA)

Description

ORA compares the proportion of genes in your target list that belong to specific categories (pathways, GO terms etc.) against the expected proportion in a background set. This implementation uses hash tables for efficient gene counting and supports parallel processing for analyzing multiple gene lists.

Usage

egt_enrichment_analysis(
  genes,
  database,
  p_adj_methods = "BH",
  p_val_cut_off = 0.5,
  background_genes = NULL,
  min_geneset_size = 10,
  max_geneset_size = 500,
  multi_cores = 0
)

Arguments

genes: Input genes, either:
Character vector of gene IDs (e.g., c("TP53","BRCA1"))
Named numeric vector from genes_with_weights() (will split by expression direction)
List of gene vectors for multiple comparisons (e.g., by cell type)
database: Gene set database, either:
Built-in database from database_GO_BP(), database_KEGG() etc.
Custom data frame with columns: (ID, Pathway_Name, Genes) or (Pathway_Name, Genes)
GMT file loaded via database_from_gmt()
p_val_cut_off: Adjusted p-value cutoff (default 0.5)
background_genes: Custom background genes (default: all genes in database)
min_geneset_size: Minimum genes per set (default 10)
max_geneset_size: Maximum genes per set (default 500)
multi_cores: (Please don’t use this since it has several known bugs) Number of cores for parallel processing (default 0 = serial)
p_adj_method: Multiple testing correction method (default “BH” for Benjamini-Hochberg)

Details

Identifies enriched biological pathways or gene sets in a gene list using high-performance C++ implementation with parallel processing support.

Author

Zhiming Ye

Value

A data frame with columns:

ID: Gene set identifier
Description: Gene set name
GeneRatio: Enriched genes / input genes
BgRatio: Set genes / background genes
pvalue: Raw p-value
p.adjust: Adjusted p-value
geneID: Enriched genes
Count: Number of enriched genes

For weighted input, additional columns show up/down regulated genes.

Examples

# Basic ORA with GO Biological Processes
genes <- c("TP53", "BRCA1", "EGFR", "CDK2")
res <- egt_enrichment_analysis(genes, database_GO_BP())

# ORA with DEG results (split by direction)
deg_genes <- genes_with_weights(DEG$gene, DEG$log2FC)
res <- egt_enrichment_analysis(deg_genes, database_KEGG())

# Multi-group ORA with parallel processing
gene_lists <- list(
  Macrophages = c("CD68", "CD163", "CD169"),
  Fibroblasts = c("COL1A1", "COL1A2", "ACTA2")
)
res <- egt_enrichment_analysis(gene_lists, database_Reactome(), multi_cores=0)

`egt_gsea_analysis`: Perform Gene Set Enrichment Analysis (GSEA)

Description

GSEA analyzes whether predefined gene sets show statistically significant enrichment at the top or bottom of a ranked gene list. This implementation uses the fast fgsea algorithm from the fgsea package.

Usage

egt_gsea_analysis(
  genes,
  database,
  p_val_cut_off = 0.5,
  min_geneset_size = 10,
  max_geneset_size = 500,
  gseaParam = 1
)

Arguments

genes: A named numeric vector of ranked genes where:
Names are gene identifiers
Values are ranking metric (e.g., log2 fold change, PCA loading)Must be sorted in descending order (use genes_with_weights() to prepare)
database: Gene set database, either:
Built-in database from database_GO_BP(), database_KEGG() etc.
Custom data frame with columns: (ID, Pathway_Name, Genes) or (Pathway_Name, Genes)
GMT file loaded via database_from_gmt()
p_val_cut_off: Adjusted p-value cutoff (default 0.5)
min_geneset_size: Minimum genes per set (default 10)
max_geneset_size: Maximum genes per set (default 500)
gseaParam: GSEA parameter controlling weight of ranking (default 1)
p_adj_method: Multiple testing correction method (default “BH” for Benjamini-Hochberg)

Details

Identifies enriched biological pathways in a ranked gene list using the fgsea algorithm.

Author

Based on fgsea package by Alexey Sergushichev et al.

Value

A data frame with columns:

ID: Gene set identifier
Description: Gene set name
ES: Enrichment score
NES: Normalized enrichment score
pvalue: Raw p-value
p.adjust: Adjusted p-value
core_enrichment: Leading edge genes

Examples

# Using differential expression results
ranked_genes <- genes_with_weights(DEG$gene, DEG$log2FC)
res <- egt_gsea_analysis(ranked_genes, database_GO_BP())

# Using PCA loadings
ranked_genes <- genes_with_weights(rownames(pca$rotation), pca$rotation[,1])
res <- egt_gsea_analysis(ranked_genes, database_KEGG())

# Custom gene sets from GMT file
ranked_genes <- genes_with_weights(genes, weights)
res <- egt_gsea_analysis(ranked_genes, database_from_gmt("pathways.gmt"))

`egt_infer_act`: Infering Pathway or Transcript Factors activity from EnrichGT meta-gene modules

Description

Only supports gene symbols. PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous. If when doing re-enrichment, you select a high number of clusters, that may cause low gene number in each meta-gene module, and then can’t be infered sucessfully. So if result is empty, please increase the number of re-clustering when doing it.

Usage

egt_infer_act(x, DB = "collectri", species = "human")

Arguments

x: an EnrichGT_obj object.
DB: can be “progeny” (the Pathway activity database), or “collectri” (TF activity database)
species: can be “human” or “mouse”

Author

Zhiming Ye, Saez-Rodriguez Lab (The decoupleR package, https://saezlab.github.io/decoupleR/)

Value

an ORA result list

`egt_llm_summary`: Summarize EnrichGT results using LLM

Description

This function uses a Large Language Model (LLM) to generate summaries for pathway clusters and gene modules in an EnrichGT_obj object.

Usage

egt_llm_summary(x, chat, lang = "English")

Arguments

x: An EnrichGT_obj object created by [egt_recluster_analysis](egt_recluster_analysis).
chat: An LLM chat object created by the ellmer package.
lang: Language pass to LLM. Can be English or Chinese.

Note

It is recommended not to add system prompts when creating the chat object. The function provides its own carefully crafted prompts for biological analysis.

References

For more information about creating chat objects, see the ellmer package documentation.

Value

Returns the input EnrichGT_obj object with added LLM annotations in the LLM_Annotation slot. The annotations include:

pathways: Summaries of pathway clusters
genes_and_title: Summaries of gene modules and their titles

Examples

# Create LLM chat object
chat <- chat_deepseek(api_key = YOUR_API_KEY, model = "deepseek-chat", system_prompt = "")

# Run enrichment analysis and get EnrichGT_obj
re_enrichment_results <- egt_recluster_analysis(...)

# Get LLM summaries
re_enrichment_results <- egt_llm_summary(re_enrichment_results, chat)

`egt_plot_gsea`: Generate GSEA Enrichment Plots

Description

This function creates graphical representations of Gene Set Enrichment Analysis (GSEA) results, including either a multi-panel GSEA table plot for multiple pathways or a single pathway enrichment plot. The visualization leverages the fgsea package’s plotting functions.

Usage

egt_plot_gsea(resGSEA$Description[1],genes = genes_with_weights(genes = DEGexample$...1, weights = DEGexample$log2FoldChange),database = database_GO_BP(org.Hs.eg.db))

egt_plot_gsea(resGSEA[1:8,],genes = genes_with_weights(genes = DEGexample$...1, weights = DEGexample$log2FoldChange),database = database_GO_BP(org.Hs.eg.db))

Arguments

x: A GSEA result object. Can be either:
A data frame containing GSEA results (requires columns: pvalue, p.adjust, Description)
A character string specifying a single pathway name
genes: A named numeric vector from genes_with_weights(). These should match the gene identifiers used in the GSEA analysis.
database: A database data.frame, You can obtain it from database_xxx() functions. This should correspond to the database used in the original GSEA analysis.

Author

Zhiming Ye, warpped from fgsea

Value

A ggplot object:

When x is a data frame: Returns a multi-panel plot showing normalized enrichment scores (NES), p-values, and leading edge plots for top pathways
When x is a pathway name: Returns an enrichment plot showing the running enrichment score for the specified pathway

`egt_plot_results`: Visualize enrichment results using simple plot

Description

This plot is the most widely like enrichplot::dotplot()used method to visualize enriched terms. It shows the enrichment scores (e.g. p values) and gene ratio or NES as dot size and color / or bar height. Users can specify the number of terms using ntop or selected terms to color via the low.col and hi.col.

Usage

egt_plot_results(
  x,
  ntop = NULL,
  showIDs = F,
  max_len_descript = 40,
  keepAll = F,
  ...,
  P.adj = NULL
)

Arguments

x: a data frame form enriched result like egt_enrichment_analysis() or egt_gsea_analysis(), or an re-clustered EnrichGT object
ntop: Show top N in each cluster. In default, for origin enriched result, showing top 15; for re-clustered object, showing top 5 in each cluster.
showIDs: bool, show pathway IDs or not. Default is FALSE
max_len_descript: the label format length, default as 40.
...: Other param
P.adj: If pass an origin data.frame from original enriched result, you can specify the P-adjust value cut off. If is null, default is 0.05. When passing EnrichGT_obj, this filter is previously done by egt_recluster_analysis.
low.col: the color for the lowest
hi.col: the color for the highest

Author

Zhiming Ye

Value

a ggplot2 object

`egt_recluster_analysis`: Cluster and re-enrichment enrichment results

Description

Performs hierarchical clustering on enrichment results (ORA or GSEA) based on gene-term associations to reduce redundancy and improve biological interpretation. The function helps identify coherent groups of related terms while preserving important but less significant findings.

Usage

egt_recluster_analysis(
  x,
  ClusterNum = 10,
  P.adj = 0.05,
  force = F,
  nTop = 10,
  method = "ward.D2",
  ...
)

Arguments

x: Enrichment result from EnrichGT or clusterProfiler. For multi-database results, provide a list.
ClusterNum: Number of clusters to create (default: 10).
P.adj: Adjusted p-value cutoff (default: 0.05). Stricter values improve performance.
force: Logical to bypass validation checks (default: FALSE).
nTop: Number of top terms to keep per cluster by p-value (default: 10).
method: Hierarchical clustering method (default: “ward.D2”). One of: “ward.D”, “ward.D2”, “single”, “complete”, “average” (UPGMA), “mcquitty” (WPGMA), “median” (WPGMC), or “centroid” (UPGMC).
...: Additional arguments passed to clustering functions.

Details

Input requirements by analysis type: ORA results: Required columns: “ID”, “Description”, “GeneRatio”, “pvalue”, “p.adjust”, “geneID”, “Count” GSEA results:
Required columns: “ID”, “Description”, “NES”, “pvalue”, “p.adjust”, “core_enrichment” compareClusterResult: Either the compareClusterResult object or a data frame with: - All ORA columns listed above - Additional “Cluster” column Multi-database: Provide as a named list of the above result types

Author

Zhiming Ye

Value

An EnrichGT_obj containing:

enriched_result: Filtered results data frame
gt_object: Formatted gt table object
gene_modules: List of gene modules per cluster
pathway_clusters: Pathway names by cluster
clustering_tree: hclust object for visualization
raw_enriched_result: Unfiltered results table

Examples

# ORA example
res <- egt_recluster_analysis(ora_result, ClusterNum=8)
plot(res@clustering_tree)

# GSEA example 
gsea_res <- egt_recluster_analysis(gsea_result, method="average")
gtsave(gsea_res@gt_object, "results.html")

`genes_with_weights`: Create a ranked gene list for GSEA analysis

Description

Takes gene identifiers and corresponding weights (like log2 fold changes) and returns a ranked vector suitable for Gene Set Enrichment Analysis (GSEA).

Usage

genes_with_weights(genes, weights)

Arguments

genes: Character vector of gene identifiers (e.g., gene symbols or ENTREZ IDs)
weights: Numeric vector of weights for each gene (typically log2 fold changes)

Author

Zhiming Ye

Value

A named numeric vector sorted in descending order by weight, where: - Names are gene identifiers
- Values are the corresponding weights

Examples

# Example using differential expression results  
genes <- c("TP53", "BRCA1", "EGFR")
log2fc <- c(1.5, -2.1, 0.8) 
ranked_genes <- genes_with_weights(genes, log2fc)

`database_...`: Get database for enrichment analysis

Description

Get Gene Ontology (GO), Reactome, and other term-to-gene database, for enrichment analysis

Usage

database_GO_BP(OrgDB = org.Hs.eg.db)

database_GO_CC(OrgDB = org.Hs.eg.db)

database_GO_MF(OrgDB = org.Hs.eg.db)

database_GO_ALL(OrgDB = org.Hs.eg.db)

database_Reactome(OrgDB = org.Hs.eg.db)

database_progeny_human()

database_progeny_mouse()

database_CollecTRI_human()

database_CollecTRI_mouse()

Arguments

OrgDB: The AnnotationDbi database to fetch pathway data and convert gene IDs to gene symbols. For human it would be org.Hs.eg.db, for mouse it would be org.Mm.eg.db. In AnnotationDbi there are many species, please search AnnotationDbi for other species annotation database. GO and Reactome should add this, progeny and collectri do not.

Author

Zhiming Ye. Part of functions were inspired by clusterProfiler but with brand new implement.

Value

a data.frame with ID, terms and genes

`database_KEGG`: Get KEGG database from KEGG website

Description

KEGG is a commercialized database. So EnrichGT can’t pre-cache them locally. You can use this function to fetch KEGG database pathways and modules.

Usage

database_KEGG(kegg_organism="hsa",OrgDB = org.Hs.eg.db,kegg_modules=F,local_cache=F)

database_KEGG_show_organism()

Arguments

kegg_organism: Determine which species data from KEGG will be fetch. For human, it would be hsa(in default); For mouse, it would be mmu. If you wants other species, see database_kegg_show_organism() for details.
OrgDB: The AnnotationDbi database to convert KEGG gene ID to gene symbols. For human it would be org.Hs.eg.db, for mouse it would be org.Mm.eg.db. In AnnotationDbi there are many species, please search AnnotationDbi for other species annotation database.
kegg_modules: If TRUE, returns KEGG module; If FALSE returns KEGG pathways. In default, this is setted to FALSE to get mouse commonly used KEGG pathways.
local_cache: cache a copy in local working folder. It will be saved as a .enrichgt_cache file in working dictionary. The .enrichgt_cache is just a .rds file, feel free to read it using readRDS().

`%-delete->%`: Filter Enrichment Results by Description Pattern

Description

Infix operator to filter enrichment results by matching against Description field. For EnrichGT_obj objects, re-runs clustering analysis after filtering.

Usage

x %-delete->% y

Arguments

x: Either an EnrichGT_obj object or data.frame containing enrichment results
y: Regular expression pattern to match against Description field

Details

This operator helps refine enrichment results by removing terms matching the given pattern from the Description field. When applied to EnrichGT_obj, it preserves all original parameters and re-runs the clustering analysis on the filtered results.

Value

For EnrichGT_obj input: A new EnrichGT_obj with filtered and re-clustered results. For data.frame input: A filtered data.frame.

Examples

# Filter out "ribosome" related terms
filtered_results <- reenrichment_obj %-delete->% "ribosome"

# Filter data.frame directly
filtered_df <- df %-delete->% "metabolism"

DESCRIPTION

convert_annotations_genes: Convert gene annotations from any keys to any keys

Description

Usage

Arguments

Value

database_from_gmt: Parse GMT format gene set files

Description

Usage

Arguments

Author

Value

Examples

egt_compare_groups: 2-Group Comparison of enrichment results and further clustering and visualizing

Description

Usage

Arguments

Details

Author

Value

egt_enrichment_analysis: Perform Over-Representation Analysis (ORA)

Description

Usage

Arguments

Details

Author

Value

Examples

egt_gsea_analysis: Perform Gene Set Enrichment Analysis (GSEA)

Description

Usage

Arguments

Details

Author

Value

Examples

egt_infer_act: Infering Pathway or Transcript Factors activity from EnrichGT meta-gene modules

Description

Usage

Arguments

Author

Value

egt_llm_summary: Summarize EnrichGT results using LLM

Description

Usage

Arguments

Note

References

Seealso

Value

Examples

egt_plot_gsea: Generate GSEA Enrichment Plots

Description

Usage

Arguments

Author

Value

egt_plot_results: Visualize enrichment results using simple plot

Description

Usage

Arguments

Author

Value

egt_recluster_analysis: Cluster and re-enrichment enrichment results

Description

Usage

Arguments

Details

Author

Value

Examples

genes_with_weights: Create a ranked gene list for GSEA analysis

Description

Usage

Arguments

Author

Value

Examples

database_...: Get database for enrichment analysis

Description

`convert_annotations_genes`: Convert gene annotations from any keys to any keys

`database_from_gmt`: Parse GMT format gene set files

`egt_compare_groups`: 2-Group Comparison of enrichment results and further clustering and visualizing

`egt_enrichment_analysis`: Perform Over-Representation Analysis (ORA)

`egt_gsea_analysis`: Perform Gene Set Enrichment Analysis (GSEA)

`egt_infer_act`: Infering Pathway or Transcript Factors activity from EnrichGT meta-gene modules

`egt_llm_summary`: Summarize EnrichGT results using LLM

`egt_plot_gsea`: Generate GSEA Enrichment Plots

`egt_plot_results`: Visualize enrichment results using simple plot

`egt_recluster_analysis`: Cluster and re-enrichment enrichment results

`genes_with_weights`: Create a ranked gene list for GSEA analysis

`database_...`: Get database for enrichment analysis

`database_KEGG`: Get KEGG database from KEGG website

`%-delete->%`: Filter Enrichment Results by Description Pattern