DESCRIPTION

Type: Package
Package: EnrichGT
Title: EnrichGT - all in one enrichment analysis soluction
Version: 0.8.4
Author: Zhiming Ye
Maintainer: Zhiming Ye <garnetcrow@hotmail.com>
Description: Do biological enrichment analysis and parsing and clustering
    enrichment result to insightful results in just ONE package
License: GPL-V3
URL: https://zhimingye.github.io/EnrichGT/
Depends: 
    R (>= 2.10)
Imports: 
    AnnotationDbi,
    BiocManager,
    cli,
    dplyr,
    fgsea,
    fontawesome,
    forcats,
    ggplot2,
    ggrepel,
    glue,
    GO.db,
    grDevices,
    gt,
    Matrix,
    methods,
    parallel,
    proxy,
    qvalue,
    RColorBrewer,
    Rcpp,
    reactome.db,
    rlang,
    scales,
    stats,
    stringr,
    text2vec,
    tibble,
    umap,
    utils,
    xfun
LinkingTo: 
    Rcpp
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2

convert_annotations_genes: Convert gene annotations from any keys to any keys

Description

Convert gene annotations from any keys to any keys

Usage

convert_annotations_genes(genes, from_what, to_what, OrgDB)

Arguments

  • genes: gene vector
  • from_what: input type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help.
  • to_what: output type (like “SYMBOL”,“ENTREZID”,“ENSEMBL”,“GENENAME”,…), keys should be supported by AnnotationDbi. Search for the help page of AnnotationDbi for further help. Can be multiple items E.g. c("ENTREZID","ENSEMBL","GENENAME")
  • OrgDB: human = org.Hs.eg.db, mouse = org.Mm.eg.db, search BioConductor website for further help

Value

a data.frame

database_from_gmt: parse GMT files file to a data.frame

Description

Read .gmt files. You can download them from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp WikiPathway database also provides pre-built GMT files (https://data.wikipathways.org/current/gmt/). In default they are recorded as ENTREZ IDs, so you need to provide proper species database (e.g. org.Hs.eg.db for human), to database_from_gmt function and EnrichGT will automatically convert ENTREZ IDs to gene symbols for enrichment analysis.

Usage

database_from_gmt(gmtfile, OrgDB = NULL, convert_2_symbols = T)

Arguments

  • gmtfile: gmt file path
  • OrgDB: Only need when converting genes, human = org.Hs.eg.db, mouse = org.Mm.eg.db, search BioConductor website for further help. Default is NULL.
  • convert_2_symbols: Force to convert numeric gene ids (as ENTREZIDs) to gene symbols

Author

cited from https://github.com/YuLab-SMU/gson/blob/main/R/GMT.R . The further Cache system is written by Zhiming Ye.

Value

data.frame

egt_compare_groups: 2-Group Comparison of enrichment results and further clustering and visualizing

Description

See ?EnrichGT()

Usage

egt_compare_groups(
  obj.test,
  obj.ctrl,
  name.test = NULL,
  name.ctrl = NULL,
  ClusterNum = 15,
  P.adj = 0.05,
  force = F,
  nTop = 10,
  method = "ward.D2",
  ...
)

Arguments

  • obj.test: the enriched object from tested group. WARNING: obj.test and obj.ctrl should come from same database (e.g. GO Biological Process(GOBP)).
  • obj.ctrl: the enriched object from control group. WARNING: obj.test and obj.ctrl should come from same database (e.g. GO Biological Process(GOBP)).
  • name.test: optional, the name of the testing group. If is NULL, the object name of obj.test will be used.
  • name.ctrl: optional, the name of the control group. If is NULL, the object name of obj.ctrl will be used.
  • ClusterNum: how many cluster will be clustered
  • P.adj: p.adjust cut-off. To avoid slow visualization, you can make stricter p-cut off.
  • force: ignore all auto-self-checks, which is useful
  • nTop: keep n top items according to p-adj in each cluster.
  • method: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
  • ...: Others options.

Details

Execute obj.test VS obj.ctrl tests, showing pathway overlaps (or differences) and meta-gene modules of test group and control group. Supports ORA and GSEA results (enriched object or data.frame). !WARNING!: obj.test and obj.ctrl should come from same database (e.g. GO Biological Process(GOBP)).

Author

Zhiming Ye

Value

List containing multiple EnrichGT_obj objects. The List contains objects with overlapped enriched terms, unique enrich terms.

egt_enrichment_analysis: A C++ accelerated universal enrichment analyzer (Over-Representation Analysis (ORA))

Description

ORA is a statistical method used to identify biological pathways or gene sets that are significantly enriched in a given list of genes (e.g., differentially expressed genes). The method compares the proportion of genes in the target list that belong to a specific category (e.g., pathways, GO terms) to the expected proportion in the background gene set. To accelerate the computation in ORA analysis, EnrichGT have implemented a function that leverages C++ for high-performance computation. The core algorithm utilizes hash tables for efficient lookup and counting of genes across categories. Also It provides multi-Core parallel calculation by package parallel.

Usage

res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_GO_BP(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",...),
database = database_GO_ALL(org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = genes_with_weights(geneSymbols, log2FC),
database = database_kegg(kegg_organism="hsa",OrgDB = org.Hs.eg.db))

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",...),
database = database_from_gmt("MsigDB_Hallmark.gmt"))

res <- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
Fibroblast=c("COL1A2","COL1A3"),...),
 database = database_from_gmt("panglaoDB.gmt"))

Arguments

  • genes: a vector of gene ids like c("TP53","CD169","CD68","CD163"...).If you have genes from multiple source or experiment group, you can also pass a list with gene ids in it. For Example , list(Macrophages=c("CD169","CD68","CD163"),Fibroblast=c("COL1A2","COL1A3)).The genes should be match in the second param database’s gene column. For example, if database provides Ensembl IDs, you should input Ensembl IDs. But in default databases provided by EnrichGT is gene symbols.Of note, since ver 0.8, genes argument supports inputs from genes_with_weights(), EnrichGT will use the whole DEG for ORA, and final split gene candidated into high-expressing and lowly-expressing according to weights.
  • database: a database data frame, can contain 3 columns (ID, Pathway_Name, Genes) or just 2 columns (Pathway_Name, Genes). You can read a data frame and pass it through this or run database_GO_CC() to get them, see example.You can run database_GO_CC() to see an example.The ID column is not necessary. EnrichGT contains several databases, functions about databases are named starts with database_..., like database_GO_BP() or database_Reactome().The default gene in each database EnrichGT provided to input is GENE SYMBOL (like TP53, not 1256 or ENSG…), not ENTREZ ID or Ensembl ID.It will be more convince for new users. Avaliable databases includes database_GO_BP(), database_GO_CC(), database_GO_MF() and database_Reactome().You can add more database by downloading MsigDB (https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp)’s GMT files. It can be load by using database_from_gmt(FILE_PATH).If you only have simple a table, you can also pass a data frame through this arguement. Of note, it should contains at least 2 coloumn (colnames(df) = c(“Terms”,“Genes)), the first is term names and the second are the corresponding genes. If you have term ids, you can add a ID column at the first column, and Terms becomes the second column and Genes the third.
  • p_adj_methods: one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”
  • p_val_cut_off: adjusted pvalue cutoff on enrichment tests to report
  • background_genes: background genes. If missing, the all genes listed in the database
  • min_geneset_size: minimal size of genes annotated for testing
  • max_geneset_size: maximal size of genes annotated for testing
  • multi_cores: multi_cores (Experimental), only used when analysis a list of genes (multi-source or groups). Set to 0 or 1 to disable. May use lots of RAM.

Author

Zhiming Ye

Value

a data frame with ORA results.

egt_gsea_analysis: Gene Set Enrichment Analysis (GSEA) by EnrichGT

Description

A warpper of fgsea::fgsea(). GSEA is a computational method used to determine whether predefined gene sets (e.g., pathways, GO terms) are statistically enriched in a ranked list of genes. Unlike ORA, GSEA considers the entire gene list and focuses on the cumulative distribution of gene ranks to identify coordinated changes. The fgsea (https://github.com/ctlab/fgsea) package is an R tool that implements an accelerated version of GSEA. It uses precomputed statistical methods and efficient algorithms to dramatically speed up enrichment analysis, especially for large datasets.

Usage

res <- egt_gsea_analysis(genes = genes_with_weights(genes = DEG$genes, weights = DEG$log2FoldChange),
database = database_GO_BP())

res <- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,weights =PCA_res$PC1_loading),
database = database_from_gmt("MsigDB_Hallmark.gmt"))

Arguments

  • genes: a named numeric vector, for example c(TP53=1.2,KRT15=1.1,IL1B=1.0,PMP22 = 0.5,FABP1 = -0.9, GLUT1 = -2).The number is the weight of each gene, can use the logFC form DEG analysis results instead. Also NMF or PCA’s loading can also be used.EnrichGT provides a genes_with_weights(genes,weights) function to build this numeric vector. Importantly, this vector should be !SORTED! for larger to smaller.
  • database: a database data frame, can contain 3 columns (ID, Pathway_Name, Genes) or just 2 columns (Pathway_Name, Genes). You can read a data frame and pass it through this or run database_GO_CC() to get them, see example.The ID column is not necessary. EnrichGT contains several databases, functions about databases are named starts with database_..., like database_GO_BP() or database_Reactome().The default gene in each database EnrichGT provided to input is GENE SYMBOL (like TP53, not 1256 or ENSG…), not ENTREZ ID or Ensembl ID. It will be more convince for new users.Avaliable databases includes database_GO_BP(), database_GO_CC(), database_GO_MF() and database_Reactome().You can add more database by downloading MsigDB (https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp)’s GMT files. It can be load by using database_from_gmt(FILE_PATH).If you only have simple a table, you can also pass a data frame through this arguement. Of note, it should contains at least 2 coloumn (colnames(df) = c(“Terms”,“Genes)), the first is term names and the second are the corresponding genes. If you have term ids, you can add a ID column at the first column, and Terms becomes the second column and Genes the third.
  • p_val_cut_off: adjusted pvalue cutoff on enrichment tests to report
  • min_geneset_size: minimal size of genes annotated for testing
  • max_geneset_size: maximal size of genes annotated for testing
  • gseaParam: other param passing to fgsea
  • p_adj_methods: one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”

Author

warpped from fgsea package.

Value

a data frame

egt_infer_act: Infering Pathway or Transcript Factors activity from EnrichGT meta-gene modules

Description

Only supports gene symbols. so you must use enrichedObj |> setReadable(OrgDb = xxx,keyType = “ENTREZID”) |> EnrichGT() . Do Not supports ENTREZIDs! PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous. If when doing re-enrichment, you select a high number of clusters, that may cause low gene number in each meta-gene module, and then can’t be infered sucessfully. So if result is empty, please increase the number of re-clustering when doing it.

Usage

egt_infer_act(x, DB = "collectri", species = "human")

Arguments

  • x: an EnrichGT_obj object.
  • DB: can be “progeny” (the Pathway activity database), or “collectri” (TF activity database)
  • species: can be “human” or “mouse”

Author

Zhiming Ye, Saez-Rodriguez Lab (The decoupleR package, https://saezlab.github.io/decoupleR/)

Value

a compareCluster result from clusterProfiler

egt_plot_results: Visualize enrichment results using simple plot

Description

This plot is the most widely like enrichplot::dotplot()used method to visualize enriched terms. It shows the enrichment scores (e.g. p values) and gene ratio or NES as dot size and color / or bar height. Users can specify the number of terms using ntop or selected terms to color via the low.col and hi.col.

Usage

egt_plot_results(
  x,
  ntop = NULL,
  showIDs = F,
  max_len_descript = 40,
  ...,
  P.adj = NULL
)

Arguments

  • x: a data frame form enriched result like egt_enrichment_analysis() or egt_gsea_analysis(), or an re-clustered EnrichGT object
  • ntop: Show top N in each cluster. In default, for origin enriched result, showing top 15; for re-clustered object, showing top 5 in each cluster.
  • showIDs: bool, show pathway IDs or not. Default is FALSE
  • max_len_descript: the label format length, default as 40.
  • ...: Other param
  • P.adj: If pass an origin data.frame from original enriched result, you can specify the P-adjust value cut off. If is null, default is 0.05. When passing EnrichGT_obj, this filter is previously done by egt_recluster_analysis.
  • low.col: the color for the lowest
  • hi.col: the color for the highest

Author

Zhiming Ye

Value

a ggplot2 object

egt_plot_umap: Visualize results generated form EnrichGT() using UMAP

Description

A word frequency matrix represents the frequency of words or tokens across different documents or text samples. Each row corresponds to a document, and each column represents a word or token, with the cell values indicating the frequency of the respective word in that document.However, high-dimensional data like word frequency matrices can be challenging to interpret directly. To make such data easier to analyze, we can reduce its dimensionality and visualize the patterns or clusters in a 2D or 3D space. UMAP (Uniform Manifold Approximation and Projection) is a powerful, non-linear dimensionality reduction technique widely used for this purpose.

Usage

egt_plot_umap(x, ...)

Arguments

  • x: an EnrichGT object
  • ...: Other param

Author

Zhiming Ye

Value

a ggplot2 object

egt_recluster_analysis: Parse enrichment results and further clustering and visualizing

Description

Cluster enrichment results based on hit genes for ORA (e.g, typical GO enrichment) or core enrichment from GSEA using term frequency analysis. This provides a clearer view of biological relevance by focusing on the genes that matter most. Gene enrichment analysis can often be misleading due to the redundancy within gene set databases and the limitations of most enrichment tools. Many tools, by default, only display a few top results and fail to filter out redundancy. This can result in both biological misinterpretation and valuable information being overlooked. For instance, high expression of certain immune genes can cause many immune-related gene sets to appear overrepresented. However, a closer look often reveals that these gene sets are derived from the same group of genes, which might represent only a small fraction. Less than 1/10 of the differentially expressed genes (DEGs). What about the other 9/10? Do they hold no biological significance? The main purpose of developing this package is to provide a lightweight and practical solution to the problems mentioned above.

Usage

egt_recluster_analysis(
  x,
  ClusterNum = 17,
  P.adj = 0.05,
  force = F,
  nTop = 10,
  method = "ward.D2",
  ...
)

Arguments

  • x: an enrichment result from clusterProfiler, or a data.frame containing result from clusterProfier. To perform fusing multi-database enrichment results, please give a list object.
  • ClusterNum: how many cluster will be clustered
  • P.adj: p.adjust cut-off. To avoid slow visualization, you can make stricter p-cut off.
  • force: ignore all auto-self-checks, which is useful
  • nTop: keep n top items according to p-adj in each cluster.
  • method: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
  • ...: Others options.

Details

For an ORA result, c(“ID”,“Description”,“GeneRatio”,“pvalue”,“p.adjust”,“geneID”,“Count”) should be contained; For GSEA, c(“ID”,“Description”,“NES”,“pvalue”,“p.adjust”,“core_enrichment”) should be contain. For compareClusterResult, a compareClusterResult object or a data-frame with additional Cluster column should be contained, others similar to ORA result. To perform fusing multi-database enrichment results, please give a list object.

Author

Zhiming Ye

Value

an EnrichGT_obj object. slot enriched_result contains a data.frame with enriched results. gt_object contains gt object. you can use obj@gt_object to get it and use functions from gt like gtsave. gene_modules is a list containing meta-gene modules of each cluster. pathway_clusters contains pathways names in each cluster. clustering_tree contains the clustering tree object from hclust(), you can use other packages like ggtree for further visualization and analysis. raw_enriched_result contains raw table without selecting nTop.

genes_with_weights: Return ranked gene list which is use for “GSEA” or other places

Description

Return ranked gene list which is use for “GSEA” or other places

Usage

genes_with_weights(genes, weights)

Arguments

  • genes: A vector containing genes
  • weights: A vector contain weight of genes, typically like log2FC from DEG analysis

Author

Zhiming Ye

Value

A ranked named numeric vector. Names of the numbers is the ENTREZID.

database_GO_BP: Get database for enrichment analysis

Description

Get Gene Ontology (GO), Reactome, and other term-to-gene database, for enrichment analysis

Usage

database_GO_BP(OrgDB = org.Hs.eg.db)

database_GO_CC(OrgDB = org.Hs.eg.db)

database_GO_MF(OrgDB = org.Hs.eg.db)

database_GO_ALL(OrgDB = org.Hs.eg.db)

database_Reactome(OrgDB = org.Hs.eg.db)

database_progeny_human()

database_progeny_mouse()

database_CollecTRI_human()

database_CollecTRI_mouse()

Arguments

  • OrgDB: The AnnotationDbi database to fetch pathway data and convert gene IDs to gene symbols. For human it would be org.Hs.eg.db, for mouse it would be org.Mm.eg.db. In AnnotationDbi there are many species, please search AnnotationDbi for other species annotation database. GO and Reactome should add this, progeny and collectri do not.

Author

Zhiming Ye. Part of functions were inspired by clusterProfiler but with brand new implement.

Value

a data.frame with ID, terms and genes

database_kegg: Get KEGG database from KEGG website

Description

KEGG is a commercialized database. So EnrichGT can’t pre-cache them locally. You can use this function to fetch KEGG database pathways and modules.

Usage

database_kegg(kegg_organism="hsa",OrgDB = org.Hs.eg.db,kegg_modules=F,local_cache=F)

database_kegg_show_organism()

Arguments

  • kegg_organism: Determine which species data from KEGG will be fetch. For human, it would be hsa(in default); For mouse, it would be mmu. If you wants other species, see database_kegg_show_organism() for details.
  • OrgDB: The AnnotationDbi database to convert KEGG gene ID to gene symbols. For human it would be org.Hs.eg.db, for mouse it would be org.Mm.eg.db. In AnnotationDbi there are many species, please search AnnotationDbi for other species annotation database.
  • kegg_modules: If TRUE, returns KEGG module; If FALSE returns KEGG pathways. In default, this is setted to FALSE to get mouse commonly used KEGG pathways.
  • local_cache: cache a copy in local working folder. It will be saved as a .enrichgt_cache file in working dictionary. The .enrichgt_cache is just a .rds file, feel free to read it using readRDS().
Back to top