Second enrichment of enriched Results

Enrichment of Enriched Results

The enriched result is too messy? Clean up it!

EnrichGT generates insightful results by simply constructing a term frequency matrix of genes enriched in pathways and performing clustering. While the results may not be statistically optimal, they offer significant interpretive insights.

Print ?egt_recluster_analysis for further help. But of note, you can adjust ClusterNum (Cluster the enrichment into N clusters) and nTop (Show how many top items in GT table) for a better result (the default is not all the best for your data).

Challenges in Biological Gene Enrichment Analysis

Gene enrichment analysis can often be misleading due to the redundancy within gene set databases and the limitations of most enrichment tools. Many tools, by default, only display a few top results and fail to filter out redundancy. This can result in both biological misinterpretation and valuable information being overlooked.

For instance, high expression of certain immune genes can cause many immune-related gene sets to appear overrepresented. However, a closer look often reveals that these gene sets are derived from the same group of genes, which might represent only a small fraction (less than 10%) of the differential expressed genes (DEGs). What about the other 90%? Do they hold no biological significance?

The main purpose of developing this package is to provide a lightweight and practical solution to the problems mentioned above. Specifically, this package can cluster enrichment results based on hit genes or core enrichment from GSEA using term frequency analysis (from the output of the powerful clusterProfiler). This provides a clearer view of biological relevance by focusing on the genes that matter most.

# From results generated before
res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_GO_BP(Org.Hs.eg.db))

re_enrich <- egt_recluster_analysis(
  res,
  ClusterNum = 10,
  P.adj = 0.05,
  force = F,
  nTop = 10,
  method = "ward.D2"
)

You can see the structure of re_enrich object above. The re_enrich object is an S4 EnrichGT_obj object. The first slot is the result table (a data.frame), and the second slot contains gt table.

str(re_enrich,max.level = 2)
Formal class 'EnrichGT_obj' [package "EnrichGT"] with 12 slots
  ..@ enriched_result     : tibble [65 × 7] (S3: tbl_df/tbl/data.frame)
  ..@ gt_object           :List of 17
  .. ..- attr(*, "class")= chr [1:2] "gt_tbl" "list"
  ..@ gt_object_noHTML    :List of 17
  .. ..- attr(*, "class")= chr [1:2] "gt_tbl" "list"
  ..@ gene_modules        :List of 9
  ..@ pathway_clusters    :List of 9
  ..@ document_term_matrix:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ clustering_tree     :List of 7
  .. ..- attr(*, "class")= chr "hclust"
  ..@ raw_enriched_result : tibble [3,003 × 9] (S3: tbl_df/tbl/data.frame)
  .. ..- attr(*, "Package")= chr "EnrichGT"
  .. ..- attr(*, "Input")= chr [1:465] "MT-ND4" "VAMP8" "PLAAT4" "LGALS1" ...
  .. ..- attr(*, "Database")='data.frame':  1284869 obs. of  2 variables:
  .. ..- attr(*, "Other_Params")=List of 5
  .. ..- attr(*, "Time")= POSIXct[1:1], format: "2025-11-23 02:56:33"
  ..@ fused               : logi FALSE
  ..@ param               :List of 6
  ..@ LLM_Annotation      :Formal class 'egt_llm' [package "EnrichGT"] with 3 slots
  ..@ LLM_Comparison      :Formal class 'egt_llm_comparison' [package "EnrichGT"] with 3 slots

Access re-enriched data table

You can simple View(re_enrich@enriched_result) for the first slot.

re_enrich@enriched_result # Get the re-enrichment result table
# A tibble: 65 × 7
   Description                          ID    Count Cluster   PCT    Padj geneID
   <chr>                                <chr> <int> <chr>   <dbl>   <dbl> <chr> 
 1 synaptic transmission, glutamatergic GO:0…    18 Cluste…   3.9 1   e-6 ATP1A…
 2 regulation of synaptic transmission… GO:0…    15 Cluste…   3.2 1.20e-6 ATP1A…
 3 glutamate receptor signaling pathway GO:0…    12 Cluste…   2.6 1.10e-5 GRIA4…
 4 regulation of neuronal synaptic pla… GO:0…    12 Cluste…   2.6 1.20e-5 APOE,…
 5 ionotropic glutamate receptor signa… GO:0…     8 Cluste…   1.7 4.1 e-5 GRIA4…
 6 ligand-gated ion channel signaling … GO:1…     9 Cluste…   1.9 9.8 e-5 GRIA4…
 7 regulation of postsynaptic membrane… GO:0…    16 Cluste…   3.4 1.6 e-4 GABRD…
 8 regulation of membrane potential     GO:0…    30 Cluste…   6.5 2.5 e-4 AGT, …
 9 locomotory behavior                  GO:0…    19 Cluste…   4.1 2.5 e-4 ALK, …
10 transmission of nerve impulse        GO:0…    11 Cluste…   2.4 4.3 e-4 ATP1A…
# ℹ 55 more rows

Access re-enriched HTML report

EnrichGT offers more than data frames. Please see HTML reports (gt table table) for further visualization.

Glance details of each module

Through Word Cloud

egt_fetch_biological_theme function can do this. Just execute:

egt_fetch_biological_theme(re_enrich, 1)

Through Morden IDEs

EnrichGT works well with IDEs RStudio and Positron. You can use egt_summary() function to see the LLM summaries.

For example, if you want to view cluster 1, you can use one of below:

egt_summary(deepseekAnno, "1")
egt_summary(deepseekAnno, 1)
egt_summary(deepseekAnno, "Cluster_1")

Through R Console

The re-enriched object from EnrichGT supports the $ subset operator. You can use it to glance details inside each cluster. In morden IDE like Positron, type $, and then press the Tab key for auto-completion, as shown in the figure.

But when you are still using RStudio (which auto-complete function is poor than ARK LSP), you can use names to get the cluster names.

names(re_enrich)
[1] "Cluster_1" "Cluster_2" "Cluster_3" "Cluster_4" "Cluster_5" "Cluster_6"
[7] "Cluster_7" "Cluster_8" "Cluster_9"

In this example, all results haven’t got any extra annotation. But EnrichGT supports Large language models (LLMs) based enrichment result annotations, you can refer to large language models integration of EnrichGT page for more details. After performed LLM annotations, this step will display more information and insights about enrichment results. If you feel typing full names is bored, you can use c1 or C1 and even "1" to access it.

For example:

re_enrich$Cluster_1
── Enrichment Result of Cluster_1 (Local Summary) ──────────────────────────────
• This cluster contains synaptic transmission, glutamatergic, regulation of
synaptic transmission, glutamatergic, glutamate receptor signaling pathway,
regulation of neuronal synaptic plasticity, ionotropic glutamate receptor
signaling pathway ...
• Candidate genes includes ACHE, ADGRF1, AGT, ALK, ANKFN1, ANKH, APLP1, APOE,
ASPM, ATP1A2, ATP8B3, AVPR1A, BCAN, C4BPB, CA2, CACNG5, CAMK1D, CAMK2B, CDC20,
CELSR3, CIART, CNTN2, CNTNAP2, CPNE4, CPNE7, CTNNA2, DAAM2, DLX5, DNAJB1, DNER,
DPP4, DPYSL5, DSCAM, E2F1, EMX2, EPHA10, FOLR2, FOXD1, FSTL4, GABRD, GAD1,
GAP43, GFAP, GIPR, GPR37, GRIA4, GRID2, GRIK2, GRIK3, GRIN1, GRIN2A, GRIN2B,
GRIN2D, GRM1, GRM5, GRM8, HCN1, HCN2, HMOX1, HRAS, HTR3A, IGSF9, IL1RAPL2,
IMPA2, INA, KCNC3, KCNE4, KCNK2, KIAA1755, KIF1A, KIF5C, LHFPL4, LOXL2, LRFN2,
MAG, MAP1B, MAPK8IP2, MEGF10, MSX1, MT-CYB, MYO3B, NKX6-1, NLGN1, NMU, NQO1,
NRCAM, NRXN1, NTRK2, NUDT1, OPCML, PPP1R1B, PTPRH, PYCR1, RAC3, RASD2, RIMS3,
RNF207, RYR1, SCN8A, SDC1, SDC2, SEMA5B, SEZ6L2, SFRP2, SHISA9, SIX1, SLC30A10,
SLC30A3, SLC6A1, SLC6A3, SPOCK1, SRD5A1, STRA6, SYNDIG1, TBX18, TF, TFAP2A,
TREM2, TRPM2, TRPV4, TWIST1, UCHL1, UNC13A, UNC13C, VGF, WDR62, WNT4, WNT5A,
ZAN, ZIC1 (We will print all genes. Please stroll to top to read)
[1] "cli-33595-15"
re_enrich$c3
── Enrichment Result of Cluster_3 (Local Summary) ──────────────────────────────
• This cluster contains regulation of neuron differentiation, positive
regulation of nervous system development, regulation of nervous system
development, regulation of developmental growth, negative regulation of neuron
differentiation ...
• Candidate genes includes AGR2, ALK, APOE, ASPM, CAMK2B, CNTN2, DAAM2, DPYSL5,
DSCAM, FOXC2, FOXS1, FSTL4, GFAP, GPAM, GRID2, GRM5, HEY2, IL34, JAG1, KCNK2,
MAG, MAP1B, NKX6-1, NLGN1, NRCAM, NRXN1, NTRK2, RAC3, SFRP2, SIX1, SLC6A3,
SOX8, SYNDIG1, TP73, TREM2, UNC13A, WDR62, WNT5A, ZNF536 (We will print all
genes. Please stroll to top to read)
[1] "cli-33595-20"
re_enrich$"5"
── Enrichment Result of Cluster_5 (Local Summary) ──────────────────────────────
• This cluster contains ATP synthesis coupled electron transport, mitochondrial
ATP synthesis coupled electron transport, aerobic electron transport chain,
aerobic respiration, respiratory electron transport chain ...
• Candidate genes includes ADGRF1, AK5, ALDH1L1, ANGPTL4, ATP1A2, ATP6V1C2,
AVPR1A, CA9, CCNB1, CD24, COL1A1, CRYAB, DPP4, ENO1, ENPP1, GIPR, HSPA1A, IDH1,
KCNK2, LOXL2, MB, MT-ATP6, MT-CO2, MT-CO3, MT-CYB, MT-ND1, MT-ND2, MT-ND3,
MT-ND4, NDUFB2, NDUFS6, OGDHL, PGF, RYR1, SLC15A1, TREM2, TRPV4, TWIST1, UCHL1,
UQCR10, VGF (We will print all genes. Please stroll to top to read)
[1] "cli-33595-25"

Further more, you can can use @ to get objects in S4. Like result@gene_modules returns genes in cluster.

How to get objects inside the S4 object?

You can use @, for example, x <- re_enrich@enriched_result returns a result table and x <- re_enrich@gt_object returns a gt object.

Mask unnecessary results

In transcriptomic sequencing, we often encounter a particular scenario. For example, T cell-related pathways can still be enriched in certain tumors from NOD-SCID mice, which are thymus-deficient. However, such immune infiltration patterns should not be predominant in thymus-deficient mice. Why does this happen? It’s likely due to shared immune response processes—such as interleukin and cytokine-related biological events—that are common across various immune cell types. What you’re seeing might be the surface reflection of a complex and chaotic underlying process.

Clearly, neither ORA nor GSEA is capable of correcting for such biases—let alone account for rare cell types or unique tissue microenvironments. In such cases, filtering out certain pathways isn’t falsification; it’s a way to minimize potential misunderstandings for readers or collaborators.

That said, I must emphasize: although this is a useful feature, please don’t use it to blindly dismiss pathways—existence implies relevance. It’s an art of trade-off. Nevertheless, every pathway with an FDR less than 0.05 deserves careful consideration.

In EnrichGT, you can simply use %-delete->% operator to achieve this:

# Filter out "ribosome" related terms in re-enriched object
filtered_results <- reenrichment_obj %-delete->% "ribosome"

# Filter data.frame directly from ORA/GSEA result is also OK
filtered_df <- df %-delete->% "metabolism"

It uses regular expression to help you remove them. Regular expression have many high-level ways to use, you can ask for Google for more details.

Manual visualize the relationship of specific terms intra database

egt_fetch_termwise_relationship is designed for this. This directly call the cluster function internal and show you the result. The purpose of designing this function is to encourage you to explore more freely.

library(org.Hs.eg.db)
egt_fetch_termwise_relationship(ora_result$Description[1:100],database = database_GO_BP(org.Hs.eg.db),ClusterNum = 5)
✔ success loaded database, time used : 10.1783771514893 sec.
Joining with `by = join_by(label)`

Infering TFs or pathway activity and more based on meta-gene modules

Based on re-enriched result, the S4 object return from re-enrichment contains gene_modules slot and pathway_clusters slot. In gene_modules slot you can find a group of meta-genes take part in specific pathway cluster (in pathway_clusters slot).

EnrichGT supports inferring Pathway or Transcript Factors activity from re-enriched meta-gene modules. This is accomplished by two amazing database:

  • PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction.

  • CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous.

Now let’s see this example:

TF_Act <- egt_infer_act(re_enrich,DB = "collectri", species = "human")
! If when doing re-enrichment, you select a high number of clusters, that may cause low gene number in each meta-gene module, and then can't be infered sucessfully. So if result is empty, please increase the cut off of pvalue.  
✔ success loaded self-contained database
✔ Done ORA in 0.0166840553283691 sec.
✔ Done ORA in 0.0125110149383545 sec.
✔ Done ORA in 0.0122601985931396 sec.
✔ Done ORA in 0.016308069229126 sec.
✔ Done ORA in 0.0119631290435791 sec.
✔ Done ORA in 0.013624906539917 sec.
✔ Done ORA in 0.0245451927185059 sec.
✔ Done ORA in 0.0154321193695068 sec.
✔ Done ORA in 0.0112709999084473 sec.
egt_plot_results(TF_Act$Cluster_1,P.adj = 0.1)
! You are drawing origin results, for better result you can re-cluster it by egt_recluster_analysis()

Wants to interpret the regulator of whole inputted genes?

PROGENy and CollecTRI can be used just like other database in ORA or GSEA enrichment, for example, the database_GO_BP(). See [Progeny Database] and [CollecTRI Database] page for detail.

Example:

TFActivity <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_CollecTRI_human())
Why many inferred results are empty?

If when doing re-enrichment with a high number of clusters, that may cause low gene number in each meta-gene module (splitting into too many clusters make gene in each cluster is not enough to enrich), and then can’t be inferred successfully. So if result is empty, please increase the number of re-clustering when doing it.

Back to top