Select your databases for enrichment or annotation

DataBases Helpers

How to specify species?

EnrichGT use AnnotationDbi for fetching most of databases and gene annotations. you can use org.Hs.eg.db for human and org.Mm.eg.db for mouse. For others, please search in Google or refer to BioConductor.

But for non-AnnotationDbi source database, you do not need to provide this, like database_CollecTRI_human() return database about human only.

Built in database or AnnotationDbi database

You should add argument OrgDB for fetching them.

Example:

database_GO_BP(OrgDB = org.Hs.eg.db)

GO Database

BP stands for biological process, CC stands for cellular component and MF stands for molecular functions. The ALL will combined the above three sub-databases.

database_GO_BP(), database_GO_CC(), database_GO_MF(), database_GO_ALL()

Reactome Database

Reactome is an open source pathway database.

database_Reactome()

KEGG Database

KEGG is a commercialized database. So EnrichGT can’t pre-cache them locally. You can use database_KEGG to fetch KEGG database pathways and modules.

This function requires two species-related argument. You may familiar to the OrgDB argument as they will be used to convert ENTREZ IDs to symbols like before. Another argument is the kegg_organism. It determines which species data from KEGG will be fetch. For human, it would be hsa(in default); For mouse, it would be mmu. If you wants other species, execute database_KEGG_show_organism() for details.

You can switch fetching KEGG pathways or modules by argument kegg_modules. If TRUE, returns KEGG module; If FALSE returns KEGG pathways. In default, this is setted to FALSE to get mouse commonly used KEGG pathways.

If you set local_cache = T, EnrichGT will cache a copy in local working folder. It will be saved as a .enrichgt_cache file in working dictionary. The .enrichgt_cache is just a .rds file, feel free to read it using readRDS().

keggdf <- database_KEGG(kegg_organism="hsa",OrgDB = org.Hs.eg.db,kegg_modules=F,local_cache=F)

WikiPathway Database

WikiPathway database provides pre-built GMT files (https://data.wikipathways.org/current/gmt/). In default they are recorded as ENTREZ IDs, so you need to provide proper species database (e.g. org.Hs.eg.db for human), to database_from_gmt function and EnrichGT will automatically convert ENTREZ IDs to gene symbols for enrichment analysis.

download.file("https://data.wikipathways.org/current/gmt/wikipathways-20241210-gmt-Homo_sapiens.gmt",destfile = "WikiPWS_human.gmt")
WikiPWsDB <- database_from_gmt("WikiPWS_human.gmt",OrgDB=org.Hs.eg.db)
res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = WikiPWsDB)

Progeny Database

For pathway activity infer, database_progeny_human() and database_progeny_mouse()

CollecTRI Database

For Transcript Factors infer, database_CollecTRI_human() and database_CollecTRI_mouse()

Read Addition Gene Sets from local GMT files

EnrichGT supports reading GMT files, You can obtain GMT files from MsigDB.

database_from_gmt("Path_to_your_Gmt_file.gmt")

In default, database_from_gmt will try to convert the numeric ids to gene symbols (as they are usually the ENTREZ IDs, you can disable this by passing convert_2_symbols = F ).

Read Addition Gene Sets from local data tables

The result of any database_*** functions are data.frames. So you can simple read any data tables and use them for any enrichment function.

The typical input should be:

IDs	Term	Genes
ID1	Biological Pathway1	Gene1
ID1	Biological Pathway1	Gene2
ID1	Biological Pathway1	Gene3
ID2	Biological Pathway2	Gene3
…	…	…

Term	Genes
Biological Pathway1	Gene1
Biological Pathway1	Gene4
Biological Pathway2	Gene7
…	…

Example:

library(readr)
db <- read_csv("you_gene_set.csv")
res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = db)

Gene Annotation Converter

You can use convert_annotations_genes() to convert gene annotations from any keys to any keys.

Example:

suppressMessages(library(EnrichGT))

Warning: replacing previous import 'AnnotationDbi::select' by 'dplyr::select'
when loading 'EnrichGT'

suppressMessages(library(readr))
suppressMessages(library(org.Hs.eg.db))
suppressMessages(DEGexample <- read_csv("./DEG.csv"))
convert_annotations_genes(DEGexample$...1[1:10], from_what="SYMBOL", to_what=c("ENTREZID","ENSEMBL","GENENAME"), OrgDB=org.Hs.eg.db)

'select()' returned 1:1 mapping between keys and columns

     SYMBOL ENTREZID         ENSEMBL
1  TBX5-AS1   255480 ENSG00000255399
2     ADH1B      125 ENSG00000196616
3     CCL11     6356 ENSG00000172156
4      TBX5     6910 ENSG00000089225
5     GATA5   140628 ENSG00000130700
6     TCF21     6943 ENSG00000118526
7     SSTR1     6751 ENSG00000139874
8      CSF3     1440 ENSG00000108342
9     GSTA2     2939 ENSG00000244067
10     WIF1    11197 ENSG00000156076
                                               GENENAME
1                                  TBX5 antisense RNA 1
2  alcohol dehydrogenase 1B (class I), beta polypeptide
3                         C-C motif chemokine ligand 11
4                          T-box transcription factor 5
5                                GATA binding protein 5
6                               transcription factor 21
7                               somatostatin receptor 1
8                           colony stimulating factor 3
9                     glutathione S-transferase alpha 2
10                              WNT inhibitory factor 1