Large language models integration

Background

Large language models (LLMs) hold promise for interpreting biological data, yet their effectiveness is constrained when directly handling raw gene input or complex enrichment analysis results.

Challenges in Direct LLM Interpretation

  1. Raw Gene Input Limitations:
    • LLMs exhibit suboptimal performance when provided solely with gene symbols or identifiers (e.g., TP53, BRCA1, EGFR).
    • Without additional biological metadata, LLMs struggle to establish meaningful biological contexts and infer functional relationships among genes.
  2. Overwhelming Enrichment Results:
    • Directly feeding complete outputs from Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses into LLMs typically introduces excessive noise.
    • Context window limitations result in significant information loss, hindering the accurate interpretation of complex enrichment data.
    • Dense, unstructured enrichment tables challenge LLMs’ ability to effectively prioritize and summarize critical biological insights.

Proposed Solution: Cluster-First Approach

To address these challenges, cluster-first methodology, exemplified by EnrichGT, is recommended. EnrichGT organizes enrichment results into meaningful clusters, thereby simplifying complexity and enhancing interpretability. Recently, EnrichGT has integrated support for LLM-driven interpretation, enabling LLMs to more effectively extract, summarize, and contextualize key biological insights from enrichment data.

How to use

Bring your LLM to R

The LLM function is based on package ellmer (https://ellmer.tidyverse.org/index.html). It provides a uniform interface for most of LLMs in R.

ellmer supports a wide variety of model providers:

  • Anthropic’s Claude: chat_anthropic().
  • AWS Bedrock: chat_aws_bedrock().
  • Azure OpenAI: chat_azure_openai().
  • Databricks: chat_databricks().
  • DeepSeek: chat_deepseek().
  • GitHub model marketplace: chat_github().
  • Google Gemini: chat_google_gemini().
  • Groq: chat_groq().
  • Ollama: chat_ollama().
  • OpenAI: chat_openai().
  • OpenRouter: chat_openrouter().
  • perplexity.ai: chat_perplexity().
  • Snowflake Cortex: chat_snowflake() and chat_cortex_analyst().
  • VLLM: chat_vllm().

You can generate a model in R environment like this (Please refer to ellmer website):

library(ellmer)
dsAPI <- "sk-**********" # your API key
chat <- chat_deepseek(api_key = dsAPI, model = "deepseek-chat", system_prompt = "")

Some suggestions:

  1. You may choose a cost-effective LLM model, as this type of annotation requires multiple calls. Also, make sure that both the LLM and the network are as stable as possible in order to obtain all the results (although EnrichGT has already been set to automatically retry multiple times).
  2. Non-reflective models or fast-thinking models are generally better. Slow-thinking models (such as DeepSeek-R1) may result in long waiting times.
  3. It is best to choose an LLM model that is relatively intelligent, has a substantial knowledge base, and exhibits low hallucination rates. In our (albeit limited) experience, although GPT-4o performs worse than DeepSeek-V3-0324 in most benchmark tests, it may produce more reliable results in some cases due to the latter’s higher hallucination rate. You are free to choose whichever large model you prefer.
  4. NO system prompts. And please adjust your LLM’s tempretures according to your provider carefully.

Summrize your results using LLM

Just execute:

re_enrichment_results <- egt_llm_summary(re_enrichment_results, chat)

A typical run in DeepSeek-V3-0324 will use ~ 6 mins.

After complete, you can use $ operator to access annotated results. For example, the annotation of Cluster_1:

llm_annotated_obj$Cluster_1
── Enrichment Result of Cluster_1 (LLM Summary) ────────────────────────────────
── "Glutamatergic Synaptic Transmission and Plasticity Network" ──
• The pathways listed predominantly converge on glutamatergic synaptic
transmission and plasticity, highlighting a central role in excitatory
neurotransmission and synaptic regulation. Key processes include glutamate
receptor signaling (ionotropic and metabotropic), synaptic vesicle dynamics,
and trans-synaptic protein interactions (e.g., neurexins and neuroligins),
which collectively modulate synaptic strength and plasticity. The inclusion of
NMDA receptor unblocking and activation underscores the importance of
calcium-dependent plasticity mechanisms, such as long-term potentiation (LTP)
or depression (LTD), critical for learning and memory.

Biological implications suggest this module regulates cognitive functions and
neuronal adaptability, with dysregulation linked to neuropsychiatric disorders
(e.g., autism, schizophrenia). Evidence includes the association of neuroligin
mutations with autism and NMDA receptor dysfunction with cognitive deficits.
The prominence of protein-protein interactions (e.g., neurexin-neuroligin)
further implicates synaptic adhesion in maintaining circuit stability.

A representative name for this module could be "Glutamatergic Synaptic
Transmission and Plasticity Network," reflecting its focus on excitatory
signaling and adaptive synaptic remodeling. This nomenclature aligns with the
pathways' roles in neuronal communication and their broader implications for
brain function and disease.
• The gene set predominantly encodes proteins critical for glutamatergic
synaptic function, plasticity, and neuronal signaling, with particularly
noteworthy members including GRIN1/2A/2B (NMDA receptor subunits), GRIA4 (AMPA
receptor), and GRM1/5/8 (metabotropic glutamate receptors) that form the core
excitatory neurotransmission machinery. Several genes stand out for their
established neurobiological roles: APOE influences synaptic repair and
Alzheimer's risk, NRXN1 and NLGN1 mediate trans-synaptic adhesion crucial for
circuit formation, while NTRK2 (BDNF receptor) and CAMK2B regulate
activity-dependent plasticity. The presence of calcium-related genes (CACNG5,
GRID2) and presynaptic regulators (UNC13A/C, RIMS3) suggests additional
modulation of vesicular release and calcium-dependent processes. Notably,
multiple genes (GRIN2B, SHISA9, IL1RAPL2) are associated with
neurodevelopmental disorders, reinforcing the set's relevance to synaptic
pathology. The inclusion of atypical members like CDC20 (cell cycle) and HRAS
(growth signaling) may reflect non-canonical neuronal functions or technical
noise, though WNT5A's role in synaptic patterning is established. The
collective profile implies these genes operate in coordinated networks
governing synaptic strength, with particular importance for cognitive functions
and neurological disease mechanisms, where glutamatergic dysfunction is
implicated across schizophrenia, autism, and neurodegenerative conditions. The
most biologically compelling targets appear to be the direct synaptic signaling
components (ionotropic receptors, adhesion molecules) and plasticity
regulators, while others may represent secondary modulators or
context-dependent players in neural circuits.
[1] "cli-92167-12"

All the results are saved in the result@LLM_Annotation slot.

Back to top