Back

Large Language Models for DNA to Advance Health

seqSight empowers biological discovery and health innovation through novel genomic models, transforming data into actionable insights

Voting is closed for this event.

seqSight.com
Large Language Models for DNA to Advance Health

About this idea

The foundational, patented, technology of seqSight is seqLens. SeqLens is a purpose‑built genomic language model engineered to understand biological sequences with far greater nuance than traditional machine‑learning approaches. It is trained on two exceptionally large and evolutionarily diverse genomic datasets—one containing 19,551 reference genomes, including more than 18,000 prokaryotes totaling 115 billion nucleotides, and another balanced dataset of 1,354 genomes spanning both prokaryotes and eukaryotes with 180 billion nucleotides. These expansive corpora allow seqLens to learn patterns across deep evolutionary timescales, enabling stronger generalization across both microbial and eukaryotic biology.
To build a model specifically suited for DNA, we developed five custom byte‑pair encoding tokenizers and trained 52 genomic language models to systematically evaluate how tokenization choices, architectures, hyperparameters, pooling strategies, and classification heads influence biological prediction performance. Our experiments revealed critical insights—particularly that larger vocabularies harm generalization, while carefully optimized tokenization and pooling dramatically improve downstream accuracy. This comprehensive architecture search formed the evidence base leading to the final seqLens design.
At the core of seqLens is its signature technical innovation: disentangled attention combined with relative positional encoding. This architecture allows the model to separate content from positional information and to reason about the relative arrangement of DNA motifs—essential for identifying regulatory structure, functional sites, and long‑range dependencies in genomes. This design leads seqLens to outperform state‑of‑the‑art models in 13 of 19 phenotypic prediction tasks, demonstrating substantial gains in biological accuracy and interpretability.
SeqLens also incorporates advanced strategies for real‑world applicability, including continual pretraining, domain‑specific adaptation, and parameter‑efficient fine‑tuning, enabling the model to rapidly specialize to new organisms, environmental contexts, or genomic challenges. We further showed that seqLens can capture evolutionary relationships, enhancing genome annotation and variant interpretation. Together, these technical innovations form the backbone of seqSight’s platform—providing scalable, biologically aligned intelligence that turns raw DNA sequences into actionable scientific insights.