Plant Genomics and the Future of Crop Improvement

Plant genomics and crop improvement research

Sequencing a wheat genome in 2005 cost approximately $50 million and took an international consortium three years. Sequencing the same genome today costs under $200 and takes two days. This cost collapse has not, on its own, accelerated crop improvement as much as the numbers suggest it should have. The bottleneck was never sequencing. It was — and largely still is — understanding which of the millions of DNA variants in a crop genome actually matter for the traits that determine yield, stress tolerance, and food quality under production conditions. Genomics data without phenotype connections is a catalog of variants, not a guide to breeding targets.

From Sequence to Function: The Annotation Challenge

Wheat's reference genome, published by the International Wheat Genome Sequencing Consortium in 2018, contains approximately 107,891 high-confidence gene models across 21 chromosomes. Rice has around 35,000 genes. Maize — with a highly complex, repetitive genome — has roughly 40,000. For none of these crops has the function of more than a small fraction of those genes been experimentally determined. Gene annotation relies predominantly on sequence homology to characterized genes in model organisms — Arabidopsis, rice, and increasingly tomato — supplemented by expression data showing when and where each gene is transcribed during plant development.

Sequence homology is a useful starting point but a poor finishing point. A wheat gene with 70 percent protein sequence similarity to an Arabidopsis gene involved in drought response may perform the same function in wheat, a partially overlapping function, or a completely divergent function that evolution co-opted from the ancestral sequence. Assuming functional equivalence from sequence homology alone, without validation in the target species, is one of the most common sources of wasted development effort in plant biotechnology.

The approach that produces reliable target identification combines several data types: expression data (is the gene expressed in the right tissue at the right time?), co-expression network analysis (does it behave similarly to genes of known function under stress conditions?), genetic association data (does natural variation in or near this gene associate with phenotypic differences across the germplasm?), and functional validation in protoplasts or mutant lines (does disrupting the gene produce the expected phenotypic change?). Running all four lines of evidence in parallel, rather than sequentially, is what a modern genomics-enabled target discovery pipeline looks like.

Pan-Genomics: One Reference Genome Is Not Enough

The conventional approach to plant genomics used a single reference genome as the foundation for all subsequent analysis. Any accession or variety was characterized by comparing its sequence to the reference, identifying variants (SNPs, indels, structural variants) relative to that reference. This approach has a structural flaw: genomic regions absent from the reference cannot be characterized, even if they are present in a large fraction of the diversity being studied. For crops with high structural variation — which includes wheat, maize, and soybean — the fraction of functional genomic content missing from any single reference can be substantial.

Pan-genomics addresses this by constructing a graph-based genome that captures the sequence diversity present across a large collection of accessions, rather than representing it all as variants from a single reference. The pan-genome of bread wheat, published in 2023 by a consortium including Bayer Crop Science and multiple universities, incorporated complete genome assemblies from ten diverse wheat accessions. The resulting pan-genome contains approximately 95,000 gene families, of which roughly 13,000 are present in only a subset of accessions — representing alleles that no single reference genome could capture.

For stress tolerance applications, these "accessory" genes — present in some accessions but absent from others — are particularly interesting. A gene found in a drought-tolerant landrace but absent from elite commercial varieties represents potential untapped adaptation. The pan-genome framework makes it possible to identify these accession-specific sequences, characterize their function, and use gene editing to introduce them into elite adapted backgrounds without the extensive backcrossing required by conventional introgression of genomic segments from unadapted germplasm.

Genome-Wide Association Studies at Scale

GWAS connects genetic variants (typically single nucleotide polymorphisms, SNPs) to phenotypic measurements across large diversity panels. The resolution of a GWAS study — how precisely it can localize the causal variant within a genomic region — depends on the number of accessions genotyped, the density of SNP markers, and the statistical power to detect small effect associations. For complex traits like drought tolerance, where hundreds of loci each contribute small effects, large panels (thousands of accessions) and dense marker coverage (hundreds of thousands to millions of SNPs) are required to achieve reliable detection of all but the largest-effect loci.

The 1001 Genomes Project for Arabidopsis, the 3000 Rice Genomes Project, and comparable initiatives for wheat and maize have created the foundation for large-scale GWAS in major crops. The 3000 Rice Genomes Project — which characterized 3,024 rice accessions from 89 countries using whole-genome resequencing — has enabled identification of more than 30 genomic regions associated with drought tolerance, heat tolerance, or yield under stress conditions. Several of these regions contain candidate genes whose function had not previously been characterized, providing new targets that would not have been identified by model-organism homology approaches alone.

Bridging GWAS and Editing: The Fine-Mapping Requirement

GWAS identifies genomic regions containing causal variants; it rarely identifies the exact variant and the gene. The statistical association signal at a GWAS locus can span tens to hundreds of kilobases containing dozens of genes, any one of which might be the functional target. Fine-mapping — using additional genetic markers, expression data, and population structure analysis to narrow the associated region — is the necessary bridge between a GWAS hit and an editing target.

ClimateCrop's computational pipeline for fine-mapping uses a combination of Bayesian fine-mapping (statistical prioritization of likely causal variants within associated regions), eQTL (expression quantitative trait locus) analysis to identify variants that influence gene expression levels, and conservation analysis comparing the associated region across related species to identify evolutionarily constrained sequences most likely to be functional. This process typically narrows a 200-kilobase GWAS signal to two to five candidate variants, which are then validated by direct functional testing in protoplasts or stable transformation before any gene editing for commercial development is initiated.

Transcriptomics: Expression Context for Target Validation

Understanding when and where a candidate gene is expressed provides critical context for predicting what its disruption or modification will do. RNA sequencing (RNA-seq) produces a quantitative snapshot of all transcribed sequences in a sample under a specific condition. Applied across time courses of drought stress — from well-watered to mild, moderate, and severe deficit, at vegetative and reproductive stages — RNA-seq generates expression profiles that reveal which genes are stress-responsive, in which tissues, and with what kinetics.

Genes that are upregulated early during drought stress in leaf tissue are candidates for drought sensing and stomatal regulation. Genes upregulated specifically during reproductive stress in pollen or ovule tissue are candidates for fertility protection. Genes that are downregulated during stress but rapidly re-induced during recovery are candidates for regulating resilience and rebound mechanisms. This temporal and tissue-specific expression context shapes how targets are prioritized and how editing strategies are designed.

Single-cell transcriptomics — which profiles gene expression at the resolution of individual cells rather than bulk tissue — has added another dimension. Guard cells, the two cells surrounding each stomatal pore, are individually too small to isolate by traditional dissection. Single-cell RNA-seq makes it possible to extract the guard cell transcriptome from mixed epidermal cell populations, enabling identification of guard-cell-specific genes and regulatory factors that would be invisible in whole-leaf expression data. This resolution matters for developing stomatal modification strategies that affect guard cell function without affecting the many other cell types in the leaf where the same genes may play entirely different roles.

Proteomics and Metabolomics: Beyond the Transcript

Transcript abundance is not a reliable proxy for protein abundance, and protein abundance is not a reliable proxy for metabolic activity. Post-transcriptional regulation, protein turnover, and enzyme kinetics all intervene between gene expression and biochemical outcome. For stress physiology applications — understanding how drought or heat stress alters plant metabolism — proteomics (systematic quantification of proteins) and metabolomics (systematic quantification of metabolites) provide essential ground truth that gene expression data alone cannot.

Comparative metabolomics of drought-tolerant versus drought-sensitive varieties during stress has identified consistent signatures of osmotic adjustment capacity — the accumulation of proline, glycine betaine, and compatible solutes that allow cells to maintain turgor under water deficit. Varieties that accumulate these compounds faster and to higher concentrations show better performance under field drought conditions. The metabolomic signature validates the trait and points to the biosynthetic pathway genes most likely to be limiting. Combined with GWAS analysis of the accessions in the metabolomics panel, this approach directly links genomic variants to metabolic differences and to field performance — the three-point connection that makes a high-confidence editing target.

What This Means for the Next Decade of Crop Editing

The genomics data now available for major crops represents a fundamentally different foundation for plant improvement than what existed ten years ago. The combination of pan-genomics reference catalogs, large-scale GWAS datasets, spatially and temporally resolved transcriptomics, and multi-omic stress phenotyping has transformed target identification from a slow, largely hypothesis-driven process into a data-intensive, systematically screenable pipeline. The bottleneck has shifted from finding targets to validating and developing them — a more tractable problem.

For ClimateCrop, this means a target pipeline that is substantially less dependent on publishing research as a discovery step and more focused on leveraging and synthesizing existing published and proprietary data to identify high-confidence editing targets faster. The crop genomics research community — distributed across public universities, CGIAR centers, and private companies — continues to generate functional annotation data that narrows the candidate space for every editing target we pursue. The convergence of freely available genomic resources with rapid, affordable editing tools is what makes this moment in plant science distinct from any that preceded it.