Development of a flax breeding database: a gateway to novel breeding strategies

Objective

1) Collect, integrate, and organize different types of flax breeding research data, including field trial data, genotyping and phenotyping data, molecular markers, genetic maps, physical maps, association mapping data, gene sequences of mutants, and pedigrees of flax germplasm;

2) Develop a flax breeding database to store and manage raw and integrated data;

3) Build an intelligent selection strategy for flax breeding parents based on genome-wide genomic information, phenotypic evaluation data, and their relationship;

4) Develop web-based applications (toolbox) with various user-friendly web tools for users (plant breeders, producers, and researchers) to fetch requested information by a simple search or browsing.

Project Description

The following data have been compiled into the flax breeding database:

  1. The draft whole genome shotgun (WGS) sequence assembly (Wang et al. 2013), the first version of chromosome-level pseudomolecule (You et al. 2018), and its annotations of cultivar CDC Bethune (genes and repeat sequences), downloaded from the Phytozome database and NCBI;
  2. ~280,000 annotated flax ESTs with ~31,000 unigenes identified and downloaded from the NCBI database;
  3. 80,337 BAC end sequences (BES) (Ragupathy et al. 2011);
  4. A total of ~1.7 million SNPs identified from 407 core flax accessions, and some identified QTL and SSR/SNP markers associated with the important traits from the Cloutier and You labs;
  5. Phenotypic data of 390 genotypes of the flax core collection population evaluated during 2009-2011 at two locations (Morden, Manitoba and Kernen Crop Research Farm, Saskatchewan). The traits include seed yield, yield components, agronomic traits, seed oil and protein content, seed oil component traits, disease resistance (such as wilt), and fibre yield and quality traits (You et al. 2017).
  6. The 91 fatty acid associated genes and 204 NBS-encoding disease resistance genes identified from flax genomes using the bioinformatics approach (You et al. 2014, 2018).
  7. Pedigree data of Canadian cultivars (You et al., 2016).

Every dataset file was tested and verified for the different trial locations and projects. Related files were validated and converted to a consistent format. Redundant records, clear mistakes, and illogical data were removed from the datasets. Missing data were properly coded. Outlier data were verified and handled appropriately. Uniform measurement units and the same scale were applied to datasets from different platforms, trial locations, and projects. After validated data were verified, datasets were merged or split into series of files to facilitate database table design.

Pedigree information of flax cultivars registered in Canada from 1910 to 2015 has been collected and analyzed. Pedigree charts of all cultivars have been drawn and integrated into one pedigree chart (Fig. 1). We developed a computing pipeline with R and Perl for pedigree analysis of all collected pedigree data. This pipeline is able to analyze the parental relationship, genetic contributions of any cultivars to their offspring or descendants, and relationship with genotyping by sequencing SNP markers and phenotypic data (You et al. 2016). All pedigree information and analysis results have been integrated into the flax pedigree database.