Source: MISSISSIPPI STATE UNIV submitted to
BIOCOMPUTATIONAL TOOLS FOR ANALYSIS OF COMPLEX AGRICULTURAL GENOMES
Sponsoring Institution
Agricultural Research Service/USDA
Project Status
NEW
Funding Source
Reporting Frequency
Annual
Accession No.
0422100
Grant No.
(N/A)
Project No.
6066-21310-004-13S
Proposal No.
(N/A)
Multistate No.
(N/A)
Program Code
(N/A)
Project Start Date
Sep 15, 2011
Project End Date
Sep 15, 2016
Grant Year
(N/A)
Project Director
SCHEFFLER B E
Recipient Organization
MISSISSIPPI STATE UNIV
(N/A)
MISSISSIPPI STATE,MS 39762
Performing Department
(N/A)
Non Technical Summary
(N/A)
Animal Health Component
(N/A)
Research Effort Categories
Basic
50%
Applied
50%
Developmental
0%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
20117101060100%
Goals / Objectives
Advances in biotechnology have led to tremendous increases in biomolecular data. For example, over the last thirty years the number of nucleotides in GenBank, an online DNA/protein sequence repository, has literally doubled every month. Analysis and utilization of exponentially increasing quantities of biomolecular data has required more intimate association of biology with high performance computing. The single-processor bioinformatics tools written in the last few years are already proving inadequate for deriving biological information from large data sets in a timely fashion. Moreover, such huge volumes of data have created a need for more powerful visualization tools that can translate digital data into intuitive graphical formats. We will generate new data analysis/visualization tools specifically designed for use on cluster supercomputers. Parallelized programs provide the built-in scalability required for the rapidly growing computational biology community.
Project Methods
We will develop high-throughput analysis pipelines for rapidly and accurately integrating genomic, transcriptomic, proteomic, metabolomic, and phenotypic data for species of importance to U.S. agriculture. Research will focus on expediting the association of genotype with phenotype while defining the biomolecular interactions that link the two. Unlike most existing bioinformatics tools, our algorithms and pipelines will employ parallel processing and other high-performance computing (HPC) principles from their inception, thus permitting scaling of computer resources to adequately meet the storage and memory needs of a wide-array of projects. In addition to de novo tool development, we will work to upgrade existing tools using HPC concepts. An important component of our work will be development of effective ways to visualize complex relationships among diverse data sets. To make our analyzed data as accessible and understandable as possible, we will utilize gene ontology (GO) techniques to annotate and ¿cross-link¿ molecular data.

Progress 10/01/12 to 09/30/13

Outputs
Progress Report Objectives (from AD-416): Advances in biotechnology have led to tremendous increases in biomolecular data. For example, over the last thirty years the number of nucleotides in GenBank, an online DNA/protein sequence repository, has literally doubled every month. Analysis and utilization of exponentially increasing quantities of biomolecular data has required more intimate association of biology with high performance computing. The single- processor bioinformatics tools written in the last few years are already proving inadequate for deriving biological information from large data sets in a timely fashion. Moreover, such huge volumes of data have created a need for more powerful visualization tools that can translate digital data into intuitive graphical formats. We will generate new data analysis/visualization tools specifically designed for use on cluster supercomputers. Parallelized programs provide the built-in scalability required for the rapidly growing computational biology community. Approach (from AD-416): We will develop high-throughput analysis pipelines for rapidly and accurately integrating genomic, transcriptomic, proteomic, metabolomic, and phenotypic data for species of importance to U.S. agriculture. Research will focus on expediting the association of genotype with phenotype while defining the biomolecular interactions that link the two. Unlike most existing bioinformatics tools, our algorithms and pipelines will employ parallel processing and other high-performance computing (HPC) principles from their inception, thus permitting scaling of computer resources to adequately meet the storage and memory needs of a wide-array of projects. In addition to de novo tool development, we will work to upgrade existing tools using HPC concepts. An important component of our work will be development of effective ways to visualize complex relationships among diverse data sets. To make our analyzed data as accessible and understandable as possible, we will utilize gene ontology (GO) techniques to annotate and �cross-link� molecular data. To date ARS scientists have conducted a number of computational and bioinformatics activities aimed at improving automated biomolecule data analysis pipelines. Special emphasis has been placed on increasing the efficiency of biocomputing algorithms by generating variants of these scripts that better utilize the advantages of high performance computing (HPC) architectures. Advances include the following: (1) Development of an HPC reference guided genome assembly pipeline. Gene order is shared among members of a species and is highly similar amongst related species. Thus, once a high quality reference genome sequence is generated for a species/genotype of interest, draft versions of related genomes (including different species and/or different genotypes of the same species) can be quickly assembled from shotgun reads using the reference sequence as a guide; this is known as reference- guided assembly. Previously, a German research group built a computational pipeline to assemble the genomes of four divergent Arabidopsis (A.) thaliana genomes using the A. thaliana accession Columbia as a reference. This pipeline utilizes sound assembly principles, but it was built to assemble small eukaryotic genomes and operate on small computer clusters. Our initial attempts to use it on even medium-sized plant genomes (e.g., cotton) showed it to be inadequate. Consequently, we integrated the basic ideas of the German group with HPC principles to create a reference-guided assembly pipeline optimized for a HPC environment. This has proved essential in our reference-guided assembly of various Gossypium (G.) genomes using the G. raimondii genome as a reference. A manuscript describing our reference-guided assembly pipeline is in preparation. (2) Adaptation of quality control tools for high throughput sequence analysis in HPC environments. Poor quality sequence reads and sequence contamination can greatly complicate downstream data analysis such as Single Nucleotide Polymorphism (SNP) calling, repeat identification, and genome assembly. A number of pre-analysis quality control tools have been developed. These tools are generally focused on trimming low quality reads and filtering out adapter sequences. Unfortunately, very few of these tools work effectively in HPC environments where multiple central processing units (which each may have multiple processing cores) are clustered together. In fact, few of these tools even take advantage of multiple processing cores within the same central processing unit (CPU) . Consequently, filtering and trimming tools often become a bottleneck limiting downstream analysis. We are working to adapt one of these tools, Cutadapt, to function more efficiently in an HPC environment. In order to do this, the script has had to be rewritten entirely. The new HPC- friendly tools will greatly accelerate preprocessing of high throughput sequencing data which will expedite and improve downstream analyses. (3) Annotation of agricultural genomes. We have continued our role in the gene ontology (GO) annotation of transcripts, genes, and proteins from numerous agricultural species. This work is available to the scientific community through the web resource AgBase (www.agbase.msstate. edu). Several bioinformatics tools and genus-specific datasets are available through AgBase. A review manuscript on cotton GO annotation and molecular resources is near completion.

Impacts
(N/A)

Publications


    Progress 10/01/11 to 09/30/12

    Outputs
    Progress Report Objectives (from AD-416): Advances in biotechnology have led to tremendous increases in biomolecular data. For example, over the last thirty years the number of nucleotides in GenBank, an online DNA/protein sequence repository, has literally doubled every month. Analysis and utilization of exponentially increasing quantities of biomolecular data has required more intimate association of biology with high performance computing. The single- processor bioinformatics tools written in the last few years are already proving inadequate for deriving biological information from large data sets in a timely fashion. Moreover, such huge volumes of data have created a need for more powerful visualization tools that can translate digital data into intuitive graphical formats. We will generate new data analysis/visualization tools specifically designed for use on cluster supercomputers. Parallelized programs provide the built-in scalability required for the rapidly growing computational biology community. Approach (from AD-416): We will develop high-throughput analysis pipelines for rapidly and accurately integrating genomic, transcriptomic, proteomic, metabolomic, and phenotypic data for species of importance to U.S. agriculture. Research will focus on expediting the association of genotype with phenotype while defining the biomolecular interactions that link the two. Unlike most existing bioinformatics tools, our algorithms and pipelines will employ parallel processing and other high-performance computing (HPC) principles from their inception, thus permitting scaling of computer resources to adequately meet the storage and memory needs of a wide-array of projects. In addition to de novo tool development, we will work to upgrade existing tools using HPC concepts. An important component of our work will be development of effective ways to visualize complex relationships among diverse data sets. To make our analyzed data as accessible and understandable as possible, we will utilize gene ontology (GO) techniques to annotate and �cross-link� molecular data. Current genome assembly programs require more random access memory (RAM) than is simultaneously accessible in supercomputers such as those at Mississippi State University�s High Performance Computing Collaboratory (HPC2) where this project is being conducted. To expedite construction of computational biology pipelines, three high RAM computer clusters were purchased. Two of these clusters have 0.5 terabytes (Tb) of shared (RAM) memory while the third has 0.25 Tb of shared memory. The high RAM computers were integrated into the HPC2 system, and popular genome assembly and analysis algorithms have been installed onto these machines. These algorithms are being tested and compared. While setting up working pipelines using existing tools is a first priority, the overarching goal is to adapt variants of these scripts so that they can take advantage of more typical high performance computing architectures.

    Impacts
    (N/A)

    Publications