Source: IOWA STATE UNIVERSITY submitted to
DEVELOPMENT OF SYSTEMS INFORMATICS TOOLS TO ACCELERATE LIVESTOCK GENOMICS
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
1000727
Grant No.
2013-67015-21210
Project No.
IOW05361
Proposal No.
2013-01001
Multistate No.
(N/A)
Program Code
A1201
Project Start Date
Sep 1, 2013
Project End Date
Aug 31, 2016
Grant Year
2013
Project Director
Reecy, J. M.
Recipient Organization
IOWA STATE UNIVERSITY
2229 Lincoln Way
AMES,IA 50011
Performing Department
Animal Science
Non Technical Summary
Genetics research has been revolutionized by the mapping and sequencing of whole genomes and now the re-sequencing of large numbers of individuals--human, chicken, cattle, swine, sheep, and multiple other species (vertebrate and non-vertebrate). Our long-range goal is to develop the integrated resources available at iPlant and Animal QTLdb, thereby leveraging the prior national funding in cyber infrastructure to ultimately address issues of importance to the livestock industry. The objectives of this particular application, which is an immediate step toward attainment of our long-range goal, are to 1) develop computational pipelines to efficiently process next-generation sequence data to facilitate GWAS/genomic selection, 2) provide computational resources to process next-generation data, and 3) facilitate comparison of livestock QTL/GWAS findings with previously published results. Two complementary and integrated objectives will accomplish our goal: Objective 1: Development of scalable, accessible informatic pipelines to leverage next-generation sequencing data in livestock genomics. We expect to generate easy to use computational pipelines to process next-generations sequence data to facilitate genomic selection/GWAS analyses. Objective 2: Expand and enhance the Animal QTL Database. We expect to expand ontologies that facilitate direct comparison of QTL/GWAS data within and across species and to curate all published QTL and marker associations for cattle, pigs, chickens, sheep, rainbow trout and other livestock species. These tools are expected to have significant positive effects on researchers' ability to analyze genotype/phenotype data associated with traits of economic and health importance in livestock species.
Animal Health Component
0%
Research Effort Categories
Basic
50%
Applied
40%
Developmental
10%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
3033399108015%
3033299108015%
3033499108015%
3033799108010%
3033599108015%
3033699108010%
3033810108010%
3033820108010%
Goals / Objectives
A goal of this projectis to develop the integrated resources available at iPlant and Animal QTLdb, thereby leveraging the prior national funding in cyber infrastructure to ultimately address issues of importance to the livestock industry. The objectives of this particular application, which is an immediate step toward attainment of our long-range goal, are to 1) develop computational pipelines to efficiently process next-generation sequence data to facilitate Genome Wide Association Study (GWAS)/genomic selection, 2) provide computational resources to process next-generation data, and 3) facilitate comparison of livestock Quantitative Trait Loci/Genome Wide Association Study (QTL/GWAS) findings with previously published results. We have formulated these objectives based on 1) need for easy to use computation pipelines, 2) need for easy access to high-performance computing resources, and 3) need to comprehend large amounts of genotype/phenotype association data published at an accelerating rate. Our rationale is that leveraging existing resources to facilitate computational analysis of next-generation sequencing data, coupled with comparison of association results with previously published research, will dramatically expedite the transformation of raw data into genetic approaches to address issues facing animal agriculture. Two complementary and integrated objectives will accomplish our goal: Objective 1: Development of scalable, accessible informatic pipelines to leverage next-generation sequencing data in livestock genomics. The aim for this objective is to allow researchers to easily and efficiently process and analyze next-generation sequence information. Objective 2: Expand and enhance the Animal QTL Database. The aim for this objective is to continue to curate published QTL/GWAS information, and to expand the Vertebrate Trait Ontology, Product Trait Ontology, and Clinical Measurement Ontologies to facilitate phenotypic/genomic data comparison and evaluation.
Project Methods
OBJECTIVE 1 We will leverage extensively the investments made by the National Science Foundation in developing iPlant, an extensible national-scale life sciences Cyber Infrastructure. iPlant currently hosts hundreds of computational biology tools, including dozens for plant genetics and genomics, has several thousand users, manages over 350 TB of user data, and brokers computation of over 4 million CPU hours of analysis per year. It offers comprehensive tool integration and workflow construction facilities. By adopting iPlant's development environments, application programming interfaces, and graphical analysis workbench, we avoid entirely the need to provision hardware for computing and storage and the complexity of building an interoperability and interface layer for myriad bioinformatics tools. For each major class of deliverable (listed below), we will follow a three-phase implementation plan. Phase 1 Using real-world test data and use cases, we will determine an optimization and parallelization strategy that will apply sufficient resources to specific applications to ensure completion in a reasonable time.Generally, this will proceed as follows. First, we will build and deploy the target application according to the data owners' instructions on TACC systems; then, we will run an initial benchmark analysis using test data. We will use in-house profiling tools such as TACC-stats (https://github.com/TACCProjects/tacc_stats) and PerfExpert to compile statistics on RAM usage, CPU core utilization, IO bandwidth, code execution efficiency, threading scalability, and so on for the application. These will, in turn, provide guidance for our enhancement strategy. We will also tune the physical resources allocated to specific applications to ensure specific metrics. All configuration changes, support scripts, and detailed performance logs will be submitted to the public domain on GitHub and, where appropriate, back to the application authors. Datasets that are available for pipeline development The success of our proposed efforts is built on the tenant that we will have access to datasets that we can process via the three phases we are proposing (see Objective 1). Toward this end, the following datasets are available today or will become available early during this project(Personal communications with Drs. Hans Cheng, Noelle Cockett, Martien Groenen, Warren Snelling, Warren Snelling, Jerry Taylor, Curt Van Tassell, Sue Lamont, Chris Tuggle, Jack Dekkers, Jason Ross, Lee Alexander, and John Williams). Phase 2 Once a specific application is working satisfactorily, we will make it accessible to the larger science community. The first step in that process is to wrap the command-line version of the optimized application in the iPlant Agave RESTful web service API. As each new web service becomes available, we will publicize it to the animal genomics community via the AnGenMap Listserv (http://animalgenome.org/community/discuss) and provide documented examples of usage. Phase 3 Once an application is deployed as an Agave web service, we will develop a graphical user interface for it using iPlant's WYSIWYG authoring tool "Tito." OBJECTIVE 2 Curation protocol to meet the standard of Minimum Information requirements for QTL and Association Studies (MIQAS) Currently, the following minimum information is required for inclusion of new QTL/association data in the Animal QTLdb: 1) target trait; 2) chromosome; 3) linkage or genome map location, denoted by cM/Mbp position, interval, or peak/flanking markers; and 4) at least one statistical measure (e.g., p-value, F-statistic, LOD score). Each publication or pre-publication report from which data is to be entered is assigned a unique ID (PubMed ID if available). Curation is tasked in three areas: 1) reference information, including article author(s), title, year, journal information, and abstract; 2) experiment information, including animal breed(s), experimental design, and analysis method; and 3) QTL details, including trait ID, QTL position, statistics, QTL effects, and candidate gene information. Reference and experiment information are entered once for each publication, while QTL information is entered for each QTL identified. Following curation of QTL/association data, the original paper is uploaded to the database for data review and future editorial reference. Curation tools to directly import associations from GenSel and EpiSNP A common complaint about users curating their own data is the need to reformat results. Therefore, we will develop tools that directly load results from GenSel and EpiSNP into a pre-curation form in a user's account to facilitate error-free data transfer and comparison with previously published results. QTL/association data mapping to respective genome maps The QTL/association locations on each linkage map are currently translated into their respective genome coordinates by an in-house script that uses the closest anchor markers as reference points for interpolation. Arguably the results by this method are only rough estimates and should be improved. Once more accurate genome map locations of the QTL/association estimates become available, data download tools can be developed to enable QTL/association data export for visualization in other software such as GBrowse, GBrowse­­_syn, VCmap, etc. Data streamlining for easy QTL/association data integration via API/Web services A programmable API or Web services can serve as agents between the QTLdb and end users to fulfill customized data requests. We plan to develop and implement Perl-driven Web services and/or API to facilitate seamless data transport with user applications and other data portals upon user-customized data requests. This will help end users in multiple ways, including dynamic data download, interactive data mining, and customized data iterations. This will greatly boost the utility of the QTLdb as a powerful tool for data analysis at the user's end. Tools for semantic analysis of QTL/association literature data to aid QTLdb data curation We already have an automated PubMed search tool in place to aid the initial selection of papers for curators to decide which report new QTL/association data. Since our search algorithm is keyword-based, the real QTL/association papers ultimately selected for curation have been approximately 1/3 of what the program gathered. This translates into a lot of curator time used. With improved accuracy of literature screening, the QTLdb curator tools will supply curators with only the relevant papers, and possibly get some data extracted in advance of curation. Trait data mappings between QTLdb and CorrDB using VT/PT/CMO Continued improvement of Vertebrate Trait (VT), Product Trait (PT), and Clinical Measurement (CNO)ontologies is vital for the success of standardized systematic genomics analysis and translational genomics studies. We already have a trait mapping tool to match traits between the terms within the QTLdb and the VT, PT, and CMO data sets. To map traits between Animal CorrDB and the VT, PT, and CMO data sets, we need to (1) Standardize trait names and annotate them with standardized trait names within the CorrDB; (2) Build trait mapping annotation tools for curators to assign proper VT, PT, and CMO unique identifiers to each CorrDB trait; and (3) Examine the mappings using hierarchy tools and trait correlation relationships to rectify the mappings. Once this is done, links should be established between the CorrDB traits and QTLdb traits, thus providing an immediate possibility for QTL trait relationship analysis based on historical trait correlations.

Progress 09/01/13 to 08/31/16

Outputs
Target Audience:The results of this project will be of interest to geneticists (human and livestock), animal breeders, animal scientists, and progressive livestock producers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Every year at the Plant and Animal Genome meeting in San Diego, we provided training on the utilization of these resources. How have the results been disseminated to communities of interest?The computational workflows are available on a request basis in the CyVersy Discovery Environment and Agave API platforms (http://www.cyverse.org/). We have presented our work at the Plant and Animal Genome meeting in 2014, 2015 and 2016. Furthermore, the team has presented additional findings at ISAG, 2016, EBI Livestock genomics 2016, and will present at the Plant and Animal Genome meeting in Jan. 2017. Furthermore, we have published our work in eight peer-reviewed journal articles. The Animal QTLdb is freely available at http://www.animalgenome.org/cgi-bin/QTLdb/index. Additionally, the curated associations housed in the Animal QTLdb can be found at NCBI (https://www.ncbi.nlm.nih.gov/), Ensembl (http://www.ensembl.org/index.html), USCS genome browser (https://genome.ucsc.edu/) and the Bovine Genome Database (http://bovinegenome.org/). What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
(N/A)

Publications

  • Type: Conference Papers and Presentations Status: Published Year Published: 2016 Citation: Flemming, D. S.J. Lamont, J.E. Fulton, J.M. Reecy, A. Lund. 2016. Segmental Duplications in the Chicken Genome Identified By Single Nucleotide Variant (SNV) Analysis in Two Inbred Lines. Plant and Animal Genome meeting, San Diego, CA
  • Type: Conference Papers and Presentations Status: Published Year Published: 2016 Citation: Koltes, J.E., E.R. Fritz-Waters, J.M. Reecy. 2016. EpiDB: an Omics Data Resource for Cattle. Plant and Animal Genome meeting, San Diego, CA
  • Type: Conference Papers and Presentations Status: Published Year Published: 2016 Citation: Hu, Z.-H. C. Park, J.M. Reecy. 2016. Animal QTLdb: Towards a Comprehensive Database and Tool for Livestock Genome Research. Plant and Animal Genome meeting, San Diego, CA
  • Type: Journal Articles Status: Published Year Published: 2016 Citation: Hu ZL, Park CA, Reecy JM: Developmental progress and current status of the Animal QTLdb. Nucleic Acids Res 2016, 44(D1):D827-833.
  • Type: Journal Articles Status: Published Year Published: 2016 Citation: Tuggle CK, Giuffra E, White SN, Clarke L, Zhou H, Ross PJ, Acloque H, Reecy JM, Archibald A, Bellone RR et al: GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes. Anim Genet 2016.
  • Type: Journal Articles Status: Published Year Published: 2016 Citation: Weeks, Nathan T.; Luecke, Glenn R.; Groth, Brandon M.; Kraeva, Marina; Ma, Li; Kramer, Luke M., Koltes, James E.; Reecy, James M. High Performance Epistasis Detection in Quantitative Trait GWAS. International Journal of High Performance Computing. Prepublished July 12, 2016. doi: 10.1177/1094342016658110


Progress 09/01/14 to 08/31/15

Outputs
Target Audience:The results of this project will be of interest to geneticists (human and livestock), animal breeders, animal scientists, and progressive livestock producers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?The workflows are available on a request basis in the iPlant Discovery Environment and Science APIs platforms. They are now publicly available. We have presented our work at the Plant and Animal Genome in 2015 and will do so again in 2016. Furthermore, we plan to present additional findings at ISAG in July of 2016. What do you plan to do during the next reporting period to accomplish the goals?For the next reporting period, our work will continue on expansion of the variant calling pipeline. In the coming project period, the variant calling and effect prediction workflows in iPlant will be expanded to include structural variant calling. The project will continue to broaden its engagement with members of the livestock genomics community to onboard them into the iPlant platform, allowing them to bring their large-scale sequence data in and perform the same analyses as their colleagues.

Impacts
What was accomplished under these goals? In the second year of the project, we have worked with several livestock genomic researchers to successfully call variants from their extremely large genomic re-sequencing data sets. This signals that we have successfully developed a data analysis pipeline that is capable of quickly and efficiently conduct variant calling from genome re-sequencing data. This has involved conducting training for early adopters, troubleshooting network problems, and developing standards for how genomic information should be organized to facilitate computation. We are currently collaborating with researchers to continue to analyze their re-sequencing variants. Anyone in the world can now utilize the pipeline that we have developed to analyze their data. We have continued to curate QTL and association data into the QTL database.

Publications

  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Smedley D, Haider S, Durinck S, Pandini L, Provero P, Allen J, Arnaiz O, Awedh MH, Baldock R, Barbiera G et al: The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res 2015. 43(W1):W589-98.
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Andersson L, Archibald AL, Bottema CD, Brauning R, Burgess SC, Burt DW, Casas E, Cheng HH, Clarke L, Couldrey C et al: Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project. Genome Biol 2015, 16(1):57.
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Bickhard DM, Hutchinson JL, Schnabel RD, Taylor JF, Reecy JM, Schroeder S, Van Tassel CP, Sonsstegard TS, Liu GE: RAPTR-SV: a hybrid method for the detection of structural variants. Bioinformatics 2015, Feb. 16 [Epub ahead of Print] 31(13):2084-90.
  • Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: Hu, Z.L., J.E. Koltes, E. Fritz-Waters, C. Park, J.M. Reecy. 2015. An application programming interface (API) for programmable access to the Animal QTLdb. 2015. Plant and Animal Genome meeting, San Diego, CA
  • Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: Carson, J., E. Dawson, J.E. Koltes, E. Fritz-Waters, J.M. Reecy, M. Vaughn. 2015. Leveraging iPlant cyberinfrastructurefor a new data-driven research community. Plant and Animal Genome meeting, San Diego, CA
  • Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: Reecy, J.M. Development and utilization of bioinformatic tools in livestock genomics. 2015. Plant and Animal Genome meeting, San Diego, CA
  • Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: Nicolazzi, E.L., C.P. Van Tassell, D. Lamartino, J.M. Reecy, E. Fritz-Waters, T.S. Sonstegard, J.E. Koltes, S.G. Schroeder, A. Ahmad, J.F. Garcia, L. Ramunno, G. Cosenza, J. Williams. 2015. Using the 90K Buffalo SNP Array. Plant and Animal Genome meeting, San Diego, CA
  • Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: Williams, J., A. Valentini, P.A. Marsan, A. Zimin, K.D. Pruitt, T.S. Sonstegard, C.P. Van Tassell, D. Lamartino, F. Strozzi, J.M. Reecy, F. Ferre, C. Lawley, E. Amaral, J. Womack. 2015. Status of the Buffalo Genome Project. Plant and Animal Genome meeting, San Diego, CA


Progress 09/01/13 to 08/31/14

Outputs
Target Audience: The results of this project will be of interest to geneticists (human and livestock), animal breeders, animal scientists, and progressive livestock producers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? An undergraduate student has been recruited to work on objective 1 and is learning an extensive amount about iPlant web service application programming interfaces (APIs) and developing scalable parallel workflows for genomics. In addition, the workflows under development by this project will be used in joint iPlant/USDA-ARS workshops starting in March 2015. Every year at the Plant and Animal Genome meeting in San Diego, we provide training on the utilization of these resources. How have the results been disseminated to communities of interest? The workflows are available on a request basis in the iPlant Discovery Environment and Science APIs platforms. They will be made publicly available in the next reporting period. We have met with the livestock research community at the 2014 Plant and Animal Genome meeting and the 2014 World Congress of Genetics Applied to Livestock Production. In both cases, we presented our efforts in the form of research presentations in several workshops. What do you plan to do during the next reporting period to accomplish the goals? For the next reporting period, our work will continue on what’s reported here: In the coming project period, the prototype variant calling and effect prediction workflows in iPlant will be hardened and parallelized by early 2015. We will also add new variant detection workflows to iPlant for finding rearrangements and copy number variations (CNV) and add support for SAMtools-based variant calling. The project will begin broader engagement with members of the livestock genomics community to onboard them into the iPlant platform, allowing them to bring their large-scale sequence data in and perform the same analyses as their colleagues. Review current Animal QTLdb data standard against MIQAS, outline work needed to patch procedures and data flow protocols to meet the standard if in sub-optimal status. Draft a protocol for internal review by the end of 2014, and use it to serve as a guide for upcoming work to strengthen data quality control practices. Make the Animal QTLdb data curation requirements a standard protocol available on the web portal for users to follow (individual curators; batch data uploads; complementary data hosting; etc.). We expect the new JBrowse to be in service by the end of 2014. An improved version of JBrowse for Animal QTLdb will be in service in the first half of 2015. Develop a roadmap for the types of most-needed API tools that complement what already exists at NCBI, Ensembl, and UCSC. Start to provide preliminary API services directly from the Animal QTLdb in the early part of 2015.

Impacts
What was accomplished under these goals? In the first year of the project, we have worked with several livestock genomic researchers to successfully upload their extremely large genomic re-sequencing data sets into the iPlant Data Store, which is a key first step in being able to conduct large-scale analysis on these data. This has involved conducting training for early adopters, troubleshooting network problems, and developing standards for how genomic information should be organized to facilitate computation. In addition, we have developed and tested a parallel (able to utilize multiple computer nodes at once) workflow, based on the "BWA MEM" program, in the iPlant cyberinfrastructure to enable alignment of sequencing reads to a reference genome. This is a critical first step in the workflow and we expect to use it at scale (several thousand animals) before the end of 2014. We have created functional prototype workflows for variant calling using the PLATYPUS and GATK software tools and predicting the effects of variants based on the VarScan and VEP programs. First, we have reviewed and updated the existing Animal QTLdb Curator Manual for the new tools, data types, and procedures implemented over the past couple years, to make it ready for further revision emphasizing the Animal QTLdb compatibility with the MIQAS data standard. We are in a process going through the protocol and software implementations to safeguard data flow procedures to meet the standards. The first of the MIQAS-compatible data standard guidelines will be ready for internal review before the end of 2014. Second, we made two moves to gear up our efforts for more robust QTL/association data mapping to the respective genome builds. (1) We have strengthened our requirements for publically recognized SNP IDs ('rs' numbers) by a procedure to work with authors and journal editors (standard letter templates developed); for the linkage map based data, we are in a process to fine-tune the procedures for how the converted genome locations (by previously developed methods) will be re-examined and updated where applicable. (2) We are in a process to set up the state-of-the-art JBrowse, a next-generation genome browser and alignment tool for improved QTL/association data placement on the latest genome builds. The advantage with JBrowse is that it allows new types of quantitative genome variation tracks, such as BAM, BED, Wiggle, BigWig, and VCF data, to be easily compared. Third, we took two approaches to make an API/Web service tool for the end users of Animal QTLdb. (1) We have worked vigorously with our data alliance partners, NCBI, Ensembl, and UCSC, to make our QTL/association data available as part of their genome data portal; therefore users can directly use NCBI, Ensembl, and UCSC API/Web service tools for QTL/association data mining against other genome features. (2) We have begun our trials on a prototype of NCBI-Eutil-like API scheme. The first of such API script is being made to work for remote users. We are currently developing a roadmap for the types of most-needed API tools that complement what already exists at NCBI, Ensembl, and UCSC, to provide services directly from the Animal QTLdb in the early part of 2015.

Publications

  • Type: Conference Papers and Presentations Status: Published Year Published: 2014 Citation: Reecy, J.M. , J.P. Carson, F. McCarthy, J.E. Koltes, E. Fritz-Waters, J. Williams, E. Lyons, C.F. Baes, and M.W. Vaughn. 2014. Cyberinfrastructure for Life Sciences - iAnimal Resources for Genomics and Other Data Driven Biology. Proceedings, 10th World Congress of Genetics Applied to Livestock Production.
  • Type: Conference Papers and Presentations Status: Published Year Published: 2014 Citation: Baes, C.F., M.A. Dolezal, E. Fritz-Waters, J.E. Koltes , B. Bapst, C. Flury, H. Signer-Hasler, C. Stricker, R. Fernando, F. Schmitz-Hsu, D.J. Garrick, J.M. Reecy, and B. Gredler. 2014. Comparison of variant calling methods for whole genome sequencing data in dairy cattle. Proceedings, 10th World Congress of Genetics Applied to Livestock Production.
  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Baes CF, Dolezal MA, Koltes JE, Bapst B, Fritz-Waters E, Jansen S, Flury C, Signer-Hasler H, Stricker C, Fernando R, Fries R, Moll J, Garrick DJ, Reecy JM, Gredler B. 2014. Evaluation of variant identification methods for whole genome sequencing data in dairy cattle. BMC Genomics. 2014 Nov 1. 15(1):948.
  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Koesterke L, JE Koltes, NT Weeks, K Milfeld, MW Vaughn, JM Reecy and D Stanzione. 2014. Discovery of biological networks using an optimized partial correlation coefficient with information theory algorithm on Stampede's Xeon and Xeon Phi processors. Concurrency and Computation: Practice and Experience 26(13):21782190. 10 September 2014.