Web-based Bioinformatics Portal for RNA-seq Based Transcriptomics and Genome-wide Analysis of Gene Regulation in Agricultural Animal Species

WEB-BASED BIOINFORMATICS PORTAL FOR RNA-SEQ BASED TRANSCRIPTOMICS AND GENOME-WIDE ANALYSIS OF GENE REGULATION IN AGRICULTURAL ANIMAL SPECIES

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

TERMINATED

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1005357

Grant No.

2013-67015-22957

Project No.

MD.W-2014-10186

Proposal No.

2014-10186

Multistate No.

(N/A)

Program Code

A1201

Project Start Date

May 1, 2014

Project End Date

Aug 31, 2017

Grant Year

2015

Project Director
Li, W.

Recipient Organization
J. CRAIG VENTER INSTITUTE
9704 MEDICAL CENTER DRIVE
ROCKVILLE,MD 20850

Performing Department
(N/A)

Non Technical Summary
Remarkable advances in Next Generation Sequencing (NGS) technologies, computational theory and bioinformatics algorithms have accelerated and broadened genomic researches. However, the majority of existing genomics resources and tools are predominantly developed with human or health applications in mind. While genomes of many agriculturally important animal species have been or being sequenced, urgent attentions are needed to improve their assembly and annotation and to develop computational tools and resources specifically designed for down-stream applications in these species.The goal of this project is to develop a web portal with integrated tools for RNA-seq based gene expression analysis for agriculturally important animal species.We will implement the web portal using software frameworks similar to what we deployed in projects such as Human Microbiome Project. This portal will include web front interface, database servers, computer grid and a list of computational, visualization and statistical tools. The major objectives are to: 1) improve genome annotation of agriculturally important animal species, 2) develop and integrate needed bioinformatics tools, 3) build a web portal that enable RNA-seq based transcriptomics analysis in aforementioned animal species.

Animal Health Component

100%

Research Effort Categories

Basic

20%

Applied

30%

Developmental

50%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
304	3999	1080	50%
304	3999	2080	40%
304	3999	2090	10%

Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3999 - Animal research, general;

Field Of Science
1080 - Genetics; 2080 - Mathematics and computer sciences; 2090 - Statistics, econometrics, and biometrics;

Keywords

Goals / Objectives
The goal of this project is to develop a web portal with integrated tools for RNA-seq based gene expression analysis for agriculturally important animal species. Three major objectives are to: 1) improve genome annotation of agriculturally important animal species, including (but not limiting to) cattle, pig, chicken, turkey, horse, sheep, and goat as well as catfish; 2) develop and integrate needed bioinformatics tools and pipelines, visualization interfaces, and statistical methods; 3) build a web portal that enable RNA-seq based transcriptomics analysis in aforementioned animal species.

Project Methods
Methods: we will improve the genome assembly and annotation of agricultural animals by using genomic data, especially RNAseq data, from public studies and our own projects. We will collect public RNA-seq datasets and other types of sequences (e.g. exome) for farm animals, run them through the mapping and assembly pipelines and identify novel genes, transcripts, SNPs, alternative splicing events and other types of variations. The novel findings will be added to the existing knowledge base using proper data format (e.g. GFF). We will develop the computational tools and web portal using the software frameworks similar to what we developed in projects such as HMP. The analysis pipelines will be developed using workflow engine tools such as Kepler, Galaxy or similar tools.Efforts: the results of this project, including improved genome annotation, software tools and web portal, will be introduced to researchers in the related fields and other targeted audience such as students and teachers through several ways: (1) major findings and developments will be published in scientifc publications; (2) workshops are planned to train our users to use our portal, software and genome resources; (3) we will outreach researchers in the fields for collabrative research; (4) we will publish software documentation, user guide and other materials from our web portal.Evaluation: we proposed a list of measurable milestones in our original proposal, which can be used to evaluate the progress and sucess of this project.These measurable milestones include Read mapping pipeline, De novo assembly pipeline, Reference-based assembly pipeline, Post-analysis pipeline, In-house RNA-seq sequencing and deposition, improved genome assembly and annotation, web interface for configurable pipelines, web interface for individual tools, programmable Web services.

Progress 05/01/14 to 08/31/17

Outputs
Target Audience:The target audiences include the following groups: 1) researchers in agricuture fields that use various genomics approaches in their research and development 2) students and teachers that use genome data, especially genome data from agricuture-related species, as teaching materials The target audiences have been reached through meetings and conferences, journal papers and directoy software support. July 2014, Kansas City, JAM conference and joint NIFA PD meeting, we made poster presentation to other PDs and researchers in the field. Oct 2015, DC, Gathering On Functional Annotation of ANimal Genomes workshop (GO-FAANG), we presented our development to the meeting participants. Jan 2016, San Diego, Plant and Animal Genomics (PAG) conference, we did a poster presentation and also a live computer demostrataion to large audience. Jan 2016, San Diego, NIFA PD meeting, we made poster presentation to other PDs and researchers in the field. Jan 2017, San Diego, Plant and Animal Genomics (PAG) conference, we did a poster presentation. Jan 2017, San Diego, NIFA PD meeting, we gave a talk and made poster presentation to other PDs and researchers. 2016, The main paper describing the web portal that was published in BMC genomics and our web portal was fully released to the community. Some other papers are also published. Since the portal release, we supported many users in their data analysis through emails by solving possible software issues, adding reference data, adding new tools etc. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?One undergraduate student from University of California, San Diego has worked as an intern in this project. His major is computer science and bioinformatics. He was trained and gained more professional skills in genomics, computational biology, software development through this project. He is an co-author of in one of the publication. How have the results been disseminated to communities of interest?As described in the target Audience section, we demonstrated the web portal and software tools to communities of interest at the Plant and Animal Genomics (PAG) conferences (2016, 2017), NIFA PD meetings 2014-2017 and related workshop (2015) with oral, poster presentations and computer demo. We also tried to reach the communities through our own website and third party website (e.g. the Galaxy project web site, Youtube) by providing the communities with general information of our software and documentations. We published the paper describing our project in BMC genomics in 2016. We have communicated with our users through emails and provided user support for researchers that used our web portal. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? The goal of this project is to develop a web portal with integrated tools for RNA-seq based gene expression analysis for agriculturally important animal species. We originally proposed 18 tasks under these 3 major objectives: 1) improve genome annotation of agriculturally important animal species, including (but not limiting to) cattle, pig, chicken, turkey, horse, sheep, and goat as well as catfish; 2) develop and integrate needed bioinformatics tools and pipelines, visualization interfaces, and statistical methods; 3) build a web portal that enable RNA-seq based transcriptomics analysis in aforementioned animal species. These tasks are Task 1.1: Integrate existing genomic data for agricultural animal, Task 1.2: Set update methods for regularly updating public genomic data, Task 1.3: Search and obtain third party genomic data, Task 1.4: In-house RNA-seq sequencing, Task 1.5: Improve genome assembly and annotation, Task 1.6: Distribute integrated and improved genome data, Task 2.1: Download, install, configure and test individual computational tools, Task 2.2: Parallelize the algorithms so that they can run on a computer cluster, Task 2.3: Implement a configurable RNA-seq read mapping pipeline, Task 2.4: Implement a configurable pipeline for de novo RNA-seq assembly, Task 2.5: Implement a configurable pipeline for genome dependant RNA-seq assembly, Task 2.6: Implement a post-analysis pipeline, Task 2.7: Distribute integrated software tools, Task 3.1: Design the complete web portal and backend systems, Task 3.2: Implement web interface to run the configurable pipelines, Task 3.3: Implement web interface for standalone tools, Task 3.4: Implement programmable web services, Task 3.5: Distribute integrated software tools and user support. Within the first part of the project, we finished tasks 1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 2.3, 2.4 and 3.1 as planned. For Task 1.2, we have used ENSEMBL as the primary source for genome data and have set up regular update schedule. For Task 1.3, we downloaded 3rd party goat genome and annotation from Kunming Institute of Zoology in China (http://goat.kiz.ac.cn/GGD/) and International Goat Genome Consortium (http://www.goatgenome.org/home.html). For Task 1.4, we generated additional RNA-seq data. Over 120 RNA-seq samples were sequenced with >2GB /sample. The sequenced species include Caprine, Ovine, Bovine, Porcine and Human from various tissue types. We implemented read mapping pipeline and assembly pipelines (tasks 2.3 & 2.4). In order to improve the compute infrastructure and reduce the cost, we have adopted Amazon cloud resources as development and production environment. We further developed the workflow engine to run under cloud. This significantly improved our efficiency in computer hardware and software maintained efforts, through utilizing modern computer cloud management software Starcluster. For task 3.1, after extensive testing and validation, we selected Galaxy to implement our web portal and used our in-house workflow tools for pipeline management. In the mid term of the project, we continued tasks 1.5. 2.5, 2.6, 3.2, 3.3. For task 1.5, all the genome data we downloaded and further processed, including the formatted and indexed genome data with bwa, bowtie2, STAR, RESM, BLASTN, BLASTP and IGV were made available for download through both our web and FTP servers. We spent most our efforts in pipeline and portal development including task 2.5, genome dependent RNA-seq assembly; task 2.6 post-analysis pipeline; task 3.2 & 3.3 web interface for workflow and standalone tools. We further integrated the pipelines and tools and reorganize them into three end-to-end workflows. The first workflow utilizes Tuxedo (Tophat, Cufflink,Cuffmerge and Cuffdiff suite of tools). The second workflow deploys Trinity for de novo assembly and uses RSEM for transcript quantification and EdgeR for differential analysis. The third combines STAR, RSEM, and EdgeR for data analysis. All these workflows support multiple samples and multiple groups of samples and perform differential analysis between groups in a single workflow job submission. In the final stage of the project, we continued all the recurring tasks, including the regular updates of genomic data. Since the publication of our BMC genomics paper, lots of users started to utilize our web portal and download the data sets. We put major efforts in user support, web portal maintenances. We improved user data upload interface. Throughout the project, we continued the recurring tasks such as reference database update (task 1.2) and user support (task 3.5). We also made efforts in sharing the software to the public (task 3.5) with the release of the web portal. For task 1.5 (improve genome annotation), this project contributed to this goal by providing the workflow tools to the communities and providing supports to users. Besides original proposed goals, we made significant new developments. Besides the three major workflows, Tuxedo, Trinity and STAR, we added another workflow based on HISAT. All these workflows support multiple samples and multiple groups of samples and perform differential analysis between groups in a single workflow job submission.

Publications

Type: Journal Articles Status: Published Year Published: 2016 Citation: Weizhong Li, R. Alexander Richter, Yunsup Jung, Qiyun Zhu and Robert W. Li. Web-based bioinformatics workflows for end-to-end RNA-seq data computation and analysis in agricultural animal species. BMC Genomics (2016) 17:761. DOI 10.1186/s12864-016-3118-z. PMID: 27678198, PMCID:PMC5039875.
Type: Journal Articles Status: Published Year Published: 2016 Citation: Robert W. Li, Weizhong Li, Jiajie Sun, Peng Yu, Ransom L. Baldwin, Joseph F. Urban. The effect of helminth infection on the microbial composition and structure of the caprine abomasal microbiome. Scientific Reports (2016) 6:20606

Progress 05/01/16 to 04/30/17

Outputs
Target Audience:The target audiences include the following groups: 1) Researchers in agriculture fields that use various genomics approaches in their research and development 2) Students and teachers that use genome data, especially genome data from agriculture-related species as teaching materials In this period, we demonstrated the web portal we developed to researchers and students at the Plant and Animal Genomics (PAG) conference (San Diego, Jan 2017). We did a poster presentation in this conference to a large audience. The PD made oral presentation and poster presentation to colleagues in NIFA's PD meeting (San Diego, Jan 2017). The paper describing the web portal that was published last year also reached our audiences and many researchers started to utilized the web portal. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?We demonstrated the web portal and software tools to communities of interest at the Plant and Animal Genomics (PAG) conference (San Diego, Jan 2017). We did a poster presentation to a large audience. We also presented our software to researchers in NIFA's PD meeting. We tried to reach the communities through our own website and third party website (e.g. the Galaxy project web site, Youtube) by providing the communities with general information of our software and documentations. We published the paper describing our project in BMC genomics in 2016. We have communicated with our users through emails and provided user support for researchers that used our web portal. What do you plan to do during the next reporting period to accomplish the goals?After this reporting period, we will have 4 months before the end of the project. For the rest 4 months, we will focus on supporting the portal users and making the software more accessible to the community. We are currently working on a few manuscripts, and plan to finish these manuscript to further increase the impact of the project.

Impacts
What was accomplished under these goals? In this project period, we continued all the recurring tasks, including the regular updates of genomic data. Since the publication of our BMC genomics paper, lots of users started to utilize our web portal and download the data sets. We put major efforts in user support, web portal maintenances. We improved user data upload interface. Besides the three major workflows, Tuxedo, Trinity and STAR, we added another workflow based on HISAT. All these workflows support multiple samples and multiple groups of samples and perform differential analysis between groups in a single workflow job submission.

Publications

Type: Journal Articles Status: Published Year Published: 2016 Citation: Weizhong Li, R. Alexander Richter, Yunsup Jung, Qiyun Zhu and Robert W. Li. Web-based bioinformatics workflows for end-to-end RNA-seq data computation and analysis in agricultural animal species. BMC Genomics (2016) 17:761. DOI 10.1186/s12864-016-3118-z. PMID: 27678198, PMCID:PMC5039875.

Progress 05/01/15 to 04/30/16

Outputs
Target Audience:The target audiences will include the following groups: 1) researchers in agricuture fields that use various genomics approaches in their research and development 2) students and teachers that use genome data, especially genome data from agricuture-related species, as teaching materials In this period, we demonstrated the web portal we developed to researchers and students at the Plant and Animal Genomics (PAG) conference (San Diego, Jan 2016). We did a poster presentation and also a computer demostrataion in this conferences to a large audience. During the last year, we showed our development to colleagues in PAG meeting, in NIFA's PD meeting (San Diego, Jan 2016), and in Gathering On Functional Annotation of ANimal Genomes workshop (GO-FAANG, DC October 2015). Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?One undergraduate student from University of California, San Diego has worked as an intern in this project last year. His major is computer science and bioinformatics. He was trained and gained more professional skills in genomics, computational biology, software development through this project. How have the results been disseminated to communities of interest?We demonstrated the web portal and software tools to communities of interest at the Plant and Animal Genomics (PAG) conference (San Diego, Jan 2016). We did a poster presentation and also a computer demostrataion in this conferences to a large audience. We also presented our software to researchers in NIFA's PD meeting, and Animal genomics annotation workshop. We tried to reach the communities through our own website and third party website (e.g. the Galaxy project web site, Youtube) by providing the communities with general information of our software and documentations. What do you plan to do during the next reporting period to accomplish the goals?We will continue maintaining and developing the web portal and the underlying software tools for RNA-seq data analysis. We will provide communities with software support. We will continue updating the genome reference databases to serve the computational pipelines with the web portal. We will closely watch for new animial genomes being sequenced and add them into our portal when them become available. We will add new data analysis tools to the web portal according to the need and feeback from user communities.

Impacts
What was accomplished under these goals? Impact of the project The software tools and pipelines for RNA-seq data analysis for animal species have been developed and released through our web portal for public use. It provides researchers world-wide in animal field with effective tools to utilize genomics approach in animal study, research and development. Accomplishments of the project We originally proposed 18 tasks under these 3 major objectives. In the previous report periods, we reported the completion of several tasks, including task 1.1, integrate existing agricultural animal genomic data; task 1.2, quarterly update of genomic data; Task 1.3, obtain 3rd party genomic data; Task 1.4, in-house RNA-seq sequencing; task 2.1, master individual computational tools; task 2.2, parallelize of tools, task 2.3 & 2.4, read mapping pipeline and assembly pipeline; and task 3.1, portal design. In this project period, we continued the recurring task 1.2, quarterly update of genomic data. All the animal genome data are up to date including chicken, cow, duck, goat, pig, horse, rabbit, sheep, turkey, as well as several other model organisms. All the genome data we downloaded and further processed, including the formatted and indexed genome data with bwa, bowtie2, STAR, RESM, BLASTN, BLASTP and IGV. All these data are available for download through both our web and FTP servers (task 1.6). We spent most our efforts in pipeline and portal development including task 2.5, genome dependent RNA-seq assembly; task 2.6 post-analysis pipeline; task 3.2 & 3.3 web interface for workflow and standalone tools. We further integrated the pipelines and tools and reorganize them into three end-to-end workflows. The first workflow utilizes Tuxedo (Tophat, Cufflink, Cuffmerge and Cuffdiff suite of tools). The second workflow deploys Trinity for de novo assembly and uses RSEM for transcript quantification and EdgeR for differential analysis. The third combines STAR, RSEM, and EdgeR for data analysis. All these workflows support multiple samples and multiple groups of samples and perform differential analysis between groups in a single workflow job submission. This largely reduces the time and efforts for users to use our web portal. To further improve the performance of web portal and reduce the compute cost for large scale RNA-seq data processing, we significantly improved our computer cyber infrastructure under Amazon cloud environment. We utilized Galaxy and Starcluster software tools and also further developed our in-house workflow engine.

Publications

Type: Journal Articles Status: Published Year Published: 2016 Citation: Robert W. Li, Weizhong Li, Jiajie Sun, Peng Yu, Ransom L. Baldwin, Joseph F. Urban. The effect of helminth infection on the microbial composition and structure of the caprine abomasal microbiome. Scientific Reports (2016) 6:20606
Type: Journal Articles Status: Under Review Year Published: 2016 Citation: Weizhong Li, R. Alexander Richter, Yunsup Jung, Robert W Li. Web-based bioinformatics workflows for end-to-end RNA-seq data computation and analysis in agricultural animal species. BMC Genomics. under review

Progress 05/01/14 to 04/30/15

Outputs
Target Audience: During the NIFA Joint Animal Nutrition, Growth and Lactation; Feed Efficiency; and Animal Genomics PD meeting in conjunction with the 2014 JAM in July 2014, we made poster presentation on the project, pipeline and computational tools to other PDs and researchers in the field. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals? We will continue the project according to our original development plans and timelines to implement the software and portal, conduct the research and make proposed deliverables.

Impacts
What was accomplished under these goals? We originally proposed 18 tasks under these 3 major objectives and the time frame of development. In the last report period, we reported the completion of several tasks: including task 1.1, integrate existing agricultural animal genomic data; task 1.2, quarterly update of genomic data; task 2.1, master individual computational tools; task 2.2, parallelize of tools, task 2.3 & 2.4, read mapping pipeline and assembly pipeline; and task 3.1, portal design. In this project period, we continued the tasks that are recurring and also performed new tasks. Task 1.2, quarterly update of genomic data. We have been used ENSEMBL as the primary source for genome data and have updated quarterly. Task 1.3, obtain 3rd party genomic data. Goat genome and annotation were downloaded from Kunming Institute of Zoology in China (http://goat.kiz.ac.cn/GGD/) and International Goat Genome Consortium(http://www.goatgenome.org/home.html ). We explored the literatures and repositories for available animal RNA-seq data. We have found about 2000 animal RNA-seq runs from NCBI SRA and have download several datasets and will use these for other tasks. Task 1.4, in-house RNA-seq sequencing. Besides the RNA-seq data from public sources and the data that have been generated in our previous studies, we have generated additional RNA-seq data. Over 120 RNA-seq samples were sequenced with >2GB /sample. The sequenced species include Caprine, Ovine, Bovine, Porcine and Human from various tissue types. Task 2.3 & 2.4, read mapping pipeline and assembly pipeline. In order to make these pipelines more robust and more scalable, after extensive testing and development, we have adopted Amazon cloud resources as development and production environment. We further developed the workflow engine to run under cloud. This significantly improved our efficiency in computer hardware and software maintained efforts, through utilizing modern computer cloud management software Starcluster. Task 3, portal design. We continued to optimize web portal development by utilizing new public software tools, which have been rapidly evolving. We found that Galaxy provides more extensive features and functions in web portal implementation, data sharing, user management than the some of the old framework we used earlier (e.g. CAMERA cyber infrastructure). We have tested Galaxy software in our project and have applied Galaxy as the main portal.

Publications

Type: Journal Articles Status: Under Review Year Published: 2015 Citation: R Li, S Wu, C Li, W Li, and S Schroeder. Splice variants and regulatory networks associated with host resistance to the intestinal worm Cooperia oncophora in cattle. Veterinary Parasitology, under review