Separating homoeologs by phasing in the tetraploid wheat transcriptome
Ksenia V. Krasileva, Vince Buffalo, Paul Bailey, Stephen Pearce, Sarah Ayling, Facundo Tabbita, Marcelo Soria, Shichen Wang, International Wheat Genome Sequencing Consortium, Eduard Akhunov, Cristobal Uauy and Jorge Dubcovsky
Link to manuscript in press in Genome Biology: 10.1186/gb-2013-14-6-r66
We obtained high quality paired-end reads from multiple cDNA libraries of diploid wheat (T. urartu, 249 million reads Bio Project PRJNA191053) and tetraploid wheat (T. turgidum, 489 million reads, Bio Project PRJNA191054) and assembled their respective transcriptomes using a multiple k-mer assembly strategy. To separate homoeologs in tetraploid wheat, we established a post-assembly pipeline that included polymorphism identification, phasing, read sorting, and re-assembly of the phased reads. We also used a comparative genomics approach to predict open reading frames (ORFs) and pseudogenes and annotated 37,806 ORFs in diploid wheat and 66,633 in tetraploid wheat (excluding pseudogenes). This data set was complemented with a set of 27,544 non-redundant wheat ORFs from other projects.
These databases can be downloaded from the
links below or searched by BLAST at:
This section mirrors the Supplementary Materials section of the paper
Supplemental File 1. T. turgidum benchmark genes: 52 genes (26 A-B homoeolog pairs) previously sequenced in our lab and annotated for gene structure by comparison of transcripts and genomic data form A and B genomes. FASTA file, zipped.
Supplemental File 2. T. urartu contigs: 86,247 contigs. FASTA file, zipped.
Supplemental File 3. T. turgidum contigs: 140,118 contigs. FASTA file, zipped.
Supplemental File 4. T. urartu annotation: ORF coordinates, predicted premature stop codons, and frame shift mutations. GFF file, zipped.
Supplemental File 5. T. turgidum annotation: ORF coordinates, predicted premature stop codons and frame shift mutations. GFF file, zipped.
Supplemental File 6. T. urartu ORFs: 37,806 open reading frames excluding putative pseudogenes. Predicted by "findorf" using a comparative approach. FASTA file, zipped.
Supplemental File 7. T. turgidum ORFs: 66,633 open reading frames excluding pseudogenes. Predicted by "findorf" using a comparative approach. FASTA file, zipped.
Supplemental File 8. T. urartu proteins: 37,806 proteins translated from ORFs Supplemental File 6, FASTA file, zipped.
Supplemental File 9. T. turgidum proteins: 66,633 proteins translated from ORFs Supplemental File 7, FASTA file, zipped.
Supplemental File 10. T. turgidum un-phased polymorphisms: List of polymorphisms between A and B homoeologs, VCF file, zipped.
Supplemental File 11. T. turgidum phased polymorphisms: "HapCUT" generated table describing the phase of the SNPs for individual blocks. Text table, zipped.
Supplemental File 12. T. turgidum homoeolog specific sub-assemblies. Reads from each block sorted based on phased SNPs with program "readphaser", and re-assembled independently with MIRA. FASTA file, zipped.
Supplemental File 13. T.urartu gene models: Exon structure predicted by "Exonerate" based on alignments between T. urartu ORFs from Supplemental File 6 and Chinese Spring chromosome arm genomic sequences. Excel file (xlsx).
Supplemental File 14. T. turgidum gene models: Exon structure predicted by "Exonerate" based on alignments between T. turgidum ORFs from Supplemental File 7 and Chinese Spring chromosome arm genomic sequences. Excel file (xlsx)
These files are not part of the Genome Biology paper, but are useful to complement the set of tetraploid wheat predicted genes and proteins in Supplemental files 7 and 9.
Supplemental File 15. Published wheat transcripts (non-redundant): 146,300 non-redundant set of contigs from reassembled wheat ESTs (T. aestivum and T. turgidum 1,177,020 ESTs from NCBI), and T. aestivum transcriptomes from Schreiber et al. (2012), Brenchley et al. (2012) and Cantu et al. (2011). Note (08/20/2013): This is version 2 of the file, with chloroplast and mitochondrial sequences removed. FASTA file, zipped.
Supplemental File 16. Published wheat transcripts (non-redundant) annotation: annotation of the Supplemental File 15. GFF file, zipped.
Supplemental File 17. Complementary wheat ORFs from published wheat transcripts: 27,544 ORFs predicted by "findorf" from the non-redundant published transcripts excluding pseudogenes, sequences shorter than 90 bp, and ORFs similar to those present in the T. turgidum transcriptome. No additional filtering (this file is known to include some chloroplast and mitochondria sequences). This file can be used in combination with Supplemental File 7 to have a more comprehensive list of wheat genes. FASTA file, zipped.
Supplemental File 18. Complementary wheat proteins: 27,544 proteins predicted from ORF Supplemental File 17. This file can be used in combination with Supplemental File 9 to have a more comprehensive list of wheat proteins. FASTA file, zipped.
Supplemental File 19. Complementary wheat gene models: Exon structure predicted by "Exonerate" based on alignments between wheat ORFs from Supplemental File 17 and Chinese Spring chromosome arm genomic sequences. Excel file (xlsx).
The “zip’ file link is the original version used in the paper, and the “current version” is the github link to the most updated version and documentation.
readphaser. Takes phased variants from the program HapCut and a corresponding BAM file of alignments to produce a FASTA file of phased reads (with phasing block and haplotype in the header). Python program (zip). Current version: https://github.com/vsbuffalo/readphaser
Parallel Mira assembly. Phased reads sorted by readphaser were reassembled using MIRA and a Perl script described in the attached word file. Parallel Mira Assembly
Schreiber AW, Hayden MJ, Forrest KL, Kong SL, Langridge P, Baumann U: Transcriptome-scale homoeolog-specific transcript assemblies of bread wheat. BMC Genomics 2012, 13:492. DOI:10.1186/1471-2164-13-492
Brenchley R, Spannagl M, Pfeifer M, Barker GL, D'Amore R, Allen AM, McKenzie N,.Kramer M, Kerhornou.A, Bolser D et al: Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature.2012, 491(7426):705-710. DOI:10.1038/nature11650
Mochida K, Yoshida T, Sakurai T, Ogihara Y, Shinozaki K. TriFLDB: a database of clustered full-length coding sequences from Triticeae with applications to comparative grass genomics. Plant Physiol 2009, 150:1135-1146. DOI:10.1104/pp.109.138214
Cantu D, Pearce SP, Distelfeld A, Christiansen MW, Uauy C, Akhunov E, Fahima T, Dubcovsky J.: Effect of the down-regulation of the high Grain Protein Content (GPC) genes on the wheat transcriptome during monocarpic senescence. BMC Genomics 2011, 12:492. DOI:10.1186/1471-2164-12-492
This work has been funded by support provided to J. Dubcovsky by the Howard Hughes Medical Institute and the Gordon and Betty Moore Foundation (grant number GBMF3031) and in part by the National Research Initiative Competitive Grants no. 2011-68002-30029 (Triticeae-CAP) and 2011-67013-30077 from the USDA National Institute of Food and Agriculture. C. Uauy acknowledges support from Biotechnology and Biological Sciences Research Council (BBSRC) to (grant BB/J003557/1). K. Krasileva has been supported by USDA NIFA post-doctoral fellowship grant number 2012-67012-19811.