Prunus persica v2 - IGA

Prunus persica v2 - IGA

|

|


	Peach Genome v2.0


	Overview The peach genome sequencing project was initiated in 2008 by the International Peach Genome Initiative (IPGI), an International consortium led by Italian and US scientists (Ignazio Verde, Albert Abbott, Jeremy Schmutz, Michele Morgante and Daniel Rokhsar). The first version (Peach v1.0) was released under Fort Lauderdale agreement on April 2010 and the results were published on 2013 on Nature Genetics). Peach v2.0 was generated from DNA from the doubled haploid cultivar 'Lovell' (PLOV2-2N) which means that the genes and intervening DNA is "fixed" or identical for all alleles and both chromosomal copies of the genome. This doubled haploid nature has facilitated a highly accurate and consistent assembly of the peach genome. Peach v2.0 currently consists of 8 pseudomolecules representing the 8 chromosomes of peach, and are numbered according to their corresponding linkage groups. The genome sequencing consisted of approximately 8.47 fold whole genome shotgun sequencing employing the accurate Sanger methodology and was assembled using Arachne. In this new release (Peach v2.0) we aim at improving several issues such as the chromosome-scale assembly, and the annotation of the repeated and gene sequences. The peach v1.0 assembly was improved using large community molecular mapping data obtained on three linkage maps. 7.3 Mb of previously unmapped sequences (11 scaffolds) were integrated within the eight peach pseudomolecules and nine randomly oriented scaffolds (20 Mb) were correctly disposed. The use of a large mapping dataset has also allowed to fix seven regions (12.2 Mb) incorrectly positioned along the pseudomolecules due to misassembly issues. As a result of these mapping efforts, the peach v2.0 has now an outstanding 99.2% of mapped sequences with 97.9% oriented. The base accuracy and contiguity were improved using contigs generated by an ABySS assembly of WGS Illumina reads (42x of 2x250 bp, 600 bp insert). Advancements include the correction of homozygous SNPs (859) and indels (1347) as well as minor assembly gaps (212 gaps closed with a gain of 25,199 bp). As a result, the contiguity of the Peach v2.0 was increased to a contig L50 of 255.4 kb (214.2 kb in Peach v1.0) and a contig N50 of 250 (294 in Peach v1.0). The annotation of the repeated fraction was also enhanced including low copy repeats and the complete sequence and location of 1,157 non-autonomous helitrons. Gene prediction and annotation were upgraded using transcript assemblies obtained from 2.2 billion of RNA seq reads from different peach tissues and organs. In total, after masking with the advanced repeats annotation, 26,873 protein-coding genes were predicted in the Peach v2.1 annotation, 991 less than those predicted in Peach v1.0. Gene annotation was highly enhanced with the prediction of almost 20,000 new isoforms. For use in publications, please cite the original paper in Nature Genetics: The International Peach Genome Initiative (2013). The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nature Genetics 45, 487–494 (2013) doi:10.1038/ng.2586 and cite the version (Peach v2.0) and download location from Phytozome, GDR or IGA. Statistics This release of Phytozome includes the JGI v2.1 gene annotation of assembly v2.0. 225.7 Mb arranged in 8 pseudomolecules, with a small additional amount of mostly repetitive sequences in unmapped scaffolds. Genome Size Approximately 227.4 Mb arranged in 191 scaffolds Approximately 224.6 Mb arranged in 2,525 contigs (~ 1.2% gap) Scaffold N50 (L50) = 4 (27.4 Mbp) Contig N50 (L5) = 250 (255.4 Kbp) 11 scaffolds larger than 50 Kbp, with 99.4% of the genome in scaffolds larger than 50 Kbp Loci 26,873 loci containing protein-coding genes Transcripts 47,089 protein-coding transcripts Sequencing, Assembly, and Annotation Gene Prediction and Locus Naming Short reads (~1B single ends and ~1.2B paired ends Illumina RNA-seq in various length ranging from 75 BP to 100 BP, and 3M 454) from various labs around the globe were used to constructe transcript assemblies (TAs) (Shu et. al., manuscript in preparation). 106,848 transcript assemblies were constructed using PASA (Haas, 2003) from 383,498 sequences in total, consisting of the TAs above, as well as Sanger ESTs, and 23,448 transcript assemblies from related species ESTs (424,656 sequences). Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabidopsis (Arabidopsis thaliana), rice, grape, soybean and Swiss-Prot eukaryote proteins to soft-repeatmasked Prunus persica genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001). The highest scoring predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. References Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666]. Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011. Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816. Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22. Locus name and transcript name mapping from previous annotation version The locus model name of a v1.0 gene is mapped to a corresponding v2.1 gene as alias if 1) the v1.0 and v2.1 loci overlap uniquely and appear on the same chromosome, and 2) at least one pair of translated transcripts from the old and new loci are MBH's (mutual best hits) with at least 70% normalized identity in a BLASTP alignment (normalized identity defined as the number of identical residues divided by the longer sequence). 77.38% v1.0 loci are mapped. Genome Browsers Peach genome browsers are available at JGI (v1.0) and the Genome Database for Rosaceae (v1.0), while the Italian version is hosted at the Istituto di Genomica Applicata (IGA) (v2.0). Access to the raw sequence data is provided via the GBrowse link at the top of this page. Once again, welcome to peach v2.0! On behalf of IPGI and its collaborators. Contacts Principal Collaborators: Ignazio Verde, Consiglio per la Ricerca e la Sperimentazione in Agricoltura (email: ignazio DOT verde AT entecra DOT it) JGI Contacts: Daniel Rokhsar (email: dsrokhsar AT gmail DOT com) Jeremy Schmutz (email: jschmutz AT hudsonalpha DOT org) IGA Contacts: Michele Morgante (email: michele DOT morgante AT uniud DOT it) Simone Scalabrin (email: sscalabrin AT igatechnology DOT com) GDR Contacts: Dorrie Main (WSU) (email: dorrie AT wsu DOT edu) Associated Publications International Peach Genome Initiative, Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, Zuccolo A, Rossini L, Jenkins J, Vendramin E, Meisel LA, Decroocq V, Sosinski B, Prochnik S, Mitros T, Policriti A, Cipriani G, Dondini L, Ficklin S, Goodstein DM, Xuan P, Del Fabbro C, Aramini V, Copetti D, Gonzalez S, Horner DS, Falchi R, Lucas S, Mica E, Maldonado J, Lazzari B, Bielenberg D, Pirona R, Miculan M, Barakat A, Testolin R, Stella A, Tartarini S, Tonutti P, Arús P, Orellana A, Wells C, Main D, Vizzotto G, Silva H, Salamini F, Schmutz J, Morgante M, Rokhsar DS, The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution., Nature genetics. 2013 May ; 45 5 487-494