Overview
The peach genome sequencing project was initiated in 2008 by the
International Peach Genome Initiative (IPGI),
an International consortium led by Italian and
US scientists (Ignazio Verde, Albert Abbott, Jeremy Schmutz,
Michele Morgante and Daniel Rokhsar). The first version (Peach v1.0) was
released under
Fort Lauderdale agreement on April 2010 and the results were
published on 2013 on Nature Genetics).
Peach v2.0 was generated from DNA from the doubled haploid cultivar
'Lovell' (PLOV2-2N) which means that the genes and intervening DNA is "fixed"
or identical for all alleles and both chromosomal copies of the genome.
This doubled haploid nature has facilitated a highly accurate and consistent assembly of the peach genome.
Peach v2.0 currently consists of 8 pseudomolecules representing the 8
chromosomes of peach, and are numbered according to their corresponding
linkage groups. The genome sequencing consisted of approximately 8.47
fold whole genome shotgun sequencing employing the accurate Sanger
methodology and was assembled using Arachne.
In this new release (Peach v2.0) we aim at improving several issues such as the
chromosome-scale assembly, and the annotation of the repeated and gene sequences.
The peach v1.0 assembly was improved using large community molecular mapping
data obtained on three linkage maps. 7.3 Mb of previously unmapped
sequences (11 scaffolds) were integrated within the eight peach
pseudomolecules and nine randomly oriented scaffolds (20 Mb) were correctly
disposed. The use of a large mapping dataset has also allowed to fix seven
regions (12.2 Mb) incorrectly positioned along the pseudomolecules due to
misassembly issues. As a result of these mapping efforts, the peach v2.0
has now an outstanding 99.2% of mapped sequences with 97.9% oriented.
The base accuracy and contiguity were improved using contigs generated by an
ABySS assembly of WGS Illumina reads (42x of 2x250 bp, 600 bp insert).
Advancements include the correction of homozygous SNPs (859) and indels (1347)
as well as minor assembly gaps (212 gaps closed with a gain of 25,199 bp).
As a result, the contiguity of the Peach v2.0 was increased to a contig L50 of
255.4 kb (214.2 kb in Peach v1.0) and a contig N50 of 250 (294 in Peach v1.0).
The annotation of the repeated fraction was also enhanced including low copy
repeats and the complete sequence and location of 1,157 non-autonomous helitrons.
Gene prediction and annotation were upgraded using transcript assemblies
obtained from 2.2 billion of RNA seq reads from different peach tissues and
organs. In total, after masking with the advanced repeats annotation, 26,873
protein-coding genes were predicted in the Peach v2.1 annotation, 991 less
than those predicted in Peach v1.0. Gene annotation was highly enhanced with
the prediction of almost 20,000 new isoforms.
For use in publications, please cite the original paper in Nature Genetics:
The International Peach Genome Initiative (2013).
The high-quality draft genome of peach (
Prunus persica) identifies unique
patterns of genetic diversity, domestication and genome evolution.
Nature Genetics 45, 487–494 (2013) doi:10.1038/ng.2586 and cite the version
(Peach v2.0) and download location from
Phytozome,
GDR or
IGA.
Statistics
This release of Phytozome includes the JGI v2.1 gene annotation of assembly
v2.0. 225.7 Mb arranged in 8 pseudomolecules, with a small additional
amount of mostly repetitive sequences in unmapped scaffolds.
Genome Size
Approximately 227.4 Mb arranged in 191 scaffolds
Approximately 224.6 Mb arranged in 2,525 contigs (~ 1.2% gap)
Scaffold N50 (L50) = 4 (27.4 Mbp)
Contig N50 (L5) = 250 (255.4 Kbp)
11 scaffolds larger than 50 Kbp, with 99.4% of the genome in scaffolds larger than 50 Kbp
Loci
26,873 loci containing protein-coding genes
Transcripts
47,089 protein-coding transcripts
Sequencing, Assembly, and Annotation
Gene Prediction and Locus Naming
Short reads (~1B single ends and ~1.2B paired ends Illumina RNA-seq in
various length ranging from 75 BP to 100 BP, and 3M 454) from various
labs around the globe were used to constructe transcript assemblies (TAs)
(Shu et. al., manuscript in preparation). 106,848 transcript assemblies
were constructed using PASA (Haas, 2003) from 383,498 sequences in total,
consisting of the TAs above, as well as Sanger ESTs, and 23,448 transcript
assemblies from related species ESTs (424,656 sequences).
Loci were determined by transcript assembly alignments and/or EXONERATE
alignments of proteins from arabidopsis (Arabidopsis thaliana), rice,
grape, soybean and Swiss-Prot eukaryote proteins to soft-repeatmasked
Prunus persica genome using RepeatMasker (Smit, 1996-2012) with up to 2K
BP extension on both ends unless extending into another locus on the same strand.
Gene models were predicted by homology-based predictors, FGENESH+
(Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site
and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001).
The highest scoring predictions for each locus are selected using
multiple positive factors including EST and protein support, and one
negative factor: overlap with repeats. The selected gene predictions were
improved by PASA. Improvement includes adding UTRs, splicing correction,
and adding alternative transcripts. PASA-improved gene model proteins were
subject to protein homology analysis to above mentioned proteomes to obtain
Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH
(mutual best hit) BLASTP score and protein coverage is highest percentage
of protein aligned to the best of homologs. PASA-improved transcripts were
selected based on Cscore, protein coverage, EST coverage, and its CDS
overlapping with repeats. The transcripts were selected if its Cscore is
larger than or equal to 0.5 and protein coverage larger than or equal to 0.5,
or it has EST coverage, but its CDS overlapping with repeats is less than 20%.
For gene models whose CDS overlaps with repeats for more that 20%, its Cscore
must be at least 0.9 and homology coverage at least 70% to be selected.
The selected gene models were subject to Pfam analysis and gene models
whose protein is more than 30% in Pfam TE domains were removed.
References
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].
Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011.
Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.
Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.
Locus name and transcript name mapping from previous annotation version
The locus model name of a v1.0 gene is mapped to a corresponding v2.1 gene as alias if 1)
the v1.0 and v2.1 loci overlap uniquely and appear on the same chromosome, and 2)
at least one pair of translated transcripts from the old and new loci are MBH's
(mutual best hits) with at least 70% normalized identity in a BLASTP alignment
(normalized identity defined as the number of identical residues divided by the
longer sequence). 77.38% v1.0 loci are mapped.
Genome Browsers
Contacts
Principal Collaborators:
- Ignazio Verde, Consiglio per la Ricerca e la Sperimentazione in Agricoltura (email: ignazio DOT verde AT entecra DOT it)
JGI Contacts:
- Daniel Rokhsar (email: dsrokhsar AT gmail DOT com)
- Jeremy Schmutz (email: jschmutz AT hudsonalpha DOT org)
IGA Contacts:
- Michele Morgante (email: michele DOT morgante AT uniud DOT it)
- Simone Scalabrin (email: sscalabrin AT igatechnology DOT com)
GDR Contacts:
- Dorrie Main (WSU) (email: dorrie AT wsu DOT edu)
Associated Publications
International Peach Genome Initiative, Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, Zuccolo A, Rossini L, Jenkins J, Vendramin E, Meisel LA, Decroocq V, Sosinski B, Prochnik S, Mitros T, Policriti A, Cipriani G, Dondini L, Ficklin S, Goodstein DM, Xuan P, Del Fabbro C, Aramini V, Copetti D, Gonzalez S, Horner DS, Falchi R, Lucas S, Mica E, Maldonado J, Lazzari B, Bielenberg D, Pirona R, Miculan M, Barakat A, Testolin R, Stella A, Tartarini S, Tonutti P, Arús P, Orellana A, Wells C, Main D, Vizzotto G, Silva H, Salamini F, Schmutz J, Morgante M, Rokhsar DS, The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution., Nature genetics. 2013 May ; 45 5 487-494