We created a new version of the nine-species benchmark originally described by Tran et al. [Tran2017]. To do so, we downloaded the RAW files from the same nine PRIDE projects (PXD005025, PXD004948, PXD004325, PXD004565, PXD004536, PXD004947, PXD003868, PXD004467, PXD004424) and converted them to MGF format using ThermoRawFileParser v1.3.4. We also downloaded the corresponding nine Uniprot reference proteomes and constructed a Tide index for each one, using Crux version 4.1. For one species (Vigna mungo), no reference proteome is available, so we used the proteome of the closely related species Vigna radiata. We allowed for the following variable modifications: Met oxidation, Asn deamidation, Gln deamidation, N-term acetylation, N-term carbamylation, N-term NH3 loss, and the combination of N-term carbamylation and NH3 loss by using the tide-index options "--mods-spec 1M+15.994915, 1N+0.984016, 1Q+0.984016 --nterm-peptide-mods-spec 1X+42.010565, 1X+43.005814, 1X-17.026549, 1X+25.980265 --max-mods 3". Note that one of the nine experiments (Mus musculus) was performed using SILAC labeling, but we searched without SILAC modifications and hence include in the benchmark only PSMs from unlabeled peptides. Each index also contains a shuffled decoy peptide corresponding to each target peptide. Each MGF file was searched against the corresponding index using the precursor window size and fragment bin tolerance specified in the original study. We used XCorr scoring with Tailor calibration, and we allowed for 1 isotope error in the selection of candidate peptides. All search results were then analyzed jointly per species using the Crux implementation of Percolator, with default parameters. For the benchmark, we retained all PSMs with Percolator q value < 0.01. We identified 13 MGF files with fewer than 100 accepted PSMs, and we eliminated all of these PSMs from the benchmark. We then post-processed the PSMs to eliminate peptides that are shared between species. Among the 229,984 unique peptides, we identified 3797 (1.7%) that occur in more than one species. For each such peptide, we selected one of the associated species at random and then eliminated all PSMs containing that peptide in other species. The final benchmark dataset consists of 2.8 million PSMs drawn from 343 RAW files, exported as annotated MGF files.
Note: the initial data submission contained annotated MGF files without considering the N-terminal modifications mentioned above. The update available in the `/MSV000090982/updates/2024-05-14_woutb_71950b89/peak/9speciesbenchmark` FTP directory contains the corrected MGF files that are directly compatible with Casanovo.
[Tran2017] Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proceedings of the National Academy of Sciences of the United States of America 31, 8247-8252 (2017).
[doi:10.25345/C52V2CK8J]
[dataset license: CC0 1.0 Universal (CC0 1.0)]
Keywords: de novo ; benchmark
Principal Investigators: (in alphabetical order) |
William Stafford Noble, University of Washington, USA |
Submitting User: | woutb |
Number of Files: | |
Total Size: | |
Spectra: | |
Subscribers: | |
Owner | Reanalyses | |
---|---|---|
Experimental Design | ||
Conditions:
|
||
Biological Replicates:
|
||
Technical Replicates:
|
||
Identification Results | ||
Proteins (Human, Remapped):
|
||
Proteins (Reported):
|
||
Peptides:
|
||
Variant Peptides:
|
||
PSMs:
|
||
Quantification Results | ||
Differential Proteins:
|
||
Quantified Proteins:
|
||
Browse Dataset Files | |
FTP Download Link (click to copy):
|