- Tytuł:
- A distributed algorithm for protein identification from tandem mass spectrometry data
- Autorzy:
-
Orzechowska, Katarzyna
Rubel, Tymon
Kurjata, Robert
Zaremba, Krzysztof - Powiązania:
- https://bibliotekanauki.pl/articles/2097435.pdf
- Data publikacji:
- 2022
- Wydawca:
- Polskie Towarzystwo Promocji Wiedzy
- Tematy:
-
proteomics
mass spectrometry
distributed computing
Apache Spark - Opis:
- Tandem mass spectrometry is an analytical technique widely used in proteomics for the high-throughput characterization of proteins in biological samples. Modern in-depth proteomic studies require the collection of even millions of mass spectra representing short protein fragments (peptides). In order to identify the peptides, the measured spectra are most often scored against a database of amino acid sequences of known proteins. Due to the volume of input data and the sizes of proteomic databases, this is a resource-intensive task, which requires an efficient and scalable computational strategy. Here, we present SparkMS, an algorithm for peptide and protein identification from mass spectrometry data explicitly designed to work in a distributed computational environment. To achieve the required performance and scalability, we use Apache Spark, a modern framework that is becoming increasingly popular not only in the field of “big data” analysis but also in bioinformatics. This paper describes the algorithm in detail and demonstrates its performance on a large proteomic dataset. Experimental results indicate that SparkMS scales with the number of worker nodes and the increas-ing complexity of the search task. Furthermore, it exhibits a protein identification efficiency comparable to X!Tandem, a widely-used proteomic search engine.
- Źródło:
-
Applied Computer Science; 2022, 18, 2; 16--27
1895-3735 - Pojawia się w:
- Applied Computer Science
- Dostawca treści:
- Biblioteka Nauki