Protein Information System (PIS)
Protein Information System (PIS) is an integrated biological information system focused on extracting, processing, and managing protein-related data. PIS consolidates data from UniProt, PDB, and GOA, enabling the efficient retrieval and organization of protein sequences, structures, and functional annotations.
The primary goal of PIS is to provide a robust framework for large-scale protein data extraction, facilitating downstream functional analysis and annotation transfer. The system is designed for high-performance computing (HPC) environments, ensuring scalability and efficiency.
Task Types
Base Tasks: Handle Object Relational Model synchronization with the database and provide a session.
Queue Tasks: Handle queuing processes with RabbitMQ for task distribution and fault tolerance.
GPU Tasks: Handle computationally intensive tasks using GPU.
Pipelines
- FANTASIA: Functional ANnoTAtion based on embedding space SImilArity.
Using the information system, the FANTASIA pipeline has been set up and is now available in its corresponding repository. For full documentation you can visit FANTASIA Documentation
Massive search of metamorphisms and multifunctionality (WIP)
Extraction
Accession Management: Manages the loading and processing of biological accession codes, ensuring data is correctly organized for subsequent analysis.
UniProt Extraction: Handles the downloading and processing of detailed protein information from UniProt, enriching the database with accurate and up-to-date data.
Embedding
Sequence Embedding: Facilitates encoding of protein and nucleotide sequences into dense vectors using pre-trained language models.
Structure Embedding: Computes 3Di embeddings from atomic protein structures by parsing and standardizing their 3D conformations. These embeddings capture local geometric patterns and enable structure-based comparison and annotation.