UniProt Module
- class protein_information_system.operation.extraction.uniprot.UniProtExtractor(conf)
Bases:
QueueTaskInitializer- class UniProtExtractor(conf, session_required=True)
Bases:
QueueTaskInitializerThe UniProtExtractor class is responsible for extracting, processing, and storing protein data from the UniProt database. It extends QueueTaskInitializer to integrate with the task queue system.
Purpose
This component handles the end-to-end retrieval of protein entries from UniProt, including: - sequence data, - structural references (PDB), - and GO annotations.
Dependency
Accessions must be preloaded into the database before running this extractor. This can be done using the AccessionManager class via either:
load_accessions_from_csv(): loads accession codes from a curated CSV file.
fetch_accessions_from_api(): retrieves accessions using UniProt API based on semantic criteria.
Once accession entries exist, UniProtExtractor.enqueue() will publish them to the queue for processing.
Key Features
Enqueues tasks based on accession codes already present in the database.
Downloads and parses SwissProt records using BioPython.
Extracts cross-references to PDB and GOA, storing them with full relational integrity.
Designed to be scalable and robust for use in HPC environments.
Example Usage
from protein_information_system.tasks.accessions import AccessionManager from protein_information_system.tasks.uniprot import UniProtExtractor # Step 1: Load accession codes config = { 'load_accesion_csv': '../data/cafa5.csv', 'load_accesion_column': 'id', 'tag': 'CAFA5', 'allowed_evidences': ['EXP', 'IDA'], 'limit_execution': 100 } accession_manager = AccessionManager(config) accession_manager.load_accessions_from_csv() # Step 2: Enqueue and process entries extractor = UniProtExtractor(config) extractor.enqueue() extractor.start()
- enqueue()
Enqueues tasks for all accession codes found in the database.
This method reads all previously stored accession codes and sends each as a task to the queue for processing.
This function assumes that accessions have already been loaded into the database (e.g., using AccessionManager).
- get_or_create_association(entry_name, go_id, evidence_code, is_transferred=False, source_cluster_id=None, target_cluster_id=None, distance=None, embedding_type_id=None)
Create or retrieve a GOAnnotation entry.
- get_or_create_go_term(reference)
Retrieves or creates a Gene Ontology (GO) term in the database based on provided reference data. :param reference: Contains the GO term details extracted from UniProt data. :type reference: list
- Returns:
The retrieved or newly created GOTerm object.
- Return type:
GOTerm
- get_or_create_protein(data)
Retrieves or creates a protein record in the database.
- Parameters:
data (SwissProt.Record) – The UniProt record containing protein details.
- Returns:
The retrieved or newly created protein object.
- Return type:
Protein
- Raises:
Exception – If an error occurs during the operation.
- get_or_create_sequence(sequence)
Retrieves or creates a sequence entity in the database. :param sequence: Amino acid sequence of a protein. :type sequence: str
- Returns:
The retrieved or newly created sequence object.
- Return type:
Sequence
- get_or_create_structure(reference, protein_id)
Retrieves or creates a Structure entry in the database based on the provided PDB ID. If the structure does not exist, it creates a new one using the associated reference data. :param reference: PDB reference data including PDB ID, method, and resolution. :type reference: list :param protein_id: The ID of the associated protein. :type protein_id: str
- Returns:
The retrieved or newly created Structure object.
- Return type:
Structure
- handle_cross_references(protein, cross_references)
Manages the cross-references associated with the protein, such as database links to PDB and GO terms. :param protein: The protein object to manage. :type protein: Protein :param cross_references: List of cross-reference data. :type cross_references: list
- handle_pdb_reference(protein, reference)
Specific handler for PDB references, extracting relevant segment if specified and storing it. :param protein: The protein object associated with the PDB reference. :type protein: Protein :param reference: PDB reference data, including PDB ID, method, and resolution. :type reference: list
- process(accession_code)
Downloads detailed protein information from UniProt using ExPASy and SwissProt modules.
- Parameters:
accession_code (str) – The accession code of the protein record to download.
- Returns:
The protein record fetched from UniProt.
- Return type:
SwissProt.Record
- Raises:
ValueError – If no SwissProt record is found for the accession code.
Exception – If any other error occurs during data retrieval.
- store_entry(data)
Stores the retrieved UniProt data into the database.
Links accession codes to the associated protein and ensures all references and annotations are updated.
- Parameters:
data (SwissProt.Record) – The UniProt data to store.
- Raises:
Exception – If the storage process fails.
- update_protein_details(protein, data)
Updates the protein details in the database and ensures the protein is linked to the correct sequence.
- Parameters:
protein (Protein) – The protein object to update.
data (SwissProt.Record) – The data containing the new protein information.