UniProt Module

The UniProt module in the Protein Data Handler project facilitates interactions with the UniProt database, focusing on extracting, processing, and storing protein data. It leverages SQL/ORM for efficient data management and integrates with a SQL database for persistence.

Class Overview

The UniProtExtractor class, an extension of BioinfoExtractorBase, is the core of this module. It manages the complexities of fetching and processing protein data from UniProt, including downloading records, parsing annotations, and integrating with a SQL database.

class protein_metamorphisms_is.information_system.uniprot.UniProtExtractor(conf)

Bases: ExtractorBase

A class for extracting and processing data from UniProt, a comprehensive resource for protein sequence and annotation data. UniProt provides a rich collection of protein sequence and functional information, which includes protein names, descriptions, taxonomic data, and sequence annotations.

This class extends BioinfoExtractorBase and provides specific implementations for extracting and processing data from UniProt.

download_record(accession_code)

Download detailed protein information from UniProt using ExPASy and SwissProt.

ExPASy is a Bioinformatics Resource Portal which provides access to scientific databases and software tools, while SwissProt is a manually annotated and reviewed protein sequence database part of UniProt.

Parameters:: accession_code (str) – The accession code of the protein record to download.

extract_entries()

Download and process UniProt entries concurrently using multi-threading.

Uses ThreadPoolExecutor for concurrent downloads, which significantly speeds up the data extraction process, especially beneficial when dealing with large datasets.

load_access_codes(search_criteria, limit)

Load access codes from UniProt based on the given search criteria and limit.

Fetches accession codes from UniProt using RESTful API calls. Accession codes are unique identifiers for protein records in UniProt. The function uses these codes to selectively download detailed protein information in later stages.

Parameters:

search_criteria (str) – The search criteria for querying UniProt.
limit (int) – The maximum number of results to fetch from UniProt. (A parameter requested by Uniprot with no
impact.) (significant) –

start()

Start the data extraction process for UniProt.

Initiates the process of fetching protein data from UniProt based on predefined search criteria and limits. Implements logic for handling the extraction and processing of data in a structured and efficient manner.

store_entry(data)

Stores the downloaded UniProt data in the database.

Processes and stores detailed protein information including annotations, cross-references, and sequence data. The function is designed to handle both new entries and updates to existing records, ensuring data consistency.

Parameters:: data (SwissProt.Record) – The UniProt data record to store.

Key Features

Data Extraction and Processing: Fetches and processes protein data from UniProt.
Concurrent Downloads: Uses multi-threading for efficient data retrieval.
SQL/ORM Integration: Interacts with the database using SQLAlchemy for data storage and management.
Error Handling: Implements error handling for reliable data extraction and processing.

Biopython Integration

The UniProtExtractor utilizes Biopython for parsing and handling protein data:

ExPASy Access: Uses ExPASy to access and download protein information from the ExPASy database.
SwissProt Parsing: Employs SwissProt for parsing detailed protein information retrieved from UniProt.

SQL/ORM Entities

The module interacts with several SQL/ORM entities:

Protein: Unique identifiers for protein records in UniProt.
Protein: Stores detailed protein information.
PDBReference: Manages references to Protein Data Bank entries.
UniprotChains: Handles protein chain information linked to PDB entries.
GOTerm: Stores Gene Ontology terms associated with proteins.

These entities organize the data fetched from UniProt in a relational database.

Configuration

The UniProtExtractor class requires specific configuration settings for optimal operation. Below is a template for the configuration structure:

# System Configuration
max_workers: [Number of concurrent workers]

# Database Configuration
DB_USERNAME: [Database username]
DB_PASSWORD: [Database password]
DB_HOST: [Database host address]
DB_PORT: [Database port]
DB_NAME: [Database name]

# UniProt Extraction Settings
search_criteria: [UniProt search criteria] e.g.:'(structure_3d:true) AND (reviewed:true)'
limit: [Maximum number of records to fetch per page (does not affect)]

Adjust these settings based on your project requirements and available resources.

Usage

Initialize UniProtExtractor with the configuration and start the extraction process:

from protein_metamorphisms_is.information_system.uniprot import UniProtExtractor

# Initialize with configuration
extractor = UniProtExtractor(conf)

# Start extraction
extractor.start()

This initiates the extraction process, managing the download, processing, and storage of UniProt data.