Protein annotation files

GO Annotations Queue Processor

This module defines GOAnnotationsQueueProcessor, a queue-integrated component of the Protein Information System (PIS) that:

Parses CAFA-formatted GO annotation files (TSV).
Loads and indexes protein sequences from a FASTA file.
Publishes per-protein tasks to the internal queue.
Persists proteins, accessions, sequences, GO terms (with category), and protein–GO associations into the relational database.

The implementation relies on ORM entities (Protein, Accession, Sequence, GOTerm, ProteinGOTermAnnotation) and a task-queue base class (QueueTaskInitializer).

Notes

CAFA TSV format expected per line: UniProtKB:ID<TAB>GO:XXXXXXX<TAB>Category. The category is optional in the file, but the DB model requires a category to create a new GO term; missing categories are therefore skipped.
FASTA headers are expected to follow UniProt convention like sp|P12345|... or tr|Q8XYZ1|...; accession is extracted from the second field split by '|'. If not present, the full record ID is used.
Evidence codes for associations default to "UNKNOWN" unless provided by upstream sources.

class protein_information_system.operation.extraction.protein_annotations_file.GOAnnotationsQueueProcessor(conf: dict)

Bases: QueueTaskInitializer

Queue processor for Gene Ontology annotations.

Parameters:

conf (dict) – Configuration mapping. Requires the following keys: - goa_annotations_file: Path to the CAFA-formatted TSV file. - goa_sequences_fasta: Path to the FASTA file with protein sequences. - limit_execution (optional): Integer limit for number of TSV lines to process.
Effects (Side)
------------
initialization (- Loads all sequences from FASTA into an in-memory dictionary at) – (self.sequences) to enable fast lookups during processing.

enqueue() → None

Enqueue per-protein tasks parsed from a CAFA-formatted TSV file.

Expected TSV columns per line:

UniProtKB:ID<TAB>GO:XXXXXXX<TAB>Category

Lines beginning with # or blank lines are ignored.
The third column (Category) is optional in the file; if absent or blank, "UNKNOWN" is assigned. Unknown categories are allowed at enqueue time but may later cause GO term creation to be skipped during storage.
Entries are grouped by protein accession before publishing to the queue.

get_or_create_accession(code: str, protein_id: str, primary: bool = True, tag: str | None = None) → Accession

Create or retrieve a UniProt Accession linked to a protein.

Parameters:

code (str) – Accession code (e.g., P12345) used as the primary key in Accession.
protein_id (str) – Identifier of the linked Protein (should match UniProt accession).
primary (bool, optional) – Whether this accession is the primary one for the protein (default True).
tag (Optional[str], optional) – Optional qualifier/tag for the accession.

Returns:

ORM instance corresponding to the accession.

Return type:

Accession

get_or_create_association(protein_id: str, go_id: str, evidence_code: str = 'UNKNOWN') → ProteinGOTermAnnotation | None

Create or retrieve a protein–GO association.

Parameters:

protein_id (str) – Protein identifier (UniProt accession).
go_id (str) – GO term identifier (e.g., GO:0008150).
evidence_code (str, optional) – Evidence code for the association (default "UNKNOWN").

Returns:

Existing ORM association if found, otherwise None (a new one is queued in the session but not yet flushed when created).

Return type:

Optional[ProteinGOTermAnnotation]

get_or_create_go_term(go_id: str, category: str) → GOTerm

Create or update a GOTerm with its category.

Parameters:

go_id (str) – GO term identifier (e.g., GO:0008150).
category (str) – GO category label (BP, MF, or CC). The DB schema enforces non-null constraints; empty/None categories are rejected.

Returns:

ORM instance corresponding to the GO term.

Return type:

GOTerm

Raises:

ValueError – If category is empty or None.

get_or_create_protein(protein_entry_id: str) → Protein

Create or retrieve a Protein by UniProt accession.

Parameters:: protein_entry_id (str) – UniProt accession used as the primary key in the Protein table.
Returns:: ORM instance for the protein.
Return type:: Protein

get_or_create_sequence(sequence: str) → Sequence

Create or retrieve a Sequence entity by raw sequence value.

Parameters:: sequence (str) – Amino acid sequence string.
Returns:: ORM instance corresponding to the stored sequence.
Return type:: Sequence
Raises:: ValueError – If sequence is empty or None.

get_sequence_from_external_source(protein_entry_id: str) → str | None

Retrieve the sequence for a protein from the in-memory FASTA index.

Parameters:: protein_entry_id (str) – UniProt accession for which to retrieve the sequence.
Returns:: The amino acid sequence if found; otherwise None.
Return type:: Optional[str]

load_sequences() → Dict[str, str]

Load sequences from the configured FASTA file into memory.

FASTA records are indexed by UniProt accession (preferred) or by the raw record ID when an accession cannot be parsed.

Returns:: Mapping {uniprot_accession: amino_acid_sequence}.
Return type:: dict

process(data: dict) → dict

Resolve sequence and return a normalized task result.

Parameters:: data (dict) – Task payload with keys: - protein_entry_id (str): Protein accession. - go_terms (list[tuple[str, str]]): GO ID and category pairs.
Returns:: Result payload with keys: protein, go_terms, sequence.
Return type:: dict

store_entry(data: dict) → None

Persist a processed protein entry into the database.

This method performs the following steps in a transactional manner:

Ensure the existence of the Protein and its corresponding Accession record.
Link the Protein to a Sequence if available, creating the sequence record on demand.
Ensure the existence of all referenced GO terms (one by one).
Collect the set of GO associations (protein_id, go_id) to be created.
Query in bulk which associations already exist for this protein.
Insert only the missing associations in a single bulk statement (multi-values INSERT).
Commit all changes once at the end.

Compared to the previous row-by-row approach, this implementation eliminates the N+1 query pattern and reduces overhead by performing association inserts in bulk. This significantly improves throughput when handling proteins with a large number of GO annotations.

Parameters:

data (dict) –

Parsed entry with at least:

”protein”: str, protein identifier
”sequence”: Optional[str], raw protein sequence
”go_terms”: List[Tuple[str, str]], list of (go_id, category)