Protein annotation files

GO Annotations Queue Processor

This module defines GOAnnotationsQueueProcessor, a queue-integrated component of the Protein Information System (PIS) that:

  • Parses CAFA-formatted GO annotation files (TSV).

  • Loads and indexes protein sequences from a FASTA file.

  • Publishes per-protein tasks to the internal queue.

  • Persists proteins, accessions, sequences, GO terms (with category), and protein–GO associations into the relational database.

The implementation relies on ORM entities (Protein, Accession, Sequence, GOTerm, ProteinGOTermAnnotation) and a task-queue base class (QueueTaskInitializer).

Notes

  • CAFA TSV format expected per line: UniProtKB:ID<TAB>GO:XXXXXXX<TAB>Category. The category is optional in the file, but the DB model requires a category to create a new GO term; missing categories are therefore skipped.

  • FASTA headers are expected to follow UniProt convention like sp|P12345|... or tr|Q8XYZ1|...; accession is extracted from the second field split by '|'. If not present, the full record ID is used.

  • Evidence codes for associations default to "UNKNOWN" unless provided by upstream sources.

class protein_information_system.operation.extraction.protein_annotations_file.GOAnnotationsQueueProcessor(conf: dict)

Bases: QueueTaskInitializer

Queue processor for Gene Ontology annotations.

Parameters:
  • conf (dict) – Configuration mapping. Requires the following keys: - goa_annotations_file: Path to the CAFA-formatted TSV file. - goa_sequences_fasta: Path to the FASTA file with protein sequences. - limit_execution (optional): Integer limit for number of TSV lines to process.

  • Effects (Side)

  • ------------

  • initialization (- Loads all sequences from FASTA into an in-memory dictionary at) – (self.sequences) to enable fast lookups during processing.

enqueue() None

Enqueue per-protein tasks parsed from a CAFA-formatted TSV file.

Expected TSV columns per line:

UniProtKB:ID<TAB>GO:XXXXXXX<TAB>Category
  • Lines beginning with # or blank lines are ignored.

  • The third column (Category) is optional in the file; if absent or blank, "UNKNOWN" is assigned. Unknown categories are allowed at enqueue time but may later cause GO term creation to be skipped during storage.

  • Entries are grouped by protein accession before publishing to the queue.

get_or_create_accession(code: str, protein_id: str, primary: bool = True, tag: str | None = None) Accession

Create or retrieve a UniProt Accession linked to a protein.

Parameters:
  • code (str) – Accession code (e.g., P12345) used as the primary key in Accession.

  • protein_id (str) – Identifier of the linked Protein (should match UniProt accession).

  • primary (bool, optional) – Whether this accession is the primary one for the protein (default True).

  • tag (Optional[str], optional) – Optional qualifier/tag for the accession.

Returns:

ORM instance corresponding to the accession.

Return type:

Accession

get_or_create_association(protein_id: str, go_id: str, evidence_code: str = 'UNKNOWN') ProteinGOTermAnnotation | None

Create or retrieve a protein–GO association.

Parameters:
  • protein_id (str) – Protein identifier (UniProt accession).

  • go_id (str) – GO term identifier (e.g., GO:0008150).

  • evidence_code (str, optional) – Evidence code for the association (default "UNKNOWN").

Returns:

Existing ORM association if found, otherwise None (a new one is queued in the session but not yet flushed when created).

Return type:

Optional[ProteinGOTermAnnotation]

get_or_create_go_term(go_id: str, category: str) GOTerm

Create or update a GOTerm with its category.

Parameters:
  • go_id (str) – GO term identifier (e.g., GO:0008150).

  • category (str) – GO category label (BP, MF, or CC). The DB schema enforces non-null constraints; empty/None categories are rejected.

Returns:

ORM instance corresponding to the GO term.

Return type:

GOTerm

Raises:

ValueError – If category is empty or None.

get_or_create_protein(protein_entry_id: str) Protein

Create or retrieve a Protein by UniProt accession.

Parameters:

protein_entry_id (str) – UniProt accession used as the primary key in the Protein table.

Returns:

ORM instance for the protein.

Return type:

Protein

get_or_create_sequence(sequence: str) Sequence

Create or retrieve a Sequence entity by raw sequence value.

Parameters:

sequence (str) – Amino acid sequence string.

Returns:

ORM instance corresponding to the stored sequence.

Return type:

Sequence

Raises:

ValueError – If sequence is empty or None.

get_sequence_from_external_source(protein_entry_id: str) str | None

Retrieve the sequence for a protein from the in-memory FASTA index.

Parameters:

protein_entry_id (str) – UniProt accession for which to retrieve the sequence.

Returns:

The amino acid sequence if found; otherwise None.

Return type:

Optional[str]

load_sequences() Dict[str, str]

Load sequences from the configured FASTA file into memory.

FASTA records are indexed by UniProt accession (preferred) or by the raw record ID when an accession cannot be parsed.

Returns:

Mapping {uniprot_accession: amino_acid_sequence}.

Return type:

dict

process(data: dict) dict

Resolve sequence and return a normalized task result.

Parameters:

data (dict) – Task payload with keys: - protein_entry_id (str): Protein accession. - go_terms (list[tuple[str, str]]): GO ID and category pairs.

Returns:

Result payload with keys: protein, go_terms, sequence.

Return type:

dict

store_entry(data: dict) None

Persist a processed protein entry into the database.

This method performs the following steps in a transactional manner:

  1. Ensure the existence of the Protein and its corresponding Accession record.

  2. Link the Protein to a Sequence if available, creating the sequence record on demand.

  3. Ensure the existence of all referenced GO terms (one by one).

  4. Collect the set of GO associations (protein_id, go_id) to be created.

  5. Query in bulk which associations already exist for this protein.

  6. Insert only the missing associations in a single bulk statement (multi-values INSERT).

  7. Commit all changes once at the end.

Compared to the previous row-by-row approach, this implementation eliminates the N+1 query pattern and reduces overhead by performing association inserts in bulk. This significantly improves throughput when handling proteins with a large number of GO annotations.

Parameters:

data (dict) –

Parsed entry with at least:
  • ”protein”: str, protein identifier

  • ”sequence”: Optional[str], raw protein sequence

  • ”go_terms”: List[Tuple[str, str]], list of (go_id, category)