Sequence Embeddings Module
- class protein_information_system.operation.embedding.sequence_embedding.SequenceEmbeddingManager(conf)
Bases:
GPUTaskInitializerManages the sequence embedding process, including model loading, task enqueuing, and result storing.
This class initializes GPU tasks, retrieves model configuration, and processes batches of sequences for embedding generation.
- reference_attribute
Name of the attribute used as the reference for embedding (default: ‘sequence’).
- Type:
str
- model_instances
Dictionary of loaded models keyed by embedding type ID.
- Type:
dict
- tokenizer_instances
Dictionary of loaded tokenizers keyed by embedding type ID.
- Type:
dict
- base_module_path
Base module path for dynamic imports of embedding tasks.
- Type:
str
- batch_size
Number of sequences processed per batch. Defaults to 40.
- Type:
int
- types
Configuration dictionary for embedding types.
- Type:
dict
- enqueue()
Enqueue sequence-embedding tasks for all models, requesting only the missing layers.
Behavior
- For each (sequence, embedding model type):
Read the desired layer indices from configuration (e.g., [0, 1, 2]).
Query the database for already-present layers for (sequence_id, embedding_type_id).
Compute the set difference → ‘missing_layers’.
If any layers are missing, publish a single task payload for that sequence/model that includes only those missing layer indices.
Batching
Sequences are chunked into batches of size self.queue_batch_size to control memory and message size. For each batch, messages are grouped per model (backend) to minimize queue traffic.
Notes
This function assumes the DB schema has a layer_index column on the sequence_embeddings table and that downstream storage (store_entry) includes this value when inserting.
- It is recommended to add a UNIQUE constraint on
(sequence_id, embedding_type_id, layer_index)
to prevent duplicates in concurrent/parallel workers.
- raises Exception:
Re-raises any unexpected error after logging.
- process(batch_data)
Processes a batch of sequences to generate embeddings.
- Parameters:
batch_data (list[dict]) – List of dictionaries, each containing sequence data.
- Returns:
List of dictionaries with embedding results.
- Return type:
list[dict]
- Raises:
Exception – If there’s an error during embedding generation.
Example
>>> batch_data = [{"sequence": "ATCG", "sequence_id": 1, "embedding_type_id": 2}] >>> results = manager.process(batch_data)
- store_entry(records)
Abstract method to store processed entries. Must be overridden by subclasses.