Lookup Table Generation for FANTASIA
====================================

This document describes the available pipelines to populate the functional lookup table used in FANTASIA. Two complementary pathways are supported:

1. **UniProt-based standard import**, which pulls accessions and metadata using APIs or curated CSVs.
2. **Custom annotation ingestion**, useful for local or third-party datasets with manually curated annotations and FASTA sequences.

Standard Accession Import
-------------------------

The `AccessionManager` class provides two standard methods to initialize protein accessions:

.. code-block:: python

    AccessionManager(conf).fetch_accessions_from_api()

Fetches accessions directly from UniProtKB using the `search_criteria` defined in the YAML configuration. The query must be a valid UniProt search string. For example:

.. code-block:: yaml

    search_criteria: '(structure_3d:true)'

To restrict results to experimentally validated proteins with GO annotations, use:

.. code-block:: yaml

    search_criteria: '(go_exp:* OR go_ida:* OR go_ipi:* OR go_imp:* OR go_igi:* OR go_iep:* OR go_tas:* OR go_ic:*)'

Alternatively, accessions can be loaded from a user-provided CSV:

.. code-block:: python

    AccessionManager(conf).load_accessions_from_csv()

This requires the configuration to define the file path and the relevant column:

.. code-block:: yaml

    load_accesion_csv: ../data/sample.csv
    load_accesion_column: uniprot_id

This mode is recommended for predefined accession lists or curated datasets.

Post-processing Steps
---------------------

Once accessions are available, metadata and protein representations are generated using:

.. code-block:: python

    UniProtExtractor(conf).start()
    SequenceEmbeddingManager(conf).start()

These modules:

- Download protein sequence and metadata from UniProt.
- Generate embeddings using selected protein language models.

Embedding Model Selection
-------------------------

Available embedding models are defined under `embedding.types` in the YAML configuration:

.. code-block:: yaml

    embedding:
      types:
        - 1  # ESM: Evolutionary Scale Modeling (Meta AI)
        - 2  # ProSTT5: Structural Transformer T5-based (Ana Rojas Lab)
        - 3  # ProtT5: Protein Transformer T5-based (EMBL/UniProt)
        - 4  # Ankh3: Contextual residue embedding model (Ankh v3)

Multiple models may be activated simultaneously. Batch sizes for queueing and inference are controlled via:

.. code-block:: yaml

    batch_size: 1
    batch_size_embedding: 1

Annotation Filtering by Evidence
--------------------------------

FANTASIA supports filtering GO annotations based on UniProt evidence codes. To retain only experimentally supported annotations:

.. code-block:: yaml

    allowed_evidences:
      - EXP  # Inferred from Experiment
      - IDA  # Inferred from Direct Assay
      - IPI  # Inferred from Physical Interaction
      - IMP  # Inferred from Mutant Phenotype
      - IGI  # Inferred from Genetic Interaction
      - IEP  # Inferred from Expression Pattern
      - TAS  # Traceable Author Statement
      - IC   # Inferred by Curator

If the list is left empty (`[]`), all annotations will be imported regardless of quality.

Custom Annotation via GOAnnotationsQueueProcessor
--------------------------------------------------

FANTASIA also supports local datasets or third-party annotations via the `GOAnnotationsQueueProcessor` class.

Requirements:

- A tab-separated annotation file (`goa_annotations_file`) with format:

  .. code-block::

      PROT_ID_001    GO:0008150,GO:0003674,GO:0005575

Execution:

.. code-block:: python

    GOAnnotationsQueueProcessor(conf).start()

This module performs the following steps internally:

1. Parses each protein entry and its GO terms.
2. Retrieves the protein sequence from UniProt.
3. Stores or updates the protein, sequence, GO terms, and assigns a default evidence code (`"UNKNOWN"`).

Configuration Summary
----------------------

Depending on the selected mode, the YAML configuration must include the appropriate keys. Only one mode should be active per execution.

.. code-block:: yaml

    # --- Mode 1: Standard UniProt Search (API query) ---
    # Triggered by: AccessionManager(conf).fetch_accessions_from_api()
    search_criteria: '(go_exp:* OR go_ida:* OR go_ipi:* OR go_imp:*)'
    tag: HUMAN_SEARCH
    allowed_evidences:
      - EXP
      - IDA
      - IPI
      - IMP
    embedding:
      types: [3, 4]     # e.g. ProtT5, Ankh3
      batch_size: 1

    # --- Mode 2: CSV-based Custom Dataset ---
    # Triggered by: AccessionManager(conf).load_accessions_from_csv()
    load_accesion_csv: ../data/sample.csv
    load_accesion_column: uniprot_id
    fasta_path: ../data/sequences.fasta
    tag: CUSTOM_DATASET
    allowed_evidences: [EXP, IDA, IPI, IMP]
    embedding:
      types: [3, 4]
      batch_size: 1

    # --- Mode 3: GOA File with Local Annotations ---
    # Triggered by: GOAnnotationsQueueProcessor(conf).start()
    goa_annotations_file: ../data/custom_go_annotations.tsv
    limit_execution: 1000  # Optional

Execution Flow
^^^^^^^^^^^^^^

The following illustrates the high-level execution logic, depending on the selected mode:

.. code-block:: python

    # --- Mode 1 ---
    AccessionManager(conf).fetch_accessions_from_api()
    UniProtExtractor(conf).start()

    # --- Mode 2 ---
    AccessionManager(conf).load_accessions_from_csv()
    UniProtExtractor(conf).start()

    # --- Mode 3 ---
    GOAnnotationsQueueProcessor(conf).start()

    # Common to all modes
    SequenceEmbeddingManager(conf).start()

Each configuration block must be properly defined in your YAML file. Do not mix multiple modes in a single execution context.