ORM Model

This section provides detailed documentation of the Object-Relational Mapping (ORM) models used in the Protein Data Handler. Each model represents a table in the database and is crucial for managing and storing protein data efficiently.

ER Diagram

Protein

class protein_metamorphisms_is.sql.model.Protein(**kwargs)

Bases: Base

Represents a protein, encapsulating its properties and relationships within a database.

This class models a protein entity, encompassing various attributes that describe its characteristics and relationships to other entities. It serves as a comprehensive record for proteins, covering aspects from basic sequence data to more complex annotations and references.

entry_name

Unique entry name for the protein, serving as the primary key.

Type:: str

data_class

Categorization of the protein’s data (e.g., experimental, predicted).

Type:: str

molecule_type

Type of the protein molecule (e.g., enzyme, antibody).

Type:: str

sequence_length

The length of the amino acid sequence of the protein.

Type:: int

sequence

Full amino acid sequence of the protein.

Type:: str

accessions

A link to the ‘Accession’ class, detailing access codes associated with this protein.

Type:: relationship

created_date

The date when the protein record was first created.

Type:: Date

sequence_update_date

The date when the protein’s sequence was last updated.

Type:: Date

annotation_update_date

The date when the protein’s annotation was last updated.

Type:: Date

description

A general description or overview of the protein.

Type:: str

gene_name

The name of the gene that encodes this protein.

Type:: str

organism

The organism from which the protein is derived.

Type:: str

organelle

The specific organelle where the protein is localized, if applicable.

Type:: str

organism_classification

Taxonomic classification of the organism (e.g., species, genus).

Type:: str

taxonomy_id

A unique identifier for the organism in taxonomic databases.

Type:: str

host_organism

The host organism for the protein, relevant in cases of viral or symbiotic proteins.

Type:: str

host_taxonomy_id

Taxonomy identifier for the host organism, if applicable.

Type:: str

comments

Additional remarks or notes about the protein.

Type:: str

pdb_references

A link to the ‘PDBReference’ class, providing references to structural data in the PDB.

Type:: relationship

go_terms

A connection to the ‘GOTerm’ class, indicating Gene Ontology terms associated with the protein.

Type:: relationship

keywords

Descriptive keywords related to the protein, aiding in categorization and search.

Type:: str

protein_existence

A numerical code indicating the evidence level for the protein’s existence.

Type:: int

seqinfo

Supplementary information about the protein’s sequence.

Type:: str

disappeared

Flag indicating whether the protein is obsolete or no longer relevant.

Type:: Boolean

created_at

Timestamp of when the record was initially created.

Type:: DateTime

updated_at

Timestamp of the most recent update to the record.

Type:: DateTime

This class is integral to managing and querying detailed protein data, supporting a wide range of bioinformatics and data analysis tasks.

Accession

class protein_metamorphisms_is.sql.model.Accession(**kwargs)

Bases: Base

Represents a unique access code for a protein in a database, often used in bioinformatics repositories.

This class models an accession record, which is essential for tracking and referencing protein data. Each accession record provides a unique identifier for a protein and is linked to detailed protein information.

The Accession class plays a crucial role in the organization and retrieval of protein data, acting as a key reference point for protein identification and database querying.

id

A unique identifier for the accession record within the database.

Type:: int

accession_code

The unique access code associated with a specific protein. This code is typically used as a reference in various bioinformatics databases and literature.

Type:: str

primary

A flag indicating whether this accession code is the primary identifier for the associated protein. Primary accession codes are generally the most stable and widely used references.

Type:: Boolean

protein_entry_name

The entry name of the protein associated with this accession code. This serves as a link to the protein’s detailed record.

Type:: str

protein

A SQLAlchemy relationship with the ‘Protein’ class. This relationship provides a direct connection to the protein entity that this accession code represents, allowing for the retrieval of comprehensive protein information.

Type:: relationship

disappeared

A flag indicating whether the accession code is obsolete or no longer in use. This is important for maintaining the integrity and relevance of the database.

Type:: Boolean

created_at

The date and time when this accession record was first created in the database.

Type:: DateTime

updated_at

The date and time when this accession record was last updated, reflecting any changes or updates to the accession information.

Type:: DateTime

PDBReference

class protein_metamorphisms_is.sql.model.PDBReference(**kwargs)

Bases: Base

Represents a reference to a structure in the Protein Data Bank (PDB).

This class is pivotal for storing and managing details about protein structures as cataloged in the PDB. It forms a bridge between PDB structures and UniProt entries, enabling comprehensive tracking and analysis of protein structures and their corresponding sequences.

The PDBReference class serves as a critical component for integrating structural data with protein sequence and functional information, thereby enriching the understanding of protein structures.

id

A unique identifier for the PDB reference within the database. This serves as the primary key.

Type:: int

pdb_id

The unique identifier of the protein structure in PDB, typically a 4-character alphanumeric code.

Type:: str

protein_entry_name

The entry name of the associated protein in UniProt. This helps link the structure to its corresponding protein sequence and other relevant data in UniProt.

Type:: str

protein

A SQLAlchemy relationship to the ‘Protein’ class, establishing a connection to the UniProt entry corresponding to this PDB structure.

Type:: relationship

method

The method used for determining the protein structure, such as X-ray crystallography or NMR spectroscopy.

Type:: str

resolution

The resolution of the protein structure, measured in Ångströms (Å). A lower number indicates higher resolution.

Type:: Float

uniprot_chains

A relationship to the ‘UniprotChains’ class, detailing the individual protein chains in the structure as defined in UniProt.

Type:: relationship

pdb_chains

A relationship to the ‘PDBChains’ class, describing the chains in the protein structure as recorded in PDB.

Type:: relationship

created_at

The timestamp indicating when the PDB reference record was initially created in the database.

Type:: DateTime

updated_at

The timestamp of the most recent update to the PDB reference record. This field is automatically updated on each record modification.

Type:: DateTime

PDBChains

class protein_metamorphisms_is.sql.model.PDBChains(**kwargs)

Bases: Base

Represents an individual polypeptide chain within a protein structure as cataloged in the Protein Data Bank (PDB).

The PDBChains class is instrumental in representing each distinct polypeptide chain encountered in protein structures from the PDB. This class enables detailed tracking and management of these chains, facilitating analyses and queries at the chain level. By associating each chain with its parent protein structure, the class enhances the database’s ability to model complex protein structures.

id

A unique identifier for each polypeptide chain within the database, serving as the primary key.

Type:: int

chains

The specific identifier of the chain as referenced in the protein structure within PDB. This attribute, combined with ‘pdb_reference_id’, constitutes part of the composite primary key.

Type:: String

sequence

The complete amino acid sequence of the chain. Storing this mandatory attribute allows for in-depth analyses of the chain’s molecular structure.

Type:: String

pdb_reference_id

A foreign key linking to the unique identifier of the parent protein structure in the PDB. This attribute forms the other part of the composite primary key and establishes a direct relationship with the PDBReference entity.

Type:: Integer

model

An identifier for the model of the chain, particularly important for structures like NMR that may encompass multiple models.

Type:: Integer

pdb_reference

A SQLAlchemy relationship that connects to the PDBReference entity. This relationship provides access to comprehensive details about the entire protein structure to which this chain is a part.

Type:: relationship

The composite primary key, comprising chains and pdb_reference_id, ensures that each instance of PDBChains is uniquely tied to a specific structure in the PDB. This key structure is critical for precise data retrieval and efficient management of the database’s structural data.

Cluster

class protein_metamorphisms_is.sql.model.Cluster(**kwargs)

Bases: Base

Represents a cluster of protein chains, where each cluster is formed by chains with significant similarity, determined using the cd-hit tool.

This class is instrumental in grouping protein chains that are highly similar to each other, aiding in the identification of common structures and functions.

id

Unique identifier for each cluster.

Type:: int

pdb_chain_id

Foreign key referencing the ‘PDBChains’ entity. It is used to identify the specific protein

Type:: int

chain in the PDB database associated with this cluster.

cluster_id

Identifier of the cluster, typically a unique string representing this specific group of protein chains.

Type:: int

is_representative

Indicates whether the cluster is representative of a larger set of similar chains. ‘True’ for yes, ‘False’ for no.

Type:: Boolean

sequence_length

Average length of the sequences of the chains in the cluster.

Type:: int

identity

Value representing the average sequence identity within the cluster, usually a percentage indicating how similar the chains are within the group.

Type:: Float

The relationship with ‘PDBChains’ allows each cluster to be connected to its specific chain in the PDB database, providing a direct link to detailed structural information.

StructuralComplexityLevel

class protein_metamorphisms_is.sql.model.StructuralComplexityLevel(**kwargs)

Bases: Base

Captures the hierarchy of structural forms within proteins, ranging from individual proteins to the partitioning of chains through its secondary structure.

This class provides a foundational abstraction for handling proteins at various levels of structural complexity within the development environment. It allows for the execution of operations across different complexity levels, enabling a more flexible and nuanced approach to protein data manipulation and analysis. By defining distinct levels of structural complexity, it supports targeted queries and operations, enhancing the efficiency and precision of bioinformatics workflows.

id

Unique identifier for each complexity level.

Type:: Integer

name

Descriptive name of the complexity level.

Type:: String

description

More detailed information about the complexity level.

Type:: String, optional

StructuralAlignmentType

class protein_metamorphisms_is.sql.model.StructuralAlignmentType(**kwargs)

Bases: Base

Provides a framework for aligning protein structures, crucial for understanding the functional and evolutionary relationships between proteins. This class enables the use of various alignment strategies, supporting a comprehensive approach to protein comparison.

Structural alignment methods integrated within this framework include:

CE-align: Identifies optimal alignments based on the Combinatorial Extension method, focusing on similar backbone arrangements.
US-align: Utilizes an advanced algorithm for measuring structural similarity, offering insights into sequence identity and alignment scores.
FATCAT: Capable of accommodating protein flexibility during alignment, allowing for the detection of functionally important variations.

By incorporating these methodologies, the class facilitates diverse approaches to protein comparison. This enables researchers to gain deeper insights into protein functionality and evolution, highlighting the significance of structural alignment in the field of bioinformatics.

id

Unique identifier for each alignment type.

Type:: Integer

name

Name of the alignment type.

Type:: String

description

Detailed description of the alignment method.

Type:: String

task_name

Name of the specific task or process associated with this alignment type.

Type:: String

StructuralAlignmentQueue

class protein_metamorphisms_is.sql.model.StructuralAlignmentQueue(**kwargs)

Bases: Base

Manages a queue of pending structural alignment tasks, overseeing their execution and monitoring.

id

Unique identifier for each queue entry.

Type:: Integer

cluster_entry_id

Reference to the protein chain cluster being aligned.

Type:: Integer, ForeignKey

alignment_type_id

ID of the structural alignment type to be applied.

Type:: Integer, ForeignKey

state

Current state of the task (e.g., pending, processing, completed, error).

Type:: Integer

retry_count

Number of retries attempted for the task.

Type:: Integer

error_message

Error message if the task fails.

Type:: String, optional

created_at

Timestamp when the queue entry was created.

Type:: DateTime

updated_at

Timestamp when the queue entry was last updated.

Type:: DateTime

StructuralAlignmentResults

class protein_metamorphisms_is.sql.model.StructuralAlignmentResults(**kwargs)

Bases: Base

Stores results from structural alignment tasks, providing detailed metrics and scores.

id

Unique identifier for each set of results.

Type:: Integer

cluster_entry_id

Reference to the cluster of protein chains analyzed.

Type:: Integer, ForeignKey

ce_rms

Root mean square deviation calculated by CE method.

Type:: Float

tm_rms

Root mean square deviation calculated by US-align.

Type:: Float

tm_seq_id

Sequence identity calculated by US-align.

Type:: Float

tm_score_chain_1

TM score for the first chain in the US-alignment.

Type:: Float

tm_score_chain_2

TM score for the second chain in the US-alignment.

Type:: Float

fc_rms

Root mean square deviation calculated by FATCAT.

Type:: Float

fc_identity

Sequence identity calculated by FATCAT.

Type:: Float

fc_similarity

Similarity score calculated by FATCAT.

Type:: Float

fc_score

Overall score calculated by FATCAT.

Type:: Float

fc_align_len

Length of the alignment calculated by FATCAT.

Type:: Float

GOTerm

class protein_metamorphisms_is.sql.model.GOTerm(**kwargs)

Bases: Base

Represents a Gene Ontology (GO) term associated with a protein.

This class is used to store and manage information about the functional annotation of proteins as defined by the Gene Ontology Consortium. Each GO term provides a standardized description of a protein’s molecular function, biological process, or cellular component.

id

Unique identifier for the GO term within the database.

Type:: int

go_id

Unique identifier of the GO term in the Gene Ontology system.

Type:: str

protein_entry_name

Entry name of the associated protein in UniProt.

Type:: str

protein

Relationship with the ‘Protein’ class, linking the GO term to its corresponding protein.

Type:: relationship

category

Category of the GO term, indicating whether it describes a molecular function, biological process, or cellular component.

Type:: str

description

Detailed description of the GO term, explaining the function, process, or component it represents.

Type:: str

The relationship with the ‘Protein’ class allows for the association of functional, process, or component annotations with specific proteins.