Aligner#

class pyfamsa.Aligner#

A single FAMSA aligner.

scoring_matrix#

The scoring matrix used for scoring alignments.

Type:

ScoringMatrix

New in version 0.4.0: The scoring_matrix attribute.

__init__(*, threads=0, guide_tree='sl', tree_heuristic=None, n_refinements=100, keep_duplicates=False, refine=None, scoring_matrix=None, medoid_threshold=0, medoid_seeds=100, medoid_sample=2000, medoid_evaluations=1, cluster_fraction=0.1, cluster_iters=2)#

Create a new aligner with the given configuration.

Keyword Arguments:
  • threads (int) – The number of threads to use for parallel computations. If 0 given (the default), use os.cpu_count to spawn one thread per CPU on the host machine.

  • guide_tree (str) – The method for building the guide tree. Supported values are: sl for MST+Prim single linkage, slink for SLINK single linkage, upgma for UPGMA, nj for neighbour joining.

  • tree_heuristic (str or None) – The heuristic to use for constructing the tree. Supported values are: medoid for medoid trees, part for part trees, or None to disable heuristics.

  • n_refinements (int) – The number of refinement iterations to run.

  • keep_duplicates (bool) – Set to True to avoid discarding duplicate sequences before building trees or alignments.

  • refine (bool or None) – Set to True to force refinement, False to disable refinement, or leave as None to disable refinement automatically for sets of more than 1000 sequences.

  • scoring_matrix (ScoringMatrix or str) – The scoring matrix to use for scoring alignments. By default, the PFAMSUM43 matrix is used, like in the C++ FAMSA implementation since v2.3.0.

  • medoid_threshold (int) – The minimum number of sequences a set must contain for medoid trees to be used, if enabled with tree_heuristic.

  • medoid_seeds (int) – The number of trees to select for seeding the medoid trees with PartTree.

  • medoid_sample (int) – The number of sequences to use to perform clustering.

  • medoid_evaluations (int) – The number of evaluations to perform while building the medoid trees.

  • cluster_fraction (float) – The fraction of data points to select to estimate a guide tree with the PartTree algorithm.

  • cluster_iters (int) – The number of iterations to identify starting nodes while estimating a guide tree with the PartTree algorithm.

New in version 0.4.0: The scoring_matrix argument.

Changed in version 0.6.0: Default scoring_matrix changed from MIQS to PFASUM43.

Changed in version 0.6.1: scoring_matrix supports alphabets subsets of FAMSA_ALPHABET.

New in version 0.7.0: The medoid_seeds, medoid_sample, medoid_evaluations, cluster_fraction and cluster_iters arguments.

align(sequences)#

Align sequences together.

Example

>>> aligner = Aligner()
>>> seqs = [Sequence(b't1', b'MMYK'), Sequence(b't2', b'MYKLP')]
>>> ali = aligner.align(seqs)
>>> list(ali)
[GappedSequence(b't1', b'MMYK--'), GappedSequence(b't2', b'-MYKLP')]
Parameters:

sequences (iterable of Sequence) – An iterable yielding the digitized sequences to align.

Returns:

Alignment – The aligned sequences, in aligned format.

Raises:

Changed in version 0.6.1: Sequences are now checked against the scoring_matrix alphabet.

align_profiles(profile1, profile2)#

Align two profiles together.

Profile-profile alignment computes a new alignment using sequences from the two input alignments while preserving the columns of each profile.

Parameters:
  • profile1 (Alignment) – The first profile to align.

  • profile2 (Alignment) – The second profile to align.

Returns:

Alignment – The resulting profile-profile alignment.

New in version 0.5.0.

build_tree(sequences)#

Build a tree from the given sequences.

Parameters:

sequences (iterable of Sequence) – An iterable yielding the digitized sequences to build a tree from.

Returns:

GuideTree – The guide tree obtained from the sequences.

Raises:

Changed in version 0.6.1: Sequences are now checked against the scoring_matrix alphabet.

cluster_fraction#

The fraction of data points for the PartTree algorithm.

New in version 0.7.0.

Type:

int

cluster_iters#

The number of iterations to identify starting nodes.

New in version 0.7.0.

Type:

int

guide_tree#

The name of the method used to build the guide tree.

New in version 0.7.0.

Type:

str

keep_duplicates#

Whether to keep duplicate sequences.

New in version 0.7.0.

Type:

bool

medoid_evaluations#

The number of evaluations to perform for medoid trees.

New in version 0.7.0.

Type:

int

medoid_sample#

The number of sequences to use to perform clustering.

New in version 0.7.0.

Type:

int

medoid_seeds#

The number of trees to select for seeding the medoid trees with PartTree.

New in version 0.7.0.

Type:

int

medoid_threshold#

The minimum number of sequences for medoid trees to be used.

New in version 0.7.0.

Type:

int

n_refinements#

The number of refinement iterations to run.

New in version 0.7.0.

Type:

int

refine#

Whether the refinement is manually enabled.

New in version 0.7.0.

Type:

bool or None

threads#

The number of threads used for parallel processing.

New in version 0.7.0.

Type:

int

tree_heuristic#

The heuristic for building the tree, if any.

New in version 0.7.0.

Type:

str or None