Documentation

Command line interface

haptools

haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information

haptools [OPTIONS] COMMAND [ARGS]...

Options

--version: Show the version and exit.

index

Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file

haptools index [OPTIONS] HAPLOTYPES

Options

--sort, --no-sort

Sorting of the file will not be performed

Default: True

-o, --output <output>

A .hap file containing sorted and indexed haplotypes and variants

Default: input file

-v, --verbosity <verbosity>

The level of verbosity desired

Default: INFO
Options: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

HAPLOTYPES: Required argument

karyogram

Visualize a karyogram of local ancestry tracks

haptools karyogram [OPTIONS]

Options

--bp <bp>: Required Path to .bp file with breakpoints

--sample <sample>: Required Sample ID to plot

--out <out>: Required Name of output file

--title <title>: Optional plot title

--centromeres <centromeres>: Optional file with telomere/centromere cM positions

--colors <colors>: Optional color dictionary. Input can be from the matplotlib list of colors or in hexcode. Format is e.g. ‘YRI:blue,CEU:green’

-v, --verbosity <verbosity>

The level of verbosity desired

Default: INFO
Options: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

ld

Compute the pair-wise LD (Pearson’s correlation) between haplotypes (or variants) and a single TARGET haplotype (or variant)

GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec

TARGET refers to the ID of a variant or haplotype. LD is computed pair-wise between TARGET and all of the other haplotypes in the .hap (or genotype) file

If TARGET is a variant ID, the ID must appear in GENOTYPES. Otherwise, it must be present in the .hap file

haptools ld [OPTIONS] TARGET GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default: all haplotypes

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default: all samples

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default: all samples

-i, --id <ids>

A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’). Or, if –from-gts, a list of the variant IDs to use from the genotypes file. For this to work, the .hap file must be indexed

Default: all haplotypes

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype (or variant) IDs (one per line) to subset from the .hap (or genotype) file

Default: all haplotypes

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default: all variants

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default: False

--from-gts

By default, LD is computed with the haplotypes in the .hap file. Use this switch to compute LD with the genotypes in the genotypes file, instead.

Default: False

-o, --output <output>

A .hap file containing haplotypes and their LD with TARGET

Default: stdout

-v, --verbosity <verbosity>

The level of verbosity desired

Default: INFO
Options: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

TARGET: Required argument

GENOTYPES: Required argument

HAPLOTYPES: Required argument

simgenotype

Simulate admixed genomes under a pre-defined model.

haptools simgenotype [OPTIONS]

Options

--model <model>: Required Admixture model in .dat format. See File Formats under simgenotype in the docs for complete info.

--mapdir <mapdir>: Required Directory containing files with chr{1-22,X} and ending in .map in the file name with genetic map coords.

--out <out>: Required Path to desired output file. E.g. /path/to/output.vcf.gz Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an additional breakpoints output with extension bp e.g. /path/to/output.bp.

--chroms <chroms>: Sorted and comma delimited list of chromosomes to simulate

--seed <seed>: Random seed. Set to make simulations reproducible

--ref_vcf <ref_vcf>: Required VCF or PGEN file used as reference for creation of simulated samples respective genotypes.

--sample_info <sample_info>: Required File that maps samples from the reference VCF (–invcf) to population codes describing the populations in the header of the model file.

--region <region>: Subset the simulation to a specific region in a chromosome using the form chrom:start-end. Example 2:1000-2000

--pop_field: Flag for outputting the population field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.

--sample_field: Flag for outputting the sample field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.

--only_breakpoint: Flag used to determine whether to only output breakpoints or continue to simulate a vcf file.

-v, --verbosity <verbosity>

The level of verbosity desired

Default: INFO
Options: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

simphenotype

Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.

GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec

Note: GENOTYPES must be the output from the transform subcommand.

haptools simphenotype [OPTIONS] GENOTYPES HAPLOTYPES

Options

-r, --replications <replications>

Number of rounds of simulation to perform

Default: 1

-h, --heritability <heritability>: Trait heritability

-p, --prevalence <prevalence>

Disease prevalence if simulating a case-control trait

Default: quantitative trait

--normalize, --no-normalize

Whether to normalize the genotypes before using them for simulation

Default: True

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default: all haplotypes

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default: all samples

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default: all samples

-i, --id <ids>

A list of the haplotype IDs from the .hap file to use as causal variables (ex: ‘-i H1 -i H2’).

Default: all haplotypes

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file

Default: all haplotypes

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default: all variants

--seed <seed>

Use this option across executions to make the output reproducible

Default: chosen randomly

-o, --output <output>

A TSV file containing simulated phenotypes

Default: stdout

-v, --verbosity <verbosity>

The level of verbosity desired

Default: INFO
Options: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES: Required argument

HAPLOTYPES: Required argument

transform

Creates a VCF composed of haplotypes

GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec

haptools transform [OPTIONS] GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default: all haplotypes

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default: all samples

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default: all samples

-i, --id <ids>

A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’).

Default: all haplotypes

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file

Default: all haplotypes

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default: all variants

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default: False

--ancestry

Also transform using VCF ‘POP’ FORMAT field and ‘ancestry’ .hap extra field

Default: False

-o, --output <output>

A VCF file containing haplotype ‘genotypes’

Default: stdout

-v, --verbosity <verbosity>

The level of verbosity desired

Default: INFO
Options: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES: Required argument

HAPLOTYPES: Required argument

Module contents

haptools.data.data module

class haptools.data.data.Data(fname, log=None)

Bases: ABC

Abstract class for accessing read-only data files

Attributes

fnamePath | str: The path to the read-only file containing the data
datanp.array: The contents of the data file, once loaded
log: Logger: A logging instance for recording debug statements.

static hook_compressed(filename, mode)

A utility to help open files regardless of their compression

Based off of python’s fileinput.hook_compressed and copied from https://stackoverflow.com/a/64106815/16815703

Return type

IO

Parameters

filenamestr: The path to the file
modestr: Either ‘r’ for read or ‘w’ for write

Returns

IO: The resolved file object

abstract classmethod load(fname)

Read the file contents and perform any recommended pre-processing

Parameters

fnamePath: See documentation for fname

abstract read(): Read the raw file contents into the class properties

unset()

Whether the data has been loaded into the object yet

Return type

bool

Returns

bool: True if data is None else False

haptools.data.genotypes module

class haptools.data.genotypes.Genotypes(fname, log=None)

Bases: Data

A class for processing genotypes from a file

Examples

>>> genotypes = Genotypes.load('tests/data/simple.vcf')
>>> # directly access the loaded variants, samples, and genotypes (in data)
>>> genotypes.variants
>>> genotypes.samples
>>> genotypes.data

Attributes

datanpt.NDArray

The genotypes in an n (samples) x p (variants) x 2 (strands) array

fnamePath | str

The path to the read-only file containing the data

samplestuple[str]

The names of each of the n samples

variantsnp.array

Variant-level meta information:

ID
CHROM
POS

log: Logger

A logging instance for recording debug statements

_prephasedbool

If True, assume that the genotypes are phased. Otherwise, extract their phase when reading from the VCF.

_samp_idxdict[str, int]

Sample index; maps samples to indices in self.samples

_var_idxdict[str, int]

Variant index; maps variant IDs to indices in self.variants

check_biallelic(discard_also=False)

Check that each genotype is composed of only two alleles

This function modifies the dtype of data from uint8 to bool

Parameters

discard_alsobool, optional: If True, discard any multiallelic variants without raising a ValueError

Raises

ValueError: If any of the genotypes have more than two alleles

check_maf(threshold=None, discard_also=False, warn_only=False)

Check the minor allele frequency of each variant

Raise a ValueError if any variant’s MAF doesn’t satisfy the threshold, if one is provided :rtype: ndarray[Any, dtype[float64]]

Note

You should call check_missing() and check_biallelic() before executing this method, for best results. Otherwise, the frequencies may be computed incorrectly.

Parameters

threshold: float, optional

If a variant has a minor allele frequency (MAF) rarer than this threshold, raise a ValueError

discard_alsobool, optional

If True, discard any variants that would otherwise cause a ValueError

This parameter will be ignored if a threshold is not specified

warn_only: bool, optional

Just raise a warning instead of a ValueError

Returns

The minor allele frequency of each variant

Raises

ValueError: If any variant does not meet the provided threshold minor allele frequency

check_missing(discard_also=False)

Check that each sample is properly genotyped

Parameters

discard_alsobool, optional: If True, discard any samples that are missing genotypes without raising a ValueError

Raises

ValueError: If any of the samples have missing genotypes ‘GT: .|.’

check_phase()

Check that the genotypes are phased then remove the phasing info from the data

This function modifies data in-place

Raises

ValueError: If any heterozgyous genotpyes are unphased

check_sorted()

Check that the variant coordinates are sorted

Raise a ValueError if any of variants in any of the chromosomes are unsorted

Raises

ValueError: If any variant position is less than a position that comes before it within the same chromosome

index(samples=True, variants=True)

Call this function once to improve the amortized time-complexity of look-ups of samples and variants by their ID. This is useful if you intend to later subset by a set of samples or variant IDs. The time complexity of this function should be roughly O(n+p) if both parameters are True. Otherwise, it will be either O(n) or O(p).

Parameters

samples: bool, optional: Whether to index the samples for fast loop-up. Adds complexity O(n).
variants: bool, optional: Whether to index the variants for fast look-up. Adds complexity O(p).

Raises

ValueError: If any samples or variants appear more than once

classmethod load(fname, region=None, samples=None, variants=None)

Load genotypes from a VCF file

Read the file contents, check the genotype phase, and create the MAC matrix

Return type

Genotypes

Parameters

fname: See documentation for fname
regionstr, optional: See documentation for read()
sampleslist[str], optional: See documentation for read()
variantsset[str], optional: See documentation for read()

Returns

Genotypes: A Genotypes object with the data loaded into its properties

read(region=None, samples=None, variants=None, max_variants=None)

Read genotypes from a VCF into a numpy matrix stored in data

Parameters

regionstr, optional

The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF must be indexed and the seqname must match!

Defaults to loading all genotypes

sampleslist[str], optional

A subset of the samples from which to extract genotypes

Defaults to loading genotypes from all samples

variantsset[str], optional

A set of variant IDs for which to extract genotypes

All other variants will be ignored. This may be useful if you’re running out of memory.

max_variantsint, optional

The maximum mumber of variants to load from the file. Setting this value helps preallocate the arrays, making the process faster and less memory intensive. You should use this option if your processes are frequently “Killed” from memory overuse.

If you don’t know how many variants there are, set this to a large number greater than what you would except. The np array will be resized appropriately. You can also use the bcftools “counts” plugin to obtain the number of expected sites within a region.

Note that this value is ignored if the variants argument is provided.

Raises

ValueError: If the genotypes array is empty

subset(samples=None, variants=None, inplace=False)

Subset these genotypes to a smaller set of samples or a smaller set of variants

The order of the samples and variants in the subsetted instance will match the order in the provided tuple parameters.

Parameters

samples: tuple[str]: A subset of samples to keep
variants: tuple[str]: A subset of variant IDs to keep
inplace: bool, optional: If False, return a new Genotypes object; otherwise, alter the current one

Returns

A new Genotypes object if inplace is set to False, else returns None

class haptools.data.genotypes.GenotypesPLINK(fname, log=None, chunk_size=None)

Bases: GenotypesRefAlt

A class for processing genotypes from a PLINK .pgen file

Examples

>>> genotypes = GenotypesPLINK.load('tests/data/simple.pgen')

Attributes

datanp.array

See documentation for data

samplestuple

See documentation for data

variantsnp.array

See documentation for data

log: Logger

See documentation for data

chunk_size: int, optional

The max number of variants to fetch from and write to the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory

_prephased: bool

See documentation for data

read(region=None, samples=None, variants=None, max_variants=None)

Read genotypes from a PGEN file into a numpy matrix stored in data

Parameters

regionstr, optional: See documentation for read
sampleslist[str], optional: See documentation for read
variantsset[str], optional: See documentation for read
max_variantsint, optional: See documentation for read

read_samples(samples=None)

Read sample IDs from a PSAM file into a list stored in samples

This method is called automatically by read()

Parameters

sampleslist[str], optional: See documentation for read

Returns

npt.NDArray[np.uint32]: The indices of each of the samples within the PSAM file

read_variants(region=None, variants=None, max_variants=None)

Read variants from a PVAR file into a numpy array stored in variants

One of either variants or max_variants MUST be specified!

This method is called automatically by read()

Parameters

regionstr, optional: See documentation for read
variantsset[str], optional: See documentation for read
max_variantsint, optional: See documentation for read

Returns

npt.NDArray[np.uint32]: The indices of each of the variants within the PVAR file

write(): Write the variants in this class to PLINK2 files at fname

write_samples()

Write sample IDs to a PSAM file from a list stored in samples

This method is called automatically by write()

write_variants()

Write variant IDs to a PVAR file from the numpy array stored in variants

This method is called automatically by write()

class haptools.data.genotypes.GenotypesRefAlt(fname, log=None)

Bases: Genotypes

A class for processing genotypes from a file Unlike the base Genotypes class, this class also includes REF and ALT alleles in the variants array

Attributes

datanp.array

See documentation for data

fnamePath | str

See documentation for fname

samplestuple[str]

See documentation for samples

variantsnp.array

Variant-level meta information:

ID
CHROM
POS
REF
ALT

log: Logger

See documentation for log

write(): Write the variants in this class to a VCF at fname

haptools.data.phenotypes module

class haptools.data.phenotypes.Phenotypes(fname, log=None)

Bases: Data

A class for processing phenotypes from a file

Examples

>>> phenotypes = Phenotypes.load('tests/data/simple.pheno')

Attributes

datanp.array: The phenotypes in an n (samples) x m (phenotypes) array
fnamePath | str: The path to the file containing the data
samplestuple: The names of each of the n samples
namestuple[str]: The names of the phenotypes
log: Logger: A logging instance for recording debug statements.

append(name, data)

Append a new set of phenotypes to the current set

Parameters

name: str: The name of the new phenotype
data: npt.NDArray: A 1D np array of the same length as samples, containing the phenotype values for each sample. Must have the same dtype as

classmethod load(fname, samples=None)

Load phenotypes from a pheno file

Read the file contents and standardize the phenotypes

Return type

Phenotypes

Parameters

fname: See documentation for fname
samplesset[str], optional: See documentation for read()

Returns

phenotypes: A Phenotypes object with the data loaded into its properties

read(samples=None)

Read phenotypes from a pheno file into a numpy matrix stored in data

Parameters

samplesset[str], optional

A subset of the samples from which to extract phenotypes

Defaults to loading phenotypes from all samples

Raises

AssertionError: If the provided file doesn’t follow the expected format

standardize()

Standardize phenotypes so they have a mean of 0 and a stdev of 1

This function modifies data in-place

write()

Write the phenotypes in this class to a file at fname

Examples

To write to a file, you must first initialize a Phenotypes object and then fill out the names, data, and samples properties: >>> phenotypes = Phenotypes(‘tests/data/simple.pheno’) >>> phenotypes.names = (‘height’,) >>> phenotypes.data = np.array([1, 1, 2], dtype=’float64’) >>> phenotypes.samples = (‘HG00096’, ‘HG00097’, ‘HG00099’) >>> phenotypes.write()

haptools.data.covariates module

class haptools.data.covariates.Covariates(fname, log=None)

Bases: Phenotypes

A class for processing covariates from a file

Examples

>>> covariates = Covariates.load('tests/data/simple.covar')

Attributes

datanp.array: The covariates in an n (samples) x m (covariates) array
fnamePath | str: The path to the read-only file containing the data
samplestuple[str]: The names of each of the n samples
namestuple[str]: The names of the covariates
log: Logger: A logging instance for recording debug statements.

haptools.data.haplotypes module

class haptools.data.haplotypes.Extra(name, fmt='s', description='')

Bases: object

An extra field on a line in the .hap file

Attributes

name: str: The name of the extra field
fmt: str = “s”: The python fmt string of the field value; indicates how to format the value
description: str = “”: A description of the extra field

description: str = ''

fmt: str = 's'

property fmt_str: str

Convert an Extra into a fmt string

Returns

str: A python format string (ex: “{beta:.3f}”)

classmethod from_hap_spec(line)

Convert an “extra” line in the header of a .hap file into an Extra object

Return type

Extra

Parameters

line: str: An “extra” field, as it appears declared in the header

Returns

Extra: An Extra object

name: str

to_hap_spec(line_type_symbol)

Convert an Extra object into a header line in the .hap format spec

Return type

str

Parameters

hap_id: str: The ID of the haplotype associated with this variant

Returns

str: A valid variant line (V) in the .hap format spec

class haptools.data.haplotypes.Haplotype(chrom, start, end, id)

Bases: object

A haplotype within the .hap format spec

In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration

Examples

Let’s extend this class and add an extra field called “ancestry”

>>> from dataclasses import dataclass, field
>>> @dataclass
>>> class CustomHaplotype(Haplotype):
...     ancestry: str
...     _extras: tuple = field(
...         repr=False,
...         init=False,
...         default = (
...             Extra("ancestry", "s", "Local ancestry"),
...         ),
...     )

Attributes

chrom: str: The contig to which this haplotype belongs
start: int: The chromosomal start position of the haplotype
end: int: The chromosomal end position of the haplotype
id: str: The haplotype’s unique ID
variants: tuple[Variant]: The variants in this haplotype
_extras: tuple[Extra]: Extra fields for the haplotype

property ID: Create an alias for the id property

chrom: str

end: int

classmethod extras_head()

Return the header lines of the extra fields that are supported

Return type

set

Returns

tuple: The header lines of the extra fields

classmethod extras_order()

The names of the extra fields in order

Returns

tuple[str]: The names of the extra fields in the order in which they are stored

classmethod from_hap_spec(line, variants=(), types=None)

Convert a variant line into a Haplotype object in the .hap format spec

Note that this implementation does NOT support having more extra fields than appear in the header

Parameters

line: str: A variant (H) line from the .hap file
variants: tuple[Variant], optional: The Variants in this haplotype
types: dict[str, type], optional: The types of each property in the object

Returns

Haplotype: The Haplotype object for the variant

id: str

sort(): Sorts the variants within this Haplotype instance

start: int

to_hap_spec()

Convert a Haplotype object into a haplotype line in the .hap format spec

Return type

str

Returns

str: A valid haplotype line (H) in the .hap format spec

transform(genotypes)

Transform a genotypes matrix via the current haplotype

Each entry in the returned matrix denotes the presence of the current haplotype in each chromosome of each sample in the Genotypes object

Return type

ndarray[Any, dtype[bool]]

Parameters

genotypesGenotypesRefAlt

The genotypes which to transform using the current haplotype

If the genotypes have not been loaded into the Genotypes object yet, this method will call Genotypes.read(), while loading only the needed variants

Returns

npt.NDArray[bool]: A 2D matrix of shape (num_samples, 2) where each entry in the matrix denotes the presence of the haplotype in one chromosome of a sample

types = {'chrom': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}

property varIDs

variants: tuple

class haptools.data.haplotypes.Haplotypes(fname, haplotype=<class 'haptools.data.haplotypes.Haplotype'>, variant=<class 'haptools.data.haplotypes.Variant'>, log=None)

Bases: Data

A class for processing haplotypes from a file

Examples

Parsing a basic .hap file without any extra fields is simple: >>> haplotypes = Haplotypes.load(‘tests/data/basic.hap’) >>> haps = haplotypes.data # a dictionary of Haplotype objects

If the .hap file contains extra fields, you’ll need to call the read() method manually. You’ll also need to create Haplotype and Variant subclasses that support the extra fields and then specify the names of the classes when you initialize the Haplotypes object: >>> haplotypes = Haplotypes(‘tests/data/simphenotype.hap’, HaptoolsHaplotype) >>> haplotypes.read() >>> haps = haplotypes.data # a dictionary of Haplotype objects

Attributes

fname: Path | str

The path to the file containing the data

data: dict[str, Haplotype]

A dict of Haplotype objects keyed by their IDs

types: dict

A dict of class names keyed by the symbol denoting their line type

Ex: {‘H’: Haplotype, ‘V’: Variant}

version: str

A string denoting the current file format version

log: Logger

A logging instance for recording debug statements.

check_header(lines, check_version=True, softly=False)

1) Check and parse any metadata and 2) check that any extra fields declared in the .haps file can be handled by the Variant and Haplotype classes provided in __init__()

This function is called automatically by other methods that read .hap files

Parameters

lines: list[str]: Header lines from the .hap file. Any lines beginning with # may appear in this list, especially if the file is sorted. So this may include regular comments, too.
check_version: bool, optional: Whether to also check the version of the file
softly: bool, optional: If True, then this function will not raise any ValueErrors. Instead, it will only issue errors via the logging module, which may be ignored.

Returns

tuple[dict, dict[str, tuple[Extra]]]

The metadata for the file, contained within the header lines and encoded as a dictionary where the names are keys and any subsequent fields are values

The second dictionary encodes the set of declared extra field names for each line type

Raises

ValueError: If any of the header lines are not supported

check_version(version, err_msgr)

Check the observed version string against the current version string of this instance

Parameters

version: str: The observed version string
err_msgr: Callable: A function which takes a single parameter (the error message) and errors appropriately

Returns

The parsed, observed version string

classmethod load(fname, region=None, haplotypes=None)

Load haplotypes from a .hap file

Read the file contents

Return type

Haplotypes

Parameters

fname: Path: See documentation for fname
region: str, optional: See documentation for read()
haplotypes: set[str], optional: See documentation for read()

Returns

Haplotypes: A Haplotypes object with the data loaded into its properties

read(region=None, haplotypes=None)

Read haplotypes from a .hap file into a list stored in data

Parameters

region: str, optional

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the .hap file must be indexed and the seqname must match!

Defaults to loading all haplotypes

haplotypes: set[str], optional

A list of haplotype IDs corresponding to a subset of the haplotypes to extract

Defaults to loading haplotypes from all samples

sort()

Sorts .hap files first by chrom, followed by start, end, and lastly ID

Also sorts the variants within each haplotype

subset(haplotypes, inplace=False)

Subset these haplotypes to a smaller set of haplotypes

The order of the haplotypes in the subsetted instance will match the order in the provided tuple parameters.

Parameters

haplotypes: tuple[str]: A subset of haplotype IDs to keep
inplace: bool, optional: If False, return a new Genotypes object; otherwise, alter the current one

Returns

A new Haplotypes object if inplace is set to False, else returns None

to_str(sort=True)

Create a string representation of this Haplotype

Return type

Generator[str, None, None]

Parameters

sort: bool, optional: Whether to attempt to output lines in sorted order

Yields

Generator[str, None, None]: A list of lines (strings) to include in the output

transform(gts, hap_gts=None)

Transform a genotypes matrix via the current haplotype

Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object

Return type

GenotypesRefAlt

Parameters

gtsGenotypesRefAlt: The genotypes which to transform using the current haplotype
hap_gts: GenotypesRefAlt: An empty GenotypesRefAlt object into which the haplotype genotypes should be stored

Returns

GenotypesRefAlt: A Genotypes object composed of haplotypes instead of regular variants.

write()

Write the contents of this Haplotypes object to the file at fname

If the items in data are sorted, then the output should be automatically sorted such that “sort -k1,4” would leave the output unchanged

Examples

To write to a .hap file, you must first initialize a Haplotypes object and then fill out the data property: >>> haplotypes = Haplotypes(‘tests/data/basic.hap’) >>> haplotypes.data = {‘H1’: Haplotype(‘chr1’, 0, 10, ‘H1’)} >>> haplotypes.write()

class haptools.data.haplotypes.Variant(start, end, id, allele)

Bases: object

A variant within the .hap format spec

In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration

Examples

Let’s extend this class and add an extra field called “score”

>>> from dataclasses import dataclass, field
>>> @dataclass
>>> class CustomVariant(Variant):
...     score: float
...     _extras: tuple = field(
...         repr=False,
...         init=False,
...         default = (
...             Extra("score", ".3f", "Importance of inclusion"),
...         ),
...     )

Attributes

start: int

The chromosomal start position of the variant

end: int

The chromosomal end position of the variant

In most cases this will be the same as the start position

id: str

The variant’s unique ID

allele: str

The allele of this variant within the Haplotype

_extras: tuple[Extra]

Extra fields for the haplotype

property ID: Create an alias for the id property

allele: str

end: int

classmethod extras_head()

Return the header lines of the extra fields that are supported

Return type

set

Returns

tuple: The header lines of the extra fields

classmethod extras_order()

The names of the extra fields in order

Returns

tuple[str]: The names of the extra fields in the order in which they are stored

classmethod from_hap_spec(line, types=None)

Convert a variant line into a Variant object in the .hap format spec

Note that this implementation does NOT support having more extra fields than appear in the header

Parameters

line: str: A variant (V) line from the .hap file
types: dict[str, type], optional: The order of the extra fields if different from the order in _extras

Returns

tuple[str, Variant]: The haplotype ID and Variant object for the variant

id: str

start: int

to_hap_spec(hap_id)

Convert a Variant object into a variant line in the .hap format spec

Return type

str

Parameters

hap_id: str: The ID of the haplotype associated with this variant

Returns

str: A valid variant line (V) in the .hap format spec

types = {'allele': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}

class haptools.data.haplotypes.classproperty(fget)

Bases: object

A daad-simple read-only decorator that combines the functionality of @classmethod and @property

Stolen from https://stackoverflow.com/a/13624858/16815703

haptools.data.breakpoints module

class haptools.data.breakpoints.Breakpoints(fname, log=None)

Bases: Data

A class for processing breakpoints from a file

Examples

>>> breakpoints = Breakpoints.load('tests/data/test.bp')

Attributes

datadict[str, SampleBlocks]: The haplotype blocks for each chromosome in each sample This dict maps samples (as strings) to their haplotype blocks (as SampleBlocks)
fnamePath | str: The path to the file containing the data
labelsdict | None: A dictionary containing population labels. It maps each label to the unique integers in the “pop” field of data
log: Logger: A logging instance for recording debug statements.

encode()

Replace each ancestral label in data with an equivalent integer. Store a dictionary mapping these integers back to their respective labels.

This method modifies data in place.

Returns

dict[int, str]: A dictionary mapping each integer back to its ancestral label

classmethod load(fname, samples=None)

Load breakpoints from a TSV file

Read the file contents and standardize the Breakpoints

Return type

Breakpoints

Parameters

fname: See documentation for fname
samplesset[str], optional: See documentation for read()

Returns

Breakpoints: A Breakpoints object with the data loaded into its properties

population_array(variants, samples=None)

Output an array denoting the population labels of each variant for each sample

Parameters

variantsnp.array: Variant-level meta information in a mixed np array of dtypes: CHROM (str) and POS (int)
samplestuple[str], optional: A subset of samples to include in the output, ordered by their given order

Returns

npt.NDArray

An array of shape: samples x variants x 2

The array will have the same dtype as the population labels in the “pop” field of data. Use encode() or recode() to change this.

read(samples=None)

Read breakpoints from a TSV file into a data structure stored in data

Parameters

samplesset[str], optional

A subset of the samples for which to extract breakpoints

Defaults to loading breakpoints for all samples

Raises

AssertionError: If the provided file doesn’t follow the expected format

recode()

Replace each integer in data with an equivalent ancestral label. Use the dictionary mapping these integers back to their respective ancestral labels stored in labels.

This method modifies data in place.

write()

Write the breakpoints in this class to a file at fname

Examples

To write to a file, you must first initialize a Breakpoints object and then fill out the names, data, and samples properties: >>> from haptools.data import Breakpoints, HapBlock >>> breakpoints = Breakpoints(‘simple.bp’) >>> breakpoints.data = { >>> ‘HG00096’: [ >>> np.array([(‘YRI’,’chr1’,10114,4.3),(‘CEU’,’chr1’,10116,5.2)], dtype=HapBlock) >>> np.array([(‘CEU’,’chr1’,10114,4.3),(‘YRI’,’chr1’,10116,5.2)], dtype=HapBlock) >>> ], ‘HG00097’: [ >>> np.array([(‘YRI’,’chr1’,10114,4.3),(‘CEU’,’chr2’,10116,5.2)], dtype=HapBlock) >>> np.array([(‘CEU’,’chr1’,10114,4.3),(‘YRI’,’chr2’,10116,5.2)], dtype=HapBlock) >>> ] >>> } >>> breakpoints.write()

haptools.sim_genotype module

haptools.sim_genotype.get_segment(pop, haplotype, chrom, start_coord, end_coord, end_pos, prev_gen_samples)

Create a segment or segments for an individual of the current generation using either a population label (>0) or the previous generation’s samples if the admix pop type (0) is used.

Parameters

pop: int: index of population. Can recover population name from pop_dict
haplotype: int: index of range [0, len(prev_gen_samples)] to identify the parent haplotype to copy segments from
chrom: int: chromosome the haplotype segment lies on
start_coord: int: starting coordinate from where to begin taking segments from previous generation samples
end_coord: int: ending coordinate of haplotype segment
end_pos: float: ending coordinate in centimorgans
prev_gen_samples: list[list[HaplotypeSegment]]: the previous generation simulated used as the parents for the current generation

Returns

segments: list[HaplotypeSegment]: A list of HaplotypeSegments storing the population type and end coordinate

haptools.sim_genotype.output_vcf(breakpoints, chroms, model_file, variant_file, sampleinfo_file, region, pop_field, sample_field, out, log)

Takes in simulated breakpoints and uses reference files, vcf and sampleinfo, to create simulated variants output in file: out + .vcf

Parameters

breakpoints: list[list[HaplotypeSegment]]

the simulated breakpoints

chroms: list[str]

List of chromosomes that were used to simulate

model_file: str

file with the following structure. (Must be tab delimited)

Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac

For example,

  Admixed    CEU   YRI
     0        0.05  0.95
     0.20     0.05  0.75

variant_file: str

file path that contains samples and respective variants. Can be in VCF, BCF, VCF.GZ, or PGEN format.

sampleinfo_file: str

file path that contains mapping from sample name in vcf to population

region: dict(str->str/int/int)

Dictionary with the keys “chr”, “start”, and “end” holding chromosome (str), start position (int) and end position (int) allowing the simulation process to only allow variants within that region.

pop_field: boolean

Flag to determine whether to have the population field in the VCF file output

sample_field: boolean

Flag to determine whether to have the sample field in the VCF file output

out: str

output prefix

log: log object

Outputs messages to the appropriate channel.

haptools.sim_genotype.simulate_gt(model_file, coords_dir, chroms, region, popsize, log, seed=None)

Simulate admixed genotypes based on the parameters of model_file.

Parameters

model_file: str

File with the following structure. (Must be tab delimited)

Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac

For example,

  Admixed    CEU   YRI
     0        0.05  0.95
     0.20     0.05  0.75

coords_dir: str

Directory containing files ending in .map with genetic map coords in cM used for recombination points

chroms: list[str]

List of chromosomes to simulate admixture for.

region: dict()

Dictionary with the keys “chr”, “start”, and “end” holding chromosome, start position adn end position allowing the simulation process to only within that region.

popsize: int

Size of population created for each generation.

log: log object

Outputs messages to the appropriate channel.

seed: int

Seed used for randomization.

Returns

num_samples: int: Total number of samples to output
pop_dict: dict(int->str): Dictionary that maps populations from their encoded version as integers to their population name as a string. ex: {1:CEU, 2:YRI}
next_gen_samples: list[list[HaplotypeSegment]]: Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.

haptools.sim_genotype.start_segment(start, chrom, segments)

Find first segment that is on chrom and its end coordinate is > start via binary search.

Parameters

start: int: Coordinate in bp for the start of the segment to output
chrom: int: Chromosome that the segments lie on.
segments: list[HaplotypeSegment]: List of the hapltoype segments to search from for a starting point.

Returns

mid: int: Index of the first genetic segment to collect for output.

haptools.sim_genotype.validate_params(model, mapdir, chroms, popsize, invcf, sample_info, region=None, only_bp=False)

haptools.sim_genotype.write_breakpoints(samples, pop_dict, breakpoints, out, log)

Write out a subsample of breakpoints to out determined by samples.

Parameters

samples: int: Number of samples to output
pop_dict: dict(int->str): Maps population codes in integers to their names. ex: {1:CEU, 2:YRI}
breakpoints: list[list[HaplotypeSegment]]: Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.
out: str: output prefix used to output the breakpoint file
log: log object: Outputs messages to the appropriate channel.

Returns

breakpoints: list[list[HaplotypeSegment]]: subsampled breakpoints only containing number of samples

haptools.sim_phenotype module

class haptools.sim_phenotype.Haplotype(chrom, start, end, id, beta)

Bases: Haplotype

A haplotype with sufficient fields for simphenotype

Properties and functions are shared with the base Haplotype object, “HaplotypeBase”

beta: float

class haptools.sim_phenotype.PhenoSimulator(genotypes, output=PosixPath('/dev/stdout'), seed=None, log=None)

Bases: object

Simulate phenotypes from genotypes

Examples

>>> gens = Genotypes.load("tests/data/example.vcf.gz")
>>> haps = Haplotypes.load("tests/data/basic.hap")
>>> haps_gts = haps.transform(gens)
>>> phenosim = PhenoSimulator(haps_gts)
>>> phenosim.run(haps.data.values())
>>> phenotypes = phenosim.phens

Attributes

gens: Genotypes: Genotypes to simulate
phens: Phenotypes: Simulated phenotypes; filled by run()
rng: np.random.Generator, optional: A numpy random number generator
log: logging.Logger: A logging instance for recording debug statements

run(effects, heritability=None, prevalence=None, normalize=True)

Simulate phenotypes for an entry in the Genotypes object

The generated phenotypes will also be added to output

Parameters

effects: list[Haplotype]

A list of Haplotypes to use in an additive fashion within the simulations

heritability: float, optional

The simulated heritability of the trait

If not provided, this will default to the sum of the squared effect sizes

prevalence: float, optional

How common should the disease be within the population?

If this value is specified, case/control phenotypes will be generated instead of quantitative traits.

normalize: bool, optional

If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.

Returns

npt.NDArray: The simulated phenotypes, as a np array of shape num_samples x 1

write(): Write the generated phenotypes to the file specified in __init__()

haptools.sim_phenotype.simulate_pt(genotypes, haplotypes, num_replications=1, heritability=None, prevalence=None, normalize=True, region=None, samples=None, haplotype_ids=None, chunk_size=None, seed=None, output=PosixPath('-'), log=None)

Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.

GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec

Note: GENOTYPES must be the output from the the transform subcommand.

Parameters

genotypesPath

The path to the transformed genotypes in VCF or PGEN format

haplotypesPath

The path to the haplotypes in a .hap file

replicationsint, optional

The number of rounds of simulation to perform

heritabilityint, optional

The heritability of the simulated trait; must be a float between 0 and 1

If not provided, it will be computed from the sum of the squared effect sizes

prevalenceint, optional

The prevalence of the disease if the trait should be simulated as case/control; must be a float between 0 and 1

If not provided, a quantitative trait will be simulated, instead

normalize: bool, optional

If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.

regionstr, optional

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF and .hap file must be indexed and the seqname must match!

Defaults to loading all haplotypes

sampletuple[str], optional

A subset of the samples from which to extract genotypes

Defaults to loading genotypes from all samples

samples_filePath, optional

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

haplotype_ids: set[str], optional

A list of haplotype IDs to obtain from the .hap file. All others are ignored.

If not provided, all haplotypes will be used.

chunk_size: int, optional

The max number of variants to fetch from the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.

outputPath, optional

The location to which to write the simulated phenotypes

loglogging.Logger, optional

The logging module for this task

Examples

>>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simu_phens.tsv

haptools.karyogram module

This script is inspired by Alicia Martin’s karyogram code originally published here: https://github.com/armartin/ancestry_pipeline/blob/master/plot_karyogram.py

haptools.karyogram.GetCentromereClipMask(centromeres_file, chrom_order)

Get clipping mask for the centromeres and telomeres

Parameters

centromeres_filestr, optional: Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
chrom_orderlist[int]: chromosomes in sorted order

Returns

clipmask_perchromdict[str, matplotlib.Path]: Clip region for telomeres/centromeres for each chromosome

haptools.karyogram.GetChrom(chrom)

Extract a numerical chromosome

Parameters

chromstr: Chromosome string

Returns

chromint: Integer-value for the chromosome X gets set to 23

haptools.karyogram.GetChromOrder(sample_blocks)

Get a list of chroms in sorted order

Parameters

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

Returns

chromslist[int]: list of chromsomes in sorted order

haptools.karyogram.GetCmRange(sample_blocks)

Get the min and max cM coordinates from the sample_blocks

Parameters

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

Returns

min_val, max_valfloat, float: min_val is the minimum coordinate max_val is the maximum coordinate

haptools.karyogram.GetHaplotypeBlocks(bp_file, sample_name, centromeres_file=None)

Extract haplotype blocks for the desired sample from the bp file

Parameters

bp_filestr: Path to .bp file with breakpoints
sample_namestr: Sample ID to extract
centromeres_filestr, optional: If not None then use the chromosome ends listed to extend chromosomes to proper end coordinates

Returns

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

haptools.karyogram.GetPopList(sample_blocks)

Get a list of populations in the sample_blocks

Parameters

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

Returns

poplistlist[str]: list of populations represented in the blocks

haptools.karyogram.PlotHaplotypeBlock(block, hapnum, chrom_order, colors, ax, clipmask_perchrom=None)

Plot a haplotype block on the axis

Parameters

blockdict: dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
hapnumint: 0 or 1 for the two haplotypes
chrom_orderlist[int]: chromosomes in sorted order
colorsdict[str, str], optional: Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.
axmatplotlib axis to use for plotting
clipmask_perchromdict[str, matplotlib.Path], optional: Clip region for telomeres/centromeres for each chromosome

haptools.karyogram.PlotKaryogram(bp_file, sample_name, out_file, log, centromeres_file=None, title=None, colors=None)

Plot a karyogram based on breakpoints output by haptools simgenotypes

Parameters

bp_filestr: Path to .bp file with breakpoints
sample_namestr: Sample ID to plot
out_filestr: Name of output file
log: log object: Outputs messages to the appropriate channel.
centromeres_filestr, optional: Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
titlestr, optional: Plot title. If None, no title is annotated
colorsdict(str->str), optional: Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.

haptools.transform module

class haptools.transform.GenotypesAncestry(fname, log=None)

Bases: GenotypesRefAlt

Extends the GenotypesRefAlt class for ancestry data

The ancestry information is stored within the FORMAT field of the VCF

Attributes

datanp.array: See documentation for data
fnamePath | str: See documentation for fname
samplestuple[str]: See documentation for samples
variantsnp.array: See documentation for variants
valid_labels: np.array: Reference VCF sample and respective variant grabbed for each sample.
ancestrynp.array: The ancestral population of each allele in each sample of data
log: Logger: See documentation for log

check_biallelic(discard_also=False): See documentation for check_biallelic()

check_missing(discard_also=False): See documentation for check_missing()

read(region=None, samples=None, variants=None, max_variants=None): See documentation for read()

subset(samples=None, variants=None, inplace=False): See documentation for subset()

write(chroms=None): Write the variants in this class to a VCF at fname

class haptools.transform.HaplotypeAncestry(chrom, start, end, id, ancestry)

Bases: Haplotype

A haplotype with an ancestry field for the transform subcommand

Properties and functions are shared with the base “Haplotype” object

ancestry: str

transform(genotypes)

Transform a genotypes matrix via the current haplotype and its ancestral population

See documentation for transform() for more details

Return type: ndarray[Any, dtype[bool]]

class haptools.transform.HaplotypesAncestry(fname, haplotype=<class 'haptools.transform.HaplotypeAncestry'>, variant=<class 'haptools.data.haplotypes.Variant'>, log=None)

Bases: Haplotypes

A set of haplotypes with an ancestry field for the transform subcommand

Properties and functions are shared with the base “Haplotypes” object

transform(gts, hap_gts=None)

Transform a genotypes matrix via the current haplotype

Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object

Parameters

gtsGenotypesRefAlt: The genotypes which to transform using the current haplotype
hap_gts: GenotypesRefAlt: An empty GenotypesRefAlt object into which the haplotype genotypes should be stored

Returns

GenotypesRefAlt: A Genotypes object composed of haplotypes instead of regular variants.

haptools.transform.transform_haps(genotypes, haplotypes, region=None, samples=None, haplotype_ids=None, chunk_size=None, discard_missing=False, ancestry=False, output=PosixPath('-'), log=None)

Creates a VCF composed of haplotypes

Parameters

genotypesPath

The path to the genotypes in VCF or PGEN format

haplotypesPath

The path to the haplotypes in a .hap file

regionstr, optional

See documentation for read() and read()

sampleslist[str], optional

See documentation for read()

haplotype_ids: set[str], optional

A set of haplotype IDs to obtain from the .hap file. All others are ignored.

If not provided, all haplotypes will be used.

chunk_size: int, optional

The max number of variants to fetch from the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.

discard_missingbool, optional

Discard any samples that are missing any of the required genotypes

The default is simply to complain about it

ancestrybool, optional

Whether to also match ancestral population labels from the VCF against those in the .hap file

outputPath, optional

The location to which to write output

logLogger, optional

A logging module to which to write messages about progress and any errors

haptools.ld module

class haptools.ld.Haplotype(chrom, start, end, id, ld)

Bases: Haplotype

A haplotype with sufficient fields for the ld command

Properties and functions are shared with the base Haplotype object, HaplotypeBase

ld: float

haptools.ld.calc_ld(target, genotypes, haplotypes, region=None, samples=None, ids=None, chunk_size=None, discard_missing=False, from_gts=False, output=PosixPath('/dev/stdout'), log=None)

Creates a VCF composed of haplotypes

Parameters

targetstr

The ID of the haplotype or variant with which we will calculate LD

genotypesPath

The path to the genotypes

haplotypesPath

The path to the haplotypes in a .hap file

regionstr, optional

See documentation for read() and read()

sampleslist[str], optional

See documentation for read()

ids: set[str], optional

A subset of haplotype IDs to obtain from the .hap file. All others are ignored.

Alternatively, if the –from-gts switch is specified, this will be interpreted as a subset of variant IDs to obtain from the genotypes file.

Defaults to loading all haplotypes or variants if not specified

chunk_size: int, optional

The max number of variants to fetch from the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.

discard_missingbool, optional

Discard any samples that are missing any of the required genotypes

The default is simply to complain about it

outputPath, optional

The location to which to write output

logLogger, optional

A logging module to which to write messages about progress and any errors

haptools.ld.pearson_corr_ld(arrA, arrB)

Compute the Pearson correlation coefficient between two vectors (1D numpy arrays)

Return type

float

Parameters

arrA: npt.NDArray: The first 1D numpy array
arrB: npt.NDArray: The second 1D numpy array

Returns

The LD between the genotypes in arrA and the genotypes in arrB

haptools.index module

haptools.index.append_suffix(path, suffix)

Used as a helper method for index_haps. Appends a given suffix to a Path instance.

Parameters

pathPath: The path to a file
suffixstr: A string to append to the end of the given Path. For example, “.gz” or “.gz.tbi”

haptools.index.index_haps(haplotypes, sort=False, output=None, log=None)

Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file

Parameters

haplotypesPath: The path to the haplotypes in a .hap file
outputPath, optional: The location to which to write output. If an output location is not specified, the output will have the same name as the input file.
logLogger, optional: A logging module to which to write messages about progress and any errors