Documentation

Command line interface

haptools

haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information

haptools [OPTIONS] COMMAND [ARGS]...

Options

--version: Show the version and exit.

clump

Performs clumping on datasets with SNPs, SNPs and STRs, and STRs. Clumping is the process of identifying SNPs or STRs that are highly correlated with one another and concatenating them all together into a single “clump” in order to not repeat the same effect size due to LD.

haptools clump [OPTIONS]

Options

--summstats-snps <summstats_snps>: File to load snps summary statistics

--summstats-strs <summstats_strs>: File to load strs summary statistics

--gts-snps <gts_snps>: SNP genotypes (VCF or PGEN)

--gts-strs <gts_strs>: STR genotypes (VCF)

--clump-p1 <clump_p1>: Max pval to start a new clump

--clump-p2 <clump_p2>: Filter for pvalue less than

--clump-id-field <clump_id_field>: Column header of the variant ID

--clump-field <clump_field>: Column header of the p-values

--clump-chrom-field <clump_chrom_field>: Column header of the chromosome

--clump-pos-field <clump_pos_field>: Column header of the position

--clump-kb <clump_kb>: clump kb radius

--clump-r2 <clump_r2>: r^2 threshold

--ld <ld>

Calculation type to infer LD, Exact Solution or Pearson R. (Exact|Pearson). Note the Exact Solution works best when all three genotypes are present (0,1,2) in the variants being compared.

Default:: 'Pearson'
Options:: Exact | Pearson

--out <out>: Required Output filename

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

index

Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file

haptools index [OPTIONS] HAPLOTYPES

Options

--sort, --no-sort

Sorting of the file will not be performed

Default:: True

-o, --output <output>

A .hap file containing sorted and indexed haplotypes and variants

Default:: 'input file'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

HAPLOTYPES: Required argument

karyogram

Visualize a karyogram of local ancestry tracks

haptools karyogram [OPTIONS]

Options

--bp <bp>: Required Path to .bp file with breakpoints

--sample <sample>: Required Sample ID to plot

--out <out>: Required Name of output file

--title <title>: Optional plot title

--centromeres <centromeres>: Optional file with telomere/centromere cM positions

--colors <colors>: Optional color dictionary. Input can be from the matplotlib list of colors or in hexcode. Format is e.g. ‘YRI:blue,CEU:green’

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

ld

Compute the pair-wise LD (Pearson’s correlation) between haplotypes (or variants) and a single TARGET haplotype (or variant)

GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec

TARGET refers to the ID of a variant or haplotype. LD is computed pair-wise between TARGET and all of the other haplotypes in the .hap (or genotype) file

If TARGET is a variant ID, the ID must appear in GENOTYPES. Otherwise, it must be present in the .hap file

haptools ld [OPTIONS] TARGET GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default:: 'all haplotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:: 'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:: 'all samples'

-i, --id <ids>

A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’). Or, if –from-gts, a list of the variant IDs to use from the genotypes file. For this to work, the .hap file must be indexed

Default:: 'all haplotypes'

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype (or variant) IDs (one per line) to subset from the .hap (or genotype) file

Default:: 'all haplotypes'

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default:: 'all variants'

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:: False

--from-gts

By default, LD is computed with the haplotypes in the .hap file. Use this switch to compute LD with the genotypes in the genotypes file, instead.

Default:: False

-o, --output <output>

A .hap file containing haplotypes and their LD with TARGET

Default:: 'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

TARGET: Required argument

GENOTYPES: Required argument

HAPLOTYPES: Required argument

simgenotype

Simulate admixed genomes under a pre-defined model.

haptools simgenotype [OPTIONS]

Options

--model <model>: Required Admixture model in .dat format. See File Formats under simgenotype in the docs for complete info.

--mapdir <mapdir>: Required Directory containing files with chr{1-22,X} and ending in .map in the file name with genetic map coords.

--out <out>: Required Path to desired output file. E.g. /path/to/output.vcf.gz Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an additional breakpoints output with extension bp e.g. /path/to/output.bp.

--chroms <chroms>: Sorted and comma delimited list of chromosomes to simulate

--seed <seed>: Random seed. Set to make simulations reproducible

--ref_vcf <ref_vcf>: Required VCF or PGEN file used as reference for creation of simulated samples respective genotypes.

--sample_info <sample_info>: Required File that maps samples from the reference VCF (–invcf) to population codes describing the populations in the header of the model file.

--region <region>: Subset the simulation to a specific region in a chromosome using the form chrom:start-end. Example 2:1000-2000

--pop_field: Flag for outputting the population field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.

--sample_field: Flag for outputting the sample field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.

--no_replacement: Flag used to determine whether to sample reference haplotypes with or without replacement. (Default = Replacement)

--only_breakpoint: Flag used to determine whether to only output breakpoints or continue to simulate a vcf file.

-c, --chunk-size <chunk_size>

If requesting a PGEN output file, write genotypes in chunks of X variants; reduces memory

Default:: 'all variants'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

simphenotype

Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.

GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec

Note: GENOTYPES must be the output from the transform subcommand.

haptools simphenotype [OPTIONS] GENOTYPES HAPLOTYPES

Options

-r, --replications <replications>

Number of rounds of simulation to perform

Default:: 1

--environment <environment>: Variance of environmental term; inferred if not specified

-h, --heritability <heritability>

Trait heritability

Default:: '0.5'

-p, --prevalence <prevalence>

Disease prevalence if simulating a case-control trait

Default:: 'quantitative trait'

--normalize, --no-normalize

Whether to normalize the genotypes before using them for simulation

Default:: True

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default:: 'all haplotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:: 'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:: 'all samples'

-i, --id <ids>

A list of the haplotype IDs from the .hap file to use as causal variables (ex: ‘-i H1 -i H2’).

Default:: 'all haplotypes'

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file

Default:: 'all haplotypes'

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default:: 'all variants'

--repeats <repeats>: Path to a genotypes file containing tandem repeats. This is only necessary when simulating both haplotypes and repeats as causal effects

--seed <seed>

Use this option across executions to make the output reproducible

Default:: 'chosen randomly'

-o, --output <output>

A TSV file containing simulated phenotypes

Default:: 'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES: Required argument

HAPLOTYPES: Required argument

transform

Creates a VCF composed of haplotypes

GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec

haptools transform [OPTIONS] GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default:: 'all haplotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:: 'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:: 'all samples'

-i, --id <ids>

A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’).

Default:: 'all haplotypes'

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file

Default:: 'all haplotypes'

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default:: 'all variants'

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:: False

--ancestry

Also transform using VCF ‘POP’ FORMAT field and ‘ancestry’ .hap extra field

Default:: False

--maf <maf>

Do not output haplotypes with an MAF below this value

Default:: 'all haplotypes'

-o, --output <output>

A VCF file containing haplotype ‘genotypes’

Default:: 'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES: Required argument

HAPLOTYPES: Required argument

Module contents

haptools.data.data module

class haptools.data.data.Data(fname, log=None)

Bases: ABC

Abstract class for accessing read-only data files

Attributes:

fnamePath | str: The path to the read-only file containing the data
datanp.array: The contents of the data file, once loaded
log: Logger: A logging instance for recording debug statements.

static hook_compressed(filename, mode)

A utility to help open files regardless of their compression

Based off of python’s fileinput.hook_compressed and copied from https://stackoverflow.com/a/64106815/16815703

Return type:

IO

Parameters:

filenamestr: The path to the file
modestr: Either ‘r’ for read or ‘w’ for write

Returns:

IO: The resolved file object

abstract classmethod load(fname)

Read the file contents and perform any recommended pre-processing

Parameters:

fnamePath: See documentation for fname

abstract read(): Read the raw file contents into the class properties

unset()

Whether the data has been loaded into the object yet

Return type:

bool

Returns:

bool: True if data is None else False

haptools.data.genotypes module

class haptools.data.genotypes.Genotypes(fname, log=None)

Bases: Data

A class for processing genotypes from a file

Attributes:

datanpt.NDArray

The genotypes in an n (samples) x p (variants) x 2 (strands) array

fnamePath | str

The path to the read-only file containing the data

samplestuple[str]

The names of each of the n samples

variantsnp.array

Variant-level meta information:

ID
CHROM
POS

log: Logger

A logging instance for recording debug statements

_prephasedbool

If True, assume that the genotypes are phased. Otherwise, extract their phase when reading from the VCF.

_samp_idxdict[str, int]

Sample index; maps samples to indices in self.samples

_var_idxdict[str, int]

Variant index; maps variant IDs to indices in self.variants

Examples

>>> genotypes = Genotypes.load('tests/data/simple.vcf')
>>> # directly access the loaded variants, samples, and genotypes (in data)
>>> genotypes.variants
>>> genotypes.samples
>>> genotypes.data

__iter__(region=None, samples=None, variants=None)

Read genotypes from a VCF line by line without storing anything

Return type:

Iterator[namedtuple]

Parameters:

regionstr, optional: See documentation for read()
samplesset[str], optional: See documentation for read()
variantsset[str], optional: See documentation for read()

Returns:

Iterator[namedtuple]: See documentation for _iterate()

check_biallelic(discard_also=False)

Check that each genotype is composed of only two alleles

This function modifies the dtype of data from uint8 to bool

Parameters:

discard_alsobool, optional: If True, discard any multiallelic variants without raising a ValueError

Raises:

ValueError: If any of the genotypes have more than two alleles

check_maf(threshold=None, discard_also=False, warn_only=False)

Check the minor allele frequency of each variant

Raise a ValueError if any variant’s MAF doesn’t satisfy the threshold, if one is provided :rtype: ndarray[Any, dtype[float64]]

Note

You should call check_missing() and check_biallelic() before executing this method, for best results. Otherwise, the frequencies may be computed incorrectly.

Parameters:

threshold: float, optional

If a variant has a minor allele frequency (MAF) rarer than this threshold, raise a ValueError

discard_alsobool, optional

If True, discard any variants that would otherwise cause a ValueError

This parameter will be ignored if a threshold is not specified

warn_only: bool, optional

Just raise a warning instead of a ValueError

Returns:

The minor allele frequency of each variant

Raises:

ValueError: If any variant does not meet the provided threshold minor allele frequency

check_missing(discard_also=False)

Check that each sample is properly genotyped

Parameters:

discard_alsobool, optional: If True, discard any samples that are missing genotypes without raising a ValueError

Raises:

ValueError: If any of the samples have missing genotypes ‘GT: .|.’

check_phase()

Check that the genotypes are phased then remove the phasing info from the data

This function modifies data in-place

Raises:

ValueError: If any heterozgyous genotpyes are unphased

check_sorted()

Check that the variant coordinates are sorted

Raise a ValueError if any of variants in any of the chromosomes are unsorted

Raises:

ValueError: If any variant position is less than a position that comes before it within the same chromosome

index(samples=True, variants=True)

Call this function once to improve the amortized time-complexity of look-ups of samples and variants by their ID. This is useful if you intend to later subset by a set of samples or variant IDs. The time complexity of this function should be roughly O(n+p) if both parameters are True. Otherwise, it will be either O(n) or O(p).

Parameters:

samples: bool, optional: Whether to index the samples for fast loop-up. Adds complexity O(n).
variants: bool, optional: Whether to index the variants for fast look-up. Adds complexity O(p).

Raises:

ValueError: If any samples or variants appear more than once

classmethod load(fname, region=None, samples=None, variants=None)

Load genotypes from a VCF file

Read the file contents, check the genotype phase, and create the MAC matrix

Return type:

Genotypes

Parameters:

fname: See documentation for fname
regionstr, optional: See documentation for read()
samplesset[str], optional: See documentation for read()
variantsset[str], optional: See documentation for read()

Returns:

Genotypes: A Genotypes object with the data loaded into its properties

classmethod merge_variants(objs, check_samples=True, **kwargs)

Merge genotypes objects with different sets of variants together :rtype: Genotypes

Note

The input genotypes objects are not expected to have any overlapping sets of variants. Also, all samples in the input genotypes must be the same.

Parameters:

objs: tuple[Genotypes]: The objects that should be merged together
check_samples: bool, optional: Whether to check that the set of provided samples is exactly the same for all genotypes. This can take a while so you may want to avoid it
**kwargs: Any parameters to pass to _init__()

Returns:

Genotypes: A new object containing merged versions of the properties in each object

Raises:

ValueError: If the set of samples in each input object is not the same

read(region=None, samples=None, variants=None, max_variants=None)

Read genotypes from a VCF into a numpy matrix stored in data

Parameters:

regionstr, optional

The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF must be indexed and the seqname must match!

Defaults to loading all genotypes

samplesset[str], optional

A subset of the samples from which to extract genotypes

Note that they are loaded in the same order as in the file

Defaults to loading genotypes from all samples

variantsset[str], optional

A set of variant IDs for which to extract genotypes

All other variants will be ignored. This may be useful if you’re running out of memory.

max_variantsint, optional

The maximum mumber of variants to load from the file. Setting this value helps preallocate the arrays, making the process faster and less memory intensive. You should use this option if your processes are frequently “Killed” from memory overuse.

If you don’t know how many variants there are, set this to a large number greater than what you would except. The np array will be resized appropriately. You can also use the bcftools “counts” plugin to obtain the number of expected sites within a region.

Note that this value is ignored if the variants argument is provided.

Raises:

ValueError: If the genotypes array is empty

subset(samples=None, variants=None, inplace=False)

Subset these genotypes to a smaller set of samples or a smaller set of variants

The order of the samples and variants in the subsetted instance will match the order in the provided tuple parameters.

Parameters:

samples: tuple[str]: A subset of samples to keep
variants: tuple[str]: A subset of variant IDs to keep
inplace: bool, optional: If False, return a new Genotypes object; otherwise, alter the current one

Returns:

A new Genotypes object if inplace is set to False, else returns None

class haptools.data.genotypes.GenotypesPLINK(fname, log=None, chunk_size=None)

Bases: GenotypesVCF

A class for processing genotypes from a PLINK .pgen file

Attributes:

datanpt.NDArray

See documentation for data

samplestuple[str]

See documentation for samples

variantsnp.array

See documentation for variants

log: Logger

See documentation for log

chunk_size: int, optional

The max number of variants to fetch from and write to the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory

Examples

>>> genotypes = GenotypesPLINK.load('tests/data/simple.pgen')

__iter__(region=None, samples=None, variants=None)

Read genotypes from a PGEN line by line without storing anything

Return type:

Iterator[namedtuple]

Parameters:

regionstr, optional: See documentation for read()
samplesset[str], optional: See documentation for read()
variantsset[str], optional: See documentation for read()

Returns:

Iterator[namedtuple]: See documentation for _iterate()

read(region=None, samples=None, variants=None, max_variants=None)

Read genotypes from a PGEN file into a numpy matrix stored in data

Parameters:

regionstr, optional: See documentation for read
samplesset[str], optional: See documentation for read
variantsset[str], optional: See documentation for read
max_variantsint, optional: See documentation for read

read_samples(samples=None)

Read sample IDs from a PSAM file into a list stored in samples

This method is called automatically by read()

Parameters:

samplesset[str], optional: See documentation for read

Returns:

npt.NDArray[np.uint32]: The indices of each of the samples within the PSAM file

read_variants(region=None, variants=None, max_variants=None)

Read variants from a PVAR file into a numpy array stored in variants

One of either variants or max_variants MUST be specified!

This method is called automatically by read()

Parameters:

regionstr, optional: See documentation for read
variantsset[str], optional: See documentation for read
max_variantsint, optional: See documentation for read

Returns:

npt.NDArray[np.uint32]: The indices of each of the variants within the PVAR file

write(): Write the variants in this class to PLINK2 files at fname

write_samples()

Write sample IDs to a PSAM file from a list stored in samples

This method is called automatically by write()

write_variants()

Write variant IDs to a PVAR file from the numpy array stored in variants

This method is called automatically by write()

class haptools.data.genotypes.GenotypesPLINKTR(fname, log=None, chunk_size=None, vcftype='auto')

Bases: GenotypesPLINK

A class for processing repeat genotypes from a PLINK .pgen file

Attributes:

datanpt.NDArray: See documentation for data
samplestuple[str]: See documentation for samples
variantsnp.array: See documentation for variants
log: Logger: See documentation for log
chunk_size: int, optional: See documentation for chunk_size
vcftype: str, optional: See documentation for vcftype
Examples
——–
>>> genotypes = GenotypesPLINK.load(‘tests/data/simple.pgen’)

check_biallelic()

Check that each genotype is composed of only two alleles

This function modifies the dtype of data from uint8 to bool

Parameters:

discard_alsobool, optional: If True, discard any multiallelic variants without raising a ValueError

Raises:

ValueError: If any of the genotypes have more than two alleles

check_maf()

Check the minor allele frequency of each variant

Raise a ValueError if any variant’s MAF doesn’t satisfy the threshold, if one is provided

Note

You should call check_missing() and check_biallelic() before executing this method, for best results. Otherwise, the frequencies may be computed incorrectly.

Parameters:

threshold: float, optional

If a variant has a minor allele frequency (MAF) rarer than this threshold, raise a ValueError

discard_alsobool, optional

If True, discard any variants that would otherwise cause a ValueError

This parameter will be ignored if a threshold is not specified

warn_only: bool, optional

Just raise a warning instead of a ValueError

Returns:

The minor allele frequency of each variant

Raises:

ValueError: If any variant does not meet the provided threshold minor allele frequency

classmethod load(fname, region=None, samples=None, variants=None, vcftype='auto')

Load STR genotypes from a VCF file

Read the file contents, check the genotype phase, and create the MAC matrix

Return type:

Genotypes

Parameters:

fname: See documentation for fname
regionstr, optional: See documentation for read()
samplesset[str], optional: See documentation for read()
variantsset[str], optional: See documentation for read()
vcftypestr, optional: See documentation for vcftype()

Returns:

Genotypes: A Genotypes object with the data loaded into its properties

read(region=None, samples=None, variants=None, max_variants=None)

Read genotypes from a PGEN file into a numpy matrix stored in data

Parameters:

regionstr, optional: See documentation for read
samplesset[str], optional: See documentation for read
variantsset[str], optional: See documentation for read
max_variantsint, optional: See documentation for read

write(): Write the variants in this class to PLINK2 files at fname

write_variants()

Write variant IDs to a PVAR file from the numpy array stored in variants

This method is called automatically by write()

class haptools.data.genotypes.GenotypesTR(fname, log=None, vcftype='auto')

Bases: Genotypes

A class for processing TR genotypes from a file Unlike the base Genotypes class, this class genotypes will be repeat number in the variants array

Attributes:

datanpt.NDArray

See documentation for data

fnamePath | str

See documentation for fname

samplestuple[str]

See documentation for samples

variantsnp.array

Variant-level meta information:

ID
CHROM
POS

log: Logger

See documentation for log

vcftype: str

TR vcf type currently being read. Options are {‘auto’, ‘gangstr’, ‘advntr’, ‘hipstr’, ‘eh’, ‘popstr’}

check_biallelic(): See documentation for check_biallelic()

check_maf(): See documentation for check_maf()

classmethod load(fname, region=None, samples=None, variants=None, vcftype='auto')

Load STR genotypes from a VCF file

Read the file contents, check the genotype phase, and create the MAC matrix

Return type:

Genotypes

Parameters:

fname: See documentation for fname
regionstr, optional: See documentation for read()
samplesset[str], optional: See documentation for read()
variantsset[str], optional: See documentation for read()

Returns:

Genotypes: A Genotypes object with the data loaded into its properties

class haptools.data.genotypes.GenotypesVCF(fname, log=None)

Bases: Genotypes

A class for processing genotypes from a file Unlike the base Genotypes class, this class also includes REF and ALT alleles as a list of alleles in the variants array

Attributes:

datanpt.NDArray

See documentation for data

fnamePath | str

See documentation for fname

samplestuple[str]

See documentation for samples

variantsnp.array

Variant-level meta information:

ID
CHROM
POS
[REF, ALT1, ALT2, …]

log: Logger

See documentation for log

write(): Write the variants in this class to a VCF at fname

class haptools.data.genotypes.TRRecordHarmonizerRegion(vcffile, vcfiter, vcftype='auto')

Bases: TRRecordHarmonizer

Parameters:

vcffileVCF
vcftype{‘auto’, ‘gangstr’, ‘advntr’, ‘hipstr’, ‘eh’, ‘popstr’}, optional: Type of the VCF file. Default=’auto’. If vcftype==’auto’, attempts to infer the type.
Attributes
———-
vcffileVCF
vcfiterVCF: Region to grab strs from within the VCF file.
vcftypeenum: Type of the VCF file. Must be included in VcfTypes

haptools.data.phenotypes module

class haptools.data.phenotypes.Phenotypes(fname, log=None)

Bases: Data

A class for processing phenotypes from a file

Attributes:

datanp.array: The phenotypes in an n (samples) x m (phenotypes) array
fnamePath | str: The path to the file containing the data
samplestuple: The names of each of the n samples
namestuple[str]: The names of the phenotypes
log: Logger: A logging instance for recording debug statements.

Examples

>>> phenotypes = Phenotypes.load('tests/data/simple.pheno')

__iter__(samples=None)

Read phenotypes from a pheno line by line without storing anything

Return type:

Iterable[namedtuple]

Parameters:

samplesset[str], optional

A subset of the samples from which to extract phenotypes

Defaults to loading phenotypes from all samples

Returns:

Iterable[namedtuple]: See documentation for _iterate()

append(name, data)

Append a new set of phenotypes to the current set

Parameters:

name: str: The name of the new phenotype
data: npt.NDArray: A 1D np array of the same length as samples, containing the phenotype values for each sample. Must have the same dtype as

check_missing(discard_also=False)

Check that each sample has a phenotype value

Parameters:

discard_alsobool, optional: If True, discard any samples that are missing phenotypes without raising a ValueError

Raises:

ValueError: If any of the samples have missing phenotypes, represented by -9

index(samples=True, names=True)

Call this function once to improve the amortized time-complexity of look-ups of samples and names by their ID. This is useful if you intend to later subset by a set of samples or name IDs. The time complexity of this function should be roughly O(n+p) if both parameters are True. Otherwise, it will be either O(n) or O(p).

Parameters:

samples: bool, optional: Whether to index the samples for fast loop-up. Adds complexity O(n).
names: bool, optional: Whether to index the names for fast look-up. Adds complexity O(p).

Raises:

ValueError: If any samples or names appear more than once

classmethod load(fname, samples=None)

Load phenotypes from a pheno file

Read the file contents and standardize the phenotypes

Return type:

Phenotypes

Parameters:

fname: See documentation for fname
samplesset[str], optional: See documentation for read()

Returns:

phenotypes: A Phenotypes object with the data loaded into its properties

read(samples=None)

Read phenotypes from a pheno file into a numpy matrix stored in data

Parameters:

samplesset[str], optional

A subset of the samples from which to extract phenotypes

Defaults to loading phenotypes from all samples

Raises:

AssertionError: If the provided file doesn’t follow the expected format

standardize()

Standardize phenotypes so they have a mean of 0 and a stdev of 1

This function modifies data in-place

subset(samples=None, names=None, inplace=False)

Subset these phenotypes to a smaller set of samples or a smaller set of names

The order of the samples and names in the subsetted instance will match the order in the provided tuple parameters.

Parameters:

samples: tuple[str]: A subset of samples to keep
names: tuple[str]: A subset of phenotype IDs to keep
inplace: bool, optional: If False, return a new Phenotypes object; otherwise, alter the current one

Returns:

A new Phenotypes object if inplace is set to False, else returns None

write()

Write the phenotypes in this class to a file at fname

Examples

To write to a file, you must first initialize a Phenotypes object and then fill out the names, data, and samples properties: >>> phenotypes = Phenotypes(‘tests/data/simple.pheno’) >>> phenotypes.names = (‘height’,) >>> phenotypes.data = np.array([[1, 1, 2]], dtype=’float64’) >>> phenotypes.samples = (‘HG00096’, ‘HG00097’, ‘HG00099’) >>> phenotypes.write()

haptools.data.covariates module

class haptools.data.covariates.Covariates(fname, log=None)

Bases: Phenotypes

A class for processing covariates from a file

Attributes:

datanp.array: The covariates in an n (samples) x m (covariates) array
fnamePath | str: The path to the read-only file containing the data
samplestuple[str]: The names of each of the n samples
namestuple[str]: The names of the covariates
log: Logger: A logging instance for recording debug statements.

Examples

>>> covariates = Covariates.load('tests/data/simple.covar')

haptools.data.haplotypes module

class haptools.data.haplotypes.Extra(name, fmt='s', description='')

Bases: object

An extra field on a line in the .hap file

Attributes:

name: str: The name of the extra field
fmt: str = “s”: The python fmt string of the field value; indicates how to format the value
description: str = “”: A description of the extra field

description: str = ''

fmt: str = 's'

property fmt_str: str

Convert an Extra into a fmt string

Returns:

str: A python format string (ex: “{beta:.3f}”)

classmethod from_hap_spec(line)

Convert an “extra” line in the header of a .hap file into an Extra object

Return type:

Extra

Parameters:

line: str: An “extra” field, as it appears declared in the header

Returns:

Extra: An Extra object

name: str

to_hap_spec(line_type_symbol)

Convert an Extra object into a header line in the .hap format spec

Return type:

str

Parameters:

hap_id: str: The ID of the haplotype associated with this variant

Returns:

str: A valid variant line (V) in the .hap format spec

class haptools.data.haplotypes.Haplotype(chrom, start, end, id)

Bases: object

A haplotype within the .hap format spec

In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration

Attributes:

chrom: str: The contig to which this haplotype belongs
start: int: The chromosomal start position of the haplotype
end: int: The chromosomal end position of the haplotype
id: str: The haplotype’s unique ID
variants: tuple[Variant]: The variants in this haplotype
_extras: tuple[Extra]: Extra fields for the haplotype

Examples

Let’s extend this class and add an extra field called “ancestry”

>>> from dataclasses import dataclass, field
>>> @dataclass
>>> class CustomHaplotype(Haplotype):
...     ancestry: str
...     _extras: tuple = field(
...         repr=False,
...         init=False,
...         default = (
...             Extra("ancestry", "s", "Local ancestry"),
...         ),
...     )

property ID: Create an alias for the id property

chrom: str

end: int

classmethod extras_head()

Return the header lines of the extra fields that are supported

Return type:

set

Returns:

tuple: The header lines of the extra fields

classmethod extras_order()

The names of the extra fields in order

Return type:

tuple[str]

Returns:

tuple[str]: The names of the extra fields in the order in which they are stored

classmethod from_hap_spec(line, variants=(), types=None)

Convert a variant line into a Haplotype object in the .hap format spec

Note that this implementation does NOT support having more extra fields than appear in the header

Return type:

Haplotype

Parameters:

line: str: A variant (H) line from the .hap file
variants: tuple[Variant], optional: The Variants in this haplotype
types: dict[str, type], optional: The types of each property in the object

Returns:

Haplotype: The Haplotype object for the variant

id: str

sort(): Sorts the variants within this Haplotype instance

start: int

to_hap_spec()

Convert a Haplotype object into a haplotype line in the .hap format spec

Return type:

str

Returns:

str: A valid haplotype line (H) in the .hap format spec

transform(genotypes)

Transform a genotypes matrix via the current haplotype

Each entry in the returned matrix denotes the presence of the current haplotype in each chromosome of each sample in the Genotypes object

Return type:

ndarray[Any, dtype[TypeVar(ScalarType, bound= generic, covariant=True)]]

Parameters:

genotypesGenotypesVCF

The genotypes which to transform using the current haplotype

If the genotypes have not been loaded into the Genotypes object yet, this method will call Genotypes.read(), while loading only the needed variants

Returns:

npt.NDArray: A 2D matrix of shape (num_samples, 2) where each entry in the matrix is a bool denoting the presence of the haplotype in one chromosome of a sample

types = {'chrom': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}

property varIDs

variants: tuple

class haptools.data.haplotypes.Haplotypes(fname, haplotype=<class 'haptools.data.haplotypes.Haplotype'>, variant=<class 'haptools.data.haplotypes.Variant'>, repeat=<class 'haptools.data.haplotypes.Repeat'>, log=None)

Bases: Data

A class for processing haplotypes from a file

Attributes:

fname: Path | str

The path to the file containing the data

data: dict[str, Haplotype|Repeat]

A dict of Haplotype/Repeat objects keyed by their IDs

types: dict

A dict of class names keyed by the symbol denoting their line type

Ex: {‘H’: Haplotype, ‘V’: Variant, ‘R’: Repeat}

type_ids: dict[str, list]

A dict of class names keyed by the symbol denoting line type. Stores all haplotype and repeat IDs.

Ex: {‘H’: [“H1”, “H2”, “H3”], ‘R’: [“STR_1”, “STR_2”]}

version: str

A string denoting the current file format version

log: Logger

A logging instance for recording debug statements.

Examples

Parsing a basic .hap file without any extra fields is simple:

>>> haplotypes = Haplotypes.load('tests/data/basic.hap')
>>> haps = haplotypes.data # a dictionary of Haplotype objects

If the .hap file contains extra fields, you’ll need to call the read() method manually. You’ll also need to create Haplotype and Variant subclasses that support the extra fields and then specify the names of the classes when you initialize the Haplotypes object:

>>> haplotypes = Haplotypes('tests/data/simphenotype.hap', HaptoolsHaplotype)
>>> haplotypes.read()
>>> haps = haplotypes.data # a dictionary of Haplotype objects

__iter__(region=None, haplotypes=None)

Read haplotypes from a .hap file line by line without storing anything :rtype: Iterator[Variant | Haplotype]

Note

The elements output by __iter__() are not guaranteed to be in any particular order except that variants of a haplotype will never appear before the haplotype, itself.

Parameters:

region: str, optional

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the .hap file must be indexed and the seqname must match!

Defaults to loading all haplotypes

haplotypes: set[str], optional

A list of haplotype IDs corresponding to a subset of the haplotypes to extract

Defaults to loading haplotypes from all samples

Yields:

Iterator[Variant|Repeat|Haplotype]: An iterator over each line in the file, where each line is encoded as a Variant, Repeat, or Haplotype containing each of the class properties

Examples

If you’re worried that the contents of the .hap file will be large, you may opt to parse the file line-by-line instead of loading it all into memory at once. In cases like these, you can use the __iter__() method in a for-loop:

>>> haplotypes = Haplotypes('tests/data/basic.hap')
>>> for line in haplotypes:
...     print(line)

Call the function manually to pass it the region or haplotypes params:

>>> haplotypes = Haplotypes('tests/data/basic.hap.gz')
>>> for line in haplotypes.__iter__(
...    region='21:26928472-26941960', haplotypes={"chr21.q.3365*1"}
... ):
...     print(line)

check_header(lines, check_version=True, softly=True)

1) Check and parse any metadata and 2) check that any extra fields declared in the .haps file can be handled by the Variant and Haplotype classes provided in __init__()

This function is called automatically by other methods that read .hap files

Return type:

tuple[dict, dict[str, tuple[str]]]

Parameters:

lines: list[str]: Header lines from the .hap file. Any lines beginning with # may appear in this list, especially if the file is sorted. So this may include regular comments, too.
check_version: bool, optional: Whether to also check the version of the file
softly: bool, optional: If True, then this function will not raise any ValueErrors. Instead, it will only issue warnings via the logging module, which may be ignored.

Returns:

tuple[dict, dict[str, tuple[Extra]]]

The metadata for the file, contained within the header lines and encoded as a dictionary where the names are keys and any subsequent fields are values

The second dictionary encodes the set of declared extra field names for each line type

Raises:

ValueError: If any of the header lines are not supported

check_version(version, err_msgr)

Check the observed version string against the current version string of this instance

Return type:

tuple[int, int, int]

Parameters:

version: str: The observed version string
err_msgr: Callable: A function which takes a single parameter (the error message) and errors appropriately

Returns:

The parsed, observed version string

index(force=False)

Reset the type_ids parameter

You should call this method after any changes to the data property

classmethod load(fname, region=None, haplotypes=None)

Load haplotypes from a .hap file

Read the file contents

Return type:

Haplotypes

Parameters:

fname: Path: See documentation for fname
region: str, optional: See documentation for read()
haplotypes: set[str], optional: See documentation for read()

Returns:

Haplotypes: A Haplotypes object with the data loaded into its properties

classmethod merge(objs, **kwargs)

Merge Haplotypes objects with different sets of haplotypes together :rtype: Haplotypes

Note

The input Haplotype instances are expected to have distinct indices and must all be of the same type

Parameters:

objs: tuple[Haplotypes]: The objects that should be merged together
**kwargs: Any parameters to pass to _init__()

Returns:

Haplotypes: A new object containing all of the Haplotypes from each input object

Raises:

ValueError: If the set of indices among the haplotypes is not distinct

read(region=None, haplotypes=None)

Read haplotypes from a .hap file into a list stored in data

Parameters:

region: str, optional

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the .hap file must be indexed and the seqname must match!

Defaults to loading all haplotypes

haplotypes: set[str], optional

A list of haplotype IDs corresponding to a subset of the haplotypes to extract

Defaults to loading haplotypes from all samples

sort()

Sorts .hap files first by chrom, followed by start, end, and lastly ID

Also sorts the variants within each haplotype

subset(haplotypes, inplace=False)

Subset these haplotypes to a smaller set of haplotypes

The order of the haplotypes in the subsetted instance will match the order in the provided tuple parameters.

Parameters:

haplotypes: tuple[str]: A subset of haplotype IDs to keep
inplace: bool, optional: If False, return a new Genotypes object; otherwise, alter the current one

Returns:

A new Haplotypes object if inplace is set to False, else returns None

to_str(sort=True)

Create a string representation of this Haplotype

Return type:

Generator[str, None, None]

Parameters:

sort: bool, optional: Whether to attempt to sort variant lines by their haplotype IDs

Yields:

Generator[str, None, None]: A list of lines (strings) to include in the output

transform(gts, hap_gts=None)

Transform a genotypes matrix via the current haplotype

Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object

Return type:

GenotypesVCF

Parameters:

gtsGenotypesVCF: The genotypes which to transform using the current haplotype
hap_gts: GenotypesVCF: An empty GenotypesVCF object into which the haplotype genotypes should be stored

Returns:

GenotypesVCF: A Genotypes object composed of haplotypes instead of regular variants.

version = '0.2.0'

write()

Write the contents of this Haplotypes object to the file at fname

If the items in data are sorted, then the output should be automatically sorted such that “sort -k1,4” would leave the output unchanged

Examples

To write to a .hap file, you must first initialize a Haplotypes object and then fill out the data property:

>>> haplotypes = Haplotypes('tests/data/basic.hap')
>>> haplotypes.data = {'H1': Haplotype('chr1', 0, 10, 'H1')}
>>> haplotypes.write()

class haptools.data.haplotypes.Repeat(chrom, start, end, id)

Bases: object

A tandem repeat within the .hap format spec

In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration

Attributes:

chrom: str: The contig to which this tandem repeat belongs
start: int: The chromosomal start position of the tandem repeat
end: int: The chromosomal end position of the tandem repeat
id: str: The tandem repeat’s unique ID
_extras: tuple[Extra]: Extra fields for the tandem repeat

Examples

Let’s extend this class and add an extra field called “ancestry”

>>> from dataclasses import dataclass, field
>>> @dataclass
>>> class CustomRepeat(Repeat):
...     ancestry: str
...     _extras: tuple = field(
...         repr=False,
...         init=False,
...         default = (
...             Extra("ancestry", "s", "Local ancestry"),
...         ),
...     )

property ID: Create an alias for the id property

chrom: str

end: int

classmethod extras_head()

Return the header lines of the extra fields that are supported

Return type:

set

Returns:

tuple: The header lines of the extra fields

classmethod extras_order()

The names of the extra fields in order

Return type:

tuple[str]

Returns:

tuple[str]: The names of the extra fields in the order in which they are stored

classmethod from_hap_spec(line, types=None)

Convert a tandem repeat line into a tandem repeat object in the .hap format spec

Note that this implementation does NOT support having more extra fields than appear in the header

Return type:

Repeat

Parameters:

line: str: A tandem repeat (R) line from the .hap file
types: dict[str, type], optional: The types of each property in the object

Returns:

Repeat: The repeat object line.

id: str

start: int

to_hap_spec()

Convert a Repeat object into a repeat line in the .hap format spec

Return type:

str

Returns:

str: A valid Repeat line (R) in the .hap format spec

types = {'chrom': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}

class haptools.data.haplotypes.Variant(start, end, id, allele)

Bases: object

A variant within the .hap format spec

In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration

Attributes:

start: int

The chromosomal start position of the variant

end: int

The chromosomal end position of the variant

In most cases this will be the same as the start position

id: str

The variant’s unique ID

allele: str

The allele of this variant within the Haplotype

_extras: tuple[Extra]

Extra fields for the haplotype

Examples

Let’s extend this class and add an extra field called “score”

>>> from dataclasses import dataclass, field
>>> @dataclass
>>> class CustomVariant(Variant):
...     score: float
...     _extras: tuple = field(
...         repr=False,
...         init=False,
...         default = (
...             Extra("score", ".3f", "Importance of inclusion"),
...         ),
...     )

property ID: Create an alias for the id property

allele: str

end: int

classmethod extras_head()

Return the header lines of the extra fields that are supported

Return type:

set

Returns:

tuple: The header lines of the extra fields

classmethod extras_order()

The names of the extra fields in order

Return type:

tuple[str]

Returns:

tuple[str]: The names of the extra fields in the order in which they are stored

classmethod from_hap_spec(line, types=None)

Convert a variant line into a Variant object in the .hap format spec

Note that this implementation does NOT support having more extra fields than appear in the header

Return type:

tuple[str, Variant]

Parameters:

line: str: A variant (V) line from the .hap file
types: dict[str, type], optional: The order of the extra fields if different from the order in _extras

Returns:

tuple[str, Variant]: The haplotype ID and Variant object for the variant

id: str

start: int

to_hap_spec(hap_id)

Convert a Variant object into a variant line in the .hap format spec

Return type:

str

Parameters:

hap_id: str: The ID of the haplotype associated with this variant

Returns:

str: A valid variant line (V) in the .hap format spec

types = {'allele': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}

class haptools.data.haplotypes.classproperty(fget)

Bases: object

A dead-simple read-only decorator that combines the functionality of @classmethod and @property

Stolen from https://stackoverflow.com/a/13624858/16815703

haptools.data.breakpoints module

class haptools.data.breakpoints.Breakpoints(fname, log=None)

Bases: Data

A class for processing breakpoints from a file

Attributes:

datadict[str, SampleBlocks]: The haplotype blocks for each chromosome in each sample This dict maps samples (as strings) to their haplotype blocks (as SampleBlocks)
fnamePath | str: The path to the file containing the data
labelsdict | None: A dictionary containing population labels. It maps each label to the unique integers in the “pop” field of data
log: Logger: A logging instance for recording debug statements.

Examples

>>> breakpoints = Breakpoints.load('tests/data/test.bp')

encode(labels=None)

Replace each ancestral label in data with an equivalent integer. Store a dictionary mapping these integers back to their respective labels in labels.

This method modifies data in place.

Parameters:

labels: tuple[str], optional: A list of population labels. The order of the labels in this list will be kept in the respective labels.

classmethod load(fname, samples=None)

Load breakpoints from a TSV file

Read the file contents and standardize the Breakpoints

Return type:

Breakpoints

Parameters:

fname: See documentation for fname
samplesset[str], optional: See documentation for read()

Returns:

Breakpoints: A Breakpoints object with the data loaded into its properties

population_array(variants, samples=None)

Output an array denoting the population labels of each variant for each sample

Return type:

ndarray[Any, dtype[TypeVar(ScalarType, bound= generic, covariant=True)]]

Parameters:

variantsnp.array: Variant-level meta information in a mixed np array of dtypes: CHROM (str) and POS (int)
samplestuple[str], optional: A subset of samples to include in the output, ordered by their given order

Returns:

npt.NDArray

An array of shape: samples x variants x 2

The array will have the same dtype as the population labels in the “pop” field of data. Use encode() or recode() to change this.

read(samples=None)

Read breakpoints from a TSV file into a data structure stored in data

Parameters:

samplesset[str], optional

A subset of the samples for which to extract breakpoints

Defaults to loading breakpoints for all samples

Raises:

AssertionError: If the provided file doesn’t follow the expected format

recode()

Replace each integer in data with an equivalent ancestral label. Use the dictionary mapping these integers back to their respective ancestral labels stored in labels.

This method modifies data in place.

write()

Write the breakpoints in this class to a file at fname

Examples

To write to a file, you must first initialize a Breakpoints object and then fill out the names, data, and samples properties:

>>> from haptools.data import Breakpoints, HapBlock
>>> breakpoints = Breakpoints('simple.bp')
>>> breakpoints.data = {
>>>     'HG00096': [
>>>         np.array([('YRI','chr1',10114,4.3),('CEU','chr1',10116,5.2)], dtype=HapBlock)
>>>         np.array([('CEU','chr1',10114,4.3),('YRI','chr1',10116,5.2)], dtype=HapBlock)
>>>     ], 'HG00097': [
>>>         np.array([('YRI','chr1',10114,4.3),('CEU','chr2',10116,5.2)], dtype=HapBlock)
>>>         np.array([('CEU','chr1',10114,4.3),('YRI','chr2',10116,5.2)], dtype=HapBlock)
>>>     ]
>>> }
>>> breakpoints.write()

haptools.sim_genotype module

haptools.sim_genotype.get_segment(pop, haplotype, chrom, start_coord, end_coord, end_pos, prev_gen_samples)

Create a segment or segments for an individual of the current generation using either a population label (>0) or the previous generation’s samples if the admix pop type (0) is used.

Parameters:

pop: int: index of population. Can recover population name from pop_dict
haplotype: int: index of range [0, len(prev_gen_samples)] to identify the parent haplotype to copy segments from
chrom: int: chromosome the haplotype segment lies on
start_coord: int: starting coordinate from where to begin taking segments from previous generation samples
end_coord: int: ending coordinate of haplotype segment
end_pos: float: ending coordinate in centimorgans
prev_gen_samples: list[list[HaplotypeSegment]]: the previous generation simulated used as the parents for the current generation

Returns:

segments: list[HaplotypeSegment]: A list of HaplotypeSegments storing the population type and end coordinate

haptools.sim_genotype.output_vcf(breakpoints, chroms, model_file, variant_file, sampleinfo_file, region, pop_field, sample_field, no_replacement, out, log, chunk_size=None)

Takes in simulated breakpoints and uses reference files, vcf and sampleinfo, to create simulated variants output in file: out + .vcf

Parameters:

breakpoints: list[list[HaplotypeSegment]]

the simulated breakpoints

chroms: list[str]

List of chromosomes that were used to simulate

model_file: str

file with the following structure. (Must be tab delimited)

Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac

For example,

  Admixed    CEU   YRI
     0        0.05  0.95
     0.20     0.05  0.75

variant_file: str

file path that contains samples and respective variants. Can be in VCF, BCF, VCF.GZ, or PGEN format.

sampleinfo_file: str

file path that contains mapping from sample name in vcf to population

region: dict(str->str/int/int)

Dictionary with the keys “chr”, “start”, and “end” holding chromosome (str), start position (int) and end position (int) allowing the simulation process to only allow variants within that region.

pop_field: boolean

Flag to determine whether to have the population field in the VCF file output

sample_field: boolean

Flag to determine whether to have the sample field in the VCF file output

no_replacement: boolean

Flag to determine whether we sample from the reference VCF with or without replacement. When True there will be no replacement.

out: str

output prefix

log: log object

Outputs messages to the appropriate channel.

chunk_size: int, optional

The max number of variants to write to a PGEN file together

haptools.sim_genotype.simulate_gt(model_file, coords_dir, chroms, region, popsize, log, seed=None)

Simulate admixed genotypes based on the parameters of model_file.

Parameters:

model_file: str

File with the following structure. (Must be tab delimited)

Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac

For example,

  Admixed    CEU   YRI
     0        0.05  0.95
     0.20     0.05  0.75

coords_dir: str

Directory containing files ending in .map with genetic map coords in cM used for recombination points

chroms: list[str]

List of chromosomes to simulate admixture for.

region: dict()

Dictionary with the keys “chr”, “start”, and “end” holding chromosome, start position adn end position allowing the simulation process to only within that region.

popsize: int

Size of population created for each generation.

log: log object

Outputs messages to the appropriate channel.

seed: int

Seed used for randomization.

Returns:

num_samples: int: Total number of samples to output
pop_dict: dict(int->str): Dictionary that maps populations from their encoded version as integers to their population name as a string. ex: {1:CEU, 2:YRI}
next_gen_samples: list[list[HaplotypeSegment]]: Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.

haptools.sim_genotype.start_segment(start, chrom, segments)

Find first segment that is on chrom and its end coordinate is > start via binary search.

Parameters:

start: int: Coordinate in bp for the start of the segment to output
chrom: int: Chromosome that the segments lie on.
segments: list[HaplotypeSegment]: List of the hapltoype segments to search from for a starting point.

Returns:

mid: int: Index of the first genetic segment to collect for output.

haptools.sim_genotype.validate_params(model, mapdir, chroms, popsize, invcf, sample_info, no_replacement, region=None, only_bp=False)

haptools.sim_genotype.write_breakpoints(samples, pop_dict, breakpoints, out, log)

Write out a subsample of breakpoints to out determined by samples.

Parameters:

samples: int: Number of samples to output
pop_dict: dict(int->str): Maps population codes in integers to their names. ex: {1:CEU, 2:YRI}
breakpoints: list[list[HaplotypeSegment]]: Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.
out: str: output prefix used to output the breakpoint file
log: log object: Outputs messages to the appropriate channel.

Returns:

breakpoints: list[list[HaplotypeSegment]]: subsampled breakpoints only containing number of samples

haptools.sim_phenotype module

class haptools.sim_phenotype.Effect(id, beta)

Bases: object

A variable in the simphenotype linear model

Attributes:

idstr: The ID of the variable; corresponds to a variant in a Genotypes object
betafloat: The effect size of the variable

beta: float

classmethod from_hap_spec(line)

Convert a .snplist line into an Effect object

Return type:

Effect

Parameters:

line: str: A line from a .snplist file

Returns:

Effect: The converted Effect instance

id: str

class haptools.sim_phenotype.Haplotype(chrom, start, end, id, beta)

Bases: Haplotype

A haplotype with sufficient fields for simphenotype

Properties and functions are shared with the base Haplotype object, “HaplotypeBase”

beta: float

class haptools.sim_phenotype.PhenoSimulator(genotypes, output=PosixPath('/dev/stdout'), seed=None, log=None)

Bases: object

Simulate phenotypes from genotypes

Attributes:

gens: dict[Genotypes]: Genotypes to simulate
phens: Phenotypes: Simulated phenotypes; filled by run()
rng: np.random.Generator, optional: A numpy random number generator
log: logging.Logger: A logging instance for recording debug statements

Examples

>>> gens = Genotypes.load("tests/data/example.vcf.gz")
>>> tr_gens = GenotypesTR.load("tests/data/simple_tr.vcf")
>>> haps = Haplotypes.load("tests/data/basic.hap")
>>> haps_gts = haps.transform(gens)
>>> phenosim = PhenoSimulator(haps_gts)
>>> phenosim.run(haps.data.values())
>>> phenotypes = phenosim.phens

normalize_gts(gts, ids)

Normalize variant or repeats genotypes

Parameters:

gts: np.array: Genotypes variant array stored in data.
ids: list[str]: IDs for variants

Returns:

normalized_gts: np.array: Normalized Genotypes variant array.

run(effects, heritability=None, prevalence=None, normalize=True, environment=None)

Simulate phenotypes for an entry in the Genotypes object

The generated phenotypes will also be added to output

Return type:

ndarray[Any, dtype[TypeVar(ScalarType, bound= generic, covariant=True)]]

Parameters:

effects: list[Effect|Haplotype|Repeat]

A list of Haplotypes to use in an additive fashion within the simulations

heritability: float, optional

The simulated heritability of the trait

If not provided, this will default to the sum of the squared effect sizes

prevalence: float, optional

How common should the disease be within the population?

If this value is specified, case/control phenotypes will be generated instead of quantitative traits.

normalize: bool, optional

If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.

environment: float, optional

The variance (aka strength) of the environmental contribution to the trait. This is inferred from the betas if it isn’t specified.

Returns:

npt.NDArray: The simulated phenotypes, as a np array of shape num_samples x 1

write(): Write the generated phenotypes to the file specified in __init__()

class haptools.sim_phenotype.Repeat(chrom, start, end, id, beta)

Bases: Repeat

A repeat with sufficient fields for simphenotype

Properties and functions are shared with the base Repeat object, “RepeatBeta”

beta: float

haptools.sim_phenotype.simulate_pt(genotypes, haplotypes, num_replications=1, environment=None, heritability=None, prevalence=None, normalize=True, region=None, samples=None, haplotype_ids=None, chunk_size=None, repeats=None, seed=None, output=PosixPath('-'), log=None)

Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.

GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec

Note: GENOTYPES must be the output from the the transform subcommand.

Parameters:

genotypesPath

The path to the transformed genotypes in VCF or PGEN format

haplotypesPath

The path to the haplotypes in a .hap file

replicationsint, optional

The number of rounds of simulation to perform

heritabilityint, optional

The heritability of the simulated trait; must be a float between 0 and 1

If not provided, it will be computed from the sum of the squared effect sizes

prevalenceint, optional

The prevalence of the disease if the trait should be simulated as case/control; must be a float between 0 and 1

If not provided, a quantitative trait will be simulated, instead

normalize: bool, optional

If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.

regionstr, optional

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF and .hap file must be indexed and the seqname must match!

Defaults to loading all haplotypes

samplesset[str], optional

A subset of the samples from which to extract genotypes

Defaults to loading genotypes from all samples

haplotype_ids: set[str], optional

A list of haplotype IDs to obtain from the .hap file. All others are ignored.

If not provided, all haplotypes will be used.

chunk_size: int, optional

The max number of variants to fetch from the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.

repeats: Path, optional

The path to a genotypes file containing tandem repeats. This is only necessary when simulating both haplotypes and repeats as causal effects

environment: float, optional

The variance (aka strength) of the environmental term. This will be inferred if it isn’t specified.

seed: int, optional

Seed for random processes

outputPath, optional

The location to which to write the simulated phenotypes

loglogging.Logger, optional

The logging module for this task

Examples

>>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simu_phens.tsv

haptools.karyogram module

This script is inspired by Alicia Martin’s karyogram code originally published here: https://github.com/armartin/ancestry_pipeline/blob/master/plot_karyogram.py

haptools.karyogram.GetCentromereClipMask(centromeres_file, chrom_order)

Get clipping mask for the centromeres and telomeres

Parameters:

centromeres_filestr, optional: Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
chrom_orderlist[int]: chromosomes in sorted order

Returns:

clipmask_perchromdict[str, matplotlib.Path]: Clip region for telomeres/centromeres for each chromosome

haptools.karyogram.GetChrom(chrom)

Extract a numerical chromosome

Parameters:

chromstr: Chromosome string

Returns:

chromint: Integer-value for the chromosome X gets set to 23

haptools.karyogram.GetChromOrder(sample_blocks)

Get a list of chroms in sorted order

Parameters:

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

Returns:

chromslist[int]: list of chromsomes in sorted order

haptools.karyogram.GetCmRange(sample_blocks)

Get the min and max cM coordinates from the sample_blocks

Parameters:

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

Returns:

min_val, max_valfloat, float: min_val is the minimum coordinate max_val is the maximum coordinate

haptools.karyogram.GetHaplotypeBlocks(bp_file, sample_name, centromeres_file=None)

Extract haplotype blocks for the desired sample from the bp file

Parameters:

bp_filestr: Path to .bp file with breakpoints
sample_namestr: Sample ID to extract
centromeres_filestr, optional: If not None then use the chromosome ends listed to extend chromosomes to proper end coordinates

Returns:

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

haptools.karyogram.GetPopList(sample_blocks)

Get a list of populations in the sample_blocks

Parameters:

sample_blockslist[list[hap_blocks]]: each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’

Returns:

poplistlist[str]: list of populations represented in the blocks

haptools.karyogram.PlotHaplotypeBlock(block, hapnum, chrom_order, colors, ax, clipmask_perchrom=None)

Plot a haplotype block on the axis

Parameters:

blockdict: dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
hapnumint: 0 or 1 for the two haplotypes
chrom_orderlist[int]: chromosomes in sorted order
colorsdict[str, str], optional: Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.
axmatplotlib axis to use for plotting
clipmask_perchromdict[str, matplotlib.Path], optional: Clip region for telomeres/centromeres for each chromosome

haptools.karyogram.PlotKaryogram(bp_file, sample_name, out_file, log, centromeres_file=None, title=None, colors=None)

Plot a karyogram based on breakpoints output by haptools simgenotypes

Parameters:

bp_filestr: Path to .bp file with breakpoints
sample_namestr: Sample ID to plot
out_filestr: Name of output file
log: log object: Outputs messages to the appropriate channel.
centromeres_filestr, optional: Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
titlestr, optional: Plot title. If None, no title is annotated
colorsdict(str->str), optional: Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.

haptools.transform module

class haptools.transform.GenotypesAncestry(fname, log=None)

Bases: GenotypesVCF

Extends the GenotypesVCF class for ancestry data

The ancestry information is stored within the FORMAT field of the VCF

Attributes:

datanp.array: See documentation for data
fnamePath | str: See documentation for fname
samplestuple[str]: See documentation for samples
variantsnp.array: See documentation for variants
valid_labels: np.array: Reference VCF sample and respective variant grabbed for each sample.
ancestrynp.array: The ancestral population of each allele in each sample of data
log: logging.Logger: See documentation for log

check_biallelic(discard_also=False): See documentation for check_biallelic()

check_missing(discard_also=False): See documentation for check_missing()

merge_variants(objs, check_samples=True, **kwargs)

See documentation for merge_variants()

Return type:: Genotypes

read(region=None, samples=None, variants=None, max_variants=None): See documentation for read()

subset(samples=None, variants=None, inplace=False): See documentation for subset()

write(chroms=None): Write the variants in this class to a VCF at fname

class haptools.transform.HaplotypeAncestry(chrom, start, end, id, ancestry)

Bases: Haplotype

A haplotype with an ancestry field for the transform subcommand

Properties and functions are shared with the base “Haplotype” object

ancestry: str

transform(genotypes)

Transform a genotypes matrix via the current haplotype and its ancestral population

See documentation for transform() for more details

Return type:: ndarray[Any, dtype[TypeVar(ScalarType, bound= generic, covariant=True)]]

class haptools.transform.HaplotypesAncestry(fname, haplotype=<class 'haptools.transform.HaplotypeAncestry'>, variant=<class 'haptools.data.haplotypes.Variant'>, log=None)

Bases: Haplotypes

A set of haplotypes with an ancestry field for the transform subcommand

Properties and functions are shared with the base “Haplotypes” object

transform(gts, hap_gts=None)

Transform a genotypes matrix via the current haplotype

Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object

Parameters:

gtsGenotypesVCF: The genotypes which to transform using the current haplotype
hap_gts: GenotypesVCF: An empty GenotypesVCF object into which the haplotype genotypes should be stored

Returns:

GenotypesVCF: A Genotypes object composed of haplotypes instead of regular variants.

haptools.transform.transform_haps(genotypes, haplotypes, region=None, samples=None, haplotype_ids=None, chunk_size=None, discard_missing=False, ancestry=False, maf=None, output=PosixPath('-'), log=None)

Creates a VCF composed of haplotypes

Parameters:

genotypesPath

The path to the genotypes in VCF or PGEN format

haplotypesPath

The path to the haplotypes in a .hap file

regionstr, optional

See documentation for read() and read()

samplesset[str], optional

See documentation for read()

haplotype_ids: set[str], optional

A set of haplotype IDs to obtain from the .hap file. All others are ignored.

If not provided, all haplotypes will be used.

chunk_size: int, optional

The max number of variants to fetch from the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.

discard_missingbool, optional

Discard any samples that are missing any of the required genotypes

The default is simply to complain about it

ancestrybool, optional

Whether to also match ancestral population labels from the VCF against those in the .hap file

maffloat, optional

If specified, only haplotypes with an MAF above this value will be output

outputPath, optional

The location to which to write output

logLogger, optional

A logging module to which to write messages about progress and any errors

haptools.ld module

class haptools.ld.Haplotype(chrom, start, end, id, ld)

Bases: Haplotype

A haplotype with sufficient fields for the ld command

Properties and functions are shared with the base Haplotype object, HaplotypeBase

ld: float

haptools.ld.calc_ld(target, genotypes, haplotypes, region=None, samples=None, ids=None, chunk_size=None, discard_missing=False, from_gts=False, output=PosixPath('/dev/stdout'), log=None)

Creates a VCF composed of haplotypes

Parameters:

targetstr

The ID of the haplotype or variant with which we will calculate LD

genotypesPath

The path to the genotypes

haplotypesPath

The path to the haplotypes in a .hap file

regionstr, optional

See documentation for read() and read()

samplesset[str], optional

See documentation for read()

ids: set[str], optional

A subset of haplotype IDs to obtain from the .hap file. All others are ignored.

Alternatively, if the –from-gts switch is specified, this will be interpreted as a subset of variant IDs to obtain from the genotypes file.

Defaults to loading all haplotypes or variants if not specified

chunk_size: int, optional

The max number of variants to fetch from the PGEN file at any given time

If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.

discard_missingbool, optional

Discard any samples that are missing any of the required genotypes

The default is simply to complain about it

outputPath, optional

The location to which to write output

logLogger, optional

A logging module to which to write messages about progress and any errors

haptools.ld.pearson_corr_ld(arrA, arrB)

Compute the Pearson correlation coefficient (LD) between two vectors (1D arrays)

If 2D array(s) are given instead, the rows will be treated as samples and the columns as variants

Return type:

float | ndarray[Any, dtype[TypeVar(ScalarType, bound= generic, covariant=True)]]

Parameters:

arrA: npt.NDArray: The first 1D (or 2D) numpy array
arrB: npt.NDArray: The second 1D (or 2D) numpy array

Returns:

The signed LD between the genotypes in arrA and the genotypes in arrB
If either arrA or arrB is 2D, then the return value will be an array instead. If
both are 2D, the shape of the array will correspond with the second dimensions of
arrA and arrB, respectively. Otherwise, it will be 1D.

haptools.index module

haptools.index.append_suffix(path, suffix)

Used as a helper method for index_haps. Appends a given suffix to a Path instance.

Parameters:

pathPath: The path to a file
suffixstr: A string to append to the end of the given Path. For example, “.gz” or “.gz.tbi”

haptools.index.index_haps(haplotypes, sort=False, output=None, log=None)

Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file

Parameters:

haplotypesPath: The path to the haplotypes in a .hap file
outputPath, optional: The location to which to write output. If an output location is not specified, the output will have the same name as the input file.
logLogger, optional: A logging module to which to write messages about progress and any errors

haptools.clump module

haptools.clump.ComputeExactLD(candidate_gt, index_gt, log)

Compute exact solution of haplotype frequencies to calculate r squared value. NOTE currently this approach only works for biallelic variants since having more variants causes the equation we’re solving for to be a cubic but instead to the degree of n where n is the total number of alelles which also invalidates STRs.

Parameters:

candidate_gt: np.array: array of size (genotypes,) where genotypes is the number of samples
index_gt: np.array: array of size (genotypes,) where genotypes is the number of samples
log: Logger: A logging instance for recording debug statements.

Returns:

r_squared: float: R squared value inferred from ML solution.

haptools.clump.ComputeLD(candidate_gt, index_gt, LD_type, log): Compute the LD between two variants

haptools.clump.GetOverlappingSamples(snpgts, strgts)

Get indices of overlapping samples for snps and strs

Parameters:

snpgts: GenotypesVCF: SNP Genotypes object
strgts: GenotypesTR: STR Genotypes object

Returns:

snp_samples: list(int): Indices of overlapping samples for snps
str_samples: list(int): Indices of overlapping samples for strs

haptools.clump.LoadVariant(var, gts, log): Extract vector of genotypes for this variant

class haptools.clump.SummaryStats(log=None)

Bases: object

Load and process summary statistics

Attributes:

summstats: list[Variant]: list of Variant objects that represent all summary statistics in the file.
log: Logger: A logging instance for recording debug statements.

Examples

Loading a summary stats file, grabbing an index variant, and its candidate variants to calculate LD.

>>> summstats = SummaryStats(log)
>>> summstats.Load(summstats_strs, vartype="SNP", pthresh=clump_p2,
        id_field=clump_id_field, p_field=clump_field,
        chrom_field=clump_chrom_field, pos_field=clump_pos_field)
>>> indexvar = summstats.GetNextIndexVariant(clump_p1)
>>> candidates = summstats.QueryWindow(indexvar, clump_kb)

GetNextIndexVariant(index_pval_thresh): Get the next index variant, which is the variant with the best p-value If no more variants below the clump-p1 threshold, return None Not yet implemented

Load(statsfile, vartype='SNP', pthresh=1.0, id_field='SNP', p_field='P', chrom_field='CHR', pos_field='POS'): Load summary statistics Ignore variants with pval < pthresh Not yet implemented

QueryWindow(indexvar, window_kb): Find all candidate variants in the specified window around the index variant Not yet implemented

RemoveClump(clumpvars): Remove the variants from a clump from further consideration

class haptools.clump.Variant(varid, chrom, pos, pval, vartype): Bases: object

haptools.clump.WriteClump(indexvar, clumped_vars, outf): Write a clump to the output file Not yet implemented

haptools.clump.clumpstr(summstats_snps, summstats_strs, gts_snps, gts_strs, clump_p1, clump_p2, clump_id_field, clump_field, clump_chrom_field, clump_pos_field, clump_kb, clump_r2, LD_type, out, log)