Documentation
Command line interface
haptools
haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information
haptools [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
index
Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file
haptools index [OPTIONS] HAPLOTYPES
Options
- --sort, --no-sort
Sorting of the file will not be performed
- Default
True
- -o, --output <output>
A .hap file containing sorted and indexed haplotypes and variants
- Default
input file
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default
INFO- Options
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- HAPLOTYPES
Required argument
karyogram
Visualize a karyogram of local ancestry tracks
haptools karyogram [OPTIONS]
Options
- --bp <bp>
Required Path to .bp file with breakpoints
- --sample <sample>
Required Sample ID to plot
- --out <out>
Required Name of output file
- --title <title>
Optional plot title
- --centromeres <centromeres>
Optional file with telomere/centromere cM positions
- --colors <colors>
Optional color dictionary. Input can be from the matplotlib list of colors or in hexcode. Format is e.g. ‘YRI:blue,CEU:green’
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default
INFO- Options
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
ld
Compute the pair-wise LD (Pearson’s correlation) between haplotypes (or variants) and a single TARGET haplotype (or variant)
GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec
TARGET refers to the ID of a variant or haplotype. LD is computed pair-wise between TARGET and all of the other haplotypes in the .hap (or genotype) file
If TARGET is a variant ID, the ID must appear in GENOTYPES. Otherwise, it must be present in the .hap file
haptools ld [OPTIONS] TARGET GENOTYPES HAPLOTYPES
Options
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default
all haplotypes
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default
all samples
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default
all samples
- -i, --id <ids>
A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’). Or, if –from-gts, a list of the variant IDs to use from the genotypes file. For this to work, the .hap file must be indexed
- Default
all haplotypes
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype (or variant) IDs (one per line) to subset from the .hap (or genotype) file
- Default
all haplotypes
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default
all variants
- --discard-missing
Ignore any samples that are missing genotypes for the required variants
- Default
False
- --from-gts
By default, LD is computed with the haplotypes in the .hap file. Use this switch to compute LD with the genotypes in the genotypes file, instead.
- Default
False
- -o, --output <output>
A .hap file containing haplotypes and their LD with TARGET
- Default
stdout
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default
INFO- Options
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- TARGET
Required argument
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument
simgenotype
Simulate admixed genomes under a pre-defined model.
haptools simgenotype [OPTIONS]
Options
- --model <model>
Required Admixture model in .dat format. See File Formats under simgenotype in the docs for complete info.
- --mapdir <mapdir>
Required Directory containing files with chr{1-22,X} and ending in .map in the file name with genetic map coords.
- --out <out>
Required Path to desired output file. E.g. /path/to/output.vcf.gz Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an additional breakpoints output with extension bp e.g. /path/to/output.bp.
- --chroms <chroms>
Sorted and comma delimited list of chromosomes to simulate
- --seed <seed>
Random seed. Set to make simulations reproducible
- --ref_vcf <ref_vcf>
Required VCF or PGEN file used as reference for creation of simulated samples respective genotypes.
- --sample_info <sample_info>
Required File that maps samples from the reference VCF (–invcf) to population codes describing the populations in the header of the model file.
- --region <region>
Subset the simulation to a specific region in a chromosome using the form chrom:start-end. Example 2:1000-2000
- --pop_field
Flag for outputting the population field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --sample_field
Flag for outputting the sample field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --only_breakpoint
Flag used to determine whether to only output breakpoints or continue to simulate a vcf file.
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default
INFO- Options
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
simphenotype
Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.
GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec
Note: GENOTYPES must be the output from the transform subcommand.
haptools simphenotype [OPTIONS] GENOTYPES HAPLOTYPES
Options
- -r, --replications <replications>
Number of rounds of simulation to perform
- Default
1
- -h, --heritability <heritability>
Trait heritability
- -p, --prevalence <prevalence>
Disease prevalence if simulating a case-control trait
- Default
quantitative trait
- --normalize, --no-normalize
Whether to normalize the genotypes before using them for simulation
- Default
True
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default
all haplotypes
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default
all samples
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default
all samples
- -i, --id <ids>
A list of the haplotype IDs from the .hap file to use as causal variables (ex: ‘-i H1 -i H2’).
- Default
all haplotypes
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file
- Default
all haplotypes
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default
all variants
- --seed <seed>
Use this option across executions to make the output reproducible
- Default
chosen randomly
- -o, --output <output>
A TSV file containing simulated phenotypes
- Default
stdout
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default
INFO- Options
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument
transform
Creates a VCF composed of haplotypes
GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec
haptools transform [OPTIONS] GENOTYPES HAPLOTYPES
Options
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default
all haplotypes
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default
all samples
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default
all samples
- -i, --id <ids>
A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’).
- Default
all haplotypes
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file
- Default
all haplotypes
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default
all variants
- --discard-missing
Ignore any samples that are missing genotypes for the required variants
- Default
False
- --ancestry
Also transform using VCF ‘POP’ FORMAT field and ‘ancestry’ .hap extra field
- Default
False
- -o, --output <output>
A VCF file containing haplotype ‘genotypes’
- Default
stdout
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default
INFO- Options
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument
Module contents
haptools.data.data module
- class haptools.data.data.Data(fname, log=None)
Bases:
ABCAbstract class for accessing read-only data files
- Attributes
- fnamePath | str
The path to the read-only file containing the data
- datanp.array
The contents of the data file, once loaded
- log: Logger
A logging instance for recording debug statements.
- static hook_compressed(filename, mode)
A utility to help open files regardless of their compression
Based off of python’s fileinput.hook_compressed and copied from https://stackoverflow.com/a/64106815/16815703
- Return type
IO- Parameters
- filenamestr
The path to the file
- modestr
Either ‘r’ for read or ‘w’ for write
- Returns
- IO
The resolved file object
- abstract classmethod load(fname)
Read the file contents and perform any recommended pre-processing
- Parameters
- fnamePath
See documentation for
fname
- abstract read()
Read the raw file contents into the class properties
- unset()
Whether the data has been loaded into the object yet
- Return type
bool- Returns
- bool
True if
datais None else False
haptools.data.genotypes module
- class haptools.data.genotypes.Genotypes(fname, log=None)
Bases:
DataA class for processing genotypes from a file
Examples
>>> genotypes = Genotypes.load('tests/data/simple.vcf') >>> # directly access the loaded variants, samples, and genotypes (in data) >>> genotypes.variants >>> genotypes.samples >>> genotypes.data
- Attributes
- datanpt.NDArray
The genotypes in an n (samples) x p (variants) x 2 (strands) array
- fnamePath | str
The path to the read-only file containing the data
- samplestuple[str]
The names of each of the n samples
- variantsnp.array
- Variant-level meta information:
ID
CHROM
POS
- log: Logger
A logging instance for recording debug statements
- _prephasedbool
If True, assume that the genotypes are phased. Otherwise, extract their phase when reading from the VCF.
- _samp_idxdict[str, int]
Sample index; maps samples to indices in self.samples
- _var_idxdict[str, int]
Variant index; maps variant IDs to indices in self.variants
- check_biallelic(discard_also=False)
Check that each genotype is composed of only two alleles
This function modifies the dtype of
datafrom uint8 to bool- Parameters
- discard_alsobool, optional
If True, discard any multiallelic variants without raising a ValueError
- Raises
- ValueError
If any of the genotypes have more than two alleles
- check_maf(threshold=None, discard_also=False, warn_only=False)
Check the minor allele frequency of each variant
Raise a ValueError if any variant’s MAF doesn’t satisfy the threshold, if one is provided :rtype:
ndarray[Any,dtype[float64]]Note
You should call
check_missing()andcheck_biallelic()before executing this method, for best results. Otherwise, the frequencies may be computed incorrectly.- Parameters
- threshold: float, optional
If a variant has a minor allele frequency (MAF) rarer than this threshold, raise a ValueError
- discard_alsobool, optional
If True, discard any variants that would otherwise cause a ValueError
This parameter will be ignored if a threshold is not specified
- warn_only: bool, optional
Just raise a warning instead of a ValueError
- Returns
- The minor allele frequency of each variant
- Raises
- ValueError
If any variant does not meet the provided threshold minor allele frequency
- check_missing(discard_also=False)
Check that each sample is properly genotyped
- Parameters
- discard_alsobool, optional
If True, discard any samples that are missing genotypes without raising a ValueError
- Raises
- ValueError
If any of the samples have missing genotypes ‘GT: .|.’
- check_phase()
Check that the genotypes are phased then remove the phasing info from the data
This function modifies
datain-place- Raises
- ValueError
If any heterozgyous genotpyes are unphased
- check_sorted()
Check that the variant coordinates are sorted
Raise a ValueError if any of variants in any of the chromosomes are unsorted
- Raises
- ValueError
If any variant position is less than a position that comes before it within the same chromosome
- index(samples=True, variants=True)
Call this function once to improve the amortized time-complexity of look-ups of samples and variants by their ID. This is useful if you intend to later subset by a set of samples or variant IDs. The time complexity of this function should be roughly O(n+p) if both parameters are True. Otherwise, it will be either O(n) or O(p).
- Parameters
- samples: bool, optional
Whether to index the samples for fast loop-up. Adds complexity O(n).
- variants: bool, optional
Whether to index the variants for fast look-up. Adds complexity O(p).
- Raises
- ValueError
If any samples or variants appear more than once
- classmethod load(fname, region=None, samples=None, variants=None)
Load genotypes from a VCF file
Read the file contents, check the genotype phase, and create the MAC matrix
- Return type
- Parameters
- Returns
- Genotypes
A Genotypes object with the data loaded into its properties
- read(region=None, samples=None, variants=None, max_variants=None)
Read genotypes from a VCF into a numpy matrix stored in
data- Parameters
- regionstr, optional
The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the VCF must be indexed and the seqname must match!
Defaults to loading all genotypes
- sampleslist[str], optional
A subset of the samples from which to extract genotypes
Defaults to loading genotypes from all samples
- variantsset[str], optional
A set of variant IDs for which to extract genotypes
All other variants will be ignored. This may be useful if you’re running out of memory.
- max_variantsint, optional
The maximum mumber of variants to load from the file. Setting this value helps preallocate the arrays, making the process faster and less memory intensive. You should use this option if your processes are frequently “Killed” from memory overuse.
If you don’t know how many variants there are, set this to a large number greater than what you would except. The np array will be resized appropriately. You can also use the bcftools “counts” plugin to obtain the number of expected sites within a region.
Note that this value is ignored if the variants argument is provided.
- Raises
- ValueError
If the genotypes array is empty
- subset(samples=None, variants=None, inplace=False)
Subset these genotypes to a smaller set of samples or a smaller set of variants
The order of the samples and variants in the subsetted instance will match the order in the provided tuple parameters.
- Parameters
- samples: tuple[str]
A subset of samples to keep
- variants: tuple[str]
A subset of variant IDs to keep
- inplace: bool, optional
If False, return a new Genotypes object; otherwise, alter the current one
- Returns
- A new Genotypes object if inplace is set to False, else returns None
- class haptools.data.genotypes.GenotypesPLINK(fname, log=None, chunk_size=None)
Bases:
GenotypesRefAltA class for processing genotypes from a PLINK
.pgenfileExamples
>>> genotypes = GenotypesPLINK.load('tests/data/simple.pgen')
- Attributes
- datanp.array
See documentation for
data- samplestuple
See documentation for
data- variantsnp.array
See documentation for
data- log: Logger
See documentation for
data- chunk_size: int, optional
The max number of variants to fetch from and write to the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory
- _prephased: bool
See documentation for
data
- read(region=None, samples=None, variants=None, max_variants=None)
Read genotypes from a PGEN file into a numpy matrix stored in
data- Parameters
- regionstr, optional
See documentation for
read- sampleslist[str], optional
See documentation for
read- variantsset[str], optional
See documentation for
read- max_variantsint, optional
See documentation for
read
- read_samples(samples=None)
Read sample IDs from a PSAM file into a list stored in
samplesThis method is called automatically by
read()- Parameters
- sampleslist[str], optional
See documentation for
read
- Returns
- npt.NDArray[np.uint32]
The indices of each of the samples within the PSAM file
- read_variants(region=None, variants=None, max_variants=None)
Read variants from a PVAR file into a numpy array stored in
variantsOne of either variants or max_variants MUST be specified!
This method is called automatically by
read()- Parameters
- regionstr, optional
See documentation for
read- variantsset[str], optional
See documentation for
read- max_variantsint, optional
See documentation for
read
- Returns
- npt.NDArray[np.uint32]
The indices of each of the variants within the PVAR file
- write()
Write the variants in this class to PLINK2 files at
fname
- class haptools.data.genotypes.GenotypesRefAlt(fname, log=None)
Bases:
GenotypesA class for processing genotypes from a file Unlike the base Genotypes class, this class also includes REF and ALT alleles in the variants array
- Attributes
- datanp.array
See documentation for
data- fnamePath | str
See documentation for
fname- samplestuple[str]
See documentation for
samples- variantsnp.array
- Variant-level meta information:
ID
CHROM
POS
REF
ALT
- log: Logger
See documentation for
log
- write()
Write the variants in this class to a VCF at
fname
haptools.data.phenotypes module
- class haptools.data.phenotypes.Phenotypes(fname, log=None)
Bases:
DataA class for processing phenotypes from a file
Examples
>>> phenotypes = Phenotypes.load('tests/data/simple.pheno')
- Attributes
- datanp.array
The phenotypes in an n (samples) x m (phenotypes) array
- fnamePath | str
The path to the file containing the data
- samplestuple
The names of each of the n samples
- namestuple[str]
The names of the phenotypes
- log: Logger
A logging instance for recording debug statements.
- append(name, data)
Append a new set of phenotypes to the current set
- Parameters
- name: str
The name of the new phenotype
- data: npt.NDArray
A 1D np array of the same length as
samples, containing the phenotype values for each sample. Must have the same dtype as
- classmethod load(fname, samples=None)
Load phenotypes from a pheno file
Read the file contents and standardize the phenotypes
- Return type
- Parameters
- fname
See documentation for
fname- samplesset[str], optional
See documentation for
read()
- Returns
- phenotypes
A Phenotypes object with the data loaded into its properties
- read(samples=None)
Read phenotypes from a pheno file into a numpy matrix stored in
data- Parameters
- samplesset[str], optional
A subset of the samples from which to extract phenotypes
Defaults to loading phenotypes from all samples
- Raises
- AssertionError
If the provided file doesn’t follow the expected format
- standardize()
Standardize phenotypes so they have a mean of 0 and a stdev of 1
This function modifies
datain-place
- write()
Write the phenotypes in this class to a file at
fnameExamples
To write to a file, you must first initialize a Phenotypes object and then fill out the names, data, and samples properties: >>> phenotypes = Phenotypes(‘tests/data/simple.pheno’) >>> phenotypes.names = (‘height’,) >>> phenotypes.data = np.array([1, 1, 2], dtype=’float64’) >>> phenotypes.samples = (‘HG00096’, ‘HG00097’, ‘HG00099’) >>> phenotypes.write()
haptools.data.covariates module
- class haptools.data.covariates.Covariates(fname, log=None)
Bases:
PhenotypesA class for processing covariates from a file
Examples
>>> covariates = Covariates.load('tests/data/simple.covar')
- Attributes
- datanp.array
The covariates in an n (samples) x m (covariates) array
- fnamePath | str
The path to the read-only file containing the data
- samplestuple[str]
The names of each of the n samples
- namestuple[str]
The names of the covariates
- log: Logger
A logging instance for recording debug statements.
haptools.data.haplotypes module
- class haptools.data.haplotypes.Extra(name, fmt='s', description='')
Bases:
objectAn extra field on a line in the .hap file
- Attributes
- name: str
The name of the extra field
- fmt: str = “s”
The python fmt string of the field value; indicates how to format the value
- description: str = “”
A description of the extra field
-
description:
str= ''
-
fmt:
str= 's'
- property fmt_str: str
Convert an Extra into a fmt string
- Returns
- str
A python format string (ex: “{beta:.3f}”)
- classmethod from_hap_spec(line)
Convert an “extra” line in the header of a .hap file into an Extra object
- Return type
- Parameters
- line: str
An “extra” field, as it appears declared in the header
- Returns
- Extra
An Extra object
-
name:
str
- to_hap_spec(line_type_symbol)
Convert an Extra object into a header line in the .hap format spec
- Return type
str- Parameters
- hap_id: str
The ID of the haplotype associated with this variant
- Returns
- str
A valid variant line (V) in the .hap format spec
- class haptools.data.haplotypes.Haplotype(chrom, start, end, id)
Bases:
objectA haplotype within the .hap format spec
In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration
Examples
Let’s extend this class and add an extra field called “ancestry”
>>> from dataclasses import dataclass, field >>> @dataclass >>> class CustomHaplotype(Haplotype): ... ancestry: str ... _extras: tuple = field( ... repr=False, ... init=False, ... default = ( ... Extra("ancestry", "s", "Local ancestry"), ... ), ... )
- Attributes
- chrom: str
The contig to which this haplotype belongs
- start: int
The chromosomal start position of the haplotype
- end: int
The chromosomal end position of the haplotype
- id: str
The haplotype’s unique ID
- variants: tuple[Variant]
The variants in this haplotype
- _extras: tuple[Extra]
Extra fields for the haplotype
- property ID
Create an alias for the id property
-
chrom:
str
-
end:
int
- classmethod extras_head()
Return the header lines of the extra fields that are supported
- Return type
set- Returns
- tuple
The header lines of the extra fields
- classmethod extras_order()
The names of the extra fields in order
- Returns
- tuple[str]
The names of the extra fields in the order in which they are stored
- classmethod from_hap_spec(line, variants=(), types=None)
Convert a variant line into a Haplotype object in the .hap format spec
Note that this implementation does NOT support having more extra fields than appear in the header
- Parameters
- line: str
A variant (H) line from the .hap file
- variants: tuple[Variant], optional
The Variants in this haplotype
- types: dict[str, type], optional
The types of each property in the object
- Returns
- Haplotype
The Haplotype object for the variant
-
id:
str
- sort()
Sorts the variants within this Haplotype instance
-
start:
int
- to_hap_spec()
Convert a Haplotype object into a haplotype line in the .hap format spec
- Return type
str- Returns
- str
A valid haplotype line (H) in the .hap format spec
- transform(genotypes)
Transform a genotypes matrix via the current haplotype
Each entry in the returned matrix denotes the presence of the current haplotype in each chromosome of each sample in the Genotypes object
- Return type
ndarray[Any,dtype[bool]]- Parameters
- genotypesGenotypesRefAlt
The genotypes which to transform using the current haplotype
If the genotypes have not been loaded into the Genotypes object yet, this method will call Genotypes.read(), while loading only the needed variants
- Returns
- npt.NDArray[bool]
A 2D matrix of shape (num_samples, 2) where each entry in the matrix denotes the presence of the haplotype in one chromosome of a sample
- types = {'chrom': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}
- property varIDs
-
variants:
tuple
- class haptools.data.haplotypes.Haplotypes(fname, haplotype=<class 'haptools.data.haplotypes.Haplotype'>, variant=<class 'haptools.data.haplotypes.Variant'>, log=None)
Bases:
DataA class for processing haplotypes from a file
Examples
Parsing a basic .hap file without any extra fields is simple: >>> haplotypes = Haplotypes.load(‘tests/data/basic.hap’) >>> haps = haplotypes.data # a dictionary of Haplotype objects
If the .hap file contains extra fields, you’ll need to call the read() method manually. You’ll also need to create Haplotype and Variant subclasses that support the extra fields and then specify the names of the classes when you initialize the Haplotypes object: >>> haplotypes = Haplotypes(‘tests/data/simphenotype.hap’, HaptoolsHaplotype) >>> haplotypes.read() >>> haps = haplotypes.data # a dictionary of Haplotype objects
- Attributes
- fname: Path | str
The path to the file containing the data
- data: dict[str, Haplotype]
A dict of Haplotype objects keyed by their IDs
- types: dict
A dict of class names keyed by the symbol denoting their line type
Ex: {‘H’: Haplotype, ‘V’: Variant}
- version: str
A string denoting the current file format version
- log: Logger
A logging instance for recording debug statements.
- check_header(lines, check_version=True, softly=False)
1) Check and parse any metadata and 2) check that any extra fields declared in the .haps file can be handled by the Variant and Haplotype classes provided in __init__()
This function is called automatically by other methods that read .hap files
- Parameters
- lines: list[str]
Header lines from the .hap file. Any lines beginning with # may appear in this list, especially if the file is sorted. So this may include regular comments, too.
- check_version: bool, optional
Whether to also check the version of the file
- softly: bool, optional
If True, then this function will not raise any ValueErrors. Instead, it will only issue errors via the logging module, which may be ignored.
- Returns
- tuple[dict, dict[str, tuple[Extra]]]
The metadata for the file, contained within the header lines and encoded as a dictionary where the names are keys and any subsequent fields are values
The second dictionary encodes the set of declared extra field names for each line type
- Raises
- ValueError
If any of the header lines are not supported
- check_version(version, err_msgr)
Check the observed version string against the current version string of this instance
- Parameters
- version: str
The observed version string
- err_msgr: Callable
A function which takes a single parameter (the error message) and errors appropriately
- Returns
- The parsed, observed version string
- classmethod load(fname, region=None, haplotypes=None)
Load haplotypes from a .hap file
Read the file contents
- Return type
- Parameters
- Returns
- Haplotypes
A Haplotypes object with the data loaded into its properties
- read(region=None, haplotypes=None)
Read haplotypes from a .hap file into a list stored in
data- Parameters
- region: str, optional
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the .hap file must be indexed and the seqname must match!
Defaults to loading all haplotypes
- haplotypes: set[str], optional
A list of haplotype IDs corresponding to a subset of the haplotypes to extract
Defaults to loading haplotypes from all samples
- sort()
Sorts .hap files first by chrom, followed by start, end, and lastly ID
Also sorts the variants within each haplotype
- subset(haplotypes, inplace=False)
Subset these haplotypes to a smaller set of haplotypes
The order of the haplotypes in the subsetted instance will match the order in the provided tuple parameters.
- Parameters
- haplotypes: tuple[str]
A subset of haplotype IDs to keep
- inplace: bool, optional
If False, return a new Genotypes object; otherwise, alter the current one
- Returns
- A new Haplotypes object if inplace is set to False, else returns None
- to_str(sort=True)
Create a string representation of this Haplotype
- Return type
Generator[str,None,None]- Parameters
- sort: bool, optional
Whether to attempt to output lines in sorted order
- Yields
- Generator[str, None, None]
A list of lines (strings) to include in the output
- transform(gts, hap_gts=None)
Transform a genotypes matrix via the current haplotype
Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object
- Return type
- Parameters
- gtsGenotypesRefAlt
The genotypes which to transform using the current haplotype
- hap_gts: GenotypesRefAlt
An empty GenotypesRefAlt object into which the haplotype genotypes should be stored
- Returns
- GenotypesRefAlt
A Genotypes object composed of haplotypes instead of regular variants.
- write()
Write the contents of this Haplotypes object to the file at
fnameIf the items in
dataare sorted, then the output should be automatically sorted such that “sort -k1,4” would leave the output unchangedExamples
To write to a .hap file, you must first initialize a Haplotypes object and then fill out the data property: >>> haplotypes = Haplotypes(‘tests/data/basic.hap’) >>> haplotypes.data = {‘H1’: Haplotype(‘chr1’, 0, 10, ‘H1’)} >>> haplotypes.write()
- class haptools.data.haplotypes.Variant(start, end, id, allele)
Bases:
objectA variant within the .hap format spec
In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration
Examples
Let’s extend this class and add an extra field called “score”
>>> from dataclasses import dataclass, field >>> @dataclass >>> class CustomVariant(Variant): ... score: float ... _extras: tuple = field( ... repr=False, ... init=False, ... default = ( ... Extra("score", ".3f", "Importance of inclusion"), ... ), ... )
- Attributes
- start: int
The chromosomal start position of the variant
- end: int
The chromosomal end position of the variant
In most cases this will be the same as the start position
- id: str
The variant’s unique ID
- allele: str
The allele of this variant within the Haplotype
- _extras: tuple[Extra]
Extra fields for the haplotype
- property ID
Create an alias for the id property
-
allele:
str
-
end:
int
- classmethod extras_head()
Return the header lines of the extra fields that are supported
- Return type
set- Returns
- tuple
The header lines of the extra fields
- classmethod extras_order()
The names of the extra fields in order
- Returns
- tuple[str]
The names of the extra fields in the order in which they are stored
- classmethod from_hap_spec(line, types=None)
Convert a variant line into a Variant object in the .hap format spec
Note that this implementation does NOT support having more extra fields than appear in the header
- Parameters
- line: str
A variant (V) line from the .hap file
- types: dict[str, type], optional
The order of the extra fields if different from the order in _extras
- Returns
- tuple[str, Variant]
The haplotype ID and Variant object for the variant
-
id:
str
-
start:
int
- to_hap_spec(hap_id)
Convert a Variant object into a variant line in the .hap format spec
- Return type
str- Parameters
- hap_id: str
The ID of the haplotype associated with this variant
- Returns
- str
A valid variant line (V) in the .hap format spec
- types = {'allele': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}
- class haptools.data.haplotypes.classproperty(fget)
Bases:
objectA daad-simple read-only decorator that combines the functionality of @classmethod and @property
Stolen from https://stackoverflow.com/a/13624858/16815703
haptools.data.breakpoints module
- class haptools.data.breakpoints.Breakpoints(fname, log=None)
Bases:
DataA class for processing breakpoints from a file
Examples
>>> breakpoints = Breakpoints.load('tests/data/test.bp')
- Attributes
- datadict[str, SampleBlocks]
The haplotype blocks for each chromosome in each sample This dict maps samples (as strings) to their haplotype blocks (as SampleBlocks)
- fnamePath | str
The path to the file containing the data
- labelsdict | None
A dictionary containing population labels. It maps each label to the unique integers in the “pop” field of
data- log: Logger
A logging instance for recording debug statements.
- encode()
Replace each ancestral label in
datawith an equivalent integer. Store a dictionary mapping these integers back to their respective labels.This method modifies
datain place.- Returns
- dict[int, str]
A dictionary mapping each integer back to its ancestral label
- classmethod load(fname, samples=None)
Load breakpoints from a TSV file
Read the file contents and standardize the Breakpoints
- Return type
- Parameters
- fname
See documentation for
fname- samplesset[str], optional
See documentation for
read()
- Returns
- Breakpoints
A Breakpoints object with the data loaded into its properties
- population_array(variants, samples=None)
Output an array denoting the population labels of each variant for each sample
- Parameters
- variantsnp.array
Variant-level meta information in a mixed np array of dtypes: CHROM (str) and POS (int)
- samplestuple[str], optional
A subset of samples to include in the output, ordered by their given order
- Returns
- read(samples=None)
Read breakpoints from a TSV file into a data structure stored in
data- Parameters
- samplesset[str], optional
A subset of the samples for which to extract breakpoints
Defaults to loading breakpoints for all samples
- Raises
- AssertionError
If the provided file doesn’t follow the expected format
- recode()
Replace each integer in
datawith an equivalent ancestral label. Use the dictionary mapping these integers back to their respective ancestral labels stored inlabels.This method modifies
datain place.
- write()
Write the breakpoints in this class to a file at
fnameExamples
To write to a file, you must first initialize a Breakpoints object and then fill out the names, data, and samples properties: >>> from haptools.data import Breakpoints, HapBlock >>> breakpoints = Breakpoints(‘simple.bp’) >>> breakpoints.data = { >>> ‘HG00096’: [ >>> np.array([(‘YRI’,’chr1’,10114,4.3),(‘CEU’,’chr1’,10116,5.2)], dtype=HapBlock) >>> np.array([(‘CEU’,’chr1’,10114,4.3),(‘YRI’,’chr1’,10116,5.2)], dtype=HapBlock) >>> ], ‘HG00097’: [ >>> np.array([(‘YRI’,’chr1’,10114,4.3),(‘CEU’,’chr2’,10116,5.2)], dtype=HapBlock) >>> np.array([(‘CEU’,’chr1’,10114,4.3),(‘YRI’,’chr2’,10116,5.2)], dtype=HapBlock) >>> ] >>> } >>> breakpoints.write()
haptools.sim_genotype module
- haptools.sim_genotype.get_segment(pop, haplotype, chrom, start_coord, end_coord, end_pos, prev_gen_samples)
Create a segment or segments for an individual of the current generation using either a population label (>0) or the previous generation’s samples if the admix pop type (0) is used.
- Parameters
- pop: int
index of population. Can recover population name from pop_dict
- haplotype: int
index of range [0, len(prev_gen_samples)] to identify the parent haplotype to copy segments from
- chrom: int
chromosome the haplotype segment lies on
- start_coord: int
starting coordinate from where to begin taking segments from previous generation samples
- end_coord: int
ending coordinate of haplotype segment
- end_pos: float
ending coordinate in centimorgans
- prev_gen_samples: list[list[HaplotypeSegment]]
the previous generation simulated used as the parents for the current generation
- Returns
- segments: list[HaplotypeSegment]
A list of HaplotypeSegments storing the population type and end coordinate
- haptools.sim_genotype.output_vcf(breakpoints, chroms, model_file, variant_file, sampleinfo_file, region, pop_field, sample_field, out, log)
Takes in simulated breakpoints and uses reference files, vcf and sampleinfo, to create simulated variants output in file: out + .vcf
- Parameters
- breakpoints: list[list[HaplotypeSegment]]
the simulated breakpoints
- chroms: list[str]
List of chromosomes that were used to simulate
- model_file: str
file with the following structure. (Must be tab delimited)
Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac
For example,
40 Admixed CEU YRI 1 0 0.05 0.95 2 0.20 0.05 0.75
- variant_file: str
file path that contains samples and respective variants. Can be in VCF, BCF, VCF.GZ, or PGEN format.
- sampleinfo_file: str
file path that contains mapping from sample name in vcf to population
- region: dict(str->str/int/int)
Dictionary with the keys “chr”, “start”, and “end” holding chromosome (str), start position (int) and end position (int) allowing the simulation process to only allow variants within that region.
- pop_field: boolean
Flag to determine whether to have the population field in the VCF file output
- sample_field: boolean
Flag to determine whether to have the sample field in the VCF file output
- out: str
output prefix
- log: log object
Outputs messages to the appropriate channel.
- haptools.sim_genotype.simulate_gt(model_file, coords_dir, chroms, region, popsize, log, seed=None)
Simulate admixed genotypes based on the parameters of model_file.
- Parameters
- model_file: str
File with the following structure. (Must be tab delimited)
Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac
For example,
40 Admixed CEU YRI 1 0 0.05 0.95 2 0.20 0.05 0.75
- coords_dir: str
Directory containing files ending in .map with genetic map coords in cM used for recombination points
- chroms: list[str]
List of chromosomes to simulate admixture for.
- region: dict()
Dictionary with the keys “chr”, “start”, and “end” holding chromosome, start position adn end position allowing the simulation process to only within that region.
- popsize: int
Size of population created for each generation.
- log: log object
Outputs messages to the appropriate channel.
- seed: int
Seed used for randomization.
- Returns
- num_samples: int
Total number of samples to output
- pop_dict: dict(int->str)
Dictionary that maps populations from their encoded version as integers to their population name as a string. ex: {1:CEU, 2:YRI}
- next_gen_samples: list[list[HaplotypeSegment]]
Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.
- haptools.sim_genotype.start_segment(start, chrom, segments)
Find first segment that is on chrom and its end coordinate is > start via binary search.
- Parameters
- start: int
Coordinate in bp for the start of the segment to output
- chrom: int
Chromosome that the segments lie on.
- segments: list[HaplotypeSegment]
List of the hapltoype segments to search from for a starting point.
- Returns
- mid: int
Index of the first genetic segment to collect for output.
- haptools.sim_genotype.validate_params(model, mapdir, chroms, popsize, invcf, sample_info, region=None, only_bp=False)
- haptools.sim_genotype.write_breakpoints(samples, pop_dict, breakpoints, out, log)
Write out a subsample of breakpoints to out determined by samples.
- Parameters
- samples: int
Number of samples to output
- pop_dict: dict(int->str)
Maps population codes in integers to their names. ex: {1:CEU, 2:YRI}
- breakpoints: list[list[HaplotypeSegment]]
Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.
- out: str
output prefix used to output the breakpoint file
- log: log object
Outputs messages to the appropriate channel.
- Returns
- breakpoints: list[list[HaplotypeSegment]]
subsampled breakpoints only containing number of samples
haptools.sim_phenotype module
- class haptools.sim_phenotype.Haplotype(chrom, start, end, id, beta)
Bases:
HaplotypeA haplotype with sufficient fields for simphenotype
Properties and functions are shared with the base Haplotype object, “HaplotypeBase”
-
beta:
float
-
beta:
- class haptools.sim_phenotype.PhenoSimulator(genotypes, output=PosixPath('/dev/stdout'), seed=None, log=None)
Bases:
objectSimulate phenotypes from genotypes
Examples
>>> gens = Genotypes.load("tests/data/example.vcf.gz") >>> haps = Haplotypes.load("tests/data/basic.hap") >>> haps_gts = haps.transform(gens) >>> phenosim = PhenoSimulator(haps_gts) >>> phenosim.run(haps.data.values()) >>> phenotypes = phenosim.phens
- Attributes
- gens: Genotypes
Genotypes to simulate
- phens: Phenotypes
Simulated phenotypes; filled by
run()- rng: np.random.Generator, optional
A numpy random number generator
- log: logging.Logger
A logging instance for recording debug statements
- run(effects, heritability=None, prevalence=None, normalize=True)
Simulate phenotypes for an entry in the Genotypes object
The generated phenotypes will also be added to
output- Parameters
- effects: list[Haplotype]
A list of Haplotypes to use in an additive fashion within the simulations
- heritability: float, optional
The simulated heritability of the trait
If not provided, this will default to the sum of the squared effect sizes
- prevalence: float, optional
How common should the disease be within the population?
If this value is specified, case/control phenotypes will be generated instead of quantitative traits.
- normalize: bool, optional
If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.
- Returns
- npt.NDArray
The simulated phenotypes, as a np array of shape num_samples x 1
- write()
Write the generated phenotypes to the file specified in
__init__()
- haptools.sim_phenotype.simulate_pt(genotypes, haplotypes, num_replications=1, heritability=None, prevalence=None, normalize=True, region=None, samples=None, haplotype_ids=None, chunk_size=None, seed=None, output=PosixPath('-'), log=None)
Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.
GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec
Note: GENOTYPES must be the output from the the transform subcommand.
- Parameters
- genotypesPath
The path to the transformed genotypes in VCF or PGEN format
- haplotypesPath
The path to the haplotypes in a .hap file
- replicationsint, optional
The number of rounds of simulation to perform
- heritabilityint, optional
The heritability of the simulated trait; must be a float between 0 and 1
If not provided, it will be computed from the sum of the squared effect sizes
- prevalenceint, optional
The prevalence of the disease if the trait should be simulated as case/control; must be a float between 0 and 1
If not provided, a quantitative trait will be simulated, instead
- normalize: bool, optional
If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.
- regionstr, optional
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the VCF and .hap file must be indexed and the seqname must match!
Defaults to loading all haplotypes
- sampletuple[str], optional
A subset of the samples from which to extract genotypes
Defaults to loading genotypes from all samples
- samples_filePath, optional
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- haplotype_ids: set[str], optional
A list of haplotype IDs to obtain from the .hap file. All others are ignored.
If not provided, all haplotypes will be used.
- chunk_size: int, optional
The max number of variants to fetch from the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.
- outputPath, optional
The location to which to write the simulated phenotypes
- loglogging.Logger, optional
The logging module for this task
Examples
>>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simu_phens.tsv
haptools.karyogram module
This script is inspired by Alicia Martin’s karyogram code originally published here: https://github.com/armartin/ancestry_pipeline/blob/master/plot_karyogram.py
- haptools.karyogram.GetCentromereClipMask(centromeres_file, chrom_order)
Get clipping mask for the centromeres and telomeres
- Parameters
- centromeres_filestr, optional
Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
- chrom_orderlist[int]
chromosomes in sorted order
- Returns
- clipmask_perchromdict[str, matplotlib.Path]
Clip region for telomeres/centromeres for each chromosome
- haptools.karyogram.GetChrom(chrom)
Extract a numerical chromosome
- Parameters
- chromstr
Chromosome string
- Returns
- chromint
Integer-value for the chromosome X gets set to 23
- haptools.karyogram.GetChromOrder(sample_blocks)
Get a list of chroms in sorted order
- Parameters
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- Returns
- chromslist[int]
list of chromsomes in sorted order
- haptools.karyogram.GetCmRange(sample_blocks)
Get the min and max cM coordinates from the sample_blocks
- Parameters
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- Returns
- min_val, max_valfloat, float
min_val is the minimum coordinate max_val is the maximum coordinate
- haptools.karyogram.GetHaplotypeBlocks(bp_file, sample_name, centromeres_file=None)
Extract haplotype blocks for the desired sample from the bp file
- Parameters
- bp_filestr
Path to .bp file with breakpoints
- sample_namestr
Sample ID to extract
- centromeres_filestr, optional
If not None then use the chromosome ends listed to extend chromosomes to proper end coordinates
- Returns
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- haptools.karyogram.GetPopList(sample_blocks)
Get a list of populations in the sample_blocks
- Parameters
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- Returns
- poplistlist[str]
list of populations represented in the blocks
- haptools.karyogram.PlotHaplotypeBlock(block, hapnum, chrom_order, colors, ax, clipmask_perchrom=None)
Plot a haplotype block on the axis
- Parameters
- blockdict
dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- hapnumint
0 or 1 for the two haplotypes
- chrom_orderlist[int]
chromosomes in sorted order
- colorsdict[str, str], optional
Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.
- axmatplotlib axis to use for plotting
- clipmask_perchromdict[str, matplotlib.Path], optional
Clip region for telomeres/centromeres for each chromosome
- haptools.karyogram.PlotKaryogram(bp_file, sample_name, out_file, log, centromeres_file=None, title=None, colors=None)
Plot a karyogram based on breakpoints output by haptools simgenotypes
- Parameters
- bp_filestr
Path to .bp file with breakpoints
- sample_namestr
Sample ID to plot
- out_filestr
Name of output file
- log: log object
Outputs messages to the appropriate channel.
- centromeres_filestr, optional
Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
- titlestr, optional
Plot title. If None, no title is annotated
- colorsdict(str->str), optional
Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.
haptools.transform module
- class haptools.transform.GenotypesAncestry(fname, log=None)
Bases:
GenotypesRefAltExtends the GenotypesRefAlt class for ancestry data
The ancestry information is stored within the FORMAT field of the VCF
- Attributes
- datanp.array
See documentation for
data- fnamePath | str
See documentation for
fname- samplestuple[str]
See documentation for
samples- variantsnp.array
See documentation for
variants- valid_labels: np.array
Reference VCF sample and respective variant grabbed for each sample.
- ancestrynp.array
The ancestral population of each allele in each sample of
data- log: Logger
See documentation for
log
- check_biallelic(discard_also=False)
See documentation for
check_biallelic()
- check_missing(discard_also=False)
See documentation for
check_missing()
- write(chroms=None)
Write the variants in this class to a VCF at
fname
- class haptools.transform.HaplotypeAncestry(chrom, start, end, id, ancestry)
Bases:
HaplotypeA haplotype with an ancestry field for the transform subcommand
Properties and functions are shared with the base “Haplotype” object
-
ancestry:
str
- transform(genotypes)
Transform a genotypes matrix via the current haplotype and its ancestral population
See documentation for
transform()for more details- Return type
ndarray[Any,dtype[bool]]
-
ancestry:
- class haptools.transform.HaplotypesAncestry(fname, haplotype=<class 'haptools.transform.HaplotypeAncestry'>, variant=<class 'haptools.data.haplotypes.Variant'>, log=None)
Bases:
HaplotypesA set of haplotypes with an ancestry field for the transform subcommand
Properties and functions are shared with the base “Haplotypes” object
- transform(gts, hap_gts=None)
Transform a genotypes matrix via the current haplotype
Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object
- Parameters
- gtsGenotypesRefAlt
The genotypes which to transform using the current haplotype
- hap_gts: GenotypesRefAlt
An empty GenotypesRefAlt object into which the haplotype genotypes should be stored
- Returns
- GenotypesRefAlt
A Genotypes object composed of haplotypes instead of regular variants.
- haptools.transform.transform_haps(genotypes, haplotypes, region=None, samples=None, haplotype_ids=None, chunk_size=None, discard_missing=False, ancestry=False, output=PosixPath('-'), log=None)
Creates a VCF composed of haplotypes
- Parameters
- genotypesPath
The path to the genotypes in VCF or PGEN format
- haplotypesPath
The path to the haplotypes in a .hap file
- regionstr, optional
See documentation for
read()andread()- sampleslist[str], optional
See documentation for
read()- haplotype_ids: set[str], optional
A set of haplotype IDs to obtain from the .hap file. All others are ignored.
If not provided, all haplotypes will be used.
- chunk_size: int, optional
The max number of variants to fetch from the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.
- discard_missingbool, optional
Discard any samples that are missing any of the required genotypes
The default is simply to complain about it
- ancestrybool, optional
Whether to also match ancestral population labels from the VCF against those in the .hap file
- outputPath, optional
The location to which to write output
- logLogger, optional
A logging module to which to write messages about progress and any errors
haptools.ld module
- class haptools.ld.Haplotype(chrom, start, end, id, ld)
Bases:
HaplotypeA haplotype with sufficient fields for the ld command
Properties and functions are shared with the base Haplotype object,
HaplotypeBase-
ld:
float
-
ld:
- haptools.ld.calc_ld(target, genotypes, haplotypes, region=None, samples=None, ids=None, chunk_size=None, discard_missing=False, from_gts=False, output=PosixPath('/dev/stdout'), log=None)
Creates a VCF composed of haplotypes
- Parameters
- targetstr
The ID of the haplotype or variant with which we will calculate LD
- genotypesPath
The path to the genotypes
- haplotypesPath
The path to the haplotypes in a .hap file
- regionstr, optional
See documentation for
read()andread()- sampleslist[str], optional
See documentation for
read()- ids: set[str], optional
A subset of haplotype IDs to obtain from the .hap file. All others are ignored.
Alternatively, if the –from-gts switch is specified, this will be interpreted as a subset of variant IDs to obtain from the genotypes file.
Defaults to loading all haplotypes or variants if not specified
- chunk_size: int, optional
The max number of variants to fetch from the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.
- discard_missingbool, optional
Discard any samples that are missing any of the required genotypes
The default is simply to complain about it
- outputPath, optional
The location to which to write output
- logLogger, optional
A logging module to which to write messages about progress and any errors
- haptools.ld.pearson_corr_ld(arrA, arrB)
Compute the Pearson correlation coefficient between two vectors (1D numpy arrays)
- Return type
float- Parameters
- arrA: npt.NDArray
The first 1D numpy array
- arrB: npt.NDArray
The second 1D numpy array
- Returns
- The LD between the genotypes in arrA and the genotypes in arrB
haptools.index module
- haptools.index.append_suffix(path, suffix)
Used as a helper method for index_haps. Appends a given suffix to a Path instance.
- Parameters
- pathPath
The path to a file
- suffixstr
A string to append to the end of the given Path. For example, “.gz” or “.gz.tbi”
- haptools.index.index_haps(haplotypes, sort=False, output=None, log=None)
Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file
- Parameters
- haplotypesPath
The path to the haplotypes in a .hap file
- outputPath, optional
The location to which to write output. If an output location is not specified, the output will have the same name as the input file.
- logLogger, optional
A logging module to which to write messages about progress and any errors