Documentation
Command line interface
haptools
haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information
haptools [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
clump
Performs clumping on datasets with SNPs, SNPs and STRs, and STRs. Clumping is the process of identifying SNPs or STRs that are highly correlated with one another and concatenating them all together into a single “clump” in order to not repeat the same effect size due to LD.
haptools clump [OPTIONS]
Options
- --summstats-snps <summstats_snps>
File to load snps summary statistics
- --summstats-strs <summstats_strs>
File to load strs summary statistics
- --gts-snps <gts_snps>
SNP genotypes (VCF or PGEN)
- --gts-strs <gts_strs>
STR genotypes (VCF)
- --clump-p1 <clump_p1>
Max pval to start a new clump
- --clump-p2 <clump_p2>
Filter for pvalue less than
- --clump-id-field <clump_id_field>
Column header of the variant ID
- --clump-field <clump_field>
Column header of the p-values
- --clump-chrom-field <clump_chrom_field>
Column header of the chromosome
- --clump-pos-field <clump_pos_field>
Column header of the position
- --clump-kb <clump_kb>
clump kb radius
- --clump-r2 <clump_r2>
r^2 threshold
- --ld <ld>
Calculation type to infer LD, Exact Solution or Pearson R. (Exact|Pearson). Note the Exact Solution works best when all three genotypes are present (0,1,2) in the variants being compared.
- Default:
'Pearson'- Options:
Exact | Pearson
- --out <out>
Required Output filename
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
index
Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file
haptools index [OPTIONS] HAPLOTYPES
Options
- --sort, --no-sort
Sorting of the file will not be performed
- Default:
True
- -o, --output <output>
A .hap file containing sorted and indexed haplotypes and variants
- Default:
'input file'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- HAPLOTYPES
Required argument
karyogram
Visualize a karyogram of local ancestry tracks
haptools karyogram [OPTIONS]
Options
- --bp <bp>
Required Path to .bp file with breakpoints
- --sample <sample>
Required Sample ID to plot
- --out <out>
Required Name of output file
- --title <title>
Optional plot title
- --centromeres <centromeres>
Optional file with telomere/centromere cM positions
- --colors <colors>
Optional color dictionary. Input can be from the matplotlib list of colors or in hexcode. Format is e.g. ‘YRI:blue,CEU:green’
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
ld
Compute the pair-wise LD (Pearson’s correlation) between haplotypes (or variants) and a single TARGET haplotype (or variant)
GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec
TARGET refers to the ID of a variant or haplotype. LD is computed pair-wise between TARGET and all of the other haplotypes in the .hap (or genotype) file
If TARGET is a variant ID, the ID must appear in GENOTYPES. Otherwise, it must be present in the .hap file
haptools ld [OPTIONS] TARGET GENOTYPES HAPLOTYPES
Options
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default:
'all haplotypes'
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default:
'all samples'
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default:
'all samples'
- -i, --id <ids>
A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’). Or, if –from-gts, a list of the variant IDs to use from the genotypes file. For this to work, the .hap file must be indexed
- Default:
'all haplotypes'
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype (or variant) IDs (one per line) to subset from the .hap (or genotype) file
- Default:
'all haplotypes'
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default:
'all variants'
- --discard-missing
Ignore any samples that are missing genotypes for the required variants
- Default:
False
- --from-gts
By default, LD is computed with the haplotypes in the .hap file. Use this switch to compute LD with the genotypes in the genotypes file, instead.
- Default:
False
- -o, --output <output>
A .hap file containing haplotypes and their LD with TARGET
- Default:
'stdout'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- TARGET
Required argument
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument
simgenotype
Simulate admixed genomes under a pre-defined model.
haptools simgenotype [OPTIONS]
Options
- --model <model>
Required Admixture model in .dat format. See File Formats under simgenotype in the docs for complete info.
- --mapdir <mapdir>
Required Directory containing files with chr{1-22,X} and ending in .map in the file name with genetic map coords.
- --out <out>
Required Path to desired output file. E.g. /path/to/output.vcf.gz Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an additional breakpoints output with extension bp e.g. /path/to/output.bp.
- --chroms <chroms>
Sorted and comma delimited list of chromosomes to simulate
- --seed <seed>
Random seed. Set to make simulations reproducible
- --ref_vcf <ref_vcf>
Required VCF or PGEN file used as reference for creation of simulated samples respective genotypes.
- --sample_info <sample_info>
Required File that maps samples from the reference VCF (–invcf) to population codes describing the populations in the header of the model file.
- --region <region>
Subset the simulation to a specific region in a chromosome using the form chrom:start-end. Example 2:1000-2000
- --pop_field
Flag for outputting the population field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --sample_field
Flag for outputting the sample field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --no_replacement
Flag used to determine whether to sample reference haplotypes with or without replacement. (Default = Replacement)
- --only_breakpoint
Flag used to determine whether to only output breakpoints or continue to simulate a vcf file.
- -c, --chunk-size <chunk_size>
If requesting a PGEN output file, write genotypes in chunks of X variants; reduces memory
- Default:
'all variants'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
simphenotype
Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.
GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec
Note: GENOTYPES must be the output from the transform subcommand.
haptools simphenotype [OPTIONS] GENOTYPES HAPLOTYPES
Options
- -r, --replications <replications>
Number of rounds of simulation to perform
- Default:
1
- --environment <environment>
Variance of environmental term; inferred if not specified
- -h, --heritability <heritability>
Trait heritability
- Default:
'0.5'
- -p, --prevalence <prevalence>
Disease prevalence if simulating a case-control trait
- Default:
'quantitative trait'
- --normalize, --no-normalize
Whether to normalize the genotypes before using them for simulation
- Default:
True
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default:
'all haplotypes'
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default:
'all samples'
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default:
'all samples'
- -i, --id <ids>
A list of the haplotype IDs from the .hap file to use as causal variables (ex: ‘-i H1 -i H2’).
- Default:
'all haplotypes'
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file
- Default:
'all haplotypes'
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default:
'all variants'
- --repeats <repeats>
Path to a genotypes file containing tandem repeats. This is only necessary when simulating both haplotypes and repeats as causal effects
- --seed <seed>
Use this option across executions to make the output reproducible
- Default:
'chosen randomly'
- -o, --output <output>
A TSV file containing simulated phenotypes
- Default:
'stdout'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument
transform
Creates a VCF composed of haplotypes
GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec
haptools transform [OPTIONS] GENOTYPES HAPLOTYPES
Options
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default:
'all haplotypes'
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default:
'all samples'
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default:
'all samples'
- -i, --id <ids>
A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’).
- Default:
'all haplotypes'
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file
- Default:
'all haplotypes'
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default:
'all variants'
- --discard-missing
Ignore any samples that are missing genotypes for the required variants
- Default:
False
- --ancestry
Also transform using VCF ‘POP’ FORMAT field and ‘ancestry’ .hap extra field
- Default:
False
- --maf <maf>
Do not output haplotypes with an MAF below this value
- Default:
'all haplotypes'
- -o, --output <output>
A VCF file containing haplotype ‘genotypes’
- Default:
'stdout'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument
Module contents
haptools.data.data module
- class haptools.data.data.Data(fname, log=None)
Bases:
ABCAbstract class for accessing read-only data files
- Attributes:
- fnamePath | str
The path to the read-only file containing the data
- datanp.array
The contents of the data file, once loaded
- log: Logger
A logging instance for recording debug statements.
- static hook_compressed(filename, mode)
A utility to help open files regardless of their compression
Based off of python’s fileinput.hook_compressed and copied from https://stackoverflow.com/a/64106815/16815703
- Return type:
IO- Parameters:
- filenamestr
The path to the file
- modestr
Either ‘r’ for read or ‘w’ for write
- Returns:
- IO
The resolved file object
- abstract classmethod load(fname)
Read the file contents and perform any recommended pre-processing
- Parameters:
- fnamePath
See documentation for
fname
- abstract read()
Read the raw file contents into the class properties
- unset()
Whether the data has been loaded into the object yet
- Return type:
bool- Returns:
- bool
True if
datais None else False
haptools.data.genotypes module
- class haptools.data.genotypes.Genotypes(fname, log=None)
Bases:
DataA class for processing genotypes from a file
- Attributes:
- datanpt.NDArray
The genotypes in an n (samples) x p (variants) x 2 (strands) array
- fnamePath | str
The path to the read-only file containing the data
- samplestuple[str]
The names of each of the n samples
- variantsnp.array
- Variant-level meta information:
ID
CHROM
POS
- log: Logger
A logging instance for recording debug statements
- _prephasedbool
If True, assume that the genotypes are phased. Otherwise, extract their phase when reading from the VCF.
- _samp_idxdict[str, int]
Sample index; maps samples to indices in self.samples
- _var_idxdict[str, int]
Variant index; maps variant IDs to indices in self.variants
Examples
>>> genotypes = Genotypes.load('tests/data/simple.vcf') >>> # directly access the loaded variants, samples, and genotypes (in data) >>> genotypes.variants >>> genotypes.samples >>> genotypes.data
- __iter__(region=None, samples=None, variants=None)
Read genotypes from a VCF line by line without storing anything
- check_biallelic(discard_also=False)
Check that each genotype is composed of only two alleles
This function modifies the dtype of
datafrom uint8 to bool- Parameters:
- discard_alsobool, optional
If True, discard any multiallelic variants without raising a ValueError
- Raises:
- ValueError
If any of the genotypes have more than two alleles
- check_maf(threshold=None, discard_also=False, warn_only=False)
Check the minor allele frequency of each variant
Raise a ValueError if any variant’s MAF doesn’t satisfy the threshold, if one is provided :rtype:
ndarray[Any,dtype[float64]]Note
You should call
check_missing()andcheck_biallelic()before executing this method, for best results. Otherwise, the frequencies may be computed incorrectly.- Parameters:
- threshold: float, optional
If a variant has a minor allele frequency (MAF) rarer than this threshold, raise a ValueError
- discard_alsobool, optional
If True, discard any variants that would otherwise cause a ValueError
This parameter will be ignored if a threshold is not specified
- warn_only: bool, optional
Just raise a warning instead of a ValueError
- Returns:
- The minor allele frequency of each variant
- Raises:
- ValueError
If any variant does not meet the provided threshold minor allele frequency
- check_missing(discard_also=False)
Check that each sample is properly genotyped
- Parameters:
- discard_alsobool, optional
If True, discard any samples that are missing genotypes without raising a ValueError
- Raises:
- ValueError
If any of the samples have missing genotypes ‘GT: .|.’
- check_phase()
Check that the genotypes are phased then remove the phasing info from the data
This function modifies
datain-place- Raises:
- ValueError
If any heterozgyous genotpyes are unphased
- check_sorted()
Check that the variant coordinates are sorted
Raise a ValueError if any of variants in any of the chromosomes are unsorted
- Raises:
- ValueError
If any variant position is less than a position that comes before it within the same chromosome
- index(samples=True, variants=True)
Call this function once to improve the amortized time-complexity of look-ups of samples and variants by their ID. This is useful if you intend to later subset by a set of samples or variant IDs. The time complexity of this function should be roughly O(n+p) if both parameters are True. Otherwise, it will be either O(n) or O(p).
- Parameters:
- samples: bool, optional
Whether to index the samples for fast loop-up. Adds complexity O(n).
- variants: bool, optional
Whether to index the variants for fast look-up. Adds complexity O(p).
- Raises:
- ValueError
If any samples or variants appear more than once
- classmethod load(fname, region=None, samples=None, variants=None)
Load genotypes from a VCF file
Read the file contents, check the genotype phase, and create the MAC matrix
- Return type:
- Parameters:
- Returns:
- Genotypes
A Genotypes object with the data loaded into its properties
- classmethod merge_variants(objs, check_samples=True, **kwargs)
Merge genotypes objects with different sets of variants together :rtype:
GenotypesNote
The input genotypes objects are not expected to have any overlapping sets of variants. Also, all samples in the input genotypes must be the same.
- Parameters:
- objs: tuple[Genotypes]
The objects that should be merged together
- check_samples: bool, optional
Whether to check that the set of provided samples is exactly the same for all genotypes. This can take a while so you may want to avoid it
- **kwargs
Any parameters to pass to
_init__()
- Returns:
- Genotypes
A new object containing merged versions of the properties in each object
- Raises:
- ValueError
If the set of samples in each input object is not the same
- read(region=None, samples=None, variants=None, max_variants=None)
Read genotypes from a VCF into a numpy matrix stored in
data- Parameters:
- regionstr, optional
The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the VCF must be indexed and the seqname must match!
Defaults to loading all genotypes
- samplesset[str], optional
A subset of the samples from which to extract genotypes
Note that they are loaded in the same order as in the file
Defaults to loading genotypes from all samples
- variantsset[str], optional
A set of variant IDs for which to extract genotypes
All other variants will be ignored. This may be useful if you’re running out of memory.
- max_variantsint, optional
The maximum mumber of variants to load from the file. Setting this value helps preallocate the arrays, making the process faster and less memory intensive. You should use this option if your processes are frequently “Killed” from memory overuse.
If you don’t know how many variants there are, set this to a large number greater than what you would except. The np array will be resized appropriately. You can also use the bcftools “counts” plugin to obtain the number of expected sites within a region.
Note that this value is ignored if the variants argument is provided.
- Raises:
- ValueError
If the genotypes array is empty
- subset(samples=None, variants=None, inplace=False)
Subset these genotypes to a smaller set of samples or a smaller set of variants
The order of the samples and variants in the subsetted instance will match the order in the provided tuple parameters.
- Parameters:
- samples: tuple[str]
A subset of samples to keep
- variants: tuple[str]
A subset of variant IDs to keep
- inplace: bool, optional
If False, return a new Genotypes object; otherwise, alter the current one
- Returns:
- A new Genotypes object if inplace is set to False, else returns None
- class haptools.data.genotypes.GenotypesPLINK(fname, log=None, chunk_size=None)
Bases:
GenotypesVCFA class for processing genotypes from a PLINK
.pgenfile- Attributes:
- datanpt.NDArray
See documentation for
data- samplestuple[str]
See documentation for
samples- variantsnp.array
See documentation for
variants- log: Logger
See documentation for
log- chunk_size: int, optional
The max number of variants to fetch from and write to the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory
Examples
>>> genotypes = GenotypesPLINK.load('tests/data/simple.pgen')
- __iter__(region=None, samples=None, variants=None)
Read genotypes from a PGEN line by line without storing anything
- read(region=None, samples=None, variants=None, max_variants=None)
Read genotypes from a PGEN file into a numpy matrix stored in
data- Parameters:
- regionstr, optional
See documentation for
read- samplesset[str], optional
See documentation for
read- variantsset[str], optional
See documentation for
read- max_variantsint, optional
See documentation for
read
- read_samples(samples=None)
Read sample IDs from a PSAM file into a list stored in
samplesThis method is called automatically by
read()- Parameters:
- samplesset[str], optional
See documentation for
read
- Returns:
- npt.NDArray[np.uint32]
The indices of each of the samples within the PSAM file
- read_variants(region=None, variants=None, max_variants=None)
Read variants from a PVAR file into a numpy array stored in
variantsOne of either variants or max_variants MUST be specified!
This method is called automatically by
read()- Parameters:
- regionstr, optional
See documentation for
read- variantsset[str], optional
See documentation for
read- max_variantsint, optional
See documentation for
read
- Returns:
- npt.NDArray[np.uint32]
The indices of each of the variants within the PVAR file
- write()
Write the variants in this class to PLINK2 files at
fname
- class haptools.data.genotypes.GenotypesPLINKTR(fname, log=None, chunk_size=None, vcftype='auto')
Bases:
GenotypesPLINKA class for processing repeat genotypes from a PLINK
.pgenfile- Attributes:
- datanpt.NDArray
See documentation for
data- samplestuple[str]
See documentation for
samples- variantsnp.array
See documentation for
variants- log: Logger
See documentation for
log- chunk_size: int, optional
See documentation for
chunk_size- vcftype: str, optional
See documentation for
vcftype- Examples
- ——–
- >>> genotypes = GenotypesPLINK.load(‘tests/data/simple.pgen’)
- check_biallelic()
Check that each genotype is composed of only two alleles
This function modifies the dtype of
datafrom uint8 to bool- Parameters:
- discard_alsobool, optional
If True, discard any multiallelic variants without raising a ValueError
- Raises:
- ValueError
If any of the genotypes have more than two alleles
- check_maf()
Check the minor allele frequency of each variant
Raise a ValueError if any variant’s MAF doesn’t satisfy the threshold, if one is provided
Note
You should call
check_missing()andcheck_biallelic()before executing this method, for best results. Otherwise, the frequencies may be computed incorrectly.- Parameters:
- threshold: float, optional
If a variant has a minor allele frequency (MAF) rarer than this threshold, raise a ValueError
- discard_alsobool, optional
If True, discard any variants that would otherwise cause a ValueError
This parameter will be ignored if a threshold is not specified
- warn_only: bool, optional
Just raise a warning instead of a ValueError
- Returns:
- The minor allele frequency of each variant
- Raises:
- ValueError
If any variant does not meet the provided threshold minor allele frequency
- classmethod load(fname, region=None, samples=None, variants=None, vcftype='auto')
Load STR genotypes from a VCF file
Read the file contents, check the genotype phase, and create the MAC matrix
- Return type:
- Parameters:
- Returns:
- Genotypes
A Genotypes object with the data loaded into its properties
- read(region=None, samples=None, variants=None, max_variants=None)
Read genotypes from a PGEN file into a numpy matrix stored in
data- Parameters:
- regionstr, optional
See documentation for
read- samplesset[str], optional
See documentation for
read- variantsset[str], optional
See documentation for
read- max_variantsint, optional
See documentation for
read
- write()
Write the variants in this class to PLINK2 files at
fname
- class haptools.data.genotypes.GenotypesTR(fname, log=None, vcftype='auto')
Bases:
GenotypesA class for processing TR genotypes from a file Unlike the base Genotypes class, this class genotypes will be repeat number in the variants array
- Attributes:
- datanpt.NDArray
See documentation for
data- fnamePath | str
See documentation for
fname- samplestuple[str]
See documentation for
samples- variantsnp.array
- Variant-level meta information:
ID
CHROM
POS
- log: Logger
See documentation for
log- vcftype: str
TR vcf type currently being read. Options are {‘auto’, ‘gangstr’, ‘advntr’, ‘hipstr’, ‘eh’, ‘popstr’}
- check_biallelic()
See documentation for
check_biallelic()
- check_maf()
See documentation for
check_maf()
- classmethod load(fname, region=None, samples=None, variants=None, vcftype='auto')
Load STR genotypes from a VCF file
Read the file contents, check the genotype phase, and create the MAC matrix
- Return type:
- Parameters:
- Returns:
- Genotypes
A Genotypes object with the data loaded into its properties
- class haptools.data.genotypes.GenotypesVCF(fname, log=None)
Bases:
GenotypesA class for processing genotypes from a file Unlike the base Genotypes class, this class also includes REF and ALT alleles as a list of alleles in the variants array
- Attributes:
- datanpt.NDArray
See documentation for
data- fnamePath | str
See documentation for
fname- samplestuple[str]
See documentation for
samples- variantsnp.array
- Variant-level meta information:
ID
CHROM
POS
[REF, ALT1, ALT2, …]
- log: Logger
See documentation for
log
- write()
Write the variants in this class to a VCF at
fname
- class haptools.data.genotypes.TRRecordHarmonizerRegion(vcffile, vcfiter, vcftype='auto')
Bases:
TRRecordHarmonizer- Parameters:
- vcffileVCF
- vcftype{‘auto’, ‘gangstr’, ‘advntr’, ‘hipstr’, ‘eh’, ‘popstr’}, optional
Type of the VCF file. Default=’auto’. If vcftype==’auto’, attempts to infer the type.
- Attributes
- ———-
- vcffileVCF
- vcfiterVCF
Region to grab strs from within the VCF file.
- vcftypeenum
Type of the VCF file. Must be included in VcfTypes
haptools.data.phenotypes module
- class haptools.data.phenotypes.Phenotypes(fname, log=None)
Bases:
DataA class for processing phenotypes from a file
- Attributes:
- datanp.array
The phenotypes in an n (samples) x m (phenotypes) array
- fnamePath | str
The path to the file containing the data
- samplestuple
The names of each of the n samples
- namestuple[str]
The names of the phenotypes
- log: Logger
A logging instance for recording debug statements.
Examples
>>> phenotypes = Phenotypes.load('tests/data/simple.pheno')
- __iter__(samples=None)
Read phenotypes from a pheno line by line without storing anything
- Return type:
Iterable[namedtuple]- Parameters:
- samplesset[str], optional
A subset of the samples from which to extract phenotypes
Defaults to loading phenotypes from all samples
- Returns:
- Iterable[namedtuple]
See documentation for
_iterate()
- append(name, data)
Append a new set of phenotypes to the current set
- Parameters:
- name: str
The name of the new phenotype
- data: npt.NDArray
A 1D np array of the same length as
samples, containing the phenotype values for each sample. Must have the same dtype as
- check_missing(discard_also=False)
Check that each sample has a phenotype value
- Parameters:
- discard_alsobool, optional
If True, discard any samples that are missing phenotypes without raising a ValueError
- Raises:
- ValueError
If any of the samples have missing phenotypes, represented by -9
- index(samples=True, names=True)
Call this function once to improve the amortized time-complexity of look-ups of samples and names by their ID. This is useful if you intend to later subset by a set of samples or name IDs. The time complexity of this function should be roughly O(n+p) if both parameters are True. Otherwise, it will be either O(n) or O(p).
- Parameters:
- samples: bool, optional
Whether to index the samples for fast loop-up. Adds complexity O(n).
- names: bool, optional
Whether to index the names for fast look-up. Adds complexity O(p).
- Raises:
- ValueError
If any samples or names appear more than once
- classmethod load(fname, samples=None)
Load phenotypes from a pheno file
Read the file contents and standardize the phenotypes
- Return type:
- Parameters:
- fname
See documentation for
fname- samplesset[str], optional
See documentation for
read()
- Returns:
- phenotypes
A Phenotypes object with the data loaded into its properties
- read(samples=None)
Read phenotypes from a pheno file into a numpy matrix stored in
data- Parameters:
- samplesset[str], optional
A subset of the samples from which to extract phenotypes
Defaults to loading phenotypes from all samples
- Raises:
- AssertionError
If the provided file doesn’t follow the expected format
- standardize()
Standardize phenotypes so they have a mean of 0 and a stdev of 1
This function modifies
datain-place
- subset(samples=None, names=None, inplace=False)
Subset these phenotypes to a smaller set of samples or a smaller set of names
The order of the samples and names in the subsetted instance will match the order in the provided tuple parameters.
- Parameters:
- samples: tuple[str]
A subset of samples to keep
- names: tuple[str]
A subset of phenotype IDs to keep
- inplace: bool, optional
If False, return a new Phenotypes object; otherwise, alter the current one
- Returns:
- A new Phenotypes object if inplace is set to False, else returns None
- write()
Write the phenotypes in this class to a file at
fnameExamples
To write to a file, you must first initialize a Phenotypes object and then fill out the names, data, and samples properties: >>> phenotypes = Phenotypes(‘tests/data/simple.pheno’) >>> phenotypes.names = (‘height’,) >>> phenotypes.data = np.array([[1, 1, 2]], dtype=’float64’) >>> phenotypes.samples = (‘HG00096’, ‘HG00097’, ‘HG00099’) >>> phenotypes.write()
haptools.data.covariates module
- class haptools.data.covariates.Covariates(fname, log=None)
Bases:
PhenotypesA class for processing covariates from a file
- Attributes:
- datanp.array
The covariates in an n (samples) x m (covariates) array
- fnamePath | str
The path to the read-only file containing the data
- samplestuple[str]
The names of each of the n samples
- namestuple[str]
The names of the covariates
- log: Logger
A logging instance for recording debug statements.
Examples
>>> covariates = Covariates.load('tests/data/simple.covar')
haptools.data.haplotypes module
- class haptools.data.haplotypes.Extra(name, fmt='s', description='')
Bases:
objectAn extra field on a line in the .hap file
- Attributes:
- name: str
The name of the extra field
- fmt: str = “s”
The python fmt string of the field value; indicates how to format the value
- description: str = “”
A description of the extra field
-
description:
str= ''
-
fmt:
str= 's'
- property fmt_str: str
Convert an Extra into a fmt string
- Returns:
- str
A python format string (ex: “{beta:.3f}”)
- classmethod from_hap_spec(line)
Convert an “extra” line in the header of a .hap file into an Extra object
- Return type:
- Parameters:
- line: str
An “extra” field, as it appears declared in the header
- Returns:
- Extra
An Extra object
-
name:
str
- to_hap_spec(line_type_symbol)
Convert an Extra object into a header line in the .hap format spec
- Return type:
str- Parameters:
- hap_id: str
The ID of the haplotype associated with this variant
- Returns:
- str
A valid variant line (V) in the .hap format spec
- class haptools.data.haplotypes.Haplotype(chrom, start, end, id)
Bases:
objectA haplotype within the .hap format spec
In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration
- Attributes:
- chrom: str
The contig to which this haplotype belongs
- start: int
The chromosomal start position of the haplotype
- end: int
The chromosomal end position of the haplotype
- id: str
The haplotype’s unique ID
- variants: tuple[Variant]
The variants in this haplotype
- _extras: tuple[Extra]
Extra fields for the haplotype
Examples
Let’s extend this class and add an extra field called “ancestry”
>>> from dataclasses import dataclass, field >>> @dataclass >>> class CustomHaplotype(Haplotype): ... ancestry: str ... _extras: tuple = field( ... repr=False, ... init=False, ... default = ( ... Extra("ancestry", "s", "Local ancestry"), ... ), ... )
- property ID
Create an alias for the id property
-
chrom:
str
-
end:
int
- classmethod extras_head()
Return the header lines of the extra fields that are supported
- Return type:
set- Returns:
- tuple
The header lines of the extra fields
- classmethod extras_order()
The names of the extra fields in order
- Return type:
tuple[str]- Returns:
- tuple[str]
The names of the extra fields in the order in which they are stored
- classmethod from_hap_spec(line, variants=(), types=None)
Convert a variant line into a Haplotype object in the .hap format spec
Note that this implementation does NOT support having more extra fields than appear in the header
- Return type:
- Parameters:
- line: str
A variant (H) line from the .hap file
- variants: tuple[Variant], optional
The Variants in this haplotype
- types: dict[str, type], optional
The types of each property in the object
- Returns:
- Haplotype
The Haplotype object for the variant
-
id:
str
- sort()
Sorts the variants within this Haplotype instance
-
start:
int
- to_hap_spec()
Convert a Haplotype object into a haplotype line in the .hap format spec
- Return type:
str- Returns:
- str
A valid haplotype line (H) in the .hap format spec
- transform(genotypes)
Transform a genotypes matrix via the current haplotype
Each entry in the returned matrix denotes the presence of the current haplotype in each chromosome of each sample in the Genotypes object
- Return type:
ndarray[Any,dtype[TypeVar(ScalarType, bound=generic, covariant=True)]]- Parameters:
- genotypesGenotypesVCF
The genotypes which to transform using the current haplotype
If the genotypes have not been loaded into the Genotypes object yet, this method will call Genotypes.read(), while loading only the needed variants
- Returns:
- npt.NDArray
A 2D matrix of shape (num_samples, 2) where each entry in the matrix is a bool denoting the presence of the haplotype in one chromosome of a sample
- types = {'chrom': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}
- property varIDs
-
variants:
tuple
- class haptools.data.haplotypes.Haplotypes(fname, haplotype=<class 'haptools.data.haplotypes.Haplotype'>, variant=<class 'haptools.data.haplotypes.Variant'>, repeat=<class 'haptools.data.haplotypes.Repeat'>, log=None)
Bases:
DataA class for processing haplotypes from a file
- Attributes:
- fname: Path | str
The path to the file containing the data
- data: dict[str, Haplotype|Repeat]
A dict of Haplotype/Repeat objects keyed by their IDs
- types: dict
A dict of class names keyed by the symbol denoting their line type
Ex: {‘H’: Haplotype, ‘V’: Variant, ‘R’: Repeat}
- type_ids: dict[str, list]
A dict of class names keyed by the symbol denoting line type. Stores all haplotype and repeat IDs.
Ex: {‘H’: [“H1”, “H2”, “H3”], ‘R’: [“STR_1”, “STR_2”]}
- version: str
A string denoting the current file format version
- log: Logger
A logging instance for recording debug statements.
Examples
Parsing a basic .hap file without any extra fields is simple:
>>> haplotypes = Haplotypes.load('tests/data/basic.hap') >>> haps = haplotypes.data # a dictionary of Haplotype objects
If the .hap file contains extra fields, you’ll need to call the read() method manually. You’ll also need to create Haplotype and Variant subclasses that support the extra fields and then specify the names of the classes when you initialize the Haplotypes object:
>>> haplotypes = Haplotypes('tests/data/simphenotype.hap', HaptoolsHaplotype) >>> haplotypes.read() >>> haps = haplotypes.data # a dictionary of Haplotype objects
- __iter__(region=None, haplotypes=None)
Read haplotypes from a .hap file line by line without storing anything :rtype:
Iterator[Variant|Haplotype]Note
The elements output by
__iter__()are not guaranteed to be in any particular order except that variants of a haplotype will never appear before the haplotype, itself.- Parameters:
- region: str, optional
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the .hap file must be indexed and the seqname must match!
Defaults to loading all haplotypes
- haplotypes: set[str], optional
A list of haplotype IDs corresponding to a subset of the haplotypes to extract
Defaults to loading haplotypes from all samples
- Yields:
- Iterator[Variant|Repeat|Haplotype]
An iterator over each line in the file, where each line is encoded as a Variant, Repeat, or Haplotype containing each of the class properties
Examples
If you’re worried that the contents of the .hap file will be large, you may opt to parse the file line-by-line instead of loading it all into memory at once. In cases like these, you can use the __iter__() method in a for-loop:
>>> haplotypes = Haplotypes('tests/data/basic.hap') >>> for line in haplotypes: ... print(line)
Call the function manually to pass it the region or haplotypes params:
>>> haplotypes = Haplotypes('tests/data/basic.hap.gz') >>> for line in haplotypes.__iter__( ... region='21:26928472-26941960', haplotypes={"chr21.q.3365*1"} ... ): ... print(line)
- check_header(lines, check_version=True, softly=True)
1) Check and parse any metadata and 2) check that any extra fields declared in the .haps file can be handled by the Variant and Haplotype classes provided in __init__()
This function is called automatically by other methods that read .hap files
- Return type:
tuple[dict,dict[str,tuple[str]]]- Parameters:
- lines: list[str]
Header lines from the .hap file. Any lines beginning with # may appear in this list, especially if the file is sorted. So this may include regular comments, too.
- check_version: bool, optional
Whether to also check the version of the file
- softly: bool, optional
If True, then this function will not raise any ValueErrors. Instead, it will only issue warnings via the logging module, which may be ignored.
- Returns:
- tuple[dict, dict[str, tuple[Extra]]]
The metadata for the file, contained within the header lines and encoded as a dictionary where the names are keys and any subsequent fields are values
The second dictionary encodes the set of declared extra field names for each line type
- Raises:
- ValueError
If any of the header lines are not supported
- check_version(version, err_msgr)
Check the observed version string against the current version string of this instance
- Return type:
tuple[int,int,int]- Parameters:
- version: str
The observed version string
- err_msgr: Callable
A function which takes a single parameter (the error message) and errors appropriately
- Returns:
- The parsed, observed version string
- index(force=False)
Reset the type_ids parameter
You should call this method after any changes to the data property
- classmethod load(fname, region=None, haplotypes=None)
Load haplotypes from a .hap file
Read the file contents
- Return type:
- Parameters:
- Returns:
- Haplotypes
A Haplotypes object with the data loaded into its properties
- classmethod merge(objs, **kwargs)
Merge Haplotypes objects with different sets of haplotypes together :rtype:
HaplotypesNote
The input Haplotype instances are expected to have distinct indices and must all be of the same type
- Parameters:
- objs: tuple[Haplotypes]
The objects that should be merged together
- **kwargs
Any parameters to pass to
_init__()
- Returns:
- Haplotypes
A new object containing all of the Haplotypes from each input object
- Raises:
- ValueError
If the set of indices among the haplotypes is not distinct
- read(region=None, haplotypes=None)
Read haplotypes from a .hap file into a list stored in
data- Parameters:
- region: str, optional
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the .hap file must be indexed and the seqname must match!
Defaults to loading all haplotypes
- haplotypes: set[str], optional
A list of haplotype IDs corresponding to a subset of the haplotypes to extract
Defaults to loading haplotypes from all samples
- sort()
Sorts .hap files first by chrom, followed by start, end, and lastly ID
Also sorts the variants within each haplotype
- subset(haplotypes, inplace=False)
Subset these haplotypes to a smaller set of haplotypes
The order of the haplotypes in the subsetted instance will match the order in the provided tuple parameters.
- Parameters:
- haplotypes: tuple[str]
A subset of haplotype IDs to keep
- inplace: bool, optional
If False, return a new Genotypes object; otherwise, alter the current one
- Returns:
- A new Haplotypes object if inplace is set to False, else returns None
- to_str(sort=True)
Create a string representation of this Haplotype
- Return type:
Generator[str,None,None]- Parameters:
- sort: bool, optional
Whether to attempt to sort variant lines by their haplotype IDs
- Yields:
- Generator[str, None, None]
A list of lines (strings) to include in the output
- transform(gts, hap_gts=None)
Transform a genotypes matrix via the current haplotype
Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object
- Return type:
- Parameters:
- gtsGenotypesVCF
The genotypes which to transform using the current haplotype
- hap_gts: GenotypesVCF
An empty GenotypesVCF object into which the haplotype genotypes should be stored
- Returns:
- GenotypesVCF
A Genotypes object composed of haplotypes instead of regular variants.
- version = '0.2.0'
- write()
Write the contents of this Haplotypes object to the file at
fnameIf the items in
dataare sorted, then the output should be automatically sorted such that “sort -k1,4” would leave the output unchangedExamples
To write to a .hap file, you must first initialize a Haplotypes object and then fill out the data property:
>>> haplotypes = Haplotypes('tests/data/basic.hap') >>> haplotypes.data = {'H1': Haplotype('chr1', 0, 10, 'H1')} >>> haplotypes.write()
- class haptools.data.haplotypes.Repeat(chrom, start, end, id)
Bases:
objectA tandem repeat within the .hap format spec
In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration
- Attributes:
- chrom: str
The contig to which this tandem repeat belongs
- start: int
The chromosomal start position of the tandem repeat
- end: int
The chromosomal end position of the tandem repeat
- id: str
The tandem repeat’s unique ID
- _extras: tuple[Extra]
Extra fields for the tandem repeat
Examples
Let’s extend this class and add an extra field called “ancestry”
>>> from dataclasses import dataclass, field >>> @dataclass >>> class CustomRepeat(Repeat): ... ancestry: str ... _extras: tuple = field( ... repr=False, ... init=False, ... default = ( ... Extra("ancestry", "s", "Local ancestry"), ... ), ... )
- property ID
Create an alias for the id property
-
chrom:
str
-
end:
int
- classmethod extras_head()
Return the header lines of the extra fields that are supported
- Return type:
set- Returns:
- tuple
The header lines of the extra fields
- classmethod extras_order()
The names of the extra fields in order
- Return type:
tuple[str]- Returns:
- tuple[str]
The names of the extra fields in the order in which they are stored
- classmethod from_hap_spec(line, types=None)
Convert a tandem repeat line into a tandem repeat object in the .hap format spec
Note that this implementation does NOT support having more extra fields than appear in the header
- Return type:
- Parameters:
- line: str
A tandem repeat (R) line from the .hap file
- types: dict[str, type], optional
The types of each property in the object
- Returns:
- Repeat
The repeat object line.
-
id:
str
-
start:
int
- to_hap_spec()
Convert a Repeat object into a repeat line in the .hap format spec
- Return type:
str- Returns:
- str
A valid Repeat line (R) in the .hap format spec
- types = {'chrom': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}
- class haptools.data.haplotypes.Variant(start, end, id, allele)
Bases:
objectA variant within the .hap format spec
In order to use this class with the Haplotypes class, you should 1) add properties to the class for each of extra fields 2) override the _extras property to describe the header declaration
- Attributes:
- start: int
The chromosomal start position of the variant
- end: int
The chromosomal end position of the variant
In most cases this will be the same as the start position
- id: str
The variant’s unique ID
- allele: str
The allele of this variant within the Haplotype
- _extras: tuple[Extra]
Extra fields for the haplotype
Examples
Let’s extend this class and add an extra field called “score”
>>> from dataclasses import dataclass, field >>> @dataclass >>> class CustomVariant(Variant): ... score: float ... _extras: tuple = field( ... repr=False, ... init=False, ... default = ( ... Extra("score", ".3f", "Importance of inclusion"), ... ), ... )
- property ID
Create an alias for the id property
-
allele:
str
-
end:
int
- classmethod extras_head()
Return the header lines of the extra fields that are supported
- Return type:
set- Returns:
- tuple
The header lines of the extra fields
- classmethod extras_order()
The names of the extra fields in order
- Return type:
tuple[str]- Returns:
- tuple[str]
The names of the extra fields in the order in which they are stored
- classmethod from_hap_spec(line, types=None)
Convert a variant line into a Variant object in the .hap format spec
Note that this implementation does NOT support having more extra fields than appear in the header
- Return type:
tuple[str,Variant]- Parameters:
- line: str
A variant (V) line from the .hap file
- types: dict[str, type], optional
The order of the extra fields if different from the order in _extras
- Returns:
- tuple[str, Variant]
The haplotype ID and Variant object for the variant
-
id:
str
-
start:
int
- to_hap_spec(hap_id)
Convert a Variant object into a variant line in the .hap format spec
- Return type:
str- Parameters:
- hap_id: str
The ID of the haplotype associated with this variant
- Returns:
- str
A valid variant line (V) in the .hap format spec
- types = {'allele': <class 'str'>, 'end': <class 'int'>, 'id': <class 'str'>, 'start': <class 'int'>}
- class haptools.data.haplotypes.classproperty(fget)
Bases:
objectA dead-simple read-only decorator that combines the functionality of @classmethod and @property
Stolen from https://stackoverflow.com/a/13624858/16815703
haptools.data.breakpoints module
- class haptools.data.breakpoints.Breakpoints(fname, log=None)
Bases:
DataA class for processing breakpoints from a file
- Attributes:
- datadict[str, SampleBlocks]
The haplotype blocks for each chromosome in each sample This dict maps samples (as strings) to their haplotype blocks (as SampleBlocks)
- fnamePath | str
The path to the file containing the data
- labelsdict | None
A dictionary containing population labels. It maps each label to the unique integers in the “pop” field of
data- log: Logger
A logging instance for recording debug statements.
Examples
>>> breakpoints = Breakpoints.load('tests/data/test.bp')
- encode(labels=None)
Replace each ancestral label in
datawith an equivalent integer. Store a dictionary mapping these integers back to their respective labels inlabels.This method modifies
datain place.- Parameters:
- labels: tuple[str], optional
A list of population labels. The order of the labels in this list will be kept in the respective labels.
- classmethod load(fname, samples=None)
Load breakpoints from a TSV file
Read the file contents and standardize the Breakpoints
- Return type:
- Parameters:
- fname
See documentation for
fname- samplesset[str], optional
See documentation for
read()
- Returns:
- Breakpoints
A Breakpoints object with the data loaded into its properties
- population_array(variants, samples=None)
Output an array denoting the population labels of each variant for each sample
- Return type:
ndarray[Any,dtype[TypeVar(ScalarType, bound=generic, covariant=True)]]- Parameters:
- variantsnp.array
Variant-level meta information in a mixed np array of dtypes: CHROM (str) and POS (int)
- samplestuple[str], optional
A subset of samples to include in the output, ordered by their given order
- Returns:
- read(samples=None)
Read breakpoints from a TSV file into a data structure stored in
data- Parameters:
- samplesset[str], optional
A subset of the samples for which to extract breakpoints
Defaults to loading breakpoints for all samples
- Raises:
- AssertionError
If the provided file doesn’t follow the expected format
- recode()
Replace each integer in
datawith an equivalent ancestral label. Use the dictionary mapping these integers back to their respective ancestral labels stored inlabels.This method modifies
datain place.
- write()
Write the breakpoints in this class to a file at
fnameExamples
To write to a file, you must first initialize a Breakpoints object and then fill out the names, data, and samples properties:
>>> from haptools.data import Breakpoints, HapBlock >>> breakpoints = Breakpoints('simple.bp') >>> breakpoints.data = { >>> 'HG00096': [ >>> np.array([('YRI','chr1',10114,4.3),('CEU','chr1',10116,5.2)], dtype=HapBlock) >>> np.array([('CEU','chr1',10114,4.3),('YRI','chr1',10116,5.2)], dtype=HapBlock) >>> ], 'HG00097': [ >>> np.array([('YRI','chr1',10114,4.3),('CEU','chr2',10116,5.2)], dtype=HapBlock) >>> np.array([('CEU','chr1',10114,4.3),('YRI','chr2',10116,5.2)], dtype=HapBlock) >>> ] >>> } >>> breakpoints.write()
haptools.sim_genotype module
- haptools.sim_genotype.get_segment(pop, haplotype, chrom, start_coord, end_coord, end_pos, prev_gen_samples)
Create a segment or segments for an individual of the current generation using either a population label (>0) or the previous generation’s samples if the admix pop type (0) is used.
- Parameters:
- pop: int
index of population. Can recover population name from pop_dict
- haplotype: int
index of range [0, len(prev_gen_samples)] to identify the parent haplotype to copy segments from
- chrom: int
chromosome the haplotype segment lies on
- start_coord: int
starting coordinate from where to begin taking segments from previous generation samples
- end_coord: int
ending coordinate of haplotype segment
- end_pos: float
ending coordinate in centimorgans
- prev_gen_samples: list[list[HaplotypeSegment]]
the previous generation simulated used as the parents for the current generation
- Returns:
- segments: list[HaplotypeSegment]
A list of HaplotypeSegments storing the population type and end coordinate
- haptools.sim_genotype.output_vcf(breakpoints, chroms, model_file, variant_file, sampleinfo_file, region, pop_field, sample_field, no_replacement, out, log, chunk_size=None)
Takes in simulated breakpoints and uses reference files, vcf and sampleinfo, to create simulated variants output in file: out + .vcf
- Parameters:
- breakpoints: list[list[HaplotypeSegment]]
the simulated breakpoints
- chroms: list[str]
List of chromosomes that were used to simulate
- model_file: str
file with the following structure. (Must be tab delimited)
Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac
For example,
40 Admixed CEU YRI 1 0 0.05 0.95 2 0.20 0.05 0.75
- variant_file: str
file path that contains samples and respective variants. Can be in VCF, BCF, VCF.GZ, or PGEN format.
- sampleinfo_file: str
file path that contains mapping from sample name in vcf to population
- region: dict(str->str/int/int)
Dictionary with the keys “chr”, “start”, and “end” holding chromosome (str), start position (int) and end position (int) allowing the simulation process to only allow variants within that region.
- pop_field: boolean
Flag to determine whether to have the population field in the VCF file output
- sample_field: boolean
Flag to determine whether to have the sample field in the VCF file output
- no_replacement: boolean
Flag to determine whether we sample from the reference VCF with or without replacement. When True there will be no replacement.
- out: str
output prefix
- log: log object
Outputs messages to the appropriate channel.
- chunk_size: int, optional
The max number of variants to write to a PGEN file together
- haptools.sim_genotype.simulate_gt(model_file, coords_dir, chroms, region, popsize, log, seed=None)
Simulate admixed genotypes based on the parameters of model_file.
- Parameters:
- model_file: str
File with the following structure. (Must be tab delimited)
Header: number of samples, Admixed, {all pop labels}
Below: generation number, frac, frac, frac
For example,
40 Admixed CEU YRI 1 0 0.05 0.95 2 0.20 0.05 0.75
- coords_dir: str
Directory containing files ending in .map with genetic map coords in cM used for recombination points
- chroms: list[str]
List of chromosomes to simulate admixture for.
- region: dict()
Dictionary with the keys “chr”, “start”, and “end” holding chromosome, start position adn end position allowing the simulation process to only within that region.
- popsize: int
Size of population created for each generation.
- log: log object
Outputs messages to the appropriate channel.
- seed: int
Seed used for randomization.
- Returns:
- num_samples: int
Total number of samples to output
- pop_dict: dict(int->str)
Dictionary that maps populations from their encoded version as integers to their population name as a string. ex: {1:CEU, 2:YRI}
- next_gen_samples: list[list[HaplotypeSegment]]
Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.
- haptools.sim_genotype.start_segment(start, chrom, segments)
Find first segment that is on chrom and its end coordinate is > start via binary search.
- Parameters:
- start: int
Coordinate in bp for the start of the segment to output
- chrom: int
Chromosome that the segments lie on.
- segments: list[HaplotypeSegment]
List of the hapltoype segments to search from for a starting point.
- Returns:
- mid: int
Index of the first genetic segment to collect for output.
- haptools.sim_genotype.validate_params(model, mapdir, chroms, popsize, invcf, sample_info, no_replacement, region=None, only_bp=False)
- haptools.sim_genotype.write_breakpoints(samples, pop_dict, breakpoints, out, log)
Write out a subsample of breakpoints to out determined by samples.
- Parameters:
- samples: int
Number of samples to output
- pop_dict: dict(int->str)
Maps population codes in integers to their names. ex: {1:CEU, 2:YRI}
- breakpoints: list[list[HaplotypeSegment]]
Each list is a person containing a variable number of Haplotype Segments based on how many recombination events occurred throughout the generations of ancestors for this person.
- out: str
output prefix used to output the breakpoint file
- log: log object
Outputs messages to the appropriate channel.
- Returns:
- breakpoints: list[list[HaplotypeSegment]]
subsampled breakpoints only containing number of samples
haptools.sim_phenotype module
- class haptools.sim_phenotype.Effect(id, beta)
Bases:
objectA variable in the simphenotype linear model
- Attributes:
- idstr
The ID of the variable; corresponds to a variant in a Genotypes object
- betafloat
The effect size of the variable
-
beta:
float
- classmethod from_hap_spec(line)
Convert a .snplist line into an Effect object
- Return type:
- Parameters:
- line: str
A line from a .snplist file
- Returns:
- Effect
The converted Effect instance
-
id:
str
- class haptools.sim_phenotype.Haplotype(chrom, start, end, id, beta)
Bases:
HaplotypeA haplotype with sufficient fields for simphenotype
Properties and functions are shared with the base Haplotype object, “HaplotypeBase”
-
beta:
float
-
beta:
- class haptools.sim_phenotype.PhenoSimulator(genotypes, output=PosixPath('/dev/stdout'), seed=None, log=None)
Bases:
objectSimulate phenotypes from genotypes
- Attributes:
- gens: dict[Genotypes]
Genotypes to simulate
- phens: Phenotypes
Simulated phenotypes; filled by
run()- rng: np.random.Generator, optional
A numpy random number generator
- log: logging.Logger
A logging instance for recording debug statements
Examples
>>> gens = Genotypes.load("tests/data/example.vcf.gz") >>> tr_gens = GenotypesTR.load("tests/data/simple_tr.vcf") >>> haps = Haplotypes.load("tests/data/basic.hap") >>> haps_gts = haps.transform(gens) >>> phenosim = PhenoSimulator(haps_gts) >>> phenosim.run(haps.data.values()) >>> phenotypes = phenosim.phens
- normalize_gts(gts, ids)
Normalize variant or repeats genotypes
- Parameters:
- gts: np.array
Genotypes variant array stored in data.
- ids: list[str]
IDs for variants
- Returns:
- normalized_gts: np.array
Normalized Genotypes variant array.
- run(effects, heritability=None, prevalence=None, normalize=True, environment=None)
Simulate phenotypes for an entry in the Genotypes object
The generated phenotypes will also be added to
output- Return type:
ndarray[Any,dtype[TypeVar(ScalarType, bound=generic, covariant=True)]]- Parameters:
- effects: list[Effect|Haplotype|Repeat]
A list of Haplotypes to use in an additive fashion within the simulations
- heritability: float, optional
The simulated heritability of the trait
If not provided, this will default to the sum of the squared effect sizes
- prevalence: float, optional
How common should the disease be within the population?
If this value is specified, case/control phenotypes will be generated instead of quantitative traits.
- normalize: bool, optional
If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.
- environment: float, optional
The variance (aka strength) of the environmental contribution to the trait. This is inferred from the betas if it isn’t specified.
- Returns:
- npt.NDArray
The simulated phenotypes, as a np array of shape num_samples x 1
- write()
Write the generated phenotypes to the file specified in
__init__()
- class haptools.sim_phenotype.Repeat(chrom, start, end, id, beta)
Bases:
RepeatA repeat with sufficient fields for simphenotype
Properties and functions are shared with the base Repeat object, “RepeatBeta”
-
beta:
float
-
beta:
- haptools.sim_phenotype.simulate_pt(genotypes, haplotypes, num_replications=1, environment=None, heritability=None, prevalence=None, normalize=True, region=None, samples=None, haplotype_ids=None, chunk_size=None, repeats=None, seed=None, output=PosixPath('-'), log=None)
Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.
GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec
Note: GENOTYPES must be the output from the the transform subcommand.
- Parameters:
- genotypesPath
The path to the transformed genotypes in VCF or PGEN format
- haplotypesPath
The path to the haplotypes in a .hap file
- replicationsint, optional
The number of rounds of simulation to perform
- heritabilityint, optional
The heritability of the simulated trait; must be a float between 0 and 1
If not provided, it will be computed from the sum of the squared effect sizes
- prevalenceint, optional
The prevalence of the disease if the trait should be simulated as case/control; must be a float between 0 and 1
If not provided, a quantitative trait will be simulated, instead
- normalize: bool, optional
If True, normalize the genotypes before using them to simulate the phenotypes. Otherwise, use the raw values.
- regionstr, optional
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the VCF and .hap file must be indexed and the seqname must match!
Defaults to loading all haplotypes
- samplesset[str], optional
A subset of the samples from which to extract genotypes
Defaults to loading genotypes from all samples
- haplotype_ids: set[str], optional
A list of haplotype IDs to obtain from the .hap file. All others are ignored.
If not provided, all haplotypes will be used.
- chunk_size: int, optional
The max number of variants to fetch from the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.
- repeats: Path, optional
The path to a genotypes file containing tandem repeats. This is only necessary when simulating both haplotypes and repeats as causal effects
- environment: float, optional
The variance (aka strength) of the environmental term. This will be inferred if it isn’t specified.
- seed: int, optional
Seed for random processes
- outputPath, optional
The location to which to write the simulated phenotypes
- loglogging.Logger, optional
The logging module for this task
Examples
>>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simu_phens.tsv
haptools.karyogram module
This script is inspired by Alicia Martin’s karyogram code originally published here: https://github.com/armartin/ancestry_pipeline/blob/master/plot_karyogram.py
- haptools.karyogram.GetCentromereClipMask(centromeres_file, chrom_order)
Get clipping mask for the centromeres and telomeres
- Parameters:
- centromeres_filestr, optional
Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
- chrom_orderlist[int]
chromosomes in sorted order
- Returns:
- clipmask_perchromdict[str, matplotlib.Path]
Clip region for telomeres/centromeres for each chromosome
- haptools.karyogram.GetChrom(chrom)
Extract a numerical chromosome
- Parameters:
- chromstr
Chromosome string
- Returns:
- chromint
Integer-value for the chromosome X gets set to 23
- haptools.karyogram.GetChromOrder(sample_blocks)
Get a list of chroms in sorted order
- Parameters:
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- Returns:
- chromslist[int]
list of chromsomes in sorted order
- haptools.karyogram.GetCmRange(sample_blocks)
Get the min and max cM coordinates from the sample_blocks
- Parameters:
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- Returns:
- min_val, max_valfloat, float
min_val is the minimum coordinate max_val is the maximum coordinate
- haptools.karyogram.GetHaplotypeBlocks(bp_file, sample_name, centromeres_file=None)
Extract haplotype blocks for the desired sample from the bp file
- Parameters:
- bp_filestr
Path to .bp file with breakpoints
- sample_namestr
Sample ID to extract
- centromeres_filestr, optional
If not None then use the chromosome ends listed to extend chromosomes to proper end coordinates
- Returns:
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- haptools.karyogram.GetPopList(sample_blocks)
Get a list of populations in the sample_blocks
- Parameters:
- sample_blockslist[list[hap_blocks]]
each hap_block is a dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- Returns:
- poplistlist[str]
list of populations represented in the blocks
- haptools.karyogram.PlotHaplotypeBlock(block, hapnum, chrom_order, colors, ax, clipmask_perchrom=None)
Plot a haplotype block on the axis
- Parameters:
- blockdict
dictionary with keys ‘pop’, ‘chrom’, ‘start’, ‘end’
- hapnumint
0 or 1 for the two haplotypes
- chrom_orderlist[int]
chromosomes in sorted order
- colorsdict[str, str], optional
Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.
- axmatplotlib axis to use for plotting
- clipmask_perchromdict[str, matplotlib.Path], optional
Clip region for telomeres/centromeres for each chromosome
- haptools.karyogram.PlotKaryogram(bp_file, sample_name, out_file, log, centromeres_file=None, title=None, colors=None)
Plot a karyogram based on breakpoints output by haptools simgenotypes
- Parameters:
- bp_filestr
Path to .bp file with breakpoints
- sample_namestr
Sample ID to plot
- out_filestr
Name of output file
- log: log object
Outputs messages to the appropriate channel.
- centromeres_filestr, optional
Path to file with centromere coordinates. Format: chrom, chromstart_cm, centromere_cm, chromend_cm If None, no centromere and telomere locations are shown
- titlestr, optional
Plot title. If None, no title is annotated
- colorsdict(str->str), optional
Dictionary of colors to use for each population If not set, reasonable defaults are used. In addition to strings, you can specify RGB or RGBA tuples.
haptools.transform module
- class haptools.transform.GenotypesAncestry(fname, log=None)
Bases:
GenotypesVCFExtends the GenotypesVCF class for ancestry data
The ancestry information is stored within the FORMAT field of the VCF
- Attributes:
- datanp.array
See documentation for
data- fnamePath | str
See documentation for
fname- samplestuple[str]
See documentation for
samples- variantsnp.array
See documentation for
variants- valid_labels: np.array
Reference VCF sample and respective variant grabbed for each sample.
- ancestrynp.array
The ancestral population of each allele in each sample of
data- log: logging.Logger
See documentation for
log
- check_biallelic(discard_also=False)
See documentation for
check_biallelic()
- check_missing(discard_also=False)
See documentation for
check_missing()
- merge_variants(objs, check_samples=True, **kwargs)
See documentation for
merge_variants()- Return type:
- write(chroms=None)
Write the variants in this class to a VCF at
fname
- class haptools.transform.HaplotypeAncestry(chrom, start, end, id, ancestry)
Bases:
HaplotypeA haplotype with an ancestry field for the transform subcommand
Properties and functions are shared with the base “Haplotype” object
-
ancestry:
str
- transform(genotypes)
Transform a genotypes matrix via the current haplotype and its ancestral population
See documentation for
transform()for more details- Return type:
ndarray[Any,dtype[TypeVar(ScalarType, bound=generic, covariant=True)]]
-
ancestry:
- class haptools.transform.HaplotypesAncestry(fname, haplotype=<class 'haptools.transform.HaplotypeAncestry'>, variant=<class 'haptools.data.haplotypes.Variant'>, log=None)
Bases:
HaplotypesA set of haplotypes with an ancestry field for the transform subcommand
Properties and functions are shared with the base “Haplotypes” object
- transform(gts, hap_gts=None)
Transform a genotypes matrix via the current haplotype
Each entry in the returned matrix denotes the presence of each haplotype in each chromosome of each sample in the Genotypes object
- Parameters:
- gtsGenotypesVCF
The genotypes which to transform using the current haplotype
- hap_gts: GenotypesVCF
An empty GenotypesVCF object into which the haplotype genotypes should be stored
- Returns:
- GenotypesVCF
A Genotypes object composed of haplotypes instead of regular variants.
- haptools.transform.transform_haps(genotypes, haplotypes, region=None, samples=None, haplotype_ids=None, chunk_size=None, discard_missing=False, ancestry=False, maf=None, output=PosixPath('-'), log=None)
Creates a VCF composed of haplotypes
- Parameters:
- genotypesPath
The path to the genotypes in VCF or PGEN format
- haplotypesPath
The path to the haplotypes in a .hap file
- regionstr, optional
See documentation for
read()andread()- samplesset[str], optional
See documentation for
read()- haplotype_ids: set[str], optional
A set of haplotype IDs to obtain from the .hap file. All others are ignored.
If not provided, all haplotypes will be used.
- chunk_size: int, optional
The max number of variants to fetch from the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.
- discard_missingbool, optional
Discard any samples that are missing any of the required genotypes
The default is simply to complain about it
- ancestrybool, optional
Whether to also match ancestral population labels from the VCF against those in the .hap file
- maffloat, optional
If specified, only haplotypes with an MAF above this value will be output
- outputPath, optional
The location to which to write output
- logLogger, optional
A logging module to which to write messages about progress and any errors
haptools.ld module
- class haptools.ld.Haplotype(chrom, start, end, id, ld)
Bases:
HaplotypeA haplotype with sufficient fields for the ld command
Properties and functions are shared with the base Haplotype object,
HaplotypeBase-
ld:
float
-
ld:
- haptools.ld.calc_ld(target, genotypes, haplotypes, region=None, samples=None, ids=None, chunk_size=None, discard_missing=False, from_gts=False, output=PosixPath('/dev/stdout'), log=None)
Creates a VCF composed of haplotypes
- Parameters:
- targetstr
The ID of the haplotype or variant with which we will calculate LD
- genotypesPath
The path to the genotypes
- haplotypesPath
The path to the haplotypes in a .hap file
- regionstr, optional
See documentation for
read()andread()- samplesset[str], optional
See documentation for
read()- ids: set[str], optional
A subset of haplotype IDs to obtain from the .hap file. All others are ignored.
Alternatively, if the –from-gts switch is specified, this will be interpreted as a subset of variant IDs to obtain from the genotypes file.
Defaults to loading all haplotypes or variants if not specified
- chunk_size: int, optional
The max number of variants to fetch from the PGEN file at any given time
If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory. This argument is ignored if the genotypes are not in PGEN format.
- discard_missingbool, optional
Discard any samples that are missing any of the required genotypes
The default is simply to complain about it
- outputPath, optional
The location to which to write output
- logLogger, optional
A logging module to which to write messages about progress and any errors
- haptools.ld.pearson_corr_ld(arrA, arrB)
Compute the Pearson correlation coefficient (LD) between two vectors (1D arrays)
If 2D array(s) are given instead, the rows will be treated as samples and the columns as variants
- Return type:
float|ndarray[Any,dtype[TypeVar(ScalarType, bound=generic, covariant=True)]]- Parameters:
- arrA: npt.NDArray
The first 1D (or 2D) numpy array
- arrB: npt.NDArray
The second 1D (or 2D) numpy array
- Returns:
- The signed LD between the genotypes in arrA and the genotypes in arrB
- If either arrA or arrB is 2D, then the return value will be an array instead. If
- both are 2D, the shape of the array will correspond with the second dimensions of
- arrA and arrB, respectively. Otherwise, it will be 1D.
haptools.index module
- haptools.index.append_suffix(path, suffix)
Used as a helper method for index_haps. Appends a given suffix to a Path instance.
- Parameters:
- pathPath
The path to a file
- suffixstr
A string to append to the end of the given Path. For example, “.gz” or “.gz.tbi”
- haptools.index.index_haps(haplotypes, sort=False, output=None, log=None)
Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file
- Parameters:
- haplotypesPath
The path to the haplotypes in a .hap file
- outputPath, optional
The location to which to write output. If an output location is not specified, the output will have the same name as the input file.
- logLogger, optional
A logging module to which to write messages about progress and any errors
haptools.clump module
- haptools.clump.ComputeExactLD(candidate_gt, index_gt, log)
Compute exact solution of haplotype frequencies to calculate r squared value. NOTE currently this approach only works for biallelic variants since having more variants causes the equation we’re solving for to be a cubic but instead to the degree of n where n is the total number of alelles which also invalidates STRs.
- Parameters:
- candidate_gt: np.array
array of size (genotypes,) where genotypes is the number of samples
- index_gt: np.array
array of size (genotypes,) where genotypes is the number of samples
- log: Logger
A logging instance for recording debug statements.
- Returns:
- r_squared: float
R squared value inferred from ML solution.
- haptools.clump.ComputeLD(candidate_gt, index_gt, LD_type, log)
Compute the LD between two variants
- haptools.clump.GetOverlappingSamples(snpgts, strgts)
Get indices of overlapping samples for snps and strs
- Parameters:
- snpgts: GenotypesVCF
SNP Genotypes object
- strgts: GenotypesTR
STR Genotypes object
- Returns:
- snp_samples: list(int)
Indices of overlapping samples for snps
- str_samples: list(int)
Indices of overlapping samples for strs
- haptools.clump.LoadVariant(var, gts, log)
Extract vector of genotypes for this variant
- class haptools.clump.SummaryStats(log=None)
Bases:
objectLoad and process summary statistics
- Attributes:
- summstats: list[Variant]
list of Variant objects that represent all summary statistics in the file.
- log: Logger
A logging instance for recording debug statements.
Examples
Loading a summary stats file, grabbing an index variant, and its candidate variants to calculate LD.
>>> summstats = SummaryStats(log) >>> summstats.Load(summstats_strs, vartype="SNP", pthresh=clump_p2, id_field=clump_id_field, p_field=clump_field, chrom_field=clump_chrom_field, pos_field=clump_pos_field) >>> indexvar = summstats.GetNextIndexVariant(clump_p1) >>> candidates = summstats.QueryWindow(indexvar, clump_kb)
- GetNextIndexVariant(index_pval_thresh)
Get the next index variant, which is the variant with the best p-value If no more variants below the clump-p1 threshold, return None Not yet implemented
- Load(statsfile, vartype='SNP', pthresh=1.0, id_field='SNP', p_field='P', chrom_field='CHR', pos_field='POS')
Load summary statistics Ignore variants with pval < pthresh Not yet implemented
- QueryWindow(indexvar, window_kb)
Find all candidate variants in the specified window around the index variant Not yet implemented
- RemoveClump(clumpvars)
Remove the variants from a clump from further consideration
- class haptools.clump.Variant(varid, chrom, pos, pval, vartype)
Bases:
object
- haptools.clump.WriteClump(indexvar, clumped_vars, outf)
Write a clump to the output file Not yet implemented
- haptools.clump.clumpstr(summstats_snps, summstats_strs, gts_snps, gts_strs, clump_p1, clump_p2, clump_id_field, clump_field, clump_chrom_field, clump_pos_field, clump_kb, clump_r2, LD_type, out, log)