transform
Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.
The transform command takes as input a set of phased genotypes and a list of haplotypes and outputs a set of haplotype pseudo-genotypes, where each haplotype is encoded as a bi-allelic variant record in the output. In other words, each sample will have a genotype of 0|0, 1|0, 0|1, or 1|1 indicating whether each of their two chromosome copies contains the alleles of a haplotype.
Users may also specify an ancestral population label for each haplotype. See the ancestry section for more details.
Usage
haptools transform \
--region TEXT \
--sample SAMPLE --sample SAMPLE \
--samples-file FILENAME \
--id ID --id ID \
--ids-file FILENAME \
--chunk-size INT \
--discard-missing \
--ancestry \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES
Input
Genotypes must be specified in VCF and haplotypes must be specified in the .hap file format.
Alternatively, you may specify genotypes in PLINK2 PGEN format. Just use the appropriate “.pgen” file extension in the input. See the documentation for genotypes in the format docs for more information.
Ancestry
If your .hap file contains an “ancestry” extra field and your VCF contains a “POP” format field (as output by simgenotype), you should specify the --ancestry flag.
This will enable us to match the population labels of each haplotype against those in the genotypes output by simgenotype.
In other words, a sample is said to contain a haplotype only if all of the alleles of the haplotype are labeled with the haplotype’s ancestry.
Alternatively, you may specify a breakpoints file accompanying the genotypes file. It must have the same name as the genotypes file but with a .bp file ending. If such a file exists, transform will ignore any “POP” format fields in the genotypes file and instead obtain the ancestry labels from the breakpoints file. This is primarily a speed enhancement, since it’s faster to load ancestral labels from the breakpoints file.
Output
Transform outputs psuedo-genotypes in VCF, but you may request genotypes in PLINK2 PGEN format, instead. Just use the appropriate “.pgen” file extension in the output path. See the documentation for genotypes in the format docs for more information.
Examples
haptools transform tests/data/simple.vcf.gz tests/data/simple.hap
Let’s try transforming just two samples and let’s output to PGEN format:
haptools transform -o output.pgen -s HG00097 -s NA12878 tests/data/apoe.vcf.gz tests/data/apoe4.hap
To get progress information, increase the verbosity to “INFO”:
haptools transform --verbosity INFO -o output.vcf.gz tests/data/example.vcf.gz tests/data/basic.hap.gz
To match haplotypes as well as their ancestral population labels, use the --ancestry flag:
haptools transform --ancestry tests/data/simple-ancestry.vcf tests/data/simple.hap
All files used in these examples are described here.
Detailed Usage
haptools
haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information
haptools [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
transform
Creates a VCF composed of haplotypes
GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec
haptools transform [OPTIONS] GENOTYPES HAPLOTYPES
Options
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default:
all haplotypes
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default:
all samples
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default:
all samples
- -i, --id <ids>
A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’).
- Default:
all haplotypes
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file
- Default:
all haplotypes
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default:
all variants
- --discard-missing
Ignore any samples that are missing genotypes for the required variants
- Default:
False
- --ancestry
Also transform using VCF ‘POP’ FORMAT field and ‘ancestry’ .hap extra field
- Default:
False
- -o, --output <output>
A VCF file containing haplotype ‘genotypes’
- Default:
stdout
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
INFO- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument