transform

Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

The transform command takes as input a set of phased genotypes and a list of haplotypes and outputs a set of haplotype pseudo-genotypes, where each haplotype is encoded as a bi-allelic variant record in the output. In other words, each sample will have a genotype of 0|0, 1|0, 0|1, or 1|1 indicating whether each of their two chromosome copies contains the alleles of a haplotype.

Users may also specify an ancestral population label for each haplotype. See the ancestry section for more details.

Usage

haptools transform \
--region TEXT \
--sample SAMPLE --sample SAMPLE \
--samples-file FILENAME \
--id ID --id ID \
--ids-file FILENAME \
--chunk-size INT \
--discard-missing \
--ancestry \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES

Input

Genotypes must be specified in VCF and haplotypes must be specified in the .hap file format.

Alternatively, you may specify genotypes in PLINK2 PGEN format. Just use the appropriate “.pgen” file extension in the input. See the documentation for genotypes in the format docs for more information.

Ancestry

If your .hap file contains an “ancestry” extra field and your VCF contains a “POP” format field (as output by simgenotype), you should specify the --ancestry flag. This will enable us to match the population labels of each haplotype against those in the genotypes output by simgenotype. In other words, a sample is said to contain a haplotype only if all of the alleles of the haplotype are labeled with the haplotype’s ancestry.

Alternatively, you may specify a breakpoints file accompanying the genotypes file. It must have the same name as the genotypes file but with a .bp file ending. If such a file exists, transform will ignore any “POP” format fields in the genotypes file and instead obtain the ancestry labels from the breakpoints file. This is primarily a speed enhancement, since it’s faster to load ancestral labels from the breakpoints file.

Output

Transform outputs psuedo-genotypes in VCF, but you may request genotypes in PLINK2 PGEN format, instead. Just use the appropriate “.pgen” file extension in the output path. See the documentation for genotypes in the format docs for more information.

Examples

haptools transform tests/data/simple.vcf.gz tests/data/simple.hap

Let’s try transforming just two samples and let’s output to PGEN format:

haptools transform -o output.pgen -s HG00097 -s NA12878 tests/data/apoe.vcf.gz tests/data/apoe4.hap

To get progress information, increase the verbosity to “INFO”:

haptools transform --verbosity INFO -o output.vcf.gz tests/data/example.vcf.gz tests/data/basic.hap.gz

To match haplotypes as well as their ancestral population labels, use the --ancestry flag:

haptools transform --ancestry tests/data/simple-ancestry.vcf tests/data/simple.hap

All files used in these examples are described here.

Detailed Usage

haptools

haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information

haptools [OPTIONS] COMMAND [ARGS]...

Options

--version: Show the version and exit.

transform

Creates a VCF composed of haplotypes

GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec

haptools transform [OPTIONS] GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default:: all haplotypes

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:: all samples

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:: all samples

-i, --id <ids>

A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’).

Default:: all haplotypes

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file

Default:: all haplotypes

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default:: all variants

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:: False

--ancestry

Also transform using VCF ‘POP’ FORMAT field and ‘ancestry’ .hap extra field

Default:: False

-o, --output <output>

A VCF file containing haplotype ‘genotypes’

Default:: stdout

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: INFO
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES: Required argument

HAPLOTYPES: Required argument