transform

Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

The transform command takes as input a set of phased genotypes and a list of haplotypes and outputs a set of haplotype pseudo-genotypes, where each haplotype is encoded as a bi-allelic variant record in the output. In other words, each sample will have a genotype of 0|0, 1|0, 0|1, or 1|1 indicating whether each of their two chromosome copies contains the alleles of a haplotype.

Transforming genotypes via haplotypes

Users may also specify an ancestral population label for each haplotype. See the ancestry section for more details.

Usage

haptools transform \
--region TEXT \
--sample SAMPLE --sample SAMPLE \
--samples-file FILENAME \
--id ID --id ID \
--ids-file FILENAME \
--chunk-size INT \
--discard-missing \
--ancestry \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES

Input

Genotypes must be specified in VCF and haplotypes must be specified in the .hap file format.

Alternatively, you may specify genotypes in PLINK2 PGEN format. Just use the appropriate “.pgen” file extension in the input. See the documentation for genotypes in the format docs for more information.

Ancestry

If your .hap file contains an “ancestry” extra field and your VCF contains a “POP” format field (as output by simgenotype), you should specify the --ancestry flag. This will enable us to match the population labels of each haplotype against those in the genotypes output by simgenotype. In other words, a sample is said to contain a haplotype only if all of the alleles of the haplotype are labeled with the haplotype’s ancestry.

Transforming via ancestry labels

Alternatively, you may specify a breakpoints file accompanying the genotypes file. It must have the same name as the genotypes file but with a .bp file ending. If such a file exists, transform will ignore any “POP” format fields in the genotypes file and instead obtain the ancestry labels from the breakpoints file. This is primarily a speed enhancement, since it’s faster to load ancestral labels from the breakpoints file.

Output

Transform outputs psuedo-genotypes in VCF, but you may request genotypes in PLINK2 PGEN format, instead. Just use the appropriate “.pgen” file extension in the output path. See the documentation for genotypes in the format docs for more information.

Examples

haptools transform tests/data/simple.vcf.gz tests/data/simple.hap

Let’s try transforming just two samples and let’s output to PGEN format:

haptools transform -o output.pgen -s HG00097 -s NA12878 tests/data/apoe.vcf.gz tests/data/apoe4.hap

To get progress information, increase the verbosity to “INFO”:

haptools transform --verbosity INFO -o output.vcf.gz tests/data/example.vcf.gz tests/data/basic.hap.gz

To match haplotypes as well as their ancestral population labels, use the --ancestry flag:

haptools transform --ancestry tests/data/simple-ancestry.vcf tests/data/simple.hap

All files used in these examples are described here.

Detailed Usage

haptools

haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information

haptools [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

transform

Creates a VCF composed of haplotypes

GENOTYPES must be formatted as a VCF or PGEN and HAPLOTYPES must be formatted according to the .hap format spec

haptools transform [OPTIONS] GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default:

'all haplotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:

'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:

'all samples'

-i, --id <ids>

A list of the haplotype IDs to use from the .hap file (ex: ‘-i H1 -i H2’).

Default:

'all haplotypes'

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file

Default:

'all haplotypes'

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default:

'all variants'

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:

False

--ancestry

Also transform using VCF ‘POP’ FORMAT field and ‘ancestry’ .hap extra field

Default:

False

--maf <maf>

Do not output haplotypes with an MAF below this value

Default:

'all haplotypes'

-o, --output <output>

A VCF file containing haplotype ‘genotypes’

Default:

'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:

'INFO'

Options:

CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES

Required argument

HAPLOTYPES

Required argument