Genotypes

https://drive.google.com/uc?export=view&id=1_JARKJQ0LX-DzL0XsHW1aiQgLCOJ1ZvC

The time required to load various genotype file formats.

VCF/BCF

Genotype files must be specified as VCF or BCF files. They can be bgzip-compressed.

PLINK2 PGEN

There is also experimental support for PLINK2 PGEN files in some commands. These files can be loaded and created much more quickly than VCFs, so we highly recommend using them if you’re working with large datasets. See the documentation for the GenotypesPLINK class in the API docs for more information.

If you run out memory when using PGEN files, consider reading/writing variants from the file in chunks via the --chunk-size parameter.

Note

PLINK2 support depends on the Pgenlib python library. This can be installed automatically with haptools if you specify the “files” extra requirements during installation.

pip install haptools[files]

Warning

At the moment, only biallelic SNPs can be encoded in PGEN files because of limitations in the Pgenlib python library. It doesn’t properly support multiallelic variants yet (source). To ensure your PGEN files only contain SNPs, we recommend use the following command to convert from VCF to PGEN.

plink2 --snps-only 'just-acgt' --vcf tests/data/simple.vcf --make-pgen --out simple