Genotypes
The time required to load various genotype file formats.
VCF/BCF
Genotype files must be specified as VCF or BCF files. They can be bgzip-compressed.
PLINK2 PGEN
There is also experimental support for PLINK2 PGEN files in some commands. These files can be loaded and created much more quickly than VCFs, so we highly recommend using them if you’re working with large datasets. See the documentation for the GenotypesPLINK class in the API docs for more information.
If you run out memory when using PGEN files, consider reading/writing variants from the file in chunks via the --chunk-size parameter.
Note
PLINK2 support depends on the Pgenlib python library. This can be installed automatically with haptools if you specify the “files” extra requirements during installation.
pip install haptools[files]
Warning
At the moment, only biallelic SNPs can be encoded in PGEN files because of limitations in the Pgenlib python library. It doesn’t properly support multiallelic variants yet (source). To ensure your PGEN files only contain SNPs, we recommend use the following command to convert from VCF to PGEN.
plink2 --snps-only 'just-acgt' --vcf tests/data/simple.vcf --make-pgen --out simple