Genotypes
The time required to load various genotype file formats.
VCF/BCF
Genotype files must be specified as VCF or BCF files. They can be bgzip-compressed.
To be loaded properly, VCFs must follow the VCF specification. VCFs with duplicate variant IDs do not follow the specification; the IDs must be unique. Please validate your VCF using a tool like gatk ValidateVariants before using haptools.
PLINK2 PGEN
There is also experimental support for PLINK2 PGEN files in some commands. These files can be loaded and created much more quickly than VCFs, so we highly recommend using them if you’re working with large datasets. See the documentation for the GenotypesPLINK class in the API docs for more information.
If you run out memory when using PGEN files, consider reading/writing variants from the file in chunks via the --chunk-size parameter.
Converting from VCF to PGEN
To convert a VCF containing only SNPs to PGEN, use the following command.
plink2 --snps-only 'just-acgt' --vcf input.vcf --make-pgen --out output
To convert a VCF containing tandem repeats to PGEN, use this command, instead.
plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' --vcf input.vcf --make-pgen --out output