simgenotype

Takes as input a reference set of haplotypes in VCF format and a user-specified admixture model.

Outputs a VCF file with simulated genotype information for admixed genotypes, as well as a breakpoints file that can be used for visualization. For example, you could simulate a 50/50 mixture of CEU and YRI for 10 generations. Other more complex models such as involving pulse events of new populations can also be simulated.

Basic Usage

haptools simgenotype \
--model MODELFILE \
--mapdir GENETICMAPDIR \
--chroms LIST,OF,CHROMS \
--region CHR:START-END \
--ref_vcf REFVCF \
--sample_info SAMPLEINFOFILE \
--pop_field \
--out /PATH/TO/OUTPUT.VCF.GZ

Detailed information about each option, and example commands using publicly available files, are shown below.

Parameter Descriptions

  • --model - Parameters for simulating admixture across generations including sample size, population fractions, and number of generations.

  • --mapdir - Directory containing all .map files with this structure where the third position is in centiMorgans

  • --out - Full output path to file of the structure /path/to/output.(vcf|bcf|vcf.gz|pgen) which if vcf.gz is chosen outputs /path/to/output.vcf.gz and breakpoints file /path/to/output.bp

  • --chroms - List of chromosomes to be simulated. The map file directory must contain the “chr<CHR>” where <CHR> is the chromosome identifier eg. 1,2,…,X

  • --seed - Seed for randomized calculations during simulation of breakpoints. [Optional]

  • --popsize - Population size for each generaetion that is sampled from to create our simulated samples. Default = max(10000, 10*samples) [Optional]

  • --ref_vcf - Input VCF or PGEN file used to simulate specifiic haplotypes for resulting samples

  • --sample_info - File used to map samples in REFVCF to populations found in MODELFILE

  • --region - Limit the simulation to a region within a single chromosome. Overwrites chroms with the chrom listed in this region. eg 1:1-10000 [Optional]

  • --pop_field - Flag for ouputting population field in VCF output. Note this flag does not work when your output is in PGEN format. [Optional]

  • --sample_field - Flag for ouputting sample field in VCF output. Note this flag does not work when your output is in PGEN format. Should only be used for debugging. [Optional]

  • --no_replacement - Flag for deteremining during the VCF generation process whether we grab samples’ haplotypes with or without replacement from the reference VCF file. Default = False (With replacement) [Optional]

  • --verbosity - What level of output the logger should print to stdout. Please see logging levels for output levels. Default = INFO [Optional]

  • --only_breakpoint - Flag which when provided only outputs the breakpoint file. Note you will not need to provide a --ref_vcf or --sample_info file and can instead put NA. eg. --ref_vcf NA and --sample_info NA [Optional]

File Formats

Examples

haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.vcf.gz \
--sample_info tests/data/outvcf_info.tab \
--pop_field \
--out tests/data/example_simgenotype.vcf

If speed is important, it’s generally faster to use PGEN files than VCFs.

haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.pgen \
--sample_info tests/data/outvcf_info.tab \
--pop_field \
--out tests/data/example_simgenotype.pgen

All files used in these examples are described here.

Detailed Usage

haptools

haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information

haptools [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

simgenotype

Simulate admixed genomes under a pre-defined model.

haptools simgenotype [OPTIONS]

Options

--model <model>

Required Admixture model in .dat format. See File Formats under simgenotype in the docs for complete info.

--mapdir <mapdir>

Required Directory containing files with chr{1-22,X} and ending in .map in the file name with genetic map coords.

--out <out>

Required Path to desired output file. E.g. /path/to/output.vcf.gz Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an additional breakpoints output with extension bp e.g. /path/to/output.bp.

--chroms <chroms>

Sorted and comma delimited list of chromosomes to simulate

--seed <seed>

Random seed. Set to make simulations reproducible

--ref_vcf <ref_vcf>

Required VCF or PGEN file used as reference for creation of simulated samples respective genotypes.

--sample_info <sample_info>

Required File that maps samples from the reference VCF (–invcf) to population codes describing the populations in the header of the model file.

--region <region>

Subset the simulation to a specific region in a chromosome using the form chrom:start-end. Example 2:1000-2000

--pop_field

Flag for outputting the population field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.

--sample_field

Flag for outputting the sample field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.

--no_replacement

Flag used to determine whether to sample reference haplotypes with or without replacement. (Default = Replacement)

--only_breakpoint

Flag used to determine whether to only output breakpoints or continue to simulate a vcf file.

-v, --verbosity <verbosity>

The level of verbosity desired

Default:

INFO

Options:

CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET