simgenotype
Takes as input a reference set of haplotypes in VCF format and a user-specified admixture model.
Outputs a VCF file with simulated genotype information for admixed genotypes, as well as a breakpoints file that can be used for visualization. For example, you could simulate a 50/50 mixture of CEU and YRI for 10 generations. Other more complex models such as involving pulse events of new populations can also be simulated.
Basic Usage
haptools simgenotype \
--model MODELFILE \
--mapdir GENETICMAPDIR \
--chroms LIST,OF,CHROMS \
--region CHR:START-END \
--ref_vcf REFVCF \
--sample_info SAMPLEINFOFILE \
--pop_field \
--out /PATH/TO/OUTPUT.VCF.GZ
Detailed information about each option, and example commands using publicly available files, are shown below.
Parameter Descriptions
--model- Parameters for simulating admixture across generations including sample size, population fractions, and number of generations.--mapdir- Directory containing all .map files with this structure where the third position is in centiMorgans--out- Full output path to file of the structure/path/to/output.(vcf|bcf|vcf.gz|pgen)which ifvcf.gzis chosen outputs/path/to/output.vcf.gzand breakpoints file/path/to/output.bp--chroms- List of chromosomes to be simulated. The map file directory must contain the “chr<CHR>” where <CHR> is the chromosome identifier eg. 1,2,…,X--seed- Seed for randomized calculations during simulation of breakpoints. [Optional]--popsize- Population size for each generaetion that is sampled from to create our simulated samples. Default = max(10000, 10*samples) [Optional]--ref_vcf- Input VCF or PGEN file used to simulate specifiic haplotypes for resulting samples--sample_info- File used to map samples inREFVCFto populations found inMODELFILE--region- Limit the simulation to a region within a single chromosome. Overwrites chroms with the chrom listed in this region. eg 1:1-10000 [Optional]--pop_field- Flag for ouputting population field in VCF output. Note this flag does not work when your output is in PGEN format. [Optional]--sample_field- Flag for ouputting sample field in VCF output. Note this flag does not work when your output is in PGEN format. Should only be used for debugging. [Optional]--no_replacement- Flag for deteremining during the VCF generation process whether we grab samples’ haplotypes with or without replacement from the reference VCF file. Default = False (With replacement) [Optional]--verbosity- What level of output the logger should print to stdout. Please see logging levels for output levels. Default = INFO [Optional]--only_breakpoint- Flag which when provided only outputs the breakpoint file. Note you will not need to provide a--ref_vcfor--sample_infofile and can instead put NA. eg.--ref_vcf NAand--sample_info NA[Optional]
File Formats
Examples
haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.vcf.gz \
--sample_info tests/data/outvcf_info.tab \
--pop_field \
--out tests/data/example_simgenotype.vcf
If speed is important, it’s generally faster to use PGEN files than VCFs.
haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.pgen \
--sample_info tests/data/outvcf_info.tab \
--pop_field \
--out tests/data/example_simgenotype.pgen
All files used in these examples are described here.
Detailed Usage
haptools
haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information
haptools [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
simgenotype
Simulate admixed genomes under a pre-defined model.
haptools simgenotype [OPTIONS]
Options
- --model <model>
Required Admixture model in .dat format. See File Formats under simgenotype in the docs for complete info.
- --mapdir <mapdir>
Required Directory containing files with chr{1-22,X} and ending in .map in the file name with genetic map coords.
- --out <out>
Required Path to desired output file. E.g. /path/to/output.vcf.gz Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an additional breakpoints output with extension bp e.g. /path/to/output.bp.
- --chroms <chroms>
Sorted and comma delimited list of chromosomes to simulate
- --seed <seed>
Random seed. Set to make simulations reproducible
- --ref_vcf <ref_vcf>
Required VCF or PGEN file used as reference for creation of simulated samples respective genotypes.
- --sample_info <sample_info>
Required File that maps samples from the reference VCF (–invcf) to population codes describing the populations in the header of the model file.
- --region <region>
Subset the simulation to a specific region in a chromosome using the form chrom:start-end. Example 2:1000-2000
- --pop_field
Flag for outputting the population field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --sample_field
Flag for outputting the sample field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --no_replacement
Flag used to determine whether to sample reference haplotypes with or without replacement. (Default = Replacement)
- --only_breakpoint
Flag used to determine whether to only output breakpoints or continue to simulate a vcf file.
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
INFO- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET