simgenotype
Takes as input a reference set of haplotypes in VCF format and a user-specified admixture model.
Outputs a VCF file with simulated genotype information for admixed genotypes, as well as a breakpoints file that can be used for visualization. For example, you could simulate a 50/50 mixture of CEU and YRI for 10 generations. Other more complex models such as involving pulse events of new populations can also be simulated.
Basic Usage
haptools simgenotype \
--model MODELFILE \
--mapdir GENETICMAPDIR \
--chroms LIST,OF,CHROMS \
--region CHR:START-END \
--ref_vcf REFVCF \
--sample_info SAMPLEINFOFILE \
--pop_field \
--out /PATH/TO/OUTPUT.VCF.GZ
Detailed information about each option, and example commands using publicly available files, are shown below.
Parameter Descriptions
--model- Parameters for simulating admixture across generations including sample size, population fractions, and number of generations.--mapdir- Directory containing all .map files with this structure where the third position is in centiMorgans--out- Full output path to file of the structure/path/to/output.(vcf|bcf|vcf.gz|pgen)which ifvcf.gzis chosen outputs/path/to/output.vcf.gzand breakpoints file/path/to/output.bp--chroms- List of chromosomes to be simulated. The map file directory must contain the “chr<CHR>” where <CHR> is the chromosome identifier eg. 1,2,…,X--seed- Seed for randomized calculations during simulation of breakpoints. [Optional]--popsize- Population size for each generaetion that is sampled from to create our simulated samples. Default = max(10000, 10*samples) [Optional]--ref_vcf- Input VCF or PGEN file used to simulate specifiic haplotypes for resulting samples--sample_info- File used to map samples inREFVCFto populations found inMODELFILE--region- Limit the simulation to a region within a single chromosome. Overwrites chroms with the chrom listed in this region. eg 1:1-10000 [Optional]--pop_field- Flag for ouputting population field in VCF output. Note this flag does not work when your output is in PGEN format. [Optional]--sample_field- Flag for ouputting sample field in VCF output. Note this flag does not work when your output is in PGEN format. Should only be used for debugging. [Optional]--no_replacement- Flag for deteremining during the VCF generation process whether we grab samples’ haplotypes with or without replacement from the reference VCF file. Default = False (With replacement) [Optional]--verbosity- What level of output the logger should print to stdout. Please see logging levels for output levels. Default = INFO [Optional]--only_breakpoint- Flag which when provided only outputs the breakpoint file. Note you will not need to provide a--ref_vcfor--sample_infofile and can instead put NA. eg.--ref_vcf NAand--sample_info NA[Optional]
File Formats
Examples
haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.vcf.gz \
--sample_info tests/data/outvcf_info.tab \
--pop_field \
--out tests/data/example_simgenotype.vcf
If speed is important, it’s generally faster to use PGEN files than VCFs.
haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.pgen \
--sample_info tests/data/outvcf_info.tab \
--pop_field \
--out tests/data/example_simgenotype.pgen
Warning
Writing PGEN files will require more memory than writing VCFs. The memory will depend on the number of simulated samples and variants.
You can reduce the memory required for this step by writing the variants in chunks. Just specify a --chunk-size value.
All files used in these examples are described here.
Detailed Usage
haptools
haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information
haptools [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
simgenotype
Simulate admixed genomes under a pre-defined model.
haptools simgenotype [OPTIONS]
Options
- --model <model>
Required Admixture model in .dat format. See File Formats under simgenotype in the docs for complete info.
- --mapdir <mapdir>
Required Directory containing files with chr{1-22,X} and ending in .map in the file name with genetic map coords.
- --out <out>
Required Path to desired output file. E.g. /path/to/output.vcf.gz Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an additional breakpoints output with extension bp e.g. /path/to/output.bp.
- --chroms <chroms>
Sorted and comma delimited list of chromosomes to simulate
- --seed <seed>
Random seed. Set to make simulations reproducible
- --ref_vcf <ref_vcf>
Required VCF or PGEN file used as reference for creation of simulated samples respective genotypes.
- --sample_info <sample_info>
Required File that maps samples from the reference VCF (–invcf) to population codes describing the populations in the header of the model file.
- --region <region>
Subset the simulation to a specific region in a chromosome using the form chrom:start-end. Example 2:1000-2000
- --pop_field
Flag for outputting the population field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --sample_field
Flag for outputting the sample field in your VCF output. NOTE this flag does not work when your output file is in PGEN format.
- --no_replacement
Flag used to determine whether to sample reference haplotypes with or without replacement. (Default = Replacement)
- --only_breakpoint
Flag used to determine whether to only output breakpoints or continue to simulate a vcf file.
- -c, --chunk-size <chunk_size>
If requesting a PGEN output file, write genotypes in chunks of X variants; reduces memory
- Default:
'all variants'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET