simphenotype
Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal haplotypes or variants by specifying them in a .hap file. Phenotypes are simulated from genotypes output by the transform command.
The implementation is based on the GCTA GWAS Simulation utility.
Usage
haptools simphenotype \
--replications INT \
--heritability FLOAT \
--prevalence FLOAT \
--region TEXT \
--sample SAMPLE --sample SAMPLE \
--samples-file FILENAME \
--id ID --id ID \
--ids-file FILENAME \
--chunk-size INT \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES
Model
Each normalized haplotype \(\vec{Z_j}\) is encoded as an independent causal variable in a linear model:
where
The heritability \(h^2\) is user-specified, but if it is not provided, then \(\sigma^2\) will be computed purely from the effect sizes, instead:
If a prevalence for the disease is specified, the final \(\vec{y}\) value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals.
Input
Genotypes must be specified in VCF and haplotypes must be specified in the .hap file format. If you’d like to encode simple SNPs as causal variants within a .hap file, use the haptools API like in this example.
Note
Your .hap files must contain a “beta” extra field. See this section of the .hap format spec for more details.
Alternatively, you may also specify genotypes in PLINK2 PGEN format. Just use the appropriate “.pgen” file extension in the input. See the documentation for genotypes in the format docs for more information.
Output
Phenotypes are output in the PLINK2-style .pheno file format. If --replications was set to greater than 1, additional columns are output for each simulated trait.
Note
Case/control phenotypes are encoded as 0 (control) + 1 (case) not 1 (control) + 2 (case). The latter is assumed by PLINK2 unless the --1 flag is used (see the PLINK2 docs). Therefore, you must use --1 when providing our .pheno files to PLINK.
Examples
haptools transform tests/data/simple.vcf tests/data/simple.hap | \
haptools simphenotype -o simulated.pheno /dev/stdin tests/data/simple.hap
By default, all of the haplotypes in the .hap file will be encoded as causal variables. Alternatively, you can select the causal variables manually via the --id or --ids-file parameters.
haptools transform tests/data/simple.vcf tests/data/simple.hap | \
haptools simphenotype --id 'H1' /dev/stdin tests/data/simple.hap
To simulate ancestry-specific effects from a genotypes file with population labels, use the --ancestry switch when running transform.
haptools transform --ancestry tests/data/simple-ancestry.vcf tests/data/simple.hap | \
haptools simphenotype --id 'H1' /dev/stdin tests/data/simple.hap
If speed is important, it’s generally faster to use PGEN files than VCFs.
haptools transform -o simple-haps.pgen tests/data/simple.pgen tests/data/simple.hap
haptools simphenotype --id 'H1' simple-haps.pgen tests/data/simple.hap
Let’s simulate two replicates of a case/control trait that occurs in 60% of samples with a heritability of 0.8. We’ll encode only two of the haplotypes in tests/data/simphenotype.hap as independent causal variables.
haptools transform tests/data/example.vcf.gz tests/data/simphenotype.hap | \
haptools simphenotype \
--replications 2 \
--heritability 0.8 \
--prevalence 0.6 \
--id 'chr21.q.3365*10' \
--id 'chr21.q.3365*11' \
--output simulated.pheno \
/dev/stdin tests/data/simphenotype.hap
All files used in these examples are described here.
Detailed Usage
haptools
haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information
haptools [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
simphenotype
Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.
GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec
Note: GENOTYPES must be the output from the transform subcommand.
haptools simphenotype [OPTIONS] GENOTYPES HAPLOTYPES
Options
- -r, --replications <replications>
Number of rounds of simulation to perform
- Default
1
- -h, --heritability <heritability>
Trait heritability
- -p, --prevalence <prevalence>
Disease prevalence if simulating a case-control trait
- Default
quantitative trait
- --normalize, --no-normalize
Whether to normalize the genotypes before using them for simulation
- Default
True
- --region <region>
The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files
- Default
all haplotypes
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default
all samples
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default
all samples
- -i, --id <ids>
A list of the haplotype IDs from the .hap file to use as causal variables (ex: ‘-i H1 -i H2’).
- Default
all haplotypes
- -I, --ids-file <ids_file>
A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file
- Default
all haplotypes
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default
all variants
- --seed <seed>
Use this option across executions to make the output reproducible
- Default
chosen randomly
- -o, --output <output>
A TSV file containing simulated phenotypes
- Default
stdout
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default
INFO- Options
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument