simphenotype

Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal haplotypes or variants by specifying them in a .hap file. Phenotypes are simulated from genotypes output by the transform command.

The implementation is based on the GCTA GWAS Simulation utility.

Usage

haptools simphenotype \
--replications INT \
--heritability FLOAT \
--prevalence FLOAT \
--region TEXT \
--sample SAMPLE --sample SAMPLE \
--samples-file FILENAME \
--id ID --id ID \
--ids-file FILENAME \
--chunk-size INT \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES

Model

Each normalized haplotype \(\vec{Z_j}\) is encoded as an independent causal variable in a linear model:

\[\vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon\]

where

\[\epsilon_i \sim N(0, \sigma^2)\]
\[\sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1)\]

The heritability \(h^2\) is user-specified, but if it is not provided, then \(\sigma^2\) will be computed purely from the effect sizes, instead:

\[\sigma^2 = \Biggl \lbrace {1 - \sum \beta_j^2 \quad \quad {\sum \beta_j^2 \le 1} \atop 0 \quad \quad \quad \quad \quad \text{ otherwise }}\]

If a prevalence for the disease is specified, the final \(\vec{y}\) value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals.

Input

Genotypes must be specified in VCF and haplotypes must be specified in the .hap file format. If you’d like to encode simple SNPs as causal variants within a .hap file, use the haptools API like in this example.

Note

Your .hap files must contain a “beta” extra field. See this section of the .hap format spec for more details.

Alternatively, you may also specify genotypes in PLINK2 PGEN format. Just use the appropriate “.pgen” file extension in the input. See the documentation for genotypes in the format docs for more information.

Output

Phenotypes are output in the PLINK2-style .pheno file format. If --replications was set to greater than 1, additional columns are output for each simulated trait.

Note

Case/control phenotypes are encoded as 0 (control) + 1 (case) not 1 (control) + 2 (case). The latter is assumed by PLINK2 unless the --1 flag is used (see the PLINK2 docs). Therefore, you must use --1 when providing our .pheno files to PLINK.

Examples

haptools transform tests/data/simple.vcf tests/data/simple.hap | \
haptools simphenotype -o simulated.pheno /dev/stdin tests/data/simple.hap

By default, all of the haplotypes in the .hap file will be encoded as causal variables. Alternatively, you can select the causal variables manually via the --id or --ids-file parameters.

haptools transform tests/data/simple.vcf tests/data/simple.hap | \
haptools simphenotype --id 'H1' /dev/stdin tests/data/simple.hap

To simulate ancestry-specific effects from a genotypes file with population labels, use the --ancestry switch when running transform.

haptools transform --ancestry tests/data/simple-ancestry.vcf tests/data/simple.hap | \
haptools simphenotype --id 'H1' /dev/stdin tests/data/simple.hap

If speed is important, it’s generally faster to use PGEN files than VCFs.

haptools transform -o simple-haps.pgen tests/data/simple.pgen tests/data/simple.hap
haptools simphenotype --id 'H1' simple-haps.pgen tests/data/simple.hap

Let’s simulate two replicates of a case/control trait that occurs in 60% of samples with a heritability of 0.8. We’ll encode only two of the haplotypes in tests/data/simphenotype.hap as independent causal variables.

haptools transform tests/data/example.vcf.gz tests/data/simphenotype.hap | \
haptools simphenotype \
--replications 2 \
--heritability 0.8 \
--prevalence 0.6 \
--id 'chr21.q.3365*10' \
--id 'chr21.q.3365*11' \
--output simulated.pheno \
/dev/stdin tests/data/simphenotype.hap

All files used in these examples are described here.

Detailed Usage

haptools

haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information

haptools [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

simphenotype

Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a set of haplotypes.

GENOTYPES must be formatted as a VCF or PGEN file and HAPLOTYPES must be formatted according to the .hap format spec

Note: GENOTYPES must be the output from the transform subcommand.

haptools simphenotype [OPTIONS] GENOTYPES HAPLOTYPES

Options

-r, --replications <replications>

Number of rounds of simulation to perform

Default

1

-h, --heritability <heritability>

Trait heritability

-p, --prevalence <prevalence>

Disease prevalence if simulating a case-control trait

Default

quantitative trait

--normalize, --no-normalize

Whether to normalize the genotypes before using them for simulation

Default

True

--region <region>

The region from which to extract haplotypes; ex: ‘chr1:1234-34566’ or ‘chr7’. For this to work, the VCF and .hap file must be indexed and the seqname provided must correspond with one in the files

Default

all haplotypes

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default

all samples

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default

all samples

-i, --id <ids>

A list of the haplotype IDs from the .hap file to use as causal variables (ex: ‘-i H1 -i H2’).

Default

all haplotypes

-I, --ids-file <ids_file>

A single column txt file containing a list of the haplotype IDs (one per line) to subset from the .hap file

Default

all haplotypes

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default

all variants

--seed <seed>

Use this option across executions to make the output reproducible

Default

chosen randomly

-o, --output <output>

A TSV file containing simulated phenotypes

Default

stdout

-v, --verbosity <verbosity>

The level of verbosity desired

Default

INFO

Options

CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES

Required argument

HAPLOTYPES

Required argument