clump

Clump a set of variants specified in a .linear file.

The clump command creates a clump file joining SNPs or STRs in LD with one another.

Usage

haptools clump \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
--summstats-snps PATH \
--gts-snps PATH \
--summstats-strs PATH \
--gts-strs PATH \
--clump-field TEXT \
--clump-id-field TEXT \
--clump-chrom-field TEXT \
--clump-pos-field TEXT \
--clump-p1 FLOAT \
--clump-p2 FLOAT \
--clump-r2 FLOAT \
--clump-kb FLOAT \
--ld [Exact|Pearson] \
--out PATH

Examples

haptools clump \
  --summstats-snps tests/data/test_snpstats.linear \
  --gts-snps tests/data/simple.vcf \
  --clump-id-field ID \
  --clump-chrom-field CHROM \
  --clump-pos-field POS \
  --out test_snps.clump

You can use --ld [Exact|Pearson] to indicate which type of LD calculation you’d like to perform. Exact utilizes an exact cubic solution adopted from CubeX whereas Pearson utilizes a Pearson R calculation. Note Exact only works on SNPs and not any other variant type eg. STRs. The default value is Pearson.

haptools clump \
  --summstats-snps tests/data/test_snpstats.linear \
  --gts-snps tests/data/simple.vcf \
  --clump-id-field ID \
  --clump-chrom-field CHROM \
  --clump-pos-field POS \
  --ld Exact \
  --out test_snps.clump

You can modify thresholds and values used in the clumping process. --clump-p1 is the largest value of a p-value to consider being an index variant for a clump. --clump-p2 dictates the maximum p-value any variant can have to be considered when clumping. --clump-r2 is the R squared threshold where being greater than this value implies the candidate variant is in LD with the index variant. --clump-kb is the maximum distance upstream or downstream from the index variant to collect candidate variants for LD comparison. For example, --clump-kb 100 implies all variants 100 Kb upstream and 100 Kb downstream from the variant will be considered.

haptools clump \
  --summstats-snps tests/data/test_snpstats.linear \
  --gts-snps tests/data/simple.vcf \
  --clump-id-field ID \
  --clump-chrom-field CHROM \
  --clump-pos-field POS \
  --clump-p1 0.001 \
  --clump-p2 0.05 \
  --clump-r2 0.7 \
  --clump-kb 200.5 \
  --out test_snps.clump

You can also input STRs when calculating clumps. They can be used together with SNPs or alone.

haptools clump \
  --summstats-strs tests/data/test_strstats.linear \
  --gts-strs tests/data/simple_tr.vcf \
  --summstats-snps tests/data/test_snpstats.linear \
  --gts-snps tests/data/simple.vcf \
  --clump-id-field ID \
  --clump-chrom-field CHROM \
  --clump-pos-field POS \
  --ld Exact \
  --out test_snps.clump
haptools clump \
  --summstats-strs tests/data/test_strstats.linear \
  --gts-strs tests/data/simple_tr.vcf \
  --clump-id-field ID \
  --clump-chrom-field CHROM \
  --clump-pos-field POS \
  --ld Exact \
  --out test_snps.clump

All files used in these examples are described here.

Detailed Usage

haptools

haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information

haptools [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

clump

Performs clumping on datasets with SNPs, SNPs and STRs, and STRs. Clumping is the process of identifying SNPs or STRs that are highly correlated with one another and concatenating them all together into a single “clump” in order to not repeat the same effect size due to LD.

haptools clump [OPTIONS]

Options

--summstats-snps <summstats_snps>

File to load snps summary statistics

--summstats-strs <summstats_strs>

File to load strs summary statistics

--gts-snps <gts_snps>

SNP genotypes (VCF or PGEN)

--gts-strs <gts_strs>

STR genotypes (VCF)

--clump-p1 <clump_p1>

Max pval to start a new clump

--clump-p2 <clump_p2>

Filter for pvalue less than

--clump-id-field <clump_id_field>

Column header of the variant ID

--clump-field <clump_field>

Column header of the p-values

--clump-chrom-field <clump_chrom_field>

Column header of the chromosome

--clump-pos-field <clump_pos_field>

Column header of the position

--clump-kb <clump_kb>

clump kb radius

--clump-r2 <clump_r2>

r^2 threshold

--ld <ld>

Calculation type to infer LD, Exact Solution or Pearson R. (Exact|Pearson). Note the Exact Solution works best when all three genotypes are present (0,1,2) in the variants being compared.

Default:

Pearson

Options:

Exact | Pearson

--out <out>

Required Output filename

-v, --verbosity <verbosity>

The level of verbosity desired

Default:

INFO

Options:

CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET