clump
Clump a set of variants specified in a .linear file.
The clump command creates a clump file joining SNPs or STRs in LD with one another.
Usage
haptools clump \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
--summstats-snps PATH \
--gts-snps PATH \
--summstats-strs PATH \
--gts-strs PATH \
--clump-field TEXT \
--clump-id-field TEXT \
--clump-chrom-field TEXT \
--clump-pos-field TEXT \
--clump-p1 FLOAT \
--clump-p2 FLOAT \
--clump-r2 FLOAT \
--clump-kb FLOAT \
--ld [Exact|Pearson] \
--out PATH
Examples
haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--out test_snps.clump
You can use --ld [Exact|Pearson] to indicate which type of LD calculation you’d like to perform. Exact utilizes an exact cubic solution adopted from CubeX whereas Pearson utilizes a Pearson R calculation. Note Exact only works on SNPs and not any other variant type eg. STRs. The default value is Pearson.
haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump
You can modify thresholds and values used in the clumping process. --clump-p1 is the largest value of a p-value to consider being an index variant for a clump. --clump-p2 dictates the maximum p-value any variant can have to be considered when clumping. --clump-r2 is the R squared threshold where being greater than this value implies the candidate variant is in LD with the index variant. --clump-kb is the maximum distance upstream or downstream from the index variant to collect candidate variants for LD comparison. For example, --clump-kb 100 implies all variants 100 Kb upstream and 100 Kb downstream from the variant will be considered.
haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--clump-p1 0.001 \
--clump-p2 0.05 \
--clump-r2 0.7 \
--clump-kb 200.5 \
--out test_snps.clump
You can also input STRs when calculating clumps. They can be used together with SNPs or alone.
haptools clump \
--summstats-strs tests/data/test_strstats.linear \
--gts-strs tests/data/simple_tr.vcf \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump
haptools clump \
--summstats-strs tests/data/test_strstats.linear \
--gts-strs tests/data/simple_tr.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump
All files used in these examples are described here.
Detailed Usage
haptools
haptools: A toolkit for simulating and analyzing genotypes and phenotypes while taking into account haplotype information
haptools [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
clump
Performs clumping on datasets with SNPs, SNPs and STRs, and STRs. Clumping is the process of identifying SNPs or STRs that are highly correlated with one another and concatenating them all together into a single “clump” in order to not repeat the same effect size due to LD.
haptools clump [OPTIONS]
Options
- --summstats-snps <summstats_snps>
File to load snps summary statistics
- --summstats-strs <summstats_strs>
File to load strs summary statistics
- --gts-snps <gts_snps>
SNP genotypes (VCF or PGEN)
- --gts-strs <gts_strs>
STR genotypes (VCF)
- --clump-p1 <clump_p1>
Max pval to start a new clump
- --clump-p2 <clump_p2>
Filter for pvalue less than
- --clump-id-field <clump_id_field>
Column header of the variant ID
- --clump-field <clump_field>
Column header of the p-values
- --clump-chrom-field <clump_chrom_field>
Column header of the chromosome
- --clump-pos-field <clump_pos_field>
Column header of the position
- --clump-kb <clump_kb>
clump kb radius
- --clump-r2 <clump_r2>
r^2 threshold
- --ld <ld>
Calculation type to infer LD, Exact Solution or Pearson R. (Exact|Pearson). Note the Exact Solution works best when all three genotypes are present (0,1,2) in the variants being compared.
- Default:
Pearson- Options:
Exact | Pearson
- --out <out>
Required Output filename
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
INFO- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET