ATAV (Association Tests for Annotated Variants)

Whole genome/exome association analysis toolset for annotated variants in next-generation sequencing studies

Introduction | Definitions | Availability | Analysis on Single Variants | Analysis on Group of Variants | Population Stratification | Trios Analysis | Linkage Analysis
  1. Introduction
  2. Definitions
  3. Availability
  4. Analysis on Single Variants
  5. Analysis on Group of Variants
  6. Population Stratification
  7. Trios Analysis
  8. Linkage Analysis
  9. Citations
  1. Introduction

  2. ATAV (Association Tests for Annotated Variants) is a statistical toolset that is designed to detect complex disease-associated rare genetic variants by performing association analysis on annotated variants derived, by using SVA (SequenceVariantAnalyzer) or other annotation programs, from whole-genome or whole-exome sequencing data.

    ATAV is developed by Dr. Max M. He et al..  This toolset is free to academic community. Please feel free to contact Max with any questions, feedback, or bug reports at maxy dot he at gmail dot com

    We are interested to hear your comments and feedback as features and improvements will be added in future releases. If you have ideas for improving ATAV or features you wish to add, we will be glad to hear about them.

    Back to the top
  3. Definitions

    • Variant: a "variant" is defined as the difference from the human reference genome sequence. A "heterozygote" is a carrier of one copy of variant allele and one copy of reference allele, and a "homozygote" is a carrier of two copies of variant alleles. The "variant allele" is not necessarily the "minor" (as opposed to "major") allele, or the "derived" (as opposed to "ancestral") allele.
    • Genetic models: all genetic models discussed in this document refer to "variant allele". For example, a dominant model tests whether a "variant allele" (A) is dominant over the "reference allele" (a): AA+Aa vs. aa. And for recessive model: AA vs. Aa+aa. These models are not referring to minor/major alleles, or derived/ancestral alleles.
    • Sex chromosome variants: by default, male X hets will be considered as missing (dropping the relevant genotypes for the variant), unless specified by "--exclude-male-het", where the variant will be excluded in the analysis (dropping all the genotypes for the variant); variants on sex chromosomes in pseudoautosomal regions are treated as the variants on autosomes.

    Back to the top
  4. Availability

  5. ATAV is a command-line based toolset and is distributed in executable format. ATAV has been set up on the CHGV server.

    Users at the CHGV can use it directly by typing a command in a terminal. For example,
    $ atav.sh --memory 50000 --project $PROJECT.gsap --fisher --min-coverage 3 --min-variant-present 2 --ctrlMAF 0.05 --snvFunctionList STOP_GAINED,STOP_LOST,ESSENTIAL_SPLICE_SITE,NON_SYNONYMOUS_CODING --indelFunctionList CODING_DISRUPTED_FRAMESHIFT,CODING_DISRUPTED_OTHER --out $OUTPUT

    Back to the top
  6. Analysis on Single Variants

  7. A Fisher's exact test is implemented to analyze single variants. It allows users to screen for variants identified for imbalances in frequency between cases and controls. Users can specify whether the test is focused on imbalance of the variant allele frequency (by comparing frequencies of variant alleles in cases and controls), or by frequency of selected genotypes (for example, in a dominant model, counts of both homozygotes (AA) and heterozygotes (Aa) for any given variant and compare between cases and controls). This analysis performs a Fisher's exact test with allelic, dominant, recessive, and genotypic models.

    • A Fisher's exact test:

      This analysis performs a Fisher's exact test with allelic, dominant, recessive, and genotypic models. For X chromosome variants, only allelic model is tested.

      Command line (Note that the starting "$" symbol indicates the system prompt, do not type in it in your ATAV command line):
      $ atav.sh --memory 50000 --project $PROJECT.gsap --fisher --min-coverage 3 --min-variant-present 2 --ctrlMAF 0.05 --snvFunctionList STOP_GAINED,STOP_LOST,ESSENTIAL_SPLICE_SITE,NON_SYNONYMOUS_CODING --indelFunctionList CODING_DISRUPTED_FRAMESHIFT,CODING_DISRUPTED_OTHER --out $OUTPUT

      [More Input Parameters]

    • Logistic/Linear regression:

      This feature is to incorporate eigenvectors and/or other relevant components as covariates to run association tests.

      Type a command in a terminal as follows:
      $ atav.sh --memory 50000 --project $PROJECT.gsap --single-variant --min-coverage 3 --min-variant-present 2 --ctrlMAF 0.05 --snvFunctionList STOP_GAINED,STOP_LOST,ESSENTIAL_SPLICE_SITE,NON_SYNONYMOUS_CODING --indelFunctionList CODING_DISRUPTED_FRAMESHIFT,CODING_DISRUPTED_OTHER --out $OUTPUT

      [More Input Parameters]

    Back to the top
  8. Analysis on Group of Variants

  9. This method collapses the genotypes across all variants within a testing unit (e.g. gene), such that an individual is coded as 1 if a rare allele is present at any of the variant sites and as 0 otherwise. We use this type of analyses for testing a hypothesis that there may be an excess of some kind of variants (not necessarily one single variant), in the cases versus in the controls in the same gene, as is often observed in Mendelian diseases. For the following collapsing methods, we simply "collapse" qualified variants in the same gene into a single category (indicator variable), and then test for association on this collapsed set of variants. The users may define what variants should be collapsed and tested, both in terms of functional groupings and in terms of allele frequency threshold in controls.

    The allele frequency threshold in controls is dependent on the project design. For most projects where population (as opposed to "extreme") controls are included, we recommend a maximum variant allele frequency in controls as 0.05. That is, only variants with a variant allele frequency not higher than 0.05 in controls can be collapsed. For projects where the case and controls are both extremes, special structure for analysis should be obtained, for example, using an allele frequency threshold derived from a "real" control population.

    • A collapsing method with dominant model - command line as follows:
      $ atav.sh --memory 50000 --project $PROJECT.gsap --collapsing --min-coverage 3 --min-variant-present 2 --ctrlMAF 0.05 --snvFunctionList STOP_GAINED,STOP_LOST,ESSENTIAL_SPLICE_SITE,NON_SYNONYMOUS_CODING --indelFunctionList CODING_DISRUPTED_FRAMESHIFT,CODING_DISRUPTED_OTHER --permute --mperm 1000 --out $OUTPUT

      [More Input Parameters]

    • A collapsing method with recessive model - command line as follows:
      $ atav.sh --memory 50000 --project $PROJECT.gsap --collapsing --recessive --min-coverage 3 --min-variant-present 2 --ctrlMAF 0.05 --snvFunctionList STOP_GAINED,STOP_LOST,ESSENTIAL_SPLICE_SITE,NON_SYNONYMOUS_CODING --indelFunctionList CODING_DISRUPTED_FRAMESHIFT,CODING_DISRUPTED_OTHER --permute --mperm 1000 --out $OUTPUT

      [More Input Parameters]

    • A collapsing method with compound hets - command line as follows:
      $ atav.sh --memory 50000 --project $PROJECT.gsap --collapsing-comp-het --min-coverage 3 --min-variant-present 2 --ctrlMAF 0.05 --snvFunctionList STOP_GAINED,STOP_LOST,ESSENTIAL_SPLICE_SITE,NON_SYNONYMOUS_CODING --indelFunctionList CODING_DISRUPTED_FRAMESHIFT,CODING_DISRUPTED_OTHER --permute --mperm 1000 --out $OUTPUT

      [More Input Parameters]

    Back to the top
  10. Population Stratification

  11. It is well known that population stratification is a problem in genome wide association studies (GWAS) in which the association found might be due to the underlying subpopulations and not disease associated loci. The real disease causing loci could not be found in the study if the loci are less prevalent in the studied population. If differences in disease burden between subpopulations are present, population stratification can result in false-positive associations between the disease and genetic variants. It is expected to be a more severe issue with rare variants than with common variants in next-generation sequencing studies.

    ATAV generates PED/MAP files on whole-exome or whole-genome dataset. After pruning the generated data by excluding variants in high LD regions, removing variants with low frequencies, excluding variants that violate HWE, performing LD prune and cryptic relatedness check, ATAV calculates eigenvalues and eigenvectors by principal component analysis.

    Command line as follows:
    $ ./atav.sh --memory 50000 --project $PROJECT.gsap --ped-map --pop-strata --out $OUTPUT

    Back to the top

  12. Trios Analysis

  13. This analysis is designed for the Epi4K project (http://www.epgp.org/epi4k/) that studies the genetic basis of human epilepsy in order to improve our understanding of the biology of epilepsy and to develop new directions for its treatment. It is to detect (1) de novo mutations; (2) newly homozygous/recessive mutations; (3) compound heterozygosity for 700 trios in the Epi4K project.

    • De novo screens for trios - command line as follows:
      $ atav.sh --memory 50000 --project $PROJECT.gsap --list-trio-denovo --min-coverage-screen 10 --min-case-cov-screen 0 --min-ctrl-cov-screen 10 --ctrlMAF 0.05 --snvFunctionList $SNVLIST --indelFunctionList $INDELLIST --out $OUTPUT

      [More Input Parameters]

    • Newly homozygous screens for trios - command line as follows:
      $ atav.sh --memory 50000 --project $PROJECT.gsap --list-trio-denovo --min-coverage-screen 10 --min-case-cov-screen 0 --min-ctrl-cov-screen 10 --ctrlMAF 0.05 --snvFunctionList $SNVLIST --indelFunctionList $INDELLIST --out $OUTPUT

      [More Input Parameters]

    • Compound het screens for trios - command line as follows:
      $ atav.sh --memory 50000 --project $PROJECT.gsap --list-trio-comp-het --min-coverage-screen 10 --min-case-cov-screen 0 --min-ctrl-cov-screen 10 --ctrlMAF 0.03 --snvFunctionList STOP_GAINED,STOP_LOST,ESSENTIAL_SPLICE_SITE,NON_SYNONYMOUS_CODING --indelFunctionList CODING_DISRUPTED_FRAMESHIFT,CODING_DISRUPTED_OTHER --out $OUTPUT

      [More Input Parameters]

    Back to the top
  14. Linkage Analysis

  15. For a variant that shows a clear cosegregation pattern where all affected individuals carry the variant and no unaffecteds carry the variant, one kind of analysis can assess the likelihood that a variant will cosegregate with affectation status perfectly simply by chance. This test is simply a linkage analysis that uses the variant as a single marker and assesses linkage between this marker and the affectation status within the pedigree. This can be done using a linkage program MERLIN. The program calculates both a LOD score and an associated p-value. The model applied within MERLIN compares zero recombination and full penetrance of the variant to a model with free recombination (null hypothesis).

    Command line as follows:
    $ ./atav.sh --memory 10000 --linkage-analysis --linkage-data-file YourDATA.dat --linkage-ped-file YourPED.ped --linkage-map-file YourMAP.map --linkage-model-file YourParametric.model --out [your output]

    [MERLIN Input Files]   [MERLIN Parametric Linkage Analysis]

    Back to the top
  16. Citations

  17. ATAV: Association Tests for Annotated Variants in Next-Generation DNA Sequencing Studies. In preparation.

    Back to the top