A subscription to JoVE is required to view this content. Sign in or start your free trial.
Amino acid-level signal-to-noise analysis determines the prevalence of genetic variation at a given amino acid position normalized to background genetic variation of a given population. This allows for identification of variant "hotspots" within a protein sequence (signal) that rises above the frequency of rare variants found in a population (noise).
Advancements in the cost and speed of next generation genetic sequencing have generated an explosion of clinical whole exome and whole genome testing. While this has led to increased identification of likely pathogenic mutations associated with genetic syndromes, it has also dramatically increased the number of incidentally found genetic variants of unknown significance (VUS). Determining the clinical significance of these variants is a major challenge for both scientists and clinicians. An approach to assist in determining the likelihood of pathogenicity is signal-to-noise analysis at the protein sequence level. This protocol describes a method for amino acid-level signal-to-noise analysis that leverages variant frequency at each amino acid position of the protein with known protein topology to identify areas of the primary sequence with elevated likelihood of pathologic variation (relative to population "background" variation). This method can identify amino acid residue location "hotspots" of high pathologic signal, which can be used to refine the diagnostic weight of VUSs such as those identified by next generation genetic testing.
The rapid improvement in genetic sequencing platforms has revolutionized the accessibility and role of genetics in medicine. Once confined to a single gene, or a handful of genes, the reduction in cost and increase in speed of next generation genetic sequencing has led routine sequencing of the entirety of the genome's coding sequence (whole exome sequencing, WES) and the entire genome (whole genome sequencing, WGS) in the clinical setting. WES and WGS have been used frequently in the setting of critically ill neonates and children with concern for genetic syndrome where it is a proven diagnostic tool that can change clinical management1,2. While this has led to increased identification of likely pathogenic mutations associated with genetic syndromes, it has also dramatically increased the number of incidentally found genetic variants, or unexpected positive results, of unknown diagnostic significance (VUS). While some of these variants are disregarded and not reported, variants localizing to genes associated with potentially fatal or highly morbid diseases are often reported. Current guidelines recommend reporting of incidental variants found in specific genes which may be of medical benefit to the patient, including genes associated with the development of sudden cardiac death-predisposing diseases such as cardiomyopathies and channelopathies3. Although this recommendation was designed to capture individuals at risk of a SCD-predisposing disease, the sensitivity of variant detection far exceeds specificity. This is reflected in a growing number of VUSs and incidentally identified variants with unclear diagnostic utility that far exceed the frequency of the respective diseases in a given population4. One such disease, long QT syndrome (LQTS), is a canonical cardiac channelopathy caused by mutations localizing to genes which encode cardiac ion channels, or channel interacting proteins, resulting in delayed cardiac repolarization5. This delayed repolarization, seen by a prolonged QT interval on resting electrocardiogram, results in an electrical predisposition to potentially fatal ventricular arrhythmias such as torsades de pointes. While a number of genes have been linked to the development of this disease, mutations in KCNQ1-encoded IKs potassium channel (KCNQ1, Kv7.1) is the cause of LQTS type 1 and is utilized as an example below6. Illustrating the complexity in variant interpretation, the presence of rare variants in LQTS-associated genes, so called "background genetic variation" has been previously described7,8.
In addition to large compendium-style databases of known pathogenic variants, several strategies exist for predicting the effect different variants will produce. Some are based on algorithms, such as SIFT and Polyphen 2, which can filter large numbers of novel non-synonymous variants to predict deleteriousness9,10. Despite broad use of these tools, low specificity limits their applicability when it comes to "calling" clinical VUSs11. "Signal-to-noise" analysis is a tool which identifies the likelihood of a variant being associated with disease based on the frequency of known pathologic variation at the loci in question normalized against rare genetic variation from a population. Variants localizing to genetic loci where there is a high prevalence of disease-associated mutations compared to population-based variation, a high signal-to-noise, are more likely to be disease-associated themselves. Further, rare variants found incidentally localizing to a gene with a high frequency of rare population variants compared to disease-associated frequency, a low signal-to-noise, may be less likely to be disease-associated. The diagnostic utility of signal-to-noise analysis has been illustrated in the latest guidelines for genetic testing for cardiomyopathies and channelopathies; however, it has only been employed at the whole gene level or domain-specific level12. Recently, given increased availability of both pathologic variants (disease databases, cohort studies in the literature) and population-based control variants (Exome Aggregation Consortium, ExAC and the Genome Aggregation Database, GnomAD13), this has been applied to the individual amino acid positions within the primary sequence of a protein. Amino acid-level signal-to-noise analysis has proven useful in categorizing incidentally identified variants in genes associated with LQTS as likely "background" genetic variation rather than disease-associated. Among the three major genes associated with LQTS, including KCNQ1, these incidentally identified variants lacked a significant signal-to-noise ratios, suggesting that the frequency of these variants at individual amino acid positions reflect rare population variation rather than disease-associated mutations. Furthermore, when protein-specific domain topology was overlaid against areas of high signal-to-noise, pathologic mutation "hotspots" localized to key functional domains of the proteins14. This methodology holds promise in determining 1) the likelihood a variant is disease- or population-associated and 2) identifying novel critical functional domains of a protein associated with human disease.
1. Identify the Gene and Specific Splice Isoform of Interest
NOTE: Here, we demonstrate the use of Ensembl15 to identify the consensus sequence for the gene of interest which is associated with the pathogenesis of the disease of interest (i.e. KCNQ1 mutations are associated with LQTS). Alternatives to Ensembl include RefSeq via the National Center for Biotechnology Information (NCBI)16 and the University of California, Santa Cruz (UCSC) Human Genome Browser17 (see Table of Materials).
2. Create the Experimental Genetic Variant Database (the "Signal")
NOTE: Here, we demonstrate how to create a database of disease-associated variants in the gene of interest with the frequency of the disease-associated variants among individuals with the disease of interest. This database can take many forms and represents the "signal" (phenotype-positive genetic variation) which will be normalized against the control variant database. This can include 1) disease-associated variants for comparison against VUSs to identify novel functional domains of the protein and/or 2) VUSs, including incidentally identified VUSs, to compare against disease-associated variants to determine likelihood of pathogenicity. Disease-associated variants in KCNQ1 will be presented for illustration; however, the method is the same for analysis of incidentally-identified VUSs or any other set of experimental variants.
3. Create the Control Genetic Variant Database (the "Noise")
NOTE: Here, we demonstrate how to create a database of control variants in the gene of interest with an associated frequency in a control population. This database represents the "noise" (phenotype-negative, population-based genetic variation) which is the background against which the experimental variant database will be normalized. This is referred to as "control" variation.
4. Amino Acid Level Signal-to-noise Calculation and Mapping
5. Protein Domain Topology Overlay
6. Variant Position Overlay
A representative result for amino acid-level signal to noise analysis for KCNQ1 is depicted in Figure 6. In this example, rare variants identified in the GnomAD cohort (control cohort), incidentally-identified WES variants (experimental cohort #1), and LQTS case-associated variants deemed likely disease-associated (experimental cohort #2) are depicted. Further, the signal-to-noise analysis comparing the WES and LQTS cohort variant frequency normalized against...
High-throughput genetic testing has advanced dramatically in its application and availability over the past decade. However, in many diseases with well-established genetic underpinnings, such as cardiomyopathies, expanded testing has failed to improve diagnostic yield21. Further, there is significant uncertainty regarding the diagnostic utility of many identified variants. This is partially due to a growing number of incidentally identified rare variants discovered on WES and WGS, which can lead t...
The authors have nothing to disclose.
APL is supported by the National Institutes of Health K08-HL136839.
Name | Company | Catalog Number | Comments |
1000 Genome Project | N/A | www.internationalgenome.org | |
ClinVar | N/A | www.ncbi.nlm.nih.gov/clinvar | |
Ensembl Genome Browser | N/A | uswest.ensembl.org/index.html | |
Excel | Microsoft | office.microsoft.com/excel/ | Used for all example formulas and functions |
Exome Aggregation Consortium | N/A | www.exac.broadinstitute.org | |
Genome Aggregation Database | N/A | www.gnomad.broadinstitute.org | |
National Center for Biotechnology Information Domain and Structure Database | N/A | www.ncbi.nlm.nih.gov/guide/domains-structures/ | |
National Center for Biotechnology Information Gene Database | N/A | www.ncbi.nlm.nih.gov/gene/ | |
National Center for Biotechnology Information Protein Database | N/A | www.ncbi.nlm.nih.gov/protein/ | |
National Heart, Lung, and Blood Institute GO Exome Sequencing Project | N/A | www.evs.gs.washington.edu/EVS/ | |
SnapGene | GSL Biotech LCC | www.snapgene.com | |
University of California, Santa Cruz Human Genome Browser | N/A | www.genome.ucsc.edu |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
ISSN 2578-6326
Copyright © 2025 MyJoVE Corporation. All rights reserved
We use cookies to enhance your experience on our website.
By continuing to use our website or clicking “Continue”, you are agreeing to accept our cookies.