Short method summary
Fixed or nearly fixed recent evolutionary changes were identified as differences between 1000 Genomes and the Ensembl Compara inferred human-chimpanzee ancestral genome (derived allele frequency (DAF) of at least 95%, 14.9 million SNVs and 1.7 million indels). To simulate an equivalent number of mutations, we used an empirical model of sequence evolution with CpG dinucleotide-specific rates and mutation rates locally scaled in megabase windows. For annotation, we used the Ensembl Variant Effect Predictor (VEP), data from the ENCODE project and information from UCSC genome browser tracks. These annotations span a wide range of data types including conservation metrics like GERP, phastCons, and phyloP; functional genomic data like DNase hypersensitivity and transcription factor binding; transcript information like distance to exon-intron boundaries or expression levels in commonly studied cell lines; and protein-level scores like Grantham, SIFT, and PolyPhen.
In CADD v1.0 (major release), the resulting variant-by-annotation matrix contained 29.4 million variants (half observed, half simulated) and 63 distinct annotations. We trained a support vector machine (SVM) with a linear kernel on features derived from these annotations, supplemented by a limited number of interaction terms. The same 63 annotations were obtained for all 8.6 billion possible substitutions in the human reference genome (GRCh37), and, after training on observed and simulated variants, the model was applied to score all possible substitutions. As the scale of the combined SVM score ("C-scores") is effectively arbitrary due to the annotations used, we defined phred-like scores ("scaled C-scores") ranging from 1 to 99, based on the rank of each variant relative to all possible 8.6 billion substitutions in the human reference genome.
In CADD v1.1 (developmental/minor release), we used a slightly extended and updated annotation set and we trained a logistic regression model. Please find further information in the release notes.
In CADD v1.2 (developmental/minor release), we corrected some minor issues identified in v1.1. Please find further information in the release notes.
In CADD v1.3 (developmental/minor release), we corrected issues in v1.1 and v1.2 relating to overlapping gene annotation. We also updated our training data set using updated whole genome alignments. Please find further information in the release notes.
Notes on using scaled vs. unscaled C-scores
We believe that CADD scores are useful in two distinct forms, namely "raw" and "scaled", and we provide both in our output files. "Raw" CADD scores come straight from the model, and are interpretable as the extent to which the annotation profile for a given variant suggests that that variant is likely to be "observed" (negative values) vs "simulated" (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or "not observed") and therefore more likely to have deleterious effects.
Since the raw scores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a "normalized" and now externally comparable unit of analysis. In our case, we scored and ranked all ~8.6 billion SNVs of the GRCh37/hg19 reference and then "PHRED-scaled" those values by expressing the rank in order of magnitude terms rather than the precise rank itself. For example, reference genome single nucleotide variants at the 10th-% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, etc. The results of this transformation are the "scaled" CADD scores.
The advantages and disadvantages of the score sets are summarized as follows:
1. Resolution: Raw scores offer superior resolution across the entire spectrum, and preserve relative differences between scores that may otherwise be rounded away in the scaled scores. For example, the bottom 90% (~7.74 billion) of all GRCh37/hg19 reference SNVs (~8.6 billion) are compressed into scaled CADD units of 0 to 10, while the next 9% (top 10% to top 1%, spanning ~774 million SNVs) occupy CADD-10 to CADD-20, etc., with the scaled units only getting close to resolving individual SNVs from one another at the extreme top end. As a result, many variants that have substantive raw score differences between them will be necessarily forced to the same or very similar rank unit.
2. Frame of reference: Since there must always be a top-ranked variant, second-ranked variant, etc, scaled scores are easier to interpret at first glance and will be comparable across CADD versions as we, for example, update the model to include new annotations (or even use an entirely distinct model-building method). A scaled score of 10, for example, refers to the top 10% of all reference genome SNVs, regardless of the details of the annotation set, model parameters, etc. Furthermore, with scaled values one can always infer, with just a simple glance, the probability of picking a variant(s) at that score or greater when selecting randomly from all possible reference SNVs.
We envision the "typical use" cases for CADD, and appropriate choice of score set, as follows:
1. Discovering causal variants within an individual, or small groups, of exomes or genomes. Scaled CADD scores are most useful in this context, as one will generally only be interested or capable of reviewing a small set of the "most interesting" variants. In this setting, the distinction between a variant at the 25th percentile and 75th percentile is effectively irrelevant (scaled scores of ~0 to 1), while the difference between a variant in the top 10% (scaled score of 10) vs 1% (scaled score of 20) may be quite meaningful. Further, the absolute frame of the reference is valuable here, allowing an analyst to quickly place a variant in context and facilitate easier translation of results across publications, studies, etc.
2. Fine-mapping to discover causal variants within associated loci. As above, scaled scores are likely to be more useful here by allowing focus on a small set of manually reviewable best candidates and providing the absolute frame of the reference genome.
3. Comparing distributions of scores between groups of variants, e.g., cases vs controls. In this case, raw scores should be used, as they preserve distinctions that may be relevant across the entire scoring spectrum. Scaled scores may obscure systematic and potentially highly significant distinctions between two groups of variants (e.g., the first and third quartiles of all reference SNV scores). Further, since such analyses are generally conducted computationally and without manual intervention, the absolute frame of reference advantage to scaled scores is not as valuable here.
What are the scores in the files and which score cutoff should I use?
The last column of the provided files is the PHRED-like (-10*log10(rank/total)) scaled C-score ranking a variant relative to all possible substitutions of the human genome (8.6x10^9). Like explained above, a scaled C-score of greater of equal 10 indicates that these are predicted to be the 10% most deleterious substitutions that you can do to the human genome, a score of greater or equal 20 indicates the 1% most deleterious and so on.
The second to last column is the raw score of the model. Due to the high mislabeling in our training data, it does not have any interpretation (even the sign does not have an interpretation). The higher the raw C score the more predicted to be deleterious. If you want to do a non-parametric test between sets of variants, we recommend using this raw C-score (see above). If you want to put a cutoff on deleteriousness, we recommend the last column (scaled C-score) as it has some interpretation by relating the raw C-score to the raw C-scores of all possible substitutions in the human genome.
If you would like to apply a cutoff on deleteriousness, e.g. to identify potentially pathogenic variants, we would suggest to put a cutoff somewhere between 10 and 20. Maybe at 15, as this also happens to be the median value for all possible canonical splice site changes and non-synonymous variants. However, there is not a natural choice here -- it is always arbitrary. We therefore recommend integrating C-scores with other evidence and to rank your candidates for follow up rather than hard filtering.
I checked the "include underlying annotation in output", what information do I get? Why are there multiple lines per variant?
CADD uses many different annotations for its combined score. These include functional effect predictions based on gene models, conservation measures, ENCODE data summaries -- a total of 65 annotations in v1.0. Given a set of variants, we use Ensembl Variant Effect Predictor (VEP) on the set of variants to obtain the gene model based predictions. We run VEP with the --per_gene option, which will return a "representative" transcript with the most severe effect for a certain gene. If a position overlaps multiple genes, it will return multiple annotations for this variant. We then extent the VEP output by values from different UCSC-style annotation tracks (Encode OpenChromatin, expression, histones, conservation scores, ...). Once we got the fully annotated table, the raw C-score is calculated as described in our publication (see below). In case of multiple annotations, we report the highest score for both annotations of a variant. The corresponding scaled C-score is looked up and added to the output. If you check "include underlying annotation in output", this is the file that you will get. We provide these input annotations to allow users an interpretation of their scoring results. All the columns present in files of CADD v1.0 and below are listed and briefly described in Supplementary Table 1 of our paper (page 22). Annotation information for CADD v1.1-v1.3 is available here
Please do not use the annotation provided with the score as a replacement for annotating your variants using up-to-date gene annotation. CADD v1.0 uses what is now an outdated gene build (Ensembl v68). Further, for all CADD versions, the output does not contain information on all transcripts and the "representative" transcript picked by VEP does not need be the canonical one or even represent the most severe effect based on up-to-date annotations.
I fail to retrieve scores for my variants using the webserver. What is going wrong?
(1) If your upload fails it can be for two different reasons:
(a) You are attempting to upload a file larger than 2MB, which is automatically rejected by the webserver with a connection reset (white page, server error). In this case, please submit your variant set in smaller pieces or try removing additional columns in the VCF (CADD only requires the first 5 columns) to meet the upload limit. Also consider gzip-compression of your VCF file. We generally recommend submitting variants in small batches, as different submissions can be processed in parallel.
(b) If the file is smaller than 2MB, but it is not correctly formatted as a VCF or the file extension is neither vcf, tsv, txt nor gz, you get the "Your upload failed." error message on the regular CADD website with some description on how the uploaded file needs to be formatted and named. If you get this type of error, please adjust the formatting of the information (i.e. 5 columns: CHROM, POS, ID, REF, ALT; ID column can be empty but cannot be missing) and make sure your file has one of the filename extensions mentioned above. The upload will also fail if the file is formatted with only MAC new line characters ('\r'). UNIX ('\n') and Windows ('\r\n') formatted files work.
(2) If your upload is successful and you are retrieving scores for only a subset of variants the reasons might be as follows:
(a) Multiple identical variants were removed from your request.
(b) When uploading variants, the webinterface removes all additional information from the provided VCF and checks that lines match expectations for a VCF. Our expectations are defined as: (1) the chromosome column matches one of the reference chromosome identifiers, (2) the position column is a number and (3) the reference and alternative columns are nucleotide sequences without ambiguity codes (i.e. only A, C, G, T). Sometimes users describe InDels as 1 1234 . ACGT - (CHROM POS ID REF ALT) for a deletion of ACGT. This is no valid VCF, here insertions and deletions need to be described with one base of context left of the event (1 1233 . TACGT T). When violating one of those assumptions, variants will be missing from the output.
(c) Because of the way our webserver is set up, you will receive results for all your submitted InDels as these are calculated from scratch if they are not in the set of pre-scored variants. There is no checking performed on whether the deleted sequence is in the reference or whether the one context base required for insertions in VCF format matches the reference. If you are missing InDels, and they are not missing due to non-uniqueness or due to a contig ID that is not part of our reference (see 2d), please contact us with your submission ID (part of the download link/name of the scored file).
(d) When you are retrieving scores for only a subset of your SNV variants, this indicates that some of them cannot be found in the pre-scored whole genome file. The webserver checks that the reported reference base matches the base of the reference at the reported position. If you are submitting variants based on a different genome build, about 3/4 of your variants will be missing as the reference allele at specific position will match randomly with ~1/4. Our scores are currently based on GRCh37/hg19. We note that for example dbSNP versions based on GRCh38/hg38 were released already. Please make sure that you are using the right version.
Frequently, the problem is that users use the right assembly version, but do not adhere to the VCF specification. The plus strand reference allele needs to be in the reference column and the non-reference allele in the alternative column of the VCF. If you submit variants on the minus strand, they will be missing from your output files. It is our understanding that neither dbSNP nor HGMD necessary report variants based on the human reference assembly, but report the major allele as the reference allele in their VCF export.
(e) You might also fail retrieving sites from a contig that is missing in the reference we are using or is masked with Ns in our version of the human reference. The latter for example applies for the X paralogous region on the Y chromosome which needs to be requested with the X coordinates and not its Y coordinates. Find more information on the pseudo autosomal sequences in GRCh37/hg19 on the Ensembl website.
I like to use CADD for annotating more than the >100,000 variants provided through the webinterface or would like to use CADD for annotating variants on a regular basis. What should I do?
The webinterface has an arbitrarily introduced 2MB limit for computational reasons and to make it unattractive to use the webserver for scoring large sets of variants. We provide pre-scored files for SNVs and InDels. For the minimal local set-up, you can download the pre-scored 1000 Genomes and ESP variant files (500M). This will cover a large proportion of the variants (SNVs and InDels) that you will observe in any exome or genome data sets. You can then run all the additional/new variants on our the webserver. You can also download all possible SNVs (~80G), a set of 12.3M InDels (~160M) and then only score new InDels through the webinterface. In this case, we recommend that you start building a local database of InDels that you previously scored and check against these before running variants through the website. You might initiate that that local database with the InDels provided for the 1000 Genomes and ESP projects or the larger set of 12.3M InDels that we provide.
The files that we provide are block-gzip compressed and tabix indexed and allow for fast retrieval of specific genomic locations using the SAMtools library. There are bindings for several programming languages, which allow easy scripting for retrieving specific variants. We provide an example script that uses the pysam library to work on CADD v1.0 files here.
How to cite CADD?
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J.
A general framework for estimating the relative pathogenicity of human genetic variants.
Nat Genet. 2014 Feb 2. doi: 10.1038/ng.2892.
PubMed PMID: 24487276.