Training data for Combined Annotation Dependent Depletion (CADD) models
Thank you for your interest in CADD and how it is created. We are sharing the models (i.e. coefficients) as well as code for annotation and scoring with our scoring script releases available in the download section. Here, we share pre-annotated training data sets, different test/validation sets that we are using to assess model performance/optimze model hyperparameters as well as VCF files of the simulated and human-derived variants. These files will allow (1) reproducing our model training, test and validation results, (2) explore different learners on the same training data, and (3) to develop other approaches based on the sets of simulated and human-derived variants. The variant simulator used to create a set of "de novo" based on characteristics of the human-derived variants (published with the Nature Genetics manuscript in 2014) is available here. Please note that we do not share our code for preparing training data sets or adding new features. This has several reasons, including limited time resources to guide people through building a new CADD model and the potential pitfalls that come with it.
We provide training data for the following releases:
- v1.0, the version referenced in our Nature Genetics manuscript
- v1.3, a more recent CADD version and also provides a number of validation sets
- v1.4, the first CADD version with matching models for GRCh38 and GRCh37 (please note that training variants are not identical for the two builds). This version is described in our NAR manuscript
- v1.6, the most recent CADD version that includes additional scores for splice sites
CADD v1.0 (original release)
In the original CADD paper, we sampled 10 variant sets (from the larger set of simulated variants), trained 10 models and averaged the coefficients (download US | DE) . For details about the model training see the supplement of the original manuscript). In later versions, we only have one matched set that we trained on. With CADD v1.3, we fixed some issues with human-derived InDel sequences. We therefore recommend using the training data set of v1.3 rather than of v1.0.
In this release, we newly extracted human-derived changes from EPO 6 primate alignments (v75), fixing an issue with the previously extracted InDel sequences and allowing non-Chimpanzee sequences as the closest human outgroup. We also created a newly matched simulated set of approximately 15 million simulated variants.
Training files (download US | DE)
- Training variants in block-gzip-compressed VCF format (incl. tabix index files):
- Annotated, imputed and reoriented training data:
- Fully expanded variant set for direct training in a learning package:
In addition, four validation data sets are available (download US | DE) .
When we released CADD v1.4, we created an additional training data set for GRCh38. Here, we tried to match annotations and training data sets as closely as possible between GRCh38 and GRCh37, however we note that there will be substantial differences in annotations and variants used. Note also that when we released CADD v1.5 for GRCh38 (which fixed some annotation issues), we did not release an updated training set. VCF files describing human-derived and simulated variants sets have not changed from v1.4 through v1.6.
For the CADD v1.4 data sets, files are organized in a similar way to v1.3. Please find INFO files in the respective folders:
Training data for GRCh38 (download US | DE)
Updated ClinVar-ExAC test set (download US | DE)
CADD v1.6 (CADD-Splice) includes SpliceAI and MMSplice DNN models as additional variant annotations. The training labels/sites (human-derived vs simulated variants) are unchanged since CADD v1.4.
The annotated training data is available. Please see INFO files in the respective folders for further details:
Please also find a template script for model training (download US | DE) .
A comprehensive set of test and validation sets is also available for this release (download US | DE) .