Molecular TF

Genetic variant Interpretation and validation of ERKReg data

Name of Principal Investigator and research fellow mentor: K. Joeri van der Velde and Albertien M. van Eerde

Affiliation: Genomics Coordination Center, Department of Genetics, University of Groningen and University Medical Center Groningen, Groningen, The Netherlands; Department of Genetics, University Medical Center Utrecht, Utrecht, the Netherlands.

Background / Literature Review

This project will be supervised by Dr. K. Joeri van der Velde (k.j.van.der.velde@umcg.nl) in collaboration with Dr. Albertien M. van Eerde (a.vaneerde@umcutrecht.nl) and the ERKNet Taskforce Molecular Diagnostics.

The Variant interpretation Pipeline (VIP, publication forthcoming) is a flexible human variant interpretation pipeline for rare disease using state-of-the-art pathogenicity predictors (SIFT, phyloP, PolyPhen, Grantham, ReMM, CAPICE), complementary predictors (SpliceAI, ncER, AnnotSv, AlphScore, GADO), variant databases (ClinVar, VKGL, gnomAD) and template-based interactive reporting to facilitate decision support. For this purpose, a configurable decision tree, and filters based on Human Phenotype Ontology (HPO) and gene inheritance can be used to find unknown disease causing variants or to finetune a query for specific variants. VIP is non-commercial and available as open-source software at github.com/molgenis/vip.

Within the Taskforce Molecular Diagnostics in ERKNet, VIP was presented in May 2023 as a solution for batch curation of genetic variants entered inERKReg as a molecular diagnosis. A project was initiated by Prof. Dr. V.V.A.M. (Nine) Knoers, Dr. A.M. (Albertien) van Eerde, Dr. K.J. (Joeri) van der Velde, Prof. Dr. med. F. (Franz) Schaefer, and the Taskforce to curate the genetic variants registered in ERKReg through VIP.

Pilot data were generated in May 2023 from 2,726 cases extracted from ERKReg. A total of 2,065 of primary variants, excluding additional variant part of the genetic diagnosis for the sake of simplicity, were fully processed and received a suggested classification: 1,795 Likely Pathogenic (LP; 66%), 223 Likely Benign (LB; 8%), and 47 Variants of Unknown Significance (VUS; 2%). The 661 (24%) leftover variants either had missing or incorrect HGVS notations, or were too large for processing for this type of pipeline (i.e. whole exon or gene deletions). We repeated this analysis in December 2023 after substantial manual curation of variant nomenclature and inclusion of additional cases to a total of 4,604. Now, 3,621 variants received a suggested classification: 3,173 LP (69%), 376 LB (8%), 72 VUS (2%), with 983 (21%) leftover variants that could not (yet) be processed. Both analyses were performed with VIP version 5.4.1 to allow comparison of these results at the time.

Before drawing conclusions on for instance on the number of variants that might not be causal (with a suggested LB classification), it is necessary to lower the number of variants that cannot be processed automatically because of nomenclature issues. Also, as large CNVs are not suitable for processing in this current pipeline, it is necessary to assess the percentage of CNVs in the current “not processed” list and find an alternative to separate the suggested LB CNVs from the LP and P ones.

Objective(s) / Working hypothesis

ERKReg is a rich and multinational registry for all patients with rare kidney diseases, containing valuable genetic diagnoses established over a considerable range of years. For ERKReg maintainers, data quality management is an important and ongoing activity. Diagnoses made in the past can be re-evaluated using the latest standards and most up-to-date knowledge from the field. We expect that the date of a diagnosis will correlate with the chance that a diagnosis needs updating. However, to maximize benefit, we do not only want to curate the past, but also learn relevant lessons from it, so that we may avoid the same pitfalls in the future. This project focuses on evaluating and curating genetic diagnoses in ERKReg.

The most important goals of this project will be:

Better quality of ERKReg molecular diagnoses by resolving notational issues of variants * Better clinical reliability of ERKReg by
- re-assessing suggested LB classifications and
- assessing whether LP/P variants can actually explain the registered clinical diagnosis or might suggest altering the registered clinical diagnosis
Better VIP pipeline by learning from suggested LB classifications that were mistaken
Better insight into difficult to resolve issues and how they might be prevented

To improve the quality and reliability of ERKReg, we first want to fix as many variant notation issues as possible, followed by a complete re-analysis of all ERKReg genetic diagnoses using the latest version of the VIP pipeline. The current version is 7.5.0 and has received many upgrades, updates, finetuning and fixes compared to the previously used version 5.4.1. After this analysis, we will compile an overview of any remaining notation issues and unexpected classifications. This overview will present key learnings and serve as a milestone for further ERKReg improvements.

We hypothesize that the quality and interpretation of many ERKReg variants, as well as future collection of genetic diagnoses, can be significantly improved by a combination of curation and re-analysis of the current registry. Key questions that will be answered are:

Why does the extraction of HGVS notation of certain variants fail?
Why does the conversion of HGVS notation into genomic coordinates of certain variants fail?
Why does VIP suggest a Likely Benign classification for certain variants?
In the realm of LP/P variants, are molecular diagnoses explaining the registered phenotypes? Are reclassifications in order?

Patient Population

This project will include all ERKREG patients with a genetic diagnosis, their phenotypes and registered disease.

Data Sources & Data Elements

Data Sources: ERKReg

Data Elements:
For ERKReg variant reinterpretation, we will use information on genetic diagnoses, phenotypes, disease(s), as well as method and year of testing. And a list of genetic tests that were performed. Optionally, we can use metadata such as sex and and disease category to create pivot tables. These tables can help to identify potential overrepresentation of variant issues or classifications for a particular subgroup. In addition, anonymized identifiers of submitting centers and their countries would enable us to assess practice differences with respect to variant annotation. Although we will clearly focus on cases with a genetic diagnosis, we will also request totals of patients per phenotype group, and totals of patients within these groups that did have genetic testing but no diagnosis.

Methodology

The project can be split into 3 logical parts, and in total the following 14 steps:

Part one of this project will curate the annotation of existing genetic diagnoses based on existing analysis results. This means:

1. Investigate variants that failed to receive a HGVS notation and fix where possible.

2. Investigate variants that failed conversion into Variant Call Format (VCF) and fix where possible.

3. Identify Copy Number Variation (CNV), other Structural Variation (SV), and indels (>50 bp) and set them aside for a separate analysis.

The second part will re-process the curated data set using VIP v7.5.0. This analysis consists of:

4. Converting the ERKReg XLSX export into a well-formatted Comma-Separated Values (CSV) file.

5. Processing and splitting the CSV file into two new files using a Java program:
one that connects individuals to variants, and another with variants as input file for Ensembl Variant Effect Predictor (VEP).

6. Running Ensembl VEP web tool in batches to get VCF formatted variants
(web service at www.ensembl.org/Tools/VEP using genome assembly GRCh38.p14).

7. Merging all VEP results into one new file and prepare for running the VIP pipeline:

a. Adding genotype columns to complete VCF rows.

b. Cleanup (incl. b38 chr notation, remaining indels >50 bp moved to CNV analyses). c. Adding an appropriate VCF header with build 38 contigs.

d. Appling positional sorting (using BCFtools).

e. Preparing a VIP sample sheet.

f. Preparing a VIP config file to report all classifications for all quality levels.

8. Running VIP v7.5.0 pipeline in VCF mode with the special config, otherwise default settings.

9. Merging VIP results with original ERKReg XLSX including full annotations Java program).

The third and last part will scrutinize these results and create an overview of remaining issues:

10. Investigate variants that still failed to receive a HGVS notation.

11. Investigate variants that still failed conversion into VCF.

12. a. Investigate variants suggested classification ‘Likely Benign’ and provide appropriate reasons for either acknowledging or
rejecting this suggested classification.

b. Assess whether LP/P variants can actually cause the registered phenotype in terms of inheritance pattern, pathophysiology, and expected severity.

c. the list of variants resulting from the analyses in A and B will be presented to the taskforce and the WG leads, in order to obtain an expert opinion/approval on the suggested new classifications.

13. Compile all results into a comprehensive variant overview:

a. Successfully confirmed variants (LP confirmed as LP).

b. Variants confirmed as LP that were originally reported as VUS. c. LB variants that were acknowledged as being true LBs.

d. LB variants that were rejected and classified as LP after all. e. Remaining technical issues regarding variant notation.

14. Draft a paper including the following elements:

descriptive report on the subset of patients in ERKreg with a (correct) molecular diagnosis
diagnosis groups
- genes
- types of variants/ but also number of patients with the same variants and
- analyses of % patients that has genetic testing performed within phenotype groups.
report on process of curation/ reevaluation of variants
recommendations on how data entry can be improved in the future.
report on the causes for wrongful assignment of genetic diagnoses
report on changed clinical diagnoses based on the curation.

< back to project overview