Introduction to Bioinformatics online course: IBT

Practical : Consolidation Week

Module name: Genomics

Session name: Consolidation week

Trainer: Fatma Guerfali

Participant:write your name here>

Date:write today’s date here

Annotate and interrogate a VCF file

Introduction

We will use a vcf file containing small mutations reported for Venter’s genome. In the introductory video for this exercise, we will find out together how to only select data for chromosome 1, and annotate this selected set of mutants using the VEP (Variant Effect Predictor) on-line resource provided by Ensembl. You will then query the results for different information provided by the annotation.

Important note:

For this practical you will be working with a reduced vcf file containing small mutations on chromosome 1-3 identified by sequencing Venter’s genome. This file named “Venter_chr123.vcf” has been made available for you to download.

Please note that the file is in VCF format and has been saved as a .txt file to ensure that everyone can access the file irrespective of machine OS, text editor used etc. The file could be viewed on a Linux terminal using a command like less or more.

For those using CygWin on a Windows machine, you will need to download the file “Venter_chr123.vcf” from the Vula site (IBT2017 -> Resources -> Genomics -> Consolidation Session) and move it to your CygWin home directory on your computer. You should be able to navigate to and find your CygWin home directory using Windows File Explorer. In most instances the path should be c:\cygwin\home\your_username. You will need to move the file to this directory.

Once you place the file in CygWin home directory it will be accessible via the command line and you should see the filename if you type ‘ls’ on the command line while you are in your home directory. Once you have created the vcf file for chromosome 1 following the instructions in the video, which is named “Venter_chr1.vcf”, you can save that file to the same directory.

From the instructions in the demo video, you should all have a working directory called “IBT_consolidation_week_2017” where the “Venter_chr123.vcf”, and the “Venter_chr1.vcf” are located and where you should also have downloaded the VCF file after annotation through VEP, that we called “Venter_chr1_VEP.vcf”. We will be using this file for Task2.

Note: For those formally enrolled in the IBT course: In case you had problems to generate the “Venter_chr1.vcf” or to download the “Venter_chr1_VEP.vcf”, please note that these files have been provided to your teaching assistants for download.

Tools used in this session

Task 1: Using VEP Filters

Task 1: instructions

For this task, you should have the VEP results page in front of you. Use the Filters to retrieve information about the Gene ENSG00000230021(Note: Leave all other options set to default values).

1. What biotype is this gene classified as?

2. How many mutations are reported for this gene? (have a look at the “Allele” column on the left-hand side to orient your answer, and to the “Location” column where the mutations are detailed)

3. How many different transcripts are affected by these mutations?

4. Click on the differentmutations you see in order to view the region in detail. Explore the information on the page associated with each mutation. We have explored this kind of information in one of the previous sessions.What class of SV (structural variation) is associated with thesemutations?

Note: please leave this page open on your browser as we will come back to it later (Task3).

Task 1: participant’s answer

start typing your answer here

Task 2: Navigate through the same informations using the command-line

Task 2: instructions

We will now try to retrieve that same information as above using the annotated VCF file called “Venter_chr1_VEP.vcf”, that you should all have downloaded at this stage, using the command line. This VCF annotated file should be located in your working directory “IBT_consolidation_week_2017”, as instructed

1. Go to your working directory (use the cd command). What command would allow you to retrieve all lines of information related to the ENSG00000230021 gene?

2. How many lines of information are related to this gene?

Task 2: participant’s answer

start typing your answer here

Task 3: Back to the VEP results page to modify other parameters

Task 3: instructions

Go back to the VEP results online page you left open on your browser. Next to “Filters” options, on the left-hand side, you can see ”Navigation” options. Notice that the default options are set to “5”.

1. Set the “Navigation” options to 10. How has changing this option changed the results obtained in terms of the number of mutations?

2. Set the “Navigation” options to 50. How has changing this option changed the results obtained in terms of the number of mutations?

3. Click on the last mutation you can see on the table (1:765156-765156) and have a look at the information related to this mutation as we did it in Task1. What class of SV (structural variation) is associated with this mutation?

Task 3: participant’s answer

start typing your answer here

Concluding remark:

There is a huge difference between the results you retrieved through the website and the results you can retrieve using the command line to interrogate your file. This difference was seen in the example above based on the filtering options selected and whether those are set by default or you chose them. When you do not know what to expect, the smallest error could be detrimental to the rest of your analysis, as you could have missed an important variation such as a CNV as you saw it here.

This exercise is meant to show you the power of the command-line interface when you interrogate these kind of large files, both in terms of rapidity and accuracy.