Table des matières
Here we will use the BWA aligner to map short reads to a reference genome, and then call variants (differences between the reads and the reference).
Download data
Goal: get the sequence data!
Run:
mkdir ~/data cd ~/data curl -O http://dib-training.ucdavis.edu.s3.amazonaws.com/2017-ucdavis-igg201b/SRR2584857.fq.gz
Map data
Goal: execute a basic mapping
Run the following commands to install bwa:
cd curl -L https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2/download > bwa-0.7.15.tar.bz2
tar xjvf bwa-0.7.15.tar.bz2 cd bwa-0.7.15 make
sudo cp bwa /usr/local/bin echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc source ~/.bashrc
2 Make & change into a working directory:
3 Copy and gunzip the reference:
wget https://github.com/ctb/2017-ucdavis-igg201b/raw/master/lab1/ecoli-rel606.fa.gz gunzip ecoli-rel606.fa.gz
4 Prepare it for mapping:
bwa index ecoli-rel606.fa
5 Map! (This will take about 2 minutes.)
bwa mem -t 6 ecoli-rel606.fa ~/data/SRR2584857.fq.gz > SRR2584857.sam
6 Observe!
Visualize mapping
Goal: make it possible to go look at a specific bit of the genome.
Install samtools:
sudo apt-get -y install samtools
Convert the SAM file into a BAM file that can be sorted and indexed:
samtools view -hSbo SRR2584857.bam SRR2584857.sam
Sort the BAM file by position in genome:
samtools sort SRR2584857.bam SRR2584857.sorted
Index the BAM file so that we can randomly access it quickly:
samtools index SRR2584857.sorted.bam
Visualize with tview:
samtools tview SRR2584857.sorted.bam ecoli-rel606.fa
tview commands of relevance:
left and right arrows scroll q to quit CTRL-h and CTRL-l do “big” scrolls g ecoli:3931002 will take you to a specific location.
Call variants!
Goal: find places where the reads are systematically different from the genome.
Now we can call variants using samtools mpileup:
samtools mpileup -uD -f ecoli-rel606.fa SRR2584857.sorted.bam | \
bcftools view -bvcg - > variants.raw.bcf
This will take a few minutes… and output a file that is not human readable! But we can quickly convert it into the ‘variant call format’ that is human readable:
bcftools view variants.raw.bcf > variants.vcf
Look at the VCF file
The output VCF file contains a list of all the variants that samtools thinks are there. What’s in it?
The official VCF specification is a great read…if you’re suffering from insomnia. Let’s skip this and just take a quick look at the file.
Look at the non-commented lines along with the header:
The first five columns: CHROM POS ID REF ALT. It’s a little easier to see if you run
grep -v ^## variants.vcf | less -S
Use your left and right arrows to scroll, and ‘q’ to quit.
Examine one of the variants with tview:
samtools tview SRR2584857.sorted.bam ecoli-rel606.fa -p ecoli:920514
‘q’ to quit, left arrow to scroll a bit left.
Well, at least that variant looks real… Look at the VCF file with bedtools.¶
bedtools docs
Download and build bedtools:
cd ~/ curl -O -L https://github.com/arq5x/bedtools2/releases/download/v2.26.0/bedtools-2.26.0.tar.gz tar -xzf bedtools-2.26.0.tar.gz
cd bedtools2 make sudo make install
Go back to work:
Download a GFF3 file with annotations for E. coli:
wget https://github.com/ctb/2017-ucdavis-igg201b/raw/master/lab3/ecoli-rel606.gff.gz .
Run bedtools intersect:
bedtools intersect -a ecoli-rel606.gff.gz -b variants.vcf -wa -u
Documentation for bedtools intersect
Extract reads with samtools.
Execute:
samtools view SRR2584857.sorted.bam 'ecoli:920514-920514' > out.sam wc -l out.sam
and this will give you the coverage of the relevant position. Discussion points / extra things to cover¶
What are the drawbacks to mapping-based variant calling? What are the positives? Where do reference genomes come from?
Source : https://angus.readthedocs.io/en/2017/variant-calling.html