Outils pour utilisateurs

Outils du site


bioinfo:bwa

Here we will use the BWA aligner to map short reads to a reference genome, and then call variants (differences between the reads and the reference).

Download data

Goal: get the sequence data!

Run:

   mkdir ~/data
   cd ~/data
   curl -O http://dib-training.ucdavis.edu.s3.amazonaws.com/2017-ucdavis-igg201b/SRR2584857.fq.gz

Map data

Goal: execute a basic mapping

Run the following commands to install bwa:

   cd
   curl -L https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2/download > bwa-0.7.15.tar.bz2
   tar xjvf bwa-0.7.15.tar.bz2
   cd bwa-0.7.15
   make
   sudo cp bwa /usr/local/bin
   
   echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc
   source ~/.bashrc

2 Make & change into a working directory:

3 Copy and gunzip the reference:

   wget https://github.com/ctb/2017-ucdavis-igg201b/raw/master/lab1/ecoli-rel606.fa.gz
   gunzip ecoli-rel606.fa.gz

4 Prepare it for mapping:

   bwa index ecoli-rel606.fa

5 Map! (This will take about 2 minutes.)

   bwa mem -t 6 ecoli-rel606.fa ~/data/SRR2584857.fq.gz > SRR2584857.sam

6 Observe!

Visualize mapping

Goal: make it possible to go look at a specific bit of the genome.

Install samtools:

   sudo apt-get -y install samtools

Convert the SAM file into a BAM file that can be sorted and indexed:

   samtools view -hSbo SRR2584857.bam SRR2584857.sam

Sort the BAM file by position in genome:

   samtools sort SRR2584857.bam SRR2584857.sorted

Index the BAM file so that we can randomly access it quickly:

   samtools index SRR2584857.sorted.bam

Visualize with tview:

   samtools tview SRR2584857.sorted.bam ecoli-rel606.fa

tview commands of relevance:

      left and right arrows scroll
      q to quit
      CTRL-h and CTRL-l do “big” scrolls
      g ecoli:3931002 will take you to a specific location.

Call variants!

Goal: find places where the reads are systematically different from the genome.

Now we can call variants using samtools mpileup:

samtools mpileup -uD -f ecoli-rel606.fa SRR2584857.sorted.bam | \

  bcftools view -bvcg - > variants.raw.bcf

This will take a few minutes… and output a file that is not human readable! But we can quickly convert it into the ‘variant call format’ that is human readable:

bcftools view variants.raw.bcf > variants.vcf

Look at the VCF file

The output VCF file contains a list of all the variants that samtools thinks are there. What’s in it?

The official VCF specification is a great read…if you’re suffering from insomnia. Let’s skip this and just take a quick look at the file.

Look at the non-commented lines along with the header:

The first five columns: CHROM POS ID REF ALT. It’s a little easier to see if you run

   grep -v ^## variants.vcf | less -S

Use your left and right arrows to scroll, and ‘q’ to quit.

Examine one of the variants with tview:

   samtools tview SRR2584857.sorted.bam ecoli-rel606.fa -p ecoli:920514

‘q’ to quit, left arrow to scroll a bit left.

Well, at least that variant looks real… Look at the VCF file with bedtools.¶

bedtools docs

Download and build bedtools:

   cd ~/
   curl -O -L https://github.com/arq5x/bedtools2/releases/download/v2.26.0/bedtools-2.26.0.tar.gz
   tar -xzf bedtools-2.26.0.tar.gz
   cd bedtools2
   make
   sudo make install

Go back to work:

Download a GFF3 file with annotations for E. coli:

   wget https://github.com/ctb/2017-ucdavis-igg201b/raw/master/lab3/ecoli-rel606.gff.gz .

Run bedtools intersect:

   bedtools intersect -a ecoli-rel606.gff.gz -b variants.vcf -wa -u

Documentation for bedtools intersect

Extract reads with samtools.

  Execute:
   samtools view SRR2584857.sorted.bam 'ecoli:920514-920514' > out.sam
   wc -l out.sam

and this will give you the coverage of the relevant position. Discussion points / extra things to cover¶

  What are the drawbacks to mapping-based variant calling? What are the positives?
  Where do reference genomes come from?

Source : https://angus.readthedocs.io/en/2017/variant-calling.html

bioinfo/bwa.txt · Dernière modification : 2021/04/08 11:52 de klorobionten