




















































In this article by Tiago Antao, author of Bioinformatics with Python Cookbook, you will process next-generation sequencing datasets using Python.
If you work in life sciences, you are probably aware of the increasing importance of computational methods to analyze increasingly larger datasets. There is a massive need for bioinformaticians to process this data, and one the main tools is, of course, Python. Python is probably the fastest growing language in the field of data sciences. It includes a rich ecology of software libraries to perform complex data analysis. Another major point in Python is its great community, which is always ready to help and produce great documentation and high-quality reliable software.
In this article, we will use Python to process next-generation sequencing datasets. This is one of the many examples of Python usability in bioinformatics; chances are that if you have a biological dataset to analyze, Python can help you. This is surely the case with population genetics, genomics, phylogenetics, proteomics, and many other fields.
Next-generation Sequencing (NGS) is one of the fundamental technological developments of the decade in the field of life sciences. Whole Genome Sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and several other technologies are routinely used to investigate important biological problems. These are also called high-throughput sequencing technologies with good reason; they generate vast amounts of data that need to be processed. NGS is the main reason why computational biology is becoming a "big data" discipline. More than anything else, this is a field that requires strong bioinformatics techniques. There is very strong demand for professionals with these skillsets.
Here, we will not discuss each individual NGS technique per se (this will be a massive undertaking). We will use two existing WGS datasets: the Human 1000 genomes project (http://www.1000genomes.org/) and the Anopheles 1000 genomes dataset (http://www.malariagen.net/projects/vector/ag1000g). The code presented will be easily applicable for other genomic sequencing approaches; some of them can also be used for transcriptomic analysis (for example, RNA-Seq). Most of the code is also species-independent, that is, you will be able to apply them to any species in which you have sequenced data.
As this is not an introductory text, you are expected to at least know what FASTA, FASTQ, BAM, and VCF files are. We will also make use of basic genomic terminology without introducing it (things such as exomes, nonsynonym mutations, and so on). You are required to be familiar with basic Python, and we will leverage that knowledge to introduce the fundamental libraries in Python to perform NGS analysis.
Here, we will concentrate on analyzing VCF files.
You will need Python 2.7 or 3.4. You can use many of the available distributions, including the standard one at http://www.python.org, but we recommend Anaconda Python from http://continuum.io/downloads.
We also recommend the IPython Notebook (Project Jupyter) from http://ipython.org/. If you use Anaconda, this and many other packages are available with a simple conda install.
There are some amazing libraries to perform data analysis in Python; here, we will use NumPy (http://www.numpy.org/) and matplotlib (http://matplotlib.org/), which you may already be using in your projects. We will also make use of the less widely used seaborn library (http://stanford.edu/~mwaskom/software/seaborn/).
For bioinformatics, we will use Biopython (http://biopython.org) and PyVCF (https://pyvcf.readthedocs.org).
The code used here is available on GitHub at https://github.com/tiagoantao/bioinf-python.
In your realistic pipeline, you will probably be using other tools, such as bwa, samtools, or GATK to perform your alignment and SNP calling. In our case, tabix and bgzip (http://www.htslib.org/) is needed.
After running a genotype caller (for example, GATK or samtools mpileup), you will have a Variant Call Format (VCF) file reporting on genomic variations, such as SNPs (Single-Nucleotide Polymorphisms), InDels (Insertions/Deletions), CNVs (Copy Number Variation) among others. In this recipe, we will discuss VCF processing with the PyVCF module over the human 1000 genomes project to analyze SNP data.
I am to believe that 2 to 20 GB of data for a tutorial is asking too much. Although, the 1000 genomes' VCF files with realistic annotations are in that order of magnitude, we will want to work with much less data here. Fortunately, the bioinformatics community has developed tools that allow partial download of data. As part of the samtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. For example:
tabix -fh ftp://ftp- trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_ with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_ integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 |bgzip -c > genotypes.vcf.gz tabix -p vcf genotypes.vcf.gz
The first line will perform a partial download of the VCF file for chromosome 22 (up to 17 Mbp) of the 1000 genomes project. Then, bgzip will compress it.
The second line will create an index, which we will need for direct access to a section of the genome.
The preceding code is available at https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Working_with_VCF.ipynb.
Take a look at the following steps:
import vcf v = vcf.Reader(filename='genotypes.vcf.gz') print('Variant Level information') infos = v.infos for info in infos: print(info) print('Sample Level information') fmts = v.formats for fmt in fmts: print(fmt)
v = vcf.Reader(filename='genotypes.vcf.gz') rec = next(v) print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER) print(rec.INFO) print(rec.FORMAT) samples = rec.samples print(len(samples)) sample = samples[0] print(sample.called, sample.gt_alleles, sample.is_het, sample.phased) print(int(sample['DP']))
from collections import defaultdict f = vcf.Reader(filename='genotypes.vcf.gz') my_type = defaultdict(int) num_alts = defaultdict(int) for rec in f: my_type[rec.var_type, rec.var_subtype] += 1 if rec.is_snp: num_alts[len(rec.ALT)] += 1 print(num_alts) print(my_type)
The purpose of this recipe is to get you up to speed on the PyVCF module. At this stage, you should be comfortable with the API. We do not delve much here on usage details because that will be the main purpose of the next recipe: using the VCF module to study the quality of your variant calls.
It will probably not be a shocking revelation that PyVCF is not the fastest module on earth. This file format (highly text-based) makes processing a time-consuming task. There are two main strategies of dealing with this problem: parallel processing or converting to a more efficient format. Note that VCF developers will perform a binary (BCF) version to deal with part of these problems at http://www.1000genomes.org/wiki/analysis/variant-call-format/bcf-binary-vcf-version-2.
If you are using NGS data, the quality of your VCF calls may need to be assessed and filtered. Here, we will put in place a framework to filter SNP data. More than giving filtering rules (an impossible task to be performed in a general way), we give you procedures to assess the quality of your data. With this, you can then devise your own filters.
In the best-case scenario, you have a VCF file with proper filters applied; if this is the case, you can just go ahead and use your file. Note that all VCF files will have a FILTER column, but this does not mean that all the proper filters were applied. You have to be sure that your data is properly filtered.
In the second case, which is one of the most common, your file will have unfiltered data, but you have enough annotations. Also, you can apply hard filters (that is, no need for programmatic filtering). If you have a GATK annotated file, refer, for instance, to http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set.
In the third case, you have a VCF file that has all the annotations that you need, but you may want to apply more flexible filters (for example, "if read depth > 20, then accept; if mapping quality > 30, accept if mapping quality > 40").
In the fourth case, your VCF file does not have all the necessary annotations, and you have to revisit your BAM files (or even other sources of information). In this case, the best solution is to find whatever extra information you have and create a new VCF file with the needed annotations. Some genotype callers like GATK allow you to specify with annotations you want; you may also want to use extra programs to provide more annotations, for example, SnpEff (http://snpeff.sourceforge.net/) will annotate your SNPs with predictions of their effect (for example, if they are in exons, are they coding on noncoding?).
It is impossible to provide a clear-cut recipe; it will vary with the type of your sequencing data, your species of study, and your tolerance to errors, among other variables. What we can do is provide a set of typical analysis that is done for high-quality filtering.
In this recipe, we will not use data from the Human 1000 genomes project; we want "dirty" unfiltered data that has a lot of common annotations that can be used to filter it. We will use data from the Anopheles 1000 genomes project (Anopheles is the mosquito vector involved in the transmission of the parasite causing malaria), which makes available filtered and unfiltered data. You can find more information about this project at http://www.malariagen.net/projects/vector/ag1000g.
We will get a part of the centromere of chromosome 3L for around 100 mosquitoes, followed by a part somewhere in the middle of that chromosome (and index both), as shown in the following code:
tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC. phase1.AR1.vcf.gz 3L:1-200000 |bgzip -c > centro.vcf.gz tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC. phase1.AR1.vcf.gz 3L:21000001-21200000 |bgzip -c > standard.vcf.gz tabix -p vcf centro.vcf.gz tabix -p vcf standard.vcf.gz
As usual, the code to download this data is available at the https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Filtering_SNPs.ipynb notebook.
Finally, a word of warning about this recipe: the level of Python here will be slightly more complicated than before. The more general code that we will write may be easier to reuse in your specific case. We will perform extensive use of functional programming techniques (lambda functions) and the partial function application.
Take a look at the following steps:
%matplotlib inline from collections import defaultdict import seaborn as sns import matplotlib.pyplot as plt import vcf def do_window(recs, size, fun): start = None win_res = [] for rec in recs: if not rec.is_snp or len(rec.ALT) > 1: continue if start is None: start = rec.POS my_win = 1 + (rec.POS - start) // size while len(win_res) < my_win: win_res.append([]) win_res[my_win - 1].extend(fun(rec)) return win_res wins = {} size = 2000 vcf_names = ['centro.vcf.gz', 'standard.vcf.gz'] for vcf_name in vcf_names: recs = vcf.Reader(filename=vcf_name) wins[name] = do_window(recs, size, lambda x: [1])
def apply_win_funs(wins, funs): fun_results = [] for win in wins: my_funs = {} for name, fun in funs.items(): try: my_funs[name] = fun(win) except: my_funs[name] = None fun_results.append(my_funs) return fun_results stats = {} fig, ax = plt.subplots(figsize=(16, 9)) for name, nwins in wins.items(): stats[name] = apply_win_funs(nwins, {'sum': sum}) x_lim = [i * size for i in range(len(stats[name]))] ax.plot(x_lim, [x['sum'] for x in stats[name]], label=name) ax.legend() ax.set_xlabel('Genomic location in the downloaded segment') ax.set_ylabel('Number of variant sites (bi-allelic SNPs)') fig.suptitle('Distribution of MQ0 along the genome', fontsize='xx-large')
Figure 1: The number of bi-allelic SNPs distributed of windows of 2, 000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green). Both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project
import functools import numpy as np mq0_wins = {} vcf_names = ['centro.vcf.gz', 'standard.vcf.gz'] size = 5000 def get_sample(rec, annot, my_type): res = [] samples = rec.samples for sample in samples: if sample[annot] is None: # ignoring nones continue res.append(my_type(sample[annot])) return res for vcf_name in vcf_names: recs = vcf.Reader(filename=vcf_name) mq0_wins[vcf_name] = do_window(recs, size, functools.partial(get_sample, annot='MQ0', my_type=int))
stats = {} colors = ['b', 'g'] i = 0 fig, ax = plt.subplots(figsize=(16, 9)) for name, nwins in mq0_wins.items(): stats[name] = apply_win_funs(nwins, {'median': np.median, '75': functools.partial(np.percentile, q=75)}) x_lim = [j * size for j in range(len(stats[name]))] ax.plot(x_lim, [x['median'] for x in stats[name]], label=name, color=colors[i]) ax.plot(x_lim, [x['75'] for x in stats[name]], '--', color=colors[i]) i += 1 ax.legend() ax.set_xlabel('Genomic location in the downloaded segment') ax.set_ylabel('MQ0') fig.suptitle('Distribution of MQ0 along the genome', fontsize='xx-large')
Figure 2: Median (continuous line) and 75th percentile (dashed) of MQ0 of sample SNPs distributed on windows of 5,000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green); both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project
Figure 3: The continuous line represents the fraction of heterozygosite calls computed at a certain depth; in blue is the centromeric area, in green is the "standard" area; the dashed lines represent the number of sample calls per depth; both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project
def get_sample_relation(recs, f1, f2): rel = defaultdict(int) for rec in recs: if not rec.is_snp: continue for sample in rec.samples: try: v1 = f1(sample) v2 = f2(sample) if v1 is None or v2 is None: continue # We ignore Nones rel[(v1, v2)] += 1 except: pass # This is outside the domain (typically None) return rel rels = {} for vcf_name in vcf_names: recs = vcf.Reader(filename=vcf_name) rels[vcf_name] = get_sample_relation(recs, lambda s: 1 if s.is_het else 0, lambda s: int(s['DP']))
def plot_hz_rel(dps, ax, ax2, name, rel): frac_hz = [] cnt_dp = [] for dp in dps: hz = 0.0 cnt = 0 for khz, kdp in rel.keys(): if kdp != dp: continue cnt += rel[(khz, dp)] if khz == 1: hz += rel[(khz, dp)] frac_hz.append(hz / cnt) cnt_dp.append(cnt) ax.plot(dps, frac_hz, label=name) ax2.plot(dps, cnt_dp, '--', label=name)
fig, ax = plt.subplots(figsize=(16, 9)) ax2 = ax.twinx() for name, rel in rels.items(): dps = list(set([x[1] for x in rel.keys()])) dps.sort() plot_hz_rel(dps, ax, ax2, name, rel) ax.set_xlim(0, 75) ax.set_ylim(0, 0.2) ax2.set_ylabel('Quantity of calls') ax.set_ylabel('Fraction of Heterozygote calls') ax.set_xlabel('Sample Read Depth (DP)') ax.legend() fig.suptitle('Number of calls per depth and fraction of calls which are Hz',, fontsize='xx-large')
def get_variant_relation(recs, f1, f2): rel = defaultdict(int) for rec in recs: if not rec.is_snp: continue try: v1 = f1(rec) v2 = f2(rec) if v1 is None or v2 is None: continue # We ignore Nones rel[(v1, v2)] += 1 except: pass return rel
accepted_eff = ['INTERGENIC', 'INTRON', 'NON_SYNONYMOUS_CODING', 'SYNONYMOUS_CODING'] def eff_to_int(rec): try: for annot in rec.INFO['EFF']: #We use the first annotation master_type = annot.split('(')[0] return accepted_eff.index(master_type) except ValueError: return len(accepted_eff)
eff_mq0s = {} for vcf_name in vcf_names: recs = vcf.Reader(filename=vcf_name) eff_mq0s[vcf_name] = get_variant_relation(recs, lambda r: eff_to_int(r), lambda r: int(r.INFO['DP']))
fig, ax = plt.subplots(figsize=(16,9)) vcf_name = 'standard.vcf.gz' bp_vals = [[] for x in range(len(accepted_eff) + 1)] for k, cnt in eff_mq0s[vcf_name].items(): my_eff, mq0 = k bp_vals[my_eff].extend([mq0] * cnt) sns.boxplot(bp_vals, sym='', ax=ax) ax.set_xticklabels(accepted_eff + ['OTHER']) ax.set_ylabel('DP (variant)') fig.suptitle('Distribution of variant DP per SNP type', fontsize='xx-large')
Figure 4: Boxplot for the distribution of variant read depth across different SNP effects
The approach would depend on the type of sequencing data that you have, the number of samples, and potential extra information (for example, pedigree among samples).
This recipe is very complex as it is, but parts of it are profoundly naive (there is a limit of complexity that I could force on you on a simple recipe). For example, the window code does not support overlapping windows; also, data structures are simplistic. However, I hope that they give you an idea of the general strategy to process genomic high-throughput sequencing data.
In this article, we prepared the environment, analyzed variant calls and learned about genome accessibility and filtering SNP data.