Community developed variant calling pipelines

Brad Chapman

Bioinformatics Core, Harvard School of Public Health

@chapmanb

20 July 2013

Challenges

Complex, rapidly changing pipelines

gatk_changes.png

Large number of specialized dependencies

https://github.com/StanfordBioinformatics/HugeSeq

Quality differences between calling methods

http://www.bioplanet.com/gcat

Scaling on full ecosystem of clusters

schedulers.png

Solution

http://www.amazon.com/Community-Structure-Belonging-Peter-Block/dp/1605092770

bcbio_nextgen_highlevel.png

Development goals

  • Quantifiable
  • Analyzable
  • Scalable
  • Reproducible
  • Community developed
  • Accessible

Quantify quality

Known unknowns

  • Coverage: summarize what you can't assess
  • Structural: large, complex rearrangements

Analysis

Query

gemini.png

https://github.com/arq5x/gemini

Visualize

https://github.com/chapmanb/o8

Parallel scaling

parallel-clustertypes.png

Better parallel blocks

parallel-genome.png

Reproducibility

  • Express intentions at a high level
  • Revision controlled configuration
  • Handle complex distributed logging
  • Provenance tracking

Configuration

- files: [NA12878-NGv3-LAB1360-A_1.fastq.gz, NA12878-NGv3-LAB1360-A_2.fastq.gz]
  description: NA12878
  analysis: variant2
  genome_build: GRCh37
  algorithm:
    aligner: bwa
    recalibrate: gatk
    realign: gatk
    variantcaller: [gatk, freebayes, gatk-haplotype]
    coverage_interval: exome
    coverage_depth: high
    platform: illumina
    quality_format: Standard
    validate: NA12878-nist-v2_13-NGv3-pass.vcf

Provenance

Community developed

  • Fully automated installation: CloudBioLinux
  • Deployable on multiple clusters (LSF, SGE, Torque…)
  • API for new aligners and variant callers
  • Open source, hackable and documented

https://github.com/chapmanb/bcbio-nextgen

Automated installation

  • Single biggest software problem: running for the first time
  • Bootstrap from bare machine to ready-to-go pipeline
  • Builds off existing installation work: CloudBioLinux
  • Provide example pipelines with real data

http://cloudbiolinux.org

https://bcbio-nextgen.readthedocs.org

Accessible

http://exploringpersonalgenomics.org/

Galaxy

galaxy_pipeline.png

https://bitbucket.org/hbc/galaxy-central-hbc

STORMSeq

4.1_stormseq.png

http://www.stormseq.org/

Summary

  • Community developed pipelines > challenges
  • Focus
    • Assessing quality: good science
    • Analysis: enable exploration
    • Scalability: finish in time
    • Reproducibility: show your work
  • Widely accessible

https://github.com/chapmanb/bcbio-nextgen