Geraldine Van der Auwera, Author at Terra

Thank you for your feedback!

Geraldine Van der Auwera — Thu, 24 Nov 2022 15:00:40 +0000

All of us on the Terra development team have a lot to be grateful for — it’s an amazing time to be doing meaningful, challenging work at the interface of biomedical science and computing technology.

Today we express our collective gratitude to the thousands of researchers who are actively using the platform, making all of our hard work worth it.

We are especially grateful to the countless among you who have taken the time to let us know how it’s working for you, and what we can do to make things better. The feedback we collect through various channels including the helpdesk, the community forum, and in-depth user experience interviews is a critical part of our development process. It’s not always rainbows and butterflies, because there’s rarely a straight line to success when you’re solving hard problems, but honest feedback is what allows us to keep improving and arrive at solutions that truly meet your needs.

So a big thank you to every one of you who helped us understand what we could do better — and who let us know when we reached the goal!

On that note, we leave you with a sampling of quotes from researchers in population genetics, oncology, public health and more, whose generous feedback gave our team warm fuzzy feelings to carry into the holiday weekend.

“Thank you for making Terra. I’ve used other technologies, and this is by far the easiest to use.”

“I find Terra easy to use – and I like not having to use many command line inputs. Thank you for helping me conduct my work.”

“Thank you for making bioinformatics accessible!”

“Terra allows me to share workflows and data with laboratory scientists who would otherwise need to learn how to write their own tools or hire a bioinformatician.”

“What I like the most is that I can train someone with less computational experience to conduct analysis, without having to give them a raw machine or VM.”

“Terra has a very easy GUI for directing complex cloud computing jobs once WDL is mastered (which is a very easy language to learn).”

“I have been using Terra for a long time and I love it! It’s very easy and I like how all my workspaces are in one place.”

“The customer service is outstanding – such knowledgeable staff.”

“Terra has great support and continuously improves. Thank you to the team!”

The post Thank you for your feedback! appeared first on Terra.

Five online courses to master Terra

Geraldine Van der Auwera — Thu, 17 Nov 2022 19:40:25 +0000

Learning how to use a new set of tools can be challenging, especially when it involves computing technology on which you’ve never been formally trained. Compounding the problem, navigating documentation resources can be a bit hit-and-miss when there is so much to learn, and what you need to learn depends on what you’re trying to do.

To help you get to grips with the tooling offered by Terra without the guesswork, our User Education team has developed a series of free online courses offered through the Leanpub learning platform.

Each course is organized around clear learning objectives, and integrates documentation resources and tutorials from the Terra knowledge center with course-specific elements such as summaries and quizzes. This combination of elements is designed to guide the learning process and solidify takeaways. The courses also include links to further reading and complementary resources to deepen your understanding and help you apply what you have learned to your own research.

To take advantage of this invaluable resource, head over to the Terra University homepage, select a course and follow the instructions to add it to your Leanpub library. Once a course is in your library, you have unlimited time to read through the lessons and work through the exercises.

Note that the Leanpub system uses a shopping cart/ purchasing framework, but rest assured the Terra courses are all completely free.

We hope you will find these courses helpful. If you run into any difficulties, or if you have any questions or feedback for the User Education team, please reach out through the public Support Forum or the private Terra Helpdesk.

The post Five online courses to master Terra appeared first on Terra.

Ten simple rules — #6 Version both software and data

Geraldine Van der Auwera — Thu, 10 Nov 2022 19:39:33 +0000

This blog post is part of a series based on the paper “Ten simple rules for large-scale data processing” by Arkarachai Fungtammasan et al. (PLOS Computational Biology, 2022). Each installment reviews one of the rules proposed by the authors and illustrates how it can be applied when running workflows in Terra. In this installment, we take a look at version control across a range of components including tools, dependencies, workflow scripts and data resources.

Version control is one of those technical concepts that’s obviously a good idea yet can be really tricky to do correctly. And as much as it has become an established practice for most computational scientists, many tend to underestimate the scope of what should be version-controlled.

(If you’ve never heard of version control or would like a refresher tailored for scientists, check out the Software Carpentry lesson on version control with Git.)

In this sixth rule, Arkarachai Fungtammasan and colleagues rightly emphasize that it’s not enough to control the version of the main software code and tools involved in analysis:

“Applying version control to all code is always recommended for reproducible research. In the context of a large-scale data analysis, we need to go beyond this initial step […].”

Indeed, the tools researchers use directly — those that feature most obviously in command lines and in scripts — typically rely on other, less visible components, or dependencies. Changes to those dependencies can affect analysis results, so it’s important to ensure specific versions are used rather than “the latest available”.

The authors also call out workflow scripts and data resources such as genome builds as components that should be carefully version-controlled.

Version-controlled workflows in Terra

Terra’s workflow execution system is designed specifically to enable version control at multiple levels, with minimal effort on the part of pipeline developers and end users.

Workflow scripts

“When multiple processing steps are combined into workflow (Rule 4), the workflow itself should be versioned.”

The WDL workflow scripts used in Terra are held in version controlled repositories — either the built-in Broad Methods Repository, which supports version control through the concept of “snapshots”, or the external Dockstore repository, which offers workflow-specific versioning features backed by the industry-standard code versioning of GitHub.

Tools, dependencies and computing environment

“To guard against interruptions, all dependencies should be pinned to a specific version (ideally through a version control hash or equivalent) that has been thoroughly tested (Rule 5). […] Utilizing container technology is highly encouraged, if allowed in the system, to guarantee the computing environment for processing.”

Within a WDL workflow, each individual analysis task specifies a Docker container that encapsulates all software tools and dependencies involved. Workflow developers and users can specify the exact version of the container using a unique identifier that ensures absolute reproducibility of the computing environment.

Data resources

“In biomedical data analysis, there are often components beyond the software and related dependencies that need to be included in the reproducibility plan. For instance, if the data processing relies on a genome build, using the most recent build and release in the pipeline will be insufficient. Instead, the processing needs to be tied to a specific build and release, much like the dependencies in the pipeline.”

Terra workspaces provide data management features that include data manifests (see “data tables”), the ability to load versioned genome reference builds, and a system of key-value pairs that can be used to specify and label custom data resources for use in workflow configurations.

It’s worth noting that the workflow logging system also provides features that support version control by making it possible to go back and look at configuration details for past analyses, as we touched on in Rule 2 (Document Everything).

To learn more about making effective use of version control for large-scale data processing in Terra, please read “How does pipeline versioning work?” in the Terra knowledge base.

The post Ten simple rules — #6 Version both software and data appeared first on Terra.

Paper Spotlight: Germline predisposition to pediatric Ewing sarcoma

Geraldine Van der Auwera — Thu, 03 Nov 2022 18:13:27 +0000

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.

Germline predisposition to pediatric Ewing sarcoma is characterized by inherited pathogenic variants in DNA damage repair genes

By Riaz Gillani, Sabrina Y. Camp, Seunghun Han, Jill K. Jones, Hoyin Chu, Schuyler O’Brien, Erin L. Young, Lucy Hayes, Gareth Mitchell, Trent Fowler, Alexander Gusev, Junne Kamihara, Katherine A. Janeway, Joshua D. Schiffman, Brian D. Crompton, Saud H. AlDubayan and Eliezer M. Van Allen

The American Journal of Human Genetics (2022) https://doi.org/10.1016/j.ajhg.2022.04.007

Abstract: More knowledge is needed regarding germline predisposition to Ewing sarcoma to inform biological investigation and clinical practice. Here, we evaluated the enrichment of pathogenic germline variants in Ewing sarcoma relative to other pediatric sarcoma subtypes, as well as patterns of inheritance of these variants. We carried out European-focused and pan-ancestry case-control analyses to screen for enrichment of pathogenic germline variants in 141 established cancer predisposition genes in 1,147 individuals with pediatric sarcoma diagnoses (226 Ewing sarcoma, 438 osteosarcoma, 180 rhabdomyosarcoma, and 303 other sarcoma) relative to identically processed cancer-free control individuals. Findings in Ewing sarcoma were validated with an additional cohort of 430 individuals, and a subset of 301 Ewing sarcoma parent-proband trios was analyzed for inheritance patterns of identified pathogenic variants. A distinct pattern of pathogenic germline variants was seen in Ewing sarcoma relative to other sarcoma subtypes. FANCC was the only gene with an enrichment signal for heterozygous pathogenic variants in the European Ewing sarcoma discovery cohort (three individuals, OR 12.6, 95% CI 3.0–43.2, p = 0.003, FDR = 0.40). This enrichment in FANCC heterozygous pathogenic variants was again observed in the European Ewing sarcoma validation cohort (three individuals, OR 7.0, 95% CI 1.7–23.6, p = 0.014), representing a broader importance of genes involved in DNA damage repair, which were also nominally enriched in individuals with Ewing sarcoma. Pathogenic variants in DNA damage repair genes were acquired through autosomal inheritance. Our study provides new insight into germline risk factors contributing to Ewing sarcoma pathogenesis.

What part of the work was done in Terra?

Excerpts from the paper’s Methods section:

Raw sequencing data was downloaded to Terra (https://firecloud.terra.bio/), a collaborative cloud-computing platform utilized for genomic analyses, developed as part of the NCI Cloud Pilot program and supported by the Broad Institute.

Personal communication from first author Riaz Gillani:

We used Terra for many parts of the analysis, including harmonizing raw sequencing data, calling variants, inferring ancestry, running quality control, and counting variants. Some of the key public workflows that enabled this were DeepVariant, BAMRealigner, and various adaptations of the GATK workflows.

Links to the relevant public workflows:

https://portal.firecloud.org/?return=firecloud#methods/vanallenlab/deepvariant_wes/

https://portal.firecloud.org/?return=firecloud#methods/GPTAG/BamRealigner/

How did they do it?

Automated workflows

The authors used previously described bioinformatics analysis pipelines implemented as WDL workflows and shared in the Broad Methods Repository.

Terra also supports importing workflows from Dockstore, a free and open source platform for sharing reusable and scalable analytical tools and workflows.

They ran the workflows at scale using Terra’s workflow execution service.

To try your hand at running a workflow in Terra, check out the Workflows Quickstart Guide.

Interactive analysis

The authors used Jupyter Notebooks in Terra’s interactive Cloud Environments system for the ancestry inference analysis.

To get started with Jupyter Notebooks in Terra, check out the Notebooks Quickstart Guide.

The post Paper Spotlight: Germline predisposition to pediatric Ewing sarcoma appeared first on Terra.

Get trained on Terra with our interactive online workshops

Geraldine Van der Auwera — Thu, 13 Oct 2022 18:58:25 +0000

In response to popular demand, our User Education team has been running weekly interactive workshops to help newcomers get started with Terra. They’ve also developed some satellite workshops that use Terra to teach specific applications like whole genome analysis with DRAGEN-GATK.

These online events are open to all and are entirely free: registered participants even receive access to a cloud billing project, so everyone can follow along and do the exercises without having to jump through any administrative hoops!

As you work through the hands-on exercises, you’ll be able to ask questions in real time and interact directly with our experts, developers and trainers.

Whether you’re completely new to the platform, or you’re already using it but looking to get more formal training on how to utilize its features effectively, this is a great opportunity to level up your skills in a friendly, inclusive online environment.

Introduction to Terra: A scalable platform for biomedical research

Online workshop hosted by the Terra team. Register today to receive the Zoom invite!

This interactive workshop consists of two 2-hour sessions held on consecutive days covering the following topics through demos and hands-on exercises:

Terra architecture as it relates to cloud-based data sets, tools, and computing resources
How data is organized in Terra
How to run a WDL workflow to automate analysis
How to launch a Cloud Environment from your Terra workspace to run interactive analysis tools like Jupyter Notebooks, RStudio, and Galaxy

Available dates:

October 19 and 20, 10:00 am – 12:00 pm ET
October 31 and November 1, 10:00 am – 12:00 pm ET

For more information, visit the Introduction to Terra event page.

DRAGEN-GATK Webinar

Online workshop hosted by the GATK team. Register today to receive the Zoom invite!

This interactive workshop consists of a 2-hour session on Whole-Genome Analysis with DRAGEN-GATK. You will learn about DRAGEN-GATK — what it is, why it exists, and how to use it. Then, you’ll get an opportunity to use DRAGEN-GATK in a controlled demo environment within Terra.

Date: October 21, 10:30AM – 12:30PM ET

For more information, visit the DRAGEN-GATK Webinar event page.

We hope you will join us for one or more of these online training events! If you’d like to get notified when new events are scheduled, go to the Upcoming Events page of the Terra support website and hit the “Follow” button.

Make sure to also check out our library of supporting materials and recordings from past events.

The post Get trained on Terra with our interactive online workshops appeared first on Terra.

What’s next for genomics? Plug into GA4GH to find out

Geraldine Van der Auwera — Fri, 30 Sep 2022 14:30:26 +0000

Sequencing technology keeps getting better, faster, more productive — but that’s not the only thing that shapes the big picture of where we’re headed as a field. If you want to understand where genomics is going, you should pay attention to the Global Alliance for Genomics and Health, or GA4GH.

GA4GH is an international organization composed largely of contributors from member institutions in healthcare, research, patient advocacy, and information technology, seeking to enable responsible genomic data sharing within a human rights framework.

The boring way to describe what GA4GH does is to say it develops standards and policies — ranging from technology standards like file formats and application programming interfaces (APIs), which aim to enable interoperability and broad access to data and tools among the global research community, to policy frameworks like patient consent clauses and data privacy policies.

I prefer to think of it this way: GA4GH contributors are effectively building the scaffolding for what genomics will deliver in practice, at scale, over the next decade.

The GA4GH standard development process involves collaboration with implementers, i.e. groups that apply the GA4GH standards in practice, typically in the context of driver projects — which have the advantage of providing real-life use cases. (As it turns out, you develop better solutions when you’re actually working on real problems.)

For technology standards, implementation means building software tools and operating services that follow the rules laid out by the relevant standards.

For example, the CRAM and VCF file formats are widely-used bioinformatics standards stewarded by GA4GH that specify how to encode sequencing reads and variant calls, respectively. There is a software library called htsjdk that implements both of these standards in the Java programming language, meaning that it includes code that is capable of reading and writing files that are encoded according to those standards. Researchers can then use genome analysis tools like GATK and Picard, which include the htsjdk library, to read and write CRAM and VCF files as part of their analysis work. Tadaa. (For fans of C, substitute htslib and samtools in the library/tool mentions.)

In addition to these now-classic (if imperfect) workhorse formats, GA4GH has been driving the development of other, newer knowledge representation standards that you may not yet be aware of, but will likely transform the way many of us work. Take the Variation Representation Specification (VRS, pronounced “verse”), which among other things makes it possible to capture the complex information that underlies variant interpretation in computable form, a key feature for solving variant interpretation bottlenecks. Right now, VRS is a fairly niche product, but within a few years I expect we’ll see it being used across a variety of research and clinical diagnostic platforms.

Computable standards for alleviating variant interpretation bottlenecks. From “Genomic Knowledge Standards Advancements” by Larry Babb, presented at the 10th Plenary Meeting of the GA4GH (see Day 1 recording).

Pro-tip: you can check out the Python implementation of VRS in Github, and there’s even a Terra workspace hosting Jupyter notebooks that demonstrate how to use it in practice.

Speaking of platforms, another major axis of GA4GH standard development is platform-level interoperability, i.e. infrastructure standards that enable platforms like Terra to talk to each other.

I’ve written before about the big picture of interoperability for open ecosystems, the example of the AnVIL project, and how Terra uses the DRS standard for data interoperability specifically. The ultimate goal here is to make it possible for researchers to do things like combining data from separate repositories into powerful federated analyses without having to move any of it around.

Excitingly, that dream is starting to materialize! We are now at the stage where data federation is effectively possible — for a limited set of datasets and platforms, with some clunkiness involved. As the work continues, you can expect the scope of what’s possible to include more datasets, with a smoother experience as the handoff between platforms gets ironed out.

There is a lot more to say about the scope and impact of GA4GH work, but I’ll have to leave that for another time.

The bottom line is, if any of this is new to you, now is a great time to start getting caught up.

The organization held its annual plenary meeting over two days last week, and the full recordings for both Day 1 and Day 2 are available on YouTube, annotated with timestamps for specific sessions and presentations (see the expanded video descriptions). You can also find links to the slide decks in the agenda.

As a new member of the organization (I joined the Large-Scale Genomics Work Stream in June), I found the lineup of talks struck a great balance between showcasing progress made so far, outlining upcoming challenges and discussing concrete solutions. I hope you will find these resources useful too — and consider joining the effort!

Additional resources

For a more comprehensive tour of the Global Alliance’s scope, vision, and outputs, read the “Perspective” paper published late last year in Cell Genomics by Heidi Rehm and colleagues:

GA4GH: International policies and standards for data sharing across genomic research and healthcare (2021) Cell Genomics, Volume 1, Issue 2, https://doi.org/10.1016/j.xgen.2021.100029

The post What’s next for genomics? Plug into GA4GH to find out appeared first on Terra.

NVIDIA’s Clara Parabricks workflows in Terra bring GPU acceleration to genomic analysis

Geraldine Van der Auwera — Tue, 20 Sep 2022 17:40:09 +0000

The past few years have seen a massive surge in the development of advanced analytical methods for biomedical research, fueled in part by technological innovations that allow computational scientists to crunch data at ever-increasing speed and scale. A growing number of technology companies have joined the effort to help researchers tackle emerging challenges, ranging from large-scale genomics to multi-modal analysis of the myriad data types associated with medical records — including doctors’ notes, which are famously easy to read and interpret.

Today NVIDIA, a pioneer in AI and accelerated computing, announced a new partnership with the Broad Institute that will pool the two organizations’ respective expertise in deep learning, accelerated compute, and biomedical research. This partnership builds on an existing collaboration between NVIDIA and the Broad’s GATK team, who have already been working together to improve some of the deep learning algorithms in GATK. (Keep an eye on the GATK blog for an upcoming release announcement.)

The NVIDIA team released a Clara Parabricks workspace in Terra that makes their GPU-accelerated genomic analysis toolkit available on the cloud at the click of a button. As shown by the benchmarking results below, the Clara Parabricks workflows in Terra deliver accelerations up to 24x faster execution compared to equivalent CPU-based workflows, and can cut the total cost of execution by up to 50%.

What’s in the box? Drop-in replacements for popular GATK workflows

NVIDIA Clara Parabricks is a suite of GPU-accelerated industry-standard tools for the most common genomics analyses, including read alignment and both germline and somatic variant calling for whole genomes, exomes, and gene panels.

To make these tools easy to run in Terra, the NVIDIA team produced six modular workflows written in the Workflow Description Language (WDL) that are designed as drop-in replacements for the corresponding GATK workflows, summarized in the figure below.

The six Clara Parabricks workflows available as WDLs in Terra (leftmost boxes), with component modules listed to their right. In the case of the germline calling workflow, the two modules (HaplotypeCaller and DeepVariant) are alternative options that can be toggled with a configuration flag.

Each workflow comes with a reference configuration that includes the most appropriate GPU instances to run them on, and the ability to select GATK best practice flags and options.

The NVIDIA team collaborated with the GATK team at the Broad Institute to evaluate the accuracy of the germline workflows. Through this rigorous process, they verified that the Clara Parabricks workflows produce results that are functionally equivalent to the CPU-native GATK versions, as originally defined here.

As a specific example, benchmarking on publicly available Genome in a Bottle (GIAB) samples with the fq2bam and germline caller workflows from the Clara Parabricks suite produced variant calling results that were >0.9999 equivalent in both precision and recall to those produced by the BWA, MarkDuplicates, BQSR, and HaplotypeCaller commands in the GATK’s Whole Genome Germline Single Sample variant calling workflow (available here in Terra).

Up to 24 times the speed and half the cost

The team benchmarked the runtime of the Clara Parabricks workflows on Terra, and found that the GPU-accelerated workflows delivered speedups of up to 24x for germline genome analyses.

When using Clara Parabricks in Terra, the total runtime on NVIDIA GPUs is reduced significantly for a 30x whole genome including BWA-MEM, MarkDuplicates, BQSR, and HaplotypeCaller. Total run time from FASTQ to VCF including variant calling with haplotypecaller, is just over 2 hours on NVIDIA T4 GPUs compared to 24 hours in a CPU-based environment. Additionally, the cost for analysis was 50% reduced in the GPU environment compared to the CPU environment.

Time and cost comparisons of alignment and variant calling (FastQ to VCF) on CPU vs. GPU for a 30X whole genome in Terra.

Runtime from FASTQ to BAM (with BWA-MEM) was reduced from 7 hours with CPU instances (N2) to a little over an hour with 4 NVIDIA T4 instances, and dropped even further to ~45 mins with 8 NVIDIA V100 GPU instances.

You can also see in the figure above (right side panel) that the overall cost of running the workflows on the T4 GPUs is less than half the cost of running the CPU-based equivalents, which is always a happy surprise.

I’ve written before about the speed benefits of GPUs on this blog, though that was in the context of interactive analysis. One of my big takeaways was that you have to find the sweet spot between speed and cost that works for you, because oftentimes the fancy hardware that makes things go really fast is also the most expensive to rent. The good news is that if the speedup is big enough, you only have to use the special hardware for a very short amount of time, and that makes up for the higher rate. Or in the case of the T4 instances, more than makes up for it, since that configuration manages to be substantially more economical even as it delivers a heck of a speed boost.

According to NVIDIA, the T4 GPU is designed to optimize cost and performance when running inference-heavy workloads like Clara Parabricks — which matches what we’re seeing in these benchmarking results. NVIDIA T4 GPUs are available for as little as $0.11 per hour on Terra (backed by Google Cloud), so by running the Clara Parabricks workflows there, you can run the entire alignment and variant calling pipeline for less than $2.50 a sample. That represents a major reduction from the $5 cost-per-sample of the GATK’s CPU-based workflows (running on N2 instances) while still reducing the overall data processing runtime from a whole day to less than three hours. Instances on Terra can also be configured with up to 8 V100 GPUs, which are more optimized for absolute performance than the T4. The same fastq-to-VCF pipeline on 8 V100 GPUs is up to 24X faster than the CPU pipeline, at roughly twice the cost.

I’m excited to see this technology being made available to a wide audience, in a form that doesn’t require taking specialized training or purchasing expensive hardware. It’s a big step forward toward ensuring researchers of any background are able to run sophisticated genomic analysis at scale.

Try it out for yourself today

The Clara Parabricks Terra workspace created by the NVIDIA team is preloaded with example data, workflow configurations, and straightforward instructions, so you can try out the workflows without having to install or tweak anything. Simply clone the workspace and launch the preconfigured examples, or load your own data and get to work.

If you’re new to running workflows on Terra, see the Workflows Quickstart Tutorial.

Don’t hesitate to reach out if you have any questions or if you run into any trouble running the workflows. For help with Terra-specific features (e.g. how to launch, monitor and troubleshoot WDL workflows in Terra), you can either post in our public discussion forum or contact the helpdesk team privately. For technical questions about NVIDIA Clara Parabricks, please visit the developer forum page here.

Acknowledgments

We are grateful to the NVIDIA team, specifically Eric Dawson and Vanessa Braunstein, for running the workflow benchmarks and helping us characterize the benefits of using Clara Parabricks on GPU instances in Terra.

Resources

– Terra Workflows Quickstart Tutorial

– Terra blog about using GPUs for interactive analysis, machine learning

– Cromwell documentation about runtime parameters for using GPUs in workflows

– Clara Parabricks Genomic Analysis webpage

– Clara Parabricks Documentation Page

– Clara Parabricks 4.0 Blog

– Clara Parabricks in Terra Workspace

– Clara Parabricks GTC DLI Hands on free workshop on Sept 21 as part of GTC

The post NVIDIA’s Clara Parabricks workflows in Terra bring GPU acceleration to genomic analysis appeared first on Terra.

Learn about running bioinformatics workflows in the cloud

Geraldine Van der Auwera — Thu, 08 Sep 2022 17:47:45 +0000

Are you looking to expand your skillset this September? Whether you’re heading back to school for a new academic year or just itching to tackle some new professional challenges, there’s one topic that I can guarantee you will benefit from learning more about: bioinformatics workflows.

Last week, I presented an overview of this topic, in a webinar hosted by a Brazilian organization that brings together several institutions’ PhD programs in bioinformatics. I knew it was going to be a very heterogeneous audience, ranging from complete beginners to experienced professional bioinformaticians, so I tried to make the presentation accessible to newcomers while still including some nuggets of more advanced information for the veterans.

The resulting presentation, which I’ve just posted on Zenodo, starts with a brief introduction to the topic of bioinformatics workflows, with a particular emphasis on execution in the cloud. It then introduces the Workflow Definition Language (WDL), outlining the basic syntax and key features of the language, as well as the Cromwell execution engine, which is available as a managed service in Terra but can also be deployed as a standalone workflow manager on other systems. Finally, the presentation references three published scientific use cases that exemplify how WDL, Cromwell and Terra can be used to power scalable and reproducible research in the biomedical community.

You can find the full webinar recording on YouTube, with my presentation starting around the 13-minute mark, and the Q&A discussion starting around the 56-minute mark.

If you’re new to this topic, I hope you will find this presentation useful for getting oriented. Once you’re ready to take the next step and actually try your hand at running a workflow, I recommend checking out the Terra Workflows Quickstart. It’s a hands-on tutorial designed to walk you through the process of running a simple workflow on the cloud using Terra, with detailed step by step instructions and video recordings. Once you have the basics down, you’ll be able to move on to running more complex workflows imported through Dockstore.org, or start learning how to write your own workflows.

Along the way, you’ll pick up new knowledge and skills that will certainly come in handy as you develop your professional career. Good luck!

The post Learn about running bioinformatics workflows in the cloud appeared first on Terra.

Paper Spotlight: Molecular map of chronic lymphocytic leukemia and its impact on outcome

Geraldine Van der Auwera — Thu, 25 Aug 2022 17:40:22 +0000

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.

Molecular map of chronic lymphocytic leukemia and its impact on outcome

By Binyamin A. Knisbacher, Ziao Lin, Cynthia K. Hahn, Ferran Nadeu, Martí Duran-Ferrer, Gad Getz, Chip Stewart, Catherine J. Wu et al.

Nature Genetics (2022) https://doi.org/10.1038/s41588-022-01140-w

Abstract: Recent advances in cancer characterization have consistently revealed marked heterogeneity, impeding the completion of integrated molecular and clinical maps for each malignancy. Here, we focus on chronic lymphocytic leukemia (CLL), a B cell neoplasm with variable natural history that is conventionally categorized into two subtypes distinguished by extent of somatic mutations in the heavy-chain variable region of immunoglobulin genes (IGHV). To build the ‘CLL map,’ we integrated genomic, transcriptomic and epigenomic data from 1,148 patients. We identified 202 candidate genetic drivers of CLL (109 new) and refined the characterization of IGHV subtypes, which revealed distinct genomic landscapes and leukemogenic trajectories. Discovery of new gene expression subtypes further subcategorized this neoplasm and proved to be independent prognostic factors. Clinical outcomes were associated with a combination of genetic, epigenetic and gene expression features, further advancing our prognostic paradigm. Overall, this work reveals fresh insights into CLL oncogenesis and prognostication.

What part of the work was done in Terra?

Excerpts from the paper’s Methods section:

Sequence data processing and analysis

All sequencing data (WES, WGS, RNA-seq, RRBS, and targeted NOTCH1 sequencing) were processed and analyzed using methods implemented in the Terra platform (https://app.terra.bio). The main Terra methods are available at https://app.terra.bio/#workspaces/broad-firecloud-wupo1/CLLmap_Methods_Apr2021 […]

RNA-seq analysis

RNA-seq data were processed in Terra using the GTEx V7 pipeline (https://github.com/broadinstitute/gtex-pipeline). Briefly, reads were aligned with STAR (v2.6.1d) to hg19 (b37) using the GENCODE v19 annotation, and quality control metrics and gene expression were computed with RNA-SeQC v2.3.6 (https://github.com/getzlab/rnaseqc). A collapsed version of the GENCODE annotation was used to quantify gene-level expression (available at gs://gtex-resources/GENCODE/gencode.v19.genes.v7.collapsed_only.patched_ contigs.gtf). TPMs were used for sample clustering, whereas gene counts were used for differential gene expression, as required[…]

DNA methylation data processing

DNA methylome data was analyzed for a total of 1,037 samples, including 490 samples profiled with Illumina 450K array previously analyzed52 (European Genome-phenome Archive (EGA) accession EGAD00010001975), and 547 samples profiled using RRBS with either single-end or paired-end approaches. A pipeline was developed in Terra to obtain the CpG methylation estimates from RRBS data (Supplementary Note). The epitype classifier and the epiCMIT mitotic clock were previously developed for Illumina 450K and EPIC array data […]

How did they do it?

The authors processed and analyzed all sequencing data (WES, WGS, RNA-seq, RRBS, and targeted NOTCH1 sequencing) using WDL workflows. The methods used in this study can be found in this Terra workspace.

If you are a new Terra user, try your hand at running a workflow in Terra with this Quickstart Tutorial Workspace.

Appendix: Data and code availability

The molecular data used in this study are publicly available and are included in the following patient cohorts: DFCI, Dana-Farber Cancer Institute; GCLLSG, German CLL Study Group; ICGC, International Cancer Genome Consortium; MDACC, MD Anderson Cancer Center; NHLBI, National Heart Lung and Blood Institute; UCSD, University of California San Diego. Sequencing, expression, and genotyping is available at EGA (http://www.ebi.ac.uk/ega/), which is hosted at the European Bioinformatics Institute, under accession number EGAS00000000092 (ICGC cohort) and in dbGaP under accession numbers phs001473.v2.p1 (MDACC, NHLBI), phs000922.v2.p1 (GCLLSG), phs001431.v2.p1 (DFCI, UCSD), phs001091.v1.p1 (MDACC), phs000435.v3.p1 (DFCI), phs002297.v2.p1 (NHLBI) and phs000879.v1.p1 (DFCI) and GEO accession number GSE143673 (GCLLSG). 450K array data are available at EGA under accession number EGAD00010001975 (ICGC). The project data portal is available at https://cllmap.org.
Terra methods used in the study can be found at https://app.terra.bio/#workspaces/broad-firecloud-wupo1/CLLmap_Methods_Apr2021. Source code used in the study can be found at https://github.com/getzlab/CLLmap. The RFcaller pipeline is available at https://github.com/xa-lab/RFcaller. The new epiCMIT suitable for Illumina arrays and NGS approaches as well as the CLL epitype classifier can be found at https://github.com/Duran-FerrerM/CLLmap-epigenetics.

The post Paper Spotlight: Molecular map of chronic lymphocytic leukemia and its impact on outcome appeared first on Terra.

AnVIL in the Classroom: Cloud-scale educational resources for modern genomics

Geraldine Van der Auwera — Fri, 29 Jul 2022 20:37:19 +0000

Genomics has become enough of a mainstream discipline to be introduced in undergraduate classes, even high school. There are lots of courses and online resources offered to help educators teach genomics, and heaps of literature about teaching methodologies.

In my recent webinar hosted by the American Society for Human Genetics, I discussed the exciting opportunities that the move to cloud-based research infrastructure offers for educators who are interested in delivering practical instruction in genomics through hands-on exercises.

Big thanks to my colleagues Liz Kiernan and Anton Kovalsky, both Lead Science Educators in the Data Sciences Platform at the Broad Institute, for their invaluable contributions to the development and execution of this webinar.

Bridging the gap between teaching and research

Computational infrastructure has always been a challenge when it comes to hands-on teaching in scientific computing. Even at teaching institutions that are well-equipped with sophisticated computer labs, there often remains a deep divide between teaching environments and research environments, which end up siloed apart from each other. So as an educator, every time you want to bring data and tools from the research environment to a teaching setting, you have to cross that gap, which takes effort. You can easily end up with teaching examples that are out of date, or oversimplified, so learners only encounter toy versions of what researchers really do.

It’s critical that we work toward bridging that gap, to reduce the distance between teaching examples and real research as much as possible. It is both more exciting for learners to be developing their skills through more realistic examples (closer to “real” scientific investigation), and more productive in terms of achieving educational goals. I would also expect that when they are ready to take the next step in their educational journey, learners are more likely to transition smoothly to more complex projects if they can build on prior experience rather than having to be re-trained to use different tooling.

The solution that is standing right in front of us is to take advantage of what’s happening on the side of research infrastructure, where for the past few years we’ve seen a big shift toward using cloud infrastructure.

Traditionally, everyone would use their own siloed computing infrastructure, and if multiple groups were working with the same dataset, they would each store a local copy to work on. So we’d end up with similar gaps between researchers as we were just talking about between researchers and educators. With the shift to the cloud, the idea is everyone can just access one copy of the data that’s stored centrally, and run whatever computation on hardware that’s right there, colocated with the data.

This is not a new idea as such, but the reason it’s been gaining so much traction recently is because the scale of the datasets that are being generated in genomics and related disciplines: it’s just not feasible —or fair— to expect everyone to download a copy of the data and work on it locally. Hence the big commitments we’re seeing from various federal agencies to support infrastructure projects like the NHGRI AnVIL, which aims to enable genomics researchers to make effective use of cloud.

Putting the cloud to work for educators

The added benefit of the cloud model is that it also means everyone has access to the same type of machines, and you can package analysis tools in a way that anyone in the world can go and use in the same way without having to put a ton of effort into figuring out how to configure their local computing environment.

And that’s where we get to the key point I made in the webinar: this new cloud model is great for research, but it’s also a big opportunity for educators to be empowered to move in and use the same environments, datasets and tools that researchers use.

In support of this point, I gave a live demonstration of how an instructor could use AnVIL resources, specifically Jupyter Notebooks in Terra workspaces, to develop and administer practical instruction in genomics to a class of students. To see for yourself how this works, you can view the full recording of the webinar, which is available on demand from the AHSG learning portal.

The portal requires an account login, but the account registration is free and does not require a paid membership with ASHG.

We would love to hear from any educators who might be interested in trying out this model in their own teaching practice, so feel free to reach out to me specifically (geraldine@broadinstitute.org), or to the Terra support team through the community forum or helpdesk. We can help you identify resources and mechanisms to fit your needs and audience level.

Resources

Presentation materials

Webinar recording (on-demand from the AHSG learning portal)
Slide deck on Zenodo

Referenced in the live demo:

Configuration to launch custom notebook environment supporting embedded IGV

Custom environment container

gcr.io/broad-dsde-outreach/terra-base:ipyigv1

Startup script

gs://genomics-in-the-cloud/v1/scripts/install_GATK_4130_with_igv.sh

Workspaces and notebooks

Referenced in the slide deck:

Further reading

Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Michael Schatz et al., Cell Genom. 2022 Jan 12; 2(1): 100085
Terra takes the pain out of ‘omics’ computing in the cloud. Jeffrey Perkel, Nature (Technology feature article)

The post AnVIL in the Classroom: Cloud-scale educational resources for modern genomics appeared first on Terra.