Allie Cliffe, Author at Terra

Why we moved to Terra and what we learned along the way

Allie Cliffe — Mon, 29 Jun 2020 16:12:50 +0000

In this guest blog post, Timothy Majarian, a Computational Associate from the Manning lab is giving us a glimpse to the lab’s journey with transitioning to cloud computing using Terra from the lab’s local high-performance compute cluster.

Back in 2017, our lab had just begun the transition to cloud computing from using our local high-performance compute cluster. We were all new to everything: a platform called FireCloud (soon to become Terra), Workflow Description Language (WDL) and so much more. While the transition certainly took some time — with many missteps, help requests, and even some frantic forum searches — we eventually made it to (semi-) pro users. Nowadays, Terra is essential to nearly all of our projects within the Manning lab.

It’s our hope that our story can dispel, or at least quell, some of the initial concerns felt by new-to-the-cloud researchers and help make the journey smoother by sharing what we’ve learned along the way.

As a lab, we’re interested in complex disease genetics, particularly among people of diverse populations. We predominantly study type 2 diabetes in large epidemiological cohorts, made possible by the Trans-Omics for Precision Medicine (TOPMed) program, sponsored by the National Institutes of Health’s National Heart, Lung and Blood Institute. TOPMed is part of a precision medicine initiative that focuses on whole genome sequencing (WGS) of many, many individuals – some 100,000+ study participants from 30+ cohorts. Given the massive scale of these data, the cross-institutional collaborative nature of our project, and the computational demands of our planned genome-wide association analyses with WGS data, we had no choice but to move our analyses and data storage to the cloud.

Before we could even touch the WGS data, we had to develop the tools and workflows needed for our analysis plan. We spent a summer learning WDL and Docker: how to set up the correct compute environment, describe inputs, scatter jobs to run in parallel on multiple virtual machines, localize the right scripts and ensure the outputs were gathered correctly.

Lesson #1 – first, get comfortable with the basics

Lesson #1 came out of these initial efforts (and we cannot stress it enough): take the time to really learn how WDL works and develop and test your workflows with small data sets on a local computing environment. Too often were we stuck with an esoteric error message that could have easily been avoided if we had a greater understanding of the particulars of WDL. Understanding how inputs and outputs are managed will significantly shorten the duration of any headaches.

Lesson #2: Avoid wasting resources (i.e. money and time) by developing and testing locally

Next came development and testing, where concerns over cost usually begin to arise. While it is true that a large-scale analysis in the cloud can quickly become expensive, there are some simple ways to ensure that compute costs don’t run away from you. Again, we found that the initial investment of learning WDL is the best way to prevent unnecessary costs and avoid running a large job only to realize that your results either don’t make sense or are simply missing due to a mistake in your workflow.

Lesson #3: After your workflow succeeds locally, you should always test the workflow in Terra with a small dataset.

Testing locally is a great way to start the development cycle, but it won’t solve all issues involved when moving your analysis to the cloud; mirroring the conditions of cloud-based analysis on your local machine is possible but, depending on your local setup, there may be slight differences. We found, for example, that local file paths don’t always behave the same as Google cloud storage URLs when used as workflow inputs or outputs.

We often use a single chromosome for GWAS testing, which helps us minimize cost when we first move to running our workflows in Terra, while still using a representative dataset. Note that this strategy can also be useful when running a full analysis, since Terra’s caching feature will detect and port over previous results without having to rerun the entire analysis. In the example above, the test chromosome will be skipped and previously generated results linked over.

Having worked our way through these missteps, with the workflow on version I-can’t-even-count-that-high, we created the final version of our GWAS workflow for WGS data (with scripts developed with our TOPMed collaborators) and were able to dive into our genetic association results. If you’re interested, you can find our publicly-available GWAS workflow here.

The payoff – a scalable, shareable, and secure analysis
Our association analysis used WGS data from over 50,000 individuals and tested ~44.5 million genetic variants. Terra allowed us to share these results securely with colleagues across the US and the globe, and perform follow-up analyses like LD score regression and GCTA heritability analysis using the interactive Jupyter notebooks built into the Terra platform. All the while, we monitored costs to make sure we remained within our budget.

Resources and workflows that don’t require programming skills

While we went the route of developing our own tools, there are many workflows developed by the broader community that you can use in Terra. Two great resources for workflows are the Dockstore and the Broad Methods Repository. For users new to cloud computing, searching for applicable methods in either repository allows you to learn and understand the specifics of the workflow implementation, adapt the workflows if necessary, and hit the ground running with your own development and analysis life cycle. When using pre-developed workflows, we would stress the importance of understanding the exact statistical methods underlying the workflow. While some tools are ubiquitous, like file conversions, the vast majority should be used with careful consideration of your particular use case.

Final takeaway: initial investments in understanding cloud computing pay off in the end

What I hope you‘ve taken away from our story is this: though you’ll need to make some initial investments to do your research using a cloud-based infrastructure, those investments will pay off many times over – especially so as data grow and science trends towards even more collaboration across institutions. The practices outlined here – learning the details of WDL and Docker, testing locally, testing in the cloud, and, finally, running a full analysis – have helped our lab maximize the benefits of cloud computing. Like any discipline, a solid conceptual understanding of the tools and techniques you’ll use will greatly reduce both the time and cost of adopting cloud-based research. This includes workflow development, statistical and computational methods, and particulars of the datasets you’ll be using. The lessons learned in developing our first pipelines, particularly testing locally and in the cloud, have continued to yield positive results as we design and implement new workflows to use in the cloud.

For more on some of the tools that we have developed, and an interactive example of running GWAS in the cloud, head over to our public workspace available through Terra’s showcase & tutorials page. The workspace was developed for the 2019 American Society of Human Genetics annual conference and leads a user through the common steps of a genome-wide association analysis.

The post Why we moved to Terra and what we learned along the way appeared first on Terra.

Workflow worries to working WDLs in under a week

Allie Cliffe — Fri, 19 Jul 2019 18:52:22 +0000

How insightful feedback about the GATK showcase led to a flurry of updates and the creation of a suite of format conversion tools

Last June, just as we were putting the final touches on the teaching materials for a 4-day GATK workshop that we were planning to run entirely on Terra, we received a very insightful email from the workshop host, Matt Bashton of the University of Newcastle in the UK. A long-time user of the GATK, Matt had been trying out the Best Practices workspaces that we maintain on behalf of the GATK team in the Terra Showcase. In his email, he detailed a list of nine issues he saw as blocking researchers like him from moving their work onto Terra.

The workshop was scheduled for the following week, so my initial response was a bit of panic. Even though the workshop mainly uses Terra as a convenient environment for running the hands-on tutorials, it’s a unique opportunity for us to introduce the Best Practices workspaces as the recommended way to try out and evaluate the official GATK workflows. That wasn’t going to be terribly effective if there were major flaws in the resources involved.

Reading down the detailed list Matt had sent us, I understood his concerns: for example, he was absolutely right that it wasn’t clear how to convert sequence data to the uBAM format many GATK workflows used as input on the platform. Or how to deal with interleaved FASTQ files. Some issues we were already aware of, such as the lack of support for WDL 1.0 which had been blocking the publication of new and updated GATK workflows on Terra. In the case of WDL 1.0, resolution was imminent (in fact it’s now supported as of July 18th!), but others were a month or more away from being addressed. So while we wanted to make some concrete improvements in time for the workshop, my immediate question was, what could we possibly do in less than a week??

Turns out, quite a bit more than I was giving our team credit for.

That Friday afternoon, four members of our Support team sprang into action. It started with a phone call meeting between two team members in the office, one working remotely and me on my drive up to Maine for the weekend (don’t worry, I pulled off the road at a stop before taking the call). We went over the list of issues one by one and within an hour had a list of action items for the four of us to tackle before the workshop.

By that evening we were working away. We had a very basic file format conversion workspace already, so one team member went through those workflows, updating them to make sure they were configured to be runnable back-to-back (in the right order) without additional tweaks. Another developed an entirely new WDL that converts an interleaved FASTQ input into paired FASTQ files. I worked on the documentation to clarify the purpose and requirements of each workflow, define their input and output formats, and so on. As we finished each part of the action plan, our fourth teammate updated the tools, data, and workspaces in the public showcase.

Within five days we had completely revamped the barebones file format conversion workspace that we had started from. It now contained one completely new and four significantly updated workflows, as well as a clearer title and description card:

We shared the new workspace with Matt, and his feedback was deeply rewarding: “Thanks so much for creating this workspace it’s exactly what I was looking for I suspect it will really help with on-boarding of new people.” Nailed it! We don’t often pull a weekender like that — this was honestly an extreme case (seriously, we know about work-life balance) — but sometimes it’s just worth it, you know?

Of course, we know that we’re not done yet. In our line of work, you’re never really done. There’s always more to do to clear the path, lay down tracks and make the ride smoother. For my part, the next project I plan to tackle is making all the GATK workspaces follow the same template – more modular, with clearer descriptions and titles.

I wanted to share this story to encourage all of you to give us your feedback too. Be like Matt. Tell us what you need, what’s blocking you. I can’t promise we’ll always have the solutions to all your troubles, but I can promise we’ll do our best every time.

The post Workflow worries to working WDLs in under a week appeared first on Terra.