Ecosystem Archives - Terra

Celebrating a year of progress — and a sneak peek at what’s coming next

Kyle Vernest — Thu, 15 Dec 2022 17:29:02 +0000

Kyle Vernest is Head of Product in the Data Sciences Platform at the Broad Institute. In this guest blog post, Kyle takes a look back at how Terra has grown over the past year, and gives us a preview of what to expect in the first quarter of 2023.

It’s been an incredible year for Terra, with a lot of new users coming to the platform as more labs, groups, and organizations move their computational work to the cloud. We’re also thrilled to see user growth being fueled by scientific consortia such as the Human Cell Atlas, and NIH-driven programs such as AnVIL, rallying their communities around Terra as a platform for secure data sharing and collaboration.

The Terra development teams spanning the Broad Institute, Microsoft, and Verily have worked tirelessly to continue to expand the platform’s capabilities in service of these growing communities. Highlights of the year’s releases include an improved user interface for managing cloud environments for interactive analysis, increased scalability of the workflow management system, and better tooling for uploading and organizing data in workspaces. We also rolled out numerous useability improvements, like email notifications for workflow status and better organization of the list of workspaces. Most recently, we launched the public preview of the Terra Data Repository, a new component of the Terra platform designed to provide data storage and access management capabilities tailored for the life sciences.

Yet all these upgrades are in many ways only the tip of the iceberg. Behind the scenes, an enormous amount of work has gone into laying the groundwork for a major development that will come to fruition in the first quarter of 2023: support for storing data and running analyses on Microsoft Azure.

Coming soon to a cloud near you

We have been working closely with our partners at Microsoft to expand Terra to a multi-cloud offering, and we are nearing the launch of Terra on Azure coming early in the new year. Leading up to the launch, you may notice a new “Sign in with Microsoft” option on the Terra welcome screen (which will take you to a “Coming Soon” page until the preview phase starts).

But don’t worry if you’re planning to stick with Terra on Google; we have plenty of upgrades in store for you as well! In particular, you can look forward to taking advantage of WDL 1.1’s workflow language updates, and switching from Jupyter Notebook to JupyterLab for a more full-featured code development experience.

Whether you’re using Terra on Google or on Azure, you’ll be presented with a new version of the Terra Terms of Service, which we’ve updated to reflect the expanded functionality and new multi-cloud nature of the platform.

— —

Finally, as we close out this brief tour of the year’s achievements, we’re especially proud to celebrate the many scientific successes that Terra has already enabled. These have covered an impressive range of domains, from the Telomere-to-Telomere reference genome project to the CDC’s efforts to empower public health labs across the country to adopt genomics for biosurveillance. We look forward to many more in the coming year, featuring even greater variety — including more ‘omics data technologies beyond genomics.

The post Celebrating a year of progress — and a sneak peek at what’s coming next appeared first on Terra.

What’s next for genomics? Plug into GA4GH to find out

Geraldine Van der Auwera — Fri, 30 Sep 2022 14:30:26 +0000

Sequencing technology keeps getting better, faster, more productive — but that’s not the only thing that shapes the big picture of where we’re headed as a field. If you want to understand where genomics is going, you should pay attention to the Global Alliance for Genomics and Health, or GA4GH.

GA4GH is an international organization composed largely of contributors from member institutions in healthcare, research, patient advocacy, and information technology, seeking to enable responsible genomic data sharing within a human rights framework.

The boring way to describe what GA4GH does is to say it develops standards and policies — ranging from technology standards like file formats and application programming interfaces (APIs), which aim to enable interoperability and broad access to data and tools among the global research community, to policy frameworks like patient consent clauses and data privacy policies.

I prefer to think of it this way: GA4GH contributors are effectively building the scaffolding for what genomics will deliver in practice, at scale, over the next decade.

The GA4GH standard development process involves collaboration with implementers, i.e. groups that apply the GA4GH standards in practice, typically in the context of driver projects — which have the advantage of providing real-life use cases. (As it turns out, you develop better solutions when you’re actually working on real problems.)

For technology standards, implementation means building software tools and operating services that follow the rules laid out by the relevant standards.

For example, the CRAM and VCF file formats are widely-used bioinformatics standards stewarded by GA4GH that specify how to encode sequencing reads and variant calls, respectively. There is a software library called htsjdk that implements both of these standards in the Java programming language, meaning that it includes code that is capable of reading and writing files that are encoded according to those standards. Researchers can then use genome analysis tools like GATK and Picard, which include the htsjdk library, to read and write CRAM and VCF files as part of their analysis work. Tadaa. (For fans of C, substitute htslib and samtools in the library/tool mentions.)

In addition to these now-classic (if imperfect) workhorse formats, GA4GH has been driving the development of other, newer knowledge representation standards that you may not yet be aware of, but will likely transform the way many of us work. Take the Variation Representation Specification (VRS, pronounced “verse”), which among other things makes it possible to capture the complex information that underlies variant interpretation in computable form, a key feature for solving variant interpretation bottlenecks. Right now, VRS is a fairly niche product, but within a few years I expect we’ll see it being used across a variety of research and clinical diagnostic platforms.

Computable standards for alleviating variant interpretation bottlenecks. From “Genomic Knowledge Standards Advancements” by Larry Babb, presented at the 10th Plenary Meeting of the GA4GH (see Day 1 recording).

Pro-tip: you can check out the Python implementation of VRS in Github, and there’s even a Terra workspace hosting Jupyter notebooks that demonstrate how to use it in practice.

Speaking of platforms, another major axis of GA4GH standard development is platform-level interoperability, i.e. infrastructure standards that enable platforms like Terra to talk to each other.

I’ve written before about the big picture of interoperability for open ecosystems, the example of the AnVIL project, and how Terra uses the DRS standard for data interoperability specifically. The ultimate goal here is to make it possible for researchers to do things like combining data from separate repositories into powerful federated analyses without having to move any of it around.

Excitingly, that dream is starting to materialize! We are now at the stage where data federation is effectively possible — for a limited set of datasets and platforms, with some clunkiness involved. As the work continues, you can expect the scope of what’s possible to include more datasets, with a smoother experience as the handoff between platforms gets ironed out.

There is a lot more to say about the scope and impact of GA4GH work, but I’ll have to leave that for another time.

The bottom line is, if any of this is new to you, now is a great time to start getting caught up.

The organization held its annual plenary meeting over two days last week, and the full recordings for both Day 1 and Day 2 are available on YouTube, annotated with timestamps for specific sessions and presentations (see the expanded video descriptions). You can also find links to the slide decks in the agenda.

As a new member of the organization (I joined the Large-Scale Genomics Work Stream in June), I found the lineup of talks struck a great balance between showcasing progress made so far, outlining upcoming challenges and discussing concrete solutions. I hope you will find these resources useful too — and consider joining the effort!

Additional resources

For a more comprehensive tour of the Global Alliance’s scope, vision, and outputs, read the “Perspective” paper published late last year in Cell Genomics by Heidi Rehm and colleagues:

GA4GH: International policies and standards for data sharing across genomic research and healthcare (2021) Cell Genomics, Volume 1, Issue 2, https://doi.org/10.1016/j.xgen.2021.100029

The post What’s next for genomics? Plug into GA4GH to find out appeared first on Terra.

NVIDIA’s Clara Parabricks workflows in Terra bring GPU acceleration to genomic analysis

Geraldine Van der Auwera — Tue, 20 Sep 2022 17:40:09 +0000

The past few years have seen a massive surge in the development of advanced analytical methods for biomedical research, fueled in part by technological innovations that allow computational scientists to crunch data at ever-increasing speed and scale. A growing number of technology companies have joined the effort to help researchers tackle emerging challenges, ranging from large-scale genomics to multi-modal analysis of the myriad data types associated with medical records — including doctors’ notes, which are famously easy to read and interpret.

Today NVIDIA, a pioneer in AI and accelerated computing, announced a new partnership with the Broad Institute that will pool the two organizations’ respective expertise in deep learning, accelerated compute, and biomedical research. This partnership builds on an existing collaboration between NVIDIA and the Broad’s GATK team, who have already been working together to improve some of the deep learning algorithms in GATK. (Keep an eye on the GATK blog for an upcoming release announcement.)

The NVIDIA team released a Clara Parabricks workspace in Terra that makes their GPU-accelerated genomic analysis toolkit available on the cloud at the click of a button. As shown by the benchmarking results below, the Clara Parabricks workflows in Terra deliver accelerations up to 24x faster execution compared to equivalent CPU-based workflows, and can cut the total cost of execution by up to 50%.

What’s in the box? Drop-in replacements for popular GATK workflows

NVIDIA Clara Parabricks is a suite of GPU-accelerated industry-standard tools for the most common genomics analyses, including read alignment and both germline and somatic variant calling for whole genomes, exomes, and gene panels.

To make these tools easy to run in Terra, the NVIDIA team produced six modular workflows written in the Workflow Description Language (WDL) that are designed as drop-in replacements for the corresponding GATK workflows, summarized in the figure below.

The six Clara Parabricks workflows available as WDLs in Terra (leftmost boxes), with component modules listed to their right. In the case of the germline calling workflow, the two modules (HaplotypeCaller and DeepVariant) are alternative options that can be toggled with a configuration flag.

Each workflow comes with a reference configuration that includes the most appropriate GPU instances to run them on, and the ability to select GATK best practice flags and options.

The NVIDIA team collaborated with the GATK team at the Broad Institute to evaluate the accuracy of the germline workflows. Through this rigorous process, they verified that the Clara Parabricks workflows produce results that are functionally equivalent to the CPU-native GATK versions, as originally defined here.

As a specific example, benchmarking on publicly available Genome in a Bottle (GIAB) samples with the fq2bam and germline caller workflows from the Clara Parabricks suite produced variant calling results that were >0.9999 equivalent in both precision and recall to those produced by the BWA, MarkDuplicates, BQSR, and HaplotypeCaller commands in the GATK’s Whole Genome Germline Single Sample variant calling workflow (available here in Terra).

Up to 24 times the speed and half the cost

The team benchmarked the runtime of the Clara Parabricks workflows on Terra, and found that the GPU-accelerated workflows delivered speedups of up to 24x for germline genome analyses.

When using Clara Parabricks in Terra, the total runtime on NVIDIA GPUs is reduced significantly for a 30x whole genome including BWA-MEM, MarkDuplicates, BQSR, and HaplotypeCaller. Total run time from FASTQ to VCF including variant calling with haplotypecaller, is just over 2 hours on NVIDIA T4 GPUs compared to 24 hours in a CPU-based environment. Additionally, the cost for analysis was 50% reduced in the GPU environment compared to the CPU environment.

Time and cost comparisons of alignment and variant calling (FastQ to VCF) on CPU vs. GPU for a 30X whole genome in Terra.

Runtime from FASTQ to BAM (with BWA-MEM) was reduced from 7 hours with CPU instances (N2) to a little over an hour with 4 NVIDIA T4 instances, and dropped even further to ~45 mins with 8 NVIDIA V100 GPU instances.

You can also see in the figure above (right side panel) that the overall cost of running the workflows on the T4 GPUs is less than half the cost of running the CPU-based equivalents, which is always a happy surprise.

I’ve written before about the speed benefits of GPUs on this blog, though that was in the context of interactive analysis. One of my big takeaways was that you have to find the sweet spot between speed and cost that works for you, because oftentimes the fancy hardware that makes things go really fast is also the most expensive to rent. The good news is that if the speedup is big enough, you only have to use the special hardware for a very short amount of time, and that makes up for the higher rate. Or in the case of the T4 instances, more than makes up for it, since that configuration manages to be substantially more economical even as it delivers a heck of a speed boost.

According to NVIDIA, the T4 GPU is designed to optimize cost and performance when running inference-heavy workloads like Clara Parabricks — which matches what we’re seeing in these benchmarking results. NVIDIA T4 GPUs are available for as little as $0.11 per hour on Terra (backed by Google Cloud), so by running the Clara Parabricks workflows there, you can run the entire alignment and variant calling pipeline for less than $2.50 a sample. That represents a major reduction from the $5 cost-per-sample of the GATK’s CPU-based workflows (running on N2 instances) while still reducing the overall data processing runtime from a whole day to less than three hours. Instances on Terra can also be configured with up to 8 V100 GPUs, which are more optimized for absolute performance than the T4. The same fastq-to-VCF pipeline on 8 V100 GPUs is up to 24X faster than the CPU pipeline, at roughly twice the cost.

I’m excited to see this technology being made available to a wide audience, in a form that doesn’t require taking specialized training or purchasing expensive hardware. It’s a big step forward toward ensuring researchers of any background are able to run sophisticated genomic analysis at scale.

Try it out for yourself today

The Clara Parabricks Terra workspace created by the NVIDIA team is preloaded with example data, workflow configurations, and straightforward instructions, so you can try out the workflows without having to install or tweak anything. Simply clone the workspace and launch the preconfigured examples, or load your own data and get to work.

If you’re new to running workflows on Terra, see the Workflows Quickstart Tutorial.

Don’t hesitate to reach out if you have any questions or if you run into any trouble running the workflows. For help with Terra-specific features (e.g. how to launch, monitor and troubleshoot WDL workflows in Terra), you can either post in our public discussion forum or contact the helpdesk team privately. For technical questions about NVIDIA Clara Parabricks, please visit the developer forum page here.

Acknowledgments

We are grateful to the NVIDIA team, specifically Eric Dawson and Vanessa Braunstein, for running the workflow benchmarks and helping us characterize the benefits of using Clara Parabricks on GPU instances in Terra.

Resources

– Terra Workflows Quickstart Tutorial

– Terra blog about using GPUs for interactive analysis, machine learning

– Cromwell documentation about runtime parameters for using GPUs in workflows

– Clara Parabricks Genomic Analysis webpage

– Clara Parabricks Documentation Page

– Clara Parabricks 4.0 Blog

– Clara Parabricks in Terra Workspace

– Clara Parabricks GTC DLI Hands on free workshop on Sept 21 as part of GTC

The post NVIDIA’s Clara Parabricks workflows in Terra bring GPU acceleration to genomic analysis appeared first on Terra.

From liquid biopsies in Ghana to African cancer genomics in the cloud

Samuel Ahuno — Fri, 02 Sep 2022 14:00:41 +0000

Samuel Terkper Ahuno is a student at the Tri-Institutional PhD Program in Computational Biology and Medicine, NYC. In this guest blog post, he describes his published work on African cancer genomics, which evaluated the feasibility of using liquid biopsies to detect breast cancer in a Ghanaian clinic, and shares his vision of a cloud-powered future for computational research in Africa.

Cancer in sub-Saharan Africa is becoming common, with increasing mortality. Current efforts to mitigate this are focused on increasing public awareness, earlier diagnosis, increasing access to treatment and care and researching the lifestyle, environmental and genetic risk factors that might be more prevalent for African women.

One major obstacle we are facing is that the current standard of testing for most cancers, including breast cancer, is the traditional biopsy: extracting a small piece of the tumor surgically using a needle. As a result, many cancers are diagnosed only after considerable growth has occurred. Therefore, technologies for earlier detection could make a big difference to patient outcomes. Additionally, less invasive procedures would be better accepted by the population, and could enable repeated sampling and improved treatment monitoring.

Using liquid biopsies to detect breast cancer in Ghanaian patients

Liquid biopsy techniques address these challenges by using readily available biological fluids such as urine, blood, or saliva for diagnosis. These fluids contain circulating or “cell-free” DNA (cfDNA), some of which may be coming from tumor cells and are then called circulating tumor DNA. A liquid biopsy consists of sampling the relevant fluids and testing for the presence of circulating tumor DNA (ctDNA) or other such markers of cancer.

Our research group recently tested whether liquid biopsies could be used to detect breast cancer in Ghana, as part of the Ghana Breast Health Study. From each patient who was recruited into the study and came to one of three hospitals, a small amount of blood was collected, then extracted DNA from the blood was sequenced. This enabled us to estimate how much of the cfDNA was shed by the patient’s tumor into the blood and what sort of DNA damage was from the associated tumor.

We found encouraging results, suggesting that liquid biopsies could be a viable way to detect cancer markers such as copy number alteration (CNA) status for many selected breast cancer genes in Ghanaian patients. Copy number alteration is a type of cancer-associated mutation where one or more segments of the DNA are either lost or duplicated. Yet, the adoption of this diagnostic approach would require developing genomic and bioinformatics capacity within the country while also strengthening basic health care services to make sure women can gain access to the treatment they need to pursue this research further, and ultimately empower clinics to offer these tests in a sustainable and cost-effective way.

The computational requirements of cancer genomics

Going from raw cfDNA sequence data to biological insights about each patient’s tumor involves complex bioinformatics procedures that we can divide in two main stages of analysis, with very different computational requirements.

The first phase consists of pre-processing the data to ensure we have high quality information in a suitable format for identifying tumor DNA. In practice, this involves mapping each individual sequence read to a standard genome reference, and applying stringent quality control measures (see GATK Best Practices for more details). This is the most computationally intensive step of the analysis pipeline; for whole genomes with billions of reads, you can imagine how complicated it can get.

The second phase consists of estimating what fraction of the circulating DNA is likely to have originated from a tumor, and identifying CNAs (see ichorCNA documentation for more details).

A hybrid approach to achieve scalability without changing everything

Given the computationally intensive nature of the pre-processing phase, we performed that part of the work using cloud-optimized workflows that we ran on the Terra platform. This allowed us to scale execution very easily and not have to worry about managing high-performance computing resources directly.

For the second phase of the work, which did not pose any scaling challenges, we chose to use our existing tools on the Mount Sinai Hospital servers. It was easy enough to download the pre-processed outputs from Terra onto our local filesystem.

This hybrid approach allowed us to take advantage of Terra’s scalable batch processing capabilities without having to change our familiar environment for the more exploratory part of the work. If we were to do this again with a larger dataset, downloading the pre-processed outputs would probably be less feasible, and it might be worth it for us to look into Terra’s interactive cloud environments for doing the rest of the work on the platform as well.

The bigger picture of cloud computing in Africa

The study I presented here was the result of an international collaboration between multiple research and clinical institutions in Ghana, as well as in Canada, the UK and the United States. Strengthening Global Partnerships plays an important role and part of the United Nations Sustainable Development Goals of economic development, yet for approaches like liquid biopsies to become the standard of care in Ghana and many other African countries, we must ultimately develop the bioinformatics capacity to perform the relevant research and testing autonomously in-country.

One of the major challenges for bioinformatics and computational biology in many African countries is the limited infrastructure such as computing resources, even though the cost of computing is becoming cheaper and associated with increasing efficiency.

Cloud-powered platforms like Terra could play a huge role in increasing access to computing resources to enable scalable genomics research in Africa, by Africans.

In addition to providing access to powerful hardware resources, such platforms also make it possible to leverage publicly available workflows and pre-installed software tools and environments. This helps newcomers overcome initial learning curves and empowers seasoned researchers to leverage best in class tooling without having to spend time installing anything. Once familiar with the infrastructure, they can also develop their own workflows and tools to innovate in the pursuit of their preferred research question.

Organizations such as H3Africa have over the years been building bioinformatics capacity in affiliated institutions in the region. Building on that work, the DSI-Africa consortium recently launched the eLwazi platform, an African-led open data science project powered by Terra.

However, moving forward it will be great to have data centers within Africa to enable regional processing, storage and control of genomic data due to privacy and ethical reasons.

There are still many practical, ethical and technological challenges to implement genomic technologies in Africa, yet it is encouraging to see such progress toward a future where African countries such as Ghana can access the resources they need to chart their own course.

Acknowledgements

I would like to thank Paz Polak, PhD; Jonine Figueroa, PhD; Geraldine Van der Auwera, PhD, and Kofi Johnson, PhD, for helpful comments.

The post From liquid biopsies in Ghana to African cancer genomics in the cloud appeared first on Terra.

AnVIL in the Classroom: Cloud-scale educational resources for modern genomics

Geraldine Van der Auwera — Fri, 29 Jul 2022 20:37:19 +0000

Genomics has become enough of a mainstream discipline to be introduced in undergraduate classes, even high school. There are lots of courses and online resources offered to help educators teach genomics, and heaps of literature about teaching methodologies.

In my recent webinar hosted by the American Society for Human Genetics, I discussed the exciting opportunities that the move to cloud-based research infrastructure offers for educators who are interested in delivering practical instruction in genomics through hands-on exercises.

Big thanks to my colleagues Liz Kiernan and Anton Kovalsky, both Lead Science Educators in the Data Sciences Platform at the Broad Institute, for their invaluable contributions to the development and execution of this webinar.

Bridging the gap between teaching and research

Computational infrastructure has always been a challenge when it comes to hands-on teaching in scientific computing. Even at teaching institutions that are well-equipped with sophisticated computer labs, there often remains a deep divide between teaching environments and research environments, which end up siloed apart from each other. So as an educator, every time you want to bring data and tools from the research environment to a teaching setting, you have to cross that gap, which takes effort. You can easily end up with teaching examples that are out of date, or oversimplified, so learners only encounter toy versions of what researchers really do.

It’s critical that we work toward bridging that gap, to reduce the distance between teaching examples and real research as much as possible. It is both more exciting for learners to be developing their skills through more realistic examples (closer to “real” scientific investigation), and more productive in terms of achieving educational goals. I would also expect that when they are ready to take the next step in their educational journey, learners are more likely to transition smoothly to more complex projects if they can build on prior experience rather than having to be re-trained to use different tooling.

The solution that is standing right in front of us is to take advantage of what’s happening on the side of research infrastructure, where for the past few years we’ve seen a big shift toward using cloud infrastructure.

Traditionally, everyone would use their own siloed computing infrastructure, and if multiple groups were working with the same dataset, they would each store a local copy to work on. So we’d end up with similar gaps between researchers as we were just talking about between researchers and educators. With the shift to the cloud, the idea is everyone can just access one copy of the data that’s stored centrally, and run whatever computation on hardware that’s right there, colocated with the data.

This is not a new idea as such, but the reason it’s been gaining so much traction recently is because the scale of the datasets that are being generated in genomics and related disciplines: it’s just not feasible —or fair— to expect everyone to download a copy of the data and work on it locally. Hence the big commitments we’re seeing from various federal agencies to support infrastructure projects like the NHGRI AnVIL, which aims to enable genomics researchers to make effective use of cloud.

Putting the cloud to work for educators

The added benefit of the cloud model is that it also means everyone has access to the same type of machines, and you can package analysis tools in a way that anyone in the world can go and use in the same way without having to put a ton of effort into figuring out how to configure their local computing environment.

And that’s where we get to the key point I made in the webinar: this new cloud model is great for research, but it’s also a big opportunity for educators to be empowered to move in and use the same environments, datasets and tools that researchers use.

In support of this point, I gave a live demonstration of how an instructor could use AnVIL resources, specifically Jupyter Notebooks in Terra workspaces, to develop and administer practical instruction in genomics to a class of students. To see for yourself how this works, you can view the full recording of the webinar, which is available on demand from the AHSG learning portal.

The portal requires an account login, but the account registration is free and does not require a paid membership with ASHG.

We would love to hear from any educators who might be interested in trying out this model in their own teaching practice, so feel free to reach out to me specifically (geraldine@broadinstitute.org), or to the Terra support team through the community forum or helpdesk. We can help you identify resources and mechanisms to fit your needs and audience level.

Resources

Presentation materials

Webinar recording (on-demand from the AHSG learning portal)
Slide deck on Zenodo

Referenced in the live demo:

Configuration to launch custom notebook environment supporting embedded IGV

Custom environment container

gcr.io/broad-dsde-outreach/terra-base:ipyigv1

Startup script

gs://genomics-in-the-cloud/v1/scripts/install_GATK_4130_with_igv.sh

Workspaces and notebooks

Referenced in the slide deck:

Further reading

Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Michael Schatz et al., Cell Genom. 2022 Jan 12; 2(1): 100085
Terra takes the pain out of ‘omics’ computing in the cloud. Jeffrey Perkel, Nature (Technology feature article)

The post AnVIL in the Classroom: Cloud-scale educational resources for modern genomics appeared first on Terra.

Using the cloud to support alignment between exploratory research and the rights of clinical study participants

Laurence (Laury) Mignon — Thu, 07 Jul 2022 16:10:27 +0000

Laury Mignon is Executive Director of Translational Medicine at Ionis Pharmaceuticals. She is responsible for improving the probability of (technical) success of Ionis’ preclinical assets from “bench to bedside”. In this guest blog post, Laury presents a novel approach to study participants’ data access and control in industry-sponsored clinical studies.

Whole-genome sequencing (WGS) is increasingly used in human research, including clinical trials, and the resulting data hold a lot of potential value for research beyond the immediate use for which it is collected. Allowing pharmaceutical companies to use WGS data from clinical trials for exploratory research could unlock substantial benefits for patients as well as for study participants who could receive actionable incidental findings. However, this raises important questions around participant consent, disclosure of incidental findings, and the rights of participants to withdraw their data from further study.

To address these questions, we developed the Exploratory Genetic Research Project (EGRP), a novel framework that aims to provide an umbrella protocol to collect genetic material in all of a sponsor’s clinical studies, giving consenting individuals the right to access and control access to their genetic data while enabling unspecified or exploratory future research. We recently published a manuscript describing the full EGRP protocol, as well as the detailed reasoning behind key design decisions we made, and we are currently working with Color Genomics and the Broad Institute to test our very first implementation of this protocol.

A novel ‘social contract’ – An attempt to harmonize a sponsor’s exploratory research with a clinical study participant’s data rights
By Laurence Mignon, Kim Doan, Michael Murphy, Lauren Elder, Chris Yun, Jeff Milton, Shruti Sasaki, Christopher E. Hart, Dante Montenegro, Nickolas Allen, Dany Matar, Danielle Ciofani, Frank Rigo, and Leonardo Sahelijo (2022)
In Contemporary Clinical Trials, Volume 119, https://doi.org/10.1016/j.cct.2022.106819

I hope you will read the paper to get the full picture of this innovative framework. Here, I wanted to highlight our decision to use a secure cloud platform — specifically Terra — to store the WGS data and make it available for analysis in a way that protects the privacy and autonomy of study participants.

Terra as the cloud-based Sandbox

Our intention was to prioritize participants’ control over the use of their genetic data, while enabling proprietary future research on the clinical and genetic data. For that reason, the EGRP process stipulates that the pharmaceutical company who sponsors the study is not permitted to download participants’ individual WGS data onto their own servers. Instead, the WGS data must remain in the custody of an independent partner, the data host, who performs the genome sequencing and makes the data available for analysis through a secure cloud platform.

Additionally, samples are de-identified prior to sequencing through the use of barcodes managed by a third partner, the “honest broker”, who is responsible for interfacing with participants and managing consents, barcodes and return of results.

Flow of information between the EGRP partners, with Ionis as the sponsor (pharmaceutical company running clinical trials), Color Genomics as the honest broker (interfacing with participants, managing consents, barcodes and return of results) and the Broad Institute as the data host (performing WGS and providing secure access to data in Terra). Modified from Mignon et al., 2022.

This setup makes it possible to ensure that data can be promptly removed from the system if a participant decides to withdraw their data. All the participant needs to do is notify the honest broker (Color Genomics), who will then issue a withdrawal order for the corresponding barcode to the data host (Broad Institute). By excluding the sponsor from the removal process, we effectively remove potential conflicts of interest and increase the amount of control that study participants can wield over their genetic data.

In addition, using Terra offers the opportunity to use a large set of cloud computing tools that are readily available in the platform and in many cases, have been optimized for genomic analysis at scale. This includes algorithms and pipelines created by other Terra users, data engineers supporting large projects such as the All of Us Research Program and the Human Cell Atlas, and the wider bioinformatics community. We view this as a promising path toward a faster and more standardized way of performing genetic analyses, as well as a fairer method of developing and sharing computational tools across private and public industries.

As we move forward with our first real-world test of the EGRP protocol, we are excited to have defined a process that serves the interests of individual study participants and pharmaceutical sponsors, and we are hopeful that it will provide a blueprint for future work by other clinical trial sponsors as well.

The post Using the cloud to support alignment between exploratory research and the rights of clinical study participants appeared first on Terra.

The Path of Genomes: Expediting the way to actionable public health data

Frank Ambrosio — Fri, 03 Jun 2022 19:19:51 +0000

Frank Ambrosio is a bioinformatics scientist at Theiagen Genomics, a company whose mission is to transform public health and infectious disease surveillance through the innovative implementation of NGS and bioinformatics technologies. In this guest blog post, Frank recounts how Terra has come to serve as a shared platform for public health labs, and to foster cross-cutting collaboration among members of the public health community.

Every public health scientist remembers the sequence of events and conversations that occurred leading up to their realization that we were facing a legitimate pandemic threat to our species in the form of a novel viral pathogen. We all remember the first “two weeks to flatten the curve”, and then the realization that this would not be nearly enough to stop the disease from reaching pandemic proportions. Filled with trepidation, we all watched the spread of the disease even as we helped produce the data and interpretations that provided situational awareness to our public health and government leaders.

Eventually we began to settle into the “new normal” of social distancing and quarantines, but with the fabric of society starting to fray from the effects of prolonged fear of exposure and social isolation, the world looked to the scientific community for answers. Our public health community was thus thrust into the limelight while facing its greatest challenge since 1918: COVID-19 would quickly become the worst pandemic to plague humanity since it became possible to sequence the genomic material of pathogens.

The way forward was clear: we had to accelerate sequencing efforts to monitor viral variants, but the magnitude of the undertaking was profound. Simply generating all that sequencing data would be a monumental task involving purchasing new equipment, training personnel, and validating new assays, all while adapting our extant pathogen surveillance systems to the new disease.

None of this was easy, and it could not happen overnight, but our community rose to the occasion. Sequencing efforts expanded at unprecedented rates, contact tracing teams worked around the clock, and –crucially– we developed a new collaborative model for developing and sharing public health bioinformatics resources.

Opportunity and challenges of the genomic era

In the context of a modern globalized society, infectious disease pandemics, by their very nature, require a collaborative effort from the global public health community to mitigate breadth and severity of their impact.

For the first time in history, we had the tools to sequence and analyze pathogen genomes from the onset of the pandemic, giving us the opportunity to track the evolution of a virus as it propagates throughout the world. Yet we lacked standardized approaches for processing the data, extracting actionable insights and sharing them across public health jurisdictions, both nationally and internationally. Individual laboratories were initially using their own custom pipelines to assemble SARS-CoV-2 genomes. They experienced challenges even discussing their outputs with other labs, let alone performing the aggregation required for longitudinal analyses. Additionally, labs that did not have a bioinformatics expert to develop and run these pipelines were unable to analyze the sequence data they worked so hard to generate.

It quickly became clear that the public health community would benefit from a focused effort to enhance our ability to collaborate on the development and distribution of analytical pipelines. Ideally, we would be able to distribute these pipelines through a medium that made them accessible to scientists of diverse technical backgrounds, and provide a forum for discussing the nuances of these pipelines and providing feedback to the developers.

For much of the public health community in the USA, that medium was the Terra platform.

Terra as a public health conduit

When our team at Theiagen Genomics was given the opportunity to help public health laboratories across the country tackle their newfound wealth of viral genomics data, we started using Terra as the common platform that all our partner labs could use to host data, perform genomic analyses and share results.

We worked with our public health partners to develop standardized workflows that automate all the data processing and analysis steps involved, such as assembling genomes from the raw sequencing data, building phylogenetic trees, and submitting the sequences to the NCBI’s public data repository.

Screenshot from the Genomic Analysis of SC2 introductory video by Theiagen Genomics

We leveraged the use of a shared analysis platform to help bring together different public health partners —local, state, and federal— and create a community of practice for distributable workflows and harmonized results. On a regular basis, this community of practice compares outputs, discusses interpretation of results, and contributes to the codebase. These discussions are facilitated by the Terra Training and Office Hours sessions hosted by Theiagen, which provide a unique opportunity for public health scientists to discuss the analytical workflows they use on a daily basis with the developers and fellow end-users.

Terra.bio is the nexus of all SARS-CoV-2 sequencing data for this community, and these office hours sessions serve as the nexus of ideas. In these sessions, updates to the platform and workflows are announced, there are demonstrations on how to overcome common issues, and public health scientists from around the country discuss the latest situational developments. The sessions are recorded, but off-record time is saved at the end of the call to allow laboratories to freely discuss laboratory-specific issues like the detection of a new variant, a particular outbreak investigation, or developing a new approach.

As we’ve grown from working with just two public health laboratories to over 40 (in part through a partnership with the Broad Institute and the CDC), this community engagement model has scaled very well, with office hours meetings now approaching 100 participants. And as participation in these events continues to rise, so too does the value of the discussions.

Success through practitioner-led research and development

From a technical standpoint, this practice of involving the public health community closely in the development of our analytical workflows has allowed us to avoid the pitfalls that are typically associated with siloed development, and to identify and resolve interoperability issues before they have the opportunity to cause fissures in the community. Through consistent communication with participating labs, we were able to develop analytical procedures, sets of parameter values and reporting standards that are both scientifically sound and well understood among the community members.

As a result, we’ve seen wide adoption of the workflows by public health labs. That in turn has led to a considerable rise in the number of researchers who are able to analyze their own sequencing data, and perhaps most importantly, the amount of data made available in a usable, understandable form to public health policy makers has skyrocketed.

A model for public health bioinformatics beyond the current crisis

Our story is ultimately one of successfully combining a cutting-edge technological solution with an old-school community building approach. Being able to bring molecular epidemiology experts together into a shared data platform and providing a venue for collaborative engagement with our bioinformatics scientists has changed the game and led to substantive benefits for public health at large.

The value of the community of practice that has emerged from this initiative stretches far beyond simply developing better workflows and educating end-users. By engaging with this passionate community of scientists who deeply understand the complexities of pandemic genomics, we can ensure that the information reported to policymakers is clear and, most importantly, actionable.

Moreover, the impacts that this community has on the tools and procedures developed to combat SARS-CoV-2 will reverberate through public health for years to come. Even as we are dealing with renewed levels of COVID-19, new challenges are rising on the horizon. We are hopeful that our experience will contribute to establishing a modern model of outbreak genomics, and drive the field of global public health in the direction of openness, collaboration, and community.

Resources

Theiagen Genomics company website
Theiagen bioinformatics resources for public health viral genomics

The post The Path of Genomes: Expediting the way to actionable public health data appeared first on Terra.

Paper Spotlight: A complete reference genome improves analysis of human genetic variation

Geraldine Van der Auwera — Thu, 07 Apr 2022 16:40:07 +0000

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.

A complete reference genome improves analysis of human genetic variation

By Sergey Aganezov, Stephanie M. Yan, Daniela C. Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J. Taylor, Kishwar Shafin, Alaina Shumate, Michael C. Schatz et al., 2022

Science, Vol 376, Issue 6588 https://doi.org/10.1126/science.abl3533

Abstract: Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.

What part of the work was done in Terra?

Excerpts from the paper’s Methods section:

Short-read variant calling

To evaluate short-read small-variant calling between GRCh38 and T2T-CHM13, we used the NHGRI AnVIL (44) to align all 3,202 1KGP samples to CHM13 with BWA-MEM (45) and performed variant calling with GATK HaplotypeCaller (77) using a workflow modeled on the one developed by the New York Genome Center (NYGC) for 1KGP analysis performed on GRCh38 (28). As in the NYGC analysis, we recalibrated the variant calls with GATK VariantRecalibrator. We analyzed coverage statistics using samtools and AF using bedtools. To identify Mendelian-discordant variants, we used GATK VariantEval.

Note: NHGRI AnVIL is a project of the US National Human Genome Research Institute that brings together Terra and several complementary platforms into a powerful genomics analysis ecosystem. The AnVIL portal powered by Terra provides full access to Terra’s data and analysis capabilities.

How did they do it?

The authors developed WDL workflows for calling variants in the short read sequencing data based on a previous analysis by the New York Genome Center. They ran the workflows at scale on all 3,202 whole genomes in the 1000 Genomes project cohort using Terra’s workflow execution service.

You can learn more about the scaling challenges they faced and how they overcame them by using Terra in this blog post, written by Samantha Zarate of the Schatz Lab.

To try your hand at running a workflow in Terra, check out this Quickstart Tutorial Workspace.

Appendix: Data and code availability

Aligned reads, variant calls, and other summarizations available within the NHGRI AnVIL Platform, along with notebooks and workflows for computing the analysis, https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_T2T.
Short read variant calling workflows and analysis scripts are available in Github at https://github.com/schatzlab/t2t-variants

The post Paper Spotlight: A complete reference genome improves analysis of human genetic variation appeared first on Terra.

Ten simple rules — #2 Document everything

Geraldine Van der Auwera — Thu, 10 Mar 2022 20:03:19 +0000

This blog post is part of a series based on the paper “Ten simple rules for large-scale data processing” by Arkarachai Fungtammasan et al. (PLOS Computational Biology, 2022). Each installment reviews one of the rules proposed by the authors and illustrates how it can be applied when working in Terra. In this installment, we cover features that Terra users can take advantage of to communicate key project information to collaborators, keep records of workflow executions, and document analyses done in interactive environments for reproducibility.

Hot on the heels of “Don’t reinvent the wheel“, we tackle another deceptively simple rule: “Document everything”, an exhortation that may seem self-evident but can be quite challenging to apply consistently in practice.

“If it’s not written down, it didn’t happen.”

In their paper, Arkarachai Fungtammasan and colleagues motivate this rule primarily by calling out the necessity of ensuring effective transfer of knowledge within teams, particularly large collaborative teams that experience staff turnover. They also wisely point out the utility of progressively recording information that will later need to be collated for documentation, which I interpret as documentation intended for an external audience, e.g. for a research publication.

“[…] As members join and leave the team working on a large-scale data processing project, remembering why each decision was made can become difficult. A simple log with what the decision is, what the rationale for it is, who contributed to making it, and who agreed or approved with it can be enough. […] This information can also be helpful to have consolidated when creating documentation explaining the pipelines used [11].”

This is a compelling argument, deeply relatable. Even folks who are working mostly solo rather than as part of a large team should recognize the value of setting up their future self for success when the time comes to write the Materials and Methods section of their manuscript. To (badly) paraphrase the great Ru Paul, if you don’t document the work for yourself today, how the heck are you going to document it for somebody else in six months or more?

When it comes to solutions, the authors’ recommendation centers on the use of project management tools.

“There are multiple approaches to logging decisions, but a powerful and common approach is to repurpose project management systems such as GitHub Issues. Doing so effectively ties decisions to code or documentation changes, allows for historical record including discussion, can handle decision dependencies, and can incorporate formal review process automation.”

We can certainly agree it’s a great idea to use a formal project management system to track work in general, and I for one wish that had been covered in my graduate school education. The specific suggestion of Github Issues will work particularly well for people whose work has a strong code development component, since they’re likely to be using Github already. I’ll note that there is also an add-on for Github Issues called Zenhub that provides additional project management functionality, and is free to use with public repositories. And of course, there are plenty of other options with different feature sets for teams that have different needs and preferences.

Yet while this recommendation does a great job of addressing the need to capture information about decision-making and code development processes for posterity, it doesn’t really touch the question of how to document analysis work at a granular level — e.g. which pipeline was run on which data, what was the command line, what were the outputs etc — without having to do an inhuman amount of manual input.

This is admittedly a difficult question to address in a generic way, because the answer depends so much on the specific platform or environment you’ll be using to do the work.

Fortunately in this post I have the luxury of focusing on how you can apply the “Document everything” rule specifically within the Terra ecosystem. So let’s review a few key features of Terra that can help you apply this rule in three main areas: keeping records of workflow executions, documenting the twists and turns of your interactive analyses, and communicating the purpose and contents of a project workspace.

FYI, the closely related topics of version control and monitoring execution are the object of separate rules, which we’ll get to in a few weeks.

Keep detailed records of workflow executions

Running workflows at scale can be very challenging due to a range of factors including the amount of data involved, the complexity of the workflows, and the importance of processing all samples in a dataset in the same way. So it’s absolutely essential that whatever system you use enables you to find out exactly what was done in any given workflow run.

I’m happy to say that is something the Terra workflow management system does particularly well. Whenever you launch a workflow (or set of workflows), the system records all relevant metadata automatically, including all command-line parameter values as well as direct links to the workflow code, input files and output files. The system also retains (and links to) copies of all execution logs, which contain information such as the exact command line that was run at each step, what logging information the tool itself produced (stdin and stderr), and additional metadata that we’ll discuss more when we get to later rules such as #10: Monitor Execution.

This logging system (and its user-friendly web interface) makes it possible to quickly find all the information and inputs you would need to reproduce a workflow-based analysis with perfect fidelity.

Screenshots of a workspace’s job history view: list of submissions (top) and detailed view of one submission (bottom) (browse this workspace’s job history here)

In addition to the automated logging, you also have the option of adding comments to your workflow submission, either at the time you launch it, or after the fact (including during execution). For example, in the list of submissions shown above, the second row in the table (with “Aborted” status) includes a comment that was added after the workflow run was aborted. This can be very useful for keeping track of decisions or troubleshooting notes, especially in projects with a heavy development component, where multiple attempts may need to be made on the path to success.

It’s worth noting that this commenting feature was added in response to community requests and has proved hugely popular with researchers who need to manage a lot of workflow submissions. Community feedback works!

Use Jupyter Notebooks to document interactive analyses

In my experience, the part of people’s projects that tends to be the least adequately documented is the phase of iterative data exploration, analysis and visualization that is generally typical of tertiary data analysis, which we lump under the term “interactive analysis” in contrast to automated workflows. This phase typically involves applying a variety of commands, sometimes scripted, sometimes not, within an interactive environment such as a terminal shell or an application like RStudio.

By its very nature this can be a messy, non-linear process, and unfortunately the entire process often ends up summarized in Methods sections as “We applied methods X and Y using base R and this list of packages which are available in CRAN.” It goes without saying that this is not sufficient to enable a reader, collaborator or even your future self to reproduce the work.

One increasingly popular way to address this problem, which is fully supported in Terra, is to perform most if not all of the work within a Jupyter Notebook. The Notebook environment allows you to progressively document every step and every attempted command, alternating documentation cells and code execution cells, with full logging of command outputs. This provides a much richer documentation record than “just” including code comments in a script, for example. And keep in mind that you can run almost any command-line analysis tool from within a notebook; you’re not limited to running Python and R code.

Screenshot of a Jupyter Notebook in Terra (preview mode) showing invocation and partial log output of a GATK command (see the full tutorial notebook here)

That being said, the resulting “complete record” can be a little overwhelming, so I personally like to maintain a parallel notebook in which I only include “the bits that worked”. This allows me to progressively build (and regularly re-check) the minimal end-to-end path necessary to reproduce the work. The result is a more easily readable documentation record that is pretty much ready to publish.

This approach may also be a good fit for people who find the experience of developing an analysis in Jupyter Notebook to be too constraining, and who prefer to work in an environment like RStudio (which is also available in Terra). As you progress through your analysis, record chunks of the work in a notebook in parallel, alternating descriptions of your decision-making process and the actual code executions applied to the data. If you’re used to saving analysis code in R scripts, you can simply invoke the scripts from your notebook, and combine the advantages of both sides — the flexibility of RStudio as a development environment and the documentation power of Jupyter Notebooks.

Does that mean you’ll be running a lot of computations twice or more? Why yes, it does indeed, and that’s a good thing: it’s a built-in way to verify the reproducibility of your work as you go.

Communicate the purpose and contents of your workspace

Giving teammates access to your work in Terra is straightforward; simply share your project workspace with them through the workspace sharing menu. However, there can be a lot of assets in your workspace (data, code, tools) and it’s not necessarily trivial for someone coming in to understand how it all ties together, especially if they are new to Terra themselves.

We encourage you to take advantage of the editable “dashboard” of your workspace to provide collaborators with an overview of the project that the workspace is meant to tackle, and summarize key information: what are the main assets used in the workspace (e.g. data, tools, code), instructions for running the analyses, plus meta-level information like authorship and any applicable licensing conditions. The Terra User Education team provides best-practices recommendations for how to structure dashboard documentation, based on their extensive experience developing public workspaces for educational purposes.

Screenshot showing part of a workspace dashboard (see full dashboard here)

One limitation here is that the workspace dashboard documentation functionality does require manual input (the information is not collated automatically) and it’s not version-controlled. Its main advantage is that it makes it possible to attach summary information as an integral part to the workspace itself, rather than relying on external/separate documents.

Some enhancements to this functionality have been discussed, like the possibility of adding a comment log so that multiple people collaborating within a workspace could post timestamped notes, flag issues and ask questions within the context of the workspace itself (rather than having to switch to an outside application).

As I hinted at earlier, we welcome feature requests, so feel free to upvote this idea or suggest your own in the Feature Requests section of the Terra community forum!

The post Ten simple rules — #2 Document everything appeared first on Terra.

Ten simple rules — #1 Don’t reinvent the wheel

Geraldine Van der Auwera — Thu, 03 Mar 2022 19:48:27 +0000

This blog post is part of a series based on the paper “Ten simple rules for large-scale data processing” by Arkarachai Fungtammasan et al. (PLOS Computational Biology, 2022). Each installment reviews one of the rules proposed by the authors and illustrates how it can be applied when working in Terra. In this first installment, we cover data and tooling resources that Terra users can take advantage of to avoid doing unnecessary work.

We kick off this “Ten simple rules” series with “Don’t reinvent the wheel”, a classic maxim that is ubiquitous in programming advice forums yet tragically underappreciated in the world of research computing. Certainly a fitting start to any list of guiding principles for tackling computational science at scale.

In their paper, Arkarachai Fungtammasan and colleagues address this rule mainly from the point of view of data resources, emphasizing that, before you set out to process a large body of data, you should check whether the work might have been done for you already:

[…] In short, undertaking large-scale data processing will require substantial planning time, implementation time, and resources.
There are many data resources providing preprocessed data that may meet all or nearly all of one’s needs. For example, Recount3 [4,5], ARCHS4 [6], and refine.bio [7] provide processed transcriptomic data in various forms and processed with various tool kits. CBioPortal [1,8] provides mutation calls for many cancer studies. Cistrome provides both data and tool kit for transcription factor binding and chromatin profiling [9,10]. A research project can be substantially accelerated by starting with an existing data resource.

This focus on data surprised me a little, because in my experience, the “Don’t reinvent the wheel” rule is more commonly invoked to advocate for using existing bioinformatics tools and workflows rather than writing new ones. However the authors are not wrong to call out the usefulness of looking for already processed data, particularly in an age when large data generation initiatives are being developed specifically for the purpose of making data available for mining by the wider research community.

In the Terra ecosystem, there are multiple research consortia that are making data resources available in a form that has already been processed through standardized workflows, so that researchers can focus their resources on downstream analysis, and that can be readily imported into Terra for analysis. For example, the Human Cell Atlas provides a multitude of analysis-ready ‘omics data resources that can be imported into a Terra workspace via the HCA Data Portal, as does the BRAIN Initiative Cell Census Network (BICCN), which offers human, non-human primate and mouse ‘omics data through its Terra-connected Neuroscience Multi-Omics (NeMO) portal.

You can check out the Terra Dataset Library to browse the various public and access-controlled datasets (spanning multiple data types and research focus areas) that are available in repositories connected to Terra.

And now, to extend the scope of discussion a little compared to the paper…

Try to reuse existing code, tools, containers, and other assets

Unless what you’re doing is unusually cutting-edge, chances are someone has already tackled a similar problem, and you may be able to reuse some of their tooling. Not to get into the debate of when it’s appropriate to write a new genome aligner from scratch — but I think we can all agree that there are some well-established data processing operations like running a variant calling pipeline on human WGS data, or generating count matrices from single-cell RNAseq data, where you can often benefit from reusing existing tools and workflows rather than rolling your own. In some cases you may need to make some modifications to adapt them to your specific use case, but that’s still a lot less work than starting from nothing.

So where do you find existing tooling?

In the context of Terra, here’s a shortlist of the best places you can look for ready-to-use tools:

1. The Terra showcase features a growing collection of public workspaces that offer fully-configured workflows, Jupyter notebooks, example data and more for a wide range of use cases. Some of these workspaces are created by tool developers to serve as a demonstration of how to run their tools. Others are created by researchers, often as companions to published papers, to recapitulate an end-to-end analysis in a fully reproducible way. The great thing they all have in common is that they combine data, tools and configuration settings that have been shown to work, so you can see in practice how the different pieces are supposed to connect. You may not find a workspace that’s an exact match for your needs, but you may find one that is close enough to use as a starting point, which can dramatically shorten the amount of setup time you need to get your analysis going.

2. For interactive analysis, Terra’s Cloud Environments system provides a menu of pre-built environment images for running applications like Jupyter Notebook and RStudio that come with sets of popular packages pre-installed to get you up and running as quickly as possible. For example, the Bioconductor environment developed as part of the AnVIL project includes the Bioconductor Core packages.

3. Terra also offers a Galaxy environment that includes the full Galaxy Tool Shed.

4. The Terra Notebooks Playground is a great resource for finding code examples of how to perform a variety of operations in Terra notebooks. In addition, many researchers now share Jupyter Notebook files demonstrating how to run computational analyses that they have published; many of these can be run in Terra’s Cloud Environments with only minimal adaptations.

5. For running automated pipelines at scale, the Dockstore workflow repository offers a large collection of workflows contributed by research groups around the world, with a particular emphasis on large-scale analyses and optimizations for cloud platforms. Dockstore connects directly to Terra, so once you’ve found a workflow you’re interested in, you can import the workflow script and an example configuration file with a few clicks. Most WDL workflows that you find in Dockstore can be run in Terra without any modifications. If you do need to modify the workflow code to suit your use case, either fork the original code in github and register your version in Dockstore, or bring it into the Broad Methods Repository if you want basic version control and editing capabilities without having to deal with git. There are also other sources of WDLs out there that are not registered in Dockstore, like the BioWDL project; the OpenWDL community is a good starting point to track those down.

6. Tool container repositories like Dockerhub and Quay.io can be really handy if you’re writing your own workflows. Running workflows in the cloud requires the use of “containers”, which are a way to package command-line tools into a self-contained environment that can be run on a virtual machine. One of the things we hear researchers worry about when they start moving to the cloud is that they’re not comfortable with creating their own Docker containers. The good news is that creating your own containers is actually not as difficult as it’s sometimes made out to be (if you have the right tutorial) BUT we can all agree it’s even easier if you don’t have to do it at all. Fortunately, many tool developers now provide pre-built containers through container repositories such as those listed above, and for the rest, there are community-driven projects like BioTools that make containers available for a wide range of popular bioinformatics tools. So once again, chances are you can find what you need off the shelf and not have to do it yourself.

Finally, keep in mind that reusing existing tools will not only save you a whole lot of time and effort; you will also be more likely to generate outputs that are more directly compatible with other researchers’ work. This increases the comparability of results across different studies and opens up opportunities to aggregate results into federated analyses that will deliver greater power and broader insights.

And don’t forget to share your tools and data, so the next researcher in line can also avoid having to reinvent the wheel!

The post Ten simple rules — #1 Don’t reinvent the wheel appeared first on Terra.