Publications Archives - Terra

From liquid biopsies in Ghana to African cancer genomics in the cloud

Samuel Ahuno — Fri, 02 Sep 2022 14:00:41 +0000

Samuel Terkper Ahuno is a student at the Tri-Institutional PhD Program in Computational Biology and Medicine, NYC. In this guest blog post, he describes his published work on African cancer genomics, which evaluated the feasibility of using liquid biopsies to detect breast cancer in a Ghanaian clinic, and shares his vision of a cloud-powered future for computational research in Africa.

Cancer in sub-Saharan Africa is becoming common, with increasing mortality. Current efforts to mitigate this are focused on increasing public awareness, earlier diagnosis, increasing access to treatment and care and researching the lifestyle, environmental and genetic risk factors that might be more prevalent for African women.

One major obstacle we are facing is that the current standard of testing for most cancers, including breast cancer, is the traditional biopsy: extracting a small piece of the tumor surgically using a needle. As a result, many cancers are diagnosed only after considerable growth has occurred. Therefore, technologies for earlier detection could make a big difference to patient outcomes. Additionally, less invasive procedures would be better accepted by the population, and could enable repeated sampling and improved treatment monitoring.

Using liquid biopsies to detect breast cancer in Ghanaian patients

Liquid biopsy techniques address these challenges by using readily available biological fluids such as urine, blood, or saliva for diagnosis. These fluids contain circulating or “cell-free” DNA (cfDNA), some of which may be coming from tumor cells and are then called circulating tumor DNA. A liquid biopsy consists of sampling the relevant fluids and testing for the presence of circulating tumor DNA (ctDNA) or other such markers of cancer.

Our research group recently tested whether liquid biopsies could be used to detect breast cancer in Ghana, as part of the Ghana Breast Health Study. From each patient who was recruited into the study and came to one of three hospitals, a small amount of blood was collected, then extracted DNA from the blood was sequenced. This enabled us to estimate how much of the cfDNA was shed by the patient’s tumor into the blood and what sort of DNA damage was from the associated tumor.

We found encouraging results, suggesting that liquid biopsies could be a viable way to detect cancer markers such as copy number alteration (CNA) status for many selected breast cancer genes in Ghanaian patients. Copy number alteration is a type of cancer-associated mutation where one or more segments of the DNA are either lost or duplicated. Yet, the adoption of this diagnostic approach would require developing genomic and bioinformatics capacity within the country while also strengthening basic health care services to make sure women can gain access to the treatment they need to pursue this research further, and ultimately empower clinics to offer these tests in a sustainable and cost-effective way.

The computational requirements of cancer genomics

Going from raw cfDNA sequence data to biological insights about each patient’s tumor involves complex bioinformatics procedures that we can divide in two main stages of analysis, with very different computational requirements.

The first phase consists of pre-processing the data to ensure we have high quality information in a suitable format for identifying tumor DNA. In practice, this involves mapping each individual sequence read to a standard genome reference, and applying stringent quality control measures (see GATK Best Practices for more details). This is the most computationally intensive step of the analysis pipeline; for whole genomes with billions of reads, you can imagine how complicated it can get.

The second phase consists of estimating what fraction of the circulating DNA is likely to have originated from a tumor, and identifying CNAs (see ichorCNA documentation for more details).

A hybrid approach to achieve scalability without changing everything

Given the computationally intensive nature of the pre-processing phase, we performed that part of the work using cloud-optimized workflows that we ran on the Terra platform. This allowed us to scale execution very easily and not have to worry about managing high-performance computing resources directly.

For the second phase of the work, which did not pose any scaling challenges, we chose to use our existing tools on the Mount Sinai Hospital servers. It was easy enough to download the pre-processed outputs from Terra onto our local filesystem.

This hybrid approach allowed us to take advantage of Terra’s scalable batch processing capabilities without having to change our familiar environment for the more exploratory part of the work. If we were to do this again with a larger dataset, downloading the pre-processed outputs would probably be less feasible, and it might be worth it for us to look into Terra’s interactive cloud environments for doing the rest of the work on the platform as well.

The bigger picture of cloud computing in Africa

The study I presented here was the result of an international collaboration between multiple research and clinical institutions in Ghana, as well as in Canada, the UK and the United States. Strengthening Global Partnerships plays an important role and part of the United Nations Sustainable Development Goals of economic development, yet for approaches like liquid biopsies to become the standard of care in Ghana and many other African countries, we must ultimately develop the bioinformatics capacity to perform the relevant research and testing autonomously in-country.

One of the major challenges for bioinformatics and computational biology in many African countries is the limited infrastructure such as computing resources, even though the cost of computing is becoming cheaper and associated with increasing efficiency.

Cloud-powered platforms like Terra could play a huge role in increasing access to computing resources to enable scalable genomics research in Africa, by Africans.

In addition to providing access to powerful hardware resources, such platforms also make it possible to leverage publicly available workflows and pre-installed software tools and environments. This helps newcomers overcome initial learning curves and empowers seasoned researchers to leverage best in class tooling without having to spend time installing anything. Once familiar with the infrastructure, they can also develop their own workflows and tools to innovate in the pursuit of their preferred research question.

Organizations such as H3Africa have over the years been building bioinformatics capacity in affiliated institutions in the region. Building on that work, the DSI-Africa consortium recently launched the eLwazi platform, an African-led open data science project powered by Terra.

However, moving forward it will be great to have data centers within Africa to enable regional processing, storage and control of genomic data due to privacy and ethical reasons.

There are still many practical, ethical and technological challenges to implement genomic technologies in Africa, yet it is encouraging to see such progress toward a future where African countries such as Ghana can access the resources they need to chart their own course.

Acknowledgements

I would like to thank Paz Polak, PhD; Jonine Figueroa, PhD; Geraldine Van der Auwera, PhD, and Kofi Johnson, PhD, for helpful comments.

The post From liquid biopsies in Ghana to African cancer genomics in the cloud appeared first on Terra.

Paper Spotlight: Molecular map of chronic lymphocytic leukemia and its impact on outcome

Geraldine Van der Auwera — Thu, 25 Aug 2022 17:40:22 +0000

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.

Molecular map of chronic lymphocytic leukemia and its impact on outcome

By Binyamin A. Knisbacher, Ziao Lin, Cynthia K. Hahn, Ferran Nadeu, Martí Duran-Ferrer, Gad Getz, Chip Stewart, Catherine J. Wu et al.

Nature Genetics (2022) https://doi.org/10.1038/s41588-022-01140-w

Abstract: Recent advances in cancer characterization have consistently revealed marked heterogeneity, impeding the completion of integrated molecular and clinical maps for each malignancy. Here, we focus on chronic lymphocytic leukemia (CLL), a B cell neoplasm with variable natural history that is conventionally categorized into two subtypes distinguished by extent of somatic mutations in the heavy-chain variable region of immunoglobulin genes (IGHV). To build the ‘CLL map,’ we integrated genomic, transcriptomic and epigenomic data from 1,148 patients. We identified 202 candidate genetic drivers of CLL (109 new) and refined the characterization of IGHV subtypes, which revealed distinct genomic landscapes and leukemogenic trajectories. Discovery of new gene expression subtypes further subcategorized this neoplasm and proved to be independent prognostic factors. Clinical outcomes were associated with a combination of genetic, epigenetic and gene expression features, further advancing our prognostic paradigm. Overall, this work reveals fresh insights into CLL oncogenesis and prognostication.

What part of the work was done in Terra?

Excerpts from the paper’s Methods section:

Sequence data processing and analysis

All sequencing data (WES, WGS, RNA-seq, RRBS, and targeted NOTCH1 sequencing) were processed and analyzed using methods implemented in the Terra platform (https://app.terra.bio). The main Terra methods are available at https://app.terra.bio/#workspaces/broad-firecloud-wupo1/CLLmap_Methods_Apr2021 […]

RNA-seq analysis

RNA-seq data were processed in Terra using the GTEx V7 pipeline (https://github.com/broadinstitute/gtex-pipeline). Briefly, reads were aligned with STAR (v2.6.1d) to hg19 (b37) using the GENCODE v19 annotation, and quality control metrics and gene expression were computed with RNA-SeQC v2.3.6 (https://github.com/getzlab/rnaseqc). A collapsed version of the GENCODE annotation was used to quantify gene-level expression (available at gs://gtex-resources/GENCODE/gencode.v19.genes.v7.collapsed_only.patched_ contigs.gtf). TPMs were used for sample clustering, whereas gene counts were used for differential gene expression, as required[…]

DNA methylation data processing

DNA methylome data was analyzed for a total of 1,037 samples, including 490 samples profiled with Illumina 450K array previously analyzed52 (European Genome-phenome Archive (EGA) accession EGAD00010001975), and 547 samples profiled using RRBS with either single-end or paired-end approaches. A pipeline was developed in Terra to obtain the CpG methylation estimates from RRBS data (Supplementary Note). The epitype classifier and the epiCMIT mitotic clock were previously developed for Illumina 450K and EPIC array data […]

How did they do it?

The authors processed and analyzed all sequencing data (WES, WGS, RNA-seq, RRBS, and targeted NOTCH1 sequencing) using WDL workflows. The methods used in this study can be found in this Terra workspace.

If you are a new Terra user, try your hand at running a workflow in Terra with this Quickstart Tutorial Workspace.

Appendix: Data and code availability

The molecular data used in this study are publicly available and are included in the following patient cohorts: DFCI, Dana-Farber Cancer Institute; GCLLSG, German CLL Study Group; ICGC, International Cancer Genome Consortium; MDACC, MD Anderson Cancer Center; NHLBI, National Heart Lung and Blood Institute; UCSD, University of California San Diego. Sequencing, expression, and genotyping is available at EGA (http://www.ebi.ac.uk/ega/), which is hosted at the European Bioinformatics Institute, under accession number EGAS00000000092 (ICGC cohort) and in dbGaP under accession numbers phs001473.v2.p1 (MDACC, NHLBI), phs000922.v2.p1 (GCLLSG), phs001431.v2.p1 (DFCI, UCSD), phs001091.v1.p1 (MDACC), phs000435.v3.p1 (DFCI), phs002297.v2.p1 (NHLBI) and phs000879.v1.p1 (DFCI) and GEO accession number GSE143673 (GCLLSG). 450K array data are available at EGA under accession number EGAD00010001975 (ICGC). The project data portal is available at https://cllmap.org.
Terra methods used in the study can be found at https://app.terra.bio/#workspaces/broad-firecloud-wupo1/CLLmap_Methods_Apr2021. Source code used in the study can be found at https://github.com/getzlab/CLLmap. The RFcaller pipeline is available at https://github.com/xa-lab/RFcaller. The new epiCMIT suitable for Illumina arrays and NGS approaches as well as the CLL epitype classifier can be found at https://github.com/Duran-FerrerM/CLLmap-epigenetics.

The post Paper Spotlight: Molecular map of chronic lymphocytic leukemia and its impact on outcome appeared first on Terra.

Paper Spotlight: A complete reference genome improves analysis of human genetic variation

Geraldine Van der Auwera — Thu, 07 Apr 2022 16:40:07 +0000

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.

A complete reference genome improves analysis of human genetic variation

By Sergey Aganezov, Stephanie M. Yan, Daniela C. Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J. Taylor, Kishwar Shafin, Alaina Shumate, Michael C. Schatz et al., 2022

Science, Vol 376, Issue 6588 https://doi.org/10.1126/science.abl3533

Abstract: Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.

What part of the work was done in Terra?

Excerpts from the paper’s Methods section:

Short-read variant calling

To evaluate short-read small-variant calling between GRCh38 and T2T-CHM13, we used the NHGRI AnVIL (44) to align all 3,202 1KGP samples to CHM13 with BWA-MEM (45) and performed variant calling with GATK HaplotypeCaller (77) using a workflow modeled on the one developed by the New York Genome Center (NYGC) for 1KGP analysis performed on GRCh38 (28). As in the NYGC analysis, we recalibrated the variant calls with GATK VariantRecalibrator. We analyzed coverage statistics using samtools and AF using bedtools. To identify Mendelian-discordant variants, we used GATK VariantEval.

Note: NHGRI AnVIL is a project of the US National Human Genome Research Institute that brings together Terra and several complementary platforms into a powerful genomics analysis ecosystem. The AnVIL portal powered by Terra provides full access to Terra’s data and analysis capabilities.

How did they do it?

The authors developed WDL workflows for calling variants in the short read sequencing data based on a previous analysis by the New York Genome Center. They ran the workflows at scale on all 3,202 whole genomes in the 1000 Genomes project cohort using Terra’s workflow execution service.

You can learn more about the scaling challenges they faced and how they overcame them by using Terra in this blog post, written by Samantha Zarate of the Schatz Lab.

To try your hand at running a workflow in Terra, check out this Quickstart Tutorial Workspace.

Appendix: Data and code availability

Aligned reads, variant calls, and other summarizations available within the NHGRI AnVIL Platform, along with notebooks and workflows for computing the analysis, https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_T2T.
Short read variant calling workflows and analysis scripts are available in Github at https://github.com/schatzlab/t2t-variants

The post Paper Spotlight: A complete reference genome improves analysis of human genetic variation appeared first on Terra.

Ten simple rules — #1 Don’t reinvent the wheel

Geraldine Van der Auwera — Thu, 03 Mar 2022 19:48:27 +0000

This blog post is part of a series based on the paper “Ten simple rules for large-scale data processing” by Arkarachai Fungtammasan et al. (PLOS Computational Biology, 2022). Each installment reviews one of the rules proposed by the authors and illustrates how it can be applied when working in Terra. In this first installment, we cover data and tooling resources that Terra users can take advantage of to avoid doing unnecessary work.

We kick off this “Ten simple rules” series with “Don’t reinvent the wheel”, a classic maxim that is ubiquitous in programming advice forums yet tragically underappreciated in the world of research computing. Certainly a fitting start to any list of guiding principles for tackling computational science at scale.

In their paper, Arkarachai Fungtammasan and colleagues address this rule mainly from the point of view of data resources, emphasizing that, before you set out to process a large body of data, you should check whether the work might have been done for you already:

[…] In short, undertaking large-scale data processing will require substantial planning time, implementation time, and resources.
There are many data resources providing preprocessed data that may meet all or nearly all of one’s needs. For example, Recount3 [4,5], ARCHS4 [6], and refine.bio [7] provide processed transcriptomic data in various forms and processed with various tool kits. CBioPortal [1,8] provides mutation calls for many cancer studies. Cistrome provides both data and tool kit for transcription factor binding and chromatin profiling [9,10]. A research project can be substantially accelerated by starting with an existing data resource.

This focus on data surprised me a little, because in my experience, the “Don’t reinvent the wheel” rule is more commonly invoked to advocate for using existing bioinformatics tools and workflows rather than writing new ones. However the authors are not wrong to call out the usefulness of looking for already processed data, particularly in an age when large data generation initiatives are being developed specifically for the purpose of making data available for mining by the wider research community.

In the Terra ecosystem, there are multiple research consortia that are making data resources available in a form that has already been processed through standardized workflows, so that researchers can focus their resources on downstream analysis, and that can be readily imported into Terra for analysis. For example, the Human Cell Atlas provides a multitude of analysis-ready ‘omics data resources that can be imported into a Terra workspace via the HCA Data Portal, as does the BRAIN Initiative Cell Census Network (BICCN), which offers human, non-human primate and mouse ‘omics data through its Terra-connected Neuroscience Multi-Omics (NeMO) portal.

You can check out the Terra Dataset Library to browse the various public and access-controlled datasets (spanning multiple data types and research focus areas) that are available in repositories connected to Terra.

And now, to extend the scope of discussion a little compared to the paper…

Try to reuse existing code, tools, containers, and other assets

Unless what you’re doing is unusually cutting-edge, chances are someone has already tackled a similar problem, and you may be able to reuse some of their tooling. Not to get into the debate of when it’s appropriate to write a new genome aligner from scratch — but I think we can all agree that there are some well-established data processing operations like running a variant calling pipeline on human WGS data, or generating count matrices from single-cell RNAseq data, where you can often benefit from reusing existing tools and workflows rather than rolling your own. In some cases you may need to make some modifications to adapt them to your specific use case, but that’s still a lot less work than starting from nothing.

So where do you find existing tooling?

In the context of Terra, here’s a shortlist of the best places you can look for ready-to-use tools:

1. The Terra showcase features a growing collection of public workspaces that offer fully-configured workflows, Jupyter notebooks, example data and more for a wide range of use cases. Some of these workspaces are created by tool developers to serve as a demonstration of how to run their tools. Others are created by researchers, often as companions to published papers, to recapitulate an end-to-end analysis in a fully reproducible way. The great thing they all have in common is that they combine data, tools and configuration settings that have been shown to work, so you can see in practice how the different pieces are supposed to connect. You may not find a workspace that’s an exact match for your needs, but you may find one that is close enough to use as a starting point, which can dramatically shorten the amount of setup time you need to get your analysis going.

2. For interactive analysis, Terra’s Cloud Environments system provides a menu of pre-built environment images for running applications like Jupyter Notebook and RStudio that come with sets of popular packages pre-installed to get you up and running as quickly as possible. For example, the Bioconductor environment developed as part of the AnVIL project includes the Bioconductor Core packages.

3. Terra also offers a Galaxy environment that includes the full Galaxy Tool Shed.

4. The Terra Notebooks Playground is a great resource for finding code examples of how to perform a variety of operations in Terra notebooks. In addition, many researchers now share Jupyter Notebook files demonstrating how to run computational analyses that they have published; many of these can be run in Terra’s Cloud Environments with only minimal adaptations.

5. For running automated pipelines at scale, the Dockstore workflow repository offers a large collection of workflows contributed by research groups around the world, with a particular emphasis on large-scale analyses and optimizations for cloud platforms. Dockstore connects directly to Terra, so once you’ve found a workflow you’re interested in, you can import the workflow script and an example configuration file with a few clicks. Most WDL workflows that you find in Dockstore can be run in Terra without any modifications. If you do need to modify the workflow code to suit your use case, either fork the original code in github and register your version in Dockstore, or bring it into the Broad Methods Repository if you want basic version control and editing capabilities without having to deal with git. There are also other sources of WDLs out there that are not registered in Dockstore, like the BioWDL project; the OpenWDL community is a good starting point to track those down.

6. Tool container repositories like Dockerhub and Quay.io can be really handy if you’re writing your own workflows. Running workflows in the cloud requires the use of “containers”, which are a way to package command-line tools into a self-contained environment that can be run on a virtual machine. One of the things we hear researchers worry about when they start moving to the cloud is that they’re not comfortable with creating their own Docker containers. The good news is that creating your own containers is actually not as difficult as it’s sometimes made out to be (if you have the right tutorial) BUT we can all agree it’s even easier if you don’t have to do it at all. Fortunately, many tool developers now provide pre-built containers through container repositories such as those listed above, and for the rest, there are community-driven projects like BioTools that make containers available for a wide range of popular bioinformatics tools. So once again, chances are you can find what you need off the shelf and not have to do it yourself.

Finally, keep in mind that reusing existing tools will not only save you a whole lot of time and effort; you will also be more likely to generate outputs that are more directly compatible with other researchers’ work. This increases the comparability of results across different studies and opens up opportunities to aggregate results into federated analyses that will deliver greater power and broader insights.

And don’t forget to share your tools and data, so the next researcher in line can also avoid having to reinvent the wheel!

The post Ten simple rules — #1 Don’t reinvent the wheel appeared first on Terra.

Paper Spotlight: Transmission from vaccinated individuals in a large SARS-CoV-2 Delta variant outbreak

Geraldine Van der Auwera — Thu, 17 Feb 2022 18:46:05 +0000

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.

Transmission from vaccinated individuals in a large SARS-CoV-2 Delta variant outbreak

By Katherine J. Siddle, Lydia A. Krasilnikova, Gage K. Moreno, Stephen F. Schaffner et al., 2022
Cell, Volume 185, Issue 3, 485 – 492.e10 https://doi.org/10.1016/j.cell.2021.12.027

Abstract: An outbreak of over 1,000 COVID-19 cases in Provincetown, Massachusetts (MA), in July 2021—the first large outbreak mostly in vaccinated individuals in the US—prompted a comprehensive public health response, motivating changes to national masking recommendations and raising questions about infection and transmission among vaccinated individuals. To address these questions, we combined viral genomic and epidemiological data from 467 individuals, including 40% of outbreak-associated cases. The Delta variant accounted for 99% of cases in this dataset; it was introduced from at least 40 sources, but 83% of cases derived from a single source, likely through transmission across multiple settings over a short time rather than a single event. Genomic and epidemiological data supported multiple transmissions of Delta from and between fully vaccinated individuals. However, despite its magnitude, the outbreak had limited onward impact in MA and the US overall, likely due to high vaccination rates and a robust public health response.

What part of the work was done in Terra?

Excerpts from the paper’s Methods section:

SARS-CoV-2 genome assembly and analysis
For genomes generated at the Broad Institute, we conducted all analyses using viral-ngs 2.1.28 on the Terra platform (app.terra.bio). All of the workflows named below are publicly available via the Dockstore Tool Registry Service (dockstore.org/organizations/BroadInstitute/collections/pgs). Briefly, samples were demultiplexed, reads were filtered for known sequencing contaminants, and SARS-CoV-2 reads were assembled using a reference-based assembly approach with the SARS-CoV-2 isolate Wuhan-Hu-1 reference genome GenBank: NC_045512.2 (sarscov2_illumina_full.wdl).

Phylogenetic Tree construction
We constructed a maximum-likelihood (ML) phylogenetic tree (Sagulenko et al., 2018) with associated visualizations using a SARS-CoV-2-tailored Augur pipeline (Huddleston et al., 2021) (sarscov2_nextstrain_aligned_input), part of the Nextstrain project (Hadfield et al., 2018), adapted from github.com/nextstrain/ncov, with the entirety of ARTICv3 amplicons 64, 72, and 73 (Delta dropout regions) masked from tree construction.

How did they do it?

The authors implemented these analyses as WDL workflows, which they deposited in Dockstore then imported into Terra. They ran the workflows at scale using Terra’s workflow execution service. You can see these workflows in action in the public COVID-19 workspace maintained by the authors as a related resource.

To try your hand at running a workflow in Terra, check out this Quickstart Tutorial Workspace.

Appendix: Data and code availability

All SARS-CoV-2 genomes, patient metadata, and raw sequencing reads have been deposited to NCBI under BioProject: PRJNA715749 or BioProject: PRJNA686883 in GenBank, BioSample, and SRA databases, respectively. All genomes produced in the present study are also available on GISAID. All data is publicly available as of the date of publication. Accession numbers of additional publicly available data analyzed in this paper are available in Table S1.
All code used for sequence data processing, genome assembly, and phylogenetic analysis is publicly available either via the Dockstore Tool Registry Service (dockstore.org/organizations/BroadInstitute/collections/pgs) or on GitHub (github.com/AndrewLangvt/genomic_analyses/blob/main/workflows/wf_viral_refbased_assembly.wdl).
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request (kjsiddle@broadinstitute.org).

The post Paper Spotlight: Transmission from vaccinated individuals in a large SARS-CoV-2 Delta variant outbreak appeared first on Terra.

What’s new in the Terraverse in January 2022

Geraldine Van der Auwera — Thu, 27 Jan 2022 19:49:34 +0000

While many of us have been having a rough start to the year 2022, there have nonetheless been some exciting news rolling out over the past few weeks in the Terra ecosystem. Here is a quick recap of the headlines to get you all caught up.

Journal appearances for Terra and the AnVIL

To kick off the year, Terra was featured in 2022’s first issue of Nature! Technology editor Jeffrey Perkel interviewed researchers who actively use Terra and summarized the key takeaways of their experiences in an insightful article that declares “Terra takes the pain out of ‘omics’ computing in the cloud”.

Then the AnVIL paper came out in Cell Genomics the very next week. Led by Michael Schatz from Johns Hopkins University and featuring a proverbial football team of contributors, this landmark open access paper describes the purpose, architecture and usage of the AnVIL project, including Terra. (See also our previous blog post, How Terra fits within the AnVIL ecosystem for a quick primer.)

Singular Genomics to integrate new G4 sequencer with Terra

The sequencing technology market has been bubbling with new developments, and it’s not just about the chemistry and instrumentation — we’re also seeing some exciting efforts from manufacturers to streamline the process of getting data off the sequencer and into the researcher’s analysis environment.

Case in point, Singular Genomics sweetened the recent launch of their new G4 sequencer with a plan to make the instrument integrate with Terra, which will empower G4 users to import their freshly sequenced data straight into Terra for analysis. We look forward to collaborating with Singular on this, and can readily envision Terra workspaces equipped with pipelines and tools preconfigured to run on the data produced by the G4 instrument.

Microsoft’s Cromwell on Azure 3.0 supports setting VM size per task

Cromwell is the workflow manager that Terra uses to make it so easy to run pipelines at scale in Terra. It can be used outside of Terra as a standalone command-line program, and is in fact used by many groups around the world to run pipelines on a variety of platforms (including on-premises servers and HPCs).

“Cromwell on Azure” is a solution developed by Microsoft to empower command-line users to run workflows through Cromwell on the Azure cloud, leveraging Azure Batch as the execution service, with minimal configuration and deployment burden. (See the Github readme to learn more about how to use Cromwell on Azure in practice.)

The version 3.0 of Cromwell on Azure released earlier this week provides a number of enhancements, most important of which is the ability for workflow authors to select a specific Azure VM size for each workflow task. This is a big deal because “right-sizing” VMs — i.e. choosing VM sizes that fit each task depending on the tool and amount of data involved — is critical for optimizing workflows for speed and cost.

What does this mean about Azure support in Terra itself, you ask? Well, you can take it as a sign that we’ve been actively working on it since Microsoft joined the Terra partnership last year, and individual pieces are starting to materialize. We expect to have more concrete information and timelines to share on this topic in the not-too-distant future, so watch this space — and subscribe to the Terra newsletter so you don’t miss any of the many exciting updates on the horizon.

The post What’s new in the Terraverse in January 2022 appeared first on Terra.

The Molecular Oncology Almanac: From algorithm to analysis portal

Brendan Reardon — Tue, 11 Jan 2022 15:06:59 +0000

Cancer researchers are increasingly turning to cloud resources, primarily to access data and solve large-scale computational problems. Yet as the infrastructure matures, the cloud can also be used to widen access to important tools to clinicians and point-of-care personnel who do not have the level of computational training that has traditionally been required to use such tools.

In this guest blog post, Brendan Reardon, computational biologist in the Van Allen Lab at Dana Farber Cancer Institute, tells us about the MOAlmanac algorithm and the analysis portal that he and his collaborators built on top of Terra to enable a wider range of users to leverage large-scale genomics to guide individualized patient care in oncology.

In recent years, it has become routine to use individual tumor molecular profiling to identify “first-order” genomic alterations such as single-gene variants that can inform therapeutic options. However, clinical interpretation algorithms and knowledge bases do not generally account for the interactions between these first-order events and global molecular features such as mutational signatures, despite the increasing recognition that these “second-order” events can be associated with clinical outcomes.

To address this gap, our team developed the Molecular Oncology Almanac (MOAlmanac), an open-source clinical interpretation algorithm paired with a novel knowledge base to enable integrative interpretation of genomic and transcriptional cancer data.

Overview of the MOAlmanac method (from Reardon et al, 2021, Figure 1a)

In our recent manuscript published in Nature Cancer, we showed how the MOAlmanac method nominated a median of two therapies per patient and identified therapeutic strategies administered in 46% of patient profiles when applied to a prospective precision oncology trial cohort. These results suggest MOAlmanac can be readily applied for translational hypothesis generation in cancer research, yet the ultimate goal of the project is to be able to also assist clinicians in their point-of-care treatment decision-making.

To that end, we are now working with other groups to test the method in real-world projects. In a collaboration with Dr. Katherine Janeway and Dr. Nikhil Wagle of Dana-Farber Cancer Institute, we are applying the MOAlmanac approach to participant data from the Count Me In Osteosarcoma Project as part of the National Cancer Institute’s Participant Engagement and Cancer Genome Sequencing (PE-CGS) Network. The collaboration aims to evaluate the use of MOAlmanac as an aid to assist molecular pathologists in the clinical interpretation and sample contextualization for somatic return of results to study participants.

As a rare type of cancer that is severely understudied, osteosarcomas pose a particular set of challenges and are in dire need of innovative approaches to drive novel treatment strategies, clinical trials, and standards of care. While MOAlmanac is currently released for research use only, we believe it constitutes an exciting step forward toward the next level of precision cancer medicine.

Using the cloud to widen access to computational oncology

In order to ensure that this promising method is widely accessible and usable by researchers and clinicians who do not have extensive computing experience, we created a website that enables others to easily browse the MOAlmanac knowledge base and run the MOAlmanac algorithm on their own samples.

The MOAlmanac portal interface walks users with minimal computational experience through the basic steps of setting up the analysis; then it uses Terra’s built-in data modeling and workflow management capabilities under the hood to run the analysis efficiently and securely. This enables users such as point-of-care clinicians to take advantage of Terra’s powerful and highly scalable workflow execution system without having to learn how to use Terra itself.

Screenshot of the analysis configuration page on the MOAlmanac portal

For researchers who are comfortable with bioinformatics and are interested in running the MOAlmanac method on their own infrastructure, customizing it or extending it with their own ideas, we provide complementary resources, including source code, containerized tools and scripted workflows. We also made a public Terra workspace that demonstrates the use of these resources; we invite you to try it out and let us know what you think by emailing us at moalmanac[at]ds.dfci.harvard.edu or tweeting at us, @moalmanac.

The post The Molecular Oncology Almanac: From algorithm to analysis portal appeared first on Terra.

Calling variants from telomere to telomere with the new T2T-CHM13 genome reference

Samantha Zarate — Thu, 11 Nov 2021 15:30:20 +0000

Samantha Zarate is a third-year PhD student in the computer science department at Johns Hopkins University in Baltimore, MD, working in the lab of Dr. Michael Schatz. As a member of the Telomere-to-Telomere consortium, she has been working for the last year to evaluate how the T2T-CHM13 reference genome affects variant calling with short-read data. In this guest blog post, Samantha explains what this entails, then walks us through the computational challenges she faced in implementing this analysis and how she solved them using Terra and the AnVIL.

Earlier this year, the Telomere-to-Telomere (T2T) consortium released the complete sequence of a human genome, unlocking the remaining 8% of the human genome reference unfinished in the current human reference genome and introducing nearly 200 million bp of novel sequence (Nurk, Koren, Rhie, Rautiainen, et al., 2021). For context, this is about as much novel sequence as in all of chromosome 3!

Within the T2T consortium, I led an analysis that demonstrated that the new T2T-CHM13 reference genome improves read mapping and variant calling for 3,202 globally diverse samples sequenced with short reads. We found that compared to the current standard, GRCh38, using this new reference mitigates or eliminates major sources of error that derived from incorrect assembly and certain idiosyncrasies of the samples previously used as the basis for reference construction. For example, the T2T-CHM13 reference includes corrections to collapsed segmental duplications, which are regions that previously appeared highly enriched for heterozygous paralog-specific variants in nearly all individuals due to the false pileup of reads from duplicated regions to a single location. This and other such corrections lead to a decrease in the number of variants erroneously called per sample when using the T2T-CHM13 reference genome. In addition, because we also added nearly 200Mbp of additional sequence, we discovered over 1 million additional high quality variants across the entire collection.

Our collaborators in the T2T consortium evaluated the scientific utility of our improved variant calling results. Excitingly, they found that the use of the new reference genome led to the discovery of novel signatures of selection in the newly assembled regions of the genome, as well as improved variant analysis throughout. This included reporting up to 12 times fewer false-positive variants in clinically relevant genes that have traditionally proved difficult to sequence and analyze.

Collectively, these results represent a significant improvement in variant calling using the T2T-CHM13 reference genome, which has broad implications for clinical genetics analyses, including inherited, de novo, and somatic mutations.

If you’d like to learn more about our findings and the supporting evidence, you can read the preprint on bioRxiv. In the rest of this blog post, I want to give some behind-the-scenes insight into what it took to accomplish this analysis — the computational challenges we faced, the decision to use Terra and the AnVIL, and how it went in practice — in case it might be useful for others tackling such a large-scale project for the first time.

Designing the pipeline and mapping out scaling challenges

To evaluate the T2T-CHM13 reference genome across multiple populations and a large number of openly accessible samples, we turned to the 1000 Genomes Project (1KGP), which recently expanded its scope to encompass 3,202 samples representing 602 trios and 26 populations around the world.

A team at the New York Genome Center (NYGC) had previously generated variant calls from that same dataset using GRCh38 as the reference genome (Byrska-Bishop et al., 2021), with a pipeline based on the functional equivalence pipeline standard established by the Centers for Common Disease Genomics (CCDG) and collaborators. The functional equivalence standard provides guidelines for implementing genome analysis pipelines in such a way that you can compare results obtained with different pipelines with confidence that any differences in the results are due to differences in specific inputs — such as a different reference genome — rather than to technical differences in the tools and methods employed.

By that same logic, if we followed their pipeline closely, we could perform an apples-to-apples comparison between their results and those we were going to generate with the T2T-CHM13 reference, and thus avoid having to redo the work that they had already done with GRCh38. As we implemented our pipeline, we only updated those elements that substantially improved efficiency without introducing divergences, and whenever possible, we used the same flags and options as the original study.

Pipeline overview

Conceptually, the pipeline consists of two main operations: (1) read alignment, in which we take the raw sequencing data for each sample and align, sort, and organize each read’s alignment to the reference genome, and (2) variant calling, in which we examine the aligned data to identify potential variants, first within each sample, then across the entire 1KGP collection. In practice, each of these operations consists of multiple steps of data processing and analysis, each with different computational requirements and constraints. Let’s take a closer look at how this plays out when you have to apply these steps to a large number of samples.

Read alignment

Read alignment involves aligning each read to the reference genome, plus a few additional steps to address data formatting and sorting requirements. When samples are sequenced using multiple flowcells, the data for each sample is subdivided into subsets called “read groups”, so we perform these initial steps for each read group. We then merge the aligned read group data per sample and apply a few additional steps: marking duplicate reads and compressing the alignment data into per-sample CRAM files. Finally, we generate quality control statistics that allow us to gauge the quality of the data we’re starting from.

Our chosen cohort comprised genome sequencing data from 3,202 samples in the form of paired-end FASTQ files, meaning that our starting dataset consisted of 6,404 files. Applying the initial read alignment per read group would involve generating approximately 32,000 files at the “widest” point, to eventually produce 3,202 per-sample CRAM files, plus multiple quality control files for each sample. That’s a lot of files!

Variant calling

Read alignment is just the beginning. To generate variant calls across a large cohort, we have to decompose the process into two main operations, as described in the GATK Best Practices: first, we identify variants individually for each sample, then we combine the calls from all the samples in the cohort to generate overall variant calls in VCF format.

We performed the per-sample variant calling using a special mode of the GATK HaplotypeCaller tool that generates “genomic VCFs”, or GVCFs, which contain detailed information representing the entire genome (not just potentially variant sites, as you would find in normal VCF files). For efficiency reasons, we performed this step on a per-chromosome basis, generating over 75,000 files in total.

We then combined the variant calls from all the samples in the cohort for the “joint genotyping” analysis, which involves looking at the evidence at each possible variant site across all the samples in the cohort. This analysis produced the joint callset for the whole cohort, containing all the variant sites with detailed genotyping information and variant call statistics for each sample in the cohort.

The twist is that this step has to be run on intervals smaller than a whole chromosome, so now we’re combining per-chromosome GVCF files across all the samples, but producing per-interval joint-called VCFs — about 30,000 of them in total across the 24 chromosomes (more on that in a minute). Fortunately, we can then concatenate these files into just 24 per-chromosome VCF files.

And, finally, there’s a filtering step called variant recalibration (VQSR) that is done per-chromosome. This involves scoring variants to identify those with sufficiently high quality to use in downstream analysis, as opposed to those below this threshold, which we assume to be artifacts. The VCF files annotated with final variant scores and filtering information (PASS or otherwise) are the final outputs of the pipeline.

As you can imagine from this brief overview, joint genotyping alone presents various scaling challenges, and our T2T-CHM13 reference added an interesting complexity. Normally, for whole genomes, the GATK development team recommends defining the joint genotyping intervals by using regions with Ns in the reference as “naturally occurring” points where you don’t have to worry about variants spanning the interval boundaries. (For exomes, you can just use the capture intervals.)

However, the T2T-CHM13 reference doesn’t have any regions with Ns — that’s the whole point of a “telomere-to-telomere” reference! As a result, we developed a strategy that involves generating intervals of arbitrary length (100kb) and using 1kb padding intervals that we later trimmed off from the interval-level VCF files. The padding ensured that our variant calls didn’t suffer from edge effects, and we were able to verify that it didn’t make any difference to the final results.

Scaling up with Terra and the AnVIL

Implementing an end-to-end variant discovery analysis at the scale of thousands of whole genomes is not trivial, both in terms of basic logistics and computational requirements. We originally started out using the computing cluster available to us at Johns Hopkins University, but we realized very quickly that it would take too long and require too much storage for us to be successful on the timeline we needed to keep pace with the project. Based on our early testing, we estimated it would take many months, possibly up to a year, of computation to do all the data processing on our institutional servers.

So instead, we turned to AnVIL and the Terra platform, which promised massively scalable analysis, collaborative workspaces, and a host of other features to meet our needs.

Importing data

At the outset, we ran into some difficulties getting the starting dataset ready. We had originally planned to start from the NYGC’s version of the 1000 Genomes Project data (CRAM files aligned to GRCh38), which is already available through Terra as part of the AnVIL project. The idea was to revert those files to an unmapped state and then start our pipeline from there. However, we found that those files included some processing, such as replacing ambiguous nucleotides (N’s in the reads) with other bases, which we didn’t want — we wanted it to be as close to the raw data as possible to eliminate any possible biases introduced from GRCh38.

So we decided to shift gears and start from the original FASTQ files, which are available through the European Nucleotide Archive (ENA), and this is where we hit the biggest technical obstacle in the whole process. As I mentioned earlier, that dataset comprises 6,404 paired-end FASTQ files, which amount to about 100TB of compressed sequence data — for reference, that’s about 100,000 hours of streaming movies on Netflix in standard definition. Unfortunately, though not unexpectedly, the ENA didn’t plan on having people download that much data at once, so we had to come up with a way to transfer the data that wouldn’t crash their servers. We ended up using a WDL workflow to copy batches of about 100 files at a time to a Google bucket, managing the batches manually over several days. That was not fun.

Executing workflows at scale and collaborating

After that somewhat rocky start, however, running the analysis itself was surprisingly smooth. We implemented the pipeline described above as a set of WDL workflows, and we used Terra’s built-in workflow execution system, called Cromwell, to run them on the cloud.

Scaling up went well, especially considering the alternative to using Terra was using our university’s HPC, which is subject to limited quotas, traffic restrictions from other users and equipment failures in a way that Google Cloud servers aren’t. The push-button capabilities of Terra let us scale up easily and rapidly: after verifying the success of our WDLs on a few samples, we could move on to processing hundreds or thousands of workflows at a time. It took us about a week to process everything, and that was with Google’s default compute quotas in place (eg max 25,000 cores at a time), which can be raised on request.

We also really appreciated how easy it was to collaborate with others. Working in a cloud environment, it was very easy to keep our collaborators informed on progress and share results with members of the consortium. If we had been using our institutional HPC, which does not allow access to external users, we would have had to copy files to multiple institutions to provide the same level of access.

More generally, we found that the reproducibility and reusability of our analyses have increased significantly. Having implemented our workflows in WDL to run in Terra, we can now publish them on GitHub, knowing that anyone can download them and replicate our analysis on their infrastructure, as Cromwell supports all major HPC schedulers and public clouds. We have also published the accompanying Docker image, so the environment in which we are running our code is also reproducible. With all of the materials we used for this analysis available publicly in Terra and relevant repositories, any interested party has all of the tools they need to fully reproduce our analysis and extend it for their purposes.

Room for improvement

One of the things we didn’t like so much was having to select inputs and launch workflows manually in the graphical web interface. That was convenient for initial testing, but we would have preferred to use a CLI environment to launch and manage workflows once we moved to full-scale execution. We only learned after completing the work that Terra has an open API and that there is a Python-based client called FISS that makes it possible to perform all the same actions through scripted commands. The FISS client is covered in the Terra documentation (there’s even a public workspace with a couple of tutorial notebooks) but we never saw any references to it until it was pointed out to us by someone from the Terra team, so this feature needs more visibility.

We’d also love to see more functionality added around data provenance. The job history dashboard gives you a lot of details if you’re looking at workflow execution records. However, you can’t select a piece of data and see how it was generated, by what version of the pipeline, and so on. Being able to identify the source of a given file, notably the workflow that created it as well as the parameters used, would be greatly helpful for tracing back files, especially when a workflow has been updated and files need to be re-analyzed. Or as another example, when two or more files are named XXX.bam, it’s very helpful to be able to tell which one is the final version in a way other than writing down the time each respective workflow was launched.

Finally, as mentioned above, transferring the data from ENA was painful, so we’d love to see some built-in file transfer utilities to make that process more efficient. There is currently no built-in way to obtain data from a URL or FTP link; implementing a universal file fetcher would help users move data into Terra more efficiently.

Conclusions

Scientifically, our analysis strongly supports the use of T2T-CHM13 for variant calling. We find impressive improvements in both alignment and variant calling, as CHM13 both resolves errors in GRCh38 and adds novel sequences. Our collaborators within the T2T consortium performed further analysis using the joint genotyped chromosome-wide VCF files and also found improvements in medically relevant genes and the overall accuracy of variant calling genomewide, thus demonstrating the utility of T2T-CHM13 in clinical analysis.

From a technical standpoint, this was our first time using Terra for large-scale analysis, and we found that the benefits of Terra outweighed the few pain points. Compared to a high-performance cluster, Terra is much more user-friendly for scaling up, reproducing workflows, and collaborating with others across institutions. We have noted some quality-of-life changes that would improve Terra’s useability, and we are confident that if these are implemented, Terra would become an even stronger option in the cloud genomics space.

Moving forward, we have been buoyed by the quality of the T2T-CHM13 reference, and we plan on using it for future large-scale analyses using Terra. As we demonstrate here, CHM13 is easy to use as a reference for large-scale genomic analysis, and we hope that both the clinical and research genomics communities use it to improve their own workflows.

References

Byrska-Bishop, M. et al. (2021) ‘High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios’, bioRxiv. doi: 10.1101/2021.02.06.430068 .

Nurk, S. et al. (2021) ‘The complete sequence of a human genome’, bioRxiv. doi: 10.1101/2021.05.26.445798 .

The post Calling variants from telomere to telomere with the new T2T-CHM13 genome reference appeared first on Terra.

Spotlight on PANOPLY, a scalable framework for cancer proteogenomics

D. R. Mani — Tue, 01 Jun 2021 15:53:00 +0000

PANOPLY is an innovative computational framework for applying state-of-the-art statistical and machine learning algorithms to transform multi-omic data from cancer samples into biologically meaningful and interpretable results. In this guest blog post, D. R. Mani, principal computational scientist in the Broad Institute’s Proteomics Platform and lead author of the recently published PANOPLY paper, explains how his team is leveraging Terra to make PANOPLY accessible to a wide range of researchers.

Proteogenomics involves the integrative analysis of genomic, transcriptomic, proteomic and post-translational modification (PTM) data produced by next-generation sequencing and mass spectrometry-based proteomics. Effectively analyzing proteogenomics data involves deploying complex computational algorithms that integrate multiple omics data types, and unfortunately, such algorithms remain largely inaccessible to non-computational cancer researchers. We decided to address this problem by building a framework called PANOPLY that would streamline analysis of proteogenomics data and would be easy to use, robust, flexible and reproducible. We wanted researchers to be able to use it on any standard computational platform, so we designed PANOPLY to be modular and portable, but we chose to also make it available through Terra to increase access, scalability and ease of use.

A “greatest hits” compilation of methods from flagship CPTAC studies

PANOPLY was not born in a vacuum. Proteogenomic analysis has been extensively applied to cancer samples in many studies published under the auspices of the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and the International Cancer Proteogenome Consortium (ICPC), a global effort to accelerate the understanding of the molecular basis of cancer through the application of proteogenomics. These flagship studies have advanced the field by developing cutting-edge computational methods. PANOPLY combines representative methods from these studies, which originated as disparate algorithms implemented by different research groups, into a unified pipeline within a computational framework built to be modular, scalable and reproducible.

PANOPLY takes as input a set of pre-formatted datasets derived from DNA, RNA and protein profiling, along with phenotype and clinical annotations (see Figure 1). Any normalization or filtering for proteomics data is accomplished using Data Preparation Modules. Analysis ready data is then channelled to a series of Data Analysis Modules, many of which perform integrated multi-omic analysis. Almost all analysis modules output an interactive HTML report summarizing results, in addition to detailed tables and plots.

Figure 1. PANOPLY architecture overview, showing various data types used along with modules for data pre-processing and analysis.

We implemented each module as a standalone workflow, and we created one unified workflow that imports all the module workflows into a full end-to-end pipeline. This enables researchers to easily apply the full complement of proteogenomics analyses to their data out of the box, which we expect to be the majority use case.

We also anticipated that some researchers might want to apply individual methods to a variety of use cases, so to provide that flexibility, we designed the individual modules to be runnable by themselves. We also made it possible to compose custom pipelines that include these modules in combination with other modules that researchers might write or publish themselves. This way, researchers can take advantage of the groundbreaking work done by various CPTAC groups, and they can build on that work to further advance the field, with less effort spent on figuring out tooling.

All workflows are written in WDL and use containerized tools, so they can be run on any standard computational platform. All code — including algorithm implementations and WDLs — is open source and available in GitHub.

Leveraging Terra workspaces to increase access and usability

We recognized that enabling a wide range of people to use PANOPLY, especially those with less computational experience, would require more than just releasing code. We wanted a way to make PANOPLY usable out of the box, with tutorials that bundle example data, and we wanted to be able to update all of it easily whenever we make improvements to the software.

To that end, we put together 3 main resource workspaces that are all publicly available in Terra (see Figure 2): (i) a “Modules” workspace containing separate workflows for each analysis module; (ii) a “Pipelines” workspace with preconfigured unified pipelines for fast and easy execution and (iii) a “Tutorial” workspace showing inputs and outputs for the tutorial dataset. In order to further simplify the process of setting up a new analysis workspace, we used the “Notebooks” feature of Terra to provide a startup notebook that includes a step-by-step guide for users.

Figure 2. Organization and contents of the PANOPLY workspaces on Terra.

The tutorial dataset is centered on TCGA samples that were also subjected to proteomics profiling, adapted from our first proteogenomics publication on breast cancer (BRCA) (Mertins et al., 2016), and comes with everything needed to run the unified PANOPLY pipeline. The tutorial itself is organized as an easy to follow step-by-step procedure starting from cloning the workspace to uploading the data and running the pipeline. There is also documentation on expected results and how to interpret the many interactive reports generated by PANOPLY.

In addition to these core resources, we also provide additional “case study” workspaces showcasing the analysis of BRCA samples (Krug et al., 2020) that were freshly collected exclusively for proteogenomic analysis, along with analysis of a lung adenocarcinoma (LUAD) cohort (Gillette et al., 2020).

We encourage you to check out the workspaces and try out the tutorial for yourself. If you’d like to share some feedback, please email us proteogenomics@broadinstitute.org; we look forward to hearing from you.

Resources

PANOPLY paper

Mani, D.R., Maynard, M., Kothadia, R. et al. PANOPLY: a cloud-based platform for automated and reproducible proteogenomic data analysis. Nat Methods (2021). https://doi.org/10.1038/s41592-021-01176-6

Blog references

Mertins, P. et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534, 55–62 (2016).

Krug, K. et al. Proteogenomic Landscape of Breast Cancer Tumorigenesis and Targeted Therapy. Cell 183, 1–21 (2020).

Gillette, M. A. et al. Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma. Cell 182, 200–225.e35 (2020).

Full list of PANOPLY workspaces

PANOPLY Release v1.0	PANOPLY pipelines	PANOPLY_Production_Pipelines_v1_0	Terra workspace with pre-configured pipelines, including startup notebook for easy data input and workspace configuration
PANOPLY Release v1.0	PANOPLY modules	PANOPLY_Production_Modules_v1_0	Terra workspace with all individual modules. This enables users to pick and choose modules and customize execution, but requires more knowledge of setting up inputs.
Tutorial	Tutorial description	PANOPLY-Tutorial (Github wiki)	Step-by-step instructions for running the PANOPLY tutorial
Tutorial	Tutorial workspace	PANOPLY_Tutorial	Terra workspace with tutorial instructions, data, analysis and results
Case Studies	BRCA	PANOPLY_CPTAC_BRCA	Terra workspace with data, analysis and results from the (Krug, et al,., 2020, Cell) breast cancer study
Case Studies	LUAD	PANOPLY_CPTAC_LUAD	Terra workspace with data, analysis and results from the (Gillette, et al,., 2020, Cell) lung adenocarcinoma study

The post Spotlight on PANOPLY, a scalable framework for cancer proteogenomics appeared first on Terra.

A must-read review paper for getting started with bioinformatics workflows

Geraldine Van der Auwera — Fri, 15 Jan 2021 16:53:45 +0000

New year, new partnership… and a new blog series focusing on highlighting papers that we think will be of interest to many of you. For this first iteration, we review a review paper (review-ception!) about workflow systems, coming out of C. Titus Brown’s lab at UC Davis and fresh off the virtual press over at GigaScience.

Taylor Reiter, Phillip T Brooks, Luiz Irber, Shannon E K Joslin, Charles M Reid, Camille Scott, C Titus Brown, N Tessa Pierce-Ward, Streamlining data-intensive biology with workflow systems, GigaScience, Volume 10, Issue 1, January 2021, giaa140, https://doi.org/10.1093/gigascience/giaa140
As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.

Read on to learn why this paper is a must-read if you’re getting started with workflows.

This paper in a nutshell:

Everything you need to know to get started with bioinformatics workflows

Seriously, this review covers an impressive amount of ground.

It starts with an accessible explanation of what workflows are, and why they are such an important and rapidly growing part of biological data analysis, which I expect will be very helpful to anyone who might be new to the challenges posed by Really Large Datasets.

Then, the authors provide a clear and concise review of the main types of workflows, languages and systems that you might encounter — including WDL, Terra’s current workflow language of choice, which they identify alongside CWL as “workflow specification formats that are more geared towards scalability, making them ideal for production-level pipelines with hundreds of thousands of samples” (yep, that checks out). They also touch on software management systems, including container systems (like Docker) and package managers (like Conda), and how these systems integrate with workflow systems.

That content alone is already solidly informative, yet we’re not even at the halfway point yet.

There’s a lot more in there, starting with a set of best-practice recommendations for managing a workflow-based project. This includes what to document (everything), how to document it (consistently) and what tools exist for visualization, version control and collaboration. I was nodding so hard reading that section, I pulled a neck muscle.

From there, the authors move to a series of practical recommendations for actually getting started with workflows, including finding and accessing compute resources. As stated in the abstract, these are “mainly focused on high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.” I found myself agreeing vehemently once more — the “Strategies for troubleshooting” should be required reading for every researcher who ever comes within three feet (~1m) of a computer, regardless of their field of study.

I could go on, but frankly at this point you’d be better off just reading the review itself. It’s solidly researched and well supported, insightful, clearly written and just beautifully scoped overall — well worth your time if you’re somewhat or completely new to workflows. Or even if you’re not so new and you’re willing to consider that your habitual practices might still have some room for improvement!

For an introduction to running workflows on Terra, see the Workflows documentation.

The post A must-read review paper for getting started with bioinformatics workflows appeared first on Terra.