Guest Author Archives - Terra

Celebrating a year of progress — and a sneak peek at what’s coming next

Kyle Vernest — Thu, 15 Dec 2022 17:29:02 +0000

Kyle Vernest is Head of Product in the Data Sciences Platform at the Broad Institute. In this guest blog post, Kyle takes a look back at how Terra has grown over the past year, and gives us a preview of what to expect in the first quarter of 2023.

It’s been an incredible year for Terra, with a lot of new users coming to the platform as more labs, groups, and organizations move their computational work to the cloud. We’re also thrilled to see user growth being fueled by scientific consortia such as the Human Cell Atlas, and NIH-driven programs such as AnVIL, rallying their communities around Terra as a platform for secure data sharing and collaboration.

The Terra development teams spanning the Broad Institute, Microsoft, and Verily have worked tirelessly to continue to expand the platform’s capabilities in service of these growing communities. Highlights of the year’s releases include an improved user interface for managing cloud environments for interactive analysis, increased scalability of the workflow management system, and better tooling for uploading and organizing data in workspaces. We also rolled out numerous useability improvements, like email notifications for workflow status and better organization of the list of workspaces. Most recently, we launched the public preview of the Terra Data Repository, a new component of the Terra platform designed to provide data storage and access management capabilities tailored for the life sciences.

Yet all these upgrades are in many ways only the tip of the iceberg. Behind the scenes, an enormous amount of work has gone into laying the groundwork for a major development that will come to fruition in the first quarter of 2023: support for storing data and running analyses on Microsoft Azure.

Coming soon to a cloud near you

We have been working closely with our partners at Microsoft to expand Terra to a multi-cloud offering, and we are nearing the launch of Terra on Azure coming early in the new year. Leading up to the launch, you may notice a new “Sign in with Microsoft” option on the Terra welcome screen (which will take you to a “Coming Soon” page until the preview phase starts).

But don’t worry if you’re planning to stick with Terra on Google; we have plenty of upgrades in store for you as well! In particular, you can look forward to taking advantage of WDL 1.1’s workflow language updates, and switching from Jupyter Notebook to JupyterLab for a more full-featured code development experience.

Whether you’re using Terra on Google or on Azure, you’ll be presented with a new version of the Terra Terms of Service, which we’ve updated to reflect the expanded functionality and new multi-cloud nature of the platform.

— —

Finally, as we close out this brief tour of the year’s achievements, we’re especially proud to celebrate the many scientific successes that Terra has already enabled. These have covered an impressive range of domains, from the Telomere-to-Telomere reference genome project to the CDC’s efforts to empower public health labs across the country to adopt genomics for biosurveillance. We look forward to many more in the coming year, featuring even greater variety — including more ‘omics data technologies beyond genomics.

The post Celebrating a year of progress — and a sneak peek at what’s coming next appeared first on Terra.

Introducing Terra Data Repository public preview

Jonathan Lawson — Thu, 08 Dec 2022 18:44:51 +0000

Jonathan Lawson is a Senior Software Product Manager in the Broad Institute Data Sciences Platform, overseeing data management products including the Terra Data Repository and the Data Use Oversight System. In this guest blog post, Jonathan announces the public preview phase of the Terra Data Repository, a new component of the Terra platform designed to provide data storage and access management capabilities tailored for the life sciences.

Life sciences research has entered an age of extraordinary opportunity thanks to the rapid technological developments of the past decade. We are now able to generate vast amounts of molecular information, such as genomic sequencing, and we can put that molecular data in the context of phenotypes and clinical history to probe the biology of both health and disease in unprecedented detail. These capabilities are already starting to revolutionize how we approach everything from fundamental research into population genetics to diagnostics and drug development.

Yet these technological prowesses also bring forth new technical challenges. The resulting datasets are complex, combining enormous files of molecular data with structured information —such as phenotypic data— that is best stored in database form. In addition, data assets collected from human participants are subject to various constraints with regard to how they can be shared, and with whom.

Solving this challenge calls for data storage and sharing solutions that empower data owners and custodians to make their datasets available for analysis to the research community securely, responsibly and effectively.

Today, we are excited to introduce the Terra Data Repository (TDR), a new component of the Terra platform designed to provide data storage and access management capabilities tailored for the life sciences. It is already actively being used for large collaborative projects including the Human Cell Atlas and the NHGRI’s AnVIL.

The system supports using formal schemas to represent relationships between different data entities, and generating versioned snapshots that can be used to grant collaborators access to specific subsets of data depending on research purpose and authorizations. Data snapshots are immutable, making it possible to release continuous updates to datasets while ensuring reproducibility of analyses over time.

For a complete overview of features, usage instructions and detailed technical information, please visit the TDR documentation in the Terra knowledge base.

The Terra Data Repository is available as a public preview to all registered users of Terra. Please note that the graphical user interface is still under active development, and many operations can currently only be performed through API calls. During this time, we recommend reaching out to the Terra support team to discuss whether the Terra Data Repository might be a good fit for your project’s needs.

The post Introducing Terra Data Repository public preview appeared first on Terra.

Scaling variant discovery to a million genomes with the Genomic Variant Store

Kylee Degatano — Thu, 27 Oct 2022 18:40:48 +0000

Kylee Degatano is a Senior Product Manager in the Data Sciences Platform at the Broad Institute. In this guest blog post, she introduces the Genomic Variant Store, a highly scalable solution for genomic analysis based on Google BigQuery, designed to scale joint variant discovery to a million whole genome samples. Researchers interested in trying out this approach are invited to join the GVS Early Access program.

Researchers around the world are generating petabytes of genomics data to understand human biology, identify the causes of diseases and develop new treatments. The analyses involved, such as association studies, linkage analysis, and exploration of Mendelian inheritance all require high confidence in the quality of the genomic variant calls they rely on.

One of the most successful approaches of the past decade for generating such high quality variant calls involves jointly analyzing the genomes of many different people across a population. This “joint calling” approach increases statistical power for differentiating true variants from artifacts, which makes it possible to identify extremely rare variants with confidence. In the GATK Best Practices for germline short variant discovery, it is implemented as a two-step process: first we identify potential variants individually per sample, then we evaluate the evidence found for each genomic site across all samples to produce “joint calls”, i.e. a multi-sample variant callset. We can then apply additional filtering to further refine the callset for downstream analysis.

Diagram illustrating the overall data generation and analysis process involved in joint variant discovery.

As an example of how this approach empowers discovery, my colleague Laura Gauthier recently wrote a blog post commenting on a recent study that used a joint callset of 75,000 exome samples to implicate ten new genes in the development of schizophrenia. Crucially, she explained, the study focused on ultra-rare variants, never seen before and only observed in a single individual in that entire callset, that affected the same small set of genes. The joint calling methodology was essential to the success of that analysis because it enabled the authors to accurately detect ultra-rare variants with confidence, and to do so for enough individuals to accumulate enough statistical power to implicate the corresponding genes. Laura concluded her post with this prediction:

“Future breakthroughs in common disease research will continue to come about through the hard work of recruiting large numbers of affected participants; accurately detecting the few, tiny, rare mutations that make them different; and combining as many ultra-rare variants as we can until we can point the finger at genes harboring too many mutations across people who suffer from disease.”

It is with that same vision in mind that our engineering team has been working hard to scale up our joint calling capabilities.

Every jump in the scale of the studies we have supported so far has required that we re-engineer various parts of our analysis pipelines to address new barriers related to cost efficiency, computing power, runtime… or the maximum amount of data you can store in a single file on a standard hard drive. (That was a fun one to hit.)

Today, I’m excited to share the solution we developed to get to the next order of magnitude, and to make this kind of scaling accessible to a wider range of researchers and institutions.

Aiming for a million

A few years ago, the NIH asked our team to figure out a way to run joint calling on one million human genomes for the All of Us Research Program. For context, at the time, performing joint calling for “just” 15,000 whole genomes was a costly months-long endeavor for a full team of engineers equipped with cutting-edge tools. We knew that to scale to a million genomes, we would need to revisit the engineering design behind the GATK Joint Calling pipeline (again). We also foresaw that many downstream analysis tools would not be able to handle the resulting callset, which would contain trillions of variants. Researchers would need to be able to subset the variant calls to their samples and genomic regions of interest.

Oh, and we needed to ensure our solution to these problems would be cost- and time-efficient, and scalable to both ends of the spectrum: it should allow us to scale down to 10 samples as easily as scaling up to a million.

So we took joint calling back to the drawing board. We started by designing a core schema for the variant data based on data access patterns for key use cases like training a filtering model, searching to identify samples with a specific variant or variation in a specific location, and extracting data —for all samples or for subsets— into Variant Call Format (VCF). We contemplated which information in the variant files was necessary for joint calling and downstream analysis, and stripped out anything extraneous. We tested many different foundational technologies for processing and storing data, including Spark, Dataflow, and custom developed infrastructure. We ultimately chose Google BigQuery because it is easy to operate, can handle huge data sizes, is cheap to store data and cheap to query, and has excellent security features.

The result of this three-year development effort: a highly scalable and cost-effective variant storage and processing solution we call the Genomic Variant Store.

Introducing Joint Calling with the Genomic Variant Store

The Genomic Variant Store (GVS) uses Google BigQuery to store variant data according to the core data model. It is designed to function as a persistent data warehouse to which we can add new data over time, simply by importing gVCF files produced by the single-sample calling step of the variant discovery process outlined above. This automatically combines the per-sample variant data across samples and genomic coordinates. We can then query the GVS by sample and/or by genomic coordinates to generate subset callsets of interest in either VCF or Hail VDS file format.

If you are familiar with the details of the current GATK Best Practices workflow implementation, the GVS data ingestion step is analogous to combining gVCFs into a GenomicsDB data store. However GVS is much more scalable than GenomicsDB due to its use of Google BigQuery.

The GVS also includes a built-in variant filtering model (GATK VQSR) that determines which variant calls will be considered true variants, as opposed to artifacts. The filtering model is applied when a subset of variant data is extracted to file, and it can be re-trained and improved as new data is added.

Overview of the operations supported by the Genomic Variant Store

For convenience, we have developed utility workflows written in the Workflow Description Language (WDL) to perform the import, model training, subsetting and variant search operations. These can be used individually to grow, curate and analyze a persistent variant data store.

We have tested this on a wide range of callset sizes; there is no minimum number of samples, and we anticipate that it will scale to the planned one million genomes for the All of Us Research Program. In fact, we have already used the GVS in production to produce a joint callset from 250,000 human whole genomes for the AoU program, which we believe is the largest joint-called human whole genome callset in the world so far.

Get early access to a self-contained GVS Joint Calling Pipeline

Due to the engineering challenges involved in operating at this scale, there are only a handful of genome centers around the world that are currently able to create joint callsets from tens of thousands of human whole genomes, let alone hundreds of thousands. Yet it is very likely that progress in human genetics could be substantially accelerated if this capability were made more widely accessible.

As we continue to improve the GVS — increasing scalability, adding support for new data types, improving the internal algorithms — we are looking for feedback from external groups to ensure the GVS will work well for a wide audience. To that end, we are starting an early access program for researchers who are interested in trying out the GVS approach for making joint callsets of up to 10,000 human whole genomes.

We are currently targeting that project size because most population genetics projects today involve less extreme cohorts than the All of Us Research Program; we’re seeing many projects with whole genome sample numbers in the low thousands.

To enable those projects to use GVS for scalable joint calling without having to manage complex infrastructure and run multiple separate operations, we developed a “one and done” workflow called the GVS Joint Calling pipeline that wraps all the necessary steps. This single self-contained workflow takes in a set of per-sample gVCFs, trains and applies the GATK VQSR filtering model, and extracts the variants into a VCF file containing the complete joint callset.

We plan to make this pipeline publicly available in a Terra workspace, pre-configured in such a way that anyone can run it out of the box with minimal effort. In our current tests, the GVS Joint Calling pipeline set up in Terra can produce a joint callset from up to 10,000 human whole genomes in less than half a day, at a cost of $0.06 USD per genome.

Screenshot of the Terra workflow configuration panel for the GVS Joint Calling Pipeline

We invite you to apply to join the early access program by filling out a short form that will help us assess if your callset would be a good fit for the initial release. If you are selected, we will help you get started with the GVS Joint Calling pipeline in Terra, and we will be available to assist you if you run into any problems.

Even if the Genomic Variant Store doesn’t meet your needs right now, please feel free to use the form to tell us more about your work and what features you be interested in seeing in a future version. We look forward to hearing from you!

The post Scaling variant discovery to a million genomes with the Genomic Variant Store appeared first on Terra.

Run your notebooks programmatically in the cloud

John Bates — Thu, 20 Oct 2022 18:20:37 +0000

John Bates is a software engineer at Verily. In this guest blog post, he and his fellow software engineer Nicole Deflaux share two solutions they developed for running Jupyter notebooks programmatically in support of their own analysis work.

Our team makes extensive use of Jupyter Notebooks for developing new analyses, because they enable us to iterate very quickly and collaboratively in an interactive environment.

However, we have found there are certain situations where we want to run a notebook programmatically — meaning, just launch the entire analysis with a single command, without having to manually open the notebook and run cells.

To run a notebook with a known, clean virtual machine configuration to confirm it has no unresolved dependencies on locally installed Python packages, R packages, or on local files.
To run a notebook with many different sets of parameters, all in parallel.
To execute a long-running notebook (e.g., taking hours or even days) on a machine separate from where you are working interactively.
To automate an analysis that was developed in a notebook without porting it to a workflow.

Fortunately, there is a command-line tool called Papermill that makes it possible to parameterize and execute Jupyter Notebooks programmatically. So all you really need to achieve these goals from within Terra is to devise a way to launch the Papermill command on a clean virtual machine.

We recently developed a pair of approaches to do exactly that, using either the Workflow execution system or the Terminal in a Cloud Environment. This has been very useful for our team, so we created a public workspace that demonstrates how you too can do this with minimal effort.

— —

The Workflow approach uses a single-task WDL script, notebook_workflow.wdl, that we wrote to serve as a lightweight wrapper for the Papermill command. You can submit this WDL through Terra’s Workflows execution interface as usual, specifying as inputs the path to the notebook file you want to run programmatically as well as the environment container to use for testing, and any number of other relevant parameters.

The output of this workflow is a copy of the original notebook, fully executed and rendered in html, along with any files generated by the notebook execution itself.

— —

In contrast, the Terminal option uses dsub, a Google Cloud tool that was developed for submitting and running batch scripts in the cloud. The basic idea behind dsub is to emulate the experience of using high-performance computing job schedulers like Grid Engine and Slurm, which allow you to write a script and then submit it to a job scheduler from a shell prompt on your local machine. You can then disconnect from the shell, go about your business, then later come back and query the status of your job using a predefined command generated at submission time.

You can use this tool in Terra by launching a Jupyter Cloud Environment (Python kernel), which includes a built-in Terminal app that you can fire up by clicking on its icon in the right-hand toolbar. Once you’ve installed dsub and its dependencies into your environment, you can run dsub commands to submit jobs to Google Cloud as if it were your local compute server.

For the purpose of running notebooks programmatically, you need to run a dsub command that will in turn launch the desired Papermill command, with the appropriate inputs and environment configuration. This may sound complicated, but the actual command that you will run in the Terminal is short and straightforward: to keep things simple, we wrote a Python script called dsub_notebook.py that wraps all the functionality you need to configure, launch and monitor the Papermill job through dsub. All you need to do is adapt the command with your input notebook and any appropriate parameters, and run it in the Terminal of your Python Cloud Environment.

This produces the same outputs as the Workflows option: a copy of the original notebook, fully executed and rendered in html, along with any files generated by the notebook execution itself.

— —

You can find a detailed tutorial with step by steps instructions in the public workspace that we created to demonstrate how this works in practice. The tutorial includes an example notebook parameterized with Papermill, with three choices of input datasets, as well as a setup notebook to install dsub into your environment quickly and painlessly.

We hope you will find this resource useful and would love to hear your feedback on how we could make it even better, either in the public Terra forum or privately through the helpdesk. You can also open an issue in the terra-examples repository to report a problem or discuss a technical aspect of the scripts.

Resources

Programmatic Notebook Execution Tutorial (public Terra workspace)
terra-examples repository (examples and utility scripts)
Papermill documentation
dsub documentation

The post Run your notebooks programmatically in the cloud appeared first on Terra.

From liquid biopsies in Ghana to African cancer genomics in the cloud

Samuel Ahuno — Fri, 02 Sep 2022 14:00:41 +0000

Samuel Terkper Ahuno is a student at the Tri-Institutional PhD Program in Computational Biology and Medicine, NYC. In this guest blog post, he describes his published work on African cancer genomics, which evaluated the feasibility of using liquid biopsies to detect breast cancer in a Ghanaian clinic, and shares his vision of a cloud-powered future for computational research in Africa.

Cancer in sub-Saharan Africa is becoming common, with increasing mortality. Current efforts to mitigate this are focused on increasing public awareness, earlier diagnosis, increasing access to treatment and care and researching the lifestyle, environmental and genetic risk factors that might be more prevalent for African women.

One major obstacle we are facing is that the current standard of testing for most cancers, including breast cancer, is the traditional biopsy: extracting a small piece of the tumor surgically using a needle. As a result, many cancers are diagnosed only after considerable growth has occurred. Therefore, technologies for earlier detection could make a big difference to patient outcomes. Additionally, less invasive procedures would be better accepted by the population, and could enable repeated sampling and improved treatment monitoring.

Using liquid biopsies to detect breast cancer in Ghanaian patients

Liquid biopsy techniques address these challenges by using readily available biological fluids such as urine, blood, or saliva for diagnosis. These fluids contain circulating or “cell-free” DNA (cfDNA), some of which may be coming from tumor cells and are then called circulating tumor DNA. A liquid biopsy consists of sampling the relevant fluids and testing for the presence of circulating tumor DNA (ctDNA) or other such markers of cancer.

Our research group recently tested whether liquid biopsies could be used to detect breast cancer in Ghana, as part of the Ghana Breast Health Study. From each patient who was recruited into the study and came to one of three hospitals, a small amount of blood was collected, then extracted DNA from the blood was sequenced. This enabled us to estimate how much of the cfDNA was shed by the patient’s tumor into the blood and what sort of DNA damage was from the associated tumor.

We found encouraging results, suggesting that liquid biopsies could be a viable way to detect cancer markers such as copy number alteration (CNA) status for many selected breast cancer genes in Ghanaian patients. Copy number alteration is a type of cancer-associated mutation where one or more segments of the DNA are either lost or duplicated. Yet, the adoption of this diagnostic approach would require developing genomic and bioinformatics capacity within the country while also strengthening basic health care services to make sure women can gain access to the treatment they need to pursue this research further, and ultimately empower clinics to offer these tests in a sustainable and cost-effective way.

The computational requirements of cancer genomics

Going from raw cfDNA sequence data to biological insights about each patient’s tumor involves complex bioinformatics procedures that we can divide in two main stages of analysis, with very different computational requirements.

The first phase consists of pre-processing the data to ensure we have high quality information in a suitable format for identifying tumor DNA. In practice, this involves mapping each individual sequence read to a standard genome reference, and applying stringent quality control measures (see GATK Best Practices for more details). This is the most computationally intensive step of the analysis pipeline; for whole genomes with billions of reads, you can imagine how complicated it can get.

The second phase consists of estimating what fraction of the circulating DNA is likely to have originated from a tumor, and identifying CNAs (see ichorCNA documentation for more details).

A hybrid approach to achieve scalability without changing everything

Given the computationally intensive nature of the pre-processing phase, we performed that part of the work using cloud-optimized workflows that we ran on the Terra platform. This allowed us to scale execution very easily and not have to worry about managing high-performance computing resources directly.

For the second phase of the work, which did not pose any scaling challenges, we chose to use our existing tools on the Mount Sinai Hospital servers. It was easy enough to download the pre-processed outputs from Terra onto our local filesystem.

This hybrid approach allowed us to take advantage of Terra’s scalable batch processing capabilities without having to change our familiar environment for the more exploratory part of the work. If we were to do this again with a larger dataset, downloading the pre-processed outputs would probably be less feasible, and it might be worth it for us to look into Terra’s interactive cloud environments for doing the rest of the work on the platform as well.

The bigger picture of cloud computing in Africa

The study I presented here was the result of an international collaboration between multiple research and clinical institutions in Ghana, as well as in Canada, the UK and the United States. Strengthening Global Partnerships plays an important role and part of the United Nations Sustainable Development Goals of economic development, yet for approaches like liquid biopsies to become the standard of care in Ghana and many other African countries, we must ultimately develop the bioinformatics capacity to perform the relevant research and testing autonomously in-country.

One of the major challenges for bioinformatics and computational biology in many African countries is the limited infrastructure such as computing resources, even though the cost of computing is becoming cheaper and associated with increasing efficiency.

Cloud-powered platforms like Terra could play a huge role in increasing access to computing resources to enable scalable genomics research in Africa, by Africans.

In addition to providing access to powerful hardware resources, such platforms also make it possible to leverage publicly available workflows and pre-installed software tools and environments. This helps newcomers overcome initial learning curves and empowers seasoned researchers to leverage best in class tooling without having to spend time installing anything. Once familiar with the infrastructure, they can also develop their own workflows and tools to innovate in the pursuit of their preferred research question.

Organizations such as H3Africa have over the years been building bioinformatics capacity in affiliated institutions in the region. Building on that work, the DSI-Africa consortium recently launched the eLwazi platform, an African-led open data science project powered by Terra.

However, moving forward it will be great to have data centers within Africa to enable regional processing, storage and control of genomic data due to privacy and ethical reasons.

There are still many practical, ethical and technological challenges to implement genomic technologies in Africa, yet it is encouraging to see such progress toward a future where African countries such as Ghana can access the resources they need to chart their own course.

Acknowledgements

I would like to thank Paz Polak, PhD; Jonine Figueroa, PhD; Geraldine Van der Auwera, PhD, and Kofi Johnson, PhD, for helpful comments.

The post From liquid biopsies in Ghana to African cancer genomics in the cloud appeared first on Terra.

Lab mysteries: Identifying bacterial contamination in WGS data

Yossi Farjoun — Tue, 30 Aug 2022 14:05:55 +0000

Yossi Farjoun is a Senior Biostatistician at the Richards Lab of the Lady Davis Institute in Montreal. He previously worked in the Data Sciences Platform at the Broad Institute, where he led a “special projects” team that was occasionally tapped to help the Broad’s Genomics Platform lab diagnose strange issues. In this guest blog post, Yossi shares a particularly intriguing story about how he was able to quickly identify the cause of mysteriously low sequencing alignment rates in a sequencing project, with step-by-step descriptions for recapitulating the work in Terra.

It all started when a project manager in the sequencing lab told me that they were seeing low alignment rates on multiple samples from the same project, and asked if I could help. We would normally see alignment rates (as reported from Picard’s CollectAlignmentSummaryMetrics) above 99%, but this cohort of samples was producing rates between 60% and 95%, requiring the lab to sequence more material in order to reach the contractual target coverage for the project (which doesn’t include unaligned reads, of course).

I suspected bacterial contamination since, by manual inspection, the unaligned reads did not seem to be artifactual. For example, they all had pretty random-seeming sequences, as opposed to being all the same. To investigate the problem, I decided to use GATK PathSeq, a computational pathogen discovery pipeline that is included in the Genome Analysis Toolkit (GATK) for detecting microbial organisms from short-read deep sequencing of a host organism.

In this blog post, I’ll walk you through how I used PathSeq to quickly identify that the unaligned reads all belonged to a single bacterial genus, Burkholderia. The two main steps were running the PathSeq pipeline then analyzing the results in Python in a Jupyter Notebook. At the time I did the work, I used FireCloud — an early version of the Terra platform that was primarily intended for cancer research — but I’ve updated the walkthrough to refer to Terra for simplicity since it’s the same platform under the hood, and everything I describe is entirely runnable in Terra today.

Why PathSeq?

The PathSeq method was originally developed to discover and characterize pathogens infecting a host organism (such as a human) in an unbiased manner, taking advantage of the advances made in whole genome sequencing (Kostic et al., 2011). Instead of looking for specific pathogens based on known sequence, as was traditionally done before, investigators could now sequence a biological sample from the host, subtract all reads that align to the host genome, and identify pathogens by running searches using the remaining non-host reads against a database of microbial references. The search results can then be used to produce a table of the microbial organisms detected in the host sample.

It’s interesting to note that in that use case for PathSeq, the human genome sequence was a by-product of the process, and was promptly thrown out in order to focus on the pathogens. In my use case, what the lab ultimately wanted was to produce high-quality human genome data, not study pathogens! Yet, to help them do that, I needed to figure out two things: (1) was the issue indeed due to microbial contamination, and if so (2) what kind of microbes were they — which would help identify where the contamination was coming from. PathSeq promised to give me answers to both questions.

As it happens, one of my colleagues at the Broad Institute, Mark Walker, had implemented a new version of the PathSeq method in GATK that improves on the previous one by incorporating faster computational approaches, broadening the use cases that the pipeline can support, and leveraging the Spark framework available in GATK4 to enable parallelized data processing (Walker et al., 2018).

And so I set out to apply Mark’s GATK PathSeq pipeline to the problem dataset.

Running PathSeq in Terra

Conveniently, the GATK team make a lot of their pipelines available in public workspaces in Terra, where the workflows are already configured to run according to the relevant best practices, with example data to show how the inputs need to be set up, how the files should be formatted and so on.

So I started by cloning the PathSeq workspace, familiarizing myself with how the pipeline is set up, and referring to the paper and the accompanying how-to tutorial to understand specific details.

I found that to run the pipeline, I was going to need to go through 3 main steps: (1) setting up the data to use Terra’s system of data tables for data management, (2) tweak the preloaded workflow configuration to run on my data, and (3) launch the pipeline on the data.

1. Setting up the data

Terra uses a system of data tables that helps you manage data and outputs in a scalable way. You can learn more about that in this introduction to Terra data tables; I’m just going to describe how to use the system to replicate what I did.

The workspace has some example data tables already set up, but you don’t need them (they are for a somewhat different use case as we’ll see shortly) so you can start by deleting them to reduce clutter.

You then create a load file (basically a data manifest) listing the sample IDs, with relevant sample metadata and file paths in a “sample” table. The load file is in “tab-separated values” format, aka TSV.

Here is a dummy example to show what the load file should look like (with simplified paths):

Sample TSV

    entity:sample_id percent_aligned WGS_bam_path   
    sample1 68 gs://bucket/directory/file1.unmapped.bam   
    sample2 80 gs://bucket/directory/file2.unmapped.bam   
    sample3 76 gs://bucket/directory/file3.unmapped.bam    
    sample4 99 gs://bucket/directory/file4.unmapped.bam

The column labeled percent_aligned contains the percent of aligned reads as reported by the preceding data processing pipeline. This column isn’t needed for running the PathSeq pipeline but will be used in the Python Notebook.

At the time I did this work, you also had to define a sample set (the list of samples you want to run a pipeline on for batch processing) using another load file, but nowadays you can do that on the fly in Terra.

Once your load file is ready, you upload it in the DATA tab of your workspace, and the system turns it into a table as shown below.

Note that my data files were already on the cloud, so I just had to list their locations. If you have data elsewhere you need to put them in cloud storage; either in a bucket you manage yourself (outside of Terra) or in the storage bucket attached to your Terra workspace.

2. Tweaking the workflow configuration

The PathSeq workspace actually contains multiple accessory workflows in addition to the main PathSeq workflow, all of which you can find in the aptly-named “WORKFLOWS” tab.

To replicate my analysis you just need the one called “03-pathseq-pipeline-WGS-meats”.

That name? Funny thing, the workspace was made to showcase a particular use case for PathSeq that’s a little different than mine: it’s an analysis that shows how to detect contaminating DNA from various animals that humans consume as food. Because when people give a saliva sample or buccal (cheek) swab for sequencing, sometimes the sequencing picks up what they had for lunch! If there’s enough animal DNA in there that is close enough to human DNA to get mapped to the human genome reference, it can introduce a lot of noise into downstream analyses. Which makes you wonder: Do vegetarians get cleaner sequencing results? Perhaps the takeaway could be that we should all brush our teeth before taking a saliva/buccal swab for sequencing….

Anyway, when you click on the workflow you want, Terra takes you to the corresponding configuration page, which displays summary information and some selectors for determining what data the workflow will run on.

Since you have your samples neatly organized in a table, you can simply point the system to the sample table and select all the samples you want to process, using the selectors shown above.

Once that’s done, scroll down to the INPUTS form lower down on the same page (shown in part below). This form allows you to specify inputs and parameters for each of the workflow variables.

Here, you need to check the name of the data table column that the main input variable (“input_bam”) points to. In the example I gave above, I named the column with the bam file paths differently compared to the example data. Theirs was called “contaminated_human_bam”, whereas mine is called “WGS_bam_path”. So I would change “this.contaminated_human_bam” to “this.WGS_bam_path”.

If you’re looking to identify microbial contaminants, like I was, you also have to change the reference resources that are set up in the INPUTS form. The default configuration points to animal reference materials, which are set up as key-value pairs under “Workspace Data” in the DATA tab. You can either set up your own key-value pairs to point to the relevant resources or just replace the “workspace.” by the relevant cloud storage paths. You can find paths to microbial resource files in the PathSeq section of the GATK Best Practices bucket, or you can generate your own based on more recent data. If you get confused as to which file to plug into which input variable, you can ask for help in the GATK forum.

Make sure to hit the “SAVE” button when you’re done editing out the INPUTS form (and frankly, consider doing it several times while you’re working, if you have a lot to tweak).

3. Launching the workflow

Step 2 was a lot but step 3 is the easy part. Once you have the input form filled out, all you need to do is hit the “Run Analysis” button, which brings up a confirmation dialog that restates how many workflows will be launched — one per sample in your sample list, if everything is set up correctly. This is a good time to check that you’re not about to launch ten thousand workflows by accident.

For what it’s worth, when I’m trying a new pipeline, I like to run a single sample first and wait until it completes before I run on the rest of the samples. That allows me to assess whether everything is behaving as expected, how long it takes to run and how much money it costs. That way I can estimate how much it will cost me to run the full cohort before I kick off anything big.

If the number looks reasonable, hit “Launch” and go take a break while the system starts executing the work behind the scenes. The system will automatically open the monitoring page where you can find the list of your workflow executions organized by submission batch.

Results of the PathSeq pipeline: Yep, there’s bacteria in there

So, where are the results? This is rather cool, actually. When you use the data tables system to manage data inputs, the results output by the workflow will get listed in a column (specified in the OUTPUTS form) in the relevant data table. That means you don’t have to go digging in output directories to find your output files (which are written to the workspace’s dedicated cloud storage bucket), and you’re able to launch other workflows on those new files without having to add them manually to a table.

The main results that are produced by the GATK PathSeq workflow are text files containing microbial identifiers, abundance scores and other related metrics. They clearly indicated the presence of microbial contaminants, but were not very user-friendly to interpret in their raw state, so for my investigation, I needed to analyze them further.

Characterizing the contamination in a Jupyter Notebook

My main objective was to see what was the top contaminating genus, and how it related to the “Aligned” value that we had for each sample. I decided to do this in Python in a Jupyter Notebook, which at the time was a beta feature in FireCloud. Now it’s a fully supported feature in Terra; you can learn more about it by following the Notebooks Quickstart tutorial.

Have a look at the HTML-rendered version of the completed Jupyter Notebook if you’d like to see the exact code I used and examine my findings for yourself.

First, I used Python functions to retrieve the metrics produced by the PathSeq workflow, extracting the vital information from metrics, then creating a sorted table with the extracted data to show a clear picture of what PathSeq found. The first major finding was that the top microbial genus present overall was a genus of proteobacteria called Burkholderia.

I then generated a scatter plot using the percent of alignment from the original reads on the X axis, and percent of these samples having reads aligned to the top contaminating genus on the Y axis, as shown below.

The downward slope you see in this plot shows that the samples with fewer reads aligned to the human reference had more reads aligned to the top microbial genus identified by PathSeq.

Separately, I also looked up how many of the samples were contaminated by the top-scoring genus, and found that all 78 problem samples were affected by Burkholderia contamination.

Epilogue

With these results in hand, I went back to the collaborator to discuss how PathSeq identified Burkholderia as the likely contaminant. Given other key details such as the location where the samples were collected (which had not been conveyed to me in advance), it made sense to the project manager that this proteobacteria would be detected as the contaminant. As a result, the collaborator has since identified some problems in their sample-collection protocols and have taken action to improve them.

I’ve encountered all sorts of weird quality issues in sequencing data over the years, and it can be quite fun to tackle them as a mystery to solve through careful analysis. I hope you found this story interesting, and possibly even helpful for your own work.

Feel free to try out any of the resources I mentioned for yourself, especially the public PathSeq workspace. The “meat contamination” use case is a rather interesting story in itself, and that project led to the creation of workflows to generate synthetic hybrid meat products for testing purposes (strictly virtual!) which could be reused for all sorts of interesting purposes.

You might also be interested in another Terra blog post involving PathSeq being used to identify viral insertions in the human genome. That one has IGV screenshots, which is always a sign of a good time.

So have fun and if you end up doing something with any of this, I would love to hear about it.

This story previously appeared in an earlier form on the GATK blog.

Resources

GATK PathSeq workspace

GATK PathSeq tutorial

GATK blog about Terra workspaces

GATK Best Practices resources

PathSeq paper citation

Mark A Walker, Chandra Sekhar Pedamallu, Akinyemi I Ojesina, Susan Bullman, Ted Sharpe, Christopher W Whelan, Matthew Meyerson (2018) GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts, Bioinformatics, bty501, https://doi.org/10.1093/bioinformatics/bty501

The post Lab mysteries: Identifying bacterial contamination in WGS data appeared first on Terra.

Harness machine learning to clean up your scRNA-seq data

Stephen Fleming — Tue, 09 Aug 2022 15:48:17 +0000

Stephen Fleming is a Machine Learning Scientist in the Methods group of the Data Sciences Platform at the Broad Institute. As part of Mehrtash Babadi’s team, he works to develop analysis tools for single-cell RNA sequencing (scRNA-seq) data, and he analyzes scRNA-seq datasets in Patrick Ellinor’s group as part of the Precision Cardiology Lab. In this guest blog post, Stephen introduces CellBender, a software package for eliminating technical artifacts from high-throughput scRNA-seq and other multi-omics data. The Cellbender manuscript preprint is available on biorxiv.

Microfluidics-based technologies for probing the contents of individual cells have been revolutionizing how we approach cell biology. In less than a decade, single-cell RNA sequencing has gone from being a cool new concept with a lot of potential, to a mainstream technique that has already enabled important discoveries (e.g. Montoro et al., 2018; Delorey et al., 2021), and is now inspiring spin-offs like single-cell proteomics and spatial transcriptomics.

Yet like any data generation technology that has come before, the process of turning biological material into digital signal retains some inherent messiness. On the wetlab side of things, even the most stringent protocol cannot completely eliminate enzymatic side-processes that produce spurious library fragments, contamination by exogenous or endogenous ambient transcripts, potential impurity of barcode beads, and barcode swapping during amplification and/or sequencing.

Cell dissociation and nuclei extraction lead to the presence of cell-free RNA in solution. (b) Schematic diagram of the proposed source of ambient RNA background counts. Cell-free “ambient” RNAs (black lines) and other cellular debris are present in the cell-containing solution, and these RNAs are packaged up into the same droplet as a cell (red), or into an otherwise empty droplet that contains only a barcoded capture oligo bead (green hexagon).

In the generated data, these issues show up as systematic background noise. For example, a highly-expressed marker gene might look as though it is expressed in all cell types at a low level, rather than being specific to a certain cell type. This sort of background noise masks the true cell-type-specificity of gene expression —and manifests incorrectly as spurious differential expression signals— in downstream analyses. Systematic background noise is dataset-specific, and can cause a batch effect that hinders dataset integration and atlasing efforts (see Supplementary Fig. S1 in Eraslan et al., 2022).

Identifying and removing background noise with machine learning

To address this pervasive problem, we first investigated the phenomenology of background RNA counts in a variety of scRNA-seq and snRNA-seq experiments, then developed a modeling approach that uses machine learning to identify and subtract those background counts, resulting in a cleaner and more reliable dataset.

Removal of background RNA from a published human heart snRNA-seq atlas from Chaffin et al., 2022. a) Dotplot showing several highly-expressed genes in the raw dataset; b) Same dotplot after noise removal with CellBender.

We implemented this approach in Python, using the popular Pytorch framework and the probabilistic programming language Pyro. Our software package, called Cellbender, is fully open-source and available in Github.

When we initially reported our results in a 2019 preprint, CellBender was the only truly unsupervised option for background noise removal in single-cell datasets. As a bonus, in addition to producing a clean dataset, CellBender computes several other quantities of interest, including the probability that each droplet in the experiment contains a cell (since lots of droplets in these experiments are empty, and it can be hard to tell). We found that our approach outperformed previously existing tools for cell-calling, including CellRanger and EmptyDrops.

Since the 2019 preprint, CellBender has been applied to a range of real world studies by our collaborators as well as independent groups. Some notable works include studies of the human heart in health and disease (Tucker et al., 2020; Chaffin et al., 2022; ), the human intestine (Holloway et al., 2021), human and mouse adipocytes (Sun et al., 2020; Dong et al., 2022), human tissues infected by SARS-CoV-2 (Xu et al., 2020; Delorey et al., 2021; Ziegler et al., 2021; Melms et al., 2021), and a large human cross-tissue atlas (Eraslan et al., 2022).

All this road-testing has led to further improvements and a new updated preprint that we are submitting for publication. In the new preprint, we additionally demonstrate that our approach outperforms a 2020 method for noise removal called DecontX, and can be applied to other omics data types, such as the antibody capture features that are measured together with RNA in a CITE-seq experiment.

Removal of background counts from a mutimodal CITE-seq dataset, showing gene expression in blue and antibody capture in red. (a) Raw data from a public 10x genomics PBMC dataset shows that most antibodies are measured in all cell types. (b) After CellBender, antibody counts become much more cell-type specific.

Get started with CellBender today

The Cellbender background removal tool is designed as a drop-in pre-processing step that can be run as a standalone operation or integrated into an existing workflow.

For convenience, we provide a WDL workflow that is preconfigured to run the background removal tool with GPU acceleration (Tesla K80) on a virtual machine in Google Cloud. We also created a public Terra workspace showcasing the workflow along with some test data, so you can view configuration details, browse the tool’s outputs, and even try it out yourself without having to install anything on your own computer.

If you’re not familiar with running workflows in Terra, check out the Workflows Quickstart and its accompanying video.

Among its outputs, the CellBender background removal tool produces an HTML report that includes quality metrics, as well as automatically generated commentary that is intended to guide the interpretation of those metrics. You can see an example of this in the public workspace as described below.

To view the report for a CellBender background removal run, navigate to the workspace’s DATA tab, click on “sample_set” in the left-hand TABLES menu, and then scroll the table view horizontally until you see the “html_report_array”. Click on one of the filenames in that column and hit the download button; this should open the html file in your browser. Scroll through the report to view metrics, plots and accompanying commentary.

One of several sections in the Cellbender background removal quality report, consisting of a plot, summary statistics and a statement to guide interpretation.

We encourage you to try this out for yourself and let us know how it goes. If you experience any issues getting started with Terra, please reach out to Terra helpdesk. If you run into any problems with CellBender itself, feel free to open an issue in the CellBender repository on Github.

Resources

CellBender repository on Github

WDL workflow

Public Terra workspace

The post Harness machine learning to clean up your scRNA-seq data appeared first on Terra.

Using the cloud to support alignment between exploratory research and the rights of clinical study participants

Laurence (Laury) Mignon — Thu, 07 Jul 2022 16:10:27 +0000

Laury Mignon is Executive Director of Translational Medicine at Ionis Pharmaceuticals. She is responsible for improving the probability of (technical) success of Ionis’ preclinical assets from “bench to bedside”. In this guest blog post, Laury presents a novel approach to study participants’ data access and control in industry-sponsored clinical studies.

Whole-genome sequencing (WGS) is increasingly used in human research, including clinical trials, and the resulting data hold a lot of potential value for research beyond the immediate use for which it is collected. Allowing pharmaceutical companies to use WGS data from clinical trials for exploratory research could unlock substantial benefits for patients as well as for study participants who could receive actionable incidental findings. However, this raises important questions around participant consent, disclosure of incidental findings, and the rights of participants to withdraw their data from further study.

To address these questions, we developed the Exploratory Genetic Research Project (EGRP), a novel framework that aims to provide an umbrella protocol to collect genetic material in all of a sponsor’s clinical studies, giving consenting individuals the right to access and control access to their genetic data while enabling unspecified or exploratory future research. We recently published a manuscript describing the full EGRP protocol, as well as the detailed reasoning behind key design decisions we made, and we are currently working with Color Genomics and the Broad Institute to test our very first implementation of this protocol.

A novel ‘social contract’ – An attempt to harmonize a sponsor’s exploratory research with a clinical study participant’s data rights
By Laurence Mignon, Kim Doan, Michael Murphy, Lauren Elder, Chris Yun, Jeff Milton, Shruti Sasaki, Christopher E. Hart, Dante Montenegro, Nickolas Allen, Dany Matar, Danielle Ciofani, Frank Rigo, and Leonardo Sahelijo (2022)
In Contemporary Clinical Trials, Volume 119, https://doi.org/10.1016/j.cct.2022.106819

I hope you will read the paper to get the full picture of this innovative framework. Here, I wanted to highlight our decision to use a secure cloud platform — specifically Terra — to store the WGS data and make it available for analysis in a way that protects the privacy and autonomy of study participants.

Terra as the cloud-based Sandbox

Our intention was to prioritize participants’ control over the use of their genetic data, while enabling proprietary future research on the clinical and genetic data. For that reason, the EGRP process stipulates that the pharmaceutical company who sponsors the study is not permitted to download participants’ individual WGS data onto their own servers. Instead, the WGS data must remain in the custody of an independent partner, the data host, who performs the genome sequencing and makes the data available for analysis through a secure cloud platform.

Additionally, samples are de-identified prior to sequencing through the use of barcodes managed by a third partner, the “honest broker”, who is responsible for interfacing with participants and managing consents, barcodes and return of results.

Flow of information between the EGRP partners, with Ionis as the sponsor (pharmaceutical company running clinical trials), Color Genomics as the honest broker (interfacing with participants, managing consents, barcodes and return of results) and the Broad Institute as the data host (performing WGS and providing secure access to data in Terra). Modified from Mignon et al., 2022.

This setup makes it possible to ensure that data can be promptly removed from the system if a participant decides to withdraw their data. All the participant needs to do is notify the honest broker (Color Genomics), who will then issue a withdrawal order for the corresponding barcode to the data host (Broad Institute). By excluding the sponsor from the removal process, we effectively remove potential conflicts of interest and increase the amount of control that study participants can wield over their genetic data.

In addition, using Terra offers the opportunity to use a large set of cloud computing tools that are readily available in the platform and in many cases, have been optimized for genomic analysis at scale. This includes algorithms and pipelines created by other Terra users, data engineers supporting large projects such as the All of Us Research Program and the Human Cell Atlas, and the wider bioinformatics community. We view this as a promising path toward a faster and more standardized way of performing genetic analyses, as well as a fairer method of developing and sharing computational tools across private and public industries.

As we move forward with our first real-world test of the EGRP protocol, we are excited to have defined a process that serves the interests of individual study participants and pharmaceutical sponsors, and we are hopeful that it will provide a blueprint for future work by other clinical trial sponsors as well.

The post Using the cloud to support alignment between exploratory research and the rights of clinical study participants appeared first on Terra.

Schizophrenia advances demonstrate value of joint calling methods

Laura Gauthier — Wed, 29 Jun 2022 18:10:53 +0000

Laura Gauthier is the Director of Germline Computational Methods in the Data Sciences Platform at the Broad Institute. As a computational biologist and longtime GATK team member, she has both personally contributed to and overseen the development of key algorithms and tools for genomic analysis. In this guest blog post, she provides an inside perspective on the methodologies underpinning recent discoveries in common disease research.

We recently witnessed what’s been called a “watershed moment” of schizophrenia research, marked by the back-to-back publication of two major studies that reveal genes and genome regions that influence schizophrenia risk, as summarized in a Broad Institute blog post. One of the studies, a large meta-analysis done by computational biologist TJ Singh and colleagues, used a combination of several exome cohorts totalling about 75,000 samples to implicate ten new genes in the development of schizophrenia.

This study was a departure from many classic genome-wide association analyses of the past, which typically compared the frequencies of individual variants between the “case” and control groups. The variants in question were “common” (usually >1% of the population), but each had only a very small effect on its own. In contrast, the new study focused on ultra-rare variants that were each predicted to have a large biological effect on their own, but most of which were present in only one person in the study cohort. By aggregating these ultra-rare variants per gene to improve statistical power and comparing the number of deleterious mutations across whole genes, the authors were able to highlight several new genes with big implications for schizophrenia that hadn’t been found before using the common variant approach.

This discovery would not have been possible without the 75,000 sample callset, which, at the time these data were generated, was pushing the boundaries of what was computationally possible — though since then, gnomADv3 has doubled that number with a jaw-dropping 150,000 samples. It was also exactly what we had in mind as we worked to improve the accuracy and scale of the GATK joint calling pipeline.

Different paths to finding variants that cause disease

The most intuitive model of genetic disease is what we remember from high school biology: an unfortunate mutation in a single, important gene leads to a dysfunctional protein that impacts the affected individual’s biology. One example that might come to mind is sickle cell anemia, when a mutation in the red blood cell protein HBB allows cells to collapse or break apart and die, preventing them from effectively carrying oxygen and leading to painful blood vessel clogs.

For many rare diseases where the defective gene is not yet known, we can compare the genomes of family members who are affected and unaffected by the disease, and identify one or two variants (sometimes called mutations) that have a very large “effect size”, i.e. that cause most if not all of the symptoms that characterize the disease.

In contrast, schizophrenia and most psychiatric disorders fall under the category of “common disease”, which are typically caused by a constellation of many variants with much smaller effect sizes combining to cause disease. That requires us to apply a different approach to identify contributing variants: instead of studying a small set of family members, we compare large cohorts of cases (affected individuals) and controls (unaffected individuals) in studies that typically include thousands of participants.

The challenge of common disease genomic research is that population geneticists have observed an inverse relationship between the prevalence and severity of variants involved in common diseases: the more common the variant, the smaller its effect size; and conversely, variants with large effect sizes are rare. So while we can “easily” find common variants that explain a tiny portion of the overall condition, we need to recruit a huge number of patients to be able to find those rarer variants that play a bigger role.

Diagram illustrating the inverse relationship between the prevalence and effect size of variants involved in common diseases. Derived from Manolio et al., 2009.

That is what makes this paper so groundbreaking: the study that revealed these new genes hinges on ultra-rare variants. The investigators had to accurately detect variants that have never been seen before and are observed in only one individual. And they had to do it for enough individuals to accumulate enough statistical power to implicate the corresponding genes.

Powering discovery through accuracy at scale

Finding ultra-rare variants is hard, mainly because when something is seen only once, it’s difficult to be sure it’s real and not just an artifact of the data generation process. In this study, the investigators used GATK tools to consider evidence from the entire cohort and evaluate individual variant calls with higher accuracy. Specifically, they used our GATK/Picard single-sample exome analysis pipeline to process data from each sample in a consistent way, and our GATK joint calling pipeline that can scale to tens of thousands of samples.

Diagram illustrating the overall data generation and analysis process involving the single sample and joint calling pipelines. See Resources below for links to relevant code and documentation.

This has been the team’s mandate since the GATK’s inception circa 2010. Originally embedded within the Broad’s Medical and Population Genetics (MPG) Program, the GATK team was created specifically with the mission of empowering researchers to perform accurate variant discovery from genome sequencing data. Later we left MPG to help form the Data Sciences Platform, but we continue to work closely with collaborators throughout the Broad and the larger genomics community in the pursuit of that exact same goal — and at increasingly larger scale.

Since 2010, we have been involved in many high-profile projects that have applied this approach successfully and we have seen countless others do so elsewhere on their own, enabled by our tools. It is intensely gratifying for us to see this work bear fruit once again, this time in the service of tackling such a complex condition as schizophrenia. But of course, the story doesn’t end with the list of new disease genes. Other scientists will take the new genomic results to guide in vitro experiments. In a related Broad Institute blog post, Stanley Center co-director Morgan Sheng explains how these results could potentially be actionable and lead to real changes in psychiatric disease treatment.

At the end of the day, achieving better accuracy isn’t about publishing papers or getting to the top of the leaderboard on the PrecisionFDA Challenge. Generating larger cohorts isn’t just about flexing our compute muscle. Future breakthroughs in common disease research will continue to come about through the hard work of recruiting large numbers of affected participants; accurately detecting the few, tiny, rare mutations that make them different; and combining as many ultra-rare variants as we can until we can point the finger at genes harboring too many mutations across people who suffer from disease.

Resources

The Picard/GATK pipelines referenced in this work are available as open-source workflows (written in WDL) from the Broad Institute’s WARP repository.

Single-sample exome pipeline: code | pipeline documentation |
Joint-calling exome pipeline: code | method description |

These workflows are also available in a public Terra workspace, preconfigured to run on example data. For more information on running workflows in Terra, see the Workflows Quickstart video tutorial on YouTube.

The post Schizophrenia advances demonstrate value of joint calling methods appeared first on Terra.

Try out the updated GATK pipeline for Ultima Genomics data

Megan Shand — Wed, 08 Jun 2022 16:55:50 +0000

Megan Shand is a Computational Biologist in the Data Sciences Platform at the Broad Institute. As a member of the GATK development team, she manages methods and data pipelines for whole genome sequencing (WGS) data. In this guest blog post, Megan describes the new GATK pipeline for processing short-read WGS data produced by the new Ultima Genomics technology.

The biotech startup Ultima Genomics recently shared its new short-read sequencing technology, which generates high-throughput genomic data using sequencing-by-synthesis (SBS). In keeping with our goal of enabling the research community to use relevant data types, we’ve been collaborating with Ultima to adapt our whole genome analysis pipeline — aka the GATK Best Practices for short variant discovery — to handle the new data appropriately. We also made a Terra workspace preloaded with sample data and the fully configured analysis workflow so you can check it out for yourself without having to download or install anything.

Let’s start with a brief summary of how the technology works, then I’ll explain what we changed in the pipeline and how you can try it out for yourself.

Summary of the new sequencing approach

The company released a preprint describing their new technology in detail, so I won’t rehash it all here — you should really read it for yourself — but here’s a quick summary of how it works.

The most fundamental part of the approach should sound very familiar. There’s a “flowcell-like” substrate patterned with landing pads for sequencing beads, and the chemistry involves “mostly natural” sequencing-by-synthesis, which refers to the use of sparsely labeled non-terminating nucleotides. At each sequencing cycle, the beads are exposed to the MNN mix, and polymerase extension is performed to incorporate 0, 1, or several bases of a single nucleotide base type (dA, dC, dG or dT) into each growing strand, depending on the length of the homopolymer in the template. The labeled bases are detected by optical scanning, with the signal of each bead being proportional to the length of the homopolymer sequenced. The base-calling is done by on-board GPUs that employ a deep convolutional neural network to convert the raw signals into the sequences.

However, instead of using a traditional linear flowcell, Ultima’s sequencer uses an “open fluidics” design, which consists of a circular 200mm silicon wafer that spins while reagents are released at its center. The spinning causes the reagents to spread out by inertial distribution, which reduces the volume of reagents needed per cycle. The imaging is also done by spinning the wafer — like reading a compact disc, as the preprint puts it.

Finally, the data produced by the Ultima technology is very similar to “classic” short read genomic data. It can be stored in the same file formats (FASTQ, SAM/BAM/CRAM etc), which means the tools we have all been using so far — BWA, GATK etc — can read and write the Ultima data out of the box.

For a more in-depth third-party review of the new technology and its implications, check out Keith Robison’s “Omics! Omics!” blog.

Adapting the GATK pipeline to handle Ultima data

Despite the overall similarities, we did find some differences in the error modes that affect the data generated by the new technology. Some error types are less common (mismatches), while other error types are more common (homopolymer indels). As a result, we had to make some modifications to a subset of the algorithms that are used in our pipeline to handle those differences.

Key tool/algorithm changes

– The Picard tool MarkDuplicates has been adapted to handle ambiguity in read alignment start positions;

– The contamination estimation step has been adapted to use only the highest quality SNV sites based on flow cycles and local realignment;

– In the HaplotypeCaller variant caller:

– The classic Hidden Markov Model (HMM) that we previously used to calculate genotype likelihoods has been replaced by a new likelihood model (“flow-based”) that more accurately accounts for sequencing errors present in the data;

– A new haplotype filtering step has been added to remove spurious alleles that don’t contribute significantly to the genotyping likelihood;

– At the variant filtering stage, VQSR and its Mixed Gaussian model has been replaced with an external package developed by Ultima that applies a Random Forest model.

Overall, these are mostly “under the hood” changes to existing tools that we activate using configuration flags and parameters. The only step where we’re swapping out components is variant filtering, as described above.

You can read more about these changes in the GATK technical documentation.

New pipeline implementation

The resulting pipeline is different enough from our “generic” whole genome analysis pipeline to warrant its own implementation, which you can find here in the WARP repository. Like all our other pipelines, it is written in the Workflow Description Language, or WDL, which you can learn more about here.

Fortunately we were able to reuse a lot of existing code because of the modular design we’ve been using for the past few years: the bulk of the work done by the pipeline is divided into sub-workflow scripts that we can call from a top-level workflow script, sometimes in different combinations or with different parameters as needed. We originally adopted that design to maximize code reuse between the exome and whole genome versions of our variant calling pipeline, and now it’s also coming in handy to minimize the amount of homologous code that we need to maintain in parallel for running on different data types. Hopefully this will also make things easier for any of you who are also using our workflows to analyze your short read data.

If you develop your own GATK pipelines, make sure to update to the latest version of GATK so you can take advantage of the new functionality.

Take the data and the new pipeline out for a spin

Of course there’s no better way to really get to know a new data type than to poke at it yourself. To that end, our teams collaborated to create a public workspace in Terra that contains sample Ultima data as well as the updated whole genome analysis workflow preconfigured to run on the sample data. The workspace also includes all the relevant logs, intermediate outputs, and so on produced in the process of running the analysis workflow.

Having all of that bundled together in a workspace allows you to examine the input files and see exactly how the analysis workflow is set up, what all the parameter values are, how long it takes to run and what the quality control metrics and outputs look like, in full technicolor detail. If you’d like, you can even clone the workspace under your own account and try running the workflow yourself.

So check out the Ultima workspace today to form your own opinion of the new data and test the updated whole genome pipeline. We’d love to hear your feedback in either the Terra community forum or the GATK support forum.

If you have any technical questions about the GATK side of things, please post your question in the GATK support forum. For resources to get started using the Terra workflow system, see the “Getting Started” links in the workspace dashboard. Finally, for questions about the new technology, reach out to Ultima Genomics via their contact form.

The post Try out the updated GATK pipeline for Ultima Genomics data appeared first on Terra.