Terra

Update on Terra’s Strategy for a More Scalable Future

Jason Cerrato — Fri, 19 Sep 2025 21:29:54 +0000

In January 2025, we shared that we’ve been reimagining how research will be done in the next five to ten years and outlined some of our plans for future capabilities. Since then, we’ve been working closely with Manifold on a major upgrade to Terra—“Terra Powered by Manifold”—and we’re excited to share an update.

As a reminder, we won’t make any sudden changes to how your day-to-day work happens! We’ll share our timelines and give you a heads up on any user interface and functionality changes well in advance as we roll out the next-gen upgrades.

Platform capabilities coming soon

AWS support: Since a lot of the life sciences industry is on AWS, we have partnered closely with AWS to enable seamless data access and collaboration within an organization’s AWS cloud environments, and to enable taking advantage of the latest AWS services.
NextFlow support: Robust NextFlow pipelines that feature low-code workflow configuration, integration of community workflows, and comprehensive logging and debugging capabilities.
AI agents: AI agents that empower scientists to go faster with their data and tools, starting with a dataset chat and cohort building agents, and then looking ahead to an extensible ecosystem where the community can develop and deploy their own specialized agents to address unique research challenges.

Tools & datasets coming soon

Popular data science tools including CellDega, Imputation Services, and PRS developed at Broad—and not easily accessible on Terra today—will be coming soon as scientific apps on the platform.
Controlled datasets hosted by Broad on behalf of the NIH and others, previously not accessible on AWS, will be available to a broader range of researchers who are authorized to access them.

As a recap, Manifold is focused on maintaining and upgrading the core technology platform. Broad’s Data Sciences Platform (DSP) is building the next generation of scientific tools that will be made available to everyone via the platform. As part of the partnership with Manifold, members of the DSP team operating the original Terra platform have moved to Manifold while retaining their Broad access and collaborations. This team will focus on maintaining Terra while creating the next generation platform. This strategic step enables us to accelerate the pace at which the expanded platform capabilities, tools, and datasets become available to the Terra user community and for portals powered by Terra.

Thank you for making Terra a critical tool in so many groundbreaking projects. We’re excited to continue working with you to accelerate your science!

The post Update on Terra’s Strategy for a More Scalable Future appeared first on Terra.

Introducing the All of Us + AnVIL Imputation Service

Beth Sheets — Mon, 25 Aug 2025 16:00:00 +0000

The Broad Institute’s Data Sciences Platform has launched the All of Us + AnVIL Imputation Service, now available at https://allofus-anvil-imputation.terra.bio/. This is the first in a new suite of cloud-based scientific services, designed to make large-scale genomic research faster, more accurate, and more inclusive.

Genotype imputation plays a key role in genome-wide association studies and polygenic risk score analyses by expanding the portion of the genome that can be analyzed for phenotypic associations, at a fraction of the cost of sequencing. Imputation services allow researchers to leverage large reference panels (without direct access to protect participant privacy) while also eliminating the need to build and manage the complex computational infrastructure required for imputation pipelines.

The largest and most diverse panel in the world

Our imputation service features a diverse reference panel of genomes from over 515,000 participants from the All of Us Research Program and AnVIL Center for Common Disease Genomics, including more than 250,000 genomes from non-European inferred genetic ancestries.

Figure 1. The All of Us + AnVIL reference panel contains more than 515,00 total genomes from the following computed genetic ancestries: 254,416 European (49%), 101,982 African (20%), 90,553 Americas (18%), 13,226 East Asian (3%), 9,710 South Asian (2%), 1,065 Middle Eastern (0.2%), and 44,627 remaining individuals (9%).

The All of Us + AnVIL reference panel is the largest and most ancestrally diverse reference panel currently available, enabling researchers to enhance their datasets with greater accuracy. When imputing 42 arrays representing an ancestrally diverse set of samples against their whole genome sequences, we found high confidence (R² of 0.7) in both imputed SNPs and indels, even at very low allele frequencies in most ancestries. You can read more about our reference panel in our documentation.

Figure 2. Mean R² values for imputed SNPs (top) and indels (bottom) across all chromosomes compared to their respective 30X whole genomes for 42 samples from diverse ancestries (African, African-American, Admixed American, East Asian, Non-Finnish European, South Asian). The X axis is allele frequency, and the Y axis is the mean R².

Secure and scalable infrastructure

Behind the scenes, the imputation service leverages the Terra platform. This means you can expect the same level of security and privacy, including the ability to impute controlled access data that requires NIST-800-53 Rev5 FedRAMP Moderate compliance. You can learn more about Terra’s security at https://terra.bio/terra-security/.

Because our service runs on the cloud, you can expect to get your results fast without having to wait in a queue of users. Early testing revealed that submitting 2,500 samples to the service returns results the next day.

Get Started

The beta release of the service offers imputation of array data via a command-line tool called terralab. When a user signs up during the beta release, they may be eligible for credits, allowing them to submit up to 2,500 samples at no cost, thanks to funding from the National Institutes of Health.

Discover how to utilize our service by visiting our documentation page at https://broadscientificservices.zendesk.com/hc/en-us.

What to expect over the next year

Soon, we plan to release a point-and-click web user interface, as well as support for imputing cloud-hosted data.

With the rapid growth of genomic projects, we are preparing to meet increasing demand. After the beta period, we will transition to a paid model designed to support large-scale studies, expanded pre- and post-analysis options, and the release of imputation pipelines for low-pass genomes.

Acknowledgments

This work was made possible by National Institutes of Health (NIH) awards: (1) OT2OD035404, “All of Us Data and Research Center (DRC);” (2) OT2OD03821, “Broad-Color: The Genome Center for the Future of All of Us;” (3) OT2OD002750, “The Broad-LMM-Color Genome Center for All of Us,” funded by the NIH Office of the Director; and (4) U24HG010262, “AnVIL: A National Resource for Genomic Data Analysis and Visualization,” funded by the National Human Genome Research Institute.

We gratefully acknowledge All of Us and Centers for Common Disease Genomics participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health’s All of Us Research Program and NHGRI AnVIL for making available the participant data for the reference panel.

The post Introducing the All of Us + AnVIL Imputation Service appeared first on Terra.

Tools to Manage Terra Costs

Sophie Schwartz — Wed, 12 Mar 2025 13:00:00 +0000

Having transparent, controllable cloud costs is a top priority for Terra. In response to feedback from our community, we have been working on three key initiatives to improve cost management on Terra:

Cost Reporting: Tools to provide more clear and intuitive reports of cloud expenditures
Cost Controls and Estimates: Tools to help prevent runaway workflow costs
Cost Optimizations: Tools to reduce cloud spend

Overview of Cloud Costs in Terra

When you are working with data in Google Cloud, you are paying to store and analyze your data. Running a workflow or a Jupyter notebook all incur charges through your Google Cloud billing account. More details can be found on Terra’s support site around Managing Cloud Costs.

Initiative 1. Cost Reporting

Reminder: Set up spend reporting

We encourage all billing project owners to set up spend reporting for any new billing account. By completing the setup, you ensure that your billing data is captured and accessible in Terra, giving you greater control over your cloud expenses. Once set up, you can access reports for financial oversight and optimization of your cloud costs and resources.

Improved spend reports

Workspace owners now have a consolidated spend report to see costs for all of their workspaces across billing projects to identify what workspaces are costing them the most money. Billing project owners can also dive into the details of the workspaces in a billing project and see costs over time, aggregated daily to observe how costs are trending. For more details, see our roadmap article on reported costs.

Monitor and analyze workflow costs as they run

To reduce a workflow’s cost, it’s essential to know which tasks require the most resources, and which resources cost the most. You can now gather this information by selecting the “resource monitoring” option when configuring a workflow submission. Read Monitoring GCP cloud resources used in a workflow for more information and step-by-step instructions.

Once you’ve run a workflow with resource monitoring enabled, you can break down the resources and costs consumed by the workflow. Several tutorial notebooks demonstrate how to generate plots to visualize the time spent on each workflow task: CPU, memory, and disk usage, and the cost of each resource for each task. This information will help you optimize your workflow for cost and estimate the costs of similar workflows in the future.

To generate these plots for your own data, clone the Workflow Resource and Cost-Monitoring Guide workspace and follow the instructions on the dashboard.

Initiative 2. Cost Controls and Estimates

Workflow cost thresholds

This feature gives you the ability to set a cost threshold for a workflow to stop runaway costs from occurring. We recommend you use this feature more as a safety net to prevent runaway costs than a strict budgeting tool. Although our findings show high accuracy and reliability in our cost estimations, workflows may not terminate immediately upon hitting the defined threshold.

This feature is currently in preview. For more details, see our roadmap article.

Estimate costs of previous and in-progress workflows

For new workflows submitted after March 3, 2025, Terra shows estimated costs of workflows using the same estimates used to calculate workflow cost thresholds. These estimated workflow costs are displayed on the submission details page. If you have spend reporting set up, the estimates will convert to the actual costs from Google within 24 hours.

Estimate costs for commonly-used workflows

To estimate the cost of one of Terra’s featured workflows, refer to costs of selected features workflows. Note that your cost may differ from the estimates listed in this article due to differences in your data’s size and runtime settings like disk size.

Example workspaces for these workflows offer tools and workflows with benchmarked cost and time estimates in the workspace dashboards. For example: CellBender, one of our most commonly used single cell workflows, GWAS (using REGENIE), and GATK-SV single sample analysis workflow are among the many featured and public workspaces which contain cost estimates.

Initiative 3: Cost Optimizations

Autoclass storage

By default, all Terra workspaces and Terra Data Repository datasets now leverage Autoclass storage to automatically move infrequently-accessed data to a less expensive, “colder” storage tier (see Google’s storage pricing tables for details). Terra defaults to move files all the way to the Archive storage class for maximum savings. We estimate this will save users an average of 30% to 50% on annual storage costs.

Cleaning up intermediate workflow files

Workflows can generate many intermediate files that users do not wish to keep long-term. Workflows now include an option to delete intermediate files. This reduces the storage costs incurred by workflows configured to run with this setting.

For situations where you need to clean up intermediate files from workflows that were not run with this setting or after a period of time, Terra has introduced a new way to clean up intermediate files that you no longer need using Google lifecycle rules. Lifecycle rules will both save money on storage costs and let you better manage files that are produced as a part of your analysis. For more details, see our roadmap article.

Requester pays for data egress

While users with read-only access to your workspaces cannot run compute, they can incur costs to your account by downloading or copying the data. To avoid paying when other users download or copy data from your workspace, you can enable requester pays on your bucket. Read using Requester Pays workspaces for more information and step-by-step instructions. This option is particularly helpful when hosting large datasets on Terra.

We want to hear from you!

Working in the cloud opens up many opportunities to analyze data at scale. We appreciate the feedback we have received from our community about the importance of transparent costs and the ability to control them, and are dedicated to making improvements to better meet your needs. With these new tools in hand, it will be easier to track and control costs.

We’d like to continue to hear your feedback on these features and suggestions for the future! Check out our public Terra roadmap to see features that are planned for the near term or submit a new feature request on our public community forums to suggest improvements.

The post Tools to Manage Terra Costs appeared first on Terra.

Accelerating Discovery: Terra’s Strategy for a More Scalable Future

Tiffany Graziano — Mon, 13 Jan 2025 10:00:00 +0000

It’s incredible that Terra, along with its predecessor, FireCloud, have been in development for over nine years. In that time, Terra has grown far beyond our original vision, playing a critical role in areas like public health surveillance, ‘omics’ data delivery, biobank data management, and rare disease diagnosis.

We’ve spent the last several months mapping out a strategy to evolve Terra for the future. Together with leading scientists at the Broad Institute and our collaborators, we’ve been imagining how research will be done in the next five to ten years, and how Broad’s Data Sciences Platform (DSP) can deliver on what you’ll need to drive discovery in the next decade. In this blog post, we’re excited to share our plans and give a teaser on some of our future capabilities. Let’s start by looking at some of these capabilities for the biological research use case.

Future Platform Capabilities:

Supporting all major clouds: Organizations store scientific data in a variety of cloud platforms, but users shouldn’t have to worry about where the data resides. We need to support multiple clouds and make it easy for users to work across them cost-effectively.

Support for more workflow languages: Not all users write WDL. Nextflow’s popularity has soared, for example, so we want to support it while also making it easier to support any new workflow languages that come along.

Scaling data tables and workflow execution: Over the years, Cromwell, Terra’s workflow execution engine, has done an impressive job handling massive scales of data. As data continue to grow and new types of data emerge, however, we need to scale even further. This is why we’re focused on building a platform that’s not only better at handling more data, but also evolves with the changing landscape of scientific research.

Advancing the researcher’s lifecycle with Al agents: Al agents could assist with cohort building, data analysis, scientific literature searches and much more. To improve the speed of research, we need to include more of these capabilities natively.

Enhanced data management capabilities: From ingestion and harmonization to discovery and access — organizing and managing data needs to be easier than it is today.

As we look to build these capabilities in the years to come, we want to continue to work with innovative partners who can help accelerate engineering. To help bring these capabilities to life, we’re excited to announce a strategic collaboration with Manifold. Their technical expertise and shared vision will be crucial in helping us deliver on these goals and positioning us for a scalable and innovative future.

Manifold is a technology company focused on building advanced, modern cloud infrastructure for biomedical science. Manifold started in the cancer research space, where they demonstrated the effectiveness of their research cloud infrastructure in partnerships with leading organizations such as the American Cancer Society (ACS) and Indiana University.

We’re working with Manifold on a new platform, one that’ll incorporate Terra’s features while adding new functionality to help address the needs discussed above. Like Terra, this new platform will act as a steward— not an owner —of scientific data, ensuring users retain full ownership and control over access, while providing high levels of data security. Manifold will focus on building the core platform infrastructure while we in Broad’s DSP will develop advanced open-source analysis tools and capabilities at the cutting edge of biomedical science and then make them available to everyone via the platform. You can read more about this collaboration in a press release we shared earlier today.

And, we remain hard at work improving the Terra platform that you use today to support your science. For example, soon you’ll be able to cap the cost of a workflow—allowing you to have more control over your spend in Terra when running workflows. To read more about other features that are coming, check out our new public roadmap. The goal of the roadmap is to invite your feedback and ideas on what we’re building, while also clearly outlining our upcoming plans. You can even sign up to be an early tester of new features.

We’re excited to collaborate more with all of you — and Manifold — in 2025. Thank you for making Terra a critical tool for so many groundbreaking projects. Check out our FAQs about this collaboration and if you have any other questions, please feel free to reach us at support@terra.bio.

Happy New Year!

The post Accelerating Discovery: Terra’s Strategy for a More Scalable Future appeared first on Terra.

Introducing the Public Terra Roadmap: Shaping the Future Together

Christine Loreth — Tue, 26 Nov 2024 19:00:00 +0000

Our mission is to empower the scientific community with a powerful, user-friendly, and scalable cloud platform that removes technical barriers, democratizes access to data, streamlines collaborative workflows, accelerates scientific discoveries, and drives collaborative breakthroughs. At the heart of this mission lies a commitment to transparency and collaboration with the Terra community. This includes our plans for the future of Terra, addressing key questions such as: What features are on the horizon? When will they be available? How can users provide feedback on early development efforts?

We’re answering those questions by introducing the Public Terra Roadmap—a tool to provide visibility into what we’re planning, invite feedback on new features, and foster a collaborative development process.

Why a Public Roadmap?

We believe that the development of Terra is best guided by the needs of its users. With the launch of the public roadmap, we aim to:

Provide clarity and visibility into development plans
In this roadmap, we offer a view of what’s coming to Terra. You’ll be able to follow features and functionality across four stages of active development:
- Near Term: What the team has planned or is actively working on
- Preview: New features that you can opt into and test early to provide feedback
- Launching: Production-ready, newly released features for all users
- Released: A historical view of what has been delivered
Create opportunities for feedback
Feature previews allow you to opt in and test new functionality with your use cases, your data, your tools—all before it’s fully released. We hope you’ll provide valuable feedback to help influence how new features are shaped.
Enhance communication and collaboration
The roadmap is hosted on a platform that enables users to leave comments, making it easy to share your thoughts, ask questions, and engage directly with Terra’s product team. In turn, our team will actively respond to and engage with you. Your input matters, and this is one of the ways we’re ensuring your voice helps shape Terra’s future.

What This Means for You

By making our development plans transparent, we aim to ensure that the features we deliver truly address your needs. The roadmap is now live and will be routinely updated to ensure it reflects the latest priorities. This roadmap is not a static snapshot—it’s a dynamic tool designed to grow with the platform and its community. Whether you’re a researcher managing large datasets, a public health expert involved in pathogen genomic surveillance, or a bioinformatician optimizing workflows, the roadmap is designed to keep you informed and involved. The launch of the roadmap is a reflection of a core belief: Terra is better when built in partnership with its community. We invite you to explore the roadmap, share your feedback, and be a part of shaping the future of Terra.

View the Public Terra Roadmap

-The Terra Team

The post Introducing the Public Terra Roadmap: Shaping the Future Together appeared first on Terra.

Are you compliant with NIH’s updated Genomic Data Sharing policy? Terra can help!

Tiffany Graziano — Thu, 19 Sep 2024 13:47:59 +0000

On July 25, 2024, the National Institutes of Health (NIH) issued updated guidance to the Genomic Data Sharing Policy stating that researchers approved for controlled-access NIH data, via mechanisms like dbGaP or DUOS, must store and analyze this data on systems that are NIST SP 800-171 compliant (or the equivalent ISO/IEC 27001/27002) starting January 25, 2025.

If you are using Terra or NIH Platforms like AnVIL that are powered by Terra, you are already compliant.

The update requires researchers and institutions using NIH controlled-access data to attest that systems interacting with this data are compliant with NIST SP 800-171. Many university and institutes’ IT departments are urgently determining remediation pathways for their internal clusters and systems, including what new policies and training will need to be rolled out to their staff to be compliant by January. While many institutions are rapidly working to identify, invest, and implement critical security upgrades to meet NIST SP 800-171 compliance, it is vital that IT and compliance departments can inform researchers about which systems meet these security criteria.

Luckily, there is a solution: Terra and NIH platforms like AnVIL that are powered by Terra already meet NIST SP 800-171 as well as the more rigorous NIST SP 800-53 (FedRAMP Moderate). IT and compliance departments can direct users to Terra and AnVIL, and simply acknowledge use of these systems in the attestation process. It’s as easy as that!

If you have any questions or would like to discuss your specific situation, please contact support@terra.bio. We would be glad to assist you.

The post Are you compliant with NIH’s updated Genomic Data Sharing policy? Terra can help! appeared first on Terra.

Rare disease genomics with seqr in Terra

Lynn Pais — Mon, 15 Jul 2024 07:00:00 +0000

For additional background, see Launching seqr in Terra.

By Lynn Pais

Lynn Pais is a senior clinical genomic variant analyst and product owner for seqr in the Medical and Population Genetics group at the Broad Institute. In this guest blog post, Lynn introduces seqr, an open-source genomic analysis platform powered by Terra.

Identifying a genetic diagnosis for individuals with rare monogenic diseases often requires sifting through mountains of genomic data. To help researchers tackle this challenge, seqr offers an open-source, web-based platform housed within the NHGRI’s Genomic Data Science Analysis, Visualization, and Informatics Lab-Space (AnVIL). Our flagship paper describing the platform is available here: seqr: A web-based analysis and collaboration tool for rare disease genomics. Human mutation. Some of the key features of seqr include:

Advanced Annotations & Filtering: seqr provides rich gene and variant-level annotations and powerful filtration tools to perform variant searches within a family or across projects.
User-Friendly Interface: seqr is designed to be accessible to a variety of users, including researchers, clinicians, and project managers.
Collaboration: seqr supports researchers around the world to work together on analyzing genetic data from families with rare diseases. Through the Matchmaker Exchange interface, seqr also supports the submission of candidate variants/genes and phenotypic information to an international network for gene discovery.
Data Management: seqr offers a central location for storing large amounts of genetic data alongside de-identified patient information. This simplifies project management.
Improved Diagnosis: With the continual addition of genomic research tools and support for analyzing new data types, seqr aims to accelerate diagnoses for rare diseases.

seqr is usable out-of-the-box and has over 46,500 WES/WGS samples, of which 12,000 samples have been submitted by external users through AnVIL. At the Broad, seqr has supported the diagnosis of more than 4,000 individuals with rare disease and the discovery of over 300 novel disease genes.

The tool is made available to the research community as a public instance operated by the Broad Institute and available through Terra. You can move seamlessly from secondary analysis (calling variants with GATK workflows in Terra) to tertiary analysis of your callset with seqr.

Analyzing data with seqr in Terra

Broad’s seqr instance is connected to Terra (GCP) and set up to run on data in your workspace bucket. It expects a joint-called VCF file of variants generated with a joint calling pipeline, such as the GATK Whole-Genome Joint-Calling workflow. After clicking the Files icon at the left of the Data page, use the seqr link at the top right to create a project and load data into seqr. Once a project has been created, this link will take you directly to the seqr interface.

[Note: Data import involves QC/validation and annotation steps that optimize the data for querying. This can take up to a week to run depending on factors including the size of the joint call VCF. Once complete, you can run searches on large datasets efficiently, so it’s worth it.]

Once the data is loaded (you’ll get an email notification), you can get started with your analysis in seqr. There are standardized inheritance-based searches you can use to filter variants, with options to customize a search, search for a specific variant across your cohort and other projects in seqr, and many more features to support your analysis. See our video playlist for a demonstration of some of these features.

[In legend:The seqr platform features a user-friendly interface that facilitates exome and genome sequence analysis through filtering and display of extensive gene- and variant-level annotations. Users can add shared notes and tags to provide further contextual understanding and mark variants for follow up. The platform also facilitates visualization of read data through its IGV integration and data sharing through the Matchmaker Exchange. See the video playlist/tutorial for more details on these features]

Even without loading case data in seqr, AnVIL-registered users can look up a variant of interest to review it with seqr’s informative annotations. While this feature is only available for variants that are present in seqr, the display also includes de-identified information about the cases in which the variant was found such as genotype, affected status, and high-level phenotype categories.

Try seqr on Terra today.

We invite you to try it out for yourself. To get started, check out our video tutorials, including a video describing how to load your data in seqr (last video). When you create a new account, there is a demo project you can use to try seqr out right away, even if you don’t have your own data yet.

If you run into any issues using seqr in Terra, don’t hesitate to reach out to the Terra helpdesk (go to the main menu at the top left of any page in Terra, open the Support section, and choose “Contact Us”). For other questions about seqr, see the FAQ page. If your question is not answered here, contact the seqr team at seqr@broadinstitute.org.

The post Rare disease genomics with seqr in Terra appeared first on Terra.

Accessing Cancer Research Data Commons resources in Terra

Leyla Tarhan — Wed, 10 Jul 2024 08:00:00 +0000

For researchers trying to understand and treat cancer, searching for useful datasets consumes valuable time. Even after accessing a dataset, researchers must figure out how—and where—to analyze it. After all, it is no mean task to wrangle petabytes of data.

Luckily, the Cancer Research Data Commons (CRDC) addresses many of these challenges. A recent paper in Cancer Research outlines how the CRDC accelerates researchers’ work by hosting large, cancer-related datasets on the cloud. Terra is proud to support this effort: an NCI-branded version of Terra (Firecloud) is one of three platforms that both host the CRDC’s datasets and provide cloud-based analysis tools to uncover the insights in that data.

How to access CRDC data in Terra

Through Terra, researchers can access several of the CRDC’s open- and controlled-access cancer datasets. These include reference genomes and files (such as those from the 1,000 Genomes Project), as well as data from The Cancer Genome Atlas (TGCA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Cell Line Encyclopedia (CCLE). These datasets are multimodal: they include genomics, proteomics, and imaging data.

Rather than analyzing each dataset in isolation, researchers can combine and subset the data to build a cohort that is appropriate for a specific project. In addition, researchers can combine the CRDC’s cancer datasets with other data that are accessible through Terra—for example, from the Analysis, Visualization, and Informatics Lab-space (AnVIL)—or with data from researchers’ own labs.

How to analyze cancer data in Terra

Once researchers have collected the right data for their scientific questions, Terra provides easy access to several analysis tools to begin answering those questions. These include a library of existing workflows (automated pipelines for high-throughput analyses) and interactive analysis notebooks, with tools for long- and short-read variant calling, Genome-Wide Association Studies, epigenomic and RNAseq data processing, and fusion transcript detection.

In addition to leveraging pre-existing tools from the research community, analyzing CRDC data in the cloud is much more lightweight than on a personal or high-performance computer (HPC). Large datasets require a specialized computational infrastructure, but there’s no need to set up or maintain this infrastructure when working in the cloud. There’s also no need to set up a data security system, because data analyzed in Terra remains safely in a FedRAMP- and FISMA-certified environment.

CRDC data in action: the Proteomic Data Commons

Researchers have already leveraged Terra’s CRDC resources to push the cancer field forward. For example, the Broad Institute’s Proteomics Platform integrated data from the Proteomic Data Commons (PDC) into Terra to make it easier for researchers to uncover cancer mechanisms and biomarkers.

The PDC is an important resource for cancer researchers because it hosts data generated by several large cancer programs—these include TCGA, TARGET, the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and the Applied Proteogenomics Organizational Learning and Outcomes (APOLLO) Network. D.R. Mani’s team at the Proteomics Platform worked with the Terra team to build a way for researchers to export data from the PDC into a Terra workspace with the click of a button.

Once the data is in Terra, it’s ready to be analyzed with a common proteomics analysis toolkit: FragPipe. To make this process even more seamless, the group set up a template workspace with step-by-step instructions for selecting PDC data and importing it into Terra. The workspace also includes a FragPipe workflow and an interactive notebook that walks researchers through how to analyze PDC data with an example FragPipe tool.

With cloud-based integrations like this uniting CRDC data and analysis tools, researchers can focus on understanding and treating cancer, rather than accessing and wrangling data.

Use Terra’s CRDC resources in your own work

To explore Terra’s CRDC resources further, register for an account and explore Terra’s NCI featured workspaces. These workspaces provide further instructions to access connected datasets and suggest tools to analyze the data. Note that you will need an eRA Commons account to access CRDC data on Terra.

Thank you to Alex Baumann, Emily LaPlante, and Katherine Thayer for their help preparing this post.

The post Accessing Cancer Research Data Commons resources in Terra appeared first on Terra.

Managing Access to UK Biobank Data on Terra

Leyla Tarhan — Mon, 13 Nov 2023 13:00:12 +0000

Having access to good data is crucial for genomics research. Without large-scale datasets, there’s little chance of uncovering the genetic underpinnings of disease. This is particularly important when studying diseases whose risk is determined by a large number of genes, each of which has a relatively small effect.

But analyzing and managing these large datasets can be challenging. This is particularly true when the data are protected, so that only approved researchers can access it. Because of these challenges, many good scientific stories happen alongside equally-interesting stories about data management.

In one such story, Terra’s data management tools allowed researchers to access data from the UK Biobank. In turn, this access has fueled many innovative projects that were only possible with large-scale data.

Getting the data onto the Cloud

The UK Biobank amasses genotypic and phenotypic data from roughly half a million participants in the United Kingdom. Research groups interested in working with the data can apply for access to this data set. In 2017, a group of Broadies across different research interests — including Krishna Aragam, Andrea Ganna, and Mary Hross — understood that having a single dataset available on the Cloud (and on the Broad Cluster) would allow more researchers to study the keys to disease, and simultaneously reduce storage costs. As a result, Ben Neale’s lab sponsored a project to undertake this work. The project’s storage is funded by Broad ITS and managed by Sam Bryant, who was a Senior Data Management Specialist at the time. Bryant has since become the Associate Director of Data Management at the Broad’s Stanley Center for Psychiatric Research.

Working with such large data was no easy task. To start, it took a long time to obtain the files — at the time, the UK Biobank limited downloads to 10 concurrent files. So, Sam Bryant got approval to access genotype and phenotype data from roughly half a million participants, then spent about two weeks downloading it onto the Broad’s on-premises cluster. From there, he uploaded the data to a Terra workspace in the Cloud.

Managing access to the data

With these data in hand, Bryant faced a second challenge: managing who could use the data. The UK Biobank uses a careful approval process to protect their data. So now Bryant needed a way to ensure that only approved users could access the Terra dataset.

The solution was to create a Terra Group of approved researchers. Researchers who wanted to access the data — including the Neale lab’s collaborators — applied for approval from the UK Biobank, which then let Bryant know who he could add to the Terra group. Once added to the group, collaborators could access the data — from inside or outside of Terra — in order to analyze it.

An alternative method: using DUOS to manage controlled data on Terra

In many cases, Bryant’s method is still the best way to share controlled-access data with collaborators. However, it works best when data managers are sharing their data with known collaborators. DUOS offers an alternative to share controlled-access data from the Broad and the National Human Genome Research Institute (NHGRI). DUOS speeds the data-access application process by automatically updating access permissions whenever someone is approved. This system makes it easy to share data with unknown researchers, as well as collaborators. You can learn more about how DUOS makes it easier to access controlled-access data in Streamlining Data Access and on DUOS’ documentation.

The data’s scientific impact

Since 2017, research groups have accessed the UK Biobank dataset on Terra to answer questions about several aspects of human health. These include coronary artery disease (Fahed et al., 2022; Patel et al., 2022; Patel et al., 2023; Dron et al., 2023; Khera et al., 2022); Alzheimer’s (Paranjpe et al., 2022); obesity (Agrawal et al., 2022); liver disease (Haas et al., 2021); heteroplasmy (Gupta et al., 2023); and clonal hematopoiesis (Brown et al., 2023). These data have also supported a growing understanding of how the human genome and phenome are structured – for example, uncovering correlations within the human phenome (Carey et al., 2022). In addition, the UK Biobank dataset has helped researchers better understand the effects of a dataset’s size and diversity on machine learning models (Cui et al., 2023; Majara et al., 2023).

How can you access this data?

The Broad’s UK Biobank dataset is still available on Terra. This dataset is a subset of the full UK Biobank data — the remainder is accessible via DNANexus.

If you’d like to leverage this dataset for your own research, and you’ve already applied for access to the UK Biobank, follow the instructions in this document. If you have not yet applied for access to the UK Biobank, contact Sam Bryant directly at sbryant@broadinstitute.org. And to learn more about sharing controlled data on Terra, see Best practices for sharing and protecting data resources and Managing access to shared data and tools with groups.

Many thanks to Sam Bryant, Caroline Cusick, and Jonathan Lawson for their help preparing this post.

The post Managing Access to UK Biobank Data on Terra appeared first on Terra.

Boosting Variant Calling in Terra with the New Telomere-to-Telomere Human Reference

Ricky Magner — Thu, 02 Nov 2023 16:17:51 +0000

There is a new, more complete human reference on the Terra reference disk!

When the Human Genome Project was declared complete in 2003, there were a number of gaps in the reference, due to the repetitive nature of much of the human genome and limitations of technologies available then. This left approximately 8% of the human genome out of the canonical reference sequence. To fill this gap, the first complete sequence of a human genome – also known as the Telomere-To-Telomere (T2T) reference – was announced, in early 2022. The T2T reference was built using a multitude of latest sequencing technologies to expose the missing 8%. It lets scientists access even the trickiest regions like centromeres, segmental duplications, and other complex regions, including around 100 new protein-coding genes. In addition to these gains, some errors found in the previous GRCh38 reference are corrected resulting in an overall higher quality reference. Simply switching to this reference improves variant calling performance (see this writeup).

Scientific Background

Building a reference sequence is a special type of assembly project, involving multiple modes of orthogonal technologies to generate data and careful analysis. In the case of the T2T reference, the CHM13 (complete hydatidiform mole) haploid human cell line was used to assemble the autosomes and the X chromosome. The full Y chromosome sequence was included in 2023 using DNA from a sample commonly known as HG002 (also frequently referred to as NA24385). A few versions of the T2T reference were included in the data release to facilitate different types of analyses. These differ in a handful of ways described below.

The maskedY Reference

The pseudo-autosomal regions (PAR) are a pair of regions on chrX and chrY. Because they are homologous to each other, alignments to these regions are ambiguous, and end up with much lower mapping quality for both. Traditionally, one would mask the corresponding regions on chrY for all samples, and produce diploid variant calls in the PAR on chrX. The maskedY version of the T2T reference does just this, and replaces the PAR on chrY with hard-masked N bases so users can readily align their reads coming from these regions.

The rCRS Sequence

The T2T release includes a new mitochondrial sequence derived from the CHM13 cell line. An alternate version, containing the old revised Cambridge Reference Sequence (rCRS) chrM sequence identical to the one in the human reference hg38, is included in this version of the T2T release. This is convenient for backwards compatibility with mitochondrial studies done with hg38.

The EBV Sequence

The Epstein-Barr virus (EBV) sequence is included as chrEBV in our version of the reference provided on Terra. This is consistent with our curated hg19 and hg38 reference files, as described in our reference documentation and following conventions from the All of Us project. This contig is useful during alignment for siphoning off viral sequence common in human samples and helps improve data quality for samples from lymphoblastoid cell lines like the Corielle 1000 Genomes samples.

How to use the T2T Reference on Terra

You’ll find our recommended version of the T2T reference for most use cases on Terra under the name “T2T-v2.” The v2 corresponds to the T2T release v2.0, which includes the chrY sequence.

Our version corresponds to the following choices

Uses the maskedY version for generic alignment to allosomes
Uses the rCRS chrM sequence for backwards compatibility with human mitochondrial studies
Includes a copy of chrEBV for cleaner alignments in human samples containing the viral sequence.

So you can drop the right files into your workflows and begin testing out the improved reference immediately, we also include precomputed index files for use with BWA. For other versions of the T2T reference for more specialized applications, see the full data release.

For instructions on how to add reference files to your Terra workspace, and reference them in your workflow inputs, see this guide.

Learn more about T2T with NHGRI AnVIL

The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to de novo assemble the first complete reference human genome. They share their data (T2T, chrY) and methods with the community using the NHGRI AnVIL ecosystem. In fact, Terra was used in a detailed analysis of how our understanding of human genetic variation is improved using the new reference genome across thousands of human samples from globally diverse ancestries. Terra was also used to evaluate variation on the Y chromosome using data from the 1000 Genomes Project and the Simons Genome Diversity Project

We hope by adding the T2T reference to our list of provided references in Terra, users can leverage the same cutting-edge science in their work. See if it can help yours when you check it out on Terra/AnVIL!

Acknowledgements

Thanks to Kate Balaconis (DSP), Eric Banks (DSP), Allie Cliffe (DSP), Fabio Cunial (DSP), Kylee Degatano (DSP), Laura Gauthier (DSP), Steve Huang (DSP), Vijeta Limbekar (DSP), Karen Miga (UCSC), Sam Novod (DSP), Adam Phillippy (NHGRI), Michael Schatz (JHU), Beth Sheets (DSP), Hang Su (DSP), Nick Watts (DSP), and Jessica Way (DSP) for helpful scientific and technical input in curating the data and preparing this blog post.

The post Boosting Variant Calling in Terra with the New Telomere-to-Telomere Human Reference appeared first on Terra.