Leyla Tarhan, Author at Terra

Accessing Cancer Research Data Commons resources in Terra

Leyla Tarhan — Wed, 10 Jul 2024 08:00:00 +0000

For researchers trying to understand and treat cancer, searching for useful datasets consumes valuable time. Even after accessing a dataset, researchers must figure out how—and where—to analyze it. After all, it is no mean task to wrangle petabytes of data.

Luckily, the Cancer Research Data Commons (CRDC) addresses many of these challenges. A recent paper in Cancer Research outlines how the CRDC accelerates researchers’ work by hosting large, cancer-related datasets on the cloud. Terra is proud to support this effort: an NCI-branded version of Terra (Firecloud) is one of three platforms that both host the CRDC’s datasets and provide cloud-based analysis tools to uncover the insights in that data.

How to access CRDC data in Terra

Through Terra, researchers can access several of the CRDC’s open- and controlled-access cancer datasets. These include reference genomes and files (such as those from the 1,000 Genomes Project), as well as data from The Cancer Genome Atlas (TGCA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Cell Line Encyclopedia (CCLE). These datasets are multimodal: they include genomics, proteomics, and imaging data.

Rather than analyzing each dataset in isolation, researchers can combine and subset the data to build a cohort that is appropriate for a specific project. In addition, researchers can combine the CRDC’s cancer datasets with other data that are accessible through Terra—for example, from the Analysis, Visualization, and Informatics Lab-space (AnVIL)—or with data from researchers’ own labs.

How to analyze cancer data in Terra

Once researchers have collected the right data for their scientific questions, Terra provides easy access to several analysis tools to begin answering those questions. These include a library of existing workflows (automated pipelines for high-throughput analyses) and interactive analysis notebooks, with tools for long- and short-read variant calling, Genome-Wide Association Studies, epigenomic and RNAseq data processing, and fusion transcript detection.

In addition to leveraging pre-existing tools from the research community, analyzing CRDC data in the cloud is much more lightweight than on a personal or high-performance computer (HPC). Large datasets require a specialized computational infrastructure, but there’s no need to set up or maintain this infrastructure when working in the cloud. There’s also no need to set up a data security system, because data analyzed in Terra remains safely in a FedRAMP- and FISMA-certified environment.

CRDC data in action: the Proteomic Data Commons

Researchers have already leveraged Terra’s CRDC resources to push the cancer field forward. For example, the Broad Institute’s Proteomics Platform integrated data from the Proteomic Data Commons (PDC) into Terra to make it easier for researchers to uncover cancer mechanisms and biomarkers.

The PDC is an important resource for cancer researchers because it hosts data generated by several large cancer programs—these include TCGA, TARGET, the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and the Applied Proteogenomics Organizational Learning and Outcomes (APOLLO) Network. D.R. Mani’s team at the Proteomics Platform worked with the Terra team to build a way for researchers to export data from the PDC into a Terra workspace with the click of a button.

Once the data is in Terra, it’s ready to be analyzed with a common proteomics analysis toolkit: FragPipe. To make this process even more seamless, the group set up a template workspace with step-by-step instructions for selecting PDC data and importing it into Terra. The workspace also includes a FragPipe workflow and an interactive notebook that walks researchers through how to analyze PDC data with an example FragPipe tool.

With cloud-based integrations like this uniting CRDC data and analysis tools, researchers can focus on understanding and treating cancer, rather than accessing and wrangling data.

Use Terra’s CRDC resources in your own work

To explore Terra’s CRDC resources further, register for an account and explore Terra’s NCI featured workspaces. These workspaces provide further instructions to access connected datasets and suggest tools to analyze the data. Note that you will need an eRA Commons account to access CRDC data on Terra.

Thank you to Alex Baumann, Emily LaPlante, and Katherine Thayer for their help preparing this post.

The post Accessing Cancer Research Data Commons resources in Terra appeared first on Terra.

Managing Access to UK Biobank Data on Terra

Leyla Tarhan — Mon, 13 Nov 2023 13:00:12 +0000

Having access to good data is crucial for genomics research. Without large-scale datasets, there’s little chance of uncovering the genetic underpinnings of disease. This is particularly important when studying diseases whose risk is determined by a large number of genes, each of which has a relatively small effect.

But analyzing and managing these large datasets can be challenging. This is particularly true when the data are protected, so that only approved researchers can access it. Because of these challenges, many good scientific stories happen alongside equally-interesting stories about data management.

In one such story, Terra’s data management tools allowed researchers to access data from the UK Biobank. In turn, this access has fueled many innovative projects that were only possible with large-scale data.

Getting the data onto the Cloud

The UK Biobank amasses genotypic and phenotypic data from roughly half a million participants in the United Kingdom. Research groups interested in working with the data can apply for access to this data set. In 2017, a group of Broadies across different research interests — including Krishna Aragam, Andrea Ganna, and Mary Hross — understood that having a single dataset available on the Cloud (and on the Broad Cluster) would allow more researchers to study the keys to disease, and simultaneously reduce storage costs. As a result, Ben Neale’s lab sponsored a project to undertake this work. The project’s storage is funded by Broad ITS and managed by Sam Bryant, who was a Senior Data Management Specialist at the time. Bryant has since become the Associate Director of Data Management at the Broad’s Stanley Center for Psychiatric Research.

Working with such large data was no easy task. To start, it took a long time to obtain the files — at the time, the UK Biobank limited downloads to 10 concurrent files. So, Sam Bryant got approval to access genotype and phenotype data from roughly half a million participants, then spent about two weeks downloading it onto the Broad’s on-premises cluster. From there, he uploaded the data to a Terra workspace in the Cloud.

Managing access to the data

With these data in hand, Bryant faced a second challenge: managing who could use the data. The UK Biobank uses a careful approval process to protect their data. So now Bryant needed a way to ensure that only approved users could access the Terra dataset.

The solution was to create a Terra Group of approved researchers. Researchers who wanted to access the data — including the Neale lab’s collaborators — applied for approval from the UK Biobank, which then let Bryant know who he could add to the Terra group. Once added to the group, collaborators could access the data — from inside or outside of Terra — in order to analyze it.

An alternative method: using DUOS to manage controlled data on Terra

In many cases, Bryant’s method is still the best way to share controlled-access data with collaborators. However, it works best when data managers are sharing their data with known collaborators. DUOS offers an alternative to share controlled-access data from the Broad and the National Human Genome Research Institute (NHGRI). DUOS speeds the data-access application process by automatically updating access permissions whenever someone is approved. This system makes it easy to share data with unknown researchers, as well as collaborators. You can learn more about how DUOS makes it easier to access controlled-access data in Streamlining Data Access and on DUOS’ documentation.

The data’s scientific impact

Since 2017, research groups have accessed the UK Biobank dataset on Terra to answer questions about several aspects of human health. These include coronary artery disease (Fahed et al., 2022; Patel et al., 2022; Patel et al., 2023; Dron et al., 2023; Khera et al., 2022); Alzheimer’s (Paranjpe et al., 2022); obesity (Agrawal et al., 2022); liver disease (Haas et al., 2021); heteroplasmy (Gupta et al., 2023); and clonal hematopoiesis (Brown et al., 2023). These data have also supported a growing understanding of how the human genome and phenome are structured – for example, uncovering correlations within the human phenome (Carey et al., 2022). In addition, the UK Biobank dataset has helped researchers better understand the effects of a dataset’s size and diversity on machine learning models (Cui et al., 2023; Majara et al., 2023).

How can you access this data?

The Broad’s UK Biobank dataset is still available on Terra. This dataset is a subset of the full UK Biobank data — the remainder is accessible via DNANexus.

If you’d like to leverage this dataset for your own research, and you’ve already applied for access to the UK Biobank, follow the instructions in this document. If you have not yet applied for access to the UK Biobank, contact Sam Bryant directly at sbryant@broadinstitute.org. And to learn more about sharing controlled data on Terra, see Best practices for sharing and protecting data resources and Managing access to shared data and tools with groups.

Many thanks to Sam Bryant, Caroline Cusick, and Jonathan Lawson for their help preparing this post.

The post Managing Access to UK Biobank Data on Terra appeared first on Terra.

Single Cell Portal Accelerates Genomics Research with Terra

Leyla Tarhan — Mon, 28 Aug 2023 12:00:40 +0000

Single Cell Portal (SCP) is an online portal for sharing single-cell genomics data that is built on top of Terra. Through its searchable database of single-cell studies, interactive visualizations, and sharing tools, SCP supports single-cell researchers throughout the research process. SCP recently published a whitepaper that lays out how the Portal provides this support – this post summarizes the highlights.

Navigating a growing amount of data

Single-cell research is a growing field, with an exponential growth in publications since 2009.

These data are a powerful tool, but they also pose a challenge for researchers seeking to integrate these findings into their own work: it’s hard to sift through hundreds of papers a year to find the data that are relevant. Researchers trying to share their data may also struggle to reach the right audience.

SCP helps researchers navigate these challenges to find and share data more easily.

Sparking an idea

SCP’s database provides access to over 500 single-cell studies. Advanced search tools make it easy to zero in on the studies that are relevant to a particular research area. For example, researchers can search by disease, organ, cell type, gene, and library preparation protocol, depending on their questions.

SCP’s interactive visualizations make it easy to explore these datasets, even if researchers don’t have a computational background. This allows users to answer key questions that a static paper figure couldn’t answer, and often more quickly than reading a paper would allow.

Fine-tuning your data

SCP is also a valuable tool for collaborations between computational and non-computational researchers. For example, a computational biologist described how SCP bridges the “wetlab-drylab divide” between collaborators in Single Cell Portal for collaborations: once computationalists upload analysis results to a private SCP study, wet-lab collaborators can explore the data and provide feedback on how to adjust the analysis – without ever having to touch R or a command line.

Sharing results

Once the data are finalized, making them public on SCP increases the work’s impact. An SCP study invites other researchers to engage deeply with the data, exploring it interactively and using it as a reference for patterns observed across studies. Researchers can even download the data – which are hosted within a FISMA-compliant boundary on Terra – and integrate those data into their own analyses.

Try it out!

If you’re intrigued, you can learn more about how Single Cell Portal accelerates single-cell research by reading our whitepaper. You can also visit a demo study to try out the Portal’s interactive visualizations and check out our documentation for details on how to upload your own data to the Portal.

The post Single Cell Portal Accelerates Genomics Research with Terra appeared first on Terra.