Why the new and improved WDL docs deserve a second look

Kaylee Mathews — Mon, 10 Apr 2023 13:14:42 +0000

Have you ever found yourself (desperately?) searching for WDL (Workflow Description Language) documentation that suits your needs and coming up short? I have. I was in that boat a year ago when I joined the Broad. I had never heard of WDL before, but I was starting to document WDL scripts for the Pipeline Development team, so I needed to learn quickly. In my search, I found several resources that I would later realize are quite helpful. But at that time I was looking for something really basic to get me started and what I was finding felt a bit too advanced. My team, User Education, created some great basic resources a number of years ago, but unfortunately, they had become out of date.

I remember being really frustrated; the only resources that seemed geared toward beginners weren’t usable because they hadn’t been updated in years. And I’m not the only person who felt this way. In July 2022, a survey of workflow users in Terra revealed that more than 75% were very interested or somewhat interested in learning more about coding in WDL and 50% have used or wanted to use those introductory materials to learn WDL, but weren’t able to because the documentation was outdated. Users also told us they wanted to see more tutorials and example workflows. Without these resources, they were left with limited options for learning WDL.

It’s because of that user feedback and my own experience that I’m especially excited to announce that my team recently finished a total overhaul of the introductory WDL materials. The updated resources include:

A new wdl-docs GitHub repository with a section dedicated to resources created by the WDL community.
A new wdl-docs website to host the documentation from the new wdl-docs GitHub repository.
Updates to all existing WDL syntax documentation to match the WDL 1.0 spec. You can now find the docs on the new wdl-docs GitHub repository.
17 new articles, 11 cookbook-style docs to teach users about specific use-cases and provide example workflows and 6 best practices docs to help users understand some of the grayer areas of coding in WDL, like what to consider when trying to optimize costs. Thanks to the Pipelines Development team for help developing these resources.

We hope this documentation will be useful to both new and more experienced WDL users, but also that the WDL community will help us continue to improve the documentation. The wdl-docs repository and website work best with community contributions! If you have a fix you’d like to make, a new doc you’d like to contribute or suggest, or a resource you’d like to highlight, please do so! You can check out the Home page of the wdl-docs website or the contributing guide to read about the contribution process.

Additional Resources

If you want to get started using WDL on Terra, check out our Intro to workflows on Terra Leanpub course or our T101 – Workflows Quickstart Guide and the accompanying workspace.

You can also check out the recording, materials, and the WDL-puzzles workspace from a previous WDL workshop where participants learned how to write a basic WDL workflow, bring the workflow into a Terra workspace, and run the workflow in Terra.

The post Why the new and improved WDL docs deserve a second look appeared first on Terra.

Paper Spotlight: Phenotype and genetic analysis of data collected within the first year of NeuroDev

Kaylee Mathews — Thu, 22 Dec 2022 18:57:53 +0000

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.

Phenotype and genetic analysis of data collected within the first year of NeuroDev: A Pilot Study

By Patricia Kipkemoi, Heesu Ally Kim, Bjorn Christ, Emily O’Heir, Jake Allen, Christina Austin-Tse, Samantha Baxter, Harrison Brand, Sam Bryant, Nick Buser, Victoria de Menil, Emma Eastman, Alice Galvin, Martha Kombe, Collins Kipkoech, Alysia Lovgren, Daniel G. MacArthur, Brigitte Melly, Katini Mwangasha, Alba Sanchis-Juan, Moriel Singer-Berk, Michael E. Talkowski, Grace VanNoy, Celia van der Merwe, The NeuroDev Project, Charles Newton, Anne O’Donnell-Luria, Amina Abubakar, Kirsten A Donald, and Elise Robinson

Preprint in medRxiv (2022) https://doi.org/10.1101/2022.08.22.22278891

Abstract: Genetic association studies have made significant contributions to our understanding of the aetiology of neurodevelopmental disorders (NDDs). However, the vast majority of these studies have focused on populations of European ancestry, and few include individuals from the African continent. The NeuroDev project aims to address this diversity gap through detailed phenotypic and genetic characterization of children with NDDs from Kenya and South Africa. Here we present results from NeuroDev’s first year of data collection, including phenotype data from 206 cases and clinical genetic analysis of 99 parent-child trios. The majority of the cases met criteria for global developmental delay/intellectual disability (GDD/ID, 80.3%). Approximately half of the children with GDD/ID also met criteria for autism spectrum disorders (ASD), and 14.6% met criteria for ASD alone. Analysis of exome sequencing data identified a pathogenic or likely pathogenic variant in 13 (17%) of the 75 cases from South Africa and 9 (38%) of the 24 cases from Kenya Candidate novel disease gene variants in 7 total cases were matched through MatchMaker Exchange. Data from the trio pilot cases has already been made publicly available, and the NeuroDev project will continue to develop resources for the global genetics community.

What part of the work was done in Terra?

Excerpts from the paper’s Methods section:

Genetic analyses of the trio pilot data

In brief, exome sequencing was performed on each of the trios, and the data was uploaded to the seqr platform for analysis. […]

Exome Sequencing & Data Processing Methods

Exome data was processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known indels, aligned to the human genome build 38 using BWA, and jointly analyzed for single nucleotide variants (SNVs) and insertions/deletions (indels) using Genome Analysis Toolkit (GATK) Haplotype Caller package version 4.0.10.1. […] Basic functional annotation will be performed using Variant Effect Predictor (VEP), and then the joint variant call file will be uploaded to the seqr platform for further annotation and analysis.

Copy-number variants (CNVs) were discovered from the exome sequencing data following GATK-gCNV best practices. Read coverage was calculated for each exome using GATK CollectReadCounts. After coverage collection, all samples were subdivided into batches for gCNV model training and execution; these batches were determined based on a principal components analysis (PCA) of sequencing read counts. After batching, one gCNV model was trained per batch using GATK GermlineCNVCaller on a subset of training samples, and the trained model was then applied to call CNVs for each sample per batch. Finally, all raw CNVs were aggregated and post-processed using quality- and frequency-based filtering to produce a final CNV callset.

Exome Sequencing Data Analysis Process

Upon completion of data generation, both the SNV/indel and CNV callsets were uploaded to seqr, the centralized genomic analysis platform used by the Broad Institute’s Center for Mendelian Genomics (CMG). […]

How did they do it?

Automated workflows

The authors processed exome sequencing data and discovered CNVs using the GATK Best Practices GermlineCNVCaller workflow in Terra. The workflow is implemented in the workflow description language (WDL) and is available in the Broad Methods Repository and in a public Terra workspace, where you can test it out on example data at very low cost.

Terra also supports importing workflows from Dockstore, a free and open-source platform for sharing reusable and scalable analytical tools and workflows.

To try your hand at running a workflow in Terra, check out this Quickstart Tutorial Workspace.

Variant analysis

The authors further analyzed the resulting variant calls using seqr, an intuitive browser-based system for analyzing rare disease exome and genome data on a family basis, that is available on AnVIL powered by Terra.

To learn more about seqr, watch the video playlist showcasing seqr functionality, then check out how to use seqr through Terra.

Appendix: Data and code availability

– Controlled-access genetic and phenotypic data collected during the course of this study are available via the National Human Genome Research Institute (NHGRI) Analysis Visualization and Informatics Lab-space (ANVIL) platform.

The post Paper Spotlight: Phenotype and genetic analysis of data collected within the first year of NeuroDev appeared first on Terra.

Kaylee Mathews, Author at Terra

Why the new and improved WDL docs deserve a second look

Paper Spotlight: Phenotype and genetic analysis of data collected within the first year of NeuroDev

What part of the work was done in Terra?

How did they do it?

Appendix: Data and code availability