← Back to Guides

WDL Deep Technical Documentation Template

AdvancedWDL Templates2026-03-19

WDL Deep Technical Documentation Template

Use this template to produce a thorough technical document that describes a WDL workflow in full detail. This document is intended for developers, bioinformaticians, platform engineers, and technical reviewers who need to understand, maintain, extend, or troubleshoot the workflow.


1. Document Control

Field Details
Document Title [Workflow name — Technical Documentation]
Document Version [e.g. 1.0.0]
WDL Version [1.0 / 1.1 / development]
Workflow Version [e.g. 2.3.1 — use semantic versioning]
Author(s) [Names and roles]
Last Updated [YYYY-MM-DD]
Status [Draft / In Review / Approved / Deprecated]
Repository [URL to source repository]
Licence [e.g. MIT / BSD-3 / Proprietary]

2. Workflow Overview

2.1 Purpose

Provide a concise technical summary of what this workflow does and the problem it solves.

[Enter purpose here]

2.2 Workflow Identifier

Field Value
Workflow Name [e.g. germline-variant-calling]
Namespace [e.g. org.broadinstitute.pipelines]
Entry Point [e.g. GermlineVariantCalling]
Source File [e.g. germline_variant_calling.wdl]

2.3 Changelog

Version Date Author Summary of Changes
[2.3.1] [YYYY-MM-DD] [Name] [Upgraded GATK to 4.5.0]
[2.3.0] [YYYY-MM-DD] [Name] [Added BQSR step]
[...] [...] [...] [...]

3. Architecture

3.1 Workflow Diagram

Include a visual representation of the workflow. Use Mermaid, draw.io, or a similar tool.

[Input Files] --> [Task 1: QC] --> [Task 2: Trim] --> [Task 3: Align] --> [Task 4: Sort/Mark Duplicates]
                                                                                      |
                                                                                      v
                                                      [Task 7: Merge VCFs] <-- [Task 5: Call Variants (scattered)]
                                                              |
                                                              v
                                                      [Task 6: Filter] --> [Final Outputs]

Replace the above with an accurate diagram for your workflow.

3.2 Workflow Structure

Component File Description
Main Workflow [main.wdl] [Orchestrates all tasks and sub-workflows]
Sub-workflow: QC [qc.wdl] [Quality control sub-workflow]
Task Library [tasks/alignment.wdl] [Reusable alignment tasks]
Struct Definitions [structs.wdl] [Custom struct type definitions]
[...] [...] [...]

3.3 Import Dependencies

import "tasks/alignment.wdl" as Alignment
import "tasks/variant_calling.wdl" as VariantCalling
import "structs.wdl" as Structs

List all imported WDL files and their purposes.

Import Alias Source File Purpose
[Alignment] [tasks/alignment.wdl] [BWA-MEM and post-alignment processing tasks]
[...] [...] [...]

4. Input Specification

4.1 Workflow-Level Inputs

Input Name WDL Type Required Default Description
[sample_name] String Yes [Unique identifier for the sample]
[input_bam] File Yes [Input BAM file for processing]
[reference_fasta] File Yes [Reference genome FASTA file]
[reference_fasta_index] File Yes [.fai index for reference genome]
[call_regions] File? No null [Optional BED file to restrict calling regions]
[scatter_count] Int No 24 [Number of shards for parallel variant calling]
[...] [...] [...] [...] [...]

4.2 Custom Struct Definitions

struct SampleInfo {
    String sample_id
    String patient_id
    File input_bam
    File input_bam_index
    String? library_name
}

Document each custom struct used in the workflow.

Struct Name Field Type Description
[SampleInfo] [sample_id] String [Unique sample identifier]
[SampleInfo] [patient_id] String [Associated patient identifier]
[SampleInfo] [input_bam] File [Path to input BAM]
[...] [...] [...] [...]

4.3 Example Input JSON

{
  "GermlineVariantCalling.sample_name": "NA12878",
  "GermlineVariantCalling.input_bam": "gs://bucket/samples/NA12878.bam",
  "GermlineVariantCalling.reference_fasta": "gs://bucket/references/hg38.fa",
  "GermlineVariantCalling.reference_fasta_index": "gs://bucket/references/hg38.fa.fai",
  "GermlineVariantCalling.scatter_count": 24
}

5. Task Specifications

Document every task in the workflow. Repeat this section for each task.

5.1 Task: [TaskName]

Purpose: [What this task does in one sentence]

Command Block

command <<<
    set -euo pipefail

    ~{tool_path} \
        --input ~{input_file} \
        --output ~{output_prefix}.bam \
        --reference ~{reference_fasta} \
        --threads ~{cpu}
>>>

Inputs

Input WDL Type Required Source Description
[input_file] File Yes [Workflow input / Previous task output] [Description]
[...] [...] [...] [...] [...]

Outputs

Output WDL Type Filename Pattern Description
[aligned_bam] File ~{output_prefix}.bam [Aligned BAM file]
[alignment_log] File ~{output_prefix}.log [Tool log output]
[...] [...] [...] [...]

Runtime Attributes

Attribute Value Notes
docker [broadinstitute/gatk:4.5.0.0] [Source and version justification]
cpu [~{cpu}] [Default: 4]
memory [~{memory_gb} + " GB"] [Default: 16 GB]
disks ["local-disk " + disk_size + " SSD"] [Calculated from input size]
preemptible [~{preemptible_tries}] [Default: 2]
maxRetries [~{max_retries}] [Default: 1]

Disk Size Calculation

Int disk_size = ceil(size(input_file, "GB") * 2.5) + 20

Explain the rationale for the disk calculation (e.g., input size x 2.5 for intermediate files + 20 GB headroom).


[Repeat Section 5.1 for each task in the workflow]


6. Workflow Logic and Control Flow

6.1 Task Execution Order

Describe the directed acyclic graph (DAG) of task dependencies.

1. FastQC (independent — can run in parallel with step 2)
2. TrimReads
3. AlignReads (depends on: TrimReads)
4. MarkDuplicates (depends on: AlignReads)
5. ScatteredHaplotypeCaller (depends on: MarkDuplicates) [SCATTER]
6. MergeVCFs (depends on: ScatteredHaplotypeCaller) [GATHER]
7. FilterVariants (depends on: MergeVCFs)

6.2 Scatter Operations

Scatter Variable Type Source Tasks Scattered Gather Method
[interval] Array[File] [SplitIntervals output] [HaplotypeCaller] [MergeVCFs — concatenation]
[...] [...] [...] [...] [...]

6.3 Conditional Execution

if (defined(call_regions)) {
    call RestrictToRegions { input: bed_file = select_first([call_regions]) }
}
Condition Evaluates Tasks Affected Behaviour When False
[defined(call_regions)] [Whether BED file is provided] [RestrictToRegions] [Uses whole genome]
[...] [...] [...] [...]

7. Output Specification

7.1 Final Outputs

Output Name WDL Type Source Task File Pattern Description
[final_vcf] File [FilterVariants] ~{sample_name}.filtered.vcf.gz [Filtered variant calls]
[final_vcf_index] File [FilterVariants] ~{sample_name}.filtered.vcf.gz.tbi [VCF index file]
[qc_report] File [FastQC] ~{sample_name}_fastqc.html [Quality control report]
[...] [...] [...] [...] [...]

7.2 Output Validation

Describe how outputs can be validated for correctness.

Output Validation Method Expected Result
[final_vcf] [bcftools stats] [Non-zero variant count; valid VCF format]
[final_bam] [samtools flagstat] [>95% mapping rate]
[...] [...] [...]

8. Docker Containers and Dependencies

8.1 Container Inventory

Container Image Tag Size Tools Included Used By Tasks
[broadinstitute/gatk] [4.5.0.0] [~1.8 GB] [GATK, Samtools, Picard] [MarkDuplicates, HaplotypeCaller, FilterVariants]
[biocontainers/bwa] [0.7.17] [~200 MB] [BWA] [AlignReads]
[...] [...] [...] [...] [...]

8.2 Container Build and Maintenance

Field Details
Dockerfile Location [e.g. docker/ directory in repo]
Build Process [e.g. CI/CD automatic build on tag push]
Vulnerability Scanning [e.g. Trivy / Snyk / Manual]
Update Cadence [e.g. Quarterly or on tool version bump]

9. Performance Characteristics

9.1 Benchmarks

Provide benchmark data from representative runs.

Dataset Samples Total Runtime Total Cost (est.) Platform
[30x WGS NA12878] [1] [~6 hours] [~$12 USD] [GCP — Cromwell on Terra]
[30x WGS cohort] [100] [~18 hours] [~$950 USD] [GCP — Cromwell on Terra]
[...] [...] [...] [...] [...]

9.2 Per-Task Performance Breakdown

Task Avg. Runtime Avg. CPU Utilisation Peak Memory Avg. Disk Used
[AlignReads] [45 min] [85%] [14 GB] [80 GB]
[HaplotypeCaller] [30 min/shard] [70%] [6 GB] [10 GB]
[...] [...] [...] [...] [...]

9.3 Scaling Considerations

Describe how the workflow scales with increasing data volume, sample count, or complexity.

  • [e.g. Runtime scales linearly with sample count due to scatter parallelism]
  • [e.g. Memory for joint genotyping scales with cohort size — recommend increasing memory for >500 samples]
  • [...]

10. Error Handling and Troubleshooting

10.1 Common Failure Modes

Error Symptom Root Cause Resolution
[Task fails with OOM (exit code 137)] [Insufficient memory allocation] [Increase memory_gb input parameter]
[Non-zero exit from tool X] [Corrupt or truncated input file] [Verify input file integrity; re-upload if needed]
[Disk space exhausted] [Disk multiplier too low for large inputs] [Increase disk_multiplier parameter]
[...] [...] [...]

10.2 Retry Logic

Task Max Retries Preemptible Tries Retry Behaviour
[AlignReads] [1] [2] [Retries on preemption; fails on tool error]
[...] [...] [...] [...]

10.3 Log File Locations

Log Type Location / Pattern Description
[stdout] [execution/stdout] [Standard output from command block]
[stderr] [execution/stderr] [Standard error — primary debugging log]
[tool-specific] [~{sample_name}.tool.log] [Detailed tool-level logging]

11. Platform-Specific Configuration

11.1 Cromwell

{
  "backend": "PAPIv2",
  "options": {
    "jes_gcs_root": "gs://bucket/cromwell-executions",
    "default_runtime_attributes": {
      "zones": "us-central1-a us-central1-b",
      "preemptible": 2
    }
  }
}

11.2 miniWDL

[scheduler]
container_backend=docker

[docker]
image_cache=/tmp/miniwdl_cache

11.3 Terra / DNAnexus / AWS HealthOmics

Include any platform-specific notes, workspace setup instructions, or configuration overrides.

Platform Configuration Notes
[Terra] [Upload inputs JSON via workspace Data tab; configure method with this WDL]
[DNAnexus] [Compile with dxCompiler v2.x; set instance types in extras.json]
[AWS HealthOmics] [Package as a private workflow; configure ECR container references]

12. Testing

12.1 Test Strategy

Test Type Description Data Expected Outcome
Unit Test [Individual task validation] [Minimal synthetic inputs] [Correct output format and content]
Integration Test [Full workflow end-to-end] [Downsampled real data (~1 GB)] [Workflow completes; outputs match truth set]
Regression Test [Compare outputs across versions] [Frozen test dataset] [Outputs are identical or within tolerance]
Scale Test [Run at production volume] [Full-size production data] [Completes within time/cost budget]

12.2 Validation Commands

# Validate WDL syntax
womtool validate workflow.wdl

# Generate inputs template
womtool inputs workflow.wdl > inputs.json

# Dry-run with Cromwell
java -jar cromwell.jar run workflow.wdl -i inputs.json --options options.json

13. Security and Compliance

Field Details
Data Classification [Public / Internal / Confidential / Restricted]
Encryption at Rest [e.g. GCS default encryption / Customer-managed keys]
Encryption in Transit [e.g. TLS 1.2+]
Access Controls [e.g. IAM roles, service accounts, VPC-SC]
Audit Logging [e.g. Cloud Audit Logs enabled]
Compliance Frameworks [e.g. HIPAA BAA in place / GDPR DPA signed / GxP validated]
Data Residency [e.g. All processing in us-central1]

14. Maintenance and Support

Field Details
Owning Team [Team name and contact]
Support Channel [e.g. Slack #wdl-support / JIRA project XYZ]
On-Call Rotation [e.g. PagerDuty schedule link]
Review Cadence [e.g. Quarterly review of tool versions and performance]
Deprecation Policy [e.g. Prior versions supported for 6 months after new release]

Appendix A: Complete Input Reference

Auto-generated or manually maintained complete list of every input parameter with full descriptions, types, defaults, and valid ranges.

Appendix B: Glossary

Term Definition
WDL Workflow Description Language — a specification for describing data processing workflows
Scatter A WDL construct that parallelises a task across an array of inputs
Gather The implicit collection of scattered task outputs back into an array
Preemptible A cloud VM instance that can be reclaimed by the provider at any time, offered at reduced cost
[...] [...]

Appendix C: References