WDL Deep Technical Documentation Template
Use this template to produce a thorough technical document that describes a WDL workflow in full detail. This document is intended for developers, bioinformaticians, platform engineers, and technical reviewers who need to understand, maintain, extend, or troubleshoot the workflow.
1. Document Control
| Field |
Details |
| Document Title |
[Workflow name — Technical Documentation] |
| Document Version |
[e.g. 1.0.0] |
| WDL Version |
[1.0 / 1.1 / development] |
| Workflow Version |
[e.g. 2.3.1 — use semantic versioning] |
| Author(s) |
[Names and roles] |
| Last Updated |
[YYYY-MM-DD] |
| Status |
[Draft / In Review / Approved / Deprecated] |
| Repository |
[URL to source repository] |
| Licence |
[e.g. MIT / BSD-3 / Proprietary] |
2. Workflow Overview
2.1 Purpose
Provide a concise technical summary of what this workflow does and the problem it solves.
[Enter purpose here]
2.2 Workflow Identifier
| Field |
Value |
| Workflow Name |
[e.g. germline-variant-calling] |
| Namespace |
[e.g. org.broadinstitute.pipelines] |
| Entry Point |
[e.g. GermlineVariantCalling] |
| Source File |
[e.g. germline_variant_calling.wdl] |
2.3 Changelog
| Version |
Date |
Author |
Summary of Changes |
| [2.3.1] |
[YYYY-MM-DD] |
[Name] |
[Upgraded GATK to 4.5.0] |
| [2.3.0] |
[YYYY-MM-DD] |
[Name] |
[Added BQSR step] |
| [...] |
[...] |
[...] |
[...] |
3. Architecture
3.1 Workflow Diagram
Include a visual representation of the workflow. Use Mermaid, draw.io, or a similar tool.
[Input Files] --> [Task 1: QC] --> [Task 2: Trim] --> [Task 3: Align] --> [Task 4: Sort/Mark Duplicates]
|
v
[Task 7: Merge VCFs] <-- [Task 5: Call Variants (scattered)]
|
v
[Task 6: Filter] --> [Final Outputs]
Replace the above with an accurate diagram for your workflow.
3.2 Workflow Structure
| Component |
File |
Description |
| Main Workflow |
[main.wdl] |
[Orchestrates all tasks and sub-workflows] |
| Sub-workflow: QC |
[qc.wdl] |
[Quality control sub-workflow] |
| Task Library |
[tasks/alignment.wdl] |
[Reusable alignment tasks] |
| Struct Definitions |
[structs.wdl] |
[Custom struct type definitions] |
| [...] |
[...] |
[...] |
3.3 Import Dependencies
import "tasks/alignment.wdl" as Alignment
import "tasks/variant_calling.wdl" as VariantCalling
import "structs.wdl" as Structs
List all imported WDL files and their purposes.
| Import Alias |
Source File |
Purpose |
| [Alignment] |
[tasks/alignment.wdl] |
[BWA-MEM and post-alignment processing tasks] |
| [...] |
[...] |
[...] |
4. Input Specification
4.1 Workflow-Level Inputs
| Input Name |
WDL Type |
Required |
Default |
Description |
| [sample_name] |
String |
Yes |
— |
[Unique identifier for the sample] |
| [input_bam] |
File |
Yes |
— |
[Input BAM file for processing] |
| [reference_fasta] |
File |
Yes |
— |
[Reference genome FASTA file] |
| [reference_fasta_index] |
File |
Yes |
— |
[.fai index for reference genome] |
| [call_regions] |
File? |
No |
null |
[Optional BED file to restrict calling regions] |
| [scatter_count] |
Int |
No |
24 |
[Number of shards for parallel variant calling] |
| [...] |
[...] |
[...] |
[...] |
[...] |
4.2 Custom Struct Definitions
struct SampleInfo {
String sample_id
String patient_id
File input_bam
File input_bam_index
String? library_name
}
Document each custom struct used in the workflow.
| Struct Name |
Field |
Type |
Description |
| [SampleInfo] |
[sample_id] |
String |
[Unique sample identifier] |
| [SampleInfo] |
[patient_id] |
String |
[Associated patient identifier] |
| [SampleInfo] |
[input_bam] |
File |
[Path to input BAM] |
| [...] |
[...] |
[...] |
[...] |
4.3 Example Input JSON
{
"GermlineVariantCalling.sample_name": "NA12878",
"GermlineVariantCalling.input_bam": "gs://bucket/samples/NA12878.bam",
"GermlineVariantCalling.reference_fasta": "gs://bucket/references/hg38.fa",
"GermlineVariantCalling.reference_fasta_index": "gs://bucket/references/hg38.fa.fai",
"GermlineVariantCalling.scatter_count": 24
}
5. Task Specifications
Document every task in the workflow. Repeat this section for each task.
5.1 Task: [TaskName]
Purpose: [What this task does in one sentence]
Command Block
command <<<
set -euo pipefail
~{tool_path} \
--input ~{input_file} \
--output ~{output_prefix}.bam \
--reference ~{reference_fasta} \
--threads ~{cpu}
>>>
Inputs
| Input |
WDL Type |
Required |
Source |
Description |
| [input_file] |
File |
Yes |
[Workflow input / Previous task output] |
[Description] |
| [...] |
[...] |
[...] |
[...] |
[...] |
Outputs
| Output |
WDL Type |
Filename Pattern |
Description |
| [aligned_bam] |
File |
~{output_prefix}.bam |
[Aligned BAM file] |
| [alignment_log] |
File |
~{output_prefix}.log |
[Tool log output] |
| [...] |
[...] |
[...] |
[...] |
Runtime Attributes
| Attribute |
Value |
Notes |
| docker |
[broadinstitute/gatk:4.5.0.0] |
[Source and version justification] |
| cpu |
[~{cpu}] |
[Default: 4] |
| memory |
[~{memory_gb} + " GB"] |
[Default: 16 GB] |
| disks |
["local-disk " + disk_size + " SSD"] |
[Calculated from input size] |
| preemptible |
[~{preemptible_tries}] |
[Default: 2] |
| maxRetries |
[~{max_retries}] |
[Default: 1] |
Disk Size Calculation
Int disk_size = ceil(size(input_file, "GB") * 2.5) + 20
Explain the rationale for the disk calculation (e.g., input size x 2.5 for intermediate files + 20 GB headroom).
[Repeat Section 5.1 for each task in the workflow]
6. Workflow Logic and Control Flow
6.1 Task Execution Order
Describe the directed acyclic graph (DAG) of task dependencies.
1. FastQC (independent — can run in parallel with step 2)
2. TrimReads
3. AlignReads (depends on: TrimReads)
4. MarkDuplicates (depends on: AlignReads)
5. ScatteredHaplotypeCaller (depends on: MarkDuplicates) [SCATTER]
6. MergeVCFs (depends on: ScatteredHaplotypeCaller) [GATHER]
7. FilterVariants (depends on: MergeVCFs)
6.2 Scatter Operations
| Scatter Variable |
Type |
Source |
Tasks Scattered |
Gather Method |
| [interval] |
Array[File] |
[SplitIntervals output] |
[HaplotypeCaller] |
[MergeVCFs — concatenation] |
| [...] |
[...] |
[...] |
[...] |
[...] |
6.3 Conditional Execution
if (defined(call_regions)) {
call RestrictToRegions { input: bed_file = select_first([call_regions]) }
}
| Condition |
Evaluates |
Tasks Affected |
Behaviour When False |
| [defined(call_regions)] |
[Whether BED file is provided] |
[RestrictToRegions] |
[Uses whole genome] |
| [...] |
[...] |
[...] |
[...] |
7. Output Specification
7.1 Final Outputs
| Output Name |
WDL Type |
Source Task |
File Pattern |
Description |
| [final_vcf] |
File |
[FilterVariants] |
~{sample_name}.filtered.vcf.gz |
[Filtered variant calls] |
| [final_vcf_index] |
File |
[FilterVariants] |
~{sample_name}.filtered.vcf.gz.tbi |
[VCF index file] |
| [qc_report] |
File |
[FastQC] |
~{sample_name}_fastqc.html |
[Quality control report] |
| [...] |
[...] |
[...] |
[...] |
[...] |
7.2 Output Validation
Describe how outputs can be validated for correctness.
| Output |
Validation Method |
Expected Result |
| [final_vcf] |
[bcftools stats] |
[Non-zero variant count; valid VCF format] |
| [final_bam] |
[samtools flagstat] |
[>95% mapping rate] |
| [...] |
[...] |
[...] |
8. Docker Containers and Dependencies
8.1 Container Inventory
| Container Image |
Tag |
Size |
Tools Included |
Used By Tasks |
| [broadinstitute/gatk] |
[4.5.0.0] |
[~1.8 GB] |
[GATK, Samtools, Picard] |
[MarkDuplicates, HaplotypeCaller, FilterVariants] |
| [biocontainers/bwa] |
[0.7.17] |
[~200 MB] |
[BWA] |
[AlignReads] |
| [...] |
[...] |
[...] |
[...] |
[...] |
8.2 Container Build and Maintenance
| Field |
Details |
| Dockerfile Location |
[e.g. docker/ directory in repo] |
| Build Process |
[e.g. CI/CD automatic build on tag push] |
| Vulnerability Scanning |
[e.g. Trivy / Snyk / Manual] |
| Update Cadence |
[e.g. Quarterly or on tool version bump] |
9. Performance Characteristics
9.1 Benchmarks
Provide benchmark data from representative runs.
| Dataset |
Samples |
Total Runtime |
Total Cost (est.) |
Platform |
| [30x WGS NA12878] |
[1] |
[~6 hours] |
[~$12 USD] |
[GCP — Cromwell on Terra] |
| [30x WGS cohort] |
[100] |
[~18 hours] |
[~$950 USD] |
[GCP — Cromwell on Terra] |
| [...] |
[...] |
[...] |
[...] |
[...] |
9.2 Per-Task Performance Breakdown
| Task |
Avg. Runtime |
Avg. CPU Utilisation |
Peak Memory |
Avg. Disk Used |
| [AlignReads] |
[45 min] |
[85%] |
[14 GB] |
[80 GB] |
| [HaplotypeCaller] |
[30 min/shard] |
[70%] |
[6 GB] |
[10 GB] |
| [...] |
[...] |
[...] |
[...] |
[...] |
9.3 Scaling Considerations
Describe how the workflow scales with increasing data volume, sample count, or complexity.
- [e.g. Runtime scales linearly with sample count due to scatter parallelism]
- [e.g. Memory for joint genotyping scales with cohort size — recommend increasing memory for >500 samples]
- [...]
10. Error Handling and Troubleshooting
10.1 Common Failure Modes
| Error Symptom |
Root Cause |
Resolution |
| [Task fails with OOM (exit code 137)] |
[Insufficient memory allocation] |
[Increase memory_gb input parameter] |
| [Non-zero exit from tool X] |
[Corrupt or truncated input file] |
[Verify input file integrity; re-upload if needed] |
| [Disk space exhausted] |
[Disk multiplier too low for large inputs] |
[Increase disk_multiplier parameter] |
| [...] |
[...] |
[...] |
10.2 Retry Logic
| Task |
Max Retries |
Preemptible Tries |
Retry Behaviour |
| [AlignReads] |
[1] |
[2] |
[Retries on preemption; fails on tool error] |
| [...] |
[...] |
[...] |
[...] |
10.3 Log File Locations
| Log Type |
Location / Pattern |
Description |
| [stdout] |
[execution/stdout] |
[Standard output from command block] |
| [stderr] |
[execution/stderr] |
[Standard error — primary debugging log] |
| [tool-specific] |
[~{sample_name}.tool.log] |
[Detailed tool-level logging] |
11. Platform-Specific Configuration
11.1 Cromwell
{
"backend": "PAPIv2",
"options": {
"jes_gcs_root": "gs://bucket/cromwell-executions",
"default_runtime_attributes": {
"zones": "us-central1-a us-central1-b",
"preemptible": 2
}
}
}
11.2 miniWDL
[scheduler]
container_backend=docker
[docker]
image_cache=/tmp/miniwdl_cache
11.3 Terra / DNAnexus / AWS HealthOmics
Include any platform-specific notes, workspace setup instructions, or configuration overrides.
| Platform |
Configuration Notes |
| [Terra] |
[Upload inputs JSON via workspace Data tab; configure method with this WDL] |
| [DNAnexus] |
[Compile with dxCompiler v2.x; set instance types in extras.json] |
| [AWS HealthOmics] |
[Package as a private workflow; configure ECR container references] |
12. Testing
12.1 Test Strategy
| Test Type |
Description |
Data |
Expected Outcome |
| Unit Test |
[Individual task validation] |
[Minimal synthetic inputs] |
[Correct output format and content] |
| Integration Test |
[Full workflow end-to-end] |
[Downsampled real data (~1 GB)] |
[Workflow completes; outputs match truth set] |
| Regression Test |
[Compare outputs across versions] |
[Frozen test dataset] |
[Outputs are identical or within tolerance] |
| Scale Test |
[Run at production volume] |
[Full-size production data] |
[Completes within time/cost budget] |
12.2 Validation Commands
# Validate WDL syntax
womtool validate workflow.wdl
# Generate inputs template
womtool inputs workflow.wdl > inputs.json
# Dry-run with Cromwell
java -jar cromwell.jar run workflow.wdl -i inputs.json --options options.json
13. Security and Compliance
| Field |
Details |
| Data Classification |
[Public / Internal / Confidential / Restricted] |
| Encryption at Rest |
[e.g. GCS default encryption / Customer-managed keys] |
| Encryption in Transit |
[e.g. TLS 1.2+] |
| Access Controls |
[e.g. IAM roles, service accounts, VPC-SC] |
| Audit Logging |
[e.g. Cloud Audit Logs enabled] |
| Compliance Frameworks |
[e.g. HIPAA BAA in place / GDPR DPA signed / GxP validated] |
| Data Residency |
[e.g. All processing in us-central1] |
14. Maintenance and Support
| Field |
Details |
| Owning Team |
[Team name and contact] |
| Support Channel |
[e.g. Slack #wdl-support / JIRA project XYZ] |
| On-Call Rotation |
[e.g. PagerDuty schedule link] |
| Review Cadence |
[e.g. Quarterly review of tool versions and performance] |
| Deprecation Policy |
[e.g. Prior versions supported for 6 months after new release] |
Appendix A: Complete Input Reference
Auto-generated or manually maintained complete list of every input parameter with full descriptions, types, defaults, and valid ranges.
Appendix B: Glossary
| Term |
Definition |
| WDL |
Workflow Description Language — a specification for describing data processing workflows |
| Scatter |
A WDL construct that parallelises a task across an array of inputs |
| Gather |
The implicit collection of scattered task outputs back into an array |
| Preemptible |
A cloud VM instance that can be reclaimed by the provider at any time, offered at reduced cost |
| [...] |
[...] |
Appendix C: References