Workflow orchestration build vs buy should be decided on three numbers: your 3‑year engineering budget for runtime code, the monthly state-transition volume, and how many months of durable timers you need. Buy until those three values cross an inflection point where a dedicated runtime team is cheaper than vendor fees plus integration.
A single senior backend engineer in the U.S. costs roughly $180,000/year fully loaded. A 3‑engineer team running an in-house orchestrator for three years is therefore about $1.62M in salary alone. Compare that to managed service bills: AWS Step Functions at 10M state transitions/month costs roughly $250/month; paid orchestration platforms for production clusters commonly land between $40k and $300k/year depending on throughput and retention.
Direct answer: Workflow orchestration build vs buy is buy for most teams. If you run under ~100M state transitions/month, don't require sub-10ms task handoff latency, and your workflows need less than 3 months of durable retention, a managed product (AWS Step Functions, Temporal Cloud, Dagster Cloud, or MWAA) is cheaper and safer; build only when you need fine-grained control, portability, or expect >$1M/yr ops spend on orchestration.
Workflow orchestration build vs buy: TCO, latency, and lock-in
TCO arithmetic is simple and decisive. A 3‑engineer team over 3 years costs $1.62M in salaries. Add infra (3–6 EC2 instances, storage, HA, backups) and you cross $1.8M. Temporal Cloud or Dagster Cloud at mature scale runs between $40k and $300k/year depending on throughput and retention. AWS Step Functions at 10M transitions/month costs ~$250/month; at 300M transitions/month it costs ~$7,500/month or $90k/year.
Latency and SLOs are a second axis. Amazon Step Functions standard workflows introduce tens to hundreds of milliseconds per state transition; Step Functions Express targets sub-100ms for simple flows. Temporal and self-hosted Cadence give sub-20ms local scheduling in optimized deployments but require provisioning to hit 99.99% global latency. If you need sub-50ms task dispatch at the 99.9th percentile, you're in the 'build' lane or you must pay for high-end managed clusters.
Operational complexity and migration cost are real dollars. Rewriting 200 workflow definitions from Step Functions JSON + Lambda to Temporal prototypes takes 6–12 engineer-weeks. Two engineers for six weeks at $180k/year costs about $41k. Add testing, chaos runs, and staging and the engineering bill approaches $70k before you touch production cutover.
Feature delta matters more than license cost. Temporal and Step Functions both provide retries, timers, versioning, and signals; Temporal exposes richer local debugging, code-native workflows, and long-running workflows measured in months. Airflow and Dagster target batch-oriented DAGs with different execution semantics. If you have user-facing orchestrations with sub-second expectations, Temporal or Step Functions is the right primitive; for ETL pipelines that run hourly, managed Airflow or Dagster Cloud is cheaper.
Buy orchestration until the math forces you to own runtime: the moment your annual ops spend on orchestration plus migration risk beats 2–3 senior engineers’ salary, build.
What this means for a CTO or technical founder
You should prioritize vendor evaluation against five vectors: cost per state transition, durable-timer limits, latency percentiles, observability integrations, and migration path. For example, AWS Step Functions integrates with CloudWatch and X‑Ray; Temporal Cloud offers SDK-first observability and visibility APIs; Dagster Cloud connects to metadata stores and asset-aware lineage. Pick the vendor whose telemetry matches your SLOs.
Define your breakpoints in dollars and latency. If your workflows generate <100M state transitions/month and average end-to-end latency budgets are seconds rather than tens of milliseconds, buy. If you expect to pay >$250k/year in managed fees because of scale, and you have a platform team of 3+ engineers willing to own a runtime, build. That trade-off flips at roughly $1.5M in cumulative vendor spend vs. $1.62M for a 3‑engineer build over three years.
Treat portability as insurance, not a design principle. Lock-in cost is migration engineering: converting workflow definitions, reimplementing signals, and reconciling observability. Budget that migration at 1–3% of your three‑year SaaS spend when you buy, and at 20–30% of one-year team cost when you build.
Decision checklist — when to build, when to buy
1) Buy if your monthly state transitions are under 100M, your latency SLO is >100ms, and your durable timers are under 30 days. Managed services are typically <$100k/year in this band.
2) Buy if observability integrations (CloudWatch, X‑Ray, Datadog) and managed scale are your priority; the integration cost of rolling your own is 4–8 engineer-weeks per endpoint.
3) Build if you need sub-50ms dispatch at scale, expect >$1M/year in orchestration expense, or require vendor-specific behavior not exposed by the market products. Building is defensible when you can amortize a 3‑engineer team across multiple platform capabilities.
4) Build if you require strict portability across clouds for regulatory reasons; otherwise, accept the migration tax and move faster on product.
5) Always prototype with production data. Implement 3 representative workflows on the candidate managed service and measure tail latency, developer ergonomics, and the runbook-edit cycle before deciding.
Key takeaways:
1) Buy for most startups: a managed orchestrator costs <$100k/year until you reach very high throughput or extreme latency needs.
2) Build when annual vendor bills approach the salary of 2–3 engineers and you need control over latency, scheduling semantics, or portability.
3) Budget migration at $40k–$120k per major migration project; that’s the true lock-in cost, not an opaque vendor term sheet.
4) Use prototype + metrics as the final arbiter: measure transitions/month, 99th-percentile dispatch latency, and on-call MTTR before you choose.
If you adopt a managed service, instrument aggressively for the metrics that change the decision: transitions/month, active workflows, median and 99.9th-percentile task dispatch latency, durable-timer retention, and monthly vendor spend. These five numbers let you replay the TCO every quarter and catch the inflection before it costs you months of rework.



