What does zero-downtime deploy actually require?

Five behaviours: no 5xx during traffic shift, in-flight requests drain on the old version, WebSocket connections close cleanly with a reconnect signal, database migrations stay backward-compatible during the overlap, and rollback is the same command as deploy. Most platforms call any non-502 deploy 'zero-downtime' — that is a low bar.

Which PaaS is cheapest for a typical long-process API?

DigitalOcean App Platform. A 2-instance Professional XS deployment runs around $73/mo against a comparable Vercel function bill in the $300+/mo range. Render sits in between with a more forgiving starter tier. Fly.io is cheapest at the very low end but requires the most config knobs.

Why is Vercel weak for WebSocket workloads?

Vercel's deploy mechanism is an atomic alias swap on top of the function pool — the swap is instant but anything stateful across it is the developer's problem. Long-lived sockets on serverless are inherently fragile. For persistent connections, App Platform, Render or Fly are all better-shaped.

Zero-Downtime Deploy on DigitalOcean App Platform vs Vercel, Render, Fly

What “zero-downtime” actually has to mean

Most platforms call any deploy that does not return a 502 “zero-downtime”. That is a low bar. The five behaviours we actually test for, on every platform, before we sign off on a deploy pipeline:

No 5xx during traffic shift — measured by hitting /health at 50 RPS for the full deploy window.
In-flight requests drain — long-running POSTs started against the old version finish on the old version.
WebSocket / SSE connections survive — or close cleanly with a reconnect signal, not RST.
DB migrations stay backward-compatible — old and new versions run side-by-side for at least the drain window.
Rollback is the same command as deploy, not a separate operations procedure.

Below, all four platforms are scored against those five behaviours, then we walk through one deploy on each. The rest of the post covers the gotchas we hit moving an app between them.

The matrix

Behaviour	DO App Platform	Vercel	Render	Fly.io
No 5xx during shift	Yes if health-check path configured	Yes (atomic alias swap)	Yes if zero-downtime enabled	Yes with bluegreen strategy
Drain in-flight requests	~120s SIGTERM grace	Function-bound, ~15s	~30s by default, configurable	Configurable via kill_timeout
WS / SSE survive deploy	Closes, client reconnects	Edge-only, fragile for long sockets	Closes cleanly	Cleanest of the four (per-machine)
Side-by-side runtime	~2 min overlap during shift	Atomic; no overlap	~30-90s overlap	Long, depends on rollout
One-click rollback	Yes (per deployment)	Yes (per deployment, instant)	Yes	Manual: `fly deploy --image`
Median deploy time, 250MB image	4–6 min	35–90 s	3–5 min	90 s–3 min

Sources: per-platform documentation; deploy timings averaged across our 11 production migrations (Node, Python, Laravel, Next.js).

The numbers paint a real picture: Vercel is the fastest at the swap mechanism itself (it is just an alias) but the worst at long-lived connections. App Platform and Render are the most predictable for typical CRUD APIs. Fly is the most controllable but expects you to design the strategy yourself.

Deploy 1 — DigitalOcean App Platform, second by second

The platform builds your container in a managed Cloud-Native Buildpacks runner, pushes the resulting image to DOCR, then performs a rolling deploy across the configured number of instances. The health-check path is the single most important config — get it wrong and you ship downtime.

.do/app.yaml — production health-check config

yaml

The two settings that matter: http_path and the initial_delay_seconds. Default initial_delay is 0, which fails the health check before the app boots; set it to your real cold-start time + 5 seconds.

name: api-prod
services:
  - name: web
    github:
      repo: appycodes/api
      branch: main
      deploy_on_push: true
    instance_size_slug: professional-xs
    instance_count: 2
    http_port: 8080
    health_check:
      http_path: /healthz
      initial_delay_seconds: 25
      period_seconds: 10
      timeout_seconds: 5
      success_threshold: 1
      failure_threshold: 3
    routes:
      - path: /
    envs:
      - key: DATABASE_URL
        scope: RUN_TIME
        type: SECRET
    autoscaling:
      min_instance_count: 2
      max_instance_count: 8
      metrics:
        cpu:
          percent: 70

The deploy lifecycle, as we measure it from the DO control panel:

DO App Platform — observed deploy timeline (250MB Node image, 2 instances)

text

t = 0:00    push to main, webhook fires
t = 0:08    build container starts in DO managed runner
t = 2:55    build complete, image pushed to DOCR
t = 3:00    new instance 1 starts in parallel to old 1
t = 3:25    new instance 1 passes health check
t = 3:25    load balancer adds new instance 1, removes old instance 1
t = 3:30    old instance 1 receives SIGTERM, has 120s to drain
t = 3:55    new instance 2 starts in parallel to old 2
t = 4:20    new instance 2 passes health check
t = 4:20    load balancer swap; old instance 2 starts draining
t = 5:30    old instance 1 finishes drain, exits cleanly
t = 6:25    old instance 2 finishes drain, exits cleanly
t = 6:25    deploy marked complete

The 120-second SIGTERM grace window is generous compared to most PaaS — Heroku gives 30s, Vercel functions are effectively bound by their function timeout. For our Laravel and Node APIs, 120s is enough to drain even the longest legitimate request (a CSV export hitting a slow third-party).

Two things to know about how App Platform handles long-lived connections. First, WebSocket connections are terminated at SIGTERM — the client sees a clean close, not an RST. Second, the load balancer does not currently forward sticky-session cookies for WS connections by default, so any reconnection-based recovery needs to tolerate landing on a different instance. We design every WS handler we ship to be re-entrant for this reason.

Deploy 2 — Vercel

Vercel's deploy mechanism is fundamentally different. Each push builds a new immutable deployment with its own URL (my-app-xyz.vercel.app). The production alias is then atomically swapped to point at that deployment. There is no rolling shift; the old deployment keeps running its in-flight functions until they finish, and new traffic goes straight to the new deployment.

Vercel — observed deploy timeline (same app)

text

t = 0:00    push to main
t = 0:05    build starts on Vercel build container
t = 0:48    build complete
t = 0:55    new deployment marked Ready
t = 1:00    production alias swap — atomic, < 1 s
t = 1:00    new traffic 100% on new deployment
t = 1:00 -> ~15s  in-flight functions on old deployment continue
            until they finish or the 15s default timeout

Strengths: the swap itself is instant, builds are fast, and every preview deployment is a real, queryable URL. Rollback is one click and equally instant.

Weaknesses: anything stateful across the swap is your problem. Long-running HTTP responses (file downloads, streaming AI responses) can be cut at the function timeout. WebSocket support on Vercel relies on the edge runtime and is not a primary use case — long-lived sockets on serverless are inherently fragile. If you need persistent connections, Vercel is the wrong shape.

Deploy 3 — Render

Render sits in between. It runs your service as a long-lived process (like App Platform), not as a function pool (like Vercel), but the deploy mechanic is simpler than DO's rolling shift. With Zero-downtime deploys enabled on a paid plan, Render starts the new instance, waits for it to pass its health check, then routes 100% of traffic over and SIGTERMs the old instance.

Render — observed deploy timeline

text

t = 0:00    push to main
t = 0:06    build starts
t = 2:30    build complete, container starts
t = 2:55    health check passes (default: HTTP 200 on /)
t = 2:55    100% traffic switched to new instance
t = 2:55    old instance starts ~30s SIGTERM drain
t = 3:25    old instance terminated

The 30-second drain is the part you usually need to tune. For most CRUD APIs it is fine; for anything that holds a connection open it is short. Increase it via the RENDER_DRAIN_SECONDS environment variable if your workload needs it.

The health-check on Render defaults to TCP, not HTTP. If you want behaviour comparable to App Platform's /healthz probe, set the Health Check Path in the dashboard explicitly. We have audited several Render deployments where the team thought they had a real health check and actually had a TCP probe that passed long before the app was ready.

Deploy 4 — Fly.io

Fly does not assume a strategy. You pick one from fly.toml and the platform will execute it. The two we use most: rolling (default, similar to DO) and bluegreen (full parallel fleet, then swap). Bluegreen is the one to use for serious traffic shifts.

fly.toml — bluegreen deploy with health checks

toml

app = "api-prod"
primary_region = "lhr"

[deploy]
  strategy = "bluegreen"
  release_command = "node scripts/migrate.js"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = false
  min_machines_running = 2
  processes = ["app"]

[[http_service.checks]]
  grace_period = "20s"
  interval = "10s"
  method = "GET"
  timeout = "5s"
  path = "/healthz"

[[vm]]
  cpu_kind = "shared"
  cpus = 2
  memory_mb = 1024

Fly bluegreen — observed deploy timeline

text

t = 0:00    fly deploy (or push if CI)
t = 0:20    image build complete (locally or in fly builder)
t = 0:35    release_command runs migrations, exits 0
t = 0:40    full set of "green" machines started in parallel
t = 1:05    green machines pass health checks
t = 1:05    proxy switches traffic from blue -> green
t = 1:05    blue machines start draining (kill_timeout)
t = 1:35    blue machines terminated

Fly gives you the most knobs but expects you to use them. The default kill_timeout is 5 seconds, which is too short for most production workloads. Bump it to 60 or 120 explicitly. WebSocket handling is the cleanest of the four platforms because each app is a real persistent process on a real VM — the proxy will respect existing connections during the swap window.

One real migration — Vercel to App Platform

The cleanest before/after we have is a Next.js + Postgres app we moved off Vercel onto App Platform last quarter. The reason was not performance — Vercel is faster — it was bill predictability. The app had a Stripe webhook handler that occasionally triggered a 90-second batch reconciliation, and the Vercel function-runtime cost was creeping up.

The work, week by week:

Week 1 — runtime audit. List every code path that assumes serverless: per-request DB connections (fine, but pool them now), file uploads to /tmp (rewrite to stream to object storage), ISR caches (replace with real Redis or KV cache). Six edits across the codebase.
Week 2 — Dockerfile + health endpoint. Wrote a minimal Next.js standalone Dockerfile, exposed /api/healthz that returns { ok: true } after the DB pool warms.
Week 3 — staging on App Platform. Connected the repo, deployed to a staging app spec, ran the same load test we ran on Vercel. Cold start: 8s on App Platform vs 700ms on Vercel functions — expected, mitigated by keeping min_instance_count: 2.
Week 4 — cutover. Pointed the Cloudflare-managed apex record at App Platform via an ALIAS, kept Vercel running for 48 hours behind a feature flag for instant fallback. No 5xx during the shift.

Bill comparison after the first full month: Vercel was $312/mo (Pro plus function execution), App Platform was $73/mo (2 × Professional XS instances + DB) for the same traffic shape. The latency at p95 went from 145ms to 195ms — real but acceptable for an internal-facing API.

The pattern that sets up this migration cleanly is the same engineering hygiene that helps a SaaS get through a Series A audit — see our companion Series A codebase audit study for the broader picture, and the multi-tenant cost study for what to budget per-tenant once you are on the new platform. We run end-to-end PaaS migrations like this through our tech-stack migration engagement — usually 3-5 weeks for a Next.js or Node app, including the staging soak and the 48-hour fallback window.

Choosing between the four, in 60 seconds

You are mostly serving HTTP responses under 10s, with edge or static caching desirable. Vercel. The function model lines up with the workload, the build is fast, the rollback is instant.
You run a long-process API or Laravel / Rails monolith and want predictable infrastructure cost. DigitalOcean App Platform. Boring is the feature; the rolling shift is well-behaved; the bill at $73/mo for two instances is a fifth of equivalent Vercel.
You want the App Platform shape but at a lower price floor. Render. The free / starter tiers are more forgiving than DO's; the trade-off is fewer instance types and a less-detailed deploy log.
You need regional placement, per-machine control, or stable WebSocket / SSE workloads. Fly. The deeper the workload, the more Fly's knobs become an advantage instead of a tax.

Reference: production-ready configs

The four configs we ship by default, in roughly the same shape:

Dockerfile — Node app, multi-stage, used on DO / Render / Fly

dockerfile

FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine
WORKDIR /app
ENV NODE_ENV=production
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY --from=build /app/package.json ./package.json
EXPOSE 8080
USER node
# Catch SIGTERM in app code; default tini behaviour is fine for Node.
CMD ["node", "dist/server.js"]

src/server.js — graceful shutdown that works on all four

javascript

Captures SIGTERM, stops accepting new connections, drains in-flight, then exits. Tuned for the 120s App Platform grace and Fly kill_timeout.

import http from "http";
import app from "./app.js";

const PORT = process.env.PORT || 8080;
const server = http.createServer(app);
server.listen(PORT, () => console.log("listening on", PORT));

const SHUTDOWN_TIMEOUT_MS = 60_000;

function shutdown(signal) {
  console.log("received", signal, "— draining");
  server.close((err) => {
    if (err) {
      console.error("server close error", err);
      process.exit(1);
    }
    console.log("drain complete");
    process.exit(0);
  });

  // Hard cutoff in case clients hold connections open past
  // platform timeout. Pick a value < your platform's SIGKILL window.
  setTimeout(() => {
    console.warn("forcing shutdown after", SHUTDOWN_TIMEOUT_MS, "ms");
    process.exit(0);
  }, SHUTDOWN_TIMEOUT_MS).unref();
}

process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));

The same graceful-shutdown shape runs unmodified on App Platform, Render and Fly. Vercel handles this for you at the function boundary — you do not write server.close, the platform does. The shape above is the one we paste into every new service shipped through our SaaS web-app development engagement, and the one our maintenance retainer owns post-launch alongside the deploy pipeline.

■ Related research

Related research

Three companion studies that line up with the migrations behind this post:

■ Related services

Engagements that map directly to this work

The migration engagement that runs this exact change, the SaaS web-app build that ships with these configs from day one, and the post-launch retainer that owns the deploy pipeline:

Tech Stack Migration

Modernise legacy systems with zero-downtime migrations.

Learn more

SaaS Web App Development

MVP to production builds, multi-tenant, billing, AI features.

Learn more

Maintenance & Support

Post-launch stability, security, monthly improvements.

Learn more

Frequently asked questions

What does zero-downtime deploy actually require?: Five behaviours: no 5xx during traffic shift, in-flight requests drain on the old version, WebSocket connections close cleanly with a reconnect signal, database migrations stay backward-compatible during the overlap, and rollback is the same command as deploy. Most platforms call any non-502 deploy 'zero-downtime' — that is a low bar.
Which PaaS is cheapest for a typical long-process API?: DigitalOcean App Platform. A 2-instance Professional XS deployment runs around $73/mo against a comparable Vercel function bill in the $300+/mo range. Render sits in between with a more forgiving starter tier. Fly.io is cheapest at the very low end but requires the most config knobs.
Why is Vercel weak for WebSocket workloads?: Vercel's deploy mechanism is an atomic alias swap on top of the function pool — the swap is instant but anything stateful across it is the developer's problem. Long-lived sockets on serverless are inherently fragile. For persistent connections, App Platform, Render or Fly are all better-shaped.

About the author

Ritesh — Founding Partner, Appycodes

Ritesh leads engineering at Appycodes. The 11 migrations behind this post include three Next.js apps off Vercel to App Platform, two Node APIs onto Fly bluegreen, and a Laravel monolith that has lived on Render since 2022. The graceful-shutdown shape in the reference section is the one we paste into every new service we ship.

Last reviewed: May 12, 2026

Zero-downtime push-to-deploy on DigitalOcean App Platform — and how it actually compares to Vercel, Render and Fly

What “zero-downtime” actually has to mean

The matrix

Deploy 1 — DigitalOcean App Platform, second by second

Deploy 2 — Vercel

Deploy 3 — Render

Deploy 4 — Fly.io

One real migration — Vercel to App Platform

Choosing between the four, in 60 seconds

Reference: production-ready configs

Related research

Engagements that map directly to this work

Tech Stack Migration

SaaS Web App Development

Maintenance & Support

Frequently asked questions

Ritesh — Founding Partner, Appycodes

Full stack web and mobile tech company

Taking the first step is the hardest. We make everything after that simple.