What “zero-downtime” actually has to mean
Most platforms call any deploy that does not return a 502 “zero-downtime”. That is a low bar. The five behaviours we actually test for, on every platform, before we sign off on a deploy pipeline:
- No 5xx during traffic shift — measured by hitting
/healthat 50 RPS for the full deploy window. - In-flight requests drain — long-running POSTs started against the old version finish on the old version.
- WebSocket / SSE connections survive — or close cleanly with a reconnect signal, not RST.
- DB migrations stay backward-compatible — old and new versions run side-by-side for at least the drain window.
- Rollback is the same command as deploy, not a separate operations procedure.
Below, all four platforms are scored against those five behaviours, then we walk through one deploy on each. The rest of the post covers the gotchas we hit moving an app between them.
The matrix
| Behaviour | DO App Platform | Vercel | Render | Fly.io |
|---|---|---|---|---|
| No 5xx during shift | Yes if health-check path configured | Yes (atomic alias swap) | Yes if zero-downtime enabled | Yes with bluegreen strategy |
| Drain in-flight requests | ~120s SIGTERM grace | Function-bound, ~15s | ~30s by default, configurable | Configurable via kill_timeout |
| WS / SSE survive deploy | Closes, client reconnects | Edge-only, fragile for long sockets | Closes cleanly | Cleanest of the four (per-machine) |
| Side-by-side runtime | ~2 min overlap during shift | Atomic; no overlap | ~30-90s overlap | Long, depends on rollout |
| One-click rollback | Yes (per deployment) | Yes (per deployment, instant) | Yes | Manual: fly deploy --image |
| Median deploy time, 250MB image | 4–6 min | 35–90 s | 3–5 min | 90 s–3 min |
Sources: per-platform documentation; deploy timings averaged across our 11 production migrations (Node, Python, Laravel, Next.js).
The numbers paint a real picture: Vercel is the fastest at the swap mechanism itself (it is just an alias) but the worst at long-lived connections. App Platform and Render are the most predictable for typical CRUD APIs. Fly is the most controllable but expects you to design the strategy yourself.
Deploy 1 — DigitalOcean App Platform, second by second
The platform builds your container in a managed Cloud-Native Buildpacks runner, pushes the resulting image to DOCR, then performs a rolling deploy across the configured number of instances. The health-check path is the single most important config — get it wrong and you ship downtime.
name: api-prod
services:
- name: web
github:
repo: appycodes/api
branch: main
deploy_on_push: true
instance_size_slug: professional-xs
instance_count: 2
http_port: 8080
health_check:
http_path: /healthz
initial_delay_seconds: 25
period_seconds: 10
timeout_seconds: 5
success_threshold: 1
failure_threshold: 3
routes:
- path: /
envs:
- key: DATABASE_URL
scope: RUN_TIME
type: SECRET
autoscaling:
min_instance_count: 2
max_instance_count: 8
metrics:
cpu:
percent: 70The deploy lifecycle, as we measure it from the DO control panel:
t = 0:00 push to main, webhook fires
t = 0:08 build container starts in DO managed runner
t = 2:55 build complete, image pushed to DOCR
t = 3:00 new instance 1 starts in parallel to old 1
t = 3:25 new instance 1 passes health check
t = 3:25 load balancer adds new instance 1, removes old instance 1
t = 3:30 old instance 1 receives SIGTERM, has 120s to drain
t = 3:55 new instance 2 starts in parallel to old 2
t = 4:20 new instance 2 passes health check
t = 4:20 load balancer swap; old instance 2 starts draining
t = 5:30 old instance 1 finishes drain, exits cleanly
t = 6:25 old instance 2 finishes drain, exits cleanly
t = 6:25 deploy marked completeThe 120-second SIGTERM grace window is generous compared to most PaaS — Heroku gives 30s, Vercel functions are effectively bound by their function timeout. For our Laravel and Node APIs, 120s is enough to drain even the longest legitimate request (a CSV export hitting a slow third-party).
Two things to know about how App Platform handles long-lived connections. First, WebSocket connections are terminated at SIGTERM — the client sees a clean close, not an RST. Second, the load balancer does not currently forward sticky-session cookies for WS connections by default, so any reconnection-based recovery needs to tolerate landing on a different instance. We design every WS handler we ship to be re-entrant for this reason.
Deploy 2 — Vercel
Vercel's deploy mechanism is fundamentally different. Each push builds a new immutable deployment with its own URL (my-app-xyz.vercel.app). The production alias is then atomically swapped to point at that deployment. There is no rolling shift; the old deployment keeps running its in-flight functions until they finish, and new traffic goes straight to the new deployment.
t = 0:00 push to main
t = 0:05 build starts on Vercel build container
t = 0:48 build complete
t = 0:55 new deployment marked Ready
t = 1:00 production alias swap — atomic, < 1 s
t = 1:00 new traffic 100% on new deployment
t = 1:00 -> ~15s in-flight functions on old deployment continue
until they finish or the 15s default timeoutStrengths: the swap itself is instant, builds are fast, and every preview deployment is a real, queryable URL. Rollback is one click and equally instant.
Weaknesses: anything stateful across the swap is your problem. Long-running HTTP responses (file downloads, streaming AI responses) can be cut at the function timeout. WebSocket support on Vercel relies on the edge runtime and is not a primary use case — long-lived sockets on serverless are inherently fragile. If you need persistent connections, Vercel is the wrong shape.
Deploy 3 — Render
Render sits in between. It runs your service as a long-lived process (like App Platform), not as a function pool (like Vercel), but the deploy mechanic is simpler than DO's rolling shift. With Zero-downtime deploys enabled on a paid plan, Render starts the new instance, waits for it to pass its health check, then routes 100% of traffic over and SIGTERMs the old instance.
t = 0:00 push to main
t = 0:06 build starts
t = 2:30 build complete, container starts
t = 2:55 health check passes (default: HTTP 200 on /)
t = 2:55 100% traffic switched to new instance
t = 2:55 old instance starts ~30s SIGTERM drain
t = 3:25 old instance terminatedThe 30-second drain is the part you usually need to tune. For most CRUD APIs it is fine; for anything that holds a connection open it is short. Increase it via the RENDER_DRAIN_SECONDS environment variable if your workload needs it.
The health-check on Render defaults to TCP, not HTTP. If you want behaviour comparable to App Platform's /healthz probe, set the Health Check Path in the dashboard explicitly. We have audited several Render deployments where the team thought they had a real health check and actually had a TCP probe that passed long before the app was ready.
Deploy 4 — Fly.io
Fly does not assume a strategy. You pick one from fly.toml and the platform will execute it. The two we use most: rolling (default, similar to DO) and bluegreen (full parallel fleet, then swap). Bluegreen is the one to use for serious traffic shifts.
app = "api-prod"
primary_region = "lhr"
[deploy]
strategy = "bluegreen"
release_command = "node scripts/migrate.js"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = false
min_machines_running = 2
processes = ["app"]
[[http_service.checks]]
grace_period = "20s"
interval = "10s"
method = "GET"
timeout = "5s"
path = "/healthz"
[[vm]]
cpu_kind = "shared"
cpus = 2
memory_mb = 1024t = 0:00 fly deploy (or push if CI)
t = 0:20 image build complete (locally or in fly builder)
t = 0:35 release_command runs migrations, exits 0
t = 0:40 full set of "green" machines started in parallel
t = 1:05 green machines pass health checks
t = 1:05 proxy switches traffic from blue -> green
t = 1:05 blue machines start draining (kill_timeout)
t = 1:35 blue machines terminatedFly gives you the most knobs but expects you to use them. The default kill_timeout is 5 seconds, which is too short for most production workloads. Bump it to 60 or 120 explicitly. WebSocket handling is the cleanest of the four platforms because each app is a real persistent process on a real VM — the proxy will respect existing connections during the swap window.
One real migration — Vercel to App Platform
The cleanest before/after we have is a Next.js + Postgres app we moved off Vercel onto App Platform last quarter. The reason was not performance — Vercel is faster — it was bill predictability. The app had a Stripe webhook handler that occasionally triggered a 90-second batch reconciliation, and the Vercel function-runtime cost was creeping up.
The work, week by week:
- Week 1 — runtime audit. List every code path that assumes serverless: per-request DB connections (fine, but pool them now), file uploads to
/tmp(rewrite to stream to object storage), ISR caches (replace with real Redis or KV cache). Six edits across the codebase. - Week 2 — Dockerfile + health endpoint. Wrote a minimal Next.js standalone Dockerfile, exposed
/api/healthzthat returns{ ok: true }after the DB pool warms. - Week 3 — staging on App Platform. Connected the repo, deployed to a staging app spec, ran the same load test we ran on Vercel. Cold start: 8s on App Platform vs 700ms on Vercel functions — expected, mitigated by keeping
min_instance_count: 2. - Week 4 — cutover. Pointed the Cloudflare-managed apex record at App Platform via an ALIAS, kept Vercel running for 48 hours behind a feature flag for instant fallback. No 5xx during the shift.
Bill comparison after the first full month: Vercel was $312/mo (Pro plus function execution), App Platform was $73/mo (2 × Professional XS instances + DB) for the same traffic shape. The latency at p95 went from 145ms to 195ms — real but acceptable for an internal-facing API.
The pattern that sets up this migration cleanly is the same engineering hygiene that helps a SaaS get through a Series A audit — see our companion Series A codebase audit study for the broader picture, and the multi-tenant cost study for what to budget per-tenant once you are on the new platform. We run end-to-end PaaS migrations like this through our tech-stack migration engagement — usually 3-5 weeks for a Next.js or Node app, including the staging soak and the 48-hour fallback window.
Choosing between the four, in 60 seconds
- You are mostly serving HTTP responses under 10s, with edge or static caching desirable. Vercel. The function model lines up with the workload, the build is fast, the rollback is instant.
- You run a long-process API or Laravel / Rails monolith and want predictable infrastructure cost. DigitalOcean App Platform. Boring is the feature; the rolling shift is well-behaved; the bill at $73/mo for two instances is a fifth of equivalent Vercel.
- You want the App Platform shape but at a lower price floor. Render. The free / starter tiers are more forgiving than DO's; the trade-off is fewer instance types and a less-detailed deploy log.
- You need regional placement, per-machine control, or stable WebSocket / SSE workloads. Fly. The deeper the workload, the more Fly's knobs become an advantage instead of a tax.
Reference: production-ready configs
The four configs we ship by default, in roughly the same shape:
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine
WORKDIR /app
ENV NODE_ENV=production
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY --from=build /app/package.json ./package.json
EXPOSE 8080
USER node
# Catch SIGTERM in app code; default tini behaviour is fine for Node.
CMD ["node", "dist/server.js"]import http from "http";
import app from "./app.js";
const PORT = process.env.PORT || 8080;
const server = http.createServer(app);
server.listen(PORT, () => console.log("listening on", PORT));
const SHUTDOWN_TIMEOUT_MS = 60_000;
function shutdown(signal) {
console.log("received", signal, "— draining");
server.close((err) => {
if (err) {
console.error("server close error", err);
process.exit(1);
}
console.log("drain complete");
process.exit(0);
});
// Hard cutoff in case clients hold connections open past
// platform timeout. Pick a value < your platform's SIGKILL window.
setTimeout(() => {
console.warn("forcing shutdown after", SHUTDOWN_TIMEOUT_MS, "ms");
process.exit(0);
}, SHUTDOWN_TIMEOUT_MS).unref();
}
process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));The same graceful-shutdown shape runs unmodified on App Platform, Render and Fly. Vercel handles this for you at the function boundary — you do not write server.close, the platform does. The shape above is the one we paste into every new service shipped through our SaaS web-app development engagement, and the one our maintenance retainer owns post-launch alongside the deploy pipeline.
■ Related research
Related research
Three companion studies that line up with the migrations behind this post:
■ Related services
Engagements that map directly to this work
The migration engagement that runs this exact change, the SaaS web-app build that ships with these configs from day one, and the post-launch retainer that owns the deploy pipeline:
Frequently asked questions
- What does zero-downtime deploy actually require?
- Five behaviours: no 5xx during traffic shift, in-flight requests drain on the old version, WebSocket connections close cleanly with a reconnect signal, database migrations stay backward-compatible during the overlap, and rollback is the same command as deploy. Most platforms call any non-502 deploy 'zero-downtime' — that is a low bar.
- Which PaaS is cheapest for a typical long-process API?
- DigitalOcean App Platform. A 2-instance Professional XS deployment runs around $73/mo against a comparable Vercel function bill in the $300+/mo range. Render sits in between with a more forgiving starter tier. Fly.io is cheapest at the very low end but requires the most config knobs.
- Why is Vercel weak for WebSocket workloads?
- Vercel's deploy mechanism is an atomic alias swap on top of the function pool — the swap is instant but anything stateful across it is the developer's problem. Long-lived sockets on serverless are inherently fragile. For persistent connections, App Platform, Render or Fly are all better-shaped.

About the author
Ritesh — Founding Partner, Appycodes
LinkedInRitesh leads engineering at Appycodes. The 11 migrations behind this post include three Next.js apps off Vercel to App Platform, two Node APIs onto Fly bluegreen, and a Laravel monolith that has lived on Render since 2022. The graceful-shutdown shape in the reference section is the one we paste into every new service we ship.
