appycodes.

Engineering playbook

Zero-downtime push-to-deploy on DigitalOcean App Platform, and how it actually compares to Vercel, Render and Fly

We have moved 11 production apps onto DigitalOcean App Platform in the last 18 months, most from Vercel or Render. This is the side-by-side deploy story, with timings, traffic-shift mechanics, and rollback paths on all four platforms.

May 12, 202621 min readBy Ritesh
Zero-downtime deploy on DigitalOcean App Platform versus Vercel, Render and Fly

What "zero-downtime" actually has to mean

Most platforms call any deploy that does not return a 502 "zero-downtime". That is a low bar. The five behaviours we actually test for, on every platform, before we sign off on a deploy pipeline:

  1. No 5xx during traffic shift, measured by hitting /health at 50 RPS for the full deploy window.
  2. In-flight requests drain, long-running POSTs started against the old version finish on the old version.
  3. WebSocket / SSE connections survive, or close cleanly with a reconnect signal, not RST.
  4. DB migrations stay backward-compatible, old and new versions run side-by-side for at least the drain window.
  5. Rollback is the same command as deploy, not a separate operations procedure.

Below, all four platforms are scored against those five behaviours, then we walk through one deploy on each. The rest of the post covers the gotchas we hit moving an app between them.

The matrix

BehaviourDO App PlatformVercelRenderFly.io
No 5xx during shiftYes if health-check path configuredYes (atomic alias swap)Yes if zero-downtime enabledYes with bluegreen strategy
Drain in-flight requests~120s SIGTERM graceFunction-bound, ~15s~30s by default, configurableConfigurable via kill_timeout
WS / SSE survive deployCloses, client reconnectsEdge-only, fragile for long socketsCloses cleanlyCleanest of the four (per-machine)
Side-by-side runtime~2 min overlap during shiftAtomic; no overlap~30 to 90s overlapLong, depends on rollout
One-click rollbackYes (per deployment)Yes (per deployment, instant)YesManual: fly deploy --image
Median deploy time, 250MB image4 to 6 min35 to 90 s3 to 5 min90 s to 3 min

Sources: per-platform documentation; deploy timings averaged across our 11 production migrations (Node, Python, Laravel, Next.js).

The numbers paint a real picture: Vercel is the fastest at the swap mechanism itself (it is just an alias) but the worst at long-lived connections. App Platform and Render are the most predictable for typical CRUD APIs. Fly is the most controllable but expects you to design the strategy yourself.

Deploy 1, DigitalOcean App Platform, second by second

The platform builds your container in a managed Cloud-Native Buildpacks runner, pushes the resulting image to DOCR, then performs a rolling deploy across the configured number of instances. The health-check path is the single most important config, get it wrong and you ship downtime.

The two settings that matter: http_path and the initial_delay_seconds. Default initial_delay is 0, which fails the health check before the app boots; set it to your real cold-start time + 5 seconds.

.do/app.yaml, production health-check configyaml
name: api-prod
services:
  - name: web
    github:
      repo: appycodes/api
      branch: main
      deploy_on_push: true
    instance_size_slug: professional-xs
    instance_count: 2
    http_port: 8080
    health_check:
      http_path: /healthz
      initial_delay_seconds: 25
      period_seconds: 10
      timeout_seconds: 5
      success_threshold: 1
      failure_threshold: 3
    routes:
      - path: /
    envs:
      - key: DATABASE_URL
        scope: RUN_TIME
        type: SECRET
    autoscaling:
      min_instance_count: 2
      max_instance_count: 8
      metrics:
        cpu:
          percent: 70

The deploy lifecycle, as we measure it from the DO control panel:

DO App Platform, observed deploy timeline (250MB Node image, 2 instances)text
t = 0:00    push to main, webhook fires
t = 0:08    build container starts in DO managed runner
t = 2:55    build complete, image pushed to DOCR
t = 3:00    new instance 1 starts in parallel to old 1
t = 3:25    new instance 1 passes health check
t = 3:25    load balancer adds new instance 1, removes old instance 1
t = 3:30    old instance 1 receives SIGTERM, has 120s to drain
t = 3:55    new instance 2 starts in parallel to old 2
t = 4:20    new instance 2 passes health check
t = 4:20    load balancer swap; old instance 2 starts draining
t = 5:30    old instance 1 finishes drain, exits cleanly
t = 6:25    old instance 2 finishes drain, exits cleanly
t = 6:25    deploy marked complete

The 120-second SIGTERM grace window is generous compared to most PaaS, Heroku gives 30s, Vercel functions are effectively bound by their function timeout. For our Laravel and Node APIs, 120s is enough to drain even the longest legitimate request (a CSV export hitting a slow third-party).

Two things to know about how App Platform handles long-lived connections. First, WebSocket connections are terminated at SIGTERM, the client sees a clean close, not an RST. Second, the load balancer does not currently forward sticky-session cookies for WS connections by default, so any reconnection-based recovery needs to tolerate landing on a different instance. We design every WS handler we ship to be re-entrant for this reason.

Deploy 2, Vercel

Vercel's deploy mechanism is fundamentally different. Each push builds a new immutable deployment with its own URL (my-app-xyz.vercel.app). The production alias is then atomically swapped to point at that deployment. There is no rolling shift; the old deployment keeps running its in-flight functions until they finish, and new traffic goes straight to the new deployment.

Vercel, observed deploy timeline (same app)text
t = 0:00    push to main
t = 0:05    build starts on Vercel build container
t = 0:48    build complete
t = 0:55    new deployment marked Ready
t = 1:00    production alias swap, atomic, < 1 s
t = 1:00    new traffic 100% on new deployment
t = 1:00 -> ~15s  in-flight functions on old deployment continue
            until they finish or the 15s default timeout

Strengths: the swap itself is instant, builds are fast, and every preview deployment is a real, queryable URL. Rollback is one click and equally instant.

Weaknesses: anything stateful across the swap is your problem. Long-running HTTP responses (file downloads, streaming AI responses) can be cut at the function timeout. WebSocket support on Vercel relies on the edge runtime and is not a primary use case, long-lived sockets on serverless are inherently fragile. If you need persistent connections, Vercel is the wrong shape.

Deploy 3, Render

Render sits in between. It runs your service as a long-lived process (like App Platform), not as a function pool (like Vercel), but the deploy mechanic is simpler than DO's rolling shift. With Zero-downtime deploys enabled on a paid plan, Render starts the new instance, waits for it to pass its health check, then routes 100% of traffic over and SIGTERMs the old instance.

Render, observed deploy timelinetext
t = 0:00    push to main
t = 0:06    build starts
t = 2:30    build complete, container starts
t = 2:55    health check passes (default: HTTP 200 on /)
t = 2:55    100% traffic switched to new instance
t = 2:55    old instance starts ~30s SIGTERM drain
t = 3:25    old instance terminated

The 30-second drain is the part you usually need to tune. For most CRUD APIs it is fine; for anything that holds a connection open it is short. Increase it via the RENDER_DRAIN_SECONDS environment variable if your workload needs it.

The health-check on Render defaults to TCP, not HTTP. If you want behaviour comparable to App Platform's /healthz probe, set the Health Check Path in the dashboard explicitly. We have audited several Render deployments where the team thought they had a real health check and actually had a TCP probe that passed long before the app was ready.

Deploy 4, Fly.io

Fly does not assume a strategy. You pick one from fly.toml and the platform will execute it. The two we use most: rolling (default, similar to DO) and bluegreen (full parallel fleet, then swap). Bluegreen is the one to use for serious traffic shifts.

fly.toml, bluegreen deploy with health checkstoml
app = "api-prod"
primary_region = "lhr"

[deploy]
  strategy = "bluegreen"
  release_command = "node scripts/migrate.js"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = false
  min_machines_running = 2
  processes = ["app"]

[[http_service.checks]]
  grace_period = "20s"
  interval = "10s"
  method = "GET"
  timeout = "5s"
  path = "/healthz"

[[vm]]
  cpu_kind = "shared"
  cpus = 2
  memory_mb = 1024
Fly bluegreen, observed deploy timelinetext
t = 0:00    fly deploy (or push if CI)
t = 0:20    image build complete (locally or in fly builder)
t = 0:35    release_command runs migrations, exits 0
t = 0:40    full set of "green" machines started in parallel
t = 1:05    green machines pass health checks
t = 1:05    proxy switches traffic from blue -> green
t = 1:05    blue machines start draining (kill_timeout)
t = 1:35    blue machines terminated

Fly gives you the most knobs but expects you to use them. The default kill_timeout is 5 seconds, which is too short for most production workloads. Bump it to 60 or 120 explicitly. WebSocket handling is the cleanest of the four platforms because each app is a real persistent process on a real VM, the proxy will respect existing connections during the swap window.

One real migration, Vercel to App Platform

The cleanest before/after we have is a Next.js + Postgres app we moved off Vercel onto App Platform last quarter. The reason was not performance, Vercel is faster, it was bill predictability. The app had a Stripe webhook handler that occasionally triggered a 90-second batch reconciliation, and the Vercel function-runtime cost was creeping up.

The work, week by week:

  • Week 1, runtime audit. List every code path that assumes serverless: per-request DB connections (fine, but pool them now), file uploads to /tmp (rewrite to stream to object storage), ISR caches (replace with real Redis or KV cache). Six edits across the codebase.
  • Week 2, Dockerfile + health endpoint. Wrote a minimal Next.js standalone Dockerfile, exposed /api/healthz that returns { ok: true } after the DB pool warms.
  • Week 3, staging on App Platform. Connected the repo, deployed to a staging app spec, ran the same load test we ran on Vercel. Cold start: 8s on App Platform vs 700ms on Vercel functions, expected, mitigated by keeping min_instance_count: 2.
  • Week 4, cutover. Pointed the Cloudflare-managed apex record at App Platform via an ALIAS, kept Vercel running for 48 hours behind a feature flag for instant fallback. No 5xx during the shift.

Bill comparison after the first full month: Vercel was $312/mo (Pro plus function execution), App Platform was $73/mo (2 x Professional XS instances + DB) for the same traffic shape. The latency at p95 went from 145ms to 195ms, real but acceptable for an internal-facing API.

The pattern that sets up this migration cleanly is the same engineering hygiene that helps a SaaS get through a Series A audit, see our companion Series A codebase audit study for the broader picture, and the multi-tenant cost study for what to budget per-tenant once you are on the new platform. We run end-to-end PaaS migrations like this through our tech-stack migration engagement, usually 3-5 weeks for a Next.js or Node app, including the staging soak and the 48-hour fallback window.

Choosing between the four, in 60 seconds

  • You are mostly serving HTTP responses under 10s, with edge or static caching desirable. Vercel. The function model lines up with the workload, the build is fast, the rollback is instant.
  • You run a long-process API or Laravel / Rails monolith and want predictable infrastructure cost. DigitalOcean App Platform. Boring is the feature; the rolling shift is well-behaved; the bill at $73/mo for two instances is a fifth of equivalent Vercel.
  • You want the App Platform shape but at a lower price floor.Render. The free / starter tiers are more forgiving than DO's; the trade-off is fewer instance types and a less-detailed deploy log.
  • You need regional placement, per-machine control, or stable WebSocket / SSE workloads. Fly. The deeper the workload, the more Fly's knobs become an advantage instead of a tax.

Reference: production-ready configs

The four configs we ship by default, in roughly the same shape:

Dockerfile, Node app, multi-stage, used on DO / Render / Flydockerfile
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine
WORKDIR /app
ENV NODE_ENV=production
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY --from=build /app/package.json ./package.json
EXPOSE 8080
USER node
# Catch SIGTERM in app code; default tini behaviour is fine for Node.
CMD ["node", "dist/server.js"]

Captures SIGTERM, stops accepting new connections, drains in-flight, then exits. Tuned for the 120s App Platform grace and Fly kill_timeout.

src/server.js, graceful shutdown that works on all fourjavascript
import http from "http";
import app from "./app.js";

const PORT = process.env.PORT || 8080;
const server = http.createServer(app);
server.listen(PORT, () => console.log("listening on", PORT));

const SHUTDOWN_TIMEOUT_MS = 60_000;

function shutdown(signal) {
  console.log("received", signal, ", draining");
  server.close((err) => {
    if (err) {
      console.error("server close error", err);
      process.exit(1);
    }
    console.log("drain complete");
    process.exit(0);
  });

  // Hard cutoff in case clients hold connections open past
  // platform timeout. Pick a value < your platform's SIGKILL window.
  setTimeout(() => {
    console.warn("forcing shutdown after", SHUTDOWN_TIMEOUT_MS, "ms");
    process.exit(0);
  }, SHUTDOWN_TIMEOUT_MS).unref();
}

process.on("SIGTERM", () => shutdown("SIGTERM"));
process.on("SIGINT", () => shutdown("SIGINT"));

The same graceful-shutdown shape runs unmodified on App Platform, Render and Fly. Vercel handles this for you at the function boundary, you do not write server.close, the platform does. The shape above is the one we paste into every new service shipped through our SaaS web-app development engagement, and the one our maintenance retainer owns post-launch alongside the deploy pipeline.

Three companion studies that line up with the migrations behind this post:

Engagements that map directly to this work. The migration engagement that runs this exact change, the SaaS web-app build that ships with these configs from day one, and the post-launch retainer that owns the deploy pipeline:

Frequently asked questions

What does zero-downtime deploy actually require?
Five behaviours: no 5xx during traffic shift, in-flight requests drain on the old version, WebSocket connections close cleanly with a reconnect signal, database migrations stay backward-compatible during the overlap, and rollback is the same command as deploy. Most platforms call any non-502 deploy 'zero-downtime', that is a low bar.
Which PaaS is cheapest for a typical long-process API?
DigitalOcean App Platform. A 2-instance Professional XS deployment runs around $73/mo against a comparable Vercel function bill in the $300+/mo range. Render sits in between with a more forgiving starter tier. Fly.io is cheapest at the very low end but requires the most config knobs.
Why is Vercel weak for WebSocket workloads?
Vercel's deploy mechanism is an atomic alias swap on top of the function pool, the swap is instant but anything stateful across it is the developer's problem. Long-lived sockets on serverless are inherently fragile. For persistent connections, App Platform, Render or Fly are all better-shaped.

Let's build

Taking the first step is the hardest. Everything after, we make simple.

Contact