Engineering playbook

    Zero-downtime push-to-deploy on DigitalOcean App Platform — and how it actually compares to Vercel, Render and Fly

    We have moved 11 production apps onto DigitalOcean App Platform in the last 18 months — most from Vercel or Render. This is the side-by-side deploy story, with timings, traffic-shift mechanics, and rollback paths on all four platforms.

    May 12, 202621 min readBy Ritesh
    Zero-downtime deploy on DigitalOcean App Platform versus Vercel, Render and Fly

    What “zero-downtime” actually has to mean

    Most platforms call any deploy that does not return a 502 “zero-downtime”. That is a low bar. The five behaviours we actually test for, on every platform, before we sign off on a deploy pipeline:

    1. No 5xx during traffic shift — measured by hitting /health at 50 RPS for the full deploy window.
    2. In-flight requests drain — long-running POSTs started against the old version finish on the old version.
    3. WebSocket / SSE connections survive — or close cleanly with a reconnect signal, not RST.
    4. DB migrations stay backward-compatible — old and new versions run side-by-side for at least the drain window.
    5. Rollback is the same command as deploy, not a separate operations procedure.

    Below, all four platforms are scored against those five behaviours, then we walk through one deploy on each. The rest of the post covers the gotchas we hit moving an app between them.

    The matrix

    BehaviourDO App PlatformVercelRenderFly.io
    No 5xx during shiftYes if health-check path configuredYes (atomic alias swap)Yes if zero-downtime enabledYes with bluegreen strategy
    Drain in-flight requests~120s SIGTERM graceFunction-bound, ~15s~30s by default, configurableConfigurable via kill_timeout
    WS / SSE survive deployCloses, client reconnectsEdge-only, fragile for long socketsCloses cleanlyCleanest of the four (per-machine)
    Side-by-side runtime~2 min overlap during shiftAtomic; no overlap~30-90s overlapLong, depends on rollout
    One-click rollbackYes (per deployment)Yes (per deployment, instant)YesManual: fly deploy --image
    Median deploy time, 250MB image4–6 min35–90 s3–5 min90 s–3 min

    Sources: per-platform documentation; deploy timings averaged across our 11 production migrations (Node, Python, Laravel, Next.js).

    The numbers paint a real picture: Vercel is the fastest at the swap mechanism itself (it is just an alias) but the worst at long-lived connections. App Platform and Render are the most predictable for typical CRUD APIs. Fly is the most controllable but expects you to design the strategy yourself.

    Deploy 1 — DigitalOcean App Platform, second by second

    The platform builds your container in a managed Cloud-Native Buildpacks runner, pushes the resulting image to DOCR, then performs a rolling deploy across the configured number of instances. The health-check path is the single most important config — get it wrong and you ship downtime.

    .do/app.yaml — production health-check config
    yaml
    The two settings that matter: http_path and the initial_delay_seconds. Default initial_delay is 0, which fails the health check before the app boots; set it to your real cold-start time + 5 seconds.
    name: api-prod
    services:
      - name: web
        github:
          repo: appycodes/api
          branch: main
          deploy_on_push: true
        instance_size_slug: professional-xs
        instance_count: 2
        http_port: 8080
        health_check:
          http_path: /healthz
          initial_delay_seconds: 25
          period_seconds: 10
          timeout_seconds: 5
          success_threshold: 1
          failure_threshold: 3
        routes:
          - path: /
        envs:
          - key: DATABASE_URL
            scope: RUN_TIME
            type: SECRET
        autoscaling:
          min_instance_count: 2
          max_instance_count: 8
          metrics:
            cpu:
              percent: 70

    The deploy lifecycle, as we measure it from the DO control panel:

    DO App Platform — observed deploy timeline (250MB Node image, 2 instances)
    text
    t = 0:00    push to main, webhook fires
    t = 0:08    build container starts in DO managed runner
    t = 2:55    build complete, image pushed to DOCR
    t = 3:00    new instance 1 starts in parallel to old 1
    t = 3:25    new instance 1 passes health check
    t = 3:25    load balancer adds new instance 1, removes old instance 1
    t = 3:30    old instance 1 receives SIGTERM, has 120s to drain
    t = 3:55    new instance 2 starts in parallel to old 2
    t = 4:20    new instance 2 passes health check
    t = 4:20    load balancer swap; old instance 2 starts draining
    t = 5:30    old instance 1 finishes drain, exits cleanly
    t = 6:25    old instance 2 finishes drain, exits cleanly
    t = 6:25    deploy marked complete

    The 120-second SIGTERM grace window is generous compared to most PaaS — Heroku gives 30s, Vercel functions are effectively bound by their function timeout. For our Laravel and Node APIs, 120s is enough to drain even the longest legitimate request (a CSV export hitting a slow third-party).

    Two things to know about how App Platform handles long-lived connections. First, WebSocket connections are terminated at SIGTERM — the client sees a clean close, not an RST. Second, the load balancer does not currently forward sticky-session cookies for WS connections by default, so any reconnection-based recovery needs to tolerate landing on a different instance. We design every WS handler we ship to be re-entrant for this reason.

    Deploy 2 — Vercel

    Vercel's deploy mechanism is fundamentally different. Each push builds a new immutable deployment with its own URL (my-app-xyz.vercel.app). The production alias is then atomically swapped to point at that deployment. There is no rolling shift; the old deployment keeps running its in-flight functions until they finish, and new traffic goes straight to the new deployment.

    Vercel — observed deploy timeline (same app)
    text
    t = 0:00    push to main
    t = 0:05    build starts on Vercel build container
    t = 0:48    build complete
    t = 0:55    new deployment marked Ready
    t = 1:00    production alias swap — atomic, < 1 s
    t = 1:00    new traffic 100% on new deployment
    t = 1:00 -> ~15s  in-flight functions on old deployment continue
                until they finish or the 15s default timeout

    Strengths: the swap itself is instant, builds are fast, and every preview deployment is a real, queryable URL. Rollback is one click and equally instant.

    Weaknesses: anything stateful across the swap is your problem. Long-running HTTP responses (file downloads, streaming AI responses) can be cut at the function timeout. WebSocket support on Vercel relies on the edge runtime and is not a primary use case — long-lived sockets on serverless are inherently fragile. If you need persistent connections, Vercel is the wrong shape.

    Deploy 3 — Render

    Render sits in between. It runs your service as a long-lived process (like App Platform), not as a function pool (like Vercel), but the deploy mechanic is simpler than DO's rolling shift. With Zero-downtime deploys enabled on a paid plan, Render starts the new instance, waits for it to pass its health check, then routes 100% of traffic over and SIGTERMs the old instance.

    Render — observed deploy timeline
    text
    t = 0:00    push to main
    t = 0:06    build starts
    t = 2:30    build complete, container starts
    t = 2:55    health check passes (default: HTTP 200 on /)
    t = 2:55    100% traffic switched to new instance
    t = 2:55    old instance starts ~30s SIGTERM drain
    t = 3:25    old instance terminated

    The 30-second drain is the part you usually need to tune. For most CRUD APIs it is fine; for anything that holds a connection open it is short. Increase it via the RENDER_DRAIN_SECONDS environment variable if your workload needs it.

    The health-check on Render defaults to TCP, not HTTP. If you want behaviour comparable to App Platform's /healthz probe, set the Health Check Path in the dashboard explicitly. We have audited several Render deployments where the team thought they had a real health check and actually had a TCP probe that passed long before the app was ready.

    Deploy 4 — Fly.io

    Fly does not assume a strategy. You pick one from fly.toml and the platform will execute it. The two we use most: rolling (default, similar to DO) and bluegreen (full parallel fleet, then swap). Bluegreen is the one to use for serious traffic shifts.

    fly.toml — bluegreen deploy with health checks
    toml
    app = "api-prod"
    primary_region = "lhr"
    
    [deploy]
      strategy = "bluegreen"
      release_command = "node scripts/migrate.js"
    
    [http_service]
      internal_port = 8080
      force_https = true
      auto_stop_machines = false
      min_machines_running = 2
      processes = ["app"]
    
    [[http_service.checks]]
      grace_period = "20s"
      interval = "10s"
      method = "GET"
      timeout = "5s"
      path = "/healthz"
    
    [[vm]]
      cpu_kind = "shared"
      cpus = 2
      memory_mb = 1024
    Fly bluegreen — observed deploy timeline
    text
    t = 0:00    fly deploy (or push if CI)
    t = 0:20    image build complete (locally or in fly builder)
    t = 0:35    release_command runs migrations, exits 0
    t = 0:40    full set of "green" machines started in parallel
    t = 1:05    green machines pass health checks
    t = 1:05    proxy switches traffic from blue -> green
    t = 1:05    blue machines start draining (kill_timeout)
    t = 1:35    blue machines terminated

    Fly gives you the most knobs but expects you to use them. The default kill_timeout is 5 seconds, which is too short for most production workloads. Bump it to 60 or 120 explicitly. WebSocket handling is the cleanest of the four platforms because each app is a real persistent process on a real VM — the proxy will respect existing connections during the swap window.

    One real migration — Vercel to App Platform

    The cleanest before/after we have is a Next.js + Postgres app we moved off Vercel onto App Platform last quarter. The reason was not performance — Vercel is faster — it was bill predictability. The app had a Stripe webhook handler that occasionally triggered a 90-second batch reconciliation, and the Vercel function-runtime cost was creeping up.

    The work, week by week:

    • Week 1 — runtime audit. List every code path that assumes serverless: per-request DB connections (fine, but pool them now), file uploads to /tmp (rewrite to stream to object storage), ISR caches (replace with real Redis or KV cache). Six edits across the codebase.
    • Week 2 — Dockerfile + health endpoint. Wrote a minimal Next.js standalone Dockerfile, exposed /api/healthz that returns { ok: true } after the DB pool warms.
    • Week 3 — staging on App Platform. Connected the repo, deployed to a staging app spec, ran the same load test we ran on Vercel. Cold start: 8s on App Platform vs 700ms on Vercel functions — expected, mitigated by keeping min_instance_count: 2.
    • Week 4 — cutover. Pointed the Cloudflare-managed apex record at App Platform via an ALIAS, kept Vercel running for 48 hours behind a feature flag for instant fallback. No 5xx during the shift.

    Bill comparison after the first full month: Vercel was $312/mo (Pro plus function execution), App Platform was $73/mo (2 × Professional XS instances + DB) for the same traffic shape. The latency at p95 went from 145ms to 195ms — real but acceptable for an internal-facing API.

    The pattern that sets up this migration cleanly is the same engineering hygiene that helps a SaaS get through a Series A audit — see our companion Series A codebase audit study for the broader picture, and the multi-tenant cost study for what to budget per-tenant once you are on the new platform. We run end-to-end PaaS migrations like this through our tech-stack migration engagement — usually 3-5 weeks for a Next.js or Node app, including the staging soak and the 48-hour fallback window.

    Choosing between the four, in 60 seconds

    • You are mostly serving HTTP responses under 10s, with edge or static caching desirable. Vercel. The function model lines up with the workload, the build is fast, the rollback is instant.
    • You run a long-process API or Laravel / Rails monolith and want predictable infrastructure cost. DigitalOcean App Platform. Boring is the feature; the rolling shift is well-behaved; the bill at $73/mo for two instances is a fifth of equivalent Vercel.
    • You want the App Platform shape but at a lower price floor. Render. The free / starter tiers are more forgiving than DO's; the trade-off is fewer instance types and a less-detailed deploy log.
    • You need regional placement, per-machine control, or stable WebSocket / SSE workloads. Fly. The deeper the workload, the more Fly's knobs become an advantage instead of a tax.

    Reference: production-ready configs

    The four configs we ship by default, in roughly the same shape:

    Dockerfile — Node app, multi-stage, used on DO / Render / Fly
    dockerfile
    FROM node:20-alpine AS build
    WORKDIR /app
    COPY package*.json ./
    RUN npm ci
    COPY . .
    RUN npm run build
    
    FROM node:20-alpine
    WORKDIR /app
    ENV NODE_ENV=production
    COPY --from=build /app/node_modules ./node_modules
    COPY --from=build /app/dist ./dist
    COPY --from=build /app/package.json ./package.json
    EXPOSE 8080
    USER node
    # Catch SIGTERM in app code; default tini behaviour is fine for Node.
    CMD ["node", "dist/server.js"]
    src/server.js — graceful shutdown that works on all four
    javascript
    Captures SIGTERM, stops accepting new connections, drains in-flight, then exits. Tuned for the 120s App Platform grace and Fly kill_timeout.
    import http from "http";
    import app from "./app.js";
    
    const PORT = process.env.PORT || 8080;
    const server = http.createServer(app);
    server.listen(PORT, () => console.log("listening on", PORT));
    
    const SHUTDOWN_TIMEOUT_MS = 60_000;
    
    function shutdown(signal) {
      console.log("received", signal, "— draining");
      server.close((err) => {
        if (err) {
          console.error("server close error", err);
          process.exit(1);
        }
        console.log("drain complete");
        process.exit(0);
      });
    
      // Hard cutoff in case clients hold connections open past
      // platform timeout. Pick a value < your platform's SIGKILL window.
      setTimeout(() => {
        console.warn("forcing shutdown after", SHUTDOWN_TIMEOUT_MS, "ms");
        process.exit(0);
      }, SHUTDOWN_TIMEOUT_MS).unref();
    }
    
    process.on("SIGTERM", () => shutdown("SIGTERM"));
    process.on("SIGINT", () => shutdown("SIGINT"));

    The same graceful-shutdown shape runs unmodified on App Platform, Render and Fly. Vercel handles this for you at the function boundary — you do not write server.close, the platform does. The shape above is the one we paste into every new service shipped through our SaaS web-app development engagement, and the one our maintenance retainer owns post-launch alongside the deploy pipeline.

    ■ Related services

    Engagements that map directly to this work

    The migration engagement that runs this exact change, the SaaS web-app build that ships with these configs from day one, and the post-launch retainer that owns the deploy pipeline:

    Frequently asked questions

    What does zero-downtime deploy actually require?
    Five behaviours: no 5xx during traffic shift, in-flight requests drain on the old version, WebSocket connections close cleanly with a reconnect signal, database migrations stay backward-compatible during the overlap, and rollback is the same command as deploy. Most platforms call any non-502 deploy 'zero-downtime' — that is a low bar.
    Which PaaS is cheapest for a typical long-process API?
    DigitalOcean App Platform. A 2-instance Professional XS deployment runs around $73/mo against a comparable Vercel function bill in the $300+/mo range. Render sits in between with a more forgiving starter tier. Fly.io is cheapest at the very low end but requires the most config knobs.
    Why is Vercel weak for WebSocket workloads?
    Vercel's deploy mechanism is an atomic alias swap on top of the function pool — the swap is instant but anything stateful across it is the developer's problem. Long-lived sockets on serverless are inherently fragile. For persistent connections, App Platform, Render or Fly are all better-shaped.
    Ritesh — Founding Partner, Appycodes

    About the author

    RiteshFounding Partner, Appycodes

    LinkedIn

    Ritesh leads engineering at Appycodes. The 11 migrations behind this post include three Next.js apps off Vercel to App Platform, two Node APIs onto Fly bluegreen, and a Laravel monolith that has lived on Render since 2022. The graceful-shutdown shape in the reference section is the one we paste into every new service we ship.

    Last reviewed: May 12, 2026

    Full stack web and mobile tech company

    Taking the first step is the hardest. We make everything after that simple.

    Let's talk today