Why your engineers are firefighting deploys at 2am

It's 2:14am. Your senior engineer is staring at a deploy log that has gone silent for fifty-eight minutes. The build container is still running. The status in your deploy dashboard says "in progress." Nothing has been logged since the line that read Adding build arguments to Docker Compose build command.

She has been at this for three hours. The first attempt failed at minute twenty-one with a one-line error: Command execution failed (exit code 255). The second attempt — same code, same configuration, same host — failed silently at the sixty-minute timeout. She is on the third attempt now, which is why she is still awake.

At 3:47am she sends a Slack message to the team channel: "Think it's done. Going to bed. Will check in the morning."

The next morning she checks. The container is running. The container is also crash-looping. Both are true. The dashboard shows green. The log shows Error: Cannot find module 'dotenv' repeating every two seconds.

This story isn't unusual. Some version of it happens in engineering organizations every week, and most of the time the people involved write it off as the cost of doing business. The team's senior engineers absorb the firefighting because they're the ones who can. Eventually they stop noticing how often it happens.

If you're a CTO reading this, the engineer in that scene isn't the subject of this article. You are.

The 2am session is a symptom. The system that produces 2am sessions is the disease. Your engineers can't fix the system because they don't have the lever. You do.

The three categories of why this happens systemically

I have spent the better part of two decades looking at the inside of enterprise deploy pipelines, and I have stopped being surprised by what I find. The 2am session always traces back to one of three things, usually all three at once.

Vendor defaults aren't your defaults

Tools ship with settings calibrated to won't break your data — never to won't break your team. There is a vast gap between those two configurations.

Take container orchestration platforms. The default Docker cleanup runs once a night and removes dangling images. It does not aggressively prune build cache, which means BuildKit cache mounts grow unbounded between cleanups. On a host that runs four service builds in a single push window, that cache can accumulate thirty-six gigabytes inside a day. The cleanup runs nightly and successfully. The disk fills anyway. The build that fails at 2am does so because BuildKit's image export step is fighting cache contention that nobody told the cleanup to handle.

Take Node.js applications. Most deploy platforms inject NODE_ENV=production as a build-time argument by default. This is a sensible default for a generic Node application. It is the wrong default for a Next.js application that needs devDependencies during the build to compile. The platform considers it correct. The application breaks. Neither the platform vendor nor the application framework is wrong — the combination is, and nobody at your company explicitly chose to use them together.

The unspoken rule: every default on your critical path is a decision someone should have explicitly made. If you haven't, you've outsourced it to whichever vendor calibrated for the average user. Your workload is not the average user. Compound that across a deploy pipeline with five vendors and twenty defaults, and the average outcome is that you're firefighting at 2am for reasons that won't show up on any dashboard.

Compound state in shared infrastructure

Each individual decision is small.

A container orchestrator leaves a "helper" container running after a failed deploy. It looks harmless. It is holding a lock on a BuildKit cache mount.

A Node monorepo has both a workspace lockfile at the root and a per-package lockfile inside each service directory. Both work fine. The build process uses the per-package one. The development environment uses the workspace one. They drift independently for six months.

A Dockerfile copies the entire builder node_modules into the runner stage. It includes 500MB of testing libraries that production never uses. The image takes seven and a half minutes to export. Until one day it doesn't, because BuildKit hits a resource limit during export that's invisible in any monitoring tool.

The failure mode is what happens when these stack. The third retry of the same deploy fails differently than the first two — not because the code changed, but because the environment did. The orphan helper held a lock for a while, then released it, then a different orphan held a different lock. The cache regrew between cleanups in a way that never quite reaches the manual prune threshold. The lockfile drift produces a Cannot find module error that didn't exist in any prior deploy.

This is the part your engineers can't see in real time. Each individual piece looks fine in isolation. The interaction is what bites. They fix what they can see — they kill the orphan, they prune the cache, they retry — and they get the deploy through eventually. The interaction itself goes undocumented because there's no incident to file. Until the next 2am session, when a different combination of the same compounding pieces produces a different failure mode and the documented fix from last time doesn't work.

The visibility gap

"Deployed" in your dashboard lags reality by minutes to hours, sometimes days.

A container can be crash-looping in a 60-second restart cycle and your platform's health check shows green for the first 30 seconds of each cycle. If your monitoring polls every minute, you never catch the red.

Build success is the easiest thing to measure. Runtime success is what actually matters. Most platforms conflate the two: they report "deploy succeeded" the moment the container starts, which is technically accurate and operationally useless. The engineer who's awake at 2am knows this. The CTO who reads the dashboard at 7am does not.

Health checks measure liveness — is the process responding to a TCP connection? — not correctness. A Next.js server can respond to /api/health with a 200 while the BullMQ worker that runs alongside it has crashed five times in the last ten minutes and is now wedged. Both are true. Both are visible. Neither is what the dashboard reports.

Every CTO learns this once. Learning it twice means the org didn't internalize the lesson the first time.

The actual cost of the 2am session

Let me show you the math, because most CTOs are systematically undercounting this.

A team of eight engineers averaging one deploy fire per week per team. Each fire consumes roughly three hours of a senior engineer's time and one hour from a junior. Use fully-loaded rates — salary plus benefits plus overhead, the rate your CFO actually thinks about. For a US-based team, that's around $200/hour for a senior, $120/hour for a junior.

One fire: 3 × $200 + 1 × $120 = $720.

Fifty-two weeks in a year: $37,440 in direct hours.

That's the visible part. The invisible part is roughly four times bigger:

Sleep debt costs. Your senior engineer who's up until 4am loses 30-40% of her productivity the next day. The PR she would have shipped doesn't ship. The code review she would have given gets pushed. The architectural decision she would have weighed in on goes through without her input. None of these show up in any timesheet.
Calendar slip. Releases that were supposed to ship on Tuesday ship Thursday. Sales demos miss the new feature. Marketing's launch sequence gets replanned. The engineering team takes the blame for "moving slowly" when in fact they spent half their week firefighting deploys that should have been frictionless.
Attrition risk. Senior engineers correlate "this org has recurring deploy fires" with "this org has technical debt I can't fix from my seat." They start interviewing. Replacing a senior engineer costs $150K-$250K in recruitment, ramp-up, and lost productivity. One avoidable departure per year wipes out anything you saved by not investing in the underlying system.

Total annual cost of compound infrastructure debt for a team of eight: roughly $150K-$225K, and that's a conservative estimate. Larger teams scale this proportionally — sometimes worse than linear, because more services share more infrastructure.

Most CTOs I talk to are surprised by these numbers. They've been counting only the direct hours. The direct hours are the smallest line item. (The full forensic accounting is here.)

The pattern that almost no internal team can see

The most important thing I will say in this article:

The engineers in the 2am scene did nothing wrong. They followed the runbook. They escalated when they should have. They tried sensible diagnostics. They got the deploy through.

The system was designed to produce that session.

This matters because most CTOs default to one of two responses when 2am sessions become a pattern: my engineers need to be more careful, or we need better processes. Both responses put the burden on the people who can't fix the underlying issue. Neither one will work.

The fix is architectural. Architecture is the only lever you have, and it's a lever your team can't pull from inside the system. They're too close to it. The orphan container they're killing tonight will be back next week because the system that produces orphan containers hasn't changed. The cache prune they ran today will need to run again tomorrow. They are, correctly, treating symptoms. They lack the authority and the vantage point to treat the disease.

This is the moment most CTOs realize they need an outside view. The engineers can't see the pattern from inside. The CTO can see the pattern but can't credibly diagnose it without putting their own team on the defensive. The work of mapping the compound state and proposing the architectural intervention is best done by someone who has seen this pattern in five other companies and can name the specific shape of the failure.

Six diagnostic questions

Read these slowly. Each one is binary — you either know the answer or you don't, and "I'd have to ask" is a no.

When was the last time someone explicitly audited your container cleanup defaults across your environments? "The platform came with cleanup turned on" doesn't count — that's the vendor's audit, not yours.
Can you list which dependencies in your package.json (or equivalent) land in your production runtime versus only at build time? If not, you're either shipping dev tools to prod or breaking runtime when you decide to stop.
Of the deploy failures in your last 30 days, what percentage share root cause? If most failures cluster on the same root cause, you're hitting the same wall repeatedly without fixing it. The fact that "we got it working eventually" each time is masking the cost.
Has any deploy in the last quarter required SSH-into-host intervention? If yes, your platform abstraction is leaking. The platform exists precisely so engineers don't need shell access to fix deploys. Every SSH session is a debt your platform owes back to your team.
When deploys fail, do your engineers reach for Stack Overflow or for your runbook first? If Stack Overflow, your runbook is stale or doesn't exist. Either way, you're paying engineering time to redocument knowledge you should already own.
Do you know which of your environments share infrastructure with which others — same host, same network, same BuildKit cache, same Docker daemon? If "I'd have to ask," you don't know your blast radius. A failure in staging that takes down production because they share a daemon is a one-paragraph postmortem and a board-level conversation.

Four or more answers in the "no" or "I'd have to ask" column means you've crossed from working-with-friction into architectural debt. That's a different problem than what your team is currently solving, and it needs a different intervention. (A more thorough six-indicator self-test is here.)

What the intervention actually looks like

Not a sales pitch. The shape of an architectural review.

A typical engagement runs four to six weeks. Three deliverables:

A current-state map of your deploy infrastructure: what runs where, which environments share which substrates, where the implicit decisions are. This is the thing your team can't produce on their own because they can't see their own pattern from inside it.
A prioritized risk register — every compound state issue I find, ordered by likelihood × blast radius. Some of them are urgent. Most of them are not. The ones that are not need to be in the register anyway, because they will become urgent and your team will be firefighting them at 2am if you don't sequence them earlier.
An intervention sequence with cost-benefit per item, written in the language a CFO can read. This is what you take into a budget conversation when you need to justify infrastructure investment that the company has been deferring.

The work is built around three inputs: interviews with the engineers doing the firefighting (they know more than anyone what hurts), reading the actual deploy logs from the last 90 days (the patterns are in there), and auditing the actual configurations across your environments (where the vendor defaults you never explicitly chose are sitting).

The output is a document you can use, a backlog your team can execute, and — usually — a 60-90% reduction in deploy fires within two quarters.

The engineer firefighting at 2am isn't the problem to solve. The system that produces those sessions is. The lever to change it is yours, not theirs.

If your team is working harder than the system seems to justify — if the 2am sessions are recurring with shifting symptoms — the cause is upstream of where you've been looking.

Considering an audit of your own? I run four-to-six-week deploy infrastructure audits for engineering organizations carrying compound infrastructure debt. The output is a current-state map, a prioritized risk register, and an intervention sequence with cost-benefit per item — written for your CFO. Investment: $55,000–$75,000 fixed-fee.