For the system you inherited that won't stop paging you.
Architecture review, targeted fixes, pipeline and observability retrofit, and a written handover the next team can actually use.
You’ll recognise it if -
- -Incidents are the plural - and the postmortems are getting repetitive.
- -The last vendor left, and the code they wrote is the only documentation you have.
- -Throughput, latency, or reliability has fallen off a cliff and nobody is sure when it started.
- -Every change carries the risk of a bigger incident than the one you are fixing.
- -Half the team is firefighting full-time and cannot ship new features.
By the time we leave -
- A stabilised system with the fires out. Incident volume drops by an order of magnitude within the engagement.
- A written postmortem and architectural review your team can learn from and point to.
- Observability retrofitted so the next incident is caught in minutes, not hours.
- Runbooks and dashboards your on-call rotation will actually use.
- A system your team inherits with confidence, not dread.
5 steps, in this order.
Every engagement follows roughly the same shape. The details change with your situation; the order rarely does.
- 01
Triage (week 1)
Read-only inside your system: commit history, incident logs, configuration, infrastructure. We map what's burning, what's smoldering, and what looks fine but isn't. You get a written triage report at the end of the week.
- 02
Stabilise (weeks 2–3)
Fix the highest-severity issues first - the ones paging on-call now. Every fix ships with tests so it stays fixed. Usually involves a mix of config changes, hot patches, and targeted rewrites of the worst-behaving components.
- 03
Instrument
Observability retrofit in parallel. Tracing through the request path, structured logs, alerts that actually correlate to user pain. The goal is that the next incident pages on the right signal at the right time.
- 04
Modernise the critical path
The parts that keep breaking get rewritten - event-sourced, testable, deployable on their own. Shadow-run alongside the old code. Cut over behind a feature flag. Nothing that is not broken gets rewritten.
- 05
Document & hand over
A written architectural review, runbooks for the top ten incidents, a deprecation roadmap for the worst remaining code, and pairing with your team until they can operate the system without us.
From the inside.
A typical mid-engagement week. Details shift with the project; the rhythm does not.
- Monday
Review the weekend's incident log (if any). Pick the top-severity issue and spend the day on root cause. Root cause written up by end of day whether or not the fix lands.
- Tuesday
Fix and test. Every fix ships with regression tests so the same incident does not return. If the fix is small enough, it ships today; if not, it goes behind a feature flag.
- Wednesday
Observability day. Wire up the trace or log signal that would have caught Monday's incident faster. The next incident should page on the right signal, not three hours late.
- Thursday
Pair with your team on whatever part of the system is next. Knowledge transfer happens in the seat next to us, not in a doc handed over at the end.
- Friday
Write the weekly status: what we fixed, what we learned, what we think is next. The system is noticeably calmer by week two, not week eight.
A one-week triage we can start within days. You get the written triage report whether or not the longer engagement follows.
A small senior team, embedded with yours.
No pyramid, no rotating juniors, no partnership-level engineers who never open the editor. Everyone on the engagement ships code, writes the runbook, and answers the 3am page.
We size the team to the work, not the other way around. If you need fewer people, you get fewer. If the scope grows, we tell you - and you decide.
$25k – $120k per stabilisation
Rescue engagements usually cost less than greenfield work because the scope is narrower - get the fires out, document what you find. Stable ranges here are honest because the triage week gives us enough information to commit. You get the written triage report whether or not the larger engagement follows.
- 01Severity and volume of active incidents
- 02How many people still have context on the system
- 03State of observability and logs - can we reconstruct what happened
- 04Whether the fix is a set of targeted patches or requires rewriting a critical path
- 05Access and security clearance timelines
Things we’ll tell you no to.
We’d rather lose the sale on Monday than lose the trust by month three.
- No.01
The system works fine and you want to "modernise" it for its own sake. Rescue is for systems in pain. If yours is not in pain, a platform engagement is likely the right play.
- No.02
You want a full rewrite and you want it now. Full rewrites are where rescue engagements go to die. We rewrite the parts that keep breaking, not the rest. If a full rewrite is truly the only path, we will tell you - but it will not be our first move.
- No.03
You want us to absorb blame for the original team. We work with the people in the room, not against the people who have left. If the engagement is politically charged, we will say so before we start.
- No.04
You are looking for a long-term on-call replacement. We fix the system and hand it over; we do not become the permanent on-call rotation.
Questions & answers.
Will you work on legacy stacks - PHP, Rails 3, old .NET?
Yes. We have rescued systems written in every mainstream language going back two decades. The rescue pattern is stack-agnostic - triage, stabilise, instrument, document - and we hire engineers who don't panic at unfamiliar code.
How fast can you start?
For rescue engagements, usually within the week. We keep capacity for incident-mode work because by the time you are reading this page, you probably don't have three weeks to wait.
Do you do full rewrites?
Rarely, and never as the first move. Rewrites are where rescue engagements go to die. We rewrite the parts that keep breaking, leave the rest, and shadow-run the new alongside the old until cutover is boring. Most systems stabilise with 20% of the code rewritten, not 100%.
What about the original team or vendor?
We work with whoever is there. If the original team is still around, we pair with them - they know things we never will. If they are gone, we learn the hard way from commits, logs, and config. We don't blame people who aren't in the room.
Other engagements we run.
Tell us what you’re shipping.
30 minutes, no pitch deck. We’ll ask what you’re building, what hurts, and whether we’re the right fit. If we’re not, we usually know someone who is.