Jump to

Skill · Backup and disaster recovery

Backup and disaster recovery.

Prepare for the worst case before it happens, not during it.

Plan for the worst case: the database is gone, the host is down for a week, a deploy was poisoned, ransomware encrypted everything. The skill is advance preparation, not reaction, and it answers four questions explicitly: what needs to be recoverable, how much data loss is acceptable, how much downtime is acceptable, and what the disaster actually is.

A strategy that handles only hardware failure is the easy case, not a strategy. The discipline is planning for the specific disasters, and proving the restore works before a real one arrives.

Audience: engineers and ops teams setting up backups, defining RPO and RTO targets, designing backup architecture, or running a disaster-recovery drill.

The framework

Four questions every DR plan answers.

Every disaster-recovery plan answers these four explicitly. The first one drives all the rest.

  1. 01What needs to be recoverable? Every stateful system, tiered: Tier 1 (the business stops without it), Tier 2 (painful but survivable), Tier 3 (easy to rebuild). The tier drives RPO, RTO, frequency, and spend.
  2. 02How much data loss is acceptable (RPO)? The maximum age of data you can lose. An RPO of 1 hour needs hourly backups or continuous replication; 1 day allows daily backups; near-zero is for critical financial systems.
  3. 03How much downtime is acceptable (RTO)? The maximum time to restore service. Under 5 minutes needs a hot standby with automatic failover; under 24 hours allows a cold backup with a documented restore. Aggressive RTOs are expensive.
  4. 04What is the disaster? Plan for the specific scenarios, because each has different implications, from hardware failure through ransomware and insider threat.

Plan for the specific ones

Seven scenarios, each with its own defense.

A backup strategy that only survives a dead disk handles the easiest case. These are the disasters a real plan accounts for.

  1. 01

    Hardware failure

    A disk dies. Standard backups solve it, and most modern hosts handle it automatically. The easy case.

  2. 02

    Provider outage

    A region or vendor goes down. Cross-region or cross-provider redundancy is needed for a low RTO.

  3. 03

    Data corruption

    A bad migration, a bug, an accidental delete. Point-in-time restore is needed, because the latest backup may be corrupted too.

  4. 04

    Ransomware or compromise

    An attacker encrypts or deletes. Backups must be immutable or air-gapped, or the attacker takes them with everything else.

  5. 05

    Account compromise

    An attacker holds admin credentials and deletes everything. The same defense applies: immutable backups under separate access control.

  6. 06

    Vendor lock-out

    An account is suspended or a vendor disappears. Backups held outside the vendor are needed.

  7. 07

    Insider threat

    A disgruntled employee deletes or exfiltrates. Audit logs, separation of duties, and immutable backups.

What makes a backup real

An untested backup is not a backup.

Untested backups are the single most common failure: they appear to work and the restore fails. The first restore should never happen during a real disaster, so drill on a cadence (a tabletop quarterly, a partial restore to a non-production environment annually, a full drill before major launches), and treat RPO and RTO as measured rather than aspirational. If the actual restore took six hours against a one-hour target, the target is fiction until the gap is fixed or the target is revised.

The 3-2-1 rule is the starting point: three copies of the data, on two storage types, with one offsite or off-account. Backups in the same account or region as the source fall to the same outage or compromise that takes the source, so separation is the whole point, and at least one copy stays immutable or air-gapped so ransomware cannot reach it.

The plan is the runbook plus the architecture plus the drilling, not the sentence 'we have backups'. Write the restore runbook for tired, panicked you on the worst night of a career, do not leave it in one person's head, and back up the backup system itself, because a backup encrypted with a key you have lost is useless.

Reference files

The reference that goes alongside the SKILL.md.

  • references/restore-runbook-template.md

    A fillable restore-runbook template covering detection, authorization, steps, verification, and rollback.

Browse all reference files on GitHub

Bridges to other skills

What this prepares for, and what it is not.

DR is advance preparation. These cover the event it prepares for, the routine rollbacks it is not, and the move it belongs inside.

  • The event

    incident-response

    When the disaster is actually happening, response takes over. DR is the preparation that makes the restore possible; incident response runs the live event.

  • Detection and review

    monitoring-and-alerting

    Detecting the disaster cross-references monitoring, and the routine review of snapshots lives there too. DR sets the backups up; monitoring watches that they keep running.

  • Routine rollbacks

    launch-runbook

    A routine deploy rollback is a launch-runbook concern, a planned reverse. DR is for the catastrophic case where the data itself is gone.

  • DR in the move

    content-migration

    When migrating to a new platform, the DR planning belongs in the migration plan. Run the two together so the new platform launches with its backups already designed.

Open source under MIT

Read the SKILL.md on GitHub.

The skill source lives in the rampstackco/claude-skills repository alongside dozens of other skills covering the full lifecycle of brand and product work. This page is a structured overview; the SKILL.md is the source. MIT licensed.

Frequently asked questions.

What four questions does a DR plan answer?
What needs to be recoverable (every stateful system, tiered by criticality), how much data loss is acceptable (the RPO), how much downtime is acceptable (the RTO), and what the disaster is (the specific scenarios). The tier answers the first question and drives the rest: a Tier 1 system whose loss stops the business earns a tighter RPO, a tighter RTO, more frequent backups, and more storage spend than a Tier 3 cache that is easy to rebuild.
What is the difference between RPO and RTO?
RPO (recovery point objective) is the maximum age of data it is acceptable to lose, so an RPO of 1 hour requires hourly backups or continuous replication, while 1 day allows daily backups. RTO (recovery time objective) is the maximum time to restore service, so an RTO under 5 minutes needs a hot standby with automatic failover, under 1 hour needs a warm standby or fast restore, and days-to-weeks is cheap and best-effort. Both drive architecture spend: aggressive targets are expensive, loose ones are not, and an aspirational target without the infrastructure to back it is fiction.
What is the 3-2-1 rule?
Keep three copies of the data, on two different storage types, with one of them offsite (or off-account, off-platform). It is a starting point, and the reason behind it is concrete: backups in the same account or region as the source fall to the same account compromise or region outage that takes the source. Separation is the whole point, which is why at least one copy also needs to sit outside the source service entirely.
Why do backups need to be immutable?
Because ransomware and account compromise delete or encrypt the backups too if they can reach them, so a backup an attacker can overwrite is no defense against the attacker. Use object lock or air-gapped storage for at least some copies, and keep at least one backup outside the source service, because point-in-time recovery within a managed database is gone if the database service itself is compromised. Versioning, replication, and object lock together protect production-critical object stores.
Why test restores?
Because untested backups are the single most common failure: they appear to work, and the restore fails when it matters. The first restore should never be during a real disaster, so drill on a cadence (a tabletop walkthrough quarterly, a partial restore to a non-production environment annually, and a full drill before major launches or after major architecture changes). Document each drill's actual RPO and RTO against the targets, and if the actual numbers are worse, either fix the gap or revise the target, because an unmeasured target is just a hope.