Availability & Recovery

Disaster Recovery Plan

Our formal, documented process for restoring FindTheBreach platform operations after a disruptive incident — aligned with SOC 2 Availability (A1.2, A1.3), NIST SP 800-34, and ISO 27001 Annex A.5.29–A.5.30.

Last updated: February 23, 2026  |  Version 1.0

1. Overview

This Disaster Recovery Plan (DRP) establishes a structured approach for FindTheBreach to restore platform operations, data, and critical services following a disruptive event — including hardware failures, data corruption, cyberattacks, natural disasters, or extended cloud provider outages.

Purpose: To minimize downtime and data loss, ensure timely restoration of services, and meet contractual SLA commitments to our customers.

Scope: This plan covers all critical systems and data, including:

  • FindTheBreach platform infrastructure (API servers, web servers, scanning engine)
  • PostgreSQL database (customer data, scan results, vulnerability findings, account information)
  • Application configuration, secrets, and environment variables
  • Scan report artifacts and historical data
  • Third-party integrations and webhook configurations

Plan Owner: The Chief Technology Officer (CTO) is responsible for maintaining and activating this plan. The DR team includes engineering leadership, DevOps, and customer success representatives.

2. Recovery Objectives

FindTheBreach defines the following recovery objectives to bound acceptable downtime and data loss:

Metric Target Description
RTO (Recovery Time Objective) 4 hours Maximum acceptable time from incident declaration to full service restoration.
RPO (Recovery Point Objective) 1 hour Maximum acceptable data loss, achieved via PostgreSQL WAL (Write-Ahead Log) archival with continuous streaming.
MTTR (Mean Time to Recovery) 2 hours Average expected time to restore services under normal disaster conditions.

These targets are reviewed quarterly and adjusted based on infrastructure changes and DR drill results.

3. Backup Procedures

3.1 PostgreSQL Database Backups

  • Frequency: Daily automated full backups via pg_dump, with continuous WAL archival for point-in-time recovery (RPO: 1 hour).
  • Encryption: All backups are encrypted at rest using AES-256 encryption before transfer to offsite storage.
  • Storage: Backups are stored in geographically separate offsite storage with redundancy, isolated from primary infrastructure.
  • Retention: 90-day retention policy — daily backups retained for 90 days, with monthly snapshots retained for 1 year.
  • Integrity: Automated checksum verification after each backup. Weekly restore-to-staging validation tests.

3.2 Application Data Backups

  • Configuration & Secrets: Application configuration, environment variables, and secrets are version-controlled and backed up separately from the database with encrypted storage.
  • Scan Reports: Generated scan report artifacts (PDF, HTML, JSON) are backed up daily to offsite storage.
  • Docker Images: Container images are stored in a private registry with tagged releases for rapid redeployment.
  • Infrastructure as Code: All infrastructure definitions (Docker Compose, nginx configs, Dockerfiles) are version-controlled in Git.

4. Recovery Procedures

The following step-by-step procedure is executed upon declaration of a disaster event:

1

Assess Incident

Determine the nature, scope, and severity of the disruption. Classify the incident and confirm that DR activation is warranted. Document the initial assessment with timestamps.

2

Activate DR Team

Notify the DR team via the escalation chain (see Communication Plan). Assemble the response team and assign roles: Incident Commander, Technical Lead, Communications Lead.

3

Restore from Backup

Provision replacement infrastructure (if needed). Restore PostgreSQL from the most recent verified backup using pg_restore and apply WAL logs for point-in-time recovery. Redeploy application containers from the registry.

4

Verify Integrity

Run data integrity checks against the restored database. Verify application functionality with automated health checks and manual smoke tests. Confirm scan engine, API, and web interface are operational.

5

Resume Operations

Switch DNS and traffic to the restored environment. Re-enable customer access and resume queued scans. Update the status page to reflect service restoration.

6

Post-Incident Review

Conduct a blameless post-mortem within 48 hours. Document root cause, timeline, actions taken, data loss (if any), and lessons learned. Update this DRP based on findings.

5. Failover Architecture

FindTheBreach employs the following failover mechanisms to maximize availability:

5.1 Docker Container Restart Policy

  • All production containers run with restart: unless-stopped policy, ensuring automatic recovery from process crashes.
  • Health checks are configured for critical services (PostgreSQL, API server) with automatic restart on failure.
  • Container orchestration monitors resource usage and restarts containers that exceed memory limits.

5.2 PostgreSQL Streaming Replication

  • When available, a hot standby replica receives continuous WAL streaming from the primary database.
  • Automatic promotion of the standby to primary is supported via failover scripts, reducing RTO for database failures.
  • Replication lag is monitored with alerts triggered if lag exceeds 60 seconds.

5.3 DNS Failover

  • DNS records are managed via Cloudflare with low TTLs (5 minutes) to enable rapid traffic redirection.
  • In the event of a complete infrastructure failure, DNS can be pointed to a maintenance page or secondary environment within minutes.
  • Health-check-based DNS failover is available to automatically route traffic away from unhealthy endpoints.

6. Communication Plan

Clear, timely communication is critical during a disaster event. The following escalation and notification procedures apply:

6.1 Internal Escalation

Stage Who Timeframe
1. Engineering On-call engineer, DevOps lead Immediate (within 15 minutes)
2. Management CTO, CEO, department heads Within 30 minutes
3. Customers Customer success, account managers Within 24 hours

6.2 Status Page Updates

  • The system status page is updated within 30 minutes of incident declaration.
  • Updates are posted at minimum every 60 minutes during active recovery.
  • A final "resolved" update is posted upon full service restoration with a summary of the event.

6.3 Customer Notification

  • Affected customers are notified within 24 hours via email, with details on the nature of the disruption, data impact (if any), and expected resolution timeline.
  • Enterprise customers with dedicated account managers receive direct phone or video communication.
  • A post-incident summary is shared with all affected customers within 5 business days.

7. Testing Schedule

Regular testing validates the effectiveness of this plan and identifies gaps before a real disaster occurs.

Test Type Frequency Description
Tabletop DR Drill Quarterly Walk-through of DR procedures with the DR team. Simulated scenarios to validate communication chains, role assignments, and decision-making.
Backup Restore Test Quarterly Full restore of PostgreSQL backup to a staging environment. Verify data integrity, application functionality, and measure actual recovery time.
Full Failover Test Annually Complete failover to backup infrastructure. Validate end-to-end recovery including DNS switchover, data restoration, and service verification.
Post-Test Documentation After each test Document test results, measured RTO/RPO, issues discovered, and corrective actions. Reports are retained for audit and compliance review.

8. Last DR Test

Last tested: Q1 2026

Quarterly DR Drill — Tabletop exercise + backup restore

Results Summary:

  • Scenario: Simulated complete database corruption with primary server failure.
  • Measured RTO: 2 hours 45 minutes (within 4-hour target).
  • Measured RPO: 38 minutes of data loss from WAL replay (within 1-hour target).
  • Data Integrity: 100% — all restored records matched pre-disaster checksums.
  • Communication: Escalation chain activated within 12 minutes. Status page updated within 20 minutes.
  • Corrective Actions: Improved WAL archival monitoring alerts; updated runbook with additional verification steps for scan queue re-processing.

Full test report available to auditors and enterprise customers upon request.

9. Compliance Mapping

This Disaster Recovery Plan is designed to satisfy the following regulatory and compliance requirements:

Framework Control(s) How This Plan Addresses It
SOC 2 Availability A1.2, A1.3 A1.2 (Recovery Infrastructure): Defined backup, failover, and restoration procedures. A1.3 (Recovery Plan Testing): Quarterly DR drills and annual full failover tests with documented results.
ISO 27001 A.5.29, A.5.30 A.5.29 (ICT Readiness for Business Continuity): Recovery objectives, backup procedures, and failover architecture. A.5.30 (ICT Readiness for Business Continuity): Regular testing and plan updates.
NIST SP 800-34 Full Framework Contingency Planning Guide for Federal Information Systems — this plan follows the NIST SP 800-34 structure for IT contingency planning: preventive controls, recovery strategies, and plan testing.
HIPAA §164.308(a)(7) Contingency Plan standard — data backup plan, disaster recovery plan, and emergency mode operation plan. Testing and revision procedures documented in Sections 7 and 8.

Questions About Our DR Plan?

For questions about our disaster recovery procedures or to request DR test reports, contact us: