Skip to main content

Iceberg Backup, Recovery, and Disaster Recovery: A Complete Guide

· 15 min read
Cazpian Engineering
Platform Engineering Team

Iceberg Backup Recovery and Disaster Recovery

Someone dropped the table. Or worse — they dropped it and ran expire_snapshots and remove_orphan_files. The catalog entry is gone. The metadata cleanup already happened. Your Slack channel is on fire. Can you recover?

The answer depends entirely on what you set up before the disaster. Apache Iceberg does not have a built-in backup command. There is no UNDROP TABLE that magically restores everything. But Iceberg's architecture — with its layered metadata files, immutable snapshots, and absolute file paths — gives you powerful building blocks for backup and recovery if you understand how they work.

This guide covers three scenarios: recovering a dropped table when data files still exist on S3, building a proper backup strategy so you are always prepared, and setting up cross-region disaster recovery for production-critical tables.

Understanding What You Need to Protect

Before building a backup strategy, you need to understand the three layers of an Iceberg table:

Detailed diagram of Iceberg backup and recovery architecture showing the three protection layers, S3 versioning recovery flow, cross-region replication setup, and register_table recovery workflow

Layer 1: Catalog Entry
└── Pointer to the current metadata.json file
(stored in Hive Metastore, Glue, Polaris, Nessie, etc.)

Layer 2: Metadata Files
└── metadata.json → manifest lists → manifest files
(stored in the table's metadata/ directory on S3)

Layer 3: Data Files
└── Parquet, ORC, or Avro files
(stored in the table's data/ directory on S3)

Each layer can fail independently:

FailureCatalogMetadataDataRecovery Difficulty
DROP TABLE (non-purge)GoneExistsExistsEasy — re-register
DROP TABLE PURGEGoneDeletedDeletedHard — need S3 versioning
Catalog corruptionBrokenExistsExistsEasy — re-register
Accidental expire_snapshotsExistsPartialPartialMedium — older snapshots gone
Accidental remove_orphan_filesExistsExistsPartialVaries — depends on what was removed
S3 bucket deletionExistsGoneGoneNeed cross-region replica or S3 versioning
Region outageExistsUnavailableUnavailableNeed cross-region DR

The key insight: the catalog entry is just a pointer. As long as you have a valid metadata.json file and the data files it references, you can always reconstruct the table.

Scenario 1: Recovering a Dropped Table

When Data Files Still Exist on S3

The most common recovery scenario: someone ran DROP TABLE but the data files are still on S3. This happens because many catalogs (Hive Metastore, Glue) only remove the catalog entry by default — they do not delete the underlying files unless PURGE is specified.

Step 1: Find the latest metadata.json file.

This is the most critical step, and the approach depends on how your table was created. Iceberg uses two different metadata file naming conventions:

Hadoop/Hive-style tables (created via Hive Metastore without a catalog-managed location) use sequential versioned names:

s3://bucket/warehouse/db/events/metadata/
v1.metadata.json
v2.metadata.json
v3.metadata.json ← latest (highest version number)

For these tables, simply pick the highest-numbered version.

Catalog-managed tables (created via Polaris, Glue, REST catalogs, or Nessie) use UUID-based names:

s3://bucket/warehouse/db/events/metadata/
00000-a1b2c3d4-e5f6-7890-abcd-ef1234567890.metadata.json
00001-f9e8d7c6-b5a4-3210-fedc-ba9876543210.metadata.json
00002-1a2b3c4d-5e6f-7890-abcd-1234567890ab.metadata.json

With UUID-based names, you cannot identify the latest version by name alone. Here are three approaches to find the correct file:

Approach A: Sort by last-modified timestamp (most reliable).

# List all metadata.json files sorted by last modified time
aws s3api list-objects-v2 \
--bucket your-bucket \
--prefix warehouse/db/events/metadata/ \
--query "Contents[?contains(Key, 'metadata.json')].[Key,LastModified,Size]" \
--output table | sort -k2

The file with the most recent LastModified timestamp is the latest metadata version.

Approach B: Check the metadata-log inside any metadata.json.

Every metadata.json file contains a metadata-log array that lists all previous metadata files in order. Open any metadata file and look at the last entry:

# Download any metadata file and inspect it
aws s3 cp s3://your-bucket/warehouse/db/events/metadata/00002-1a2b3c4d.metadata.json - \
| python3 -c "
import json, sys
meta = json.load(sys.stdin)
if 'metadata-log' in meta:
entries = meta['metadata-log']
print(f'This file is version {len(entries) + 1}')
print(f'Previous versions: {len(entries)}')
else:
print('This is the first metadata version')
print(f'Last updated: {meta.get(\"last-updated-ms\", \"unknown\")}')
"

If the file you opened is not the latest, the metadata-log will point you to the chain — but the latest file is always the one that is not referenced by any other metadata file's metadata-log.

Approach C: Query the catalog before it is fully gone.

If the catalog entry still exists (for example, in Glue or HMS) but is corrupt or the table was soft-deleted, you can extract the metadata pointer directly:

# AWS Glue: get the metadata location from the Glue catalog
aws glue get-table \
--database-name db \
--name events \
--query 'Table.Parameters.metadata_location' \
--output text
-- Hive Metastore: query the HMS database directly
SELECT param_value
FROM TABLE_PARAMS
WHERE param_key = 'metadata_location'
AND tbl_id = (SELECT tbl_id FROM TBLS WHERE tbl_name = 'events');

Our recommendation: Always use Approach A (sort by timestamp) as the primary method. It works regardless of naming convention and does not require parsing JSON files. Keep Approach C in your runbook as a shortcut for cases where the catalog entry is partially intact.

Step 2: Re-register the table using register_table.

CALL spark_catalog.system.register_table(
table => 'db.events',
metadata_file => 's3://your-bucket/warehouse/db/events/metadata/00002-1a2b3c4d-5e6f-7890-abcd-1234567890ab.metadata.json'
);

That is it. The table is back in your catalog, pointing to the latest metadata, which references all the data files. All snapshots, schema history, and partition specs are restored.

Step 3: Verify the recovery.

-- Check the table exists and has data
SELECT COUNT(*) FROM db.events;

-- Check snapshot history is intact
SELECT * FROM db.events.snapshots ORDER BY committed_at DESC LIMIT 10;

-- Check file count
SELECT COUNT(*) FROM db.events.files;

When DROP TABLE PURGE Was Used

If DROP TABLE PURGE was used, both the catalog entry and the files on S3 were deleted. Recovery depends on whether you have S3 versioning enabled.

With S3 versioning enabled:

S3 versioning does not actually delete objects — it adds a "delete marker." The original files are still there as previous versions.

# List all versions of metadata files (including deleted ones)
aws s3api list-object-versions \
--bucket your-bucket \
--prefix warehouse/db/events/metadata/ \
--query 'Versions[?Key.contains(@, `metadata.json`)].[Key,VersionId,LastModified]' \
--output table
# Restore the latest metadata.json by removing the delete marker
aws s3api delete-object \
--bucket your-bucket \
--key warehouse/db/events/metadata/v3.metadata.json \
--version-id "DELETE_MARKER_VERSION_ID"

You also need to restore the manifest files and data files. The metadata.json references specific manifest lists, which reference manifest files, which reference data files — all with absolute S3 paths.

# Bulk restore all files in the table directory
# First, list all delete markers
aws s3api list-object-versions \
--bucket your-bucket \
--prefix warehouse/db/events/ \
--query 'DeleteMarkers[].{Key:Key,VersionId:VersionId}' \
--output json > delete-markers.json

# Then remove each delete marker to restore the files
# (Use a script to iterate through delete-markers.json)

Once the files are restored, use register_table to re-create the catalog entry.

Without S3 versioning: The data is gone. This is why S3 versioning is a non-negotiable prerequisite for any production Iceberg deployment. Enable it now.

When expire_snapshots Went Too Far

If someone ran expire_snapshots with an aggressive retention (for example, retaining only 1 hour), older snapshots and their referenced data files may have been deleted. The current snapshot is still intact, but you cannot time-travel to previous states.

This is not recoverable through Iceberg procedures — the data files for expired snapshots are physically deleted. Your options:

  1. If S3 versioning is enabled: Restore the deleted data files and an older metadata.json that still references them, then use register_table to point to that metadata version.
  2. If you have a backup: Restore from your most recent backup (covered in the next section).
  3. If neither: The historical data is gone. Only the current table state survives.

Lesson: Always set expire_snapshots retention to at least 7 days in production. For regulated tables, use tags to protect critical snapshots from expiration:

-- Protect a snapshot from expiration
ALTER TABLE db.events
CREATE TAG `quarterly_backup_2026_Q1`
AS OF VERSION 42
RETAIN 365 DAYS;

Scenario 2: Building a Proper Backup Strategy

What to Back Up

A complete Iceberg table backup must include:

ComponentLocationWhy
metadata.json (all versions)s3://bucket/warehouse/db/table/metadata/Contains schema, partition spec, snapshot list, and pointers to manifest lists.
Manifest list files (.avro)s3://bucket/warehouse/db/table/metadata/Each snapshot's file listing.
Manifest files (.avro)s3://bucket/warehouse/db/table/metadata/Detailed file-level metadata (paths, statistics, partitions).
Data files (.parquet)s3://bucket/warehouse/db/table/data/The actual data.
Delete files (if MOR)s3://bucket/warehouse/db/table/data/Position or equality delete files for merge-on-read tables.

Important: You cannot just back up metadata.json alone. The metadata references manifest files, which reference data files — all by absolute path. You need the entire tree.

Strategy 1: S3 Versioning + Lifecycle Rules

The simplest and most effective backup strategy for Iceberg on S3:

{
"Rules": [
{
"ID": "keep-iceberg-versions-30-days",
"Status": "Enabled",
"NoncurrentVersionExpiration": {
"NoncurrentDays": 30
},
"Filter": {
"Prefix": "warehouse/"
}
}
]
}

This keeps every deleted or overwritten file for 30 days. Combined with register_table, you can recover from any deletion or corruption within that window.

Advantages: Zero operational overhead. No backup jobs to schedule. Every file change is automatically versioned.

Limitations: Same-region only. If the S3 bucket or the entire region goes down, versions are inaccessible.

Strategy 2: S3 Cross-Region Replication (CRR)

For disaster recovery across AWS regions:

{
"Role": "arn:aws:iam::role/s3-replication-role",
"Rules": [
{
"ID": "replicate-iceberg-warehouse",
"Status": "Enabled",
"Priority": 1,
"Filter": {
"Prefix": "warehouse/"
},
"Destination": {
"Bucket": "arn:aws:s3:::your-bucket-dr-region",
"StorageClass": "STANDARD_IA"
},
"DeleteMarkerReplication": {
"Status": "Disabled"
}
}
]
}

Critical setting: Set DeleteMarkerReplication to Disabled. This ensures that if someone deletes files in the primary region, those deletions are not replicated to the DR region — preserving your backup.

Strategy 3: Scheduled Table Snapshots with rewrite_table_path

For cross-bucket or cross-account backups where you want explicit control:

Step 1: Rewrite metadata paths to the backup location.

CALL spark_catalog.system.rewrite_table_path(
table => 'db.events',
source_prefix => 's3://production-bucket/warehouse/db/events',
target_prefix => 's3://backup-bucket/warehouse/db/events',
staging_location => 's3://backup-bucket/staging/db/events'
);

This produces:

  • Rewritten metadata files in the staging location (all absolute paths updated to point to backup-bucket)
  • A file list of all data files that need to be copied

Step 2: Copy data files to the backup bucket.

# Copy data files listed in the rewrite output
aws s3 sync \
s3://production-bucket/warehouse/db/events/data/ \
s3://backup-bucket/warehouse/db/events/data/ \
--storage-class STANDARD_IA

Step 3: Move staged metadata to final location.

aws s3 sync \
s3://backup-bucket/staging/db/events/ \
s3://backup-bucket/warehouse/db/events/metadata/

Step 4: (If needed) Register the backup as a table in a DR catalog.

CALL dr_catalog.system.register_table(
table => 'db.events',
metadata_file => 's3://backup-bucket/warehouse/db/events/metadata/v3.metadata.json'
);

Incremental Backups with rewrite_table_path

For large tables, you do not want to copy terabytes of data on every backup. The rewrite_table_path procedure supports incremental mode:

CALL spark_catalog.system.rewrite_table_path(
table => 'db.events',
source_prefix => 's3://production-bucket/warehouse/db/events',
target_prefix => 's3://backup-bucket/warehouse/db/events',
start_version => 'v10.metadata.json',
end_version => 'v20.metadata.json',
staging_location => 's3://backup-bucket/staging/db/events'
);

This only rewrites metadata files and lists data files added between versions 10 and 20. You only need to copy the new data files — not the entire table.

Automating Backups

Here is a PySpark script that automates the backup workflow:

from pyspark.sql import SparkSession
import subprocess
import json
from datetime import datetime

spark = SparkSession.builder \
.appName("iceberg-backup") \
.getOrCreate()

tables_to_backup = [
"db.events",
"db.user_profiles",
"db.transactions"
]

backup_bucket = "s3://backup-bucket/warehouse"
source_bucket = "s3://production-bucket/warehouse"

for table in tables_to_backup:
table_path = table.replace(".", "/")

try:
# Step 1: Rewrite metadata paths
result = spark.sql(f"""
CALL spark_catalog.system.rewrite_table_path(
table => '{table}',
source_prefix => '{source_bucket}/{table_path}',
target_prefix => '{backup_bucket}/{table_path}',
staging_location => '{backup_bucket}/staging/{table_path}'
)
""")

# Step 2: Sync data files
subprocess.run([
"aws", "s3", "sync",
f"{source_bucket}/{table_path}/data/",
f"{backup_bucket}/{table_path}/data/",
"--storage-class", "STANDARD_IA",
"--only-show-errors"
], check=True)

# Step 3: Move staged metadata
subprocess.run([
"aws", "s3", "sync",
f"{backup_bucket}/staging/{table_path}/",
f"{backup_bucket}/{table_path}/metadata/",
"--only-show-errors"
], check=True)

print(f"Backup complete: {table} at {datetime.now()}")

except Exception as e:
print(f"Backup FAILED for {table}: {str(e)}")
# Alert your monitoring system here

Schedule this script daily (or more frequently for critical tables) using Airflow, Step Functions, or a cron job.

Scenario 3: Cross-Region Disaster Recovery

For production-critical tables where a full AWS region outage is an unacceptable risk.

Architecture

Primary Region (us-east-1)                  DR Region (us-west-2)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Polaris/Glue Catalog │ │ Polaris/Glue Catalog │
│ (primary catalog) │ │ (DR catalog) │
│ │ │ │ │ │
│ s3://prod-bucket/ │ ── CRR ──► │ s3://dr-bucket/ │
│ warehouse/ │ │ warehouse/ │
│ db/events/ │ │ db/events/ │
│ metadata/ │ │ metadata/ │
│ data/ │ │ data/ │
└─────────────────────────┘ └─────────────────────────┘

Setting Up Cross-Region DR

Step 1: Enable S3 Cross-Region Replication.

Replicate the entire warehouse prefix from the primary bucket to a DR bucket in another region. Disable delete marker replication so that deletions in the primary region do not propagate.

Step 2: Set up a DR catalog.

Create a second Iceberg catalog (Glue, Polaris, or REST) in the DR region. This catalog will remain dormant until a failover is needed.

Step 3: Create a registration script for failover.

# failover.py — run only during a disaster
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("iceberg-dr-failover") \
.config("spark.sql.catalog.dr_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.dr_catalog.type", "rest") \
.config("spark.sql.catalog.dr_catalog.uri", "https://polaris.us-west-2.your-cloud.com/api/v1") \
.config("spark.sql.catalog.dr_catalog.warehouse", "s3://dr-bucket/warehouse") \
.getOrCreate()

tables = [
("db.events", "s3://dr-bucket/warehouse/db/events/metadata/v42.metadata.json"),
("db.user_profiles", "s3://dr-bucket/warehouse/db/user_profiles/metadata/v15.metadata.json"),
("db.transactions", "s3://dr-bucket/warehouse/db/transactions/metadata/v28.metadata.json"),
]

for table_name, metadata_path in tables:
try:
spark.sql(f"""
CALL dr_catalog.system.register_table(
table => '{table_name}',
metadata_file => '{metadata_path}'
)
""")
print(f"Registered {table_name} in DR catalog")
except Exception as e:
print(f"Failed to register {table_name}: {e}")

Step 4: Handle the absolute path problem.

Iceberg stores absolute S3 paths in its metadata. If your primary bucket is s3://prod-bucket and your DR bucket is s3://dr-bucket, the metadata in the DR bucket still references s3://prod-bucket. During a failover, this is a problem.

Two solutions:

Option A: Use rewrite_table_path before failover (proactive).

Run rewrite_table_path as part of your regular backup cycle to pre-stage metadata with DR paths. This makes failover instant — just run register_table.

Option B: Use S3 Access Points or bucket aliasing (reactive).

Configure the DR bucket to respond to requests for the primary bucket's paths using S3 Access Points or DNS aliasing. This avoids rewriting metadata entirely but requires network infrastructure changes.

Recovery Time Objective (RTO)

ComponentRTONotes
S3 CRR replication lag5-15 minutesFor most object sizes. Large objects may take longer.
register_table per tableSecondsJust creates a catalog entry.
rewrite_table_path per tableMinutesDepends on metadata size (number of snapshots/manifests).
Query routing switchoverMinutesUpdate DNS, connection strings, or catalog configuration.
Total failover time15-30 minutesWith pre-staged DR metadata.

Production Backup Checklist

Day 1 (Non-Negotiable)

  • Enable S3 versioning on all buckets storing Iceberg tables.
  • Set lifecycle rules to expire old versions after 30 days (balances recovery window and storage cost).
  • Set expire_snapshots retention to at least 7 days on all tables.
  • Use format-version 2 on all tables (required for position deletes and MOR).
  • Set up S3 Cross-Region Replication for production-critical tables.
  • Tag critical snapshots (quarterly closes, ML training datasets, regulatory checkpoints).
  • Document the metadata path for every production table — you need to know where to find metadata.json during a crisis.
  • Test register_table on a non-production table to validate the workflow before you need it in an emergency.

Monthly (Operational)

  • Run a DR drill. Drop a test table, recover it using register_table, and verify data integrity.
  • Verify CRR replication lag. Write a test file to the primary bucket and confirm it appears in the DR bucket within your SLA.
  • Audit expire_snapshots schedules. Ensure no table has a retention shorter than your backup frequency.

Common Pitfalls

  1. Not enabling S3 versioning. This is the single most important protection for Iceberg on S3. Without it, DROP TABLE PURGE, accidental remove_orphan_files, or S3 deletions are permanent. Enable it on every bucket.

  2. Forgetting that Iceberg uses absolute paths. Every data file, manifest, and metadata file is referenced by its full S3 path. If you copy files to a different bucket, the metadata still points to the original location. You must use rewrite_table_path to update the references.

  3. Running remove_orphan_files too aggressively. This procedure deletes data files that are not referenced by any current snapshot. If a concurrent write is in progress, it may delete files that are about to be committed. Always set older_than to at least 3 days and never run it during active writes.

  4. Backing up only metadata. A metadata.json file is useless without the manifest files and data files it references. Back up the entire table directory tree.

  5. Not testing recovery. The worst time to discover that your register_table workflow has a bug is during an actual outage. Run quarterly DR drills.

  6. Replicating delete markers in CRR. If you replicate delete markers to your DR bucket, an accidental deletion in the primary region destroys your backup too. Always disable delete marker replication.

  7. Ignoring catalog backup. Even with S3 backups, you need to know which metadata.json version to register. Keep a metadata inventory — a simple table or file listing each production table and its current metadata path.


This post is part of our Apache Iceberg deep-dive series. For table design guidance, see Iceberg Table Design: Properties, Partitioning, and Commit Best Practices. For time travel and snapshot management, see Time Travel in Apache Iceberg: Beyond the Basics. Check out the full series on our blog.