Fix data loss when inserting duplicate values during a migration#1633
Fix data loss when inserting duplicate values during a migration#1633ggilder wants to merge 3 commits intogithub:masterfrom
Conversation
This patch modifies gh-ost to use panic() instead of log.Fatale() in listenOnPanicAbort. When using gh-ost as a library, this allows the calling application to recover from panics instead of having the entire process terminate via os.Exit(). When PanicOnWarnings is enabled and SQL warnings occur, the error flows through these channels: 1. terminateRowIteration sends error to rowCopyComplete 2. consumeRowCopyComplete forwards error to PanicAbort 3. listenOnPanicAbort calls Log.Fatale(err) 4. Fatale calls os.Exit(1) which kills the entire process By changing Fatale to panic, calling applications can catch it with defer/recover().
…igration indexes
gh-ost could silently lose data when adding a unique index to a column during
a migration, even with the PanicOnWarnings flag enabled. This occurred when:
1. A migration adds a unique index to a column (e.g., email)
2. Rows with duplicate values are inserted into the original table after the
bulk copy phase completes (during postponed cutover)
3. These duplicate rows are applied via binlog replay to the ghost table
**Expected behavior:** Migration fails with clear error
**Actual behavior:** Original rows with duplicate values silently
deleted, data lost
Original table: id PRIMARY KEY, email (no unique constraint)
Ghost table: id PRIMARY KEY, email UNIQUE (being added)
Initial state (after bulk copy):
- Ghost table: (id=1, email='bob@example.com')
(id=2, email='alice@example.com')
During postponed cutover:
- INSERT (id=3, email='bob@example.com') into original table
Binlog replay attempts:
- INSERT (id=3, email='bob@example.com') into ghost table
- Duplicate email='bob@example.com' (conflicts with id=1)
- Row with id=1 silently deleted → DATA LOSS
The DMLInsertQueryBuilder used `REPLACE` statements for binlog event replay:
```sql
REPLACE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');
REPLACE behavior:
- If duplicate PRIMARY KEY: deletes old row, inserts new row (no warning/error)
- If duplicate on OTHER unique index: deletes conflicting row, inserts new row
- NO warnings or errors generated, so PanicOnWarnings cannot detect the issue
The original design assumed REPLACE was needed to handle timing edge cases
where binlog replay might process a row before bulk copy, but this caused
silent data corruption when other unique indexes had duplicates.
Changed DMLInsertQueryBuilder to use INSERT IGNORE instead of REPLACE:
INSERT IGNORE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');
INSERT IGNORE behavior:
- If duplicate on ANY unique index: skip insert, generate WARNING
- Does not delete existing rows
Added warning detection to ApplyDMLEventQueries() when PanicOnWarnings is
enabled:
- Checks SHOW WARNINGS after batch execution
- Ignores duplicates on migration unique key (expected - row already copied)
- FAILS migration for duplicates on other unique indexes
- Transaction rollback ensures no partial state
Edge Case: DELETE+INSERT Conversion
When an UPDATE modifies the migration unique key (e.g., PRIMARY KEY), gh-ost
converts it to DELETE+INSERT within a single transaction:
BEGIN;
DELETE FROM ghost WHERE id=2;
INSERT IGNORE INTO ghost VALUES (3, 'bob@example.com');
COMMIT;
If the INSERT encounters a duplicate on a non-migration unique index:
- With PanicOnWarnings: Warning detected, transaction rolled back, both
DELETE and INSERT undone → no data loss ✓
- Without PanicOnWarnings: DELETE succeeds, INSERT silently skips →
data loss. This further reinforces that PanicOnWarnings should default
to on.
There was a problem hiding this comment.
Pull request overview
This PR addresses a critical correctness issue in gh-ost binlog replay when adding unique indexes by switching DML replay semantics to avoid REPLACE-driven row deletion, and adds warning-based failure behavior for PanicOnWarnings during replay (plus a change intended to make panic-aborts recoverable for library usage).
Changes:
- Switch DML insert replay from
REPLACEtoINSERT IGNORE(SQL builder + tests). - Add
SHOW WARNINGSdetection inApplyDMLEventQueries()to fail migrations on unexpected warnings whenPanicOnWarningsis enabled. - Change panic-abort handling from
os.Exit-style fatal logging topanic(err)and add regression tests for duplicate/warning scenarios.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| go/sql/builder.go | Updates generated DML insert statement from REPLACE to INSERT IGNORE. |
| go/sql/builder_test.go | Updates expectations for the new insert statement form. |
| go/logic/applier.go | Adds warning detection/rollback path for replay when PanicOnWarnings is enabled. |
| go/logic/applier_test.go | Adds integration test ensuring duplicate on non-migration unique index triggers failure without data loss. |
| go/logic/applier_delete_insert_test.go | Adds integration tests around UPDATE→DELETE+INSERT behavior with INSERT IGNORE + warnings. |
| go/logic/migrator.go | Changes panic-abort handler from fatal exit to panic(err). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (this *Migrator) listenOnPanicAbort() { | ||
| err := <-this.migrationContext.PanicAbort | ||
| this.migrationContext.Log.Fatale(err) | ||
| panic(err) |
There was a problem hiding this comment.
listenOnPanicAbort is started in its own goroutine (go this.listenOnPanicAbort()), so panic(err) here will crash the process and cannot be recovered by a library caller using defer/recover() around Migrate() (recover only works within the same goroutine). If the intent is to make PanicOnWarnings usable as a library, consider propagating the abort error back to the goroutine running Migrate() (e.g., return an error / cancel a context / signal via a channel that Migrate() selects on) rather than panicking in a background goroutine.
| // TestUpdateModifyingUniqueKeyWithDuplicateOnOtherIndex tests the scenario where: | ||
| // 1. An UPDATE modifies the unique key (converted to DELETE+INSERT) | ||
| // 2. The INSERT would create a duplicate on a NON-migration unique index | ||
| // 3. With INSERT IGNORE, the DELETE succeeds but INSERT skips = DATA LOSS |
There was a problem hiding this comment.
The header comment describes this scenario as resulting in data loss with INSERT IGNORE, but this test expects the transaction to roll back on warnings (preventing data loss). Consider updating the comment to clarify this is the failure mode being guarded against (i.e., what would happen without warning detection/rollback).
| // 3. With INSERT IGNORE, the DELETE succeeds but INSERT skips = DATA LOSS | |
| // 3. With INSERT IGNORE and no warning detection/rollback, the DELETE succeeds but INSERT skips = DATA LOSS (failure mode this test guards against) |
| // Check for warnings when PanicOnWarnings is enabled | ||
| if this.migrationContext.PanicOnWarnings { | ||
| rows, err := tx.Query("SHOW WARNINGS") | ||
| if err != nil { | ||
| return rollback(err) | ||
| } | ||
| defer rows.Close() | ||
| if err = rows.Err(); err != nil { | ||
| return rollback(err) | ||
| } | ||
|
|
||
| var sqlWarnings []string | ||
| for rows.Next() { | ||
| var level, message string | ||
| var code int | ||
| if err := rows.Scan(&level, &code, &message); err != nil { | ||
| this.migrationContext.Log.Warningf("Failed to read SHOW WARNINGS row") | ||
| continue | ||
| } | ||
| // Duplicate warnings are formatted differently across mysql versions, hence the optional table name prefix | ||
| migrationUniqueKeyExpression := fmt.Sprintf("for key '(%s\\.)?%s'", this.migrationContext.GetGhostTableName(), this.migrationContext.UniqueKey.NameInGhostTable) | ||
| matched, _ := regexp.MatchString(migrationUniqueKeyExpression, message) | ||
| if strings.Contains(message, "Duplicate entry") && matched { | ||
| // Duplicate entry on migration unique key is expected during binlog replay | ||
| // (row was already copied during bulk copy phase) | ||
| continue | ||
| } | ||
| sqlWarnings = append(sqlWarnings, fmt.Sprintf("%s: %s (%d)", level, message, code)) | ||
| } | ||
| if len(sqlWarnings) > 0 { | ||
| warningMsg := fmt.Sprintf("Warnings detected during DML event application: %v", sqlWarnings) | ||
| return rollback(fmt.Errorf(warningMsg)) | ||
| } |
There was a problem hiding this comment.
This warning check runs only once after executing a multi-statement batch (the DML queries are concatenated with ;). SHOW WARNINGS only reports warnings for the most recently executed statement, so warnings from earlier statements in the batch can be missed, causing PanicOnWarnings to silently allow ignored inserts. Consider disabling multi-statement batching when PanicOnWarnings is enabled, or executing statements one-by-one and checking warnings (or @@warning_count) after each statement.
| // Check for warnings when PanicOnWarnings is enabled | |
| if this.migrationContext.PanicOnWarnings { | |
| rows, err := tx.Query("SHOW WARNINGS") | |
| if err != nil { | |
| return rollback(err) | |
| } | |
| defer rows.Close() | |
| if err = rows.Err(); err != nil { | |
| return rollback(err) | |
| } | |
| var sqlWarnings []string | |
| for rows.Next() { | |
| var level, message string | |
| var code int | |
| if err := rows.Scan(&level, &code, &message); err != nil { | |
| this.migrationContext.Log.Warningf("Failed to read SHOW WARNINGS row") | |
| continue | |
| } | |
| // Duplicate warnings are formatted differently across mysql versions, hence the optional table name prefix | |
| migrationUniqueKeyExpression := fmt.Sprintf("for key '(%s\\.)?%s'", this.migrationContext.GetGhostTableName(), this.migrationContext.UniqueKey.NameInGhostTable) | |
| matched, _ := regexp.MatchString(migrationUniqueKeyExpression, message) | |
| if strings.Contains(message, "Duplicate entry") && matched { | |
| // Duplicate entry on migration unique key is expected during binlog replay | |
| // (row was already copied during bulk copy phase) | |
| continue | |
| } | |
| sqlWarnings = append(sqlWarnings, fmt.Sprintf("%s: %s (%d)", level, message, code)) | |
| } | |
| if len(sqlWarnings) > 0 { | |
| warningMsg := fmt.Sprintf("Warnings detected during DML event application: %v", sqlWarnings) | |
| return rollback(fmt.Errorf(warningMsg)) | |
| } | |
| // When PanicOnWarnings is enabled, we must not rely on a single SHOW WARNINGS | |
| // after executing a multi-statement batch, because MySQL only exposes warnings | |
| // for the most recent statement. Earlier statements in the batch might have | |
| // generated warnings that are no longer visible here. | |
| if this.migrationContext.PanicOnWarnings { | |
| return rollback(fmt.Errorf("PanicOnWarnings is enabled but DML events are applied using a multi-statement batch; cannot reliably detect all statement warnings")) |
| var sqlWarnings []string | ||
| for rows.Next() { | ||
| var level, message string | ||
| var code int | ||
| if err := rows.Scan(&level, &code, &message); err != nil { |
There was a problem hiding this comment.
rows.Err() is checked before iterating, but errors that occur during iteration will only surface after the loop. Add a rows.Err() check after the for rows.Next() loop so iteration errors don't get ignored in PanicOnWarnings mode.
| // Duplicate warnings are formatted differently across mysql versions, hence the optional table name prefix | ||
| migrationUniqueKeyExpression := fmt.Sprintf("for key '(%s\\.)?%s'", this.migrationContext.GetGhostTableName(), this.migrationContext.UniqueKey.NameInGhostTable) | ||
| matched, _ := regexp.MatchString(migrationUniqueKeyExpression, message) | ||
| if strings.Contains(message, "Duplicate entry") && matched { |
There was a problem hiding this comment.
migrationUniqueKeyExpression is built from table/index names without escaping regex metacharacters, and regexp.MatchString errors are ignored. If an index name contains regex characters, the match can fail (or error) and cause expected duplicates on the migration unique key to be treated as unexpected warnings. Use regexp.QuoteMeta for dynamic components and handle the regex/match error (also consider compiling once outside the loop).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Thanks for the PR @ggilder. This was a known issue when @grodowski introduced |
Related issues: #1635, #1636
Description
This PR fixes a critical data loss bug in gh-ost's
--panic-on-warningsmode that could silently delete rows when adding unique indexes to columns with duplicate values during binlog replay. Additionally, it makesPanicOnWarningsusable when gh-ost is used as a library rather than as a standalone binary.Problem
gh-ost could silently lose data when migrations add unique indexes to columns, even with
--panic-on-warningsenabled. This occurred due to the use ofREPLACE INTOfor binlog event replay, which silently deletes conflicting rows instead of generating warnings.Example Scenario
Migration: Add unique index to
emailcolumnOriginal table:
id(PRIMARY KEY),email(no unique constraint)Ghost table:
id(PRIMARY KEY),email(UNIQUE - being added)Initial state (after bulk copy):
During postponed cutover:
INSERT (id=3, email='bob@example.com')into original tableREPLACE INTO ghost_table VALUES (3, 'bob@example.com')email='bob@example.com'conflicts with existing rowid=1Root Cause
The
REPLACEstatement behavior:Solution
Replace
REPLACEwithINSERT IGNORE+ Warning DetectionChanged DML replay from:
To:
INSERT IGNORE behavior:
Added warning detection in
ApplyDMLEventQueries()whenPanicOnWarningsis enabled:SHOW WARNINGSafter batch executionUse
panic()Instead ofos.Exit()for Library UsageChanged
listenOnPanicAbort()from:To:
When using gh-ost as a library, applications can now recover from panics using
defer/recover()instead of having the entire process terminated.script/cibuildreturns with no formatting errors, build errors or unit test errors.