Skip to content

Fix data loss when inserting duplicate values during a migration#1633

Open
ggilder wants to merge 3 commits intogithub:masterfrom
ggilder:fix-panic-on-warnings
Open

Fix data loss when inserting duplicate values during a migration#1633
ggilder wants to merge 3 commits intogithub:masterfrom
ggilder:fix-panic-on-warnings

Conversation

@ggilder
Copy link

@ggilder ggilder commented Feb 28, 2026

Related issues: #1635, #1636

Description

This PR fixes a critical data loss bug in gh-ost's --panic-on-warnings mode that could silently delete rows when adding unique indexes to columns with duplicate values during binlog replay. Additionally, it makes PanicOnWarnings usable when gh-ost is used as a library rather than as a standalone binary.

Problem

gh-ost could silently lose data when migrations add unique indexes to columns, even with --panic-on-warnings enabled. This occurred due to the use of REPLACE INTO for binlog event replay, which silently deletes conflicting rows instead of generating warnings.

Example Scenario

Migration: Add unique index to email column

Original table: id (PRIMARY KEY), email (no unique constraint)
Ghost table: id (PRIMARY KEY), email (UNIQUE - being added)

Initial state (after bulk copy):

Ghost table:
  (id=1, email='bob@example.com')
  (id=2, email='alice@example.com')

During postponed cutover:

  • INSERT (id=3, email='bob@example.com') into original table
  • Binlog replay applies to ghost table: REPLACE INTO ghost_table VALUES (3, 'bob@example.com')
  • Duplicate email='bob@example.com' conflicts with existing row id=1
  • Row with id=1 silently deletedDATA LOSS

Root Cause

The REPLACE statement behavior:

  • Deletes conflicting rows on ANY unique index violation
  • Does NOT generate warnings or errors

Solution

Replace REPLACE with INSERT IGNORE + Warning Detection

Changed DML replay from:

REPLACE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');

To:

INSERT IGNORE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');

INSERT IGNORE behavior:

  • Skips insert on ANY unique index violation
  • Generates WARNING (detectable by PanicOnWarnings)
  • Does NOT delete existing rows

Added warning detection in ApplyDMLEventQueries() when PanicOnWarnings is enabled:

  1. Executes SHOW WARNINGS after batch execution
  2. Ignores duplicates on migration unique key (expected - row already copied in bulk phase)
  3. Fails migration for duplicates on other unique indexes
  4. Transaction rollback ensures no partial state

Use panic() Instead of os.Exit() for Library Usage

Changed listenOnPanicAbort() from:

log.Fatale(err) // Calls os.Exit(1), kills entire process

To:

panic(err) // Allows recovery by calling applications

When using gh-ost as a library, applications can now recover from panics using defer/recover() instead of having the entire process terminated.

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

This patch modifies gh-ost to use panic() instead of log.Fatale() in
listenOnPanicAbort. When using gh-ost as a library, this allows the
calling application to recover from panics instead of having the entire
process terminate via os.Exit().

When PanicOnWarnings is enabled and SQL warnings occur, the error flows
through these channels:
  1. terminateRowIteration sends error to rowCopyComplete
  2. consumeRowCopyComplete forwards error to PanicAbort
  3. listenOnPanicAbort calls Log.Fatale(err)
  4. Fatale calls os.Exit(1) which kills the entire process

By changing Fatale to panic, calling applications can catch it with defer/recover().
…igration indexes

gh-ost could silently lose data when adding a unique index to a column during
a migration, even with the PanicOnWarnings flag enabled. This occurred when:

1. A migration adds a unique index to a column (e.g., email)
2. Rows with duplicate values are inserted into the original table after the
    bulk copy phase completes (during postponed cutover)
3. These duplicate rows are applied via binlog replay to the ghost table

**Expected behavior:** Migration fails with clear error
**Actual behavior:** Original rows with duplicate values silently
deleted, data lost

Original table: id PRIMARY KEY, email (no unique constraint)
Ghost table:    id PRIMARY KEY, email UNIQUE (being added)

Initial state (after bulk copy):
- Ghost table: (id=1, email='bob@example.com')
          (id=2, email='alice@example.com')

During postponed cutover:
- INSERT (id=3, email='bob@example.com') into original table

Binlog replay attempts:
- INSERT (id=3, email='bob@example.com') into ghost table
- Duplicate email='bob@example.com' (conflicts with id=1)
- Row with id=1 silently deleted → DATA LOSS

The DMLInsertQueryBuilder used `REPLACE` statements for binlog event replay:

```sql
REPLACE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');

REPLACE behavior:
- If duplicate PRIMARY KEY: deletes old row, inserts new row (no warning/error)
- If duplicate on OTHER unique index: deletes conflicting row, inserts new row
- NO warnings or errors generated, so PanicOnWarnings cannot detect the issue

The original design assumed REPLACE was needed to handle timing edge cases
where binlog replay might process a row before bulk copy, but this caused
silent data corruption when other unique indexes had duplicates.

Changed DMLInsertQueryBuilder to use INSERT IGNORE instead of REPLACE:

INSERT IGNORE INTO ghost_table (id, email) VALUES (3, 'bob@example.com');

INSERT IGNORE behavior:
- If duplicate on ANY unique index: skip insert, generate WARNING
- Does not delete existing rows

Added warning detection to ApplyDMLEventQueries() when PanicOnWarnings is
enabled:
- Checks SHOW WARNINGS after batch execution
- Ignores duplicates on migration unique key (expected - row already copied)
- FAILS migration for duplicates on other unique indexes
- Transaction rollback ensures no partial state

Edge Case: DELETE+INSERT Conversion

When an UPDATE modifies the migration unique key (e.g., PRIMARY KEY), gh-ost
converts it to DELETE+INSERT within a single transaction:

BEGIN;
DELETE FROM ghost WHERE id=2;
INSERT IGNORE INTO ghost VALUES (3, 'bob@example.com');
COMMIT;

If the INSERT encounters a duplicate on a non-migration unique index:
- With PanicOnWarnings: Warning detected, transaction rolled back, both
DELETE and INSERT undone → no data loss ✓
- Without PanicOnWarnings: DELETE succeeds, INSERT silently skips →
data loss. This further reinforces that PanicOnWarnings should default
to on.
@ggilder ggilder marked this pull request as ready for review March 2, 2026 19:54
Copilot AI review requested due to automatic review settings March 2, 2026 19:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical correctness issue in gh-ost binlog replay when adding unique indexes by switching DML replay semantics to avoid REPLACE-driven row deletion, and adds warning-based failure behavior for PanicOnWarnings during replay (plus a change intended to make panic-aborts recoverable for library usage).

Changes:

  • Switch DML insert replay from REPLACE to INSERT IGNORE (SQL builder + tests).
  • Add SHOW WARNINGS detection in ApplyDMLEventQueries() to fail migrations on unexpected warnings when PanicOnWarnings is enabled.
  • Change panic-abort handling from os.Exit-style fatal logging to panic(err) and add regression tests for duplicate/warning scenarios.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
go/sql/builder.go Updates generated DML insert statement from REPLACE to INSERT IGNORE.
go/sql/builder_test.go Updates expectations for the new insert statement form.
go/logic/applier.go Adds warning detection/rollback path for replay when PanicOnWarnings is enabled.
go/logic/applier_test.go Adds integration test ensuring duplicate on non-migration unique index triggers failure without data loss.
go/logic/applier_delete_insert_test.go Adds integration tests around UPDATE→DELETE+INSERT behavior with INSERT IGNORE + warnings.
go/logic/migrator.go Changes panic-abort handler from fatal exit to panic(err).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 269 to +271
func (this *Migrator) listenOnPanicAbort() {
err := <-this.migrationContext.PanicAbort
this.migrationContext.Log.Fatale(err)
panic(err)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listenOnPanicAbort is started in its own goroutine (go this.listenOnPanicAbort()), so panic(err) here will crash the process and cannot be recovered by a library caller using defer/recover() around Migrate() (recover only works within the same goroutine). If the intent is to make PanicOnWarnings usable as a library, consider propagating the abort error back to the goroutine running Migrate() (e.g., return an error / cancel a context / signal via a channel that Migrate() selects on) rather than panicking in a background goroutine.

Copilot uses AI. Check for mistakes.
// TestUpdateModifyingUniqueKeyWithDuplicateOnOtherIndex tests the scenario where:
// 1. An UPDATE modifies the unique key (converted to DELETE+INSERT)
// 2. The INSERT would create a duplicate on a NON-migration unique index
// 3. With INSERT IGNORE, the DELETE succeeds but INSERT skips = DATA LOSS
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment describes this scenario as resulting in data loss with INSERT IGNORE, but this test expects the transaction to roll back on warnings (preventing data loss). Consider updating the comment to clarify this is the failure mode being guarded against (i.e., what would happen without warning detection/rollback).

Suggested change
// 3. With INSERT IGNORE, the DELETE succeeds but INSERT skips = DATA LOSS
// 3. With INSERT IGNORE and no warning detection/rollback, the DELETE succeeds but INSERT skips = DATA LOSS (failure mode this test guards against)

Copilot uses AI. Check for mistakes.
Comment on lines +1561 to +1593
// Check for warnings when PanicOnWarnings is enabled
if this.migrationContext.PanicOnWarnings {
rows, err := tx.Query("SHOW WARNINGS")
if err != nil {
return rollback(err)
}
defer rows.Close()
if err = rows.Err(); err != nil {
return rollback(err)
}

var sqlWarnings []string
for rows.Next() {
var level, message string
var code int
if err := rows.Scan(&level, &code, &message); err != nil {
this.migrationContext.Log.Warningf("Failed to read SHOW WARNINGS row")
continue
}
// Duplicate warnings are formatted differently across mysql versions, hence the optional table name prefix
migrationUniqueKeyExpression := fmt.Sprintf("for key '(%s\\.)?%s'", this.migrationContext.GetGhostTableName(), this.migrationContext.UniqueKey.NameInGhostTable)
matched, _ := regexp.MatchString(migrationUniqueKeyExpression, message)
if strings.Contains(message, "Duplicate entry") && matched {
// Duplicate entry on migration unique key is expected during binlog replay
// (row was already copied during bulk copy phase)
continue
}
sqlWarnings = append(sqlWarnings, fmt.Sprintf("%s: %s (%d)", level, message, code))
}
if len(sqlWarnings) > 0 {
warningMsg := fmt.Sprintf("Warnings detected during DML event application: %v", sqlWarnings)
return rollback(fmt.Errorf(warningMsg))
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning check runs only once after executing a multi-statement batch (the DML queries are concatenated with ;). SHOW WARNINGS only reports warnings for the most recently executed statement, so warnings from earlier statements in the batch can be missed, causing PanicOnWarnings to silently allow ignored inserts. Consider disabling multi-statement batching when PanicOnWarnings is enabled, or executing statements one-by-one and checking warnings (or @@warning_count) after each statement.

Suggested change
// Check for warnings when PanicOnWarnings is enabled
if this.migrationContext.PanicOnWarnings {
rows, err := tx.Query("SHOW WARNINGS")
if err != nil {
return rollback(err)
}
defer rows.Close()
if err = rows.Err(); err != nil {
return rollback(err)
}
var sqlWarnings []string
for rows.Next() {
var level, message string
var code int
if err := rows.Scan(&level, &code, &message); err != nil {
this.migrationContext.Log.Warningf("Failed to read SHOW WARNINGS row")
continue
}
// Duplicate warnings are formatted differently across mysql versions, hence the optional table name prefix
migrationUniqueKeyExpression := fmt.Sprintf("for key '(%s\\.)?%s'", this.migrationContext.GetGhostTableName(), this.migrationContext.UniqueKey.NameInGhostTable)
matched, _ := regexp.MatchString(migrationUniqueKeyExpression, message)
if strings.Contains(message, "Duplicate entry") && matched {
// Duplicate entry on migration unique key is expected during binlog replay
// (row was already copied during bulk copy phase)
continue
}
sqlWarnings = append(sqlWarnings, fmt.Sprintf("%s: %s (%d)", level, message, code))
}
if len(sqlWarnings) > 0 {
warningMsg := fmt.Sprintf("Warnings detected during DML event application: %v", sqlWarnings)
return rollback(fmt.Errorf(warningMsg))
}
// When PanicOnWarnings is enabled, we must not rely on a single SHOW WARNINGS
// after executing a multi-statement batch, because MySQL only exposes warnings
// for the most recent statement. Earlier statements in the batch might have
// generated warnings that are no longer visible here.
if this.migrationContext.PanicOnWarnings {
return rollback(fmt.Errorf("PanicOnWarnings is enabled but DML events are applied using a multi-statement batch; cannot reliably detect all statement warnings"))

Copilot uses AI. Check for mistakes.
Comment on lines +1572 to +1576
var sqlWarnings []string
for rows.Next() {
var level, message string
var code int
if err := rows.Scan(&level, &code, &message); err != nil {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rows.Err() is checked before iterating, but errors that occur during iteration will only surface after the loop. Add a rows.Err() check after the for rows.Next() loop so iteration errors don't get ignored in PanicOnWarnings mode.

Copilot uses AI. Check for mistakes.
Comment on lines +1580 to +1583
// Duplicate warnings are formatted differently across mysql versions, hence the optional table name prefix
migrationUniqueKeyExpression := fmt.Sprintf("for key '(%s\\.)?%s'", this.migrationContext.GetGhostTableName(), this.migrationContext.UniqueKey.NameInGhostTable)
matched, _ := regexp.MatchString(migrationUniqueKeyExpression, message)
if strings.Contains(message, "Duplicate entry") && matched {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migrationUniqueKeyExpression is built from table/index names without escaping regex metacharacters, and regexp.MatchString errors are ignored. If an index name contains regex characters, the match can fail (or error) and cause expected duplicates on the migration unique key to be treated as unexpected warnings. Use regexp.QuoteMeta for dynamic components and handle the regex/match error (also consider compiling once outside the loop).

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@meiji163
Copy link
Contributor

meiji163 commented Mar 3, 2026

Thanks for the PR @ggilder. This was a known issue when @grodowski introduced --panic-on-warnings. Unfortunately we can't use INSERT IGNORE instead of REPLACE INTO because the binlog applier must overwrite rows that were copied to the ghost table. The binlog applier is responsible for inserting the latest version of the row, while INSERT IGNORE would keep the old one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants