Skip to content

Kafka Connect: Surface commit failures instead of silently swallowing them#16237

Open
yadavay-amzn wants to merge 1 commit intoapache:mainfrom
yadavay-amzn:fix/iceberg_15878
Open

Kafka Connect: Surface commit failures instead of silently swallowing them#16237
yadavay-amzn wants to merge 1 commit intoapache:mainfrom
yadavay-amzn:fix/iceberg_15878

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

@yadavay-amzn yadavay-amzn commented May 7, 2026

Fixes #15878.

Problem

The Kafka Connect Coordinator previously caught Exception around doCommit() and only logged a warning, so when a commit failed (e.g., a CommitFailedException from Glue detecting a concurrent table update), the connector stayed RUNNING while silently dropping the data that was in flight.

Fix

Remove the catch-all around doCommit() and instead log at ERROR level with the task id and commit id before rethrowing. CoordinatorThread.run() already terminates the thread on uncaught exceptions, which transitions the Kafka Connect task to FAILED — so failures are now surfaced rather than dropped.

The finally block that calls commitState.endCurrentCommit() is preserved so per-commit state is cleaned up regardless of the outcome.

Testing

  • Added testCommitFailedExceptionPropagates which mocks a catalog-side CommitFailedException on AppendFiles.commit() and asserts it propagates out of Coordinator.process(). Without the fix, this test fails because the exception is swallowed.
  • Updated two existing tests (testCoordinatorWithBadDataFile and testCoordinatorCommittedOffsetValidation) that previously relied on silent-swallow behaviour; they now assert the specific exception propagates (IllegalArgumentException for bad partition spec, ValidationException for stale offsets).
  • Full TestCoordinator suite passes locally (8/8).
  • spotlessApply passes.

… them

The Coordinator previously caught all exceptions from doCommit() and only
logged a warning, causing the connector to stay RUNNING after a
CommitFailedException (e.g., Glue concurrent update) while silently
dropping data. Propagate the exception so CoordinatorThread terminates
and the Kafka Connect task transitions to FAILED.

Fixes apache#15878
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Kafka Connect] Connector enters silent broken state after CommitFailedException (Glue concurrent update) — no data written, no error surfaced

1 participant