Skip to content

Conversation

@razvan
Copy link
Member

@razvan razvan commented Jan 30, 2026

Description

See #646

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes

Author

  • Changes are OpenShift compatible
  • CRD changes approved
  • CRD documentation for all fields, following the style guide.
  • Helm chart can be installed and deployed operator works
  • Integration tests passed (for non trivial changes)
  • Changes need to be "offline" compatible
  • Links to generated (nightly) docs added
  • Release note snippet added

Reviewer

  • Code contains useful comments
  • Code contains useful logging statements
  • (Integration-)Test cases added
  • Documentation added or updated. Follows the style guide.
  • Changelog updated
  • Cargo.toml only contains references to git tags (not specific commits or branches)

Acceptance

  • Feature Tracker has been updated
  • Proper release label has been added
  • Links to generated (nightly) docs added
  • Release note snippet added
  • Add type/deprecation label & add to the deprecation schedule
  • Add type/experimental label & add to the experimental features tracker

@razvan razvan self-assigned this Jan 30, 2026
@razvan razvan moved this to Development: Waiting for Review in Stackable Engineering Jan 30, 2026
@Techassi Techassi moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Feb 2, 2026
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rename the PR to something like feat: Support configuring number of retries on failure, as it now has evolved

Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de>
@razvan razvan changed the title fix: do not resubmit applications when spark-submit fails feat: add CRD property for retry on failure Feb 3, 2026
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait with merging till the decision was accepted

@razvan razvan added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 3, 2026
@razvan
Copy link
Member Author

razvan commented Feb 3, 2026

Release note

Starting with this version, Spark applications will not be resubmitted automatically in case of failure.
Users can restore the previous behavior by setting the new spec.job.retryOnFailureCount property to a non negative value.

In addition, small improvements have been made to clean up resources created by Spark applications.

  • application driver pods are now deleted as soon as they reach a terminal state.
  • application executor pods are now deleted as soon as possible in case of driver or submit failure.

@razvan razvan added this pull request to the merge queue Feb 4, 2026
Merged via the queue into main with commit ff3caa4 Feb 4, 2026
12 checks passed
@razvan razvan deleted the fix/no-job-retry branch February 4, 2026 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

Status: Development: In Review

Development

Successfully merging this pull request may close these issues.

4 participants