Skip to content

[SPARK-56374][BUILD] Align SBT assembly shade rules with Maven#55307

Open
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-56374-sbt-shade-alignment
Open

[SPARK-56374][BUILD] Align SBT assembly shade rules with Maven#55307
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-56374-sbt-shade-alignment

Conversation

@yadavay-amzn
Copy link
Copy Markdown

@yadavay-amzn yadavay-amzn commented Apr 11, 2026

What changes were proposed in this pull request?

Add missing shade rules to project/SparkBuild.scala to align SBT assembly output with Maven for three connect modules:

  1. SparkConnect (server): Add com.google.commonorg.sparkproject.guava and com.google.thirdpartyorg.sparkproject.guava.thirdparty relocations. Maven's sql/connect/server/pom.xml has these but SBT was missing them.

  2. SparkConnectClient (jvm): Add org.apache.arroworg.sparkproject.connect.client.org.apache.arrow relocation. Maven's connector/connect/client/jvm/pom.xml has this but SBT was missing it.

  3. SparkConnectJdbc: Add org.apache.arrow relocation for consistency with Maven's sql/connect/client/jdbc/pom.xml.

Why are the changes needed?

SBT assembly shade rules were out of sync with Maven, causing differences in the assembled JARs:

  • Server: Without the guava relocation, grpc classes in the server assembly reference unshaded com.google.common.*. At runtime these fail to resolve because Guava is shaded to org.sparkproject.guava in spark-network-common. Verified by inspecting ManagedChannelImpl.class — after the fix, references correctly point to org/sparkproject/guava/base/MoreObjects instead of com/google/common/base/MoreObjects.

  • Client JVM: Arrow classes were not being shaded. After the fix, they appear under org/sparkproject/connect/client/org/apache/arrow/.

Other Maven modules with shade rules (core, sql/core, network-yarn, root pom.xml) were verified to be intentionally different in SBT — those modules don't produce separate assembly JARs in SBT, and their shading is handled at a different level in the SBT build architecture.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Built all three affected SBT assemblies and verified the output JARs:

build/sbt connect/assembly              # ✅ success
build/sbt connect-client-jvm/assembly   # ✅ success
build/sbt connect-client-jdbc/assembly  # ✅ success

Verified shading in the output JARs:

  • Server assembly: strings ManagedChannelImpl.class shows org/sparkproject/guava/base/MoreObjects (was com/google/common/base/MoreObjects before fix)
  • Client JVM assembly: jar tf shows arrow classes under org/sparkproject/connect/client/org/apache/arrow/ with zero unshaded org/apache/arrow entries

Was this patch authored or co-authored using generative AI tooling?

No

Add missing shade rules to SparkBuild.scala for three connect modules:

1. SparkConnect (server): Add guava and guava.thirdparty relocations to
   match Maven. Without these, grpc classes in the assembly reference
   unshaded com.google.common.* which fails at runtime since Guava is
   shaded to org.sparkproject.guava in spark-network-common.

2. SparkConnectClient (jvm): Add org.apache.arrow relocation to match
   Maven. Arrow classes are now shaded under
   org/sparkproject/connect/client/org/apache/arrow/.

3. SparkConnectJdbc: Add org.apache.arrow relocation for consistency
   with Maven and the jvm client module.

Closes SPARK-56374
@yadavay-amzn
Copy link
Copy Markdown
Author

Tried closing and re-opening #55306 to trigger the automated checks workflows but doesn't work.

Opened this fresh PR but still no checks triggered.
Requesting help from committers. cc : @sarutak

Tested the builds locally and they passed.

@sarutak
Copy link
Copy Markdown
Member

sarutak commented Apr 11, 2026

Hi @yadavay-amzn, your GA seems still disabled.
yadavay-amzn-ga-disabled

Could you confirm it?

@sarutak
Copy link
Copy Markdown
Member

sarutak commented Apr 13, 2026

Hi @yadavay-amzn I have a few concerns about this PR.

  1. When all modules are built with SBT, Guava and Arrow are never relocated in any module. So the issue described in the PR description (referencing unshaded com.google.common.* failing at runtime) should not occur in a pure SBT build — all modules consistently reference the original namespaces.

  2. The described problem could occur if some modules are built with Maven (where the parent pom's shade plugin relocates Guava globally) and Spark Connect is built with SBT. However, mixing Maven-built and SBT-built JARs is not a normal workflow.

  3. Since SBT does not relocate Guava or Arrow in other modules, applying these relocation rules only to the Connect assembly JARs would cause the relocated references (e.g., org.sparkproject.guava.*) to point to classes that don't exist on the classpath. This could actually introduce runtime issues rather than fix them.

  4. The "How was this patch tested?" section verifies that the assembly builds succeed and that bytecode was rewritten, but this is a compile-time / packaging-time check. The concern here is about runtime behavior — whether the relocated references can actually be resolved. A runtime test (e.g., bin/spark-shell --remote local with the SBT-built distribution) would be more appropriate to validate this change.

cc: @LuciferYang who raised SPARK-56374

@LuciferYang
Copy link
Copy Markdown
Contributor

LuciferYang commented Apr 14, 2026

@yadavay-amzn Thank you for submitting the pr,I’d like to clarify the intent behind this PR.:

Why I’m proposing this work

Right now, build/sbt package / build/sbt assembly and mvn package produce different JARs from the same Spark source code. Most differences come from how shading is handled:

  • Maven uses maven-shade-plugin to relocate dependencies into org.sparkproject.*. SBT often uses different prefixes or doesn’t relocate at all.
  • Some Maven modules include only a small set of dependencies in their final JAR. SBT typically bundles the full transitive closure, leading to larger JARs and potential class conflicts.
  • common/network-yarn uses an antrun step in Maven to rename native Netty libraries into the shaded namespace. SBT has no equivalent logic.
  • SBT’s CopyDependencies doesn’t always pick the shaded JARs when assembling the distribution.

The end result is that an SBT-built Spark distribution cannot replace a Maven-built one at runtime. Downstream code that relies on shaded relocated classes will fail. The release process currently uses Maven builds as the source of truth, which means we are unable to use a more efficient method for version building and release.

What “done” looks like

My target end state:

  1. Byte-equivalent JARs. For every module that Maven shades, sbt assembly produces a JAR with identical class layout, relocation prefixes, and included dependencies. A jar tf diff should show only trivial differences like timestamps.
  2. Runtime compatibility. A distribution built with dev/make-distribution.sh --sbt must pass the full PySpark test suite, Spark Connect client/server flows, and the YARN external shuffle service.
  3. Build-time validation. We may need to add a new SBT task that fails the build directly if the shaded JAR is missing expected relocated classes or still contains unshaded packages.

Explicit non-goals:

  • Replacing maven-shade-plugin or changing Maven behavior. Maven remains the source of truth.
  • Refactoring the overall SBT build structure. Work is limited to assembly and shading settings and downstream consumers like CopyDependencies.
  • Modifying runtime code in core, sql, connect, or network-*. No bytecode changes outside build definitions.

Where the work lives

  • Only file modified: project/SparkBuild.scala
  • Maven source of truth:
    • Root pom.xml (inherited shade rules)
    • core/pom.xml
    • sql/core/pom.xml (note combine.self="override")
    • sql/connect/server/pom.xml
    • sql/connect/common/pom.xml (note combine.self="override", SPARK-54177)
    • sql/connect/client/jvm/pom.xml
    • sql/connect/client/jdbc/pom.xml
    • common/network-yarn/pom.xml
    • streaming/pom.xml
    • connector/protobuf/pom.xml (already partially handled in SBT — verify parity)
    • connector/kafka-0-10-assembly/pom.xml
    • connector/kinesis-asl-assembly/pom.xml

Read all these poms first. They are the specification. The modules covered above are the scope we need to align.

How to test

What I can think of now is that tests for PySpark and Connect should use the shaded jars to verify normal runtime behavior. The YARN shuffle service needs to start properly and load Netty native libraries correctly. And for Spark Connect JDBC, the client should run normally with no class-not-found errors.

This is quite a challenging task, and I’m really glad you’re interested in it. Feel free to ping me anytime if there are updates. Thanks ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants