Skip to content

Comments

feat: built-in databricks unity catalog sync#802

Open
parisni wants to merge 11 commits intoapache:mainfrom
leboncoin:pr-databricks-uc-sync
Open

feat: built-in databricks unity catalog sync#802
parisni wants to merge 11 commits intoapache:mainfrom
leboncoin:pr-databricks-uc-sync

Conversation

@parisni
Copy link
Contributor

@parisni parisni commented Feb 11, 2026

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

This pull request adds Databricks Unity Catalog (DATABRICKS_UC) catalog sync support to XTable, with a new xtable-databricks module, UC config wiring, UC sync client
implementation, and documentation updates.

Brief change log

  • Added new xtable-databricks module + ServiceLoader wiring.
  • Added DatabricksUnityCatalogConfig and DatabricksUnityCatalogSyncClient (DDL-based UC registration).
  • Implemented UC sync methods: hasDatabase, createDatabase, getTable, createTable, createOrReplaceTable, dropTable, refreshTable.
  • Added UC-specific unit tests for all catalog sync methods.
  • Documented Databricks UC sync flow, auth (OAuth M2M), limitations (Delta-only, unique location), and schema evolution strategy (drop/recreate).

Verify this pull request

This change added tests and can be verified as follows:

  • Added TestDatabricksUnityCatalogSyncClient in xtable-databricks.
  • Run:
    • mvn -Drat.skip=true -Dspotless.skip=true -Dtest=TestDatabricksUnityCatalogSyncClient -pl xtable-databricks test

not sure if unity catalog refresh by itself the table metadata. will
have to investigate
for now skip refresh table
- alter column stmt is meant for delta, not uc so it tries to upgrade
delta logs
- refresh table stmt does not refresh the columns
- create or replace stmt is also meant for delta, not uc so tries to
upgrade delta logs
- drop table + create new table: does the work, also sounds like uc does
keep track of the table history so we don't loose associated stuff (top
users, related assets, creation date...)
@parisni parisni changed the title feat: databricks unity catalog sync feat: built-in databricks unity catalog sync Feb 11, 2026

DatabricksUnityCatalogSyncClient(
ExternalCatalogConfig catalogConfig,
String tableFormat,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection risk: fullName is derived from user-provided tableIdentifier and interpolated directly into the SQL string. While escapeSqlString is used for location, the table name itself is not sanitized.

If hierarchicalId contains ; DROP TABLE ..., this would be injected. Consider validating that catalog/schema/table names match [a-zA-Z0-9_]+ before interpolating them into DDL statements.

if (table == null) {
return null;
}
return table.getStorageLocation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSCK REPAIR TABLE ... SYNC METADATA is a Hive-specific command. Does Databricks UC's SQL Warehouse support this syntax? The standard Databricks approach for refreshing external table metadata is ALTER TABLE <table> SET TBLPROPERTIES or reading from the Delta log directly.

If this doesn't actually work on Databricks, schema drift would silently be ignored. Has this been tested against a real SQL Warehouse?

String catalog = hierarchical.getCatalogName();
if (StringUtils.isBlank(catalog)) {
throw new CatalogSyncException(
"Databricks UC requires a catalog name (expected catalog.schema.table)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createOrReplaceTable drops then recreates the table. This is not atomic — if createTable fails after dropTable succeeds, the table is gone. Consider using CREATE OR REPLACE TABLE DDL instead (if supported for external tables), or at minimum catch the create failure and log a clear error that the table was dropped but not recreated.

} catch (Exception e) {
throw new CatalogSyncException("Failed to get schema: " + fullName, e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init method can create up to 3 separate WorkspaceClient instances (lines 153, 158, 164) if statementExecution, tablesApi, and schemasApi are all null. Create the WorkspaceClient once at the top:

if (this.workspaceClient == null) {
    this.workspaceClient = new WorkspaceClient(buildConfig(databricksConfig));
}
if (this.statementExecution == null) {
    this.statementExecution = workspaceClient.statementExecution();
}
...


private ExternalCatalogConfig catalogConfig;
private DatabricksUnityCatalogConfig databricksConfig;
private Configuration hadoopConf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fields (catalogConfig, databricksConfig, hadoopConf, tableFormat, workspaceClient, etc.) are mutable and set via init(). This pattern is fragile — calling any method before init() would NPE.

Consider making these final and removing the no-arg constructor, or at least adding a null-check guard in methods that use these fields. The ServiceLoader no-arg constructor + init() pattern should be documented with a warning.

props.get(CLIENT_SECRET),
props.get(TOKEN));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token field is defined and parsed but the docs say PAT auth is "intentionally not wired" and "not supported yet". If it's not supported, don't add the field — it creates confusion. Remove TOKEN constant and token field until PAT auth is actually implemented.


<dependencies>
<dependency>
<groupId>org.apache.xtable</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pin the Databricks SDK version to a property in the parent pom.xml instead of hardcoding 0.85.0 here. Follow the pattern used for other dependency versions in the project.

* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests have massive boilerplate — every test method repeats the same 15-line client construction block. Extract a createClient(String tableFormat) helper method to reduce duplication and improve readability.

.thenReturn(
new StatementResponse()
.setStatus(new StatementStatus().setState(StatementState.SUCCEEDED)));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test: refreshTable with matching schema (no-op path). You test the schema-mismatch path but not the case where schemas match and no MSCK REPAIR is issued. Also missing: refreshTable with null catalogTable and refreshTable with null/empty schema.

'hms',
'glue-catalog',
'unity-catalog',
'databricks-unity-catalog',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sidebar reference was changed from unity-catalog to databricks-unity-catalog, but the docs file is still named unity-catalog.md (not renamed). This will break the sidebar link. Either rename the md file or keep the sidebar reference as unity-catalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants