feat: built-in databricks unity catalog sync#802
feat: built-in databricks unity catalog sync#802parisni wants to merge 11 commits intoapache:mainfrom
Conversation
not sure if unity catalog refresh by itself the table metadata. will have to investigate for now skip refresh table
- alter column stmt is meant for delta, not uc so it tries to upgrade delta logs - refresh table stmt does not refresh the columns - create or replace stmt is also meant for delta, not uc so tries to upgrade delta logs - drop table + create new table: does the work, also sounds like uc does keep track of the table history so we don't loose associated stuff (top users, related assets, creation date...)
|
|
||
| DatabricksUnityCatalogSyncClient( | ||
| ExternalCatalogConfig catalogConfig, | ||
| String tableFormat, |
There was a problem hiding this comment.
SQL injection risk: fullName is derived from user-provided tableIdentifier and interpolated directly into the SQL string. While escapeSqlString is used for location, the table name itself is not sanitized.
If hierarchicalId contains ; DROP TABLE ..., this would be injected. Consider validating that catalog/schema/table names match [a-zA-Z0-9_]+ before interpolating them into DDL statements.
| if (table == null) { | ||
| return null; | ||
| } | ||
| return table.getStorageLocation(); |
There was a problem hiding this comment.
MSCK REPAIR TABLE ... SYNC METADATA is a Hive-specific command. Does Databricks UC's SQL Warehouse support this syntax? The standard Databricks approach for refreshing external table metadata is ALTER TABLE <table> SET TBLPROPERTIES or reading from the Delta log directly.
If this doesn't actually work on Databricks, schema drift would silently be ignored. Has this been tested against a real SQL Warehouse?
| String catalog = hierarchical.getCatalogName(); | ||
| if (StringUtils.isBlank(catalog)) { | ||
| throw new CatalogSyncException( | ||
| "Databricks UC requires a catalog name (expected catalog.schema.table)"); |
There was a problem hiding this comment.
createOrReplaceTable drops then recreates the table. This is not atomic — if createTable fails after dropTable succeeds, the table is gone. Consider using CREATE OR REPLACE TABLE DDL instead (if supported for external tables), or at minimum catch the create failure and log a clear error that the table was dropped but not recreated.
| } catch (Exception e) { | ||
| throw new CatalogSyncException("Failed to get schema: " + fullName, e); | ||
| } | ||
| } |
There was a problem hiding this comment.
The init method can create up to 3 separate WorkspaceClient instances (lines 153, 158, 164) if statementExecution, tablesApi, and schemasApi are all null. Create the WorkspaceClient once at the top:
if (this.workspaceClient == null) {
this.workspaceClient = new WorkspaceClient(buildConfig(databricksConfig));
}
if (this.statementExecution == null) {
this.statementExecution = workspaceClient.statementExecution();
}
...|
|
||
| private ExternalCatalogConfig catalogConfig; | ||
| private DatabricksUnityCatalogConfig databricksConfig; | ||
| private Configuration hadoopConf; |
There was a problem hiding this comment.
All fields (catalogConfig, databricksConfig, hadoopConf, tableFormat, workspaceClient, etc.) are mutable and set via init(). This pattern is fragile — calling any method before init() would NPE.
Consider making these final and removing the no-arg constructor, or at least adding a null-check guard in methods that use these fields. The ServiceLoader no-arg constructor + init() pattern should be documented with a warning.
| props.get(CLIENT_SECRET), | ||
| props.get(TOKEN)); | ||
| } | ||
| } |
There was a problem hiding this comment.
The token field is defined and parsed but the docs say PAT auth is "intentionally not wired" and "not supported yet". If it's not supported, don't add the field — it creates confusion. Remove TOKEN constant and token field until PAT auth is actually implemented.
|
|
||
| <dependencies> | ||
| <dependency> | ||
| <groupId>org.apache.xtable</groupId> |
There was a problem hiding this comment.
Pin the Databricks SDK version to a property in the parent pom.xml instead of hardcoding 0.85.0 here. Follow the pattern used for other dependency versions in the project.
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance |
There was a problem hiding this comment.
Tests have massive boilerplate — every test method repeats the same 15-line client construction block. Extract a createClient(String tableFormat) helper method to reduce duplication and improve readability.
| .thenReturn( | ||
| new StatementResponse() | ||
| .setStatus(new StatementStatus().setState(StatementState.SUCCEEDED))); | ||
|
|
There was a problem hiding this comment.
Missing test: refreshTable with matching schema (no-op path). You test the schema-mismatch path but not the case where schemas match and no MSCK REPAIR is issued. Also missing: refreshTable with null catalogTable and refreshTable with null/empty schema.
| 'hms', | ||
| 'glue-catalog', | ||
| 'unity-catalog', | ||
| 'databricks-unity-catalog', |
There was a problem hiding this comment.
The sidebar reference was changed from unity-catalog to databricks-unity-catalog, but the docs file is still named unity-catalog.md (not renamed). This will break the sidebar link. Either rename the md file or keep the sidebar reference as unity-catalog.
Important Read
What is the purpose of the pull request
This pull request adds Databricks Unity Catalog (DATABRICKS_UC) catalog sync support to XTable, with a new xtable-databricks module, UC config wiring, UC sync client
implementation, and documentation updates.
Brief change log
Verify this pull request
This change added tests and can be verified as follows: