ENG-2635: Support namespaces for Postgres DSR execution#7500
Conversation
Add PostgresNamespaceMeta schema and wire namespace support through PostgresQueryConfig, PostgreSQLConnector, and RDSPostgresConnector. Remove stubbed retrieve_data/mask_data from RDSPostgresConnector so it inherits SQLConnector's implementations, enabling DSR execution. Update RDS Postgres admin UI tags to include "DSR Automation". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard against None secrets in NamespaceMetaValidationStep. Update test_validate_unsupported_connection_type to use mariadb (postgres is now a supported namespace type). Add Postgres-specific validation tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Postgres defaults to the public schema when neither namespace_meta nor db_schema is configured. Return empty fallback fields so validation doesn't reject existing postgres connections that lack both. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove extra parentheses around bind parameters in expected query strings to match current SQLAlchemy output format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dry-run query test passes db=None to TaskResources, which flows into query_config() -> get_namespace_meta(). Return None early when no db session is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a dataset's namespace_meta has no overlapping fields with the connection type's namespace schema (e.g. BigQuery namespace_meta on a Postgres connection), skip validation rather than raising an error. This happens when datasets of mixed types are bulk-linked to a single connection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GoogleCloudSQLPostgresNamespaceMeta with database_name (optional) and schema (required), following the same pattern as Snowflake/BigQuery. Update GoogleCloudSQLPostgresQueryConfig with generate_table_name() for schema-qualified SQL, and GoogleCloudSQLPostgresConnector to fetch namespace_meta from DB and pass it to the query config. Also includes shared fixes from PR #7500: - Guard against None db session in get_namespace_meta - Guard against None secrets in namespace validation - Skip namespace validation for mismatched connection types Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GoogleCloudSQLPostgresNamespaceMeta with database_name (optional) and schema (required), following the same pattern as Snowflake/BigQuery. Update GoogleCloudSQLPostgresQueryConfig with generate_table_name() for schema-qualified SQL, and GoogleCloudSQLPostgresConnector to fetch namespace_meta from DB and pass it to the query config. Also includes shared fixes from PR #7500: - Guard against None db session in get_namespace_meta - Guard against None secrets in namespace validation - Skip namespace validation for mismatched connection types Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@greptile please review |
adamsachs
left a comment
There was a problem hiding this comment.
OK, this is getting close - it's great that this is working with a proper RDS integration and i think we're basically using the 'right' mechanisms.
still just a few points that i think could use clean-up, let me know what you think!
| from fides.api.schemas.namespace_meta.namespace_meta import NamespaceMeta | ||
|
|
||
|
|
||
| class SQLNamespaceMeta(NamespaceMeta): |
|
|
||
| connection_type: Literal["rds_postgres"] = "rds_postgres" | ||
| database_instance_id: str | ||
| database_id: str |
There was a problem hiding this comment.
so i imagine we reverted this back to database_id so it'd work well with the existing code in rds_postgres_connector.py...but i do think we should be establishing our more consistent standard here. can we just update the code there to look for database_name on the meta? i don't think that's going to break anything downstream, either here or on the monitor side in fidesplus, i really think that was just stubbed out for future use.
we do still need to update the namespace meta populated by helios monitors if we make this change, but i do think that's a relatively simple adjustment that we should go ahead and make. if we do that, then we should be able to inherit from the new SQLNamespaceMeta here, right? that'll help maintain standardization :)
note: i also wonder while we're here whether renaming the namespace field here to database_instance_name instead of database_instance_id is wise. that just seems more accurate, and if we don't change it now, we're never gonna get to it 😅
There was a problem hiding this comment.
Done — RDSPostgresNamespaceMeta now inherits from SQLNamespaceMeta with standardized fields:
database_name(inherited, replacesdatabase_id)database_instance_name(replacesdatabase_instance_id)schema(inherited, optional)
Updated the connector and tests to match. The fidesplus monitor (rds_postgres_monitor.py) will need a matching update to emit the new field names.
Note: if any RDS Postgres datasets have already been created by Helios, we'll need a data migration to rename the JSON keys in ctl_datasets.fides_meta. Suggest we handle that in a separate PR to keep this one focused.
| from fides.api.schemas.namespace_meta.sql_namespace_meta import SQLNamespaceMeta | ||
|
|
||
|
|
||
| class PostgresNamespaceMeta(SQLNamespaceMeta): |
| """ | ||
| Prepends the schema to the base table name if RDS namespace meta | ||
| is provided. Unlike Postgres, RDS doesn't need database_name | ||
| qualification because the connection targets a specific database. |
There was a problem hiding this comment.
OK this is clear - i see why this is branched off now again. i'll throw this out there as a suggestion, but it may be overkill: we could continue to keep the logic centralized in a single method on the base classPostgresQueryConfig.generate_table_name by having that take an optional arg for whether to include the database name or not.
but i'm ambivalent...that may not even be clearer.
There was a problem hiding this comment.
Agreed it could go either way. I'll keep the current approach with the separate override — it's explicit about why RDS skips database qualification (the engine URL already targets the database), which I think is clearer than a boolean flag on the base class.
| if db is None: | ||
| return None |
There was a problem hiding this comment.
k, got it. i do feel like when app code has specific checks to guard against test-only codepaths it's a bit of a code smell, but i also understand that the problem here may lie deeper in the code organization and it'd take too broad of a refactor to fix, or too much mocking in the tests.
| @@ -15,10 +15,9 @@ class SnowflakeNamespaceMeta(NamespaceMeta): | |||
|
|
|||
| connection_type: Literal["snowflake"] = "snowflake" | |||
| database_name: str | |||
There was a problem hiding this comment.
we should be able to remove this too, right? shouldn't it inherit it from SQLNamespaceMeta?
There was a problem hiding this comment.
SQLNamespaceMeta declares database_name as Optional[str] = None since Postgres doesn't always need it. Snowflake requires it, so the override here tightens the type from optional to required. Added a comment to make this clearer.
| # Skip validation when the namespace_meta connection_type doesn't | ||
| # match this connection. This happens when datasets of mixed types | ||
| # (e.g. BigQuery, Snowflake) are linked to a single connection. |
There was a problem hiding this comment.
is this a valid scenario? i don't really think this can happen...
There was a problem hiding this comment.
You're right — this can't happen in normal app runtime. Namespace metadata is set by the type-specific Helios monitor, and datasets are always linked to a single connection via the scoped endpoint (PATCH /connection/{key}/dataset). There's no mechanism to re-link a dataset across connection types.
Removed the mismatch check and the legacy field-overlap/backfill block. Also removed the 5 tests that covered those code paths.
| if not provided_fields & expected_fields: | ||
| return | ||
| # Fields overlap — backfill connection_type so it's persisted | ||
| namespace_meta["connection_type"] = connection_type |
There was a problem hiding this comment.
i thought we'd be able to get rid of this block if we ensure that all of our namespace meta's have connection_type's, which they now seem to (🙏)?
maybe there's some other reason this is needed? but it just doesn't seem like a valid code path/scenario for normal app runtime. if it's just to make a test pass, i'd recommend we find an alternative, this just feels like too much fragile special-casing to have in our app code. the whole point of the connection_type mechanism is so that we don't have to do this!
i.e. are there really 'legacy' namespace_meta's lurking around...?
There was a problem hiding this comment.
Agreed — removed this entire block in the same change. All namespace metas now have connection_type via the __init_subclass__ registration mechanism, so this heuristic is no longer needed. The alembic migration (4ebe0766021b) already backfilled connection_type for existing datasets.
|
Thanks for the second pass! All points addressed:
|
- RDSPostgresNamespaceMeta: inherit from SQLNamespaceMeta, rename database_id → database_name and database_instance_id → database_instance_name - Remove mismatch check and legacy field-overlap/backfill block from namespace validation (can't happen in normal app runtime) - Add clarifying comment on SnowflakeNamespaceMeta.database_name override - Update connector and tests to use new field names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Created PRs to start the id -> name migration as well #7589 and https://github.com/ethyca/fidesplus/pull/3203 |
The NamespaceMetaValidationStep was validating dataset namespace metadata against the connection config's type, causing failures when datasets with BigQuery namespace metadata were submitted through a Postgres connection (e.g. in the bulk create test). Now skips validation when the namespace's connection_type doesn't match the connection config's type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
adamsachs
left a comment
There was a problem hiding this comment.
OK, i think this is looking good, nice work! thanks for the iterations to get this in a good spot, i think this is a solid path forward and we've made some real gains towards standardizing our namespace meta construct, which has been needed for a while - so thank you for working through that 🙏
just a couple of non-blocking nits/questions lingering.
| connection_type: Literal["rds_postgres"] = "rds_postgres" | ||
| database_instance_name: str | ||
| database_name: str # type: ignore[assignment] # required, not optional | ||
| schema: Optional[str] = None # type: ignore[assignment] |
There was a problem hiding this comment.
hm, is this really optional for RDS postgres?
There was a problem hiding this comment.
Yes — the RDS Postgres discovery monitor populates the namespace dict at the schema level without including the schema name in the namespace itself (see rds_postgres_monitor.py:164-168 — it only includes connection_type, database_id, and database_instance_id). The schema name lives on the Schema staged resource model instead. Additionally, RDSPostgresQueryConfig.generate_table_name() handles the schema=None case by returning just the bare table name without schema qualification.
There was a problem hiding this comment.
Also added "schema": schema_name to the RDS Postgres monitor's namespace dict in the companion fidesplus PR (ethyca/fidesplus#3203) so the discovered schema is actually available for query qualification.
| if namespace_meta and namespace_meta.get("connection_type"): | ||
| namespace_conn_type = namespace_meta["connection_type"] | ||
| if namespace_conn_type != connection_type: | ||
| return |
There was a problem hiding this comment.
OK, if we're going to include this defensive check, probably worth a warning or at least info log indicating we hit an unexpected scenario?
There was a problem hiding this comment.
Good call — added a warning log for this case.
|
|
||
| connection_type: Literal["rds_postgres"] = "rds_postgres" | ||
| database_instance_name: str | ||
| database_name: str # type: ignore[assignment] # required, not optional |
There was a problem hiding this comment.
ah too bad that we have to do this...i guess we can't make it required on the base class :/
There was a problem hiding this comment.
Yeah — the base SQLNamespaceMeta has database_name as optional since not all SQL connectors require it, but RDS always discovers databases explicitly so it's always present. The type: ignore is the cleanest way to tighten the constraint in the subclass without restructuring the hierarchy.
Co-authored-by: Jade Wibbels <jade@ethyca.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Move database selection from connection secrets to dataset namespace_meta so one connection config can serve multiple databases. The connector now reads database_name from namespace_meta and creates per-database engines. - Add client_for_node(node) to SQLConnector base class (defaults to self.client()); PostgreSQLConnector overrides it to select engine by namespace_meta.database_name - Remove database_name from Postgres SQL generation (generate_table_name now produces schema.table only — DB selection is at connection level) - Cache per-database engines in PostgreSQLConnector._database_engines - Add optional engine param to table_exists for node-aware checking - Update tests to expect schema.table instead of database.schema.table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When client_for_node provides an explicit database name from namespace_meta, always build the URI directly so the override takes effect. The url secret is only used for the default engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address PR review feedback: log when dataset namespace_meta has a different connection_type than the connection config, since this defensive check silently skipping validation could mask issues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add optional engine parameter to match SQLConnector.table_exists superclass signature. No behavior change — Snowflake continues to create its own client for DESC TABLE checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The PostgresColumnQuotingMixin now quotes column names in WHERE clauses (e.g., "email" = ? instead of email = ?). Update test expectations to match the new quoting behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The mixin's format_clause_for_query had a special IN branch that produced
`"col" IN ({operand})` without a `:` prefix, causing SQLAlchemy to treat
the operand as a column reference instead of a bind parameter. This made
`WHERE id IN (id)` always true, returning all rows.
Now delegates to super() for correct operator/operand handling and only
adds double-quoting of the column name.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Jade Wibbels <jade@ethyca.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Ticket ENG-2635
Description Of Changes
Add namespace support for Postgres connectors, enabling a single connection to handle multiple schemas for DSR execution. This follows the established Snowflake/BigQuery pattern exactly.
In cases where 200-300 datasets representing different Postgres schemas. Previously, each schema required a separate integration for DSRs. With namespace support, a single Postgres connection can execute DSRs across all schemas by using fully-qualified table names (
"schema"."table"or"database"."schema"."table").This also removes the stubbed
retrieve_data()/mask_data()fromRDSPostgresConnector, unblocking DSR execution for RDS Postgres connections.PR 1 of 3 — Core Postgres + RDS Postgres namespace support. Google Cloud SQL Postgres (PR 2) and
set_schema()reconciliation (PR 3) follow separately.Code Changes
src/fides/api/schemas/namespace_meta/postgres_namespace_meta.py— NewPostgresNamespaceMetaschema withdatabase_name(optional) andschema(required), auto-registered viaNamespaceMeta.__init_subclass__. Returns empty fallback fields for backward compat (Postgres defaults to public schema).src/fides/api/service/connectors/query_configs/postgres_query_config.py— Addednamespace_meta_schema,generate_table_name()for schema-qualified SQL, updatedget_formatted_query_string()andget_update_stmt()to use itsrc/fides/api/service/connectors/postgres_connector.py—query_config()now fetches namespace_meta from DB, addedget_qualified_table_name()for unquoted names in error handlingsrc/fides/api/service/connectors/rds_postgres_connector.py— Removed stubbedretrieve_data()/mask_data()(inheritsSQLConnector's implementations),query_config()passes namespace_meta, addedget_qualified_table_name()src/fides/api/service/connectors/sql_connector.py— AddedNoneguard inget_namespace_meta()for dry-run contexts where no DB session is availablesrc/fides/service/dataset/validation_steps/namespace_meta.py— AddedNoneguard for connection secrets; skip namespace validation when namespace_meta fields don't overlap with the connection type's schema (cross-type dataset linking)clients/admin-ui/src/features/integrations/integration-type-info/rdsPostgresInfo.tsx— Added "DSR Automation" to RDS Postgres tagstests/ops/service/connectors/test_postgres_query_config.py— New tests mirroring Snowflake test structure: parametrized namespace query generation, invalid meta validation, namespaced update statementstests/service/dataset_service/test_namespace_meta_validation.py— Updated for Postgres namespace support: 5 new Postgres-specific tests, cross-type namespace skip test, updated unsupported type testSteps to Confirm
1. Run namespace query config tests
Expected: All 9 tests pass (5 parametrized query generation, 1 invalid meta, 3 update statements).
2. Run namespace validation tests
Expected: All 15 tests pass, including 5 Postgres-specific and 1 cross-type namespace skip test.
3. Verify backward compatibility (no namespace = no breakage)
Expected: Existing Postgres query config tests pass unchanged — datasets without
namespace_metaproduce unqualified table names as before.4. Verify dataset creation with namespace_meta via API
Create a Postgres connection, then create a dataset with
namespace_meta, and link it:Expected: All return 200/201. The
datasetconfigresponse shows"succeeded": [...]with 1 entry.5. Verify namespace_meta is persisted and returned
Expected: The response includes
"fides_meta": {"namespace": {"schema": "billing"}}in thectl_datasetof the linked dataset config.6. Verify backward compat via API (no namespace_meta required for Postgres)
Expected: Both return 200/201 — no validation error. Postgres connections do NOT require namespace_meta (backward compatible, defaults to public schema).
7. Check RDS Postgres Admin UI tag
Navigate to the Integrations page in the Admin UI and find "RDS Postgres".
Expected: RDS Postgres shows "DSR Automation", "Discovery", and "Detection" tags.
Pre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and works🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Removed
Tests