feat: add more scalar array functions to bigframes.bigquery#17213
feat: add more scalar array functions to bigframes.bigquery#17213tswast wants to merge 9 commits into
bigframes.bigquery#17213Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of BigQuery array operations into the global namespace, moving existing functions and adding new ones like array_concat and generate_array. The generator script was updated to handle single-argument tuples, and new unit tests were implemented. Review feedback identifies a syntax error in the generated code where single-element tuples lack trailing commas, leading to potential runtime errors. Additionally, the reviewer suggests consolidating repetitive imports for better maintainability and correcting type mismatches and argument passing in the unit tests to ensure they properly validate the new operations.
| result = array.flatten( | ||
| cast(bpd.Series, scalar_types_df["string_col"]), | ||
| cast(bpd.Series, scalar_types_df["string_col"]), | ||
| ).to_frame() | ||
|
|
||
| snapshot.assert_match(result.sql.rstrip() + "\n", "out.sql") |
There was a problem hiding this comment.
In test_flatten, the depth argument is passed as a positional argument, but the YAML definition specifies it as keyword_only: true. Additionally, passing string_col for depth is a type mismatch as it expects an integer (i64). This ensures the test aligns with the expected fail-fast behavior for unsupported types.
| result = array.flatten( | |
| cast(bpd.Series, scalar_types_df["string_col"]), | |
| cast(bpd.Series, scalar_types_df["string_col"]), | |
| ).to_frame() | |
| snapshot.assert_match(result.sql.rstrip() + "\n", "out.sql") | |
| def test_flatten(scalar_types_df: bpd.DataFrame, snapshot): | |
| result = array.flatten( | |
| cast(bpd.Series, scalar_types_df["string_col"]), | |
| depth=1, | |
| ).to_frame() | |
| snapshot.assert_match(result.sql.rstrip() + "\n", "out.sql") |
References
- When a function receives parameters of an unsupported type, it should raise an error (e.g., ProgrammingError) instead of silently returning empty values. This ensures fail-fast behavior and prevents potential issues with missing parameter values in database operations.
| def test_generate_array(scalar_types_df: bpd.DataFrame, snapshot): | ||
| result = array.generate_array( | ||
| cast(bpd.Series, scalar_types_df["string_col"]), | ||
| cast(bpd.Series, scalar_types_df["string_col"]), | ||
| cast(bpd.Series, scalar_types_df["string_col"]), | ||
| ).to_frame() | ||
|
|
||
| snapshot.assert_match(result.sql.rstrip() + "\n", "out.sql") |
There was a problem hiding this comment.
In test_generate_array, string_col is used for the start, end, and step expressions. GENERATE_ARRAY in BigQuery only supports numeric types (INT64, FLOAT64, etc.). The test should use appropriate numeric columns to align with the expected fail-fast behavior for unsupported types.
| def test_generate_array(scalar_types_df: bpd.DataFrame, snapshot): | |
| result = array.generate_array( | |
| cast(bpd.Series, scalar_types_df["string_col"]), | |
| cast(bpd.Series, scalar_types_df["string_col"]), | |
| cast(bpd.Series, scalar_types_df["string_col"]), | |
| ).to_frame() | |
| snapshot.assert_match(result.sql.rstrip() + "\n", "out.sql") | |
| def test_generate_array(scalar_types_df: bpd.DataFrame, snapshot): | |
| result = array.generate_array( | |
| cast(bpd.Series, scalar_types_df["int64_col"]), | |
| cast(bpd.Series, scalar_types_df["int64_col"]), | |
| cast(bpd.Series, scalar_types_df["int64_col"]), | |
| ).to_frame() | |
| snapshot.assert_match(result.sql.rstrip() + "\n", "out.sql") |
References
- When a function receives parameters of an unsupported type, it should raise an error (e.g., ProgrammingError) instead of silently returning empty values. This ensures fail-fast behavior and prevents potential issues with missing parameter values in database operations.
|
|
||
| class Flatten(Func): | ||
| pass | ||
| arg_types = {"this": True, "depth": False} |
There was a problem hiding this comment.
I wonder why it made this change?
There was a problem hiding this comment.
Jetski says "Enhanced SQLGlot ( third_party/bigframes_vendored/sqlglot/expressions.py ):
• Fixed a ValueError during test_flatten by updating the vendored Flatten class definitions. It now explicitly supports the optional depth parameter ( arg_types = {"this": True,
"depth": False} ), matching BigQuery's actual FLATTEN function signature.
"
| _ARRAY_CONCAT_OP = googlesql.GoogleSqlScalarOp( | ||
| "ARRAY_CONCAT", | ||
| args=(googlesql.ArgSpec(), googlesql.ArgSpec()), | ||
| signature=lambda *args: None, | ||
| ) |
There was a problem hiding this comment.
Could we place the GoogleSqlScalarOp defs somewhere public/central so that we can also use them in the compiler. My intention with the GoogleSqlScalarOp is that it can be the single definition of a google sql function for both internal and user-facing purposes.
| SELECT | ||
| `rowindex`, | ||
| ARRAY_CONCAT(`string_col`, `string_col`) AS `0` | ||
| FROM `bigframes-dev`.`sqlglot_test`.`scalar_types` AS `bft_0` |
There was a problem hiding this comment.
These start to be a bit silly now that they all exercise the same exact emitter code eh?
🦕