fix: cast JSON and nested struct columns to string for anywidget rendering#17189
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for displaying JSON and nested JSON columns in interactive widgets by converting them to strings using BigQuery's TO_JSON_STRING function, addressing a limitation in Arrow/Pandas batch processing. The changes include updates to the DataFrame display logic, HTML representation, and unit tests. A review comment suggests optimizing the transformation loop in dataframe.py by consolidating multiple assign calls into a single operation to improve performance and reduce object creation overhead.
|
|
||
|
|
||
| def test_json_column_converted_to_string_for_display(): | ||
| from bigframes.core.blocks import Block |
There was a problem hiding this comment.
Is it possible to import everything at the top of the file, so that we don't have to repeat these lines in each test?
There was a problem hiding this comment.
Moved all duplicate local imports (Block, DataFrame, JSON_DTYPE, STRING_DTYPE, struct_type, SqlScalarOp, and TableWidget) to the top of test_anywidget.py. Placed them immediately following the pytest.importorskip guards to ensure module import safety.
|
|
||
|
|
||
| def test_repr_mimebundle_head(): | ||
| from unittest.mock import Mock, patch |
There was a problem hiding this comment.
likewise, import at the top of the file?
| @@ -820,8 +820,24 @@ def __repr__(self) -> str: | |||
| ) | |||
|
|
|||
| def _get_display_df_and_blob_cols(self) -> tuple[DataFrame, list[str]]: | |||
There was a problem hiding this comment.
Maybe we can ditch the second return value and just return the display df? This this a private function so it should be safe to change its return type.
The only concern I have is that, if this function is called in many places, updating the callers might be a chore. However, if it only has around 5 callers or so, we can make the change in this PR.
There was a problem hiding this comment.
Simplified _get_display_df_and_blob_cols to _get_display_df which returns only the processed DataFrame. Updated all matching test mocks and calls in both test_anywidget.py and test_html.py to use the new single-return signature.
| def _to_display_df( | ||
| obj: Union[bigframes.dataframe.DataFrame, bigframes.series.Series], | ||
| ) -> bigframes.dataframe.DataFrame: | ||
| from bigframes.series import Series |
There was a problem hiding this comment.
Is it possible to put this import to the top of the file?
| ) -> bigframes.dataframe.DataFrame: | ||
| from bigframes.series import Series | ||
|
|
||
| df = obj.to_frame() if isinstance(obj, Series) else obj |
There was a problem hiding this comment.
Here is another design suggestion for you: instead of performing the type check in the _to_display_df(), define a "_get_display_df_and_blob_cols" method directly on the series.Series class, which basically looks like this:
def _get_display_df_and_blob_cols(self):
return self.to_frame()._get_display_df_and_blob_cols()
There are two benefits:
- You don't need to perform type checks here any more
- You provide better functional parity between the
Dataframeclass and theSeriesclass
There was a problem hiding this comment.
Added a polymorphic _get_display_df(self) method directly to the Series class, implemented as return self.to_frame()._get_display_df(). Leveraged this signature in html.py to simplify _to_display_df(obj) to a clean, type-check-free one-liner: return obj._get_display_df().
| def _to_display_df( | ||
| obj: Union[bigframes.dataframe.DataFrame, bigframes.series.Series], | ||
| ) -> bigframes.dataframe.DataFrame: | ||
| return obj._get_display_df() | ||
|
|
||
|
|
There was a problem hiding this comment.
This function now becomes so simple that we can simply remove it. Just use obj._get_display_df() at its call sites.
There was a problem hiding this comment.
Good catch, it is removed.
This Pull Request resolves visualization crashes when rendering DataFrames or Series containing raw JSON and nested JSON struct structures. It ensures that these columns are safely pre-serialized into clean, flat JSON string format on the database level prior to visual layout rendering.
Verified at: screen/424ojbuqyBPinTb
Fixes #<514763826> 🦕