[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async#1316
[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async#1316
Conversation
|
@danlu1 Is this still WIP, or are you looking for reviews? |
|
@andrewelamb sorry I should have marked this a draft. |
…ctly when upload data from a dataframe
…ger output json string
|
The integration test failures are in the recordset and submission modules and do not appear to be related to my changes. |
linglp
left a comment
There was a problem hiding this comment.
I think overall it looks good. The tests can be consolidated a bit to test all the edge cases in fewer integration tests to improve performance, and the docstring can be updated to reflect the new state of the code since json.dumps() was removed. There's also some logic that can be simplified in the redundant checks where sample_values is created but never actually used. The function name could also be more descriptive of what it actually does now (sanitizing special values rather than just converting dtypes)
tests/integration/synapseclient/models/async/test_table_async.py
Outdated
Show resolved
Hide resolved
tests/integration/synapseclient/models/synchronous/test_table.py
Outdated
Show resolved
Hide resolved
BryanFauble
left a comment
There was a problem hiding this comment.
This is looking great, once we get the last few items handled (and develop merged in), I can approve!
BryanFauble
left a comment
There was a problem hiding this comment.
Nice work on this fix -- the backslash-escaping approach for embedded quotes makes sense, and the Ellipsis/pd.NA handling is solid. I flagged one issue that I think needs to be addressed before merge, plus a few cleanup nits.
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
tests/integration/synapseclient/models/async/test_table_async.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Hi @danlu1 ! Thanks for your hard work. I think I found two bugs after a few rounds of testing myself:
Bug 1: Nested np.nan not handled
df = pd.DataFrame({"val": [[1.0, np.nan], [2.0, 3.0]]})
The nested np.nan would currently pass through to JSON serialization, which could cause issues since np.nan is not valid JSON.
This happened becauseconvert_dtypes() only converts top-level np.nan to pd.NA but nested np.nan would remain unchanged. And even though in your code, you have:
if obj is pd.NA:
But this won't handle np.nan. The fix would just be:
def _reformat_special_values(obj):
if pd.isna(obj): # Catches pd.NA, np.nan, and None
return None
Bug 2: Top-level missing values not handled
def _serialize_json_value(x):
if isinstance(x, (list, dict)):
# pd.NA handling is only here, inside _reformat_special_values
...
if x is ...:
return "..."
return x # <-- Top-level pd.NA, np.nan, None just pass through
Top-level pd.NA (or np.nan) isn't converted to None. It only works now because .replace({pd.NA: None}) is called later.
| df[col] = ( | ||
| df[col].replace({pd.NA: None}).astype(object) | ||
| ) # this will convert the int64 and float64 columns to object columns | ||
| sample_values = df[col].dropna() |
There was a problem hiding this comment.
dropna() + len() check is probably unnecessary.
Since apply() works fine on empty series and the function handles nulls, we could just apply the function directly:
for col in df.columns:
df[col] = df[col].apply(_serialize_json_value)
I tried ^this and all the unit tests could pass.
| df[col] = df[col].apply(_serialize_json_value) | ||
| # restore the original values of the column especially for the int64 and float64 columns since apply function changes the dtype | ||
| df[col] = df[col].convert_dtypes() | ||
| df[col] = df[col].replace({pd.NA: None}).astype(object) |
There was a problem hiding this comment.
the .replace({pd.NA: None}) is doing two things:
- Catching top-level
pd.NAthat_serialize_json_valuemissed - Converting any
pd.NAreintroduced by convert_dtypes()
I think it might be cleaner to have _serialize_json_value handles top-level NaN like this:
def _serialize_json_value(x):
if pd.isna(x): # Handle top-level NA
return None
There was a problem hiding this comment.
As an example, the pd.NA here in the dataframe is handled by df[col].replace({pd.NA: None}) rather than _serialize_json_value.
df = pd.DataFrame({
"name": ["Alice", pd.NA, "Charlie"],
"age": [25, 30, pd.NA]
})
df = convert_dtypes_to_json_serializable(df2)
I am not sure if this is intended.
| } | ||
|
|
||
| ], | ||
| }).convert_dtypes() |
There was a problem hiding this comment.
do we still need to call .convert_dtypes() here? since convert_dtypes_to_json_serializable also calls convert_dtypes in one of the steps?
| def _reformat_special_values(obj): | ||
| if obj is ...: | ||
| return "..." | ||
| if obj is pd.NA: |
There was a problem hiding this comment.
I tried:
df = pd.DataFrame({"val": [[1.0, np.nan], [3.0, pd.NA], [None, None]]})
convert_dtypes_to_json_serializable(df)
print(df)
and then I saw:
val
0 [1.0, nan]
1 [3.0, None]
2 [None, None]
As you could see, np.nan would remain as nan. Should this be changed to if pd.isna(obj) to catch all different representations of None, including `pd.NA, np.nan, and None?
Problem:
A JSON serialization issue occurs when a DataFrame passed to store_row_async contains a list or dictionary with strings that include both double quotes and apostrophes.
Solution:
"escapechar": "\\"toto_csv_kwargsinstore_rows_asyncto_csv_kwargsto_stream_and_update_from_dfso it can take the passedto_csv_kwargsvalues for downstream data processing.Testing:
Unit test and integration test have been added.