Conversation
61bd579 to
7054761
Compare
src/coldfront_plugin_cloud/management/commands/fetch_daily_billable_usage.py
Outdated
Show resolved
Hide resolved
knikolla
left a comment
There was a problem hiding this comment.
I believe the right way to go about this is to explicitly specify the engine.
It seems that the error points to an issue with the default engine c and switching to pyarrow fixes it.
df = pandas.read_csv(
location,
engine="pyarrow",
dtype={INVOICE_COLUMN_COST: pandas.ArrowDtype(pyarrow.decimal128(12, 2))},
)|
@knikolla I see. That makes sense. I'll make the change too on the invoicing code later today. Out of curiosity, how did you arrive at this solution? I didn't realize |
@QuanMPhm For obscure internal errors that require digging into documentation, I find something like Gemini to be pretty helpful (50% of the time). I pasted the stack trace and it gave me the following information. Then I verified that by reading the pandas docs with regards to the The error pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16) is a bit of a "low-level" protest from Arrow. In short: the pandas C engine (the default) is trying to pass data to the Arrow dtype, but they aren't speaking the same language. The C engine processes the CSV as strings/bytes, and when it hands those bytes to the Arrow decimal handler, the internal byte-length doesn't match what a decimal128 expects. To fix this, you need to tell pandas to use the Arrow engine for the entire reading process, not just for the final data type. The Fix: Switch the Engine |
7054761 to
326dd45
Compare
|
Going to trial having Copilot help with code reviews. |
There was a problem hiding this comment.
Pull request overview
Adjusts invoice CSV ingestion in the daily billable usage management command to avoid a pandas 3.0 dtype-casting regression when reading cost values.
Changes:
- Forces
pandas.read_csvto use thepyarrowengine for invoice CSV parsing.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/coldfront_plugin_cloud/management/commands/fetch_daily_billable_usage.py
Show resolved
Hide resolved
src/coldfront_plugin_cloud/management/commands/fetch_daily_billable_usage.py
Show resolved
Hide resolved
src/coldfront_plugin_cloud/management/commands/fetch_daily_billable_usage.py
Show resolved
Hide resolved
As mentioned by Kristi[1], the better solution to the Pandas read_csv bug is to specify the engine as "pyarrow", rather than having the loading and casting step seperate. [1] nerc-project/coldfront-plugin-cloud#290 (review)
|
Interesting |
|
@knikolla I'll let you resolve all the comments and merge. |
While not entirely clear, it seems the recent Pandas relase (3.0.0) changed `read_csv()` cast to pyarrow datatypes, causing an error. Specifying the `pyarrow` engine seems to fix the issue Pinned pandas version to >=3.0, <4.0
326dd45 to
b508c66
Compare
As mentioned by Kristi[1], the better solution to the Pandas read_csv bug is to specify the engine as "pyarrow", rather than having the loading and casting step seperate. [1] nerc-project/coldfront-plugin-cloud#290 (review)
While not entirely clear, it seems the recent Pandas relase (3.0.0)
changed
read_csv()cast to pyarrow datatypes, causing an error.Specifying the
pyarrowengine seems to fix the issue