Skip to content

Fix pandas bug#290

Merged
knikolla merged 1 commit intonerc-project:mainfrom
QuanMPhm:fix/pandas.3
Feb 12, 2026
Merged

Fix pandas bug#290
knikolla merged 1 commit intonerc-project:mainfrom
QuanMPhm:fix/pandas.3

Conversation

@QuanMPhm
Copy link
Contributor

@QuanMPhm QuanMPhm commented Jan 28, 2026

While not entirely clear, it seems the recent Pandas relase (3.0.0)
changed read_csv() cast to pyarrow datatypes, causing an error.
Specifying the pyarrow engine seems to fix the issue

@QuanMPhm QuanMPhm force-pushed the fix/pandas.3 branch 3 times, most recently from 61bd579 to 7054761 Compare January 28, 2026 18:02
Copy link
Collaborator

@knikolla knikolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the right way to go about this is to explicitly specify the engine.

It seems that the error points to an issue with the default engine c and switching to pyarrow fixes it.

df = pandas.read_csv(
    location,
    engine="pyarrow",
    dtype={INVOICE_COLUMN_COST: pandas.ArrowDtype(pyarrow.decimal128(12, 2))},
)

@QuanMPhm
Copy link
Contributor Author

@knikolla I see. That makes sense. I'll make the change too on the invoicing code later today. Out of curiosity, how did you arrive at this solution? I didn't realize engine was an option, or solution to this, at least from the googling I did. The error stack trace referred to deep pandas internals that I didn't look closely into.

@knikolla
Copy link
Collaborator

@knikolla I see. That makes sense. I'll make the change too on the invoicing code later today. Out of curiosity, how did you arrive at this solution? I didn't realize engine was an option, or solution to this, at least from the googling I did. The error stack trace referred to deep pandas internals that I didn't look closely into.

@QuanMPhm For obscure internal errors that require digging into documentation, I find something like Gemini to be pretty helpful (50% of the time). I pasted the stack trace and it gave me the following information.

Then I verified that by reading the pandas docs with regards to the engine option and tested the code myself.


The error pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16) is a bit of a "low-level" protest from Arrow.

In short: the pandas C engine (the default) is trying to pass data to the Arrow dtype, but they aren't speaking the same language. The C engine processes the CSV as strings/bytes, and when it hands those bytes to the Arrow decimal handler, the internal byte-length doesn't match what a decimal128 expects.

To fix this, you need to tell pandas to use the Arrow engine for the entire reading process, not just for the final data type.

The Fix: Switch the Engine
Add engine="pyarrow" to your read_csv call. This ensures that PyArrow handles the parsing from the very first byte.

@knikolla
Copy link
Collaborator

Going to trial having Copilot help with code reviews.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts invoice CSV ingestion in the daily billable usage management command to avoid a pandas 3.0 dtype-casting regression when reading cost values.

Changes:

  • Forces pandas.read_csv to use the pyarrow engine for invoice CSV parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

QuanMPhm added a commit to QuanMPhm/process_csv_report that referenced this pull request Feb 12, 2026
As mentioned by Kristi[1], the better solution to the Pandas
read_csv bug is to specify the engine as "pyarrow", rather
than having the loading and casting step seperate.

[1] nerc-project/coldfront-plugin-cloud#290 (review)
@QuanMPhm
Copy link
Contributor Author

Interesting

@QuanMPhm
Copy link
Contributor Author

@knikolla I'll let you resolve all the comments and merge.

While not entirely clear, it seems the recent Pandas relase (3.0.0)
changed `read_csv()` cast to pyarrow datatypes, causing an error.
Specifying the `pyarrow` engine seems to fix the issue

Pinned pandas version to >=3.0, <4.0
@knikolla knikolla merged commit aa758db into nerc-project:main Feb 12, 2026
4 of 7 checks passed
QuanMPhm added a commit to QuanMPhm/process_csv_report that referenced this pull request Feb 12, 2026
As mentioned by Kristi[1], the better solution to the Pandas
read_csv bug is to specify the engine as "pyarrow", rather
than having the loading and casting step seperate.

[1] nerc-project/coldfront-plugin-cloud#290 (review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants