-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Summary
Free-threaded Python 3.14t disables the GIL, exposing numerous thread-safety issues throughout the driver that were previously "accidentally safe" under the GIL. This issue tracks all identified problems beyond the shutdown segfault (#717).
The driver extensively uses shared mutable state (dicts, sets, counters) accessed from multiple threads without proper synchronization. Under CPython with the GIL, many of these were benign. Under free-threaded Python, they cause segfaults, data corruption, lost updates, and race conditions.
1. CRITICAL: Load Balancing Policy Counter/Host Races
Files: cassandra/policies.py
All round-robin-based policies have unprotected _position counter increments and unsynchronized reads of host lists during make_query_plan():
RoundRobinPolicy(lines 190-191):_positionread-modify-write without lock;_live_hostsread at line 193 without lock whileon_up()/on_down()modify it concurrentlyDCAwareRoundRobinPolicy(lines 279-280): same_positionpattern;_dc_live_hosts.get()at line 282 without lockRackAwareRoundRobinPolicy(lines 395-396): same pattern
The code even has a comment acknowledging this: "not thread-safe, but we don't care much about lost increments" — this was written assuming GIL protection for the underlying integer object, which no longer holds.
Impact: Duplicate round-robin positions, queries seeing inconsistent host lists, possible crashes in islice(cycle(hosts)) if host set changes mid-iteration.
2. CRITICAL: Connection._requests Dict Race
File: cassandra/connection.py
- Line 1104:
self._requests[request_id] = (cb, decoder, result_metadata)— written outsideself.lock - Lines 1029-1030:
error_all_requests()snapshots and clears_requestsinside lock, butsend_msg()can write to it concurrently without the lock - Lines 1290-1297: Response handling pops from
_requestswithout consistent locking
Impact: Dict corruption, lost requests, segfaults during concurrent dict mutation.
3. CRITICAL: Request ID Duplication
File: cassandra/connection.py, lines 1067-1078
get_request_id() documents it must be called with self.lock held, but highest_request_id increment is a plain Python integer read-modify-write. Under free-threaded Python, even if locks are held by callers, the integer object itself is not atomic.
Additionally, request_ids deque is appended to at lines 1296, 1332, 1344 with inconsistent locking (lock is re-acquired after a gap).
Impact: Duplicate stream IDs on the same connection → protocol errors, response routing to wrong callbacks.
4. HIGH: Session._pools Dict Races
File: cassandra/cluster.py
- Line 3214:
self._pools.get(host)outside lock - Line 3234:
self._pools[host] = new_poolinside lock (but earlier read was outside) - Line 3245:
self._pools.pop(host, None)inremove_pool()without lock - Line 3369:
get_pools()returnsself._pools.values()— a live view, not a snapshot
Impact: Dict corruption during concurrent pool addition/removal, RuntimeError during iteration.
5. HIGH: HostConnection State Races
File: cassandra/pool.py
_is_replacingflag (lines 578-580): check-then-act without lock — two threads can both readFalse, both setTrue, both submit_replace()→ double replacement_trashset (lines 582-591): membership check and remove without atomicity →KeyErroror double-close_connectionsdict (lines 450, 512-515): read without lock while_replace()modifies it →NoConnectionsAvailableor choice from empty dict_excess_connectionsset (lines 827-848): size check and add/close without lockin_flightcounter (line 781): read without lock for comparison → stale value → premature connection close
6. HIGH: concurrent.py Executor Shared State
File: cassandra/concurrent.py
_exception(lines 193-194): written from multiple callback threads without lock → lost errors_results_queue(line 189):append()without lock while_results()(line 207) sorts/reads it → list corruption_exec_depthcounter (lines 130, 145):+= 1/-= 1from multiple threads → wrong recursion depth tracking
7. MEDIUM: Metadata / Token Map Races
File: cassandra/metadata.py
token_mapreplacement (lines 311-312):self.token_map = TokenMap(...)without lock while query threads readself.token_mapat line 319 → queries route using partially-built or freed mapkeyspacesdict (lines 208, 223, 231, 238): accessed without locks from both schema refresh (ControlConnection thread) and user queries_tabletsaccess (lines 269, 278):drop_tablets()called without synchronization during topology changes
8. MEDIUM: Cluster._prepared_statements WeakValueDictionary
File: cassandra/cluster.py, lines 1448-1449
Writes are locked (_prepared_statement_lock), but reads during query execution may not hold the lock. WeakValueDictionary is not thread-safe — values can be GC'd on another thread during iteration.
9. MEDIUM: Global _clusters_for_shutdown Set
File: cassandra/cluster.py, lines 243-256
Module-level _clusters_for_shutdown set is modified via add()/discard() without any lock. The atexit handler _shutdown_clusters() calls .copy() but races with concurrent register/unregister.
10. LOW: Session.__del__() Accessing Shared State
File: cassandra/cluster.py, lines 3181-3188
__del__ calls shutdown() which accesses _lock, _pools, is_shutdown. In free-threaded Python, __del__ can run on any thread at any time.
Summary
| # | Severity | Area | Core Problem |
|---|---|---|---|
| 1 | CRITICAL | policies.py |
Unprotected counter increment + host list reads in query plan |
| 2 | CRITICAL | connection.py |
_requests dict written outside lock |
| 3 | CRITICAL | connection.py |
Request ID duplication from non-atomic increment |
| 4 | HIGH | cluster.py |
_pools dict concurrent mutation |
| 5 | HIGH | pool.py |
_is_replacing, _trash, _connections races |
| 6 | HIGH | concurrent.py |
Callback shared state without synchronization |
| 7 | MEDIUM | metadata.py |
Token map and keyspaces replaced during reads |
| 8 | MEDIUM | cluster.py |
WeakValueDictionary not thread-safe |
| 9 | MEDIUM | cluster.py |
Global set mutations without lock |
| 10 | LOW | cluster.py |
__del__ on arbitrary thread |
Related
- Segfault in free-threaded Python 3.14t during cluster shutdown (logging race in Cythonized cluster.so) #717 — Segfault during cluster shutdown (logging race in Cythonized cluster.so)