Skip to content

Free-threaded Python 3.14t: comprehensive thread-safety audit #718

@dkropachev

Description

@dkropachev

Summary

Free-threaded Python 3.14t disables the GIL, exposing numerous thread-safety issues throughout the driver that were previously "accidentally safe" under the GIL. This issue tracks all identified problems beyond the shutdown segfault (#717).

The driver extensively uses shared mutable state (dicts, sets, counters) accessed from multiple threads without proper synchronization. Under CPython with the GIL, many of these were benign. Under free-threaded Python, they cause segfaults, data corruption, lost updates, and race conditions.


1. CRITICAL: Load Balancing Policy Counter/Host Races

Files: cassandra/policies.py

All round-robin-based policies have unprotected _position counter increments and unsynchronized reads of host lists during make_query_plan():

  • RoundRobinPolicy (lines 190-191): _position read-modify-write without lock; _live_hosts read at line 193 without lock while on_up()/on_down() modify it concurrently
  • DCAwareRoundRobinPolicy (lines 279-280): same _position pattern; _dc_live_hosts.get() at line 282 without lock
  • RackAwareRoundRobinPolicy (lines 395-396): same pattern

The code even has a comment acknowledging this: "not thread-safe, but we don't care much about lost increments" — this was written assuming GIL protection for the underlying integer object, which no longer holds.

Impact: Duplicate round-robin positions, queries seeing inconsistent host lists, possible crashes in islice(cycle(hosts)) if host set changes mid-iteration.


2. CRITICAL: Connection._requests Dict Race

File: cassandra/connection.py

  • Line 1104: self._requests[request_id] = (cb, decoder, result_metadata) — written outside self.lock
  • Lines 1029-1030: error_all_requests() snapshots and clears _requests inside lock, but send_msg() can write to it concurrently without the lock
  • Lines 1290-1297: Response handling pops from _requests without consistent locking

Impact: Dict corruption, lost requests, segfaults during concurrent dict mutation.


3. CRITICAL: Request ID Duplication

File: cassandra/connection.py, lines 1067-1078

get_request_id() documents it must be called with self.lock held, but highest_request_id increment is a plain Python integer read-modify-write. Under free-threaded Python, even if locks are held by callers, the integer object itself is not atomic.

Additionally, request_ids deque is appended to at lines 1296, 1332, 1344 with inconsistent locking (lock is re-acquired after a gap).

Impact: Duplicate stream IDs on the same connection → protocol errors, response routing to wrong callbacks.


4. HIGH: Session._pools Dict Races

File: cassandra/cluster.py

  • Line 3214: self._pools.get(host) outside lock
  • Line 3234: self._pools[host] = new_pool inside lock (but earlier read was outside)
  • Line 3245: self._pools.pop(host, None) in remove_pool() without lock
  • Line 3369: get_pools() returns self._pools.values() — a live view, not a snapshot

Impact: Dict corruption during concurrent pool addition/removal, RuntimeError during iteration.


5. HIGH: HostConnection State Races

File: cassandra/pool.py

  • _is_replacing flag (lines 578-580): check-then-act without lock — two threads can both read False, both set True, both submit _replace() → double replacement
  • _trash set (lines 582-591): membership check and remove without atomicity → KeyError or double-close
  • _connections dict (lines 450, 512-515): read without lock while _replace() modifies it → NoConnectionsAvailable or choice from empty dict
  • _excess_connections set (lines 827-848): size check and add/close without lock
  • in_flight counter (line 781): read without lock for comparison → stale value → premature connection close

6. HIGH: concurrent.py Executor Shared State

File: cassandra/concurrent.py

  • _exception (lines 193-194): written from multiple callback threads without lock → lost errors
  • _results_queue (line 189): append() without lock while _results() (line 207) sorts/reads it → list corruption
  • _exec_depth counter (lines 130, 145): += 1 / -= 1 from multiple threads → wrong recursion depth tracking

7. MEDIUM: Metadata / Token Map Races

File: cassandra/metadata.py

  • token_map replacement (lines 311-312): self.token_map = TokenMap(...) without lock while query threads read self.token_map at line 319 → queries route using partially-built or freed map
  • keyspaces dict (lines 208, 223, 231, 238): accessed without locks from both schema refresh (ControlConnection thread) and user queries
  • _tablets access (lines 269, 278): drop_tablets() called without synchronization during topology changes

8. MEDIUM: Cluster._prepared_statements WeakValueDictionary

File: cassandra/cluster.py, lines 1448-1449

Writes are locked (_prepared_statement_lock), but reads during query execution may not hold the lock. WeakValueDictionary is not thread-safe — values can be GC'd on another thread during iteration.


9. MEDIUM: Global _clusters_for_shutdown Set

File: cassandra/cluster.py, lines 243-256

Module-level _clusters_for_shutdown set is modified via add()/discard() without any lock. The atexit handler _shutdown_clusters() calls .copy() but races with concurrent register/unregister.


10. LOW: Session.__del__() Accessing Shared State

File: cassandra/cluster.py, lines 3181-3188

__del__ calls shutdown() which accesses _lock, _pools, is_shutdown. In free-threaded Python, __del__ can run on any thread at any time.


Summary

# Severity Area Core Problem
1 CRITICAL policies.py Unprotected counter increment + host list reads in query plan
2 CRITICAL connection.py _requests dict written outside lock
3 CRITICAL connection.py Request ID duplication from non-atomic increment
4 HIGH cluster.py _pools dict concurrent mutation
5 HIGH pool.py _is_replacing, _trash, _connections races
6 HIGH concurrent.py Callback shared state without synchronization
7 MEDIUM metadata.py Token map and keyspaces replaced during reads
8 MEDIUM cluster.py WeakValueDictionary not thread-safe
9 MEDIUM cluster.py Global set mutations without lock
10 LOW cluster.py __del__ on arbitrary thread

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions