refactor(expm): Adding LinearOperator trait and using that trait to parallelize single vector expm_multiply by weinbe58 · Pull Request #46 · QuSpin/QuSpin-rust

weinbe58 · 2026-04-10T11:56:37Z

No description provided.

) Introduce `LinearOperator<V>` trait with `dot`, `dot_transpose`, `dot_many`, `dot_chunk`, `dot_transpose_chunk`, `trace`, and `onenorm` methods. Add `QMatrixOperator` adapter that wraps `&QMatrix<M,I,C>` with baked-in coefficients. Refactor `algorithm`, `norm_est`, and `params` to be generic over `Op: LinearOperator<V>`, removing all `&QMatrix + &[coeffs]` parameters from the Taylor-series core. Update high-level API functions in `mod.rs` and `Hamiltonian::expm_dot`/`expm_dot_many` in `ham.rs` to construct a `QMatrixOperator` at the call site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…arallelize single vector expm_multiply

Copilot

Pull request overview

This PR refactors the matrix-exponential (expm) subsystem to operate over a new LinearOperator abstraction (instead of being hard-wired to QMatrix + coeffs) and introduces a persistent-thread parallel implementation for single-vector expm_multiply dispatch.

Changes:

Introduce LinearOperator + QMatrixOperator adapter and refactor expm parameter selection, norm estimation, and algorithms to use the trait.
Add expm_multiply_par and auto-dispatch to the parallel path above a size threshold; add atomic-accumulation utilities to support chunked transpose operations.
Update benchmarking script and dependency lockfile.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`uv.lock`	Updates pinned Python toolchain/dev dependency versions.
`scripts/benchmark.py`	Expands benchmark script usage of new Rust bindings / APIs.
`crates/quspin-core/src/hamiltonian/ham.rs`	Wraps `QMatrix` + evaluated coeffs in `QMatrixOperator` before calling `expm_*`.
`crates/quspin-core/src/expm/mod.rs`	Exposes new operator abstraction and parallel entry points; updates auto APIs to use `ArrayViewMut1`.
`crates/quspin-core/src/expm/linear_operator.rs`	Adds `LinearOperator` trait and `QMatrixOperator` implementation, including chunked methods.
`crates/quspin-core/src/expm/compute.rs`	Adds `AtomicAccum` and concrete atomic accumulators; extends `ExpmComputation` with an associated atomic type.
`crates/quspin-core/src/expm/algorithm.rs`	Refactors core algorithms to use `LinearOperator` and adds persistent-thread parallel `expm_multiply_par`.
`crates/quspin-core/src/expm/params.rs`	Refactors adaptive parameter selection to operate on `LinearOperator`; updates tests accordingly.
`crates/quspin-core/src/expm/norm_est.rs`	Refactors 1-norm estimator to use `LinearOperator` for matvecs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-11T12:26:35Z

scripts/benchmark.py

 def _make_operator(L: int) -> PauliOperator:
    """Single-particle XX+YY hopping on L sites with periodic boundary conditions."""
    zz_bonds = [[1.0, i, (i + 1) % L] for i in range(L)]
    x_bonds = [[1.0, i] for i in range(L)]
-    return PauliOperator([("zz", zz_bonds)], [("x", x_bonds)])
+    return PauliOperator([("xx", zz_bonds)], [("z", x_bonds)])


_make_operator docstring says this builds an "XX+YY hopping" operator, but the returned PauliOperator uses ("xx", zz_bonds) and ("z", x_bonds). This mismatch makes the benchmark misleading and likely constructs the wrong Hamiltonian. Please update the operator terms (and/or variable names) to match the intended XX+YY model.

Copilot · 2026-04-11T12:26:35Z

scripts/benchmark.py

+ham = Hamiltonian(Q, [Static(), Hx])
+input = np.zeros(basis.size, dtype=np.complex128)
+print(basis.index("0000000000"))
+input[basis.index("0000000000")] = 1.0
+output = np.zeros_like(input)


basis.index("0000000000") hard-codes a 10-site bitstring, but L is set to 4 above, so this will error/panic for most runs. Consider using "0" * L (or the chosen seed) consistently when indexing and initializing the state vector.

Copilot · 2026-04-11T12:26:35Z

scripts/benchmark.py

+input = np.zeros(basis.size, dtype=np.complex128)
+print(basis.index("0000000000"))
+input[basis.index("0000000000")] = 1.0
+output = np.zeros_like(input)
+
+# TODO:
+# * see if we can merge `dot` and `dot_many` into a single python biniding.
+# * expose dot_transpose in the same way.
+# * Try to make output and overwrite optional
+#
+ham.dot_many(0.5, input.reshape((-1, 1)), output.reshape((-1, 1)), True)
+
+
+# TODO: Fix this this interface, should have input and output arrays
+a = 0.01j
+print(input)
+ham.expm_dot(0.0, a, input1 := input.copy())


input is used as a variable name here, which shadows Python’s built-in input() function. Renaming it (e.g., psi0/state_in) would avoid confusion when extending this script.

Suggested change

input = np.zeros(basis.size, dtype=np.complex128)

print(basis.index("0000000000"))

input[basis.index("0000000000")] = 1.0

output = np.zeros_like(input)

# TODO:

# * see if we can merge `dot` and `dot_many` into a single python biniding.

# * expose dot_transpose in the same way.

# * Try to make output and overwrite optional

#

ham.dot_many(0.5, input.reshape((-1, 1)), output.reshape((-1, 1)), True)

# TODO: Fix this this interface, should have input and output arrays

a = 0.01j

print(input)

ham.expm_dot(0.0, a, input1 := input.copy())

state_in = np.zeros(basis.size, dtype=np.complex128)

print(basis.index("0000000000"))

state_in[basis.index("0000000000")] = 1.0

output = np.zeros_like(state_in)

# TODO:

# * see if we can merge `dot` and `dot_many` into a single python biniding.

# * expose dot_transpose in the same way.

# * Try to make output and overwrite optional

#

ham.dot_many(0.5, state_in.reshape((-1, 1)), output.reshape((-1, 1)), True)

# TODO: Fix this this interface, should have input and output arrays

a = 0.01j

print(state_in)

ham.expm_dot(0.0, a, input1 := state_in.copy())

Copilot · 2026-04-11T12:26:35Z

crates/quspin-core/src/expm/algorithm.rs

+    // tmp and f are split into per-thread chunks using safe Rust: each thread
+    // receives exclusive ownership of its contiguous slice (tmp) or strided
+    // sub-view (f) with no aliasing.
+    let tmp_chunks: Vec<&mut [V]> = tmp_buf.chunks_mut(chunk).collect();
+
+    let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads);
+    {
+        let mut remaining = f;
+        for _ in 0..n_threads - 1 {
+            let (lo, hi) = remaining.split_at(Axis(0), chunk);
+            f_parts.push(lo);
+            remaining = hi;
+        }
+        f_parts.push(remaining);
+    }
+
+    std::thread::scope(|scope| {
+        for (tid, (tmp_chunk, mut f_chunk)) in tmp_chunks.into_iter().zip(f_parts).enumerate() {
+            let begin = tid * chunk;
+            let n_local = tmp_chunk.len(); // rows owned by this thread


n_threads is used to size the Barrier and per-thread norm buffers, but the actual number of spawned threads is derived from tmp_buf.chunks_mut(chunk) and can be less than n_threads (e.g., n=257, n_threads=128 ⇒ chunk=3 ⇒ only 86 chunks). This will deadlock at the barrier (or panic during split_at when remaining < chunk). Please ensure the chunking logic produces exactly n_threads disjoint ranges (e.g., build explicit (begin,end) ranges for tid in 0..n_threads, with the last range shorter), and spawn exactly n_threads workers to match the Barrier size.

Suggested change

// tmp and f are split into per-thread chunks using safe Rust: each thread

// receives exclusive ownership of its contiguous slice (tmp) or strided

// sub-view (f) with no aliasing.

let tmp_chunks: Vec<&mut [V]> = tmp_buf.chunks_mut(chunk).collect();

let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads);

{

let mut remaining = f;

for _ in 0..n_threads - 1 {

let (lo, hi) = remaining.split_at(Axis(0), chunk);

f_parts.push(lo);

remaining = hi;

}

f_parts.push(remaining);

}

std::thread::scope(|scope| {

for (tid, (tmp_chunk, mut f_chunk)) in tmp_chunks.into_iter().zip(f_parts).enumerate() {

let begin = tid * chunk;

let n_local = tmp_chunk.len(); // rows owned by this thread

// Build exactly `n_threads` disjoint row ranges so the number of spawned

// workers always matches the Barrier size and per-thread scratch buffers.

let n_rows = tmp_buf.len();

let ranges: Vec<(usize, usize)> = (0..n_threads)

.map(|tid| {

let begin = (tid * chunk).min(n_rows);

let end = (begin + chunk).min(n_rows);

(begin, end)

})

.collect();

// tmp and f are split into exactly the same per-thread partitions using

// safe Rust: each thread receives exclusive ownership of its contiguous

// slice (tmp) or sub-view (f) with no aliasing. Some trailing partitions

// may be empty when `n_threads > n_rows`.

let mut tmp_chunks: Vec<&mut [V]> = Vec::with_capacity(n_threads);

{

let mut remaining = tmp_buf;

for &(begin, end) in &ranges {

let len = end - begin;

let (lo, hi) = remaining.split_at_mut(len);

tmp_chunks.push(lo);

remaining = hi;

}

}

let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads);

{

let mut remaining = f;

for &(begin, end) in &ranges {

let len = end - begin;

let (lo, hi) = remaining.split_at(Axis(0), len);

f_parts.push(lo);

remaining = hi;

}

}

std::thread::scope(|scope| {

for (tid, (((begin, end), tmp_chunk), mut f_chunk)) in ranges

.into_iter()

.zip(tmp_chunks)

.zip(f_parts)

.enumerate()

{

let n_local = end - begin; // rows owned by this thread

Copilot · 2026-04-11T12:26:36Z

crates/quspin-core/src/expm/norm_est.rs

+/// Returns a lower bound on `‖B^p‖_1`.
+pub fn onenorm_matrix_power_nnm<V, Op>(op: &Op, a: V, mu: V, p: usize, ell: usize) -> V::Real
 where
-    M: Primitive,
-    I: Index,
-    C: CIndex,
+    V: ExpmComputation,
+    Op: LinearOperator<V>,
 {


The 1-norm estimator now runs in the generic scalar type V and returns V::Real. For V=f32 / Complex<f32>, this reduces the estimator precision compared to the previous f64 implementation and can change parameter selection (m*, s) in fragment_3_1. If the intent is to keep norm/parameter selection in full precision regardless of compute dtype, consider making onenorm_matrix_power_nnm (and its internal arithmetic) always operate in f64/Complex<f64> and return f64.

Copilot · 2026-04-11T12:26:36Z

crates/quspin-core/src/expm/params.rs

+        let est = onenorm_matrix_power_nnm(self.op, self.a, self.mu, p, self.ell);
+        // Convert V::Real → f64 via round-trip through Complex<f64>.
+        let est_f64 = V::from_real(est).to_complex().re;
        // d(p) = ||B^p||_1^(1/p)
-        let d_p = if est == 0.0 {
+        let d_p = if est_f64 == 0.0 {
            0.0
        } else {
-            est.powf(1.0 / p as f64)
+            est_f64.powf(1.0 / p as f64)
        };


LazyNormInfo::d converts the estimator result to f64 via V::from_real(est).to_complex().re. Besides being indirect, this round-trip will quantize when V::Real is f32, further reducing the accuracy of d(p)/alpha(p) and therefore (m*, s) selection. If norm estimation is meant to be done in full precision, returning f64 directly from the estimator would avoid these lossy conversions.

Copilot · 2026-04-11T12:26:36Z

crates/quspin-core/src/expm/linear_operator.rs

+    /// Unlike `dot_chunk`, the transpose scatters writes to arbitrary column
+    /// indices, so different threads processing disjoint row ranges will
+    /// concurrently write to the same output slots.  The output is therefore
+    /// typed as `&[V::Atomic]` — a shared, atomically-acumulatable array —


Typo in comment: "atomically-acumulatable" should be "atomically-accumulatable" (or "atomically accumulatable").

Suggested change

/// typed as `&[V::Atomic]` — a shared, atomically-acumulatable array —

/// typed as `&[V::Atomic]` — a shared, atomically-accumulatable array —

- algorithm.rs: replace chunks_mut with explicit (begin,end) ranges so exactly n_threads workers are spawned, matching the Barrier size and preventing a potential deadlock when n is not a multiple of chunk. Same fix applied to f_parts construction (latent panic when (n_threads-1)*chunk > n). - linear_operator.rs: fix typo "acumulatable" → "accumulatable" - norm_est.rs: document f32 precision tradeoff in onenorm_matrix_power_nnm - benchmark.py: fix variable names (zz_bonds→xx_bonds, x_bonds→z_bonds), update docstring to match actual operator, replace hardcoded 10-site bitstring with "0"*L, rename `input` (shadows builtin) to `psi0` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-11T19:07:22Z

scripts/benchmark.py



-L = 30
+# TODO build example onf PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups


Spelling: "onf" should be "of" in this TODO comment.

Suggested change

# TODO build example onf PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups

# TODO build example of PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups

Copilot · 2026-04-11T19:07:23Z

scripts/benchmark.py

+output = np.zeros_like(psi0)
+
+# TODO:
+# * see if we can merge `dot` and `dot_many` into a single python biniding.


Spelling: "biniding" should be "binding".

Suggested change

# * see if we can merge `dot` and `dot_many` into a single python biniding.

# * see if we can merge `dot` and `dot_many` into a single python binding.

Copilot · 2026-04-11T19:07:23Z

scripts/benchmark.py

+ham.dot_many(0.5, psi0.reshape((-1, 1)), output.reshape((-1, 1)), True)
+
+
+# TODO: Fix this this interface, should have input and output arrays


Grammar: duplicated word "this" in the TODO comment; consider rewriting to "Fix this interface".

Suggested change

# TODO: Fix this this interface, should have input and output arrays

# TODO: Fix this interface, should have input and output arrays

weinbe58 and others added 2 commits April 6, 2026 23:00

refactor(expm): Adding LinearOperator trait and using that trait to p…

66533ef

…arallelize single vector expm_multiply

weinbe58 requested a review from Copilot April 11, 2026 12:22

Copilot started reviewing on behalf of weinbe58 April 11, 2026 12:23 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

weinbe58 requested a review from Copilot April 11, 2026 19:03

Copilot started reviewing on behalf of weinbe58 April 11, 2026 19:04 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

weinbe58 merged commit ae9b135 into main Apr 13, 2026
5 checks passed

weinbe58 mentioned this pull request Apr 13, 2026

refactor: decouple expm module from QMatrix via LinearOperator trait #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(expm): Adding LinearOperator trait and using that trait to parallelize single vector expm_multiply#46

refactor(expm): Adding LinearOperator trait and using that trait to parallelize single vector expm_multiply#46
weinbe58 merged 3 commits intomainfrom
phil/linear-operator-trait

weinbe58 commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    // tmp and f are split into per-thread chunks using safe Rust: each thread
-    // receives exclusive ownership of its contiguous slice (tmp) or strided
-    // sub-view (f) with no aliasing.
-    let tmp_chunks: Vec<&mut [V]> = tmp_buf.chunks_mut(chunk).collect();
-    let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads);
-    {
-        let mut remaining = f;
-        for _ in 0..n_threads - 1 {
-            let (lo, hi) = remaining.split_at(Axis(0), chunk);
-            f_parts.push(lo);
-            remaining = hi;
-        }
-        f_parts.push(remaining);
-    }
-    std::thread::scope(|scope| {
-        for (tid, (tmp_chunk, mut f_chunk)) in tmp_chunks.into_iter().zip(f_parts).enumerate() {
-            let begin = tid * chunk;
-            let n_local = tmp_chunk.len(); // rows owned by this thread
+    // Build exactly `n_threads` disjoint row ranges so the number of spawned
+    // workers always matches the Barrier size and per-thread scratch buffers.
+    let n_rows = tmp_buf.len();
+    let ranges: Vec<(usize, usize)> = (0..n_threads)
+        .map(|tid| {
+            let begin = (tid * chunk).min(n_rows);
+            let end = (begin + chunk).min(n_rows);
+            (begin, end)
+        })
+        .collect();
+    // tmp and f are split into exactly the same per-thread partitions using
+    // safe Rust: each thread receives exclusive ownership of its contiguous
+    // slice (tmp) or sub-view (f) with no aliasing. Some trailing partitions
+    // may be empty when `n_threads > n_rows`.
+    let mut tmp_chunks: Vec<&mut [V]> = Vec::with_capacity(n_threads);
+    {
+        let mut remaining = tmp_buf;
+        for &(begin, end) in &ranges {
+            let len = end - begin;
+            let (lo, hi) = remaining.split_at_mut(len);
+            tmp_chunks.push(lo);
+            remaining = hi;
+        }
+    }
+    let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads);
+    {
+        let mut remaining = f;
+        for &(begin, end) in &ranges {
+            let len = end - begin;
+            let (lo, hi) = remaining.split_at(Axis(0), len);
+            f_parts.push(lo);
+            remaining = hi;
+        }
+    }
+    std::thread::scope(|scope| {
+        for (tid, (((begin, end), tmp_chunk), mut f_chunk)) in ranges
+            .into_iter()
+            .zip(tmp_chunks)
+            .zip(f_parts)
+            .enumerate()
+        {
+            let n_local = end - begin; // rows owned by this thread

	/// typed as `&[V::Atomic]` — a shared, atomically-acumulatable array —
	/// typed as `&[V::Atomic]` — a shared, atomically-accumulatable array —



		L = 30
		# TODO build example onf PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups

	# TODO build example onf PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups
	# TODO build example of PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups

	# * see if we can merge `dot` and `dot_many` into a single python biniding.
	# * see if we can merge `dot` and `dot_many` into a single python binding.

		ham.dot_many(0.5, psi0.reshape((-1, 1)), output.reshape((-1, 1)), True)


		# TODO: Fix this this interface, should have input and output arrays

	# TODO: Fix this this interface, should have input and output arrays
	# TODO: Fix this interface, should have input and output arrays

Conversation

weinbe58 commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants