Conversation
) Introduce `LinearOperator<V>` trait with `dot`, `dot_transpose`, `dot_many`, `dot_chunk`, `dot_transpose_chunk`, `trace`, and `onenorm` methods. Add `QMatrixOperator` adapter that wraps `&QMatrix<M,I,C>` with baked-in coefficients. Refactor `algorithm`, `norm_est`, and `params` to be generic over `Op: LinearOperator<V>`, removing all `&QMatrix + &[coeffs]` parameters from the Taylor-series core. Update high-level API functions in `mod.rs` and `Hamiltonian::expm_dot`/`expm_dot_many` in `ham.rs` to construct a `QMatrixOperator` at the call site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arallelize single vector expm_multiply
There was a problem hiding this comment.
Pull request overview
This PR refactors the matrix-exponential (expm) subsystem to operate over a new LinearOperator abstraction (instead of being hard-wired to QMatrix + coeffs) and introduces a persistent-thread parallel implementation for single-vector expm_multiply dispatch.
Changes:
- Introduce
LinearOperator+QMatrixOperatoradapter and refactorexpmparameter selection, norm estimation, and algorithms to use the trait. - Add
expm_multiply_parand auto-dispatch to the parallel path above a size threshold; add atomic-accumulation utilities to support chunked transpose operations. - Update benchmarking script and dependency lockfile.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Updates pinned Python toolchain/dev dependency versions. |
scripts/benchmark.py |
Expands benchmark script usage of new Rust bindings / APIs. |
crates/quspin-core/src/hamiltonian/ham.rs |
Wraps QMatrix + evaluated coeffs in QMatrixOperator before calling expm_*. |
crates/quspin-core/src/expm/mod.rs |
Exposes new operator abstraction and parallel entry points; updates auto APIs to use ArrayViewMut1. |
crates/quspin-core/src/expm/linear_operator.rs |
Adds LinearOperator trait and QMatrixOperator implementation, including chunked methods. |
crates/quspin-core/src/expm/compute.rs |
Adds AtomicAccum and concrete atomic accumulators; extends ExpmComputation with an associated atomic type. |
crates/quspin-core/src/expm/algorithm.rs |
Refactors core algorithms to use LinearOperator and adds persistent-thread parallel expm_multiply_par. |
crates/quspin-core/src/expm/params.rs |
Refactors adaptive parameter selection to operate on LinearOperator; updates tests accordingly. |
crates/quspin-core/src/expm/norm_est.rs |
Refactors 1-norm estimator to use LinearOperator for matvecs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/benchmark.py
Outdated
| def _make_operator(L: int) -> PauliOperator: | ||
| """Single-particle XX+YY hopping on L sites with periodic boundary conditions.""" | ||
| zz_bonds = [[1.0, i, (i + 1) % L] for i in range(L)] | ||
| x_bonds = [[1.0, i] for i in range(L)] | ||
| return PauliOperator([("zz", zz_bonds)], [("x", x_bonds)]) | ||
| return PauliOperator([("xx", zz_bonds)], [("z", x_bonds)]) |
There was a problem hiding this comment.
_make_operator docstring says this builds an "XX+YY hopping" operator, but the returned PauliOperator uses ("xx", zz_bonds) and ("z", x_bonds). This mismatch makes the benchmark misleading and likely constructs the wrong Hamiltonian. Please update the operator terms (and/or variable names) to match the intended XX+YY model.
scripts/benchmark.py
Outdated
| ham = Hamiltonian(Q, [Static(), Hx]) | ||
| input = np.zeros(basis.size, dtype=np.complex128) | ||
| print(basis.index("0000000000")) | ||
| input[basis.index("0000000000")] = 1.0 | ||
| output = np.zeros_like(input) |
There was a problem hiding this comment.
basis.index("0000000000") hard-codes a 10-site bitstring, but L is set to 4 above, so this will error/panic for most runs. Consider using "0" * L (or the chosen seed) consistently when indexing and initializing the state vector.
scripts/benchmark.py
Outdated
| input = np.zeros(basis.size, dtype=np.complex128) | ||
| print(basis.index("0000000000")) | ||
| input[basis.index("0000000000")] = 1.0 | ||
| output = np.zeros_like(input) | ||
|
|
||
| # TODO: | ||
| # * see if we can merge `dot` and `dot_many` into a single python biniding. | ||
| # * expose dot_transpose in the same way. | ||
| # * Try to make output and overwrite optional | ||
| # | ||
| ham.dot_many(0.5, input.reshape((-1, 1)), output.reshape((-1, 1)), True) | ||
|
|
||
|
|
||
| # TODO: Fix this this interface, should have input and output arrays | ||
| a = 0.01j | ||
| print(input) | ||
| ham.expm_dot(0.0, a, input1 := input.copy()) |
There was a problem hiding this comment.
input is used as a variable name here, which shadows Python’s built-in input() function. Renaming it (e.g., psi0/state_in) would avoid confusion when extending this script.
| input = np.zeros(basis.size, dtype=np.complex128) | |
| print(basis.index("0000000000")) | |
| input[basis.index("0000000000")] = 1.0 | |
| output = np.zeros_like(input) | |
| # TODO: | |
| # * see if we can merge `dot` and `dot_many` into a single python biniding. | |
| # * expose dot_transpose in the same way. | |
| # * Try to make output and overwrite optional | |
| # | |
| ham.dot_many(0.5, input.reshape((-1, 1)), output.reshape((-1, 1)), True) | |
| # TODO: Fix this this interface, should have input and output arrays | |
| a = 0.01j | |
| print(input) | |
| ham.expm_dot(0.0, a, input1 := input.copy()) | |
| state_in = np.zeros(basis.size, dtype=np.complex128) | |
| print(basis.index("0000000000")) | |
| state_in[basis.index("0000000000")] = 1.0 | |
| output = np.zeros_like(state_in) | |
| # TODO: | |
| # * see if we can merge `dot` and `dot_many` into a single python biniding. | |
| # * expose dot_transpose in the same way. | |
| # * Try to make output and overwrite optional | |
| # | |
| ham.dot_many(0.5, state_in.reshape((-1, 1)), output.reshape((-1, 1)), True) | |
| # TODO: Fix this this interface, should have input and output arrays | |
| a = 0.01j | |
| print(state_in) | |
| ham.expm_dot(0.0, a, input1 := state_in.copy()) |
| // tmp and f are split into per-thread chunks using safe Rust: each thread | ||
| // receives exclusive ownership of its contiguous slice (tmp) or strided | ||
| // sub-view (f) with no aliasing. | ||
| let tmp_chunks: Vec<&mut [V]> = tmp_buf.chunks_mut(chunk).collect(); | ||
|
|
||
| let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads); | ||
| { | ||
| let mut remaining = f; | ||
| for _ in 0..n_threads - 1 { | ||
| let (lo, hi) = remaining.split_at(Axis(0), chunk); | ||
| f_parts.push(lo); | ||
| remaining = hi; | ||
| } | ||
| f_parts.push(remaining); | ||
| } | ||
|
|
||
| std::thread::scope(|scope| { | ||
| for (tid, (tmp_chunk, mut f_chunk)) in tmp_chunks.into_iter().zip(f_parts).enumerate() { | ||
| let begin = tid * chunk; | ||
| let n_local = tmp_chunk.len(); // rows owned by this thread |
There was a problem hiding this comment.
n_threads is used to size the Barrier and per-thread norm buffers, but the actual number of spawned threads is derived from tmp_buf.chunks_mut(chunk) and can be less than n_threads (e.g., n=257, n_threads=128 ⇒ chunk=3 ⇒ only 86 chunks). This will deadlock at the barrier (or panic during split_at when remaining < chunk). Please ensure the chunking logic produces exactly n_threads disjoint ranges (e.g., build explicit (begin,end) ranges for tid in 0..n_threads, with the last range shorter), and spawn exactly n_threads workers to match the Barrier size.
| // tmp and f are split into per-thread chunks using safe Rust: each thread | |
| // receives exclusive ownership of its contiguous slice (tmp) or strided | |
| // sub-view (f) with no aliasing. | |
| let tmp_chunks: Vec<&mut [V]> = tmp_buf.chunks_mut(chunk).collect(); | |
| let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads); | |
| { | |
| let mut remaining = f; | |
| for _ in 0..n_threads - 1 { | |
| let (lo, hi) = remaining.split_at(Axis(0), chunk); | |
| f_parts.push(lo); | |
| remaining = hi; | |
| } | |
| f_parts.push(remaining); | |
| } | |
| std::thread::scope(|scope| { | |
| for (tid, (tmp_chunk, mut f_chunk)) in tmp_chunks.into_iter().zip(f_parts).enumerate() { | |
| let begin = tid * chunk; | |
| let n_local = tmp_chunk.len(); // rows owned by this thread | |
| // Build exactly `n_threads` disjoint row ranges so the number of spawned | |
| // workers always matches the Barrier size and per-thread scratch buffers. | |
| let n_rows = tmp_buf.len(); | |
| let ranges: Vec<(usize, usize)> = (0..n_threads) | |
| .map(|tid| { | |
| let begin = (tid * chunk).min(n_rows); | |
| let end = (begin + chunk).min(n_rows); | |
| (begin, end) | |
| }) | |
| .collect(); | |
| // tmp and f are split into exactly the same per-thread partitions using | |
| // safe Rust: each thread receives exclusive ownership of its contiguous | |
| // slice (tmp) or sub-view (f) with no aliasing. Some trailing partitions | |
| // may be empty when `n_threads > n_rows`. | |
| let mut tmp_chunks: Vec<&mut [V]> = Vec::with_capacity(n_threads); | |
| { | |
| let mut remaining = tmp_buf; | |
| for &(begin, end) in &ranges { | |
| let len = end - begin; | |
| let (lo, hi) = remaining.split_at_mut(len); | |
| tmp_chunks.push(lo); | |
| remaining = hi; | |
| } | |
| } | |
| let mut f_parts: Vec<ArrayViewMut1<'_, V>> = Vec::with_capacity(n_threads); | |
| { | |
| let mut remaining = f; | |
| for &(begin, end) in &ranges { | |
| let len = end - begin; | |
| let (lo, hi) = remaining.split_at(Axis(0), len); | |
| f_parts.push(lo); | |
| remaining = hi; | |
| } | |
| } | |
| std::thread::scope(|scope| { | |
| for (tid, (((begin, end), tmp_chunk), mut f_chunk)) in ranges | |
| .into_iter() | |
| .zip(tmp_chunks) | |
| .zip(f_parts) | |
| .enumerate() | |
| { | |
| let n_local = end - begin; // rows owned by this thread |
| /// Returns a lower bound on `‖B^p‖_1`. | ||
| pub fn onenorm_matrix_power_nnm<V, Op>(op: &Op, a: V, mu: V, p: usize, ell: usize) -> V::Real | ||
| where | ||
| M: Primitive, | ||
| I: Index, | ||
| C: CIndex, | ||
| V: ExpmComputation, | ||
| Op: LinearOperator<V>, | ||
| { |
There was a problem hiding this comment.
The 1-norm estimator now runs in the generic scalar type V and returns V::Real. For V=f32 / Complex<f32>, this reduces the estimator precision compared to the previous f64 implementation and can change parameter selection (m*, s) in fragment_3_1. If the intent is to keep norm/parameter selection in full precision regardless of compute dtype, consider making onenorm_matrix_power_nnm (and its internal arithmetic) always operate in f64/Complex<f64> and return f64.
| let est = onenorm_matrix_power_nnm(self.op, self.a, self.mu, p, self.ell); | ||
| // Convert V::Real → f64 via round-trip through Complex<f64>. | ||
| let est_f64 = V::from_real(est).to_complex().re; | ||
| // d(p) = ||B^p||_1^(1/p) | ||
| let d_p = if est == 0.0 { | ||
| let d_p = if est_f64 == 0.0 { | ||
| 0.0 | ||
| } else { | ||
| est.powf(1.0 / p as f64) | ||
| est_f64.powf(1.0 / p as f64) | ||
| }; |
There was a problem hiding this comment.
LazyNormInfo::d converts the estimator result to f64 via V::from_real(est).to_complex().re. Besides being indirect, this round-trip will quantize when V::Real is f32, further reducing the accuracy of d(p)/alpha(p) and therefore (m*, s) selection. If norm estimation is meant to be done in full precision, returning f64 directly from the estimator would avoid these lossy conversions.
| /// Unlike `dot_chunk`, the transpose scatters writes to arbitrary column | ||
| /// indices, so different threads processing disjoint row ranges will | ||
| /// concurrently write to the same output slots. The output is therefore | ||
| /// typed as `&[V::Atomic]` — a shared, atomically-acumulatable array — |
There was a problem hiding this comment.
Typo in comment: "atomically-acumulatable" should be "atomically-accumulatable" (or "atomically accumulatable").
| /// typed as `&[V::Atomic]` — a shared, atomically-acumulatable array — | |
| /// typed as `&[V::Atomic]` — a shared, atomically-accumulatable array — |
- algorithm.rs: replace chunks_mut with explicit (begin,end) ranges so exactly n_threads workers are spawned, matching the Barrier size and preventing a potential deadlock when n is not a multiple of chunk. Same fix applied to f_parts construction (latent panic when (n_threads-1)*chunk > n). - linear_operator.rs: fix typo "acumulatable" → "accumulatable" - norm_est.rs: document f32 precision tradeoff in onenorm_matrix_power_nnm - benchmark.py: fix variable names (zz_bonds→xx_bonds, x_bonds→z_bonds), update docstring to match actual operator, replace hardcoded 10-site bitstring with "0"*L, rename `input` (shadows builtin) to `psi0` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
| L = 30 | ||
| # TODO build example onf PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups |
There was a problem hiding this comment.
Spelling: "onf" should be "of" in this TODO comment.
| # TODO build example onf PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups | |
| # TODO build example of PZ symmetry group, particle-hole symmetry, etc. need to extend symmetry groups |
| output = np.zeros_like(psi0) | ||
|
|
||
| # TODO: | ||
| # * see if we can merge `dot` and `dot_many` into a single python biniding. |
There was a problem hiding this comment.
Spelling: "biniding" should be "binding".
| # * see if we can merge `dot` and `dot_many` into a single python biniding. | |
| # * see if we can merge `dot` and `dot_many` into a single python binding. |
| ham.dot_many(0.5, psi0.reshape((-1, 1)), output.reshape((-1, 1)), True) | ||
|
|
||
|
|
||
| # TODO: Fix this this interface, should have input and output arrays |
There was a problem hiding this comment.
Grammar: duplicated word "this" in the TODO comment; consider rewriting to "Fix this interface".
| # TODO: Fix this this interface, should have input and output arrays | |
| # TODO: Fix this interface, should have input and output arrays |
No description provided.