Skip to content

make FASTA reference optional#48

Open
killercup wants to merge 1 commit intomainfrom
feature/fasta-optional
Open

make FASTA reference optional#48
killercup wants to merge 1 commit intomainfrom
feature/fasta-optional

Conversation

@killercup
Copy link
Copy Markdown
Member

@killercup killercup commented May 5, 2026

Quick exploration for #46.

This PR implements:

  • Open BAM/SAM/CRAM without a FASTA: bases that don't need a reference decode as before
  • pileup() reports Base::Unknown for the reference column.
  • CRAM follows the writer's RR flag and per-slice embedded_reference to decide whether external reference bytes are needed; if a slice needs them and no FASTA was supplied, fetch returns CramError::MissingReference rather than silently producing N's.

Not sure I like this as it might be surprising to the user, but it is kinda what htslib does. Alternative would be to have type-state for Readers<T> with T: NoReference | WithReference and only support pileups on the WithReference variant. CRAM throwing errors when reference is missing is very unfortunate tho.

Open BAM/SAM/CRAM without a FASTA: bases that don't need a reference
decode as before, and `pileup()` reports `Base::Unknown` for the
reference column. CRAM follows the writer's `RR` flag and per-slice
`embedded_reference` to decide whether external reference bytes are
needed; if a slice needs them and no FASTA was supplied, fetch returns
`CramError::MissingReference` rather than silently producing N's.

`Readers::open` now takes `impl Into<Option<&Path>>` for the FASTA
argument, so `Readers::open(bam, &fa)`, `Readers::open(bam, Some(&fa))`,
and `Readers::open(bam, None)` all work. Closes #46.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@killercup
Copy link
Copy Markdown
Member Author

LLM critique


7 findings: 0 showstoppers, 3 gaps, 0 inconsistencies, 1 underspecified, 3 suggestions.


[Gap] [1] Multi-ref CRAM containers with RR=true and no FASTA silently produce N bases

The missing-reference gate in decode_slice (slice.rs:231-239) checks sh.ref_seq_id >= 0 as one of its four conditions. For multi-ref containers (ref_seq_id == -2), this condition is false, so the gate never fires. When no FASTA is attached and the container has RR=true with no embedded reference, effective_ref falls through to reference_seq (an empty &[u8]), and every ref_base_at() call returns b'N'. The spec says "silently producing N bases is forbidden" (r[cram.fasta.optional]), but the code does exactly that for multi-ref containers.

Multi-ref containers without embedded references are uncommon but not impossible — htslib can produce them for multi-contig regions where individual contigs are too small to fill a container.

Scenario: Open a CRAM file written with samtools view -C --output-fmt-option multi_seq_per_slice=1 but without embed_ref, then read it through Readers::open(path, None). Single-ref slices correctly hit MissingReference; multi-ref slices decode all records with N bases. A downstream caller expecting a hard error gets garbage data instead.

Sources: slice.rs:223-239, slice.rs:830-846 (ref_base_at)


[Gap] [2] FASTA fetch happens before RR=false containers, wasting I/O

In fetch_into_customized (reader.rs:684-687), the FASTA fetch is guarded by ch.preservation.reference_required — so RR=false skips the fetch. But the decision to fetch uses the container's compression header, while decode_slice also checks the slice's embedded_reference >= 0. If a container has RR=true but every listed slice has embedded_reference >= 0, the FASTA fetch is pure waste — the fetched bytes are used for MD5 verification only (reference_seq is passed to decode_slice for the MD5 check at slice.rs:162-184, but effective_ref prefers the embedded data at slice.rs:223-224).

For most production files this is a minor performance issue rather than a correctness one — RR=true without an embedded ref is the common case. But the tiny MD5-only use of the fetched reference degrades to a correctness issue: when the slice has an embedded ref AND the external FASTA's MD5 doesn't match, the code still verifies the external FASTA's MD5 against the slice header (slice.rs:162), which is a spec violation — the embedded reference should be the canonical one.

Scenario: A CRAM written with embed_ref=1 (embed only when small) against one reference panel, then read against a different FASTA where that contig exists and is long enough but has a different sequence. The code fetches the external FASTA, verifies its MD5 against the slice header (which was computed against the embedded reference!), and either spuriously rejects a valid file or spuriously accepts an invalid one, depending on whether the MD5 happens to match.

Sources: reader.rs:684-710, slice.rs:162-184, slice.rs:223-224


[Gap] [3] Working directory has uncommitted WIP that breaks compilation

crates/seqair/src/bam.rs has uncommitted edits adding pub mod record_traits; and pub mod slab;. The untracked files record_traits.rs and slab.rs reference a StoreLayout type that doesn't exist yet, so the crate doesn't compile with -D warnings. This isn't part of the committed branch, but it pollutes any reviewer's or CI's view — RUSTFLAGS="-D warnings" cargo check fails on this branch even though the committed code is clean.

Scenario: A reviewer checks out the branch and runs RUSTFLAGS="-D warnings" cargo clippy --all-targets to verify CI parity. The build fails with E0432: unresolved import super::record_store::StoreLayout. They have to manually identify and revert the unstaged changes before they can evaluate the branch.

Sources: unstaged diff in crates/seqair/src/bam.rs, untracked crates/seqair/src/bam/record_traits.rs, crates/seqair/src/bam/slab.rs


[Underspecified] [4] Multi-ref absence-of-reference decision is deferred to per-record decode where it can't be checked

The gate at slice.rs:231-239 makes a slice-level decision (error or proceed) based on sh.ref_seq_id >= 0. For multi-ref slices (-2), this skips the check, and the per-record decode_record receives effective_ref (empty). Inside decode_record, record_ref_id is decoded per-record from the CRAM data series (ds.ref_id.decode(ctx)? at slice.rs:362), and a record could have ref_id >= 0 even when the slice's ref_seq_id == -2. So individual records within a multi-ref container are also at risk of silent N bases.

You might want to decide now whether to (a) add a per-record gate in decode_record for the !fasta_available && RR && ref_id >= 0 case, replicating the slice-level check but using record_ref_id instead of sh.ref_seq_id, or (b) pre-scan the CRAI entries before decoding to determine whether any record in the multi-ref container could reference a contig not embedded, and gate at container level.

Sources: slice.rs:231-239, slice.rs:362 (record_ref_id), slice.rs:271 (effective_ref passed per-record)


[Suggestion] [5] The fasta_available: bool flag in decode_slice is a poor name

The parameter name describes an external state (was FASTA provided at open?) rather than what the function needs to decide (can we reconstruct bases?). Since the function also knows ch.preservation.reference_required and sh.embedded_reference >= 0, the boolean introduces a subtle 3-way coupling between fetch_into_customized (sets it), decode_slice (checks it), and the spec (constrains the 3-field condition). A name like can_fallback_to_external_ref or external_reference_available would make the condition self-documenting.

This is a naming suggestion only — the logic is correct (aside from gap [1]).

Sources: slice.rs:117, reader.rs:710


[Suggestion] [6] FastaError::NotConfigured is dead code in the no-FASTA path

FastaError::NotConfigured is only returned from Readers::fetch_base_seq (readers.rs:431). But the no-FASTA pileup path (readers.rs:377-394) uses match self.fasta.as_mut() directly — it doesn't call fetch_base_seq. If a user calls fetch_base_seq on a no-FASTA Readers, they get NotConfigured. But fetch_base_seq is a public method — consider adding a direct Readers::has_fasta() -> bool accessor so users can check before calling, rather than making them handle an error after the fact. Or at minimum, add a doc example showing the NotConfigured path.

Sources: reader.rs:421-437, readers.rs:421-437


[Suggestion] [7] The readers.rs doctest examples all use Path::new("reference.fa") — none show the no-FASTA case

The Readers::open doc comments show both &Path and Option<&Path> forms (readers.rs:149-151), but the structural doc examples at lines 46-87, 95-119, and 269-279 all use the with-FASTA form. A multi-threaded example showing Readers::open(path, None) + fork() + pileup() with Base::Unknown reference base would help users discover the no-reference workflow. The feature's primary value-add is under-documented by example.


What looks good

  • The impl Into<Option<&Path>> signature is ergonomic — &Path and Option<&Path> both work at the call site, and the semicolon-chained doc examples demonstrate all three forms. All existing examples were updated cleanly.
  • The test coverage is comprehensive: IndexedReader::open accepts CRAM, Readers::open(path, None) fetches BAM records, pileup returns Base::Unknown without FASTA, embedded-ref CRAM decodes without external FASTA, external-ref CRAM correctly surfaces MissingReference, fork preserves the no-FASTA state. The old rejecting test was correctly replaced rather than just modified.
  • The decode_slice flow correctly prefers embedded references over external FASTA at line 223-224, and the RR=false fast path at reader.rs:685 skips the wasteful FASTA fetch.
  • The MissingReference error message includes REF_PATH/REF_CACHE hints for htslib-fluent users, matching the spec's SHOULD requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant