Skip to content

feat(nanno): if snpeff deems a variant to be in multiple genes report them all#400

Draft
holmeso wants to merge 4 commits intomasterfrom
nanno_multi_gene
Draft

feat(nanno): if snpeff deems a variant to be in multiple genes report them all#400
holmeso wants to merge 4 commits intomasterfrom
nanno_multi_gene

Conversation

@holmeso
Copy link
Contributor

@holmeso holmeso commented Feb 27, 2026

This pull request refactors and improves the annotation extraction logic for VCF and SnpEff VCF files, optimizes TSV annotation field extraction, and enhances test coverage. The main focus is on making the annotation extraction more robust, efficient, and maintainable, especially for cases with multiple consequences or large INFO fields.

VCF and SnpEff VCF annotation improvements:

  • Refactored AnnotationSourceSnpEffVCF to extract all relevant consequences per gene, not just the first, and to support multiple consequences per alternate allele. This results in more comprehensive and accurate annotation extraction. [1] [2]
  • Simplified and consolidated logic in getAnnotation and related methods, reducing code duplication and improving maintainability. [1] [2]

Performance and maintainability enhancements:

  • Improved VCF INFO field parsing in AnnotationSourceVCF to efficiently handle large fields or requests for many annotation fields by parsing the INFO field once into a map when needed. [1] [2] [3]
  • Optimized TSV annotation extraction in AnnotationSourceTSV by precomputing field name and position arrays for faster lookup during annotation extraction. [1] [2] [3]

Testing improvements:

  • Added and expanded tests for SnpEff VCF annotation extraction, including setup for temporary files and use of ChrPositionRefAlt utilities.

Minor code cleanups:

  • Replaced deprecated or verbose code patterns (e.g., use of contains(",") replaced with more efficient indexOf(',') >= 0 checks).
  • Removed commented-out and redundant code for clarity. [1] [2]

These changes collectively improve the accuracy, efficiency, and maintainability of the annotation extraction logic across VCF, SnpEff VCF, and TSV sources.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves annotation extraction across TSV, generic VCF INFO, and SnpEff VCF sources, with a focus on supporting multiple-gene SnpEff annotations and reducing per-record parsing overhead.

Changes:

  • Update SnpEff VCF parsing to return consequences for multiple genes (and add/extend tests around multi-gene behavior).
  • Optimize VCF INFO parsing by optionally parsing INFO once into a map for large INFO strings or many requested fields.
  • Speed up TSV extraction by precomputing field name/position arrays.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
qannotate/test/au/edu/qimr/qannotate/nanno/AnnotationSourceTSVTest.java Updates tests to match new TSV extraction signature (arrays rather than map).
qannotate/test/au/edu/qimr/qannotate/nanno/AnnotationSourceSnpEffVCFTest.java Adds coverage for multi-gene and multi-allele SnpEff behaviors using temp VCF files.
qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSourceVCF.java Adds INFO parsing heuristics to parse once into a map for performance on large fields/many keys.
qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSourceTSV.java Precomputes field arrays for faster per-record TSV extraction.
qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSourceSnpEffVCF.java Refactors annotation matching and expands extraction to return all relevant gene consequences.
qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSource.java Simplifies getAnnotation flow and makes the “requested < next” early-return explicit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

String worstConsequence = "";
Map<String, String> worstByGene = new java.util.LinkedHashMap<>();
for (String aa : annArray) {
if (aa.startsWith(alt)) {
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getWorstConsequence uses aa.startsWith(alt) to select consequences for an allele. This can incorrectly match alleles where alt is a prefix (e.g., requested alt A will also match an ANN entry starting with AA|...). To avoid mis-annotating multi-allelic records, match the allele token exactly (e.g., aa.startsWith(alt + "|") or parse up to the first '|').

Suggested change
if (aa.startsWith(alt)) {
int pipeIndex = aa.indexOf('|');
if (pipeIndex <= 0) {
// Malformed ANN entry or missing allele token; skip
continue;
}
String alleleToken = aa.substring(0, pipeIndex);
if (alleleToken.equals(alt)) {

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Coplit please implement your suggestion

Comment on lines +52 to +57
// precompute arrays for fast extraction
this.fieldNames = headerNameAndPosition.keySet().toArray(new String[0]);
this.fieldPositions = new int[this.fieldNames.length];
for (int i = 0; i < this.fieldNames.length; i++) {
this.fieldPositions[i] = headerNameAndPosition.get(this.fieldNames[i]);
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precomputed fieldNames array is built from headerNameAndPosition.keySet(), which has non-deterministic iteration order for a HashMap. This can cause extracted annotations to be emitted in an arbitrary order that may not match the user-requested field order (and can diverge from emptyRecordResult, which preserves the input order). Consider building fieldNames/fieldPositions by iterating over the requested fieldNames string (split by ',') or using an order-preserving map (e.g., LinkedHashMap).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedbacl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +111 to +118
public static String extractFieldsFromRecord(String[] record, String[] fieldNames, int[] fieldPositions) {
StringBuilder dataToReturn = new StringBuilder();
int recordLength = null != record ? record.length : 0;
if ( recordLength > 0 && null != fields) {
// String [] recordArray = TabTokenizer.tokenize(record);
for (Entry<String, Integer> entry : fields.entrySet()) {
/*
* make sure that array length is not shorter than entry value
*/
if (recordLength > entry.getValue()) {
dataToReturn.append(( ! dataToReturn.isEmpty()) ? FIELD_DELIMITER_TAB : "").append(entry.getKey()).append("=").append(record[entry.getValue()]);
if (recordLength > 0 && null != fieldNames && null != fieldPositions) {
for (int i = 0; i < fieldNames.length; i++) {
int pos = fieldPositions[i];
if (recordLength > pos) {
dataToReturn.append((!dataToReturn.isEmpty()) ? FIELD_DELIMITER_TAB : "")
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractFieldsFromRecord iterates over fieldNames.length but indexes into fieldPositions[i] without verifying the arrays are the same length. If a caller passes mismatched arrays, this will throw ArrayIndexOutOfBoundsException. Consider validating lengths up front (and returning "" / throwing IllegalArgumentException), or iterating up to Math.min(fieldNames.length, fieldPositions.length).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link

Copilot AI commented Feb 27, 2026

@holmeso I've opened a new pull request, #401, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Feb 27, 2026

@holmeso I've opened a new pull request, #402, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: holmeso <7066552+holmeso@users.noreply.github.com>
Copy link

Copilot AI commented Feb 27, 2026

@holmeso I've opened a new pull request, #403, to work on those changes. Once the pull request is ready, I'll request review from you.

fix(nanno): guard against mismatched array lengths in extractFieldsFromRecord
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants