feat(nanno): if snpeff deems a variant to be in multiple genes report them all#400
feat(nanno): if snpeff deems a variant to be in multiple genes report them all#400
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves annotation extraction across TSV, generic VCF INFO, and SnpEff VCF sources, with a focus on supporting multiple-gene SnpEff annotations and reducing per-record parsing overhead.
Changes:
- Update SnpEff VCF parsing to return consequences for multiple genes (and add/extend tests around multi-gene behavior).
- Optimize VCF INFO parsing by optionally parsing INFO once into a map for large INFO strings or many requested fields.
- Speed up TSV extraction by precomputing field name/position arrays.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| qannotate/test/au/edu/qimr/qannotate/nanno/AnnotationSourceTSVTest.java | Updates tests to match new TSV extraction signature (arrays rather than map). |
| qannotate/test/au/edu/qimr/qannotate/nanno/AnnotationSourceSnpEffVCFTest.java | Adds coverage for multi-gene and multi-allele SnpEff behaviors using temp VCF files. |
| qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSourceVCF.java | Adds INFO parsing heuristics to parse once into a map for performance on large fields/many keys. |
| qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSourceTSV.java | Precomputes field arrays for faster per-record TSV extraction. |
| qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSourceSnpEffVCF.java | Refactors annotation matching and expands extraction to return all relevant gene consequences. |
| qannotate/src/au/edu/qimr/qannotate/nanno/AnnotationSource.java | Simplifies getAnnotation flow and makes the “requested < next” early-return explicit. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String worstConsequence = ""; | ||
| Map<String, String> worstByGene = new java.util.LinkedHashMap<>(); | ||
| for (String aa : annArray) { | ||
| if (aa.startsWith(alt)) { |
There was a problem hiding this comment.
getWorstConsequence uses aa.startsWith(alt) to select consequences for an allele. This can incorrectly match alleles where alt is a prefix (e.g., requested alt A will also match an ANN entry starting with AA|...). To avoid mis-annotating multi-allelic records, match the allele token exactly (e.g., aa.startsWith(alt + "|") or parse up to the first '|').
| if (aa.startsWith(alt)) { | |
| int pipeIndex = aa.indexOf('|'); | |
| if (pipeIndex <= 0) { | |
| // Malformed ANN entry or missing allele token; skip | |
| continue; | |
| } | |
| String alleleToken = aa.substring(0, pipeIndex); | |
| if (alleleToken.equals(alt)) { |
There was a problem hiding this comment.
@Coplit please implement your suggestion
| // precompute arrays for fast extraction | ||
| this.fieldNames = headerNameAndPosition.keySet().toArray(new String[0]); | ||
| this.fieldPositions = new int[this.fieldNames.length]; | ||
| for (int i = 0; i < this.fieldNames.length; i++) { | ||
| this.fieldPositions[i] = headerNameAndPosition.get(this.fieldNames[i]); | ||
| } |
There was a problem hiding this comment.
The precomputed fieldNames array is built from headerNameAndPosition.keySet(), which has non-deterministic iteration order for a HashMap. This can cause extracted annotations to be emitted in an arbitrary order that may not match the user-requested field order (and can diverge from emptyRecordResult, which preserves the input order). Consider building fieldNames/fieldPositions by iterating over the requested fieldNames string (split by ',') or using an order-preserving map (e.g., LinkedHashMap).
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedbacl
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| public static String extractFieldsFromRecord(String[] record, String[] fieldNames, int[] fieldPositions) { | ||
| StringBuilder dataToReturn = new StringBuilder(); | ||
| int recordLength = null != record ? record.length : 0; | ||
| if ( recordLength > 0 && null != fields) { | ||
| // String [] recordArray = TabTokenizer.tokenize(record); | ||
| for (Entry<String, Integer> entry : fields.entrySet()) { | ||
| /* | ||
| * make sure that array length is not shorter than entry value | ||
| */ | ||
| if (recordLength > entry.getValue()) { | ||
| dataToReturn.append(( ! dataToReturn.isEmpty()) ? FIELD_DELIMITER_TAB : "").append(entry.getKey()).append("=").append(record[entry.getValue()]); | ||
| if (recordLength > 0 && null != fieldNames && null != fieldPositions) { | ||
| for (int i = 0; i < fieldNames.length; i++) { | ||
| int pos = fieldPositions[i]; | ||
| if (recordLength > pos) { | ||
| dataToReturn.append((!dataToReturn.isEmpty()) ? FIELD_DELIMITER_TAB : "") |
There was a problem hiding this comment.
extractFieldsFromRecord iterates over fieldNames.length but indexes into fieldPositions[i] without verifying the arrays are the same length. If a caller passes mismatched arrays, this will throw ArrayIndexOutOfBoundsException. Consider validating lengths up front (and returning "" / throwing IllegalArgumentException), or iterating up to Math.min(fieldNames.length, fieldPositions.length).
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: holmeso <7066552+holmeso@users.noreply.github.com>
fix(nanno): guard against mismatched array lengths in extractFieldsFromRecord
This pull request refactors and improves the annotation extraction logic for VCF and SnpEff VCF files, optimizes TSV annotation field extraction, and enhances test coverage. The main focus is on making the annotation extraction more robust, efficient, and maintainable, especially for cases with multiple consequences or large INFO fields.
VCF and SnpEff VCF annotation improvements:
AnnotationSourceSnpEffVCFto extract all relevant consequences per gene, not just the first, and to support multiple consequences per alternate allele. This results in more comprehensive and accurate annotation extraction. [1] [2]getAnnotationand related methods, reducing code duplication and improving maintainability. [1] [2]Performance and maintainability enhancements:
AnnotationSourceVCFto efficiently handle large fields or requests for many annotation fields by parsing the INFO field once into a map when needed. [1] [2] [3]AnnotationSourceTSVby precomputing field name and position arrays for faster lookup during annotation extraction. [1] [2] [3]Testing improvements:
ChrPositionRefAltutilities.Minor code cleanups:
contains(",")replaced with more efficientindexOf(',') >= 0checks).These changes collectively improve the accuracy, efficiency, and maintainability of the annotation extraction logic across VCF, SnpEff VCF, and TSV sources.