Skip to content

HBASE-30115 Introduce approximate progress estimation for TableRecordReader based on row key position#8134

Open
jinhyukify wants to merge 3 commits intoapache:masterfrom
jinhyukify:HBASE-30115
Open

HBASE-30115 Introduce approximate progress estimation for TableRecordReader based on row key position#8134
jinhyukify wants to merge 3 commits intoapache:masterfrom
jinhyukify:HBASE-30115

Conversation

@jinhyukify
Copy link
Copy Markdown
Contributor

if (row == null) {
return 0;
}
double d = 0;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double avoids unsigned arithmetic issues. Java has no unsigned long, so interpreting row key bytes as a raw long would be cumbersome. With double, all values are naturally non-negative and standard arithmetic just works.

Comment on lines +181 to +182
LOG.warn("Failed to probe first row for progress estimation", e);
return null;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are any issues with the scan, this will simply report 0 progress.

@junegunn
Copy link
Copy Markdown
Member

junegunn commented May 4, 2026

Thanks, here are my initial thoughts.

Pluggable RowKeyProgress feels like overkill

Allowing a custom class gives maximum flexibility, but I think few users will author and deploy a custom class just for an optional progress signal. It costs us an @InterfaceAudience.Public interface to maintain. Could we drop it and pick the algorithm via a simple string config (e.g. uniform / hex)?

Please let me know if you have a plan to use a custom implementation.

Naming could align with RegionSplitter

Users already know HexStringSplit, UniformSplit, and DecimalStringSplit from RegionSplitter. But this PR introduces new vocabulary (HexPrefix, ByteBased) for the similar, albeit not identical, concepts. Can we reuse the existing words?

Generalize HexPrefix to handle any-length hex keys (HexString?)

Would it be possible to simplify your HexPrefix implementation so that users don't have to specify a prefix? Some tables might have rowkeys that consist entirely of hex strings, and it can be confusing for them to specify a prefix. I think we can simplify the model and clamp bytes that go beyond the hex string boundary. For example:

  • startrow: a1
  • rowkey: bea\x200\x201\x202
  • stoprow: cf123

becomes

  • a10000
  • beafff
  • cf1230

for comparison.

@jinhyukify
Copy link
Copy Markdown
Contributor Author

@junegunn Thank you for your feedback.

Pluggable RowKeyProgress feels like overkill

Thank you for raising this! I agree the abstraction comes with a real maintenance cost, and a string-config switch is appealing.

The reason I'd like to keep the pluggable interface is that we have a concrete in-house use case where the row-key salt follows a [a-z][0-9]{N} pattern. Neither uniform nor hex reports a sensible curve for that schema (the byte-level interpretation under-reports through most of the alphabet, and the hex interpretation collapses every key past f to zero). With the pluggable interface, we can just ship a small custom implementation. Otherwise, users with similar non-uniform (but not quite hex) schemas would be stuck needing a follow-up patch.

Naming could align with RegionSplitter

Great point. I hadn't connected it to RegionSplitter clearly enough. Renamed:

  • ByteBasedRowKeyProgressUniformRowKeyProgress
  • HexPrefixRowKeyProgressHexStringRowKeyProgress

Generalize HexStringRowKeyProgress to handle any-length hex keys (HexString?)

Thank you, this was the right call. Removed the hbase.mapreduce.rowkey.progress.hex.prefix.length config. the class now auto-derives the prefix length from the start/stop rows (longest common prefix + a small resolution margin, capped to fit double precision). Non-hex trailing bytes contribute zero:

  • startrow a1a10000
  • rowkey bea\x20\x20\x21\x22bea0000
  • stoprow cf123cf1230

No prefix length to tune, and works for both pure-hex and hex-prefixed strings.

@jinhyukify
Copy link
Copy Markdown
Contributor Author

I've just fixed test failures in TestTableRecordReader and TestTableInputFormat

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (HBASE-30115) adds an approximate progress estimation mechanism for TableRecordReader by mapping the last-read row key into a normalized fraction of the scan’s start/stop key range. This improves MapReduce task progress reporting without requiring tuple counting.

Changes:

  • Added a pluggable RowKeyProgress interface with default (UniformRowKeyProgress) and hex-specific (HexStringRowKeyProgress) implementations.
  • Updated TableRecordReaderImpl#getProgress() to return an estimated fraction based on the last successfully read row key (with optional probing for empty start/stop bounds).
  • Added unit tests for both progress implementations.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableRecordReaderImpl.java Initializes a RowKeyProgress estimator and uses it to report approximate scan progress; includes probing logic for empty bounds.
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/RowKeyProgress.java Introduces the progress-estimation SPI and configuration key.
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/UniformRowKeyProgress.java Default progress estimator treating row keys as big-endian unsigned byte sequences.
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HexStringRowKeyProgress.java Progress estimator for ASCII hex-encoded row keys (e.g., hash prefixes).
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestUniformRowKeyProgress.java Unit tests for UniformRowKeyProgress.
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHexStringRowKeyProgress.java Unit tests for HexStringRowKeyProgress.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +189 to +197
private byte[] probeLastRow() {
try {
Scan probeScan = new Scan(scan);
probeScan.setReversed(true);
probeScan.setOneRowLimit();
try (ResultScanner probeScanner = htable.getScanner(probeScan)) {
Result result = probeScanner.next();
return result != null ? result.getRow() : null;
}
}
if (b >= 'a' && b <= 'f') {
return 10 + (b - 'a');
}
Comment on lines +36 to +37
* Hex characters past the start/stop divergence point to include for resolution. 4 hex chars = 65
* 536 buckets, finer than any progress bar can display.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants