HBASE-30115 Introduce approximate progress estimation for TableRecordReader based on row key position by jinhyukify · Pull Request #8134 · apache/hbase

jinhyukify · 2026-04-24T16:51:04Z

Jira https://issues.apache.org/jira/browse/HBASE-30115

…Reader based on row key position

jinhyukify · 2026-04-24T16:56:50Z

+    if (row == null) {
+      return 0;
+    }
+    double d = 0;


double avoids unsigned arithmetic issues. Java has no unsigned long, so interpreting row key bytes as a raw long would be cumbersome. With double, all values are naturally non-negative and standard arithmetic just works.

jinhyukify · 2026-04-24T16:57:48Z

+      LOG.warn("Failed to probe first row for progress estimation", e);
+      return null;


If there are any issues with the scan, this will simply report 0 progress.

junegunn · 2026-05-04T02:56:54Z

Thanks, here are my initial thoughts.

Pluggable `RowKeyProgress` feels like overkill

Allowing a custom class gives maximum flexibility, but I think few users will author and deploy a custom class just for an optional progress signal. It costs us an @InterfaceAudience.Public interface to maintain. Could we drop it and pick the algorithm via a simple string config (e.g. uniform / hex)?

Please let me know if you have a plan to use a custom implementation.

Naming could align with `RegionSplitter`

Users already know HexStringSplit, UniformSplit, and DecimalStringSplit from RegionSplitter. But this PR introduces new vocabulary (HexPrefix, ByteBased) for the similar, albeit not identical, concepts. Can we reuse the existing words?

Generalize `HexPrefix` to handle any-length hex keys (`HexString`?)

Would it be possible to simplify your HexPrefix implementation so that users don't have to specify a prefix? Some tables might have rowkeys that consist entirely of hex strings, and it can be confusing for them to specify a prefix. I think we can simplify the model and clamp bytes that go beyond the hex string boundary. For example:

startrow: a1
rowkey: bea\x200\x201\x202
stoprow: cf123

becomes

a10000
beafff
cf1230

for comparison.

…h config

jinhyukify · 2026-05-04T11:39:24Z

@junegunn Thank you for your feedback.

Pluggable `RowKeyProgress` feels like overkill

Thank you for raising this! I agree the abstraction comes with a real maintenance cost, and a string-config switch is appealing.

The reason I'd like to keep the pluggable interface is that we have a concrete in-house use case where the row-key salt follows a [a-z][0-9]{N} pattern. Neither uniform nor hex reports a sensible curve for that schema (the byte-level interpretation under-reports through most of the alphabet, and the hex interpretation collapses every key past f to zero). With the pluggable interface, we can just ship a small custom implementation. Otherwise, users with similar non-uniform (but not quite hex) schemas would be stuck needing a follow-up patch.

Naming could align with `RegionSplitter`

Great point. I hadn't connected it to RegionSplitter clearly enough. Renamed:

ByteBasedRowKeyProgress → UniformRowKeyProgress
HexPrefixRowKeyProgress → HexStringRowKeyProgress

Generalize `HexStringRowKeyProgress` to handle any-length hex keys (HexString?)

Thank you, this was the right call. Removed the hbase.mapreduce.rowkey.progress.hex.prefix.length config. the class now auto-derives the prefix length from the start/stop rows (longest common prefix + a small resolution margin, capped to fit double precision). Non-hex trailing bytes contribute zero:

startrow a1 → a10000
rowkey bea\x20\x20\x21\x22 → bea0000
stoprow cf123 → cf1230

No prefix length to tune, and works for both pure-hex and hex-prefixed strings.

jinhyukify · 2026-05-06T03:31:04Z

I've just fixed test failures in TestTableRecordReader and TestTableInputFormat

Copilot

Pull request overview

This PR (HBASE-30115) adds an approximate progress estimation mechanism for TableRecordReader by mapping the last-read row key into a normalized fraction of the scan’s start/stop key range. This improves MapReduce task progress reporting without requiring tuple counting.

Changes:

Added a pluggable RowKeyProgress interface with default (UniformRowKeyProgress) and hex-specific (HexStringRowKeyProgress) implementations.
Updated TableRecordReaderImpl#getProgress() to return an estimated fraction based on the last successfully read row key (with optional probing for empty start/stop bounds).
Added unit tests for both progress implementations.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableRecordReaderImpl.java	Initializes a `RowKeyProgress` estimator and uses it to report approximate scan progress; includes probing logic for empty bounds.
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/RowKeyProgress.java	Introduces the progress-estimation SPI and configuration key.
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/UniformRowKeyProgress.java	Default progress estimator treating row keys as big-endian unsigned byte sequences.
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HexStringRowKeyProgress.java	Progress estimator for ASCII hex-encoded row keys (e.g., hash prefixes).
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestUniformRowKeyProgress.java	Unit tests for `UniformRowKeyProgress`.
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHexStringRowKeyProgress.java	Unit tests for `HexStringRowKeyProgress`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  private byte[] probeLastRow() {
+    try {
+      Scan probeScan = new Scan(scan);
+      probeScan.setReversed(true);
+      probeScan.setOneRowLimit();
+      try (ResultScanner probeScanner = htable.getScanner(probeScan)) {
+        Result result = probeScanner.next();
+        return result != null ? result.getRow() : null;
+      }


+    }
+    if (b >= 'a' && b <= 'f') {
+      return 10 + (b - 'a');
+    }


+   * Hex characters past the start/stop divergence point to include for resolution. 4 hex chars = 65
+   * 536 buckets, finer than any progress bar can display.


HBASE-30115 Introduce approximate progress estimation for TableRecord…

468a39b

…Reader based on row key position

jinhyukify commented Apr 24, 2026

View reviewed changes

HBASE-30115 Align names with RegionSplitter and drop hex prefix lengt…

400c39b

…h config

HBASE-30115 Fix failing tests

22c6baa

junegunn requested a review from Copilot May 6, 2026 06:18

Copilot started reviewing on behalf of junegunn May 6, 2026 06:19 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-30115 Introduce approximate progress estimation for TableRecordReader based on row key position#8134

HBASE-30115 Introduce approximate progress estimation for TableRecordReader based on row key position#8134
jinhyukify wants to merge 3 commits intoapache:masterfrom
jinhyukify:HBASE-30115

jinhyukify commented Apr 24, 2026

Uh oh!

jinhyukify Apr 24, 2026

Uh oh!

jinhyukify Apr 24, 2026

Uh oh!

junegunn commented May 4, 2026 •

edited

Loading

Uh oh!

jinhyukify commented May 4, 2026

Uh oh!

jinhyukify commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		LOG.warn("Failed to probe first row for progress estimation", e);
		return null;

		* Hex characters past the start/stop divergence point to include for resolution. 4 hex chars = 65
		* 536 buckets, finer than any progress bar can display.

Conversation

jinhyukify commented Apr 24, 2026

Uh oh!

jinhyukify Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jinhyukify Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

junegunn commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pluggable RowKeyProgress feels like overkill

Naming could align with RegionSplitter

Generalize HexPrefix to handle any-length hex keys (HexString?)

Uh oh!

jinhyukify commented May 4, 2026

Pluggable RowKeyProgress feels like overkill

Naming could align with RegionSplitter

Generalize HexStringRowKeyProgress to handle any-length hex keys (HexString?)

Uh oh!

jinhyukify commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

junegunn commented May 4, 2026 •

edited

Loading

Pluggable `RowKeyProgress` feels like overkill

Naming could align with `RegionSplitter`

Generalize `HexPrefix` to handle any-length hex keys (`HexString`?)

Pluggable `RowKeyProgress` feels like overkill

Naming could align with `RegionSplitter`

Generalize `HexStringRowKeyProgress` to handle any-length hex keys (HexString?)