Skip to content

Conversation

@ChrisDryden
Copy link
Collaborator

This PR is fairly large in scope and has a few design decisions that probably warrant some additional discussion. I have been prototyping many approaches to how to deal with the issue of getting the multi-byte character sequence lengths and all of the solutions I could think of that would call the libc implementation ultimately would take a bunch of unsafe code and require FFI's and still create compatibility issues when it comes to different platforms not supporting that API.

To work backwards from what the goal was, I went through all of the glibc locales to see which ones actually had multi-byte decoding rules that were unique from UTF-8 and how many of those overlapped with one another. That led to to discover that currently there are only 5 different rule sets for this calculation and that you can determine which encoding to use from the env variables for which locale to use. Then it appears that some of the locales are hardcoded to specific encodings and the only two that I could find when searching all of the glibc locales was "zh_CN" | "zh_SG" and "zh_TW" | "zh_HK"

There were two new GNU tests that were related to this type of locale stuff: paste.pl and multi-byte.sh that this PR addresses.

This can help with the other issues in the queue:
#10184
#9712
#7600
#3075
#5831
#3997

@ChrisDryden ChrisDryden force-pushed the paste-multibyte-delimiters branch from 5865c10 to 321cb08 Compare February 9, 2026 19:43
@ChrisDryden
Copy link
Collaborator Author

Seems like github is down?

@oech3

This comment was marked as outdated.

@ChrisDryden ChrisDryden force-pushed the paste-multibyte-delimiters branch from 321cb08 to b084aaa Compare February 9, 2026 19:51
@oech3
Copy link
Contributor

oech3 commented Feb 9, 2026

I think you can re-run 1 job only instead of force-pushing,

@ChrisDryden ChrisDryden force-pushed the paste-multibyte-delimiters branch from b084aaa to 1650f0e Compare February 9, 2026 20:06
grep -rl '\$abs_path_dir_' tests/*/*.sh | xargs -r "${SED}" -i "s|\$abs_path_dir_|${UU_BUILD_DIR//\//\\/}|g"
# Some tests use $abs_top_builddir/src for shebangs: point them to the uutils build dir.
grep -rl '\$abs_top_builddir/src' tests/*/*.sh tests/*/*.pl | xargs -r "${SED}" -i "s|\$abs_top_builddir/src|${UU_BUILD_DIR//\//\\/}|g"
grep -rl '\$ENV{abs_top_builddir}/src' tests/*/*.pl | xargs -r "${SED}" -i "s|\$ENV{abs_top_builddir}/src|${UU_BUILD_DIR//\//\\/}|g"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pixelb Additional abs_top_builddir

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops that is unrelated but it changes two skips to passes, was working on that locally and it got added

Copy link
Contributor

@oech3 oech3 Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does not mean that we are not using uutils binaries at here if we don't sed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the logs I think we deleted the gnu coreutils binaries from that env so it means that it just skips because its unable to find a binary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply symlink our bins to abs_top_builddir for all tests at once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

GNU testsuite comparison:

GNU test failed: tests/pr/bounded-memory. tests/pr/bounded-memory is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/env/env-S is no longer failing!
Congrats! The gnu test tests/install/basic-1 is no longer failing!
Congrats! The gnu test tests/paste/multi-byte is no longer failing!
Congrats! The gnu test tests/paste/paste is no longer failing!
Congrats! The gnu test tests/pwd/pwd-long is no longer failing!
Note: The gnu test tests/misc/coreutils is now being skipped but was previously passing.
Congrats! The gnu test tests/env/env is now passing!
Congrats! The gnu test tests/env/env-S-script is now passing!
Congrats! The gnu test tests/tail/tail-n0f is now passing!

@ChrisDryden ChrisDryden force-pushed the paste-multibyte-delimiters branch from 1650f0e to 143fec3 Compare February 9, 2026 21:22
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

GNU testsuite comparison:

Congrats! The gnu test tests/env/env-S is no longer failing!
Congrats! The gnu test tests/install/basic-1 is no longer failing!
Congrats! The gnu test tests/paste/multi-byte is no longer failing!
Congrats! The gnu test tests/paste/paste is no longer failing!
Congrats! The gnu test tests/pwd/pwd-long is no longer failing!
Note: The gnu test tests/misc/coreutils is now being skipped but was previously passing.
Congrats! The gnu test tests/env/env is now passing!
Congrats! The gnu test tests/env/env-S-script is now passing!
Congrats! The gnu test tests/tail/tail-n0f is now passing!

@sylvestre
Copy link
Contributor

oh, nice

/// Returns 1 for empty, invalid, or incomplete sequences.
pub fn mb_char_len(bytes: &[u8]) -> usize {
if bytes.is_empty() {
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning 1 when empty is confusing
can you please document why ?

translate!("paste-error-delimiter-unescaped-backslash", "delimiters" => delimiters),
));
fn parse_delimiters(delimiters: &OsString) -> UResult<Box<[Box<[u8]>]>> {
let bytes = uucore::os_string_to_vec(delimiters.clone())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use OsStr::as_bytes() on Unix or borrowing instead of cloning the OsString ?

_ => {
// Unknown escape: strip backslash, use the following character(s)
let len = mb_char_len(&bytes[i..]);
vec.push(Box::from(&bytes[i..i + len]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential panic if mb_char_len returns non-zero but bytes[i..] is shorter than returned length, no?

@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/env/env-S is no longer failing!
Congrats! The gnu test tests/install/basic-1 is no longer failing!
Congrats! The gnu test tests/paste/multi-byte is no longer failing!
Congrats! The gnu test tests/paste/paste is no longer failing!
Congrats! The gnu test tests/pr/bounded-memory is no longer failing!
Congrats! The gnu test tests/pwd/pwd-long is no longer failing!
Note: The gnu test tests/misc/coreutils is now being skipped but was previously passing.
Congrats! The gnu test tests/env/env is now passing!
Congrats! The gnu test tests/env/env-S-script is now passing!

@sylvestre sylvestre merged commit b4a4a38 into uutils:main Feb 10, 2026
157 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants