Skip to content

Add an "ascii character" type to reduce unsafe needs in conversions #179

@scottmcm

Description

@scottmcm

Proposal

Problem statement

For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe blocks because the safe equivalents have material performance impact.

We should offer a nice way that people can write such "the problem fundamentally only produces ASCII Strings" code without needing unsafe and without needing spurious UTF-8 validation checks.

After all, to quote std::ascii,

However, at times it makes more sense to only consider the ASCII character set for a specific operation.

Motivation, use-cases

I was reminded about this by this comment in rust-lang/rust#105076:

    pub fn as_str(&self) -> &str {
        // SAFETY: self.data[self.alive] is all ASCII characters.
        unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
    }

But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap() at the end because other options aren't great.

One might say "oh, just build a String as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8> generates substantially less complicated code than pushing one to a String (https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255 USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str or String.

There should be a core type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.

[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new).

[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos

Solution sketches

In core::ascii,

/// One of the 128 Unicode characters from U+0000 through U+007F, often known as
/// the [ASCII](https://www.unicode.org/glossary/index.html#ASCII) subset.
///
/// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973,
/// or [NIST FIPS 1-2](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-1977.pdf).
///
/// # Layout
///
/// This type is guaranteed to have a size and alignment of 1 byte.
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[repr(transparent)]
struct Char(u8 is 0..=127);

impl Debug for Char {}
impl Display for Char {}

impl Char {
    const fn new(c: char) -> Option<Self> {}
    const fn from_u8(x: u8) -> Option<Self> {}
    const unsafe fn from_u8_unchecked(x: u8) -> Self {}
}

impl From<Char> for char {}
impl From<&[Char]> for &str {}

In alloc::string:

impl From<Vec<ascii::Char>> for String {}

^ this From being the main idea of the whole thing

Safe code can Char::new(…).unwrap() since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in consts, then use the non-reallocating infallible Froms later if they need Strings or &strs.

Other possibilities

I wouldn't put any of these in an initial PR, but as related things

  • This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.
  • There could be associated constants (or variants) named ACK and DEL and such.
  • Lots of AsRef<str>s are possible, like on ascii::Char itself or arrays/vectors thereof
    • And potentially AsRef<[u8]>s too
  • Additional methods like String::push_ascii
  • More implementations like String: Extend<ascii::Char> or String: FromIterator<ascii::Char>
  • Checked conversions (using the well-known ASCII fast paths) from &str (or &[u8]) back to &[ascii::Char]
  • The base85 example would really like a [u8; N] -> [ascii::Char; N] operation it can use in a const so it can have something like const BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();. Long-term that's called array::map and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.

And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar, despite the stuttering name, to avoid Char vs char confusion?

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ACP-acceptedAPI Change Proposal is accepted (seconded with no objections)T-libs-apiapi-change-proposalA proposal to add or alter unstable APIs in the standard libraries

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions