Fix Uppercase/Lowercase intrinsic types to apply Unicode special case mappings#3930
Fix Uppercase/Lowercase intrinsic types to apply Unicode special case mappings#3930Copilot wants to merge 5 commits into
Conversation
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/7ac37be7-07a1-49c6-be87-36aa6ffa208f Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
… mappings Implement JavaScript-compatible full Unicode case mapping for intrinsic string types (Uppercase, Lowercase, Capitalize, Uncapitalize). Go's strings.ToUpper/ToLower use simple case mapping (1:1), while JavaScript's toUpperCase()/toLowerCase() use full Unicode case mapping from SpecialCasing.txt where a single character can map to multiple characters. Key examples fixed: - Uppercase<"ß"> now correctly resolves to "SS" (not "ß") - Lowercase<"İ"> now correctly resolves to "i̇" (not "i") - Uppercase<"fi"> now correctly resolves to "FI" (not "fi") Fixes #3489 Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/7ac37be7-07a1-49c6-be87-36aa6ffa208f Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
| // upperSpecialCasings contains unconditional special case mappings for toUpperCase | ||
| // from Unicode SpecialCasing.txt. These are cases where a single code point maps | ||
| // to multiple code points when uppercased, matching JavaScript's behavior. | ||
| var upperSpecialCasings = map[rune]string{ |
There was a problem hiding this comment.
@copilot Where did you get this, and how can we be sure to keep this up to date? Is this not something that is exposed from the unicode package?
There was a problem hiding this comment.
The original table was manually extracted from Unicode's SpecialCasing.txt. You're right that maintaining it by hand is fragile. golang.org/x/text/cases (already a project dependency) handles this — cases.Upper(language.Und) / cases.Lower(language.Und) performs full Unicode case mapping matching JS behavior, and stays current via the x/text module. Refactored in e3ee094.
…-rolled table Replace the manually maintained Unicode SpecialCasing.txt table with golang.org/x/text/cases package (already a project dependency), which provides full Unicode case mapping that matches JavaScript's behavior and stays up to date automatically via the x/text module. Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/47f91f8b-f932-4212-8285-794b2ee100e8 Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
| upperCaser = cases.Upper(language.Und) | ||
| lowerCaser = cases.Lower(language.Und) |
There was a problem hiding this comment.
@copilot We should avoid this as much as possible; special case ASCII so we aren't always going through this, and make loading these lazy via sync.OnceValue.
There was a problem hiding this comment.
Added ASCII fast paths for all four functions — ASCII-only strings use strings.ToUpper/strings.ToLower directly (or inline byte manipulation for first-rune variants). The cases.Upper/cases.Lower casers are now initialized lazily via sync.OnceValue so the x/text tables are only loaded if non-ASCII input is actually encountered. Done in 53de6b0.
ASCII strings use strings.ToUpper/ToLower directly, avoiding x/text/cases overhead. The x/text casers are initialized lazily via sync.OnceValue so tables are only loaded when non-ASCII input is encountered. Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/76d200c6-8d8e-4f2c-875d-f8712343859f Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
Go's
strings.ToUpper/strings.ToLoweruse simple Unicode case mapping (1 rune → 1 rune), while JavaScript'stoUpperCase()/toLowerCase()use full case mapping from Unicode SpecialCasing.txt (1 char → potentially multiple chars). This caused intrinsic string types to produce wrong results for ~103 code points.internal/checker/stringcase.gowithtoUpperCase/toLowerCaseand first-rune variants usinggolang.org/x/text/caseswith thelanguage.Undlocale, which provides full Unicode case mapping matching JavaScript's behavior and stays up to date automatically via thex/textmodulestrings.ToUpper/strings.ToLowerdirectly, avoiding thex/text/casesoverhead for the common casex/text/casescasers are lazily initialized viasync.OnceValueso Unicode tables are only loaded when non-ASCII input is actually encounteredapplyStringMappingto use these instead ofstrings.ToUpper/strings.ToLowerß,İ, ligatures (fi,fl,ff),Capitalize,Uncapitalize, and mixed strings