Skip to content

Filter out invisible unicode characters from text segments#3344

Open
JiuqingSong wants to merge 1 commit into
masterfrom
u/jisong/filterinvisibleunicode
Open

Filter out invisible unicode characters from text segments#3344
JiuqingSong wants to merge 1 commit into
masterfrom
u/jisong/filterinvisibleunicode

Conversation

@JiuqingSong
Copy link
Copy Markdown
Collaborator

Summary

  • Strip invisible Unicode tag characters (U+E0000–U+EFFFF) inside createText so they cannot survive paste/DOM-to-model conversion. These characters are used to hide instructions/text inside HTML (see https://embracethered.com/blog/posts/2024/hiding-and-finding-text-with-unicode-tags/) and otherwise leak into the model as normal text.
  • Meaningful invisible characters that fall outside that range (e.g. ZWSP U+200B, ZWJ U+200D, RLO U+202E, PDF U+202C) are preserved.
  • Unit tests in creatorsTest.ts cover mixed/boundary/only-invisible inputs and confirm meaningful invisible chars are untouched. An end-to-end test in endToEndTest.ts verifies a full DOM → Model → DOM/text round-trip strips only the tag range.

Test plan

  • yarn test:fast --testPathPattern=creatorsTest
  • yarn test:fast --testPathPattern=endToEndTest

🤖 Generated with Claude Code

* @returns The string with invisible unicode characters removed
*/
function stripInvisibleUnicode(value: string): string {
return value.replace(INVISIBLE_UNICODE_REGEX, '');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this was originally limited to the content was inserted as initial content in the editor and for Links. Do we have any perf concerns with applying the regex to all created text on every call?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern.

I investigated the original security issue, and realize that the attack can happen from any kind of source, as long as the content is put into editor. So manual operations (new editor, paste), or 3rd party code (call formatContentModel() can both trigger the result. Of cause the manual operation is easier to do.

What do you think? Should we limit the check to manual operation only? I'm open to any suggestion.

@romanisa fyi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants