diff --git a/dev-guide/src/grammar.md b/dev-guide/src/grammar.md index 2d9b22756d..af1c23ba6e 100644 --- a/dev-guide/src/grammar.md +++ b/dev-guide/src/grammar.md @@ -154,5 +154,19 @@ The [`mdbook-spec`] plugin automatically adds Markdown link definitions for all In some cases, there might be name collisions with the automatic linking of rule names. In that case, disambiguate with the `grammar-` prefix, such as `[Type][grammar-Type]`. The prefix can also be used when explicitness would aid clarity. +Production names can also be used in link reference definitions to provide custom link text, both with and without the `grammar-` prefix. + +```markdown +We accept any [type]. + +[type]: grammar-Type +``` + +```markdown +We accept any [type]. + +[type]: Type +``` + [`mdbook-spec`]: tooling/mdbook-spec.md [Notation]: https://doc.rust-lang.org/nightly/reference/notation.html diff --git a/dev-guide/src/links.md b/dev-guide/src/links.md index fb2199c0dc..20458758a6 100644 --- a/dev-guide/src/links.md +++ b/dev-guide/src/links.md @@ -74,6 +74,10 @@ Link definitions are automatically generated for all grammar production names. S This attribute uses the [MetaWord] syntax. Explicit grammar links can have the `grammar-` prefix like [Type][grammar-Type]. + +Grammar links can also appear in link reference definitions, e.g. [type]. + +[type]: grammar-Type ``` ## Outside book links diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 49caabfdd8..b0c390ae50 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -6,6 +6,7 @@ - [Lexical structure](lexical-structure.md) - [Input format](input-format.md) + - [Frontmatter](frontmatter.md) - [Keywords](keywords.md) - [Identifiers](identifiers.md) - [Comments](comments.md) diff --git a/src/frontmatter.md b/src/frontmatter.md new file mode 100644 index 0000000000..57aca4a351 --- /dev/null +++ b/src/frontmatter.md @@ -0,0 +1,66 @@ +r[frontmatter] +# Frontmatter + +r[frontmatter.syntax] +```grammar,lexer +@root FRONTMATTER -> + WHITESPACE_ONLY_LINE* + !FRONTMATTER_INVALID + FRONTMATTER_MAIN + +WHITESPACE_ONLY_LINE -> (!LF WHITESPACE)* LF + +FRONTMATTER_INVALID -> (!LF WHITESPACE)+ `---` ^ ⊥ + +FRONTMATTER_MAIN -> + `-`{n:3..=255} ^ FRONTMATTER_REST + +FRONTMATTER_REST -> + FRONTMATTER_FENCE_START + FRONTMATTER_LINE* + FRONTMATTER_FENCE_END + +FRONTMATTER_FENCE_START -> + MAYBE_INFOSTRING_OR_WS LF + +FRONTMATTER_FENCE_END -> + `-`{n} HORIZONTAL_WHITESPACE* ( LF | EOF ) + +FRONTMATTER_LINE -> !`-`{n} ~[LF CR]* LF + +MAYBE_INFOSTRING_OR_WS -> + HORIZONTAL_WHITESPACE* INFOSTRING? HORIZONTAL_WHITESPACE* + +INFOSTRING -> (XID_Start | `_`) ( XID_Continue | `-` | `.` )* +``` + +r[frontmatter.intro] +Frontmatter is an optional section of metadata whose syntax allows external tools to read it without parsing Rust. + +> [!EXAMPLE] +> +> ```rust,ignore +> #!/usr/bin/env cargo +> --- cargo +> package.edition = 2024 +> --- +> +> fn main() {} +> ``` + +r[frontmatter.position] +Frontmatter may appear at the start of the file (after the optional [byte order mark]) or after a [shebang]. In either case, it may be preceded by [whitespace]. + +r[frontmatter.fence] +Frontmatter must start and end with a *fence*. Each fence must start at the beginning of a line. The opening fence must consist of at least 3 and no more than 255 hyphens (`-`). The closing fence must have exactly the same number of hyphens as the opening fence. The hyphens of either fence may be followed by [horizontal whitespace]. + +r[frontmatter.infostring] +The opening fence, after optional [horizontal whitespace], may be followed by an infostring that identifies the format or purpose of the body. An infostring may be followed by horizontal whitespace. + +r[frontmatter.body] +No line in the body may start with a sequence of hyphens (`-`) equal to or longer than the opening fence. The body may not contain carriage returns. + +[byte order mark]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 +[horizontal whitespace]: grammar-HORIZONTAL_WHITESPACE +[shebang]: input-format.md#shebang-removal +[whitespace]: whitespace.md diff --git a/src/input-format.md b/src/input-format.md index 2d7a2124c1..065f827da4 100644 --- a/src/input-format.md +++ b/src/input-format.md @@ -69,6 +69,25 @@ The shebang may appear immediately at the start of the file or after the optiona r[input.shebang.removal] The shebang is removed from the input sequence (and is therefore ignored). +r[input.frontmatter] +## Frontmatter removal + +r[input.frontmatter.removal] +If the remaining input begins with a [frontmatter] fence, optionally preceded by lines containing only [whitespace], the [frontmatter] and any preceding whitespace are removed. + +For example, given the following file: + + +```rust,ignore +--- cargo +package.edition = 2024 +--- + +fn main() {} +``` + +The first three lines (the opening fence, body, and closing fence) would be removed, leaving an empty line followed by `fn main() {}`. + r[input.tokenization] ## Tokenization @@ -79,7 +98,7 @@ The resulting sequence of characters is then converted into tokens as described > > - Byte order mark removal. > - CRLF normalization. -> - Shebang removal when invoked in an item context (as opposed to expression or statement contexts). +> - Shebang and frontmatter removal when invoked in an item context (as opposed to expression or statement contexts). > > The [`include_str!`] and [`include_bytes!`] macros do not apply these transformations. @@ -88,4 +107,5 @@ The resulting sequence of characters is then converted into tokens as described [comments]: comments.md [Crates and source files]: crates-and-source-files.md [shebang]: https://en.wikipedia.org/wiki/Shebang_(Unix) +[frontmatter]: frontmatter.md [whitespace]: whitespace.md diff --git a/src/items/modules.md b/src/items/modules.md index 3cc015025b..2164051f84 100644 --- a/src/items/modules.md +++ b/src/items/modules.md @@ -123,7 +123,7 @@ r[items.mod.attributes] ## Attributes on modules r[items.mod.attributes.intro] -Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM and shebang. +Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM, shebang, and frontmatter. r[items.mod.attributes.supported] The built-in attributes that have meaning on a module are [`cfg`], [`deprecated`], [`doc`], [the lint check attributes], [`path`], and [`no_implicit_prelude`]. Modules also accept macro attributes. diff --git a/src/notation.md b/src/notation.md index 7537c67ddc..fc98c36462 100644 --- a/src/notation.md +++ b/src/notation.md @@ -45,6 +45,20 @@ Mizushima et al. introduced [cut operators][cut operator paper] to parsing expre The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens. +r[notation.grammar.bottom] +### The bottom rule + +In logic, ⊥ (*bottom*) represents absurdity --- a proposition that is always false. In type theory, it is the *empty type*: a type with no inhabitants. The grammar borrows both senses: the rule ⊥ matches nothing --- not any character, not even the end of input. + +```grammar,notation +// The bottom rule does not match anything. +⊥ -> !(CHAR | EOF) +``` + +Placed after a [hard cut operator], `^ ⊥` makes a rule fail unconditionally once the parser has committed past the cut. This gives the grammar a way to express *recognition without acceptance*: the parser identifies the input, commits so that no other alternative can be tried, and then rejects it. In the frontmatter grammar, for example, [`FRONTMATTER_INVALID`] uses `^ ⊥` to recognize an opening fence preceded by whitespace on the same line --- input that is close enough to frontmatter to rule out other interpretations, but that is not valid. + +[`FRONTMATTER_INVALID`]: frontmatter.md#grammar-FRONTMATTER_INVALID + r[notation.grammar.string-tables] ### String table productions diff --git a/src/whitespace.md b/src/whitespace.md index 7e16c51d41..da0d8502b5 100644 --- a/src/whitespace.md +++ b/src/whitespace.md @@ -16,6 +16,10 @@ WHITESPACE -> | U+2028 // Line separator | U+2029 // Paragraph separator +HORIZONTAL_WHITESPACE -> + U+0009 // Horizontal tab, `'\t'` + | U+0020 // Space, `' '` + TAB -> U+0009 // Horizontal tab, `'\t'` LF -> U+000A // Line feed, `'\n'` @@ -26,6 +30,9 @@ CR -> U+000D // Carriage return, `'\r'` r[lex.whitespace.intro] Whitespace is any non-empty string containing only characters that have the [`Pattern_White_Space`] Unicode property. +r[lex.whitespace.horizontal] +[HORIZONTAL_WHITESPACE] is the horizontal space subset of [`Pattern_White_Space`] as categorized by [UAX #31, Section 4.1][uax31-4.1]. + r[lex.whitespace.token-sep] Rust is a "free-form" language, meaning that all forms of whitespace serve only to separate _tokens_ in the grammar, and have no semantic significance. @@ -33,3 +40,4 @@ r[lex.whitespace.replacement] A Rust program has identical meaning if each whitespace element is replaced with any other legal whitespace element, such as a single space character. [`Pattern_White_Space`]: https://www.unicode.org/reports/tr31/ +[uax31-4.1]: https://www.unicode.org/reports/tr31/#Whitespace_and_Syntax diff --git a/tools/mdbook-spec/src/grammar.rs b/tools/mdbook-spec/src/grammar.rs index 12ece5df7a..8aa98d3608 100644 --- a/tools/mdbook-spec/src/grammar.rs +++ b/tools/mdbook-spec/src/grammar.rs @@ -75,6 +75,47 @@ pub fn insert_grammar(grammar: &Grammar, chapter: &Chapter, diag: &mut Diagnosti content } +/// Converts link reference definitions that point to a grammar rule +/// to the correct link. +/// +/// For example: +/// +/// ```markdown +/// We accept any [token]. +/// +/// [token]: grammar-Token +/// ``` +/// +/// This will convert the `[token]` definition to point +/// to the actual link. +/// +/// This supports both a `grammar-` prefixed form (e.g. +/// `grammar-Token`) and a bare rule name (e.g. `Token`). +pub fn grammar_link_references(chapter: &Chapter, grammar: &Grammar) -> String { + let current_path = chapter.path.as_ref().unwrap().parent().unwrap(); + let for_summary = is_summary(chapter); + crate::MD_LINK_REFERENCE_DEFINITION + .replace_all(&chapter.content, |caps: &Captures<'_>| { + let dest = &caps["dest"]; + let name = dest.strip_prefix("grammar-").unwrap_or(dest); + if let Some(production) = grammar.productions.get(name) { + let label = &caps["label"]; + let relative = pathdiff::diff_paths(&production.path, current_path).unwrap(); + // Adjust paths for Windows. + let relative = relative.display().to_string().replace('\\', "/"); + let id = render_markdown::markdown_id(name, for_summary); + if for_summary { + format!("[{label}]: #{id}") + } else { + format!("[{label}]: {relative}#{id}") + } + } else { + caps.get(0).unwrap().as_str().to_string() + } + }) + .to_string() +} + /// Creates a map of production name -> relative link path. fn make_relative_link_map(grammar: &Grammar, chapter: &Chapter) -> HashMap { let current_path = chapter.path.as_ref().unwrap().parent().unwrap(); diff --git a/tools/mdbook-spec/src/lib.rs b/tools/mdbook-spec/src/lib.rs index 918508a6df..b94d296940 100644 --- a/tools/mdbook-spec/src/lib.rs +++ b/tools/mdbook-spec/src/lib.rs @@ -168,6 +168,7 @@ impl Preprocessor for Spec { } ch.content = admonitions::admonitions(&ch, &mut diag); ch.content = self.rule_link_references(&ch, &rules); + ch.content = grammar::grammar_link_references(&ch, &grammar); ch.content = self.auto_link_references(&ch, &rules); ch.content = self.render_rule_definitions(&ch.content, &tests, &git_ref); if ch.name == "Test summary" {