Skip to content

Implement nom-based ESI parser with streaming support#43

Open
vagetman wants to merge 118 commits intomainfrom
streaming-processing
Open

Implement nom-based ESI parser with streaming support#43
vagetman wants to merge 118 commits intomainfrom
streaming-processing

Conversation

@vagetman
Copy link
Collaborator

@vagetman vagetman commented Nov 13, 2025

Summary

Complete rewrite of the ESI parser from XML-based to nom-based parsing with full streaming support, comprehensive expression evaluation, and a rich function library.

New Features

ESI Tags

  • <esi:eval> — fetches content and always parses it as ESI (blocking operation), with dca support for two-phase processing
  • <esi:param> — nested inside include/eval for query parameter injection
  • <esi:foreach> / <esi:break> — iteration over lists and dicts
  • <esi:function> / <esi:return> — user-defined functions with recursion depth control
  • <esi:text> — raw passthrough (content emitted verbatim, no ESI processing)

Expression Features

  • List literals (['a', 'b', 'c']) and dictionary literals ({'key1': 'val1', 'key2': 'val2'})
  • Mixed-type lists: lists can contain strings, integers, dicts, and nested lists (e.g. ['one', 2, ['nested']])
  • Range literals ([1..10])
  • Dynamic subkeys: $(VAR{$(dynamic_key)})
  • Operators: has, has_i, matches, matches_i, +, -, *, /, %
  • Type coercion for operators: mixed-type operands are stringified for comparison (e.g. 3 == '3' is true); + does integer addition when both operands are integers (e.g. 3 + 4 = 7), list concatenation for two lists, string concatenation otherwise (e.g. 3 + '4' = '34'); * does integer multiplication, or string/list repetition with an integer count (e.g. 3 * 'ab' = 'ababab'); expressions evaluate left to right, so 2 + 8 + ' days' = '10 days' (integer add first, then string coercion); -, /, % require integer operands
  • Function argument coercion: built-in functions coerce arguments as needed (e.g. $int parses strings to integers, $substr coerces index args)
  • Function calls: $fn_name(args...) with nested calls supported
  • Backslash escaping in strings and interpolated content (\', \\, \$, \<)

Function Library

  • String: $upper, $lstrip, $rstrip, $strip, $substr
  • Encoding: $html_decode, $url_encode, $url_decode, $base64_encode, $base64_decode, $convert_to_unicode, $convert_from_unicode
  • Quote helpers: $dollar, $dquote, $squote
  • Collections: $len, $exists, $is_empty, $index, $rindex, $string_split, $join, $list_delitem
  • Type conversion: $int, $str
  • Crypto: $digest_md5, $digest_md5_hex, $bin_int
  • Time: $time, $http_time, $strftime
  • Random: $rand, $last_rand
  • Response manipulation: $add_header, $set_response_code, $set_redirect
  • User-defined: nesting, recursion, and positional argument passing via <esi:function> / <esi:return>

Streaming Processing

  • Pre-parsed attribute expressions — all include/eval attributes (src, alt, dca, ttl, method, entity, headers, params) are parsed into expression ASTs during parsing, then fully evaluated before each request is dispatched (old code treated src/alt as raw strings with $(VAR) interpolation only)
  • Flat-buffer slot design — all content (text, include responses, try-block outputs) assigned sequential buf slots for ordered flushing
  • Concurrent fragment fetching via fastly::http::request::select — replaces sequential .wait() calls; all pending includes share a single pool and responses are harvested as they arrive while preserving document order
  • Request correlation using RequestKey (method + URL) mapped through url_map to SlotEntry
  • DCA (Dynamic Content Assembly): dca="none" (raw insertion) and dca="esi" (parse response as ESI)
  • Try-block resolution: TryBlockTracker with per-attempt slot tracking, failure propagation, and except-block fallback via assemble_try_block
  • TTL tracking for rendered document caching with CacheConfig

Variable System

  • Types: integer, string, list, dict, boolean, null — with subscript get/set (strings support char index access)
  • Default values: $(VAR|'fallback') — if undefined, the default expression is used
  • Reference semantics: lists and dicts are assigned by reference — mutations are shared across aliases
  • Request headers: $(HTTP_*) maps any prefix to the corresponding header; $(HTTP_COOKIE{'name'}) for cookies, $(QUERY_STRING{'param'}) for query params
  • Request metadata: $(REQUEST_METHOD), $(REQUEST_PATH), $(REMOTE_ADDR)
  • Function arguments: $(ARGS) / $(ARGS{n}) for positional access
  • Regex captures: $(MATCHES{n}) populated by matches / matches_i operators

Configuration

  • chunk_size — streaming read buffer size
  • function_recursion_depth — max user-defined function call depth
  • CacheConfig — rendered output caching and cache-control header generation

Validation

  • Variable name validation: must start with alpha, ASCII alphanumeric + underscore only, max 256 chars
  • Invalid assign names silently skipped (no parse error)

Improved Features

Parser Architecture (rewritten)

  • Nom-based streaming parser replacing the old quick-xml parser — supports incremental parsing with Incomplete signals for partial input
  • Zero-copy parsing using Bytes slices from the original input buffer wherever possible (slice_as_bytes)
  • Unified tag dispatcher (tag_handler) that parses the tag name once and routes to specific handlers
  • Streaming gate pattern (esi_opening_tag) ensures full opening tags are buffered before dispatching to complete-mode attribute parsers
  • Script tag handling with case-insensitive matching and proper content scanning

Improved ESI Tags

  • <esi:include> — now with full attribute set: src, alt, dca, ttl, maxwait, no-store, method, entity, onerror, appendheader, setheader, removeheader (previously only src, alt, onerror)
  • <esi:try> / <esi:attempt> / <esi:except> — now supports parallel execution with multiple <esi:attempt> blocks
  • <esi:vars> — now supports short form (name= attribute) and long form (with body)
  • <esi:assign> — now with short and long form
  • <esi:choose> / <esi:when> / <esi:otherwise> — now with pre-parsed expression evaluation

Improved Functions

  • $lower — now handles edge cases properly
  • $html_encode — encodes 4 special characters per ESI spec (>, <, &, ")
  • $replace — now supports optional count parameter

Testing

  • Comprehensive unit tests in parser.rs: tag parsing, expression parsing, operator precedence, backslash escapes, variable name validation, subkey assignment
  • Integration tests in esi-tests.rs: end-to-end processing with fragment dispatching, configuration options, variable evaluation
  • Streaming behavior tests in streaming_behavior.rs: incomplete tag detection for all ESI tag types
  • Parser tests in tests/parser.rs: try/attempt/except, include attributes, header manipulation
  • Eval tests in tests/eval_tests.rs: DCA modes, eval vs include behavior
  • Function tests in functions.rs: all built-in functions
  • Expression tests in expression.rs: function calls, HTML encoding, evaluation

Benchmarks

  • parser_benchmarks — direct comparison with old XML parser using identical test cases (esi_documents group)
  • nom_parser_features — HTML comments, script tags, assigns, advanced expressions, mixed content
  • parser_scaling — 100 to 10,000 element documents
  • expression_parsing — variable access, comparisons, logical operators, function calls
  • interpolated_strings — text with embedded expressions

Examples (updated)

All existing examples were updated to work with the new API — no new examples were added.

  • esi_example_minimal — updated fragment dispatcher signature (|req, _maxwait|)
  • esi_example_advanced_error_handling — migrated from Reader/Writer/parse_tags to process_stream with BufReader; direct output stream writing
  • esi_try_example — updated fragment dispatcher signature
  • esi_vars_example — updated fragment dispatcher signature
  • esi_example_variants — migrated from parse_tags + process_parsed_document + URL map to process_stream with inline URL rewriting in the dispatcher

tyler and others added 30 commits January 14, 2025 09:47
…ut esi:vars and recognize basic interpolation.
This merge brings in the new nom-based parser implementation up to commit f216023:

Key changes:
- Added nom dependency for modern parsing capabilities
- Introduced new_parse.rs with comprehensive nom-based ESI parser
- Added parser_types.rs with new Expr and Operator types for better type safety
- Enhanced interpolation handling with process_interpolated_chars function
- Improved error handling and debugging output
- Fixed HTML tag closure issues in examples
- Added support for expression comparisons (matches, matches_i operators)
- Better integration between parser and interpreter with consistent types

The new parser provides better performance, more robust parsing, and improved
maintainability while maintaining full compatibility with existing ESI functionality.
…a nested vector structure for improved handling of events
…nt handling

- Introduced a request field in the Fragment struct to retain the original request.
- Updated the Processor to utilize the provided fragment response processor.
- Added tests to verify behavior with is_escaped_content configuration and response processing.
- Modify the `lower` function to return `Null` for `Value::Null` arguments.
- Refactor `process_nom_chunk_to_output` to handle expression evaluation more robustly, including skipping output for `Null` values.
- Replace the simple expression evaluator with a more comprehensive parser that supports additional operators and expressions.
- Introduce parsing for interpolated strings and standalone ESI expressions.
- Add support for logical, comparison, and negation operators in the parser.
- Update `parser_types` to include new expression types and operators.
- Enhance tests for expression parsing to cover new functionality.
- Introduced `WhenBranch` struct to represent branches in choose blocks.
- Updated `Tag` enum to use `WhenBranch` and changed `Chunk` to `Element`.
- Modified `Assign` tag to accept `Expr` instead of `String` for value.
- Enhanced parsing logic to support interpolated expressions in assignments.
- Added tests for long form assignments with interpolation and multiple variables.
- Updated existing tests to reflect changes in parser structure and expression handling.
…ess_include function for improved clarity and error management; update parser types to use string slices.
vagetman added 26 commits March 3, 2026 12:45
complex_document benchmark: ~15µs → ~12.5µs
- Updated tests in `streaming_behavior.rs` to utilize the `Parser` trait for improved clarity and consistency.
- Changed instances of `tag` and `is_not` to use the `.parse()` method for better error handling with incomplete input.
- Ensured that all relevant tests correctly assert the expected `Incomplete` results when parsing incomplete data.
- Clarified some documentation mistakes in README
- Improved ESI expression evaluation by optimizing null value handling.
- Refactored caching logic to utilize `Cow<str>` for better performance and reduced allocations.
- Enhanced error handling and logging in cache configuration and request processing.
- Updated various functions to use zero-copy techniques for string manipulation, improving efficiency.
- Cleaned up code formatting and comments for better readability and maintainability.
…caping, arithmetic safety

- Change OP_AND/OP_OR from && || to & | per ESI spec (when bitwise not configured)
- Add backslash escape support (\X → literal X) in strings, interpolated content, and attribute values
- Check arithmetics on integer overflow, return errors
- Short-circuit And/Or evaluation
- Propagate errors from esi:assign expression evaluation instead of swallowing them
- Pass fragment_response_handler through on_eval with `dsa=esi` mode
- Emit HTML comment on failed fragment req  only for HTML content (is_escaped_content)
- Renamed `ExecutionError` to `ESIError` for clarity and consistency.
- Enhanced error descriptions for better debugging and understanding.
- Modified tests to reflect the changes in error handling and ensure correctness.
Introduce a new ParsingMode::Eof that treats incomplete ESI tags as
truncation errors (ESIError::UnexpectedEndOfDocument) while still
consuming trailing non-ESI text normally. This distinguishes genuine
document truncation from the streaming "need more data" case.

- Add parse_eof() for final-chunk parsing that errors on truncated tag
- Early-return from parse_loop when all input is consumed
- Simplify buffer carry-forward logic in the streaming processor
@vagetman vagetman marked this pull request as ready for review March 16, 2026 04:13
@vagetman vagetman requested review from acme, kailan and tyler March 16, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants