Skip to content

Parser performance#273

Merged
swannman merged 3 commits into
microsoft:masterfrom
jdu2600:parser_performance
Mar 27, 2026
Merged

Parser performance#273
swannman merged 3 commits into
microsoft:masterfrom
jdu2600:parser_performance

Conversation

@jdu2600
Copy link
Copy Markdown
Contributor

@jdu2600 jdu2600 commented Mar 25, 2026

I'm writing a modern Rust ETW parsing library, and was doing some parsing benchmarking.

I was surprised at just how more performant my approach was againt existing approaches so I did a little digging.

The main issue for krabs seemed to be a linear scan over a per-event property cache. This is effectively an O(N2) scan if you grab all properties - and the cache actually only hits if you grab the same property twice. 😱

Plus wstring temporaries.

Event Type krabsetw 4.4.8 krabsetw 4.4.6 ferrisetw 1.2.0 TdhFormatProperty TdhFormatProperty (naive) TdhGetProperty
Process Start (16 fields) 1286 ns 2522 ns 2503 ns 541 ns 1218 ns 3684 ns
RPC Client Call (9 fields) 516 ns 771 ns 1213 ns 338 ns 481 ns 1755 ns
Registry CreateKey (6 fields) 331 ns 467 ns 749 ns 548 ns 700 ns 1577 ns
File Create (7 fields) 410 ns 872 ns 867 ns 294 ns 489 ns 1146 ns
Thread Start (10 fields) 485 ns 1055 ns 905 ns 201 ns 849 ns 1482 ns
  • All measurements are per-event averages (ns) over the same captured set of events run on my laptop.

John Uhlmann and others added 3 commits March 25, 2026 16:59
Replace the per-event deque/linear-scan property cache with a two-level
scheme: a persistent name-to-index map in schema_locator (built once per
event type, shared across all events) and a flat vector<property_info>
in parser indexed by property position. This eliminates per-event hash
table allocation and removes string comparisons during the blob walk.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace const std::wstring& with std::wstring_view in parser's
public API (parse, try_parse, view_of) and internal helpers
(find_property, assert_valid_assignment, throw_if_invalid).

This eliminates heap-allocating std::wstring temporaries when
callers pass string literals (the common case). Property names
longer than MSVC's SSO threshold of 7 wchar_t previously caused
a heap allocation and free per field access.

Not a breaking change - std::wstring_view is implicitly
constructible from both std::wstring and const wchar_t*.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@swannman swannman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work @jdu2600 - thank you for doing this!

@swannman swannman merged commit f0934fd into microsoft:master Mar 27, 2026
2 checks passed
Copy link
Copy Markdown
Member

@kylereedmsft kylereedmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I know this has already been merged, but this change raises some new ideas that we should consider.

Comment thread krabs/krabs/schema_locator.hpp
Comment thread krabs/krabs/schema_locator.hpp
Comment thread krabs/krabs/parser.hpp

// Maintain a mapping from property name to blob data index.
std::deque<std::pair<const wchar_t *, property_info>> propertyCache_;
// Persistent name to index map shared across all events of the same type.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this layer of caching even makes sense. It's just overhead.
I can't think of any reason not to just maintain a map of name->property_info from the schema.

I'm fuzzy on all the various event type schemas, but looking at this code, it seems like the property_info blobs are expected to be consistent. Just trying to think through whether we ever need any of the old fallback code if we pre-populate and cache.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. These were just the smallest and quickest wins that I could port across.

Since a cache existed, I kept it and just addressed its bugged performance characteristics.
Though, in practice, the O(1) map lookup only seems to have gains for events with 12+ fields. (Though I see much beter gains in Rust because of UTF-8 - smaller keys to hash).

We could just delete it - and use the (fallback) hinted linear scan.
This is actually even faster for my benchmark which accesses properties in event order.

I have implementioned both approaches - and will open two PRs for you.


property_info has dynamic per-event information for events with more than one variable length field - whereas the name->index mapping is static per-schema.

I actually do a lot more work upfront in my library - I allocate a parser, calculate the offsets for all fixed location properties and cache this. Variable location property offsets are still calculated lazily. The lookup step just resets the parser.

So you could pre-populate and cache a partial property_info[] also, but that's a bigger lift.

I'm happy to provide you both pre-release access to my library if you are interested.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PRs. Did you compare the two against each other? Just wondering whether you have a preference.
Re. lookup, yes the linear scans are SUPER fast and are almost always faster than the overhead of the hashing and node memory management from the maps.

I'm still trying to convince myself we need all that overhead in the parser to get the property_info. There's a lot more code in there than I would have expected taking a look with fresh eyes.

Hinting and/or simply tracking the last index between calls is probably more than enough and wouldn't need a cache at all.

After looking at them both, I really like the idea of having less than more. I can see the merits of 274 -- and it would line up most accurately with the current model. 275 seems better to me though, and I'm thinking it doesn't go far enough.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some performance numbers for both approaches are in #275.
The hinted linear scan wins best case, and loses worst case. And the loss scales with the number of fields. TL;DR - O(1) versus O(N) scaling

If you can improve the map performance in a non-API-breaking way I would do that...

Otherwise, due to its simplicity, I would lean towards the linear scan using the highest accessed hint - despite it having worse performance in (arguably perverse) scenarios. I always consume properties in roughly the event order as this feels natural to me. It feels reasonable to assume that other developers mostly do the same...and will hit the fast path...

My only worry is that the user needs to understand the implemention details if they want best performance though. Whereas the map always gives adequate performance.

re: not far enough
What else are you thinking?


In my new library I optimised the map's performance, and provided a second API.

  • UTF-16LE -> UTF-8 keys
  • SipHash -> FxHash
// grab the parser from the cache - it's pre-warmed and reset for use
let mut parser = schema_cache.parser(event)?;

// convenience API - FxHashMap lookup keyed by UTF-8 property name
let process_id: u32 = parser.get("ProcessID")?;

// performance API - check hinted index first, then fall back to map lookup
let image_name: String = parser.get_hinted("ImageName", 10)?;

Copy link
Copy Markdown
Member

@kylereedmsft kylereedmsft Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to decide whether even the vector in the parser makes sense. I will take some time today and look more closely to see if it's even worth THAT allocation.

I like the idea of the hint as a way to allow callers to help mitigate non-scan lookups. It's a nice fallback that isn't too hard to explain and doesn't require any caching in the core layer. Well, aside from the schema caching which has a huge perf benefit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eager caching is definitely the winner of the three approaches for my all-fields-sequentially benchmarks.

Event Type 4.4.6 PR#274 PR#275 PR#276 PR#276 hinted
Process Start 2620 ns 1322 ns 824 ns 624 ns 465 ns
Thread Start 1269 ns 495 ns 280 ns 152 ns 158 ns
RPC Client Call 871 ns 549 ns 304 ns 187 ns 194 ns
File Create 653 ns 415 ns 256 ns 192 ns 175 ns
Registry CreateKey 532 ns 351 ns 232 ns 153 ns 152 ns

It's less good for sparse/partial access though - especially for long events where only early properties are accessed.
Here are the times for accessing only the 2nd/4th/6th properties.
For an eager caching approach, this can potentially be mitigated by providing an (optional) max_eager_index limit to the constructor, and additional logic to handle lookups beyond the eager cache... at this point you are probably better off taking my approach of doing upfront parser pre-work and caching / resetting that.

Event Props PR#274 PR#275 PR#276 PR#276 hinted PR#276 + max_eager(5) PR#276 + max_eager(5) + hinted
Process Start 16 233 ns 147 ns 272 ns 260 ns 102 ns 91 ns
RPC ClientCall 9 204 ns 158 ns 121 ns 114 ns 115 ns 103 ns
Thread Start 8 206 ns 140 ns 103 ns 88 ns 94 ns 81 ns
File Create 7 202 ns 137 ns 126 ns 107 ns 94 ns 80 ns
Registry CreateKey 6 237 ns 170 ns 129 ns 119 ns 129 ns 120 ns

In practice though #276 is probably good enough for most real world use cases?

Do we have any of the history on why the lazy approach was chosen initially?
I assume there was something about how the team was using it internally at the time?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An earlier iteration of our ETW library parsing looked more like a map from name->functor to accept the data. That allowed a single, forward-only pass through the event. Only fields with a binding were actually extracted. When krabs was written, the original dev tried to keep that support for sparse access intact.
Krabs is nice at the consumer side, but the lazy, pull-through model comes with some major costs -- especially when using C# because of the thunk (hence the schema locator was added). I think #275 is probably good enough, especially with hinting plumbed through. If we really wanted to chase the dragon and squeeze out every last ns, a forward only, no-cache walk with binding layer would probably be the fastest, but it'd be a much more complicated change and would change the client API shape a lot.

Copy link
Copy Markdown
Contributor

@vmurthysuhas vmurthysuhas Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at our internal usage patterns; in most cases we do not end up reading all the properties on the event, we only care about a handful of properties (< 40% of the total properties available).
The benchmarks suggest #275 is the fastest for this pattern. Even in cases where we end up reading all the properties #276 will only be significantly better if the number of properties on the event is large (no such internal usage exists today), for a smaller number of properties the performance looks nearly the same.
I'm leaning towards 275 as the preferred approach to go with.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @vmurthysuhas. That's my take too. It might still be worth pulling in the hint parameter changes from 276. We can even hold off on touching the managed layer for now and add that later if we need to.
@jdu2600 do you want to update 275 or would you like me to take your changes and apply the hinted overloads as a new PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 hinted overloads added to #275

we only care about a handful of properties (< 40% of the total properties available).
The benchmarks suggest #275 is the fastest for this pattern. Even in cases where we end up reading all the properties #276 will only be significantly better if the number of properties on the event is large (no such internal usage exists today), for a smaller number of properties the performance looks nearly the same.

The performance is less to do with the number of properties accessed, and more to do with the maximum property index accessed. And also to do with whether any slow size calculations (esp uncounted strings) can be skipped.

A better sparse benchmark might be every third property.
This always accesses a property near the end so #276's eager eval wins.

Event Props #275 #275 reverse #275 hinted (+reverse ) #276 #276 reverse #276 hinted (+reverse )
Process 6 of 16 482.9 1105.4 498.3 385.4 779.2 307.3
Thread 4 of 10 199.9 309.6 183.7 117.0 240.0 94.4
RPC 3 of 9 166.7 220.3 155.1 121.0 193.4 99.2
File 3 of 7 208.3 237.4 197.5 162.5 213.5 134.1
Registry 2 of 6 111.3 120.4 109.4 88.3 104.9 81.4

But if only the middle property is accessed, then #275's lazy eval sometimes wins.
But this depends on the event layout - and where the fixed/variable length fields sit.

Event type #275 #275 hinted #276 #276 hinted
Process 184.4 145.2 277.8 234.1
Thread 130.7 127.0 87.1 68.7
File 97.6 93.0 98.4 85.5
Registry 98.8 93.3 82.3 70.3
RPC 139.5 127.5 120.3 98.9

@swannman
Copy link
Copy Markdown
Member

Thanks @kylereedmsft for the review - @jdu2600 if you have a moment, would you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants