Skip to content

Supporting interoperable probabilistic defense structures through shared attack eval datasets #106

@johannhof

Description

@johannhof

As noted in various discussions (most recently in #97), it's unlikely that security challenges around WebMCP can be addressed purely through deterministic specification rules. Browser and/or agent developers will have to deploy custom probablistic defense architectures, such as improved model understanding, "critic" models that prevent misaligned actions, or sophisticated tool security architectures like CaMeL.

This risks being a departure from the traditional Web specification security workflow: New implementers of the specification would not be able to use some of the lessons learned by early adopters and the community in delivering a secure feature.

One thing we can (and should) do is write very good and precise security considerations. However, there are limits to that, particularly in that it is hard to quantify what e.g. "should be resistant to prompt injections" means in practice.

As such, I think this group should embrace the default approach by agent makers to quantify the performance of an agentic system: evaluations. A robust dataset of prompt injection attacks that any implementer of WebMCP MUST (or SHOULD?) prevent with >99% recall would be a good and pragmatic step forward in interoperably enabling security for WebMCP.

Obviously there are a lot of open questions here around both maintenance and architecture for running these evals.

cc @jpagnucco @bvandersloot-mozilla @victorhuangwq

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions