Payload limit configuration and validation #1288

jmaeagle99 · 2026-01-20T22:02:27Z

What was changed

Update the SDK to check the size of payload collections and issue warnings and errors when the size exceeds limits:

Exceeding the error threshold will raise a PayloadsSizeError. When running within a worker, this exception is caught, logged as a warning, and the corresponding workflow task or activity is failed.
Exceeding the warning threshold will log a warning.

This is done by:

Create a PayloadLimitsConfig class for configuring the warning and error limits. The error limit are not defined by default whereas the warning limit is set to 512 KiB.
Add a payload_limits property to DataConverter of type PayloadLimitsConfig.
Add non-public helper methods to DataConverter for encoding and decoding payloads in all forms. The encoding methods call _validate_payload_limits before returning.
The rest of the SDK is updated to use the non-public helper methods on DataConveter instead of using payload_codec directly.

Examples

Log output when an activity attempts to return a result that exceeds the error limit:

13:26:25 [ WARNING] Activity task failed: payloads size exceeded the error limit. Size: 12325 bytes, Limit: 5120 bytes ({'activity_id': '1', 'activity_type': 'large_payload_activity', 'attempt': 3, 'namespace': 'default', 'task_queue': 'b6a43612-b0dc-4e74-9fc8-f112bb64f264', 'workflow_id': 'workflow-e645f290-513d-44a1-8a7e-6ef8d72d63e4', 'workflow_run_id': '019bdd4d-3268-7395-bb2e-451b692ec217', 'workflow_type': 'LargePayloadWorkflow'}) (_activity.py:387)

Log output when a workflow attempts to provide an activity input that exceeds the error limit:

16:20:22 [ WARNING] Workflow task failed: payloads size exceeded the error limit. Size: 12346 bytes, Limit: 5120 bytes (_workflow.py:371)

Note that the above example is missing the extra context that the activity result failure example has. This is due to the available logging infra where these errors are raised and can be fixed separately with some log refactoring (see deferred items).

Log output when a workflow attempts to provide activity input that is above the warning threshold but below the error limit:

16:24:46 [ WARNING] Payloads size has exceeded the warning limit. Size: 4154 bytes, Limit: 1024 bytes (converter.py:1291)

Same note about the missing extra context.

Deferred

Work that has been deferred to later PRs (unless requested to pull back in):

Use a new WorkflowTaskFailedCause to indicate the specific cause of the failure scenario. Pending Add a new workflow failure cause for oversized payloads api#697, integration into sdk-core, and sdk-core into sdk-python.
Use a context-aware logger into _validate_payload_limits to get rich information about the execution context when issuing a warning.
Use a context-aware logger in _WorkflowWorker::_handle_activation to get rich information about the execution context when issuing a warning when exceeding the payload error limit.
Get payload error limits from server to be used as defaults for workers.

Why?

Users need to know when payload sizes are approaching or have exceeded size limits. This will help prevent workflow outages and inform users to adjust their workflows to make use of alternate storage methods or to break down their payloads more granularly.

Checklist

Closes Warn if the SDK tried to send a payload above a specific size #1284
Closes SDK should fail workflow task if payloads size it known to be too large #1285
How was this tested: Unit tests
Any docs updates needed? Yes

This reverts commit 95a66d7.

temporalio/converter.py

cretz · 2026-01-21T14:41:29Z

temporalio/converter.py

+    payload_upload_error_limit: int | Literal["disabled"] | None = None
+    """The limit at which a payloads size error is created."""
+    payload_upload_warning_limit: int | Literal["disabled"] | None = None
+    """The limit at which a payloads size warning is created."""


Literal disabled works for me, though could also use < 0 (e.g. -1), no strong preference. But what does None mean in this case, doesn't that also mean disabled?

I'll remove "disabled" in favor of just setting it to 0 or less for disablement. None will be used for a future change where we'll get the defaults from the server and merge those to make the effective payload limits for workers.

We can't really make None mean something different later than now IMO. I think we need the proper default here, and we should not accept None right now IMO and we can add it later if we want (but ideally pre-GA). Can you clarify what the default is meant to be today for payload limits (recognizing we may not get them from the server in all cases including at the moment)? Is the default meant to be no limits?

None will mean "use default values of the context in which the data converter is used". So when within a worker, it will get default values from the server. When used in a client or replayer, it will have no default values and thus limit checks are disabled. Specifying a value other than None and > 0 will unconditionally enable the limit checks.

Maybe there's a better representation of the "use default values". Maybe Literal["default"] or sentinel class and thus the config becomes:

payload_upload_error_limit: int | Literal["default"] = "default"

or

class DefaultPayloadLimit: pass ... payload_upload_error_limit: int | DefaultPayloadLimit = DefaultPayloadLimit

Sentinel class is slightly more difficult if we decide to allow specifying these via env vars or envconfig.

None will mean "use default values of the context in which the data converter is used". So when within a worker, it will get default values from the server. When used in a client or replayer, it will have no default values and thus limit checks are disabled.

This is inconsistent and difficult to comprehend from a user POV. IMO it needs to be easier to understand and therefore cohesive. I think we should consider a common set of defaults across the SDKs regardless of how payloads are used.

Maybe there's a better representation of the "use default values"

It's less about "use default values" and more about inconsistent behavior for serialization depending on where the converter is used.

At the very least, I think we should have a integer literal warning default in all SDKs that is not server derived and therefore usable everywhere since warnings are mostly harmless from a compatibility POV. If we have to have erroring have inconsistent defaults based on where used, we can (begrudgingly, with API doc clarifying why we have chosen to be inconsistent).

At the very least, I think we should have a integer literal warning default in all SDKs that is not server derived and therefore usable everywhere since warnings are mostly harmless from a compatibility POV.

I've updated the warning default to be 512 KiB.

The error limit default will still be context dependent e.g. client has no error limit vs worker has error limit prescribed by server. The error limit can be overridden or disabled in options.

temporalio/converter.py

temporalio/exceptions.py

cretz · 2026-01-21T14:54:26Z

temporalio/worker/_activity.py

+                elif isinstance(
+                    err,
+                    temporalio.exceptions.PayloadsSizeError,
+                ):
+                    temporalio.activity.logger.warning(
+                        "Activity task failed: payloads size exceeded the error limit. Size: %d bytes, Limit: %d bytes",
+                        err.payloads_size,
+                        err.payloads_limit,
+                        extra={"__temporal_error_identifier": "ActivityFailure"},
+                    )
+                    await data_converter.encode_failure(
+                        temporalio.exceptions.ApplicationError(
+                            type="PayloadsTooLarge",
+                            message="Payloads size has exceeded the error limit.",
+                        ),
+                        completion.result.failed.failure,
+                    )
+                    # TODO: Add force_cause to activity Failure bridge proto?
+                    # TODO: Add WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE to API
+                    # completion.result.failed.force_cause = WorkflowTaskFailedCause.WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE


I think we should let the traditional error path run here. Assuming a readable error message on the error, I think the only reason I can see not doing so is to have the ApplicationError.type be PayloadsTooLarge instead of PayloadsSizeError, but we can either alter the failure converter, or make just that slight specialization in the catch here.

Are you suggesting that PayloadSizeError should extend from ApplicationError?

We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message. My understanding is that log messages from sdk-core won't be surfaced in the worker unless a logger is configured via the telemetry configuration. And even then, I haven't seen activity failures get routed through there (see https://github.com/temporalio/sdk-python/pull/1288/changes#diff-3162a4b842d45d546da93b825218f7f863b6b481684d2b9570a38b04facb266bR8833), but maybe I'm doing something wrong with that.

On extending from ApplicationError, note that PayloadSizeError will be thrown from Client invocations if payload limits are configured. This is in contrast to the description of the ApplicationError which states "Error raised during workflow/activity execution."

Are you suggesting that PayloadSizeError should extend from ApplicationError?

No, in these situations, all non-Temporal-failure exceptions automatically convert to application error with their unqualified class name as the error type

We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message.

It sounds like everyone should get this message and not just this log statement. Therefore, such a message should be part of the error, not the log.

temporalio/worker/_activity.py

cretz · 2026-01-21T16:01:25Z

temporalio/worker/_activity.py

                        temporalio.exceptions.CancelledError("Cancelled"),
                        completion.result.cancelled.failure,
                    )
+                elif isinstance(


Note, this doesn't catch cases where encoding the failure includes payloads too large, may want to handle this error in the outer except. This can happen when application error details are too large, or when stack trace is too large and it's moved to encoded attributes (see temporalio/features#597).

temporalio/converter.py

cretz · 2026-01-21T20:31:12Z

temporalio/converter.py

+    payload_upload_error_limit: int | Literal["disabled"] | None = None
+    """The limit at which a payloads size error is created."""
+    payload_upload_warning_limit: int | Literal["disabled"] | None = None
+    """The limit at which a payloads size warning is created."""


We can't really make None mean something different later than now IMO. I think we need the proper default here, and we should not accept None right now IMO and we can add it later if we want (but ideally pre-GA). Can you clarify what the default is meant to be today for payload limits (recognizing we may not get them from the server in all cases including at the moment)? Is the default meant to be no limits?

cretz · 2026-01-21T20:31:51Z

temporalio/worker/_activity.py

+                elif isinstance(
+                    err,
+                    temporalio.exceptions.PayloadsSizeError,
+                ):
+                    temporalio.activity.logger.warning(
+                        "Activity task failed: payloads size exceeded the error limit. Size: %d bytes, Limit: %d bytes",
+                        err.payloads_size,
+                        err.payloads_limit,
+                        extra={"__temporal_error_identifier": "ActivityFailure"},
+                    )
+                    await data_converter.encode_failure(
+                        temporalio.exceptions.ApplicationError(
+                            type="PayloadsTooLarge",
+                            message="Payloads size has exceeded the error limit.",
+                        ),
+                        completion.result.failed.failure,
+                    )
+                    # TODO: Add force_cause to activity Failure bridge proto?
+                    # TODO: Add WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE to API
+                    # completion.result.failed.force_cause = WorkflowTaskFailedCause.WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE


Are you suggesting that PayloadSizeError should extend from ApplicationError?

No, in these situations, all non-Temporal-failure exceptions automatically convert to application error with their unqualified class name as the error type

We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message.

It sounds like everyone should get this message and not just this log statement. Therefore, such a message should be part of the error, not the log.

temporalio/exceptions.py

jmaeagle99 added 8 commits January 20, 2026 13:25

Payload limit configuration and validation

b10de0f

Remove override

b80b725

Call super.__init__

f8eca98

Undo change to worker.py

79fe175

Remove activity.logger capture due to test run interference

95a66d7

Revert "Remove activity.logger capture due to test run interference"

ebca76c

This reverts commit 95a66d7.

Fix test_activity_failure_trace_identifier test

9515b4e

Update error messages

3687b2f

jmaeagle99 marked this pull request as ready for review January 21, 2026 05:04

jmaeagle99 requested a review from a team as a code owner January 21, 2026 05:04

cretz reviewed Jan 21, 2026

View reviewed changes

jmaeagle99 added 3 commits January 21, 2026 08:24

Remove disabled option for limits

1221a27

Rename error type and extend from FailureError

103317e

Use warnings modules

7ecb766

cretz reviewed Jan 21, 2026

View reviewed changes

jmaeagle99 added 5 commits January 21, 2026 14:12

Replace _PayloadLimitsPayloadCodec with DataConverter methods

5b4e79f

Merge remote-tracking branch 'upstream' into payload-limits

f703a3e

Change super type of PayloadSizeError to TemporalError

3ffe0d3

Change warning level to have default of 512 KiB

5dd575a

Preserve unused import

c61c35f

jmaeagle99 marked this pull request as draft January 22, 2026 04:54

Payload limit configuration and validation #1288

Are you sure you want to change the base?

Payload limit configuration and validation #1288

Conversation

jmaeagle99 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Examples

Deferred

Why?

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmaeagle99 commented Jan 20, 2026 •

edited

Loading

cretz Jan 21, 2026 •

edited

Loading