Skip to content

Conversation

@jmaeagle99
Copy link
Contributor

@jmaeagle99 jmaeagle99 commented Jan 20, 2026

What was changed

Update the SDK to check the size of payload collections and issue warnings and errors when the size exceeds limits:

  • Exceeding the error threshold will raise a PayloadsSizeError. When running within a worker, this exception is caught, logged as a warning, and the corresponding workflow task or activity is failed.
  • Exceeding the warning threshold will log a warning.

This is done by:

  • Create a PayloadLimitsConfig class for configuring the warning and error limits. The error limit are not defined by default whereas the warning limit is set to 512 KiB.
  • Add a payload_limits property to DataConverter of type PayloadLimitsConfig.
  • Add non-public helper methods to DataConverter for encoding and decoding payloads in all forms. The encoding methods call _validate_payload_limits before returning.
  • The rest of the SDK is updated to use the non-public helper methods on DataConveter instead of using payload_codec directly.

Examples

Log output when an activity attempts to return a result that exceeds the error limit:

13:26:25 [ WARNING] Activity task failed: payloads size exceeded the error limit. Size: 12325 bytes, Limit: 5120 bytes ({'activity_id': '1', 'activity_type': 'large_payload_activity', 'attempt': 3, 'namespace': 'default', 'task_queue': 'b6a43612-b0dc-4e74-9fc8-f112bb64f264', 'workflow_id': 'workflow-e645f290-513d-44a1-8a7e-6ef8d72d63e4', 'workflow_run_id': '019bdd4d-3268-7395-bb2e-451b692ec217', 'workflow_type': 'LargePayloadWorkflow'}) (_activity.py:387)

Log output when a workflow attempts to provide an activity input that exceeds the error limit:

16:20:22 [ WARNING] Workflow task failed: payloads size exceeded the error limit. Size: 12346 bytes, Limit: 5120 bytes (_workflow.py:371)

Note that the above example is missing the extra context that the activity result failure example has. This is due to the available logging infra where these errors are raised and can be fixed separately with some log refactoring (see deferred items).

Log output when a workflow attempts to provide activity input that is above the warning threshold but below the error limit:

16:24:46 [ WARNING] Payloads size has exceeded the warning limit. Size: 4154 bytes, Limit: 1024 bytes (converter.py:1291)

Same note about the missing extra context.

Deferred

Work that has been deferred to later PRs (unless requested to pull back in):

  • Use a new WorkflowTaskFailedCause to indicate the specific cause of the failure scenario. Pending Add a new workflow failure cause for oversized payloads api#697, integration into sdk-core, and sdk-core into sdk-python.
  • Use a context-aware logger into _validate_payload_limits to get rich information about the execution context when issuing a warning.
  • Use a context-aware logger in _WorkflowWorker::_handle_activation to get rich information about the execution context when issuing a warning when exceeding the payload error limit.
  • Get payload error limits from server to be used as defaults for workers.

Why?

Users need to know when payload sizes are approaching or have exceeded size limits. This will help prevent workflow outages and inform users to adjust their workflows to make use of alternate storage methods or to break down their payloads more granularly.

Checklist

  1. Closes Warn if the SDK tried to send a payload above a specific size #1284

  2. Closes SDK should fail workflow task if payloads size it known to be too large #1285

  3. How was this tested: Unit tests

  4. Any docs updates needed? Yes

@jmaeagle99 jmaeagle99 marked this pull request as ready for review January 21, 2026 05:04
@jmaeagle99 jmaeagle99 requested a review from a team as a code owner January 21, 2026 05:04
Comment on lines 1245 to 1248
payload_upload_error_limit: int | Literal["disabled"] | None = None
"""The limit at which a payloads size error is created."""
payload_upload_warning_limit: int | Literal["disabled"] | None = None
"""The limit at which a payloads size warning is created."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal disabled works for me, though could also use < 0 (e.g. -1), no strong preference. But what does None mean in this case, doesn't that also mean disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove "disabled" in favor of just setting it to 0 or less for disablement. None will be used for a future change where we'll get the defaults from the server and merge those to make the effective payload limits for workers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really make None mean something different later than now IMO. I think we need the proper default here, and we should not accept None right now IMO and we can add it later if we want (but ideally pre-GA). Can you clarify what the default is meant to be today for payload limits (recognizing we may not get them from the server in all cases including at the moment)? Is the default meant to be no limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None will mean "use default values of the context in which the data converter is used". So when within a worker, it will get default values from the server. When used in a client or replayer, it will have no default values and thus limit checks are disabled. Specifying a value other than None and > 0 will unconditionally enable the limit checks.

Maybe there's a better representation of the "use default values". Maybe Literal["default"] or sentinel class and thus the config becomes:

payload_upload_error_limit: int | Literal["default"] = "default"

or

class DefaultPayloadLimit:
    pass

...

payload_upload_error_limit: int | DefaultPayloadLimit = DefaultPayloadLimit

Sentinel class is slightly more difficult if we decide to allow specifying these via env vars or envconfig.

Copy link
Member

@cretz cretz Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None will mean "use default values of the context in which the data converter is used". So when within a worker, it will get default values from the server. When used in a client or replayer, it will have no default values and thus limit checks are disabled.

This is inconsistent and difficult to comprehend from a user POV. IMO it needs to be easier to understand and therefore cohesive. I think we should consider a common set of defaults across the SDKs regardless of how payloads are used.

Maybe there's a better representation of the "use default values"

It's less about "use default values" and more about inconsistent behavior for serialization depending on where the converter is used.

At the very least, I think we should have a integer literal warning default in all SDKs that is not server derived and therefore usable everywhere since warnings are mostly harmless from a compatibility POV. If we have to have erroring have inconsistent defaults based on where used, we can (begrudgingly, with API doc clarifying why we have chosen to be inconsistent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, I think we should have a integer literal warning default in all SDKs that is not server derived and therefore usable everywhere since warnings are mostly harmless from a compatibility POV.

I've updated the warning default to be 512 KiB.

The error limit default will still be context dependent e.g. client has no error limit vs worker has error limit prescribed by server. The error limit can be overridden or disabled in options.

Comment on lines 383 to 402
elif isinstance(
err,
temporalio.exceptions.PayloadsSizeError,
):
temporalio.activity.logger.warning(
"Activity task failed: payloads size exceeded the error limit. Size: %d bytes, Limit: %d bytes",
err.payloads_size,
err.payloads_limit,
extra={"__temporal_error_identifier": "ActivityFailure"},
)
await data_converter.encode_failure(
temporalio.exceptions.ApplicationError(
type="PayloadsTooLarge",
message="Payloads size has exceeded the error limit.",
),
completion.result.failed.failure,
)
# TODO: Add force_cause to activity Failure bridge proto?
# TODO: Add WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE to API
# completion.result.failed.force_cause = WorkflowTaskFailedCause.WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should let the traditional error path run here. Assuming a readable error message on the error, I think the only reason I can see not doing so is to have the ApplicationError.type be PayloadsTooLarge instead of PayloadsSizeError, but we can either alter the failure converter, or make just that slight specialization in the catch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Are you suggesting that PayloadSizeError should extend from ApplicationError?
  2. We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message. My understanding is that log messages from sdk-core won't be surfaced in the worker unless a logger is configured via the telemetry configuration. And even then, I haven't seen activity failures get routed through there (see https://github.com/temporalio/sdk-python/pull/1288/changes#diff-3162a4b842d45d546da93b825218f7f863b6b481684d2b9570a38b04facb266bR8833), but maybe I'm doing something wrong with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On extending from ApplicationError, note that PayloadSizeError will be thrown from Client invocations if payload limits are configured. This is in contrast to the description of the ApplicationError which states "Error raised during workflow/activity execution."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that PayloadSizeError should extend from ApplicationError?

No, in these situations, all non-Temporal-failure exceptions automatically convert to application error with their unqualified class name as the error type

We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message.

It sounds like everyone should get this message and not just this log statement. Therefore, such a message should be part of the error, not the log.

temporalio.exceptions.CancelledError("Cancelled"),
completion.result.cancelled.failure,
)
elif isinstance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this doesn't catch cases where encoding the failure includes payloads too large, may want to handle this error in the outer except. This can happen when application error details are too large, or when stack trace is too large and it's moved to encoded attributes (see temporalio/features#597).

Comment on lines 1245 to 1248
payload_upload_error_limit: int | Literal["disabled"] | None = None
"""The limit at which a payloads size error is created."""
payload_upload_warning_limit: int | Literal["disabled"] | None = None
"""The limit at which a payloads size warning is created."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really make None mean something different later than now IMO. I think we need the proper default here, and we should not accept None right now IMO and we can add it later if we want (but ideally pre-GA). Can you clarify what the default is meant to be today for payload limits (recognizing we may not get them from the server in all cases including at the moment)? Is the default meant to be no limits?

Comment on lines 383 to 402
elif isinstance(
err,
temporalio.exceptions.PayloadsSizeError,
):
temporalio.activity.logger.warning(
"Activity task failed: payloads size exceeded the error limit. Size: %d bytes, Limit: %d bytes",
err.payloads_size,
err.payloads_limit,
extra={"__temporal_error_identifier": "ActivityFailure"},
)
await data_converter.encode_failure(
temporalio.exceptions.ApplicationError(
type="PayloadsTooLarge",
message="Payloads size has exceeded the error limit.",
),
completion.result.failed.failure,
)
# TODO: Add force_cause to activity Failure bridge proto?
# TODO: Add WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE to API
# completion.result.failed.force_cause = WorkflowTaskFailedCause.WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that PayloadSizeError should extend from ApplicationError?

No, in these situations, all non-Temporal-failure exceptions automatically convert to application error with their unqualified class name as the error type

We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message.

It sounds like everyone should get this message and not just this log statement. Therefore, such a message should be part of the error, not the log.

@jmaeagle99 jmaeagle99 marked this pull request as draft January 22, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDK should fail workflow task if payloads size it known to be too large Warn if the SDK tried to send a payload above a specific size

2 participants