From b44949d362b9752d35819c002a0631d6cc5a68e0 Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sat, 6 Dec 2025 17:43:28 +0400 Subject: [PATCH 1/9] chore(workflow): improve main branch synchronization logic - Refined the GitHub Actions workflow for synchronizing the main branch with the release branch. Updated fetch commands to ensure the latest state is retrieved and enhanced checks for fast-forwarding or rebasing. This ensures a more robust and accurate update process for the main branch after releases. Signed-off-by: Dmitrii Safronov --- .github/workflows/release-stable.yaml | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/.github/workflows/release-stable.yaml b/.github/workflows/release-stable.yaml index a217572..3f509ad 100644 --- a/.github/workflows/release-stable.yaml +++ b/.github/workflows/release-stable.yaml @@ -42,25 +42,24 @@ jobs: # Fetch latest state after semantic-release pushed commits # This ensures we get all commits that semantic-release created - git fetch origin release:release - git fetch origin main:main + git fetch origin - # Check if main is already up to date - if git diff --quiet main release; then + # Check if main is already up to date with release + if git diff --quiet origin/main origin/release; then echo "main is already up to date with release" exit 0 fi # Check if main is ancestor of release (can fast-forward) - if git merge-base --is-ancestor main release; then + if git merge-base --is-ancestor origin/main origin/release; then echo "Fast-forwarding main to release" - git checkout main - git merge --ff-only release + git checkout -B main origin/main + git merge --ff-only origin/release git push origin main else echo "Rebasing main onto release (force-with-lease required)" - git checkout main - git rebase release + git checkout -B main origin/main + git rebase origin/release # Use --force-with-lease for safety: only push if remote hasn't changed git push --force-with-lease origin main fi From b8fd99d8da7e90d99e126d479acff9ce8fcd3cc4 Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sun, 7 Dec 2025 16:55:23 +0400 Subject: [PATCH 2/9] refactor(logging): update error handling to use JSON format for validation errors - Modified the error handling in SchemaLogger and schema validation functions to log validation errors in a structured JSON format. This change enhances machine readability and consistency in error reporting. - Updated the README to reflect the new error message format, replacing the previous string format with a JSON structure that includes field, error, and value information. - Adjusted tests to verify the new JSON error message format is correctly logged and processed. Signed-off-by: Dmitrii Safronov --- README.md | 17 +++--- .../schema_applier.py | 52 +++++++++++++------ .../schema_logger.py | 10 ++-- tests/test_formatter.py | 22 +++++--- tests/test_formatter_basic.py | 6 ++- tests/test_integration.py | 6 ++- 6 files changed, 77 insertions(+), 36 deletions(-) diff --git a/README.md b/README.md index 5e79a8d..a5d4da3 100644 --- a/README.md +++ b/README.md @@ -263,12 +263,11 @@ logger.info( ``` In this case the `tags` value is rejected, a `DataProblem` is recorded with a -message similar to: - -> Field 'tags' is a list but contains elements with types [...]; expected all elements to be of type str +JSON message containing field, error, and value information. and an ERROR message is logged **after** the log record has been emitted with -the format: `"Log data does not match schema: {problem1}; {problem2}; ..."`. +the JSON format: `{"validation_errors": [{"field": "...", "error": "...", "value": "..."}]}`. +All fields are serialized via `repr()` for safety and consistency. ### Multiple leaves with the same source @@ -387,8 +386,9 @@ terminated after logging the error to stderr. In all of these cases a `DataProblem` is recorded for each offending field, and if at least one problem is present, a single ERROR message is logged -**after** the log record has been emitted. The error message format is: -`"Log data does not match schema: {problem1}; {problem2}; ..."`. +**after** the log record has been emitted. The error message format is JSON: +`{"validation_errors": [{"field": "...", "error": "...", "value": "..."}]}`. +All fields are serialized via `repr()` for safety and consistency. - When a `source` is used in multiple leaves (see "Multiple leaves with the same source" above), the value is validated and written independently for each leaf @@ -404,9 +404,10 @@ High-level algorithm inside `SchemaLogger`: 2. A new structured payload is built from the schema and the given `extra`. 3. Only this structured payload is passed to the underlying stdlib logger. 4. After logging, if any validation problems were detected, a single - ERROR message is logged with the format: - `"Log data does not match schema: {problem1}; {problem2}; ..."` + ERROR message is logged with the JSON format: + `{"validation_errors": [{"field": "...", "error": "...", "value": "..."}]}` (no exception is raised, ensuring 100% compatibility with standard logger behavior). + All fields are serialized via `repr()` for safety and consistency. ## Error handling diff --git a/src/logging_objects_with_schema/schema_applier.py b/src/logging_objects_with_schema/schema_applier.py index e6f544c..525d1f1 100644 --- a/src/logging_objects_with_schema/schema_applier.py +++ b/src/logging_objects_with_schema/schema_applier.py @@ -6,6 +6,7 @@ from __future__ import annotations +import json from collections import defaultdict from collections.abc import Mapping, MutableMapping from typing import Any @@ -14,6 +15,26 @@ from .schema_loader import CompiledSchema, SchemaLeaf +def _create_validation_error_json(field: str, error: str, value: Any) -> str: + """Create JSON string for a single validation error. + + Args: + field: Field name that caused the validation error. + error: Error description. + value: Invalid value that caused the error. + + Returns: + JSON string with field, error, and value (all via repr() for safety). + """ + return json.dumps( + { + "field": repr(field), + "error": repr(error), + "value": repr(value), + } + ) + + def _validate_list_value( value: list, source: str, @@ -34,9 +55,8 @@ def _validate_list_value( DataProblem if validation fails, None if validation succeeds. """ if item_expected_type is None: - return DataProblem( - f"Field '{source}' is list but has no item type configured", - ) + error_msg = "is a list but has no item type configured" + return DataProblem(_create_validation_error_json(source, error_msg, value)) if len(value) == 0: # Empty lists are always valid @@ -47,12 +67,13 @@ def _validate_list_value( } if invalid_item_types: - return DataProblem( - f"Field '{source}' is a list but contains elements " + error_msg = ( + f"is a list but contains elements " f"with types {sorted(invalid_item_types)}; " f"expected all elements to be of type " - f"{item_expected_type.__name__}", + f"{item_expected_type.__name__}" ) + return DataProblem(_create_validation_error_json(source, error_msg, value)) return None @@ -107,11 +128,12 @@ def _validate_and_apply_leaf( # bool is a subclass of int). This ensures that the actual # runtime type matches the schema type exactly. if type(value) is not leaf.expected_type: + error_msg = ( + f"has type {type(value).__name__}, " + f"expected {leaf.expected_type.__name__}" + ) problems.append( - DataProblem( - f"Field '{source}' has type {type(value).__name__}, " - f"expected {leaf.expected_type.__name__}", - ), + DataProblem(_create_validation_error_json(source, error_msg, value)) ) return @@ -225,10 +247,9 @@ def _apply_schema_internal( # Check for None values explicitly (None values are not allowed) # This check must be done once per source, not once per leaf if value is None: + error_msg = "is None" problems.append( - DataProblem( - f"Field '{source}' is None", - ), + DataProblem(_create_validation_error_json(source, error_msg, None)) ) continue @@ -244,10 +265,11 @@ def _apply_schema_internal( else (key for key in extra_values.keys() if key not in used_sources) ) for key in redundant_keys: + error_msg = "is not defined in schema" problems.append( DataProblem( - f"Field '{key}' is not defined in schema", - ), + _create_validation_error_json(key, error_msg, extra_values[key]) + ) ) cleaned_extra = _strip_empty(extra) diff --git a/src/logging_objects_with_schema/schema_logger.py b/src/logging_objects_with_schema/schema_logger.py index caffcd1..994b131 100644 --- a/src/logging_objects_with_schema/schema_logger.py +++ b/src/logging_objects_with_schema/schema_logger.py @@ -10,6 +10,7 @@ from __future__ import annotations import inspect +import json import logging import os import sys @@ -185,9 +186,12 @@ def _log( fn, lno, func, sinfo = self.findCaller( stack_info=False, stacklevel=stacklevel + 1 ) - # Format error message with details of all problems - problem_messages = [problem.message for problem in data_problems] - error_msg = f"Log data does not match schema: {'; '.join(problem_messages)}" + # Format error message as JSON for machine processing + validation_errors = [] + for problem in data_problems: + error_obj = json.loads(problem.message) + validation_errors.append(error_obj) + error_msg = json.dumps({"validation_errors": validation_errors}) error_record = self.makeRecord( self.name, logging.ERROR, diff --git a/tests/test_formatter.py b/tests/test_formatter.py index b96da13..da0bf1f 100644 --- a/tests/test_formatter.py +++ b/tests/test_formatter.py @@ -73,14 +73,19 @@ def test_strip_empty_preserves_non_dict_values() -> None: def test_apply_schema_empty_schema_returns_empty() -> None: """apply_schema_internal with empty schema should return empty dict and problems.""" + import json + schema = CompiledSchema(leaves=[]) extra = {"field1": "value1", "field2": 42} result, problems = apply_schema_internal(schema, extra) assert result == {} # All fields are considered redundant even when schema has no leaves. - problem_messages = [p.message for p in problems] - assert "Field 'field1' is not defined in schema" in problem_messages - assert "Field 'field2' is not defined in schema" in problem_messages + problem_fields = [] + for p in problems: + error_obj = json.loads(p.message) + problem_fields.append(error_obj["field"]) + assert "'field1'" in problem_fields + assert "'field2'" in problem_fields def test_apply_schema_nested_structure() -> None: @@ -393,6 +398,8 @@ def test_apply_schema_redundant_fields_with_non_empty_schema() -> None: def test_apply_schema_redundant_fields_with_empty_schema() -> None: """Redundant fields should produce problems when schema is empty.""" + import json + schema = CompiledSchema(leaves=[]) extra = { "unknown_field": "value", @@ -401,9 +408,12 @@ def test_apply_schema_redundant_fields_with_empty_schema() -> None: result, problems = apply_schema_internal(schema, extra) assert result == {} # Both fields should be reported as redundant. - problem_messages = [p.message for p in problems] - assert "Field 'unknown_field' is not defined in schema" in problem_messages - assert "Field 'another_unknown' is not defined in schema" in problem_messages + problem_fields = [] + for p in problems: + error_obj = json.loads(p.message) + problem_fields.append(error_obj["field"]) + assert "'unknown_field'" in problem_fields + assert "'another_unknown'" in problem_fields def test_apply_schema_strips_empty_dicts() -> None: diff --git a/tests/test_formatter_basic.py b/tests/test_formatter_basic.py index f803c3d..5f6ae10 100644 --- a/tests/test_formatter_basic.py +++ b/tests/test_formatter_basic.py @@ -61,7 +61,8 @@ def test_schema_logger_type_mismatch_logs_error_after_logging( assert "user_id" not in main_log # Validation error should be logged as ERROR assert "ERROR" in output - assert "Log data does not match schema" in output + # Error message should be JSON with validation_errors + assert "validation_errors" in output # Details of the problem should be included in the error message assert "user_id" in output @@ -155,7 +156,8 @@ def test_function() -> None: # Check ERROR record for validation error assert error_record.levelno == logging.ERROR - assert "Log data does not match schema" in str(error_record.msg) + # Error message should be JSON with validation_errors + assert "validation_errors" in str(error_record.msg) # Verify that funcName is set (not None) - this is the main fix we made # Previously, func was not passed to makeRecord, causing funcName to be None diff --git a/tests/test_integration.py b/tests/test_integration.py index 50a54c4..f5fe90f 100644 --- a/tests/test_integration.py +++ b/tests/test_integration.py @@ -306,7 +306,8 @@ def test_logger_validates_data_after_logging( # Validation error should be logged as ERROR assert "ERROR" in output - assert "Log data does not match schema" in output + # Error message should be JSON with validation_errors + assert "validation_errors" in output def test_logger_with_empty_schema( @@ -340,7 +341,8 @@ def test_logger_with_empty_schema( # Validation error should be logged as ERROR assert "ERROR" in output - assert "Log data does not match schema" in output + # Error message should be JSON with validation_errors + assert "validation_errors" in output # Details of problems should be included in the error message assert "unknown_field" in output assert "another" in output From 0417dc90498fe694b413e2c4abc4fcc2c02cc136 Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sun, 7 Dec 2025 17:01:32 +0400 Subject: [PATCH 3/9] fix(logging): enhance error handling in SchemaLogger for JSON parsing and serialization - Added defensive error handling in the SchemaLogger to manage potential JSON parsing and serialization issues when logging validation errors. This ensures that even if unexpected data corruption occurs, the logger can still produce a meaningful error message. - Implemented fallback mechanisms to create structured error objects when parsing or serialization fails, improving the robustness of the logging system. Signed-off-by: Dmitrii Safronov --- .../schema_logger.py | 42 +++++++++++++++++-- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/src/logging_objects_with_schema/schema_logger.py b/src/logging_objects_with_schema/schema_logger.py index 994b131..00a1c0b 100644 --- a/src/logging_objects_with_schema/schema_logger.py +++ b/src/logging_objects_with_schema/schema_logger.py @@ -189,9 +189,45 @@ def _log( # Format error message as JSON for machine processing validation_errors = [] for problem in data_problems: - error_obj = json.loads(problem.message) - validation_errors.append(error_obj) - error_msg = json.dumps({"validation_errors": validation_errors}) + try: + error_obj = json.loads(problem.message) + validation_errors.append(error_obj) + except (json.JSONDecodeError, TypeError) as exc: + # Defensive handling: if problem.message is not valid JSON, + # create a fallback error object. This should never happen in + # normal operation since problem.message is always created via + # _create_validation_error_json, but protects against unexpected + # data corruption or manual DataProblem creation. + validation_errors.append( + { + "field": repr("unknown"), + "error": repr(f"Failed to parse validation error: {exc}"), + "value": repr(problem.message), + } + ) + + try: + error_msg = json.dumps({"validation_errors": validation_errors}) + except (TypeError, ValueError) as exc: + # Defensive handling: if serialization fails, create a fallback + # error message. This should never happen in normal operation since + # validation_errors contains only dicts with primitive values, but + # protects against unexpected data corruption. + error_msg = json.dumps( + { + "validation_errors": [ + { + "field": repr("unknown"), + "error": repr( + f"Failed to serialize validation errors: {exc}" + ), + "value": repr( + "validation errors could not be serialized" + ), + } + ] + } + ) error_record = self.makeRecord( self.name, logging.ERROR, From 305144bc019bba11dbcc2a6126b6b8dc325c0e14 Mon Sep 17 00:00:00 2001 From: semantic-release-bot Date: Sun, 7 Dec 2025 13:03:30 +0000 Subject: [PATCH 4/9] chore(release): 0.1.3-rc.1 ## [0.1.3-rc.1](https://github.com/disafronov/python-logging-objects-with-schema/compare/v0.1.2...v0.1.3-rc.1) (2025-12-07) ### Bug Fixes * **logging:** enhance error handling in SchemaLogger for JSON parsing and serialization ([0417dc9](https://github.com/disafronov/python-logging-objects-with-schema/commit/0417dc90498fe694b413e2c4abc4fcc2c02cc136)) Signed-off-by: Release Bot --- pyproject.toml | 2 +- uv.lock | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index 0fa9734..28e46be 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "uv_build" [project] name = "logging-objects-with-schema" -version = "0.1.2" +version = "0.1.3rc1" description = "Proxy logging wrapper that validates extra fields against a JSON schema." readme = "README.md" requires-python = ">=3.10" diff --git a/uv.lock b/uv.lock index d3c00de..7e42f1b 100644 --- a/uv.lock +++ b/uv.lock @@ -1394,7 +1394,7 @@ version = "0.6.3" [[package]] name = "logging-objects-with-schema" -version = "0.1.2" +version = "0.1.3rc1" [package.source] editable = "." From be9e1e6f99875ccd8533f7c52dde75598a7a2ee0 Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sun, 7 Dec 2025 17:07:40 +0400 Subject: [PATCH 5/9] docs(README): add cache invalidation details for schema management - Expanded the README to include information on the library's caching mechanisms, detailing the path cache and compiled schema cache. This addition clarifies how the library handles schema file changes and caching behavior, ensuring users understand the implications of schema file modifications on caching. Signed-off-by: Dmitrii Safronov --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index a5d4da3..0dbd7a2 100644 --- a/README.md +++ b/README.md @@ -355,6 +355,17 @@ terminated after logging the error to stderr. - **Note**: The library does not provide a mechanism to reload the schema without restarting the application. This is a deliberate design decision to ensure schema consistency throughout the application lifecycle. +- **Cache invalidation**: The library maintains two levels of caching: + - Path cache: If a previously found schema file is deleted, the path cache + is invalidated and a new search is performed. If the file is recreated at + the same path, the cached compiled schema will be reused (since the cache + key is the absolute path). If the file is recreated at a different path, + it will be treated as a new schema and compiled separately. + - Compiled schema cache: Once a schema has been compiled for a given path, + the compiled result is cached for the lifetime of the process. Even if the + schema file is deleted and recreated at the same path with different + content, the old compiled schema will continue to be used until the + application is restarted. ## Schema root key restrictions From d50a28aae09e3d047a55d3c61455f2ffdbfa95f4 Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sun, 7 Dec 2025 17:20:53 +0400 Subject: [PATCH 6/9] docs(README): enhance schema validation and error handling details - Updated the README to provide comprehensive information on schema validation processes, including the creation of `SchemaLogger` instances, error handling for schema problems, and the implications of schema structure on logging behavior. - Added sections on schema root key restrictions and clarified the handling of validation errors, ensuring users understand how the library manages schema-related issues and the conditions under which fields are included in log records. Signed-off-by: Dmitrii Safronov --- README.md | 145 +++++++++++++++--------------------------------------- 1 file changed, 41 insertions(+), 104 deletions(-) diff --git a/README.md b/README.md index 0dbd7a2..0ceb7f0 100644 --- a/README.md +++ b/README.md @@ -166,11 +166,30 @@ logger.info("processing", extra={"user_id": "not-an-int"}) # Wrong type - Schema tree depth is limited to a maximum nesting level (currently 100). Any branch that exceeds this depth is ignored and reported as a schema problem. - If the schema file is not found, or cannot be read/parsed/validated, the - logger instance is not created, schema problems are logged to stderr, and - the application is terminated via `os._exit(1)`. The error message contains - a detailed description of all issues, including the path where the file was - expected (based on the current working directory at schema discovery time). +When a `SchemaLogger` instance is created (via `logging.getLogger()` after +`logging.setLoggerClass(SchemaLogger)`), the library searches for the schema +file, parses the JSON, and walks the entire tree to collect all problems with +the schema. + +If there are **any** problems with the schema (missing file, broken JSON, +invalid `type` values, conflicting root fields that match system logging +fields, malformed structure, etc.): + +- the logger instance is not created (schema validation happens before + the logger is initialized); +- schema problems are logged to stderr in the format: + `"Schema has problems: {problem1}; {problem2}; ..."`; +- the application is terminated via `os._exit(1)`. + +If there are no problems, the schema is compiled and the logger is created. +A valid empty schema (e.g., `{}` or a schema with only inner nodes and no +leaves) is treated as valid and does not cause errors. The logger is created +successfully, but no `extra` fields will be included in log records. + +**Note**: System-level errors (OSError, ValueError, RuntimeError) that occur +during schema compilation are converted to `SchemaProblem` instances and +handled the same way as schema validation problems - the application is +terminated after logging the error to stderr. An example schema: @@ -214,6 +233,16 @@ no `extra` fields will be included in log records, and any attempt to log with list elements. - `source` is the name of the field in `extra` from which the value is taken. +### Schema root key restrictions + +- The library protects system fields from the standard `logging` module + (attributes of `LogRecord` and logger internals) by preventing their use as + root keys in the schema. +- If a root key in the schema conflicts with a system logging field, a + `SchemaProblem` is generated and the schema validation fails. +- Responsibility for ensuring compatibility with other logging libraries and + formatters lies with the developer when writing the schema. + ### List-typed fields When a leaf declares `"type": "list"`, the runtime value must be a Python list @@ -265,7 +294,7 @@ logger.info( In this case the `tags` value is rejected, a `DataProblem` is recorded with a JSON message containing field, error, and value information. -and an ERROR message is logged **after** the log record has been emitted with +And an ERROR message is logged **after** the log record has been emitted with the JSON format: `{"validation_errors": [{"field": "...", "error": "...", "value": "..."}]}`. All fields are serialized via `repr()` for safety and consistency. @@ -305,77 +334,16 @@ where the type matches, and validation problems will be reported for the mismatched locations. It is the schema author's responsibility to ensure consistent type expectations when reusing a `source` field. -## Behaviour when loading the schema - -- When a `SchemaLogger` instance is created (via `logging.getLogger()` after - `logging.setLoggerClass(SchemaLogger)`), the library: - - searches for `logging_objects_with_schema.json` by walking upward from - the current working directory until it finds the file or reaches the - filesystem root; - - parses the JSON; - - walks the entire tree and collects all problems with the schema. -- If there are **any** problems with the schema (missing file, broken JSON, - invalid `type` values, conflicting root fields that match system logging - fields, malformed structure, etc.): - - the logger instance is not created (schema validation happens before - the logger is initialized); - - schema problems are logged to stderr in the format: - `"Schema has problems: {problem1}; {problem2}; ..."`; - - the application is terminated via `os._exit(1)`. -- If there are no problems: - - the schema is compiled into a `CompiledSchema`; - - the logger is created and starts using this schema to validate `extra` - fields. - - A valid empty schema (e.g., `{}` or a schema with only inner nodes and no - leaves) is treated as valid and does not cause errors. The logger is created - successfully, but no `extra` fields will be included in log records. - -**Note**: System-level errors (OSError, ValueError, RuntimeError) that occur -during schema compilation are converted to `SchemaProblem` instances and -handled the same way as schema validation problems - the application is -terminated after logging the error to stderr. - ## Schema caching and thread safety -- The library caches compiled schemas to avoid recompiling the same schema file - on every logger creation. -- The cache is keyed by the absolute schema file path and stores both the - compiled schema and any discovered problems. Once a schema (or its absence / - invalidity) has been observed for a given path, subsequent calls reuse the - cached result within the same process without re-reading or re-compiling the - schema file. -- This includes invalid schemas: once a schema file has been found to be - invalid, the corresponding (typically empty) `CompiledSchema` and its - `SchemaProblem`s are reused for the lifetime of the process. +- The library caches compiled schemas after the first load. The schema is + effectively loaded and compiled **once per process** for a given schema path. + Subsequent logger instances reuse the cached compiled schema. - Schema compilation and cache access are **thread-safe**: multiple threads can safely create `SchemaLogger` instances concurrently without race conditions. -- The schema is effectively loaded and compiled **once per process** for a - given schema path. Subsequent logger instances reuse the cached compiled - schema. - **Note**: The library does not provide a mechanism to reload the schema without restarting the application. This is a deliberate design decision to ensure schema consistency throughout the application lifecycle. -- **Cache invalidation**: The library maintains two levels of caching: - - Path cache: If a previously found schema file is deleted, the path cache - is invalidated and a new search is performed. If the file is recreated at - the same path, the cached compiled schema will be reused (since the cache - key is the absolute path). If the file is recreated at a different path, - it will be treated as a new schema and compiled separately. - - Compiled schema cache: Once a schema has been compiled for a given path, - the compiled result is cached for the lifetime of the process. Even if the - schema file is deleted and recreated at the same path with different - content, the old compiled schema will continue to be used until the - application is restarted. - -## Schema root key restrictions - -- The library protects system fields from the standard `logging` module - (attributes of `LogRecord` and logger internals) by preventing their use as - root keys in the schema. -- If a root key in the schema conflicts with a system logging field, a - `SchemaProblem` is generated and the schema validation fails. -- Responsibility for ensuring compatibility with other logging libraries and - formatters lies with the developer when writing the schema. ## Behaviour when logging data @@ -386,43 +354,12 @@ terminated after logging the error to stderr. 3. The runtime type of the value strictly matches the declared Python type (exact type match is used, e.g. `bool` is not accepted where `int` is expected, and vice versa). -- If: - - the compiled schema has no valid leaves (it is effectively empty, so no - `extra` keys are ever allowed to contribute data); - - a field is not described in the schema; - - the type does not match the declared type; - - the value is None (None values are not allowed); - - the field is considered redundant, - **it is simply not included in the final log record**. - -In all of these cases a `DataProblem` is recorded for each offending field, and -if at least one problem is present, a single ERROR message is logged -**after** the log record has been emitted. The error message format is JSON: -`{"validation_errors": [{"field": "...", "error": "...", "value": "..."}]}`. -All fields are serialized via `repr()` for safety and consistency. - +- Fields that do not meet these conditions are simply not included in the final + log record (see "Strictness guarantees" above for details on validation + error reporting). - When a `source` is used in multiple leaves (see "Multiple leaves with the same source" above), the value is validated and written independently for each leaf location. - Before emitting the final structured payload, the library strips empty dictionaries and `None` values from nested structures so that only meaningful data appears in the resulting log record. - -High-level algorithm inside `SchemaLogger`: - -- For every call to `logger.info` / `logger.error` / `logger.log`: - 1. All user-provided fields are taken from `extra`. - 2. A new structured payload is built from the schema and the given `extra`. - 3. Only this structured payload is passed to the underlying stdlib logger. - 4. After logging, if any validation problems were detected, a single - ERROR message is logged with the JSON format: - `{"validation_errors": [{"field": "...", "error": "...", "value": "..."}]}` - (no exception is raised, ensuring 100% compatibility with standard logger behavior). - All fields are serialized via `repr()` for safety and consistency. - -## Error handling - -- Schema problems are handled internally: errors are logged to stderr and - the application is terminated via `os._exit(1)`. -- No exceptions are raised by `SchemaLogger` during initialization, making - it a true drop-in replacement for `logging.Logger`. From 0390ba72461e98325c2d68b20159c6e06a7229fd Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sun, 7 Dec 2025 17:36:32 +0400 Subject: [PATCH 7/9] docs(logging): enhance schema validation error reporting and handling - Expanded the `SchemaProblem` and `DataProblem` classes to provide detailed descriptions of validation errors, including examples for better clarity. - Improved the `_create_validation_error_json` function to ensure consistent JSON serialization of error messages, enhancing security and readability. - Updated the `SchemaLogger` to log validation errors in a structured JSON format, allowing for easier parsing and analysis by consumers of the log data. - Added performance considerations and limitations in the schema compilation and validation processes to improve documentation and user understanding. Signed-off-by: Dmitrii Safronov --- src/logging_objects_with_schema/errors.py | 43 ++++- .../schema_applier.py | 70 +++++++- .../schema_loader.py | 152 +++++++++++++++--- .../schema_logger.py | 48 +++++- 4 files changed, 279 insertions(+), 34 deletions(-) diff --git a/src/logging_objects_with_schema/errors.py b/src/logging_objects_with_schema/errors.py index 2746985..fef5f16 100644 --- a/src/logging_objects_with_schema/errors.py +++ b/src/logging_objects_with_schema/errors.py @@ -9,8 +9,27 @@ class SchemaProblem: """Describes a single problem encountered while loading the schema. + This class is used to report schema validation errors during schema + compilation. Schema problems are fatal: if any are detected during + logger initialization, the application is terminated after logging + all problems to stderr via ``os._exit(1)``. + + Schema problems can occur due to: + - Missing or inaccessible schema file + - Invalid JSON syntax + - Invalid schema structure (non-object top-level, missing required fields) + - Unknown type names or invalid type declarations + - Root key conflicts with reserved logging fields + - Excessive nesting depth (exceeds MAX_SCHEMA_DEPTH) + Attributes: - message: Human-readable description of the problem. + message: Human-readable description of the problem. Examples include: + - "Schema file not found: /path/to/logging_objects_with_schema.json" + - "Failed to parse JSON schema: Expecting ',' delimiter: line 5 column 10" + - "Unknown type 'string' at ServicePayload.RequestID" + - "Root key 'name' conflicts with reserved logging fields" + - "Schema nesting depth exceeds maximum allowed depth of 100 at " + "path ServicePayload.Metrics" """ message: str @@ -20,8 +39,28 @@ class SchemaProblem: class DataProblem: """Describes a single problem encountered while validating log data. + This class is used to report validation errors when applying the compiled + schema to user-provided ``extra`` fields during logging. Unlike + :class:`SchemaProblem`, data problems are not fatal: they are collected + and logged as ERROR messages *after* the main log record has been emitted, + ensuring 100% compatibility with standard logger behavior. + + Data problems can occur due to: + - Type mismatches (e.g., providing str where int is expected) + - None values (None is never allowed for any type) + - Invalid list elements (non-homogeneous lists, non-primitive elements) + - Redundant fields (fields not defined in the schema) + Attributes: - message: Human-readable description of the validation problem. + message: JSON string containing structured error information. The message + is always a valid JSON object with the following structure: + ``{"field": "...", "error": "...", "value": "..."}`` + All values are serialized via ``repr()`` for safety and consistency. + Examples: + - ``{"field": "'user_id'", "error": "'has type str, expected int'", "value": "'abc-123'"}`` + - ``{"field": "'request_id'", "error": "'is None'", "value": "None"}`` + - ``{"field": "'tags'", "error": "'is a list but contains elements with types dict; expected all elements to be of type str'", "value": "[{'key': 'color'}]"}`` # noqa: E501 + - ``{"field": "'unknown_field'", "error": "'is not defined in schema'", "value": "'some_value'"}`` """ message: str diff --git a/src/logging_objects_with_schema/schema_applier.py b/src/logging_objects_with_schema/schema_applier.py index 525d1f1..29885ec 100644 --- a/src/logging_objects_with_schema/schema_applier.py +++ b/src/logging_objects_with_schema/schema_applier.py @@ -18,6 +18,12 @@ def _create_validation_error_json(field: str, error: str, value: Any) -> str: """Create JSON string for a single validation error. + All values are wrapped in repr() before JSON serialization. This ensures: + - Any value type can be safely serialized (even non-JSON-serializable types) + - The error message always contains a valid Python representation of the value + - Security: prevents issues with special characters or control sequences + - Consistency: all error messages have the same format regardless of value type + Args: field: Field name that caused the validation error. error: Error description. @@ -62,6 +68,10 @@ def _validate_list_value( # Empty lists are always valid return None + # Collect unique type names of items that don't match the expected type. + # We use a set comprehension to get unique type names (not the types themselves) + # for the error message. This gives a clear, readable error message showing + # which types were found (e.g., "int, str") vs what was expected. invalid_item_types = { type(item).__name__ for item in value if type(item) is not item_expected_type } @@ -94,13 +104,22 @@ def _set_nested_value( value: The value to set at the target location. """ current = target + # Navigate through intermediate dictionaries, creating them as needed. + # We iterate through all keys except the last one (path[:-1]) to build + # the nested structure. for key in path[:-1]: child = current.get(key) + # If the key doesn't exist or exists but is not a dict, create a new dict. + # This overwrites any non-dict value that might have been there (which + # shouldn't happen in normal operation, but we handle it defensively). + # We use isinstance() instead of checking for None because we need to + # ensure the value is actually a dict, not just that the key exists. if not isinstance(child, dict): child = {} current[key] = child current = child + # Set the final value at the last key in the path current[path[-1]] = value @@ -160,6 +179,12 @@ def _strip_empty(node: Any) -> Any: This helper is used by ``_apply_schema_internal`` on the final payload to avoid leaving empty containers created during schema application. + Note: Lists are not processed (they are returned as-is). This is intentional: + - Lists in the schema are always homogeneous primitive types (validated earlier) + - Empty lists are valid and should be preserved + - We only need to clean up empty dicts that were created as intermediate + containers during schema application but ended up empty + Note: This function is part of the internal implementation details and is not considered a public API. Its signature and behaviour may change @@ -220,6 +245,24 @@ def _apply_schema_internal( The function itself does not raise exceptions; it only accumulates :class:`DataProblem` instances for the caller to handle. + Performance considerations: + Time complexity is O(n + m) where n is the number of leaves in the + compiled schema and m is the number of fields in ``extra_values``. + The function groups leaves by source field name to avoid redundant + validation when a single source is used by multiple leaves. For typical + schemas (< 100 leaves) and typical extra dictionaries (< 50 fields), + the overhead is negligible (< 1ms). Memory complexity is O(n + m) for + the output structures. + + Limitations: + - Very large ``extra_values`` dictionaries (> 1000 fields) may cause + noticeable overhead, but this is uncommon in practice + - Deeply nested output structures (limited by schema depth) may increase + memory usage, but the depth is already limited by MAX_SCHEMA_DEPTH + - All validation errors are collected before returning; for schemas with + many leaves and many validation failures, the problems list may grow + large, but this is expected behavior for debugging purposes + Note: This function is used internally by :class:`SchemaLogger` and is not considered part of the public API. Its signature and behaviour may @@ -231,21 +274,33 @@ def _apply_schema_internal( extra: dict[str, Any] = {} problems: list[DataProblem] = [] - # Group leaves by source for efficient processing + # Group leaves by source field name. This is necessary because a single source + # can be referenced by multiple leaves (allowing the same value to appear in + # different locations in the output structure). Grouping allows us to process + # all leaves for a given source together, which is more efficient and allows + # us to validate the value once per source (e.g., checking for None) rather + # than once per leaf. source_to_leaves: dict[str, list[SchemaLeaf]] = defaultdict(list) for leaf in compiled.leaves: source_to_leaves[leaf.source].append(leaf) used_sources = set(source_to_leaves.keys()) + # Process each source that appears in the schema. If a source is missing from + # extra_values, we silently skip it (this is normal - not all sources need to + # be present in every log call). We only validate and apply sources that are + # actually provided. for source, leaves in source_to_leaves.items(): if source not in extra_values: + # Source not provided - this is normal, not an error. Skip it. continue value = extra_values[source] - # Check for None values explicitly (None values are not allowed) - # This check must be done once per source, not once per leaf + # Check for None values explicitly. None is never allowed for any type, + # so we check it once per source (not once per leaf) before attempting + # type-specific validation. This avoids redundant checks when a source + # is used by multiple leaves. if value is None: error_msg = "is None" problems.append( @@ -253,10 +308,17 @@ def _apply_schema_internal( ) continue + # Validate the value against each leaf that references this source. + # Each leaf validates independently, so a value might pass validation + # for some leaves (where type matches) but fail for others (where type + # doesn't match). The value is written only to locations where validation + # succeeds. for leaf in leaves: _validate_and_apply_leaf(leaf, value, source, extra, problems) - # Report redundant fields for any keys not referenced by schema leaves. + # Report redundant fields: any keys in extra_values that are not referenced + # by any schema leaf. These are fields that the user provided but which are + # not defined in the schema, so they cannot be included in the log output. # Optimization: if schema is empty (no used_sources), all fields are redundant, # so we can skip the membership check for each key. redundant_keys = ( diff --git a/src/logging_objects_with_schema/schema_loader.py b/src/logging_objects_with_schema/schema_loader.py index c53ba66..64c02cc 100644 --- a/src/logging_objects_with_schema/schema_loader.py +++ b/src/logging_objects_with_schema/schema_loader.py @@ -79,26 +79,45 @@ def _create_empty_compiled_schema_with_problems( "list": list, } -# Module-level cache for compiled schemas. -# Key: absolute schema_path, Value: (CompiledSchema, list[SchemaProblem]) -# This cache is thread-safe: all read and write operations are protected by -# _cache_lock. Double-checked locking is used in _compile_schema_internal() -# to prevent multiple threads from compiling the same schema simultaneously. +# Two-level caching system for schema loading: +# +# 1. Path cache (_resolved_schema_path, _cached_cwd): Caches the result of +# searching for the schema file. This avoids re-walking the directory tree +# on every logger creation. The cache behavior differs based on whether +# the file was found: +# - If found: The absolute path is cached (CWD-independent) since the file +# location doesn't change even if CWD changes. +# - If not found: The path is based on current CWD (where we expected to +# find it), so we cache both the path and the CWD. If CWD changes, we +# invalidate and re-search from the new location. +# +# 2. Compiled schema cache (_SCHEMA_CACHE): Caches the compiled schema and +# validation problems for a given schema file path. This avoids re-parsing +# and re-compiling the schema JSON on every logger creation. The cache key +# is the absolute schema file path from the path cache. +# +# These caches work together: path cache finds the file location, compiled +# cache stores the result of compiling that file. Both are thread-safe and +# use double-checked locking to avoid race conditions. + +# Compiled schema cache: Key is absolute schema_path, Value is tuple of +# (CompiledSchema, list[SchemaProblem]). This cache stores both successful +# compilations and failures (with problems list). _SCHEMA_CACHE: dict[Path, tuple[CompiledSchema, list[SchemaProblem]]] = {} _cache_lock = threading.RLock() -# Cache for resolved schema path (to avoid re-searching when CWD changes) +# Path cache: Cached absolute path to schema file (or expected location if not found) _resolved_schema_path: Path | None = None -# CWD that was used when caching a path for a file that was not found -# None means the cached path is an absolute path to a found file (CWD-independent) -# Non-None means the cached path is an absolute path based on CWD -# when file was not found +# CWD that was used when caching a path for a file that was not found. +# None means the cached path is an absolute path to a found file (CWD-independent). +# Non-None means the cached path is an absolute path based on CWD when file +# was not found. _cached_cwd: Path | None = None # RLock for thread-safe access to path cache variables. -# RLock is used to allow recursive locking when helper functions -# (_check_cached_found_file_path, _check_cached_missing_file_path) are called -# from _get_schema_path() which already holds the lock. +# RLock (not Lock) is needed because helper functions (_check_cached_found_file_path, +# _check_cached_missing_file_path) are called from _get_schema_path() which already +# holds the lock, and these helpers also need to acquire the lock. _path_cache_lock = threading.RLock() @@ -117,6 +136,9 @@ def _find_schema_file() -> Path | None: while True: schema_path = current / SCHEMA_FILE_NAME if schema_path.exists(): + # Use resolve() to get an absolute path and resolve any symbolic links. + # This ensures we have a canonical path that can be used as a cache key + # and for consistent error reporting. return schema_path.resolve() # Move to parent directory @@ -266,12 +288,20 @@ def _get_schema_path() -> Path: def _load_raw_schema(schema_path: Path) -> tuple[dict[str, Any], Path]: """Load raw JSON schema from the application root. - This function always attempts to read the schema file and records - any problems as :class:`SchemaProblem` instances. + This function attempts to read and parse the schema file. If any problems + occur (file not found, I/O errors, invalid JSON, wrong top-level type), + it raises exceptions (FileNotFoundError or ValueError). These exceptions + are then converted to :class:`SchemaProblem` instances by the caller + (_compile_schema_internal). Args: schema_path: Absolute path to the schema file. + Raises: + FileNotFoundError: If the schema file does not exist. + ValueError: If the file cannot be read, contains invalid JSON, or + the top-level value is not a JSON object. + Note: This helper is part of the internal implementation and is not considered a public API. @@ -324,6 +354,11 @@ def _format_path(path: tuple[str, ...], key: str | None = None) -> str: def _is_empty_or_none(value: Any) -> bool: """Check if a value is None or an empty string. + This function is used during schema validation to check if required fields + (type, source, item_type) have valid values. Both None and empty/whitespace-only + strings are considered invalid because they don't provide meaningful information + for schema compilation. + Args: value: The value to check. @@ -339,6 +374,11 @@ def _is_leaf_node(value_dict: dict[str, Any]) -> bool: A leaf node is identified by having either 'type' or 'source' field. Inner nodes have neither of these fields. + We use `.get()` with `is not None` check instead of `in` operator because: + - A field might be present but have a None value (which indicates an error) + - We need to distinguish between "field not present" (inner node) and + "field present but None" (invalid leaf node that will be caught later) + Args: value_dict: Dictionary containing node data. @@ -391,6 +431,9 @@ def _validate_and_create_leaf( if type_invalid or source_invalid: return None + # Convert to string before lookup to handle cases where the JSON parser + # might return non-string types (though this shouldn't happen with valid JSON). + # This ensures type safety and consistent behavior. expected_type = _TYPE_MAP.get(str(leaf_type)) if expected_type is None: problems.append( @@ -430,6 +473,9 @@ def _validate_and_create_leaf( return SchemaLeaf( path=path + (key,), + # Convert to string to ensure type consistency. Even though source should + # be a string from JSON, this guards against unexpected types and ensures + # the SchemaLeaf always has a string source. source=str(leaf_source), expected_type=expected_type, item_expected_type=item_expected_type, @@ -443,6 +489,23 @@ def _compile_schema_tree( ) -> Iterable[SchemaLeaf]: """Recursively compile a schema node into SchemaLeaf objects. + This function recursively walks the schema tree structure, identifying leaf + nodes (those with ``type`` and ``source`` fields) and inner nodes (those + without these fields). Leaf nodes are validated and converted to + :class:`SchemaLeaf` objects, while inner nodes are recursively processed. + + Performance considerations: + Time complexity is O(n) where n is the total number of nodes in the + schema tree. Memory complexity is O(d) where d is the maximum nesting + depth (limited by MAX_SCHEMA_DEPTH). For typical schemas (< 1000 nodes, + depth < 10), the overhead is negligible. + + Limitations: + - Maximum nesting depth is limited to MAX_SCHEMA_DEPTH (currently 100) + to prevent stack overflow and excessive memory usage + - Very large schemas (> 10,000 nodes) may cause noticeable compilation + overhead, but this is uncommon in practice + Args: node: The schema node to compile. path: Current path in the schema tree. @@ -452,7 +515,9 @@ def _compile_schema_tree( SchemaLeaf objects found in the tree. """ # Check for excessive nesting depth (DoS protection: prevent deeply nested - # schemas that could cause stack overflow or excessive memory usage) + # schemas that could cause stack overflow or excessive memory usage). + # This check happens before processing the node to avoid unnecessary work + # and to provide clear error messages about the problematic path. if len(path) > MAX_SCHEMA_DEPTH: problems.append( SchemaProblem( @@ -471,6 +536,9 @@ def _compile_schema_tree( ) continue + # Create a mutable copy of the value dict. This is necessary because + # we may need to modify it during processing, and the original value + # might be a read-only Mapping (e.g., from JSON parsing). value_dict = dict(value) if _is_leaf_node(value_dict): @@ -495,6 +563,11 @@ def get_builtin_logrecord_attributes() -> set[str]: to discover all non-callable, non-private attributes. This is the standard way to get attribute names from a class instance in Python. + The result is cached (maxsize=1) because LogRecord attributes are fixed + for a given Python version and don't change at runtime. This avoids + recreating the LogRecord instance and running introspection on every + schema compilation. + Returns: Set of attribute names that are reserved by the logging system. Examples include: 'name', 'levelno', 'pathname', 'lineno', 'msg', etc. @@ -571,6 +644,21 @@ def _compile_schema_internal() -> tuple[CompiledSchema, list[SchemaProblem]]: compiled schema together with a list of problems detected during processing (an empty ``CompiledSchema`` when the schema is missing or invalid). + Performance considerations: + First compilation of a schema involves file I/O, JSON parsing, and tree + traversal. For typical schemas (< 1000 nodes), this takes < 10ms. All + subsequent calls for the same schema path return immediately from cache + (< 0.1ms). The cache is process-wide and persists for the application + lifetime. + + Limitations: + - Schema changes on disk are not automatically reloaded; the application + must be restarted to pick up changes + - Very large schemas (> 10,000 nodes) may cause noticeable compilation + overhead on first load, but this is uncommon in practice + - The cache uses absolute file paths as keys, so schema files must be + accessible via the same path throughout the application lifetime + Note: This function is used internally by :class:`SchemaLogger` and by the test suite. It is not part of the public package API and may change @@ -581,13 +669,21 @@ def _compile_schema_internal() -> tuple[CompiledSchema, list[SchemaProblem]]: """ schema_path = _get_schema_path() - # Fast-path: if we have already attempted to compile schema for this path, - # return cached result (successful, missing-file, or invalid-schema). + # Fast-path: First check (without holding lock) if we have already attempted + # to compile schema for this path. This avoids lock contention in the common + # case when the schema is already cached. with _cache_lock: cached = _SCHEMA_CACHE.get(schema_path) if cached is not None: return cached + # Schema not in cache. We need to compile it. However, between the check above + # and now, another thread might have started (or even finished) compiling the + # same schema. We use double-checked locking to handle this race condition: + # after doing the expensive work (loading/compiling), we check the cache again + # while holding the lock. If another thread already compiled it, we use that + # result instead of storing our own (which might be different if the file changed). + problems: list[SchemaProblem] = [] try: @@ -595,8 +691,9 @@ def _compile_schema_internal() -> tuple[CompiledSchema, list[SchemaProblem]]: except (FileNotFoundError, ValueError) as exc: problems.append(SchemaProblem(str(exc))) result = _create_empty_compiled_schema_with_problems(problems) - # Double-checked locking: another thread might have compiled the schema - # while we were handling the error. + # Double-checked locking: Check cache again while holding lock. Another + # thread might have compiled the schema (or handled the same error) while + # we were processing the exception. If so, use the cached result. with _cache_lock: cached = _SCHEMA_CACHE.get(schema_path) if cached is not None: @@ -604,8 +701,13 @@ def _compile_schema_internal() -> tuple[CompiledSchema, list[SchemaProblem]]: _SCHEMA_CACHE[schema_path] = result return result + # Check root key conflicts before compiling the tree. This allows us to + # catch conflicts early and report them as schema problems. We do this + # before tree compilation to avoid unnecessary work if there are conflicts. _check_root_conflicts(raw_schema, problems) + # Compile the schema tree into leaves. Each root key becomes a separate + # tree that we compile recursively. Problems are collected as we go. leaves: list[SchemaLeaf] = [] for key, value in raw_schema.items(): if not isinstance(value, Mapping): @@ -614,14 +716,20 @@ def _compile_schema_internal() -> tuple[CompiledSchema, list[SchemaProblem]]: ) continue + # Create a mutable copy of the value dict for recursive compilation. + # This ensures we can safely process nested structures without modifying + # the original schema dictionary. for leaf in _compile_schema_tree(dict(value), (key,), problems): leaves.append(leaf) compiled = CompiledSchema(leaves=leaves) result = (compiled, problems) - # Double-checked locking: another thread might have compiled the schema - # while we were compiling it. + # Double-checked locking: Check cache again while holding lock. Another thread + # might have compiled the schema while we were doing the expensive compilation + # work (parsing JSON, validating, building leaves). If so, use the cached result + # instead of overwriting it with our own (which might be different if the file + # was modified between reads). with _cache_lock: cached = _SCHEMA_CACHE.get(schema_path) if cached is not None: diff --git a/src/logging_objects_with_schema/schema_logger.py b/src/logging_objects_with_schema/schema_logger.py index 00a1c0b..70bfabd 100644 --- a/src/logging_objects_with_schema/schema_logger.py +++ b/src/logging_objects_with_schema/schema_logger.py @@ -30,6 +30,12 @@ def _log_schema_problems_and_exit(problems: list[SchemaProblem]) -> None: """Log schema problems to stderr and terminate the application. + Uses os._exit(1) instead of sys.exit(1) to ensure immediate termination + without running cleanup handlers (atexit, finally blocks, etc.). This is + important because schema problems indicate a fatal configuration error that + should stop the application immediately, and we don't want any cleanup code + to run (which might try to use the broken logger or cause additional errors). + Args: problems: List of schema problems to log. """ @@ -39,6 +45,7 @@ def _log_schema_problems_and_exit(problems: list[SchemaProblem]) -> None: error_msg = f"Schema has problems: {'; '.join(problem_messages)}\n" sys.stderr.write(error_msg) sys.stderr.flush() + # Use os._exit() for immediate termination without cleanup handlers os._exit(1) @@ -138,6 +145,12 @@ def _log( extra or {}, ) + # Emit the main log record first, even if there are validation problems. + # This ensures 100% compatibility with standard logger behavior: the user's + # log message is always emitted, and validation errors are reported separately + # as additional ERROR messages. This approach guarantees that the application + # continues to work normally even when validation problems occur (no exceptions + # are raised, no log records are lost). super()._log( level, msg, @@ -150,9 +163,12 @@ def _log( stacklevel=stacklevel + 1, ) + # If there were validation problems, log them as separate ERROR messages + # after the main log record has been emitted. This ensures the main message + # is always logged first, and validation errors are clearly separated. if data_problems: - # Log validation errors as ERROR messages - # Get caller information for the error log record. + # Get caller information for the error log record so that validation + # errors point to the same location in user code as the original log call. # Python 3.11+ has improved findCaller() with proper stacklevel support, # so we use it as the primary method for better performance. # For Python < 3.11, we fall back to inspect.stack() due to known issues @@ -186,10 +202,23 @@ def _log( fn, lno, func, sinfo = self.findCaller( stack_info=False, stacklevel=stacklevel + 1 ) - # Format error message as JSON for machine processing + # Format error message as JSON for machine processing. + # Each DataProblem.message is already a JSON string (created by + # _create_validation_error_json) with structure: + # {"field": "...", "error": "...", "value": "..."} + # We parse them back to dicts and combine into a single JSON object + # with all validation errors. The final structure is: + # {"validation_errors": [{"field": "...", "error": "...", + # "value": "..."}, ...]} + # This allows consumers to parse the error message as structured data + # and programmatically extract field names, error types, and values. validation_errors = [] for problem in data_problems: try: + # Parse the JSON string back to a dict so we can combine + # all errors into a single JSON object. Each error object + # maintains the same structure: field, error, value + # (all via repr()). error_obj = json.loads(problem.message) validation_errors.append(error_obj) except (json.JSONDecodeError, TypeError) as exc: @@ -197,7 +226,9 @@ def _log( # create a fallback error object. This should never happen in # normal operation since problem.message is always created via # _create_validation_error_json, but protects against unexpected - # data corruption or manual DataProblem creation. + # data corruption or manual DataProblem creation. The fallback + # preserves the same structure (field, error, value) for + # consistency. validation_errors.append( { "field": repr("unknown"), @@ -207,12 +238,17 @@ def _log( ) try: + # Combine all validation errors into a single JSON object. + # The resulting string will be used as the log message for the + # ERROR record, allowing structured parsing by log consumers. error_msg = json.dumps({"validation_errors": validation_errors}) except (TypeError, ValueError) as exc: # Defensive handling: if serialization fails, create a fallback # error message. This should never happen in normal operation since - # validation_errors contains only dicts with primitive values, but - # protects against unexpected data corruption. + # validation_errors contains only dicts with primitive values (all + # values are already serialized via repr()), but protects against + # unexpected data corruption or edge cases in JSON serialization. + # The fallback ensures we always have a valid JSON error message. error_msg = json.dumps( { "validation_errors": [ From eacbd1d1668fe590b85fa85658daecd37f475485 Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sun, 7 Dec 2025 17:50:21 +0400 Subject: [PATCH 8/9] docs(README): streamline schema validation and logging behavior explanations - Revised the README to clarify the role of the JSON schema in validating `extra` fields for logging, emphasizing its importance as a contract between log producers and consumers. - Enhanced sections on strictness guarantees, schema structure, and error handling to provide clearer guidance on how the library manages logging behavior and validation errors. - Updated examples to illustrate the expected behavior when logging with valid and invalid data, ensuring users understand the implications of schema compliance. Signed-off-by: Dmitrii Safronov --- README.md | 253 +++++++++++++++--------------------------------------- 1 file changed, 70 insertions(+), 183 deletions(-) diff --git a/README.md b/README.md index 0ceb7f0..227b1a3 100644 --- a/README.md +++ b/README.md @@ -2,49 +2,38 @@ This library provides a logger subclass built on top of the standard `logging` module that strictly controls additional `extra` fields using a JSON schema. - -## Core idea - -- The standard `logging` package is used (your application configures handlers - and formatting as usual). -- `SchemaLogger` is a subclass of `logging.Logger` designed to be used as a - drop-in replacement via `logging.setLoggerClass(SchemaLogger)`. -- The schema is stored in a JSON file named `logging_objects_with_schema.json` - in the application root directory. -- Any user-provided `extra` fields are included in the log **only if** they are - described in the schema and match the declared Python type. -- The library is universal and works with any formatters from the standard - `logging` module. It is not tied to any specific log format or logging library. +`SchemaLogger` is a drop-in replacement for `logging.Logger` that validates +`extra` fields against a JSON schema file (`logging_objects_with_schema.json`) +in your application root directory. ## Schema as a contract The JSON schema is treated as a **contract** between all parties that produce -and consume logs in the system. It defines which structured fields are allowed -to appear in logs and which types they must have. - -- Application code must only send `extra` fields that are described in the schema - and match the declared Python types. Any deviation (unknown fields, wrong types, - `None` values, disallowed list elements) is logged as an ERROR message - *after* the log record has been emitted. -- The schema file (`logging_objects_with_schema.json`) is a shared, versioned - artifact that defines the shape of structured log payloads for all downstream - consumers (search, alerts, dashboards, external systems). - -## Strictness guarantees - -- Only fields explicitly described in the JSON schema (as leaves with `type` and - `source`) can ever reach your logs. -- Any `extra` field that is **not** described in the schema is treated as a data - error: it is dropped from the log output and recorded as a validation problem. -- Any mismatch between runtime values and the declared types is also treated as - a data error. -- All validation problems (unknown fields, wrong types, disallowed list - elements, `None` values, etc.) are aggregated and logged as a single - ERROR message **after** the log record has been emitted, ensuring 100% - compatibility with standard logger behavior (no exceptions are raised). -- The schema is treated as the only source of truth for which `extra` fields - are allowed to appear in logs. Any deviation from the schema is considered a - contract violation between the producer of `extra` and the schema author. +and consume logs in the system. The schema file (`logging_objects_with_schema.json`) +is a shared, versioned artifact that defines which structured fields are allowed +to appear in logs and which Python types they must have. This contract ensures +that all downstream consumers (search systems, alerts, dashboards, external +systems) can rely on a consistent log structure. + +**Strictness guarantees:** + +- **`extra` fields never go directly into logs.** They are always projected + through the schema: values from `extra` are taken by `source` field names + and placed into the log structure according to the schema paths. Only fields + explicitly described in the schema (as leaves with `type` and `source`) can + ever reach your logs. The schema is the only source of truth for which + `extra` fields are allowed. +- Any `extra` field that is **not** described in the schema is treated as a + data error: it is dropped from the log output and recorded as a validation + problem. +- Any mismatch between runtime values and the declared types (wrong types, + `None` values, disallowed list elements) is also treated as a data error. +- All validation problems are aggregated and logged as a single ERROR message + **after** the log record has been emitted, ensuring 100% compatibility with + standard logger behavior (no exceptions are raised). +- Application code must only send `extra` fields that are described in the + schema and match the declared Python types. Any deviation from the schema + is considered a contract violation. ## Installation @@ -144,52 +133,22 @@ logger = logging.getLogger("service") # logged as ERROR messages. No exception handling is needed. logger.info("processing", extra={"user_id": "not-an-int"}) # Wrong type # The valid part of the log is emitted, and validation errors are logged -# as ERROR messages with details about the problems. +# as ERROR messages in JSON format: {"validation_errors": [{"field": "...", "error": "...", "value": "..."}]} ``` -### API compatibility with ``logging.Logger`` +## Schema location and format -- ``SchemaLogger`` is a subclass of ``logging.Logger`` and can be used as a - drop-in replacement via ``logging.setLoggerClass(SchemaLogger)``. -- The public methods of ``SchemaLogger`` mirror the standard ``logging.Logger`` - API and accept the same arguments: ``msg, *args, **kwargs``. -- The only behavioural difference is that the named ``extra`` argument is - intercepted, validated according to the JSON schema, and only the validated - subset is passed further into the standard logging pipeline. +The schema file `logging_objects_with_schema.json` must be located in your +application root directory. The library searches upward from the current working +directory until it finds the file or reaches the filesystem root. -## Schema location and format +**Important**: If there are any problems with the schema (missing file, broken +JSON, invalid structure, etc.), the application is terminated after logging +schema problems to stderr. Schema validation happens when the first logger +instance is created. -- Schema file: `logging_objects_with_schema.json`. -- Location: **application root directory**. The library searches upward from the - current working directory for the schema file itself, walking up the directory - tree until it finds the file or reaches the filesystem root. -- Schema tree depth is limited to a maximum nesting level (currently 100). Any - branch that exceeds this depth is ignored and reported as a schema problem. - -When a `SchemaLogger` instance is created (via `logging.getLogger()` after -`logging.setLoggerClass(SchemaLogger)`), the library searches for the schema -file, parses the JSON, and walks the entire tree to collect all problems with -the schema. - -If there are **any** problems with the schema (missing file, broken JSON, -invalid `type` values, conflicting root fields that match system logging -fields, malformed structure, etc.): - -- the logger instance is not created (schema validation happens before - the logger is initialized); -- schema problems are logged to stderr in the format: - `"Schema has problems: {problem1}; {problem2}; ..."`; -- the application is terminated via `os._exit(1)`. - -If there are no problems, the schema is compiled and the logger is created. -A valid empty schema (e.g., `{}` or a schema with only inner nodes and no -leaves) is treated as valid and does not cause errors. The logger is created -successfully, but no `extra` fields will be included in log records. - -**Note**: System-level errors (OSError, ValueError, RuntimeError) that occur -during schema compilation are converted to `SchemaProblem` instances and -handled the same way as schema validation problems - the application is -terminated after logging the error to stderr. +A valid empty schema (`{}`) is allowed and will result in no `extra` fields +being included in log records. An example schema: @@ -216,44 +175,23 @@ An example of a valid empty schema (no leaves, no problems): {} ``` -An empty schema is valid and does not cause errors. When using an empty schema, -no `extra` fields will be included in log records, and any attempt to log with -`extra` fields will result in validation errors being logged as ERROR messages. - -- An inner node is an object without `type` and `source` (neither field is present). -- A leaf node is an object that has at least one of `type` or `source` fields. - However, a valid leaf node must have both `type` and `source` fields. If a - leaf node is missing either field or has an empty value, it will be reported - as a schema problem during validation. -- `type` is one of the allowed Python type names: `"str"`, `"int"`, `"float"`, - `"bool"`, or `"list"`. -- For `"list"` type, an additional `item_type` field is required to declare - the element type. Only primitive element types (`"str"`, `"int"`, `"float"`, - `"bool"`) are allowed. Nested lists and dictionaries are not permitted as - list elements. -- `source` is the name of the field in `extra` from which the value is taken. - -### Schema root key restrictions - -- The library protects system fields from the standard `logging` module - (attributes of `LogRecord` and logger internals) by preventing their use as - root keys in the schema. -- If a root key in the schema conflicts with a system logging field, a - `SchemaProblem` is generated and the schema validation fails. -- Responsibility for ensuring compatibility with other logging libraries and - formatters lies with the developer when writing the schema. - -### List-typed fields - -When a leaf declares `"type": "list"`, the runtime value must be a Python list -with **homogeneous primitive elements**. The element type is defined by the -mandatory `item_type` field: - -- Allowed `item_type` values: `"str"`, `"int"`, `"float"`, `"bool"` -- All elements in the list must have exactly the declared Python type - (e.g. `type(item) is int` for `"item_type": "int"`) -- Empty lists are allowed -- Nested lists and dictionaries inside the list are not allowed +**Schema structure:** + +- **Inner nodes**: Objects without `type` and `source` fields (used for nesting). +- **Leaf nodes**: Objects with both `type` and `source` fields. A valid leaf + must have both fields present and non-empty. +- **`type`**: One of `"str"`, `"int"`, `"float"`, `"bool"`, or `"list"`. +- **`source`**: The name of the field in `extra` from which the value is taken. +- **Root key restrictions**: Root keys cannot conflict with standard `logging` + module fields (e.g., `name`, `levelno`, `pathname`). Such conflicts cause + schema validation to fail. + +**List-typed fields:** + +For `"type": "list"`, you must also specify `"item_type"` (one of `"str"`, +`"int"`, `"float"`, `"bool"`). Lists must contain homogeneous primitive elements +of the declared type. Empty lists are allowed; nested lists and dictionaries are +not permitted. Example of a valid list field: @@ -280,39 +218,21 @@ logger.info( ) ``` -Example of an invalid list field (non-primitive elements): +Invalid example (non-primitive elements are rejected): ```python -logger.info( - "request processed", - extra={ - "tags": [{"key": "color", "value": "blue"}], # list[dict] – invalid - }, -) +logger.info("request processed", extra={"tags": [{"key": "color"}]}) # Invalid +# Validation error is logged as ERROR after the log record is emitted ``` -In this case the `tags` value is rejected, a `DataProblem` is recorded with a -JSON message containing field, error, and value information. - -And an ERROR message is logged **after** the log record has been emitted with -the JSON format: `{"validation_errors": [{"field": "...", "error": "...", "value": "..."}]}`. -All fields are serialized via `repr()` for safety and consistency. +**Multiple leaves with the same source:** -### Multiple leaves with the same source +A single `source` field name can be used in multiple leaves. The value is +validated independently for each leaf and written to all matching locations. +If types conflict, the value is written only where the type matches, and +validation errors are reported for mismatched locations. -A single `source` field name can be used in multiple leaves of the schema. This -allows the same value from `extra` to be placed in different locations of the -output structure. - -When a `source` is referenced by multiple leaves: - -- The value is validated against each leaf's expected type independently. -- The value is written only to those leaf locations where the runtime type - matches the expected type. -- For leaf locations where the type does not match, a `DataProblem` is added - to the ERROR message that is logged after logging. - -Example schema with duplicate source usage: +Example: ```json { @@ -325,41 +245,8 @@ Example schema with duplicate source usage: } ``` -In this example, if `extra={"request_id": "abc-123"}`, the value `"abc-123"` -will be written to both `ServicePayload.RequestID` and `ServicePayload.Metadata.ID`. - -If the same `source` is used with conflicting types (e.g., one leaf expects -`str` and another expects `int`), the value will only be written to locations -where the type matches, and validation problems will be reported for the -mismatched locations. It is the schema author's responsibility to ensure -consistent type expectations when reusing a `source` field. - -## Schema caching and thread safety - -- The library caches compiled schemas after the first load. The schema is - effectively loaded and compiled **once per process** for a given schema path. - Subsequent logger instances reuse the cached compiled schema. -- Schema compilation and cache access are **thread-safe**: multiple threads can - safely create `SchemaLogger` instances concurrently without race conditions. -- **Note**: The library does not provide a mechanism to reload the schema - without restarting the application. This is a deliberate design decision to - ensure schema consistency throughout the application lifecycle. - -## Behaviour when logging data - -- Any `extra` provided by the application **never goes directly** into the log. -- The only way for additional fields to reach the log: - 1. The field is described as a leaf in the schema (`type` + `source`). - 2. `extra` contains a value under the given `source` name. - 3. The runtime type of the value strictly matches the declared Python type - (exact type match is used, e.g. `bool` is not accepted where `int` is - expected, and vice versa). -- Fields that do not meet these conditions are simply not included in the final - log record (see "Strictness guarantees" above for details on validation - error reporting). -- When a `source` is used in multiple leaves (see "Multiple leaves with the same - source" above), the value is validated and written independently for each leaf - location. -- Before emitting the final structured payload, the library strips empty - dictionaries and `None` values from nested structures so that only meaningful - data appears in the resulting log record. +With `extra={"request_id": "abc-123"}`, the value appears in both +`ServicePayload.RequestID` and `ServicePayload.Metadata.ID`. + +**Note**: The schema is compiled once per process and cached. Schema changes +require an application restart to take effect. The library is thread-safe. From b2e75ccc0ce5d9de4419e3e800af923f729de1c3 Mon Sep 17 00:00:00 2001 From: Dmitrii Safronov Date: Sun, 7 Dec 2025 17:55:24 +0400 Subject: [PATCH 9/9] docs(README): clarify schema caching behavior and thread safety - Added a note regarding the caching of the schema, specifying that it is compiled once per process and requires an application restart for changes to take effect. - Emphasized the thread safety of the library, ensuring users understand the implications of concurrent access to the schema. Signed-off-by: Dmitrii Safronov --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 227b1a3..b6d8ead 100644 --- a/README.md +++ b/README.md @@ -147,6 +147,9 @@ JSON, invalid structure, etc.), the application is terminated after logging schema problems to stderr. Schema validation happens when the first logger instance is created. +**Note**: The schema is compiled once per process and cached. Schema changes +require an application restart to take effect. The library is thread-safe. + A valid empty schema (`{}`) is allowed and will result in no `extra` fields being included in log records. @@ -247,6 +250,3 @@ Example: With `extra={"request_id": "abc-123"}`, the value appears in both `ServicePayload.RequestID` and `ServicePayload.Metadata.ID`. - -**Note**: The schema is compiled once per process and cached. Schema changes -require an application restart to take effect. The library is thread-safe.