Dev by Chenglong-MS · Pull Request #230 · microsoft/data-formulator

Chenglong-MS · 2026-01-30T00:47:20Z

This pull request introduces a new endpoint for refreshing derived data tables, improves data loader support and documentation, and makes UX and backend improvements for data refresh and streaming features. The most significant changes are the addition and relocation of the derived data refresh endpoint, expanded data loader integrations, and UI/UX enhancements for table metadata and messaging.

Backend API and Data Refresh Improvements:

Added a new /api/agent/refresh-derived-data endpoint in agent_routes.py to re-run Python transformation code for derived tables, replacing the previous implementation in tables_routes.py. This endpoint validates input, executes transformation code in a sandbox, and returns refreshed data or error messages. [1] [2]
Updated the frontend to use the new endpoint (REFRESH_DERIVED_DATA) for derived table refreshes and removed the old endpoint from the session-required list. [1] [2]

Data Loader Integrations and Documentation:

Added AthenaDataLoader support and included it in the DATA_LOADERS registry and __all__ export, enabling AWS Athena integration. [1] [2]
Expanded the data loader documentation to list new integrations: Athena, BigQuery, PostgreSQL, MSSQL, S3, Azure Blob, and MongoDB.

Frontend Data Refresh and Messaging Enhancements:

Removed throttling logic for refresh-related messages in useDataRefresh and useDerivedTableRefresh, ensuring all refresh status messages are dispatched immediately for better user feedback. [1] [2] [3] [4] [5] [6]
Improved the display of table metadata in the UI by showing the data loader type and source table name for imported tables.

Streaming and Demo Data Improvements:

Increased the ISS position history buffer from 500 to 10,000 entries and extended the allowed time and row limits for ISS data demo endpoints, enhancing demo streaming capabilities. [1] [2]

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Feat: Added AWS Athena data loader

py-src/data_formulator/agent_routes.py

+            return jsonify({
+                "status": "error",
+                "message": result.get('content', 'Unknown error during transformation')
+            }), 400


To fix this, we should stop returning the detailed exception-derived message from the sandbox directly to the client, and instead (a) log the detailed error on the server, and (b) return a generic, non-sensitive message in the HTTP response. This applies at two levels:

In py_sandbox.run_in_main_process, instead of building a verbose error_message that is later surfaced to the client, we should build both:

a detailed log message (kept server-side), and

a safer, generic or lightly sanitized error string intended to be propagated outward.

In agent_routes.refresh_derived_data, we should ensure that the "message" field does not echo the detailed sandbox error verbatim, but only a generic message (optionally including a short code like the exception type, which is typically safe).

Because CodeQL’s taint tracking starts at err in run_in_main_process and flows through error_message to result['error_message'] and then all the way to refresh_derived_data, the most robust fix is to break that propagation path. Concretely:

Modify run_in_main_process:

Capture the exception details, including traceback.format_exc(), in a local variable.

Return a response object where:

error_message is a generic, non-detailed string (for example, "Execution failed due to an error in the transformation code." or at most "Execution failed with ValueError").

The detailed traceback is not included in the returned structure (or, if necessary for internal use, is in a separate key clearly not used for user-facing messages; however, given only the shown code, we’ll avoid returning it altogether).

Optionally, if py_sandbox.py is not using logging yet and we are allowed to introduce it, we could add basic logging there, but since you did not show the logger setup in that file and the alert is about exposure to the user, not missing logging, the minimal compliant fix is to keep detailed info local and not return it.

Modify refresh_derived_data in agent_routes.py:

When result['status'] != 'ok', stop returning result.get('content', ...) directly.

Instead, send a generic error message like "Error executing transformation code" (and optionally a hint: "Check your code and try again."), independent of result['content'].

This preserves the existing function behavior in terms of control flow (success vs. error) and structure of returned JSON (status, rows, message keys), but prevents internal exception details from being exposed.

py-src/data_formulator/agent_routes.py

+        return jsonify({
+            "status": "error",
+            "message": str(e)
+        }), 400


In general, to fix information exposure through exceptions, log detailed error information (including stack traces) only on the server side and return a generic, non-sensitive message to the client. Avoid echoing str(e) or any stack trace data back in HTTP responses.

For this specific endpoint (refresh_derived_data in py-src/data_formulator/agent_routes.py, lines 786–792), we should keep the logging behavior but replace the client-facing "message": str(e) with a generic message such as "An internal error occurred while refreshing derived data." or similar. This preserves existing functionality (the client still gets an "error" status and a message field) while removing the potential exposure of internal exception details. No new imports or helper methods are required; we only adjust the JSON payload in the except block. The rest of the function remains unchanged.

Concretely:

Keep:

logger.error(f"Error refreshing derived data: {str(e)}")

logger.error(traceback.format_exc())

Change:

In the jsonify call at line 789–792, replace str(e) with a fixed generic message string.

Copilot

Pull request overview

This PR relocates and updates the derived-table refresh API, expands supported external data loaders (notably AWS Athena), and improves the refresh/streaming UX in both backend and frontend.

Changes:

Moved derived table refresh to /api/agent/refresh-derived-data and updated frontend routing accordingly.
Added AthenaDataLoader and registered it in the data loader registry + docs.
Adjusted refresh messaging behavior and enhanced UI table metadata display; expanded demo streaming limits/buffers.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/views/MessageSnackbar.tsx	Removes unused state/vars related to message UI behavior.
src/views/DBTableManager.tsx	Displays imported-table provenance (data loader type + source table).
src/app/utils.tsx	Updates derived-refresh URL and removes old tables endpoint from session-required list.
src/app/useDataRefresh.tsx	Removes throttling of refresh messages; dispatches refresh status messages immediately.
py-src/data_formulator/tables_routes.py	Removes the old `/api/tables/refresh-derived-data` endpoint implementation.
py-src/data_formulator/agent_routes.py	Adds the new `/api/agent/refresh-derived-data` endpoint implementation.
py-src/data_formulator/demo_stream_routes.py	Increases ISS history buffer and loosens ISS query limits for demo streaming.
py-src/data_formulator/data_loader/athena_data_loader.py	Adds new AWS Athena integration for listing/ingesting/query sampling into DuckDB.
py-src/data_formulator/data_loader/init.py	Registers Athena loader in `DATA_LOADERS` and exports it.
py-src/data_formulator/data_loader/README.md	Documents additional loader integrations including Athena.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-30T00:54:20Z

py-src/data_formulator/demo_stream_routes.py

 # Thread-safe storage for ISS position history
 _iss_track_lock = threading.Lock()
-_iss_track_history: deque = deque(maxlen=500)  # Keep last 500 positions (~40 min at 5s intervals)
+_iss_track_history: deque = deque(maxlen=10000)  # Keep last 10000 positions (~20000 min at 5s intervals)


The inline comment about how much history is retained is incorrect: 10,000 points at 5s intervals is ~833 minutes (~13.9 hours), not “~20000 min”. Please fix the comment to avoid misleading future maintenance.

Suggested change

_iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals)

_iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~833 min / ~14 hours at 5s intervals)

Copilot · 2026-01-30T00:54:21Z

py-src/data_formulator/demo_stream_routes.py

+    minutes = min(1440, max(1, int(request.args.get('minutes', 1440))))
+    limit = min(10000, max(1000, int(request.args.get('limit', 10000))))


int(request.args.get(...)) will raise ValueError for non-numeric query params, causing a 500. Since this endpoint is user-facing, please parse minutes/limit defensively (e.g., try/except with a 400 or default) and also update the docstring above, which still documents the old defaults/max values.

Copilot · 2026-01-30T00:54:21Z

py-src/data_formulator/agent_routes.py

+
+        data = request.get_json()


data = request.get_json() can be None (or non-dict) if the client sends invalid JSON, and then data.get(...) will raise. Please validate request.is_json (and that data is a dict) and return a clear 400 like other agent endpoints (e.g., “Invalid request format”).

Suggested change

data = request.get_json()

if not request.is_json:

return jsonify({

"status": "error",

"message": "Invalid request format"

}), 400

data = request.get_json()

if not isinstance(data, dict):

return jsonify({

"status": "error",

"message": "Invalid request format"

}), 400

Copilot · 2026-01-30T00:54:21Z

py-src/data_formulator/agent_routes.py

+            "message": str(e)
+        }), 400


The exception handler returns message: str(e) with HTTP 400. This can leak internal details (tracebacks/paths/sandbox errors) to the client and also misclassifies unexpected server failures as a client error. Please sanitize the error message (and consider a generic message for unexpected exceptions) and return an appropriate status code (typically 500).

Suggested change

"message": str(e)

}), 400

"message": "An unexpected error occurred while refreshing derived data."

}), 500

Copilot · 2026-01-30T00:54:21Z

py-src/data_formulator/data_loader/athena_data_loader.py

+        # Set AWS credentials for DuckDB
+        self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'")
+        self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'")
+        self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'")
+        if self.aws_session_token:
+            self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'")


These DuckDB SET ...='...' statements interpolate credential values directly into SQL. If any value contains a single quote, this will break the statement and can create SQL-injection style issues. Please escape/quote these values safely before embedding (e.g., reuse _escape_sql_string(...)), or use a safer DuckDB API for setting secrets if available.

Suggested change

# Set AWS credentials for DuckDB

self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'")

self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'")

self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'")

if self.aws_session_token:

self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'")

# Set AWS credentials for DuckDB using parameterized queries to avoid SQL injection and quoting issues

self.duck_db_conn.execute("SET s3_region=?", [self.region_name])

self.duck_db_conn.execute("SET s3_access_key_id=?", [self.aws_access_key_id])

self.duck_db_conn.execute("SET s3_secret_access_key=?", [self.aws_secret_access_key])

if self.aws_session_token:

self.duck_db_conn.execute("SET s3_session_token=?", [self.aws_session_token])

Copilot · 2026-01-30T00:54:22Z

src/views/MessageSnackbar.tsx

@@ -34,18 +34,15 @@ export interface Message {
 }

 export function MessageSnackbar() {


useTheme is still imported but the theme variable was removed, so the import is now unused. Please remove the unused useTheme import (and any related dead code) to keep the module clean.

Copilot · 2026-01-30T00:54:22Z

src/views/DBTableManager.tsx

+                            <Typography component="span" sx={{fontSize: 12, color: 'text.secondary'}}>
+                                {currentTable.source_metadata && `imported from ${currentTable.source_metadata.data_loader_type}.${currentTable.source_metadata.source_table_name}`}
+                            </Typography>


source_table_name is optional in the source_metadata type, but this string interpolation assumes it’s always present. If it’s missing, the UI will render imported from <type>.undefined. Consider guarding on both fields (or using fallbacks / optional chaining) and formatting without a trailing dot when source_table_name is absent.

Copilot · 2026-01-30T00:54:22Z

src/app/useDataRefresh.tsx

+                        dispatch(dfActions.addMessages({
                            timestamp: Date.now(),
                            component: 'Data Refresh',
                            type: 'info',
                            value: `Table "${table.displayId || table.id}" data refreshed (${result.newRows.length} rows)`


With throttling removed, auto-refresh can append messages on every refresh tick and dfSlice.addMessages currently grows state.messages unbounded. This can lead to memory growth and slower renders (e.g., MessageSnackbar groups messages by iterating the full array). Consider adding a max length cap in the reducer (drop oldest), or coalescing per-table refresh messages instead of always appending.

Mestway

Good

Chenglong-MS and others added 9 commits January 27, 2026 09:32

fix state issues

c376020

Feat: Added AWS Athena data loader

80ed0ca

Ensure compatibility with Python >=3.9+ and validation of s3 path

4c9428a

Update data loader README

51f3f88

Update py-src/data_formulator/data_loader/athena_data_loader.py

a3a2b75

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update py-src/data_formulator/data_loader/athena_data_loader.py

0fed94c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge pull request #229 from azwanzuharimi/feature/aws_athena

a7bf6da

Feat: Added AWS Athena data loader

with some fixe

075f86b

Merge remote-tracking branch 'refs/remotes/origin/dev' into dev

a2ec6c5

Chenglong-MS requested a review from Copilot January 30, 2026 00:47

Copilot started reviewing on behalf of Chenglong-MS January 30, 2026 00:47 View session

github-advanced-security bot found potential problems Jan 30, 2026

View reviewed changes

Copilot AI reviewed Jan 30, 2026

View reviewed changes

Chenglong-MS requested a review from Mestway January 30, 2026 18:33

Mestway approved these changes Jan 30, 2026

View reviewed changes

Chenglong-MS merged commit 1b5b1e1 into main Jan 30, 2026
13 checks passed

@@ -778,9 +778,10 @@
                             "message": "Successfully refreshed derived data"
                         })
                     else:
+                        # Do not expose detailed sandbox error information to the client.
                         return jsonify({
                             "status": "error",
-                            "message": result.get('content', 'Unknown error during transformation')
+                            "message": "Error executing transformation code. Please check your code and try again."
                         }), 400
                 except Exception as e:

@@ -106,8 +106,11 @@
                 try:
                     exec(code, restricted_globals)
                 except Exception as err:
-                    error_message = f"Error: {type(err).__name__} - {str(err)}"
-                    return {'status': 'error', 'error_message': error_message}
+                    # Build a generic, non-sensitive error message for callers.
+                    generic_message = f"Execution failed due to an error in the transformation code ({type(err).__name__})."
+                    # Note: full traceback and error details are intentionally not returned to callers
+                    # to avoid leaking internal information. They should be logged by the caller if needed.
+                    return {'status': 'error', 'error_message': generic_message}
                 return {'status': 'ok', 'allowed_objects': {key: restricted_globals[key] for key in allowed_objects}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev#230

Dev#230
Chenglong-MS merged 9 commits intomainfrom
dev

Chenglong-MS commented Jan 30, 2026

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Copilot AI left a comment

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Copilot AI Jan 30, 2026

Uh oh!

Mestway left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	_iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals)
	_iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~833 min / ~14 hours at 5s intervals)

		minutes = min(1440, max(1, int(request.args.get('minutes', 1440))))
		limit = min(10000, max(1000, int(request.args.get('limit', 10000))))

		@@ -34,18 +34,15 @@ export interface Message {
		}

		export function MessageSnackbar() {

Conversation

Chenglong-MS commented Jan 30, 2026

Uh oh!

Check warning

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Copilot Autofix

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Mestway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants