Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Feat: Added AWS Athena data loader
| return jsonify({ | ||
| "status": "error", | ||
| "message": result.get('content', 'Unknown error during transformation') | ||
| }), 400 |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 8 days ago
To fix this, we should stop returning the detailed exception-derived message from the sandbox directly to the client, and instead (a) log the detailed error on the server, and (b) return a generic, non-sensitive message in the HTTP response. This applies at two levels:
- In
py_sandbox.run_in_main_process, instead of building a verboseerror_messagethat is later surfaced to the client, we should build both:- a detailed log message (kept server-side), and
- a safer, generic or lightly sanitized error string intended to be propagated outward.
- In
agent_routes.refresh_derived_data, we should ensure that the"message"field does not echo the detailed sandbox error verbatim, but only a generic message (optionally including a short code like the exception type, which is typically safe).
Because CodeQL’s taint tracking starts at err in run_in_main_process and flows through error_message to result['error_message'] and then all the way to refresh_derived_data, the most robust fix is to break that propagation path. Concretely:
- Modify
run_in_main_process:- Capture the exception details, including
traceback.format_exc(), in a local variable. - Return a response object where:
error_messageis a generic, non-detailed string (for example,"Execution failed due to an error in the transformation code."or at most"Execution failed with ValueError").- The detailed traceback is not included in the returned structure (or, if necessary for internal use, is in a separate key clearly not used for user-facing messages; however, given only the shown code, we’ll avoid returning it altogether).
- Capture the exception details, including
- Optionally, if
py_sandbox.pyis not using logging yet and we are allowed to introduce it, we could add basic logging there, but since you did not show the logger setup in that file and the alert is about exposure to the user, not missing logging, the minimal compliant fix is to keep detailed info local and not return it. - Modify
refresh_derived_datainagent_routes.py:- When
result['status'] != 'ok', stop returningresult.get('content', ...)directly. - Instead, send a generic error message like
"Error executing transformation code"(and optionally a hint:"Check your code and try again."), independent ofresult['content'].
- When
This preserves the existing function behavior in terms of control flow (success vs. error) and structure of returned JSON (status, rows, message keys), but prevents internal exception details from being exposed.
| @@ -778,9 +778,10 @@ | ||
| "message": "Successfully refreshed derived data" | ||
| }) | ||
| else: | ||
| # Do not expose detailed sandbox error information to the client. | ||
| return jsonify({ | ||
| "status": "error", | ||
| "message": result.get('content', 'Unknown error during transformation') | ||
| "message": "Error executing transformation code. Please check your code and try again." | ||
| }), 400 | ||
|
|
||
| except Exception as e: |
| @@ -106,8 +106,11 @@ | ||
| try: | ||
| exec(code, restricted_globals) | ||
| except Exception as err: | ||
| error_message = f"Error: {type(err).__name__} - {str(err)}" | ||
| return {'status': 'error', 'error_message': error_message} | ||
| # Build a generic, non-sensitive error message for callers. | ||
| generic_message = f"Execution failed due to an error in the transformation code ({type(err).__name__})." | ||
| # Note: full traceback and error details are intentionally not returned to callers | ||
| # to avoid leaking internal information. They should be logged by the caller if needed. | ||
| return {'status': 'error', 'error_message': generic_message} | ||
|
|
||
| return {'status': 'ok', 'allowed_objects': {key: restricted_globals[key] for key in allowed_objects}} | ||
|
|
| return jsonify({ | ||
| "status": "error", | ||
| "message": str(e) | ||
| }), 400 |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 8 days ago
In general, to fix information exposure through exceptions, log detailed error information (including stack traces) only on the server side and return a generic, non-sensitive message to the client. Avoid echoing str(e) or any stack trace data back in HTTP responses.
For this specific endpoint (refresh_derived_data in py-src/data_formulator/agent_routes.py, lines 786–792), we should keep the logging behavior but replace the client-facing "message": str(e) with a generic message such as "An internal error occurred while refreshing derived data." or similar. This preserves existing functionality (the client still gets an "error" status and a message field) while removing the potential exposure of internal exception details. No new imports or helper methods are required; we only adjust the JSON payload in the except block. The rest of the function remains unchanged.
Concretely:
- Keep:
logger.error(f"Error refreshing derived data: {str(e)}")logger.error(traceback.format_exc())
- Change:
- In the
jsonifycall at line 789–792, replacestr(e)with a fixed generic message string.
- In the
| @@ -788,5 +788,5 @@ | ||
| logger.error(traceback.format_exc()) | ||
| return jsonify({ | ||
| "status": "error", | ||
| "message": str(e) | ||
| "message": "An internal error occurred while refreshing derived data." | ||
| }), 400 |
There was a problem hiding this comment.
Pull request overview
This PR relocates and updates the derived-table refresh API, expands supported external data loaders (notably AWS Athena), and improves the refresh/streaming UX in both backend and frontend.
Changes:
- Moved derived table refresh to
/api/agent/refresh-derived-dataand updated frontend routing accordingly. - Added
AthenaDataLoaderand registered it in the data loader registry + docs. - Adjusted refresh messaging behavior and enhanced UI table metadata display; expanded demo streaming limits/buffers.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/views/MessageSnackbar.tsx | Removes unused state/vars related to message UI behavior. |
| src/views/DBTableManager.tsx | Displays imported-table provenance (data loader type + source table). |
| src/app/utils.tsx | Updates derived-refresh URL and removes old tables endpoint from session-required list. |
| src/app/useDataRefresh.tsx | Removes throttling of refresh messages; dispatches refresh status messages immediately. |
| py-src/data_formulator/tables_routes.py | Removes the old /api/tables/refresh-derived-data endpoint implementation. |
| py-src/data_formulator/agent_routes.py | Adds the new /api/agent/refresh-derived-data endpoint implementation. |
| py-src/data_formulator/demo_stream_routes.py | Increases ISS history buffer and loosens ISS query limits for demo streaming. |
| py-src/data_formulator/data_loader/athena_data_loader.py | Adds new AWS Athena integration for listing/ingesting/query sampling into DuckDB. |
| py-src/data_formulator/data_loader/init.py | Registers Athena loader in DATA_LOADERS and exports it. |
| py-src/data_formulator/data_loader/README.md | Documents additional loader integrations including Athena. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Thread-safe storage for ISS position history | ||
| _iss_track_lock = threading.Lock() | ||
| _iss_track_history: deque = deque(maxlen=500) # Keep last 500 positions (~40 min at 5s intervals) | ||
| _iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals) |
There was a problem hiding this comment.
The inline comment about how much history is retained is incorrect: 10,000 points at 5s intervals is ~833 minutes (~13.9 hours), not “~20000 min”. Please fix the comment to avoid misleading future maintenance.
| _iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~20000 min at 5s intervals) | |
| _iss_track_history: deque = deque(maxlen=10000) # Keep last 10000 positions (~833 min / ~14 hours at 5s intervals) |
| minutes = min(1440, max(1, int(request.args.get('minutes', 1440)))) | ||
| limit = min(10000, max(1000, int(request.args.get('limit', 10000)))) |
There was a problem hiding this comment.
int(request.args.get(...)) will raise ValueError for non-numeric query params, causing a 500. Since this endpoint is user-facing, please parse minutes/limit defensively (e.g., try/except with a 400 or default) and also update the docstring above, which still documents the old defaults/max values.
|
|
||
| data = request.get_json() |
There was a problem hiding this comment.
data = request.get_json() can be None (or non-dict) if the client sends invalid JSON, and then data.get(...) will raise. Please validate request.is_json (and that data is a dict) and return a clear 400 like other agent endpoints (e.g., “Invalid request format”).
| data = request.get_json() | |
| if not request.is_json: | |
| return jsonify({ | |
| "status": "error", | |
| "message": "Invalid request format" | |
| }), 400 | |
| data = request.get_json() | |
| if not isinstance(data, dict): | |
| return jsonify({ | |
| "status": "error", | |
| "message": "Invalid request format" | |
| }), 400 |
| "message": str(e) | ||
| }), 400 |
There was a problem hiding this comment.
The exception handler returns message: str(e) with HTTP 400. This can leak internal details (tracebacks/paths/sandbox errors) to the client and also misclassifies unexpected server failures as a client error. Please sanitize the error message (and consider a generic message for unexpected exceptions) and return an appropriate status code (typically 500).
| "message": str(e) | |
| }), 400 | |
| "message": "An unexpected error occurred while refreshing derived data." | |
| }), 500 |
| # Set AWS credentials for DuckDB | ||
| self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'") | ||
| self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'") | ||
| self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'") | ||
| if self.aws_session_token: | ||
| self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'") |
There was a problem hiding this comment.
These DuckDB SET ...='...' statements interpolate credential values directly into SQL. If any value contains a single quote, this will break the statement and can create SQL-injection style issues. Please escape/quote these values safely before embedding (e.g., reuse _escape_sql_string(...)), or use a safer DuckDB API for setting secrets if available.
| # Set AWS credentials for DuckDB | |
| self.duck_db_conn.execute(f"SET s3_region='{self.region_name}'") | |
| self.duck_db_conn.execute(f"SET s3_access_key_id='{self.aws_access_key_id}'") | |
| self.duck_db_conn.execute(f"SET s3_secret_access_key='{self.aws_secret_access_key}'") | |
| if self.aws_session_token: | |
| self.duck_db_conn.execute(f"SET s3_session_token='{self.aws_session_token}'") | |
| # Set AWS credentials for DuckDB using parameterized queries to avoid SQL injection and quoting issues | |
| self.duck_db_conn.execute("SET s3_region=?", [self.region_name]) | |
| self.duck_db_conn.execute("SET s3_access_key_id=?", [self.aws_access_key_id]) | |
| self.duck_db_conn.execute("SET s3_secret_access_key=?", [self.aws_secret_access_key]) | |
| if self.aws_session_token: | |
| self.duck_db_conn.execute("SET s3_session_token=?", [self.aws_session_token]) |
| @@ -34,18 +34,15 @@ export interface Message { | |||
| } | |||
|
|
|||
| export function MessageSnackbar() { | |||
There was a problem hiding this comment.
useTheme is still imported but the theme variable was removed, so the import is now unused. Please remove the unused useTheme import (and any related dead code) to keep the module clean.
| <Typography component="span" sx={{fontSize: 12, color: 'text.secondary'}}> | ||
| {currentTable.source_metadata && `imported from ${currentTable.source_metadata.data_loader_type}.${currentTable.source_metadata.source_table_name}`} | ||
| </Typography> |
There was a problem hiding this comment.
source_table_name is optional in the source_metadata type, but this string interpolation assumes it’s always present. If it’s missing, the UI will render imported from <type>.undefined. Consider guarding on both fields (or using fallbacks / optional chaining) and formatting without a trailing dot when source_table_name is absent.
| dispatch(dfActions.addMessages({ | ||
| timestamp: Date.now(), | ||
| component: 'Data Refresh', | ||
| type: 'info', | ||
| value: `Table "${table.displayId || table.id}" data refreshed (${result.newRows.length} rows)` |
There was a problem hiding this comment.
With throttling removed, auto-refresh can append messages on every refresh tick and dfSlice.addMessages currently grows state.messages unbounded. This can lead to memory growth and slower renders (e.g., MessageSnackbar groups messages by iterating the full array). Consider adding a max length cap in the reducer (drop oldest), or coalescing per-table refresh messages instead of always appending.
This pull request introduces a new endpoint for refreshing derived data tables, improves data loader support and documentation, and makes UX and backend improvements for data refresh and streaming features. The most significant changes are the addition and relocation of the derived data refresh endpoint, expanded data loader integrations, and UI/UX enhancements for table metadata and messaging.
Backend API and Data Refresh Improvements:
/api/agent/refresh-derived-dataendpoint inagent_routes.pyto re-run Python transformation code for derived tables, replacing the previous implementation intables_routes.py. This endpoint validates input, executes transformation code in a sandbox, and returns refreshed data or error messages. [1] [2]REFRESH_DERIVED_DATA) for derived table refreshes and removed the old endpoint from the session-required list. [1] [2]Data Loader Integrations and Documentation:
AthenaDataLoadersupport and included it in theDATA_LOADERSregistry and__all__export, enabling AWS Athena integration. [1] [2]Frontend Data Refresh and Messaging Enhancements:
useDataRefreshanduseDerivedTableRefresh, ensuring all refresh status messages are dispatched immediately for better user feedback. [1] [2] [3] [4] [5] [6]Streaming and Demo Data Improvements: