Move the current benchmark configs to yaml files#566
Move the current benchmark configs to yaml files#566R-Palazzo wants to merge 16 commits intofeature/benchmark_launcherfrom
Conversation
| compute_privacy_score: false | ||
|
|
||
| compute: | ||
| service: 'gcp' |
There was a problem hiding this comment.
In this PR, we only indicate which compute service to use. In a future issue, we will move the compute config defined here
SDGym/sdgym/_benchmark/config_utils.py
Line 19 in 3c2a248
Inside this yaml file also
| ) | ||
|
|
||
|
|
||
| class BenchmarkLauncher: |
There was a problem hiding this comment.
Let me know if the naming works.
I think this class could be named Benchmark also, since it does more than launching, as one will be able to manage and terminate instances from it.
Let me know your thoughts on it.
There was a problem hiding this comment.
I think that given the file name and context it makes sense, it is already defined in the issue and I think is good. An alternative, if you are looking for, could be BenchmarkManager or BenchmarkRunner.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## feature/benchmark_launcher #566 +/- ##
==============================================================
+ Coverage 82.41% 83.47% +1.05%
==============================================================
Files 33 38 +5
Lines 2923 3195 +272
==============================================================
+ Hits 2409 2667 +258
- Misses 514 528 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pvk-developer
left a comment
There was a problem hiding this comment.
I think this is looking very good. I would suggest going over the _validation.py and try to modularize it a little bit if you have the time since it is hard to navigate the long functions in there.
| ) | ||
|
|
||
|
|
||
| class BenchmarkLauncher: |
There was a problem hiding this comment.
I think that given the file name and context it makes sense, it is already defined in the issue and I think is good. An alternative, if you are looking for, could be BenchmarkManager or BenchmarkRunner.
abe9001 to
f670ba5
Compare
| @@ -0,0 +1,21 @@ | |||
| method_params: | |||
| timeout: 345600 | |||
| output_destination: 's3://sdgym-benchmark/Debug/Benchmark_Launcher/' | |||
There was a problem hiding this comment.
TODO: Update before merging
| ) | ||
| " | ||
| python -m pip install "sdgym[all]" | ||
| python -m pip install "sdgym[all] @ git+https://github.com/sdv-dev/SDGym.git@issue-532-define-yaml-files" |
There was a problem hiding this comment.
TODO: Revert before merging
amontanez24
left a comment
There was a problem hiding this comment.
I think we should restructure the credentials a bit. Lets define a format for the dict like:
gcp:
GCP_SERVICE_ACCOUNT_JSON: ...
GCP_SERVICE_ACCOUNT_PATH: ...
GCP_PROJECT_ID: ...
GCP_ZONE: ...
sdv_enterprise:
SDV_ENTERPRISE_USERNAME: ...
SDV_ENTERPRISE_LICENSE_KEY: ...
aws:
AWS_ACCESS_KEY_ID: ...
AWS_SECRET_ACCESS_KEY: ...We then load this dict and check for expected keys. If they key isn't present we try to load from the environment. So if someone is running with GCP, we would attempt to load all the expected GCP keys from the dict and then check the environment if the dict doesn't have it.
If you want to make the names less redundant, you can remove the prefixes ('SDV', 'AWS', 'GCP') and say that if you are using environment variables you should store it as {service_name}_{key_name}.
| compute: | ||
| service: 'gcp' | ||
|
|
||
| credential_locations: |
There was a problem hiding this comment.
why is it called credential_locations?
| gcp: | ||
| service_account_json_env: 'GCP_SERVICE_ACCOUNT_JSON' | ||
| project_id_env: 'GCP_PROJECT_ID' | ||
| zone_env: 'GCP_ZONE' | ||
| sdv_enterprise: | ||
| username_env: 'SDV_ENTERPRISE_USERNAME' | ||
| license_key_env: 'SDV_ENTERPRISE_LICENSE_KEY' | ||
| aws: | ||
| access_key_id_env: 'AWS_ACCESS_KEY_ID' | ||
| secret_access_key_env: 'AWS_SECRET_ACCESS_KEY' |
There was a problem hiding this comment.
I think we should either pass the credentials here, or just set them as env variables with the same name as the key. For example
gcp:
GCP_SERVICE_ACCOUNT_JSON: ...,
GCP_PROJECT_ID: sdgym-123
GCP_ZONE: ...When loading credentials we can either load from the config file, or if the key isn't present, load from the environment. I think that is a more common pattern
| def _get_credentials(credential_locations): | ||
| """Get resolved credentials dict.""" | ||
| config = credential_locations or {} | ||
| filepath = config.get('credential_filepath') |
There was a problem hiding this comment.
in the case where the filepath is provided, is the structure of the credentials dict the same as credential_locations?
| sig = inspect.signature(method_to_run) | ||
| required = { | ||
| parameter.name | ||
| for parameter in sig.parameters.values() | ||
| if parameter.default is inspect.Parameter.empty | ||
| and parameter.kind | ||
| in (inspect.Parameter.POSITIONAL_OR_KEYWORD, inspect.Parameter.KEYWORD_ONLY) | ||
| } | ||
| required_from_yaml = required - _INJECTED_PARAMS | ||
| missing = required_from_yaml - set(method_params) | ||
| if missing: | ||
| errors.append( | ||
| f'method_params: missing required parameters for {method_to_run.__name__}:' | ||
| f' {sorted(missing)}' | ||
| ) |
There was a problem hiding this comment.
is this necessary? Can't we just have defaults for whatever they don't provide? The only one that I can see as required is the output destination, and I don't think that one should be grouped with the other parameters.
| 'method_params': dict of parameters to pass to the benchmark method (e.g. timeout), | ||
| 'credentials': dict specifying how to resolve credentials (e.g. from env vars or a file), | ||
| 'compute': dict specifying the compute configuration (e.g. service: 'gcp'), | ||
| 'instance_jobs': list of dicts, each specifying a combination of synthesizers and datasets: |
There was a problem hiding this comment.
I think the output destination should be specified with the jobs
There was a problem hiding this comment.
I was thinking it makes more sense to have all the results for a benchmark in the same location, but you’re right we could also set one output_destination per instance_job. Both options work for me. Let me know which we prefer.
|
|
||
| def _get_credentials(credential_locations): | ||
| """Get resolved credentials dict.""" | ||
| config = credential_locations or {} |
There was a problem hiding this comment.
I'm not sure I understand why the credentials locations can have the filepath or the env variable names. I think that at the point of validation, we should just have one credentials dict with all the actual values in it ready to be evaluated
Before then, I think we have the user either provide the filepath or dict itself. The dict shouldn't have the filepath in it.
There was a problem hiding this comment.
I agree and I reviewed the logic here.
The YAML file now uses the credential_filepath key to point to a JSON file containing the resolved credentials, meaning the file already includes the actual credential values in the expected final dict structure. If credential_filepath is not provided, or if some credential values are missing, we build or complete the resolved credentials dict from environment variables instead.
As you pointed out, GCP is a special case. For GCP, we first look for a service account JSON file, and if it is not available, we fall back to reading the individual credential values from environment variables.
Once the resolved credentials dict is obtained, we validate it. Let me know if that was the idea here.
Resolve #545
CU-86b8h52bh
It's a pretty big PR, thanks in advance for your review :)
Currently the
BenchmarkLauncheronly works for GCP.To make it work on other compute servie (e.g. AWS) we will have to define benchmark methods that have similar parameter as the gcp ones (with
credentials/compute_config):SDGym/sdgym/_benchmark/benchmark.py
Line 350 in 3c2a248