Conversation
Co-authored-by: Jennifer A Clark <jennifer.clark@omsf.io>
Co-authored-by: Hannah Baumann <43765638+hannahbaumann@users.noreply.github.com>
Co-authored-by: Hannah Baumann <43765638+hannahbaumann@users.noreply.github.com>
| "solvent_inchikey": offmol_solute.to_inchikey(fixed_hydrogens=True), | ||
| "solvent_inchi": offmol_solute.to_inchi(fixed_hydrogens=True), | ||
| } | ||
| sys_data[key] = { |
There was a problem hiding this comment.
I think I'm not understanding yet what this system_data.json file is needed for. Is this because we cannot store the exp data, to at least have the rest here?
There was a problem hiding this comment.
@hannahbaumann Ya I thought about whether we "really needed it" because someone could just generate the experimental*.json file and use that. I guess there are two reasons to keep it:
- It provides the expected systems to run as of now, representing which ligands and systems we intend to be included in the dataset.
- It allows me to run testing and CI on the plan_asfe, but I can just use freesolv instead.
It is sort of a shame to include this file since it's almost a GB and I needed to introduce LFS because of it, but it felt like the correct "provenance thorough" thing to do. I'm open to excluding it though.
There was a problem hiding this comment.
Actually, I'm removing this. I don't think we are confident that including it would be valid given the license so better safe than sorry.
hannahbaumann
left a comment
There was a problem hiding this comment.
Thanks @jaclark5 , lgtm, just left some comments.
Would it be better to combine the two charge scripts (free solv and mnsol) into a single script?
|
|
||
| ## Charging Solutes / Solvents | ||
|
|
||
| Charges were generated using the [charge_mnsol.py](../../../data_generation/charge_freesolv.py) script using the [conda-lock_linux-64.yml](../../../data_generation/conda-lock_linux-64.yml) environment. |
There was a problem hiding this comment.
| Charges were generated using the [charge_mnsol.py](../../../data_generation/charge_freesolv.py) script using the [conda-lock_linux-64.yml](../../../data_generation/conda-lock_linux-64.yml) environment. | |
| Charges were generated using the [charge_mnsol.py](../../../data_generation/charge_mnsol.py) script using the [conda-lock_linux-64.yml](../../../data_generation/conda-lock_linux-64.yml) environment. |
| ## Notes | ||
|
|
||
| - Experimental uncertainties are set to 0.2 kcal/mol for all neutral entries, following the recommendation in the MNSol documentation. | ||
| - Solvent and solute SMILES are stored as canonical explicit-hydrogen SMILES generated by the OpenFF Toolkit. No newline at end of file |
There was a problem hiding this comment.
Maybe add a note that the exp data is not deposited here for licensing reasons, but can be generated on the fly using script X?
There was a problem hiding this comment.
@hannahbaumann would you mind if I "Resolve comments" that I think are handled or would you like to resolve the comments as the reviewer?
There was a problem hiding this comment.
I think it can be helpful for me to see them when I review again, but if it's better for you for tracking things, that's also fine!
There was a problem hiding this comment.
These changed happened when I ran pre-commit. Only formatting changes are present.
| The reference data was generated using the [generate_mnsol_data.py](../../../data_generation/generate_mnsol_data.py) script. Entries were excluded if: | ||
|
|
||
| - The solute or solvent name was not present in `mnsol-name-to-smiles.json` | ||
| - Molecules with ambiguous isomeric structure were excluded including: 'bromotoluene', 'chlorotoluene', 'dichloroethane', 'fluoroctane', 'trimethylbenzene' |
There was a problem hiding this comment.
Is the different intent intended?
There was a problem hiding this comment.
Nope thanks for catching!
| print(f"Skipping {key}: Charge is not zero") | ||
| continue | ||
| if solute_name == "water": | ||
| print(f"Skipping {key}: Water") |
There was a problem hiding this comment.
I think I'm just blanking on something we probably discussed, but why are we skipping water here again?
There was a problem hiding this comment.
This was an error, I was thinking I would skip writing it to the ligand.sdf but this has a larger effect, thanks for catching
There was a problem hiding this comment.
You're checking for solute_name != "water" and solvent_name != "water" in a few other places, is that still desired or would that also need to be removed?
There was a problem hiding this comment.
It is desired. Before I was accidentally skipping systems that had water as the solute. That is not desired.
Now I'm not including water in the ligands.sdf which is desired. This is because water is a special case that Pontibus handles separately to apply known water models like OPC3. Water won't have partial charges assigned and will cause our tests to fail for that reason, so we skip it and exclude it from the ligands_*.sdf.
hannahbaumann
left a comment
There was a problem hiding this comment.
Thanks @jaclark5 , I think this looks good, just the small comments. I think we should definitely run at least a subset of these, once we have the planning script ready, to validate this setup.
Close #61
Blocked by: