Skip to content

Add MNSol data#109

Merged
jaclark5 merged 29 commits intomainfrom
mnsol
Mar 5, 2026
Merged

Add MNSol data#109
jaclark5 merged 29 commits intomainfrom
mnsol

Conversation

@jaclark5
Copy link
Copy Markdown
Collaborator

@jaclark5 jaclark5 commented Feb 25, 2026

@jaclark5 jaclark5 self-assigned this Feb 25, 2026
@jaclark5 jaclark5 changed the base branch from main to update_freesolv February 27, 2026 18:49
@jaclark5 jaclark5 mentioned this pull request Feb 27, 2026
@jaclark5 jaclark5 marked this pull request as ready for review February 27, 2026 19:29
Comment thread openfe_benchmarks/data/data_generation/generate_mnsol_data.py Outdated
Comment thread openfe_benchmarks/data/data_generation/generate_mnsol_data.py Outdated
Comment thread openfe_benchmarks/data/data_generation/generate_mnsol_data.py Outdated
Comment thread openfe_benchmarks/data/data_generation/generate_mnsol_data.py Outdated
Comment thread openfe_benchmarks/data/data_generation/generate_mnsol_data.py
Comment thread openfe_benchmarks/data/data_generation/generate_mnsol_data.py Outdated
"solvent_inchikey": offmol_solute.to_inchikey(fixed_hydrogens=True),
"solvent_inchi": offmol_solute.to_inchi(fixed_hydrogens=True),
}
sys_data[key] = {
Copy link
Copy Markdown
Contributor

@hannahbaumann hannahbaumann Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm not understanding yet what this system_data.json file is needed for. Is this because we cannot store the exp data, to at least have the rest here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannahbaumann Ya I thought about whether we "really needed it" because someone could just generate the experimental*.json file and use that. I guess there are two reasons to keep it:

  1. It provides the expected systems to run as of now, representing which ligands and systems we intend to be included in the dataset.
  2. It allows me to run testing and CI on the plan_asfe, but I can just use freesolv instead.

It is sort of a shame to include this file since it's almost a GB and I needed to introduce LFS because of it, but it felt like the correct "provenance thorough" thing to do. I'm open to excluding it though.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm removing this. I don't think we are confident that including it would be valid given the license so better safe than sorry.

Copy link
Copy Markdown
Contributor

@hannahbaumann hannahbaumann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaclark5 , lgtm, just left some comments.
Would it be better to combine the two charge scripts (free solv and mnsol) into a single script?


## Charging Solutes / Solvents

Charges were generated using the [charge_mnsol.py](../../../data_generation/charge_freesolv.py) script using the [conda-lock_linux-64.yml](../../../data_generation/conda-lock_linux-64.yml) environment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Charges were generated using the [charge_mnsol.py](../../../data_generation/charge_freesolv.py) script using the [conda-lock_linux-64.yml](../../../data_generation/conda-lock_linux-64.yml) environment.
Charges were generated using the [charge_mnsol.py](../../../data_generation/charge_mnsol.py) script using the [conda-lock_linux-64.yml](../../../data_generation/conda-lock_linux-64.yml) environment.

## Notes

- Experimental uncertainties are set to 0.2 kcal/mol for all neutral entries, following the recommendation in the MNSol documentation.
- Solvent and solute SMILES are stored as canonical explicit-hydrogen SMILES generated by the OpenFF Toolkit. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note that the exp data is not deposited here for licensing reasons, but can be generated on the fly using script X?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannahbaumann would you mind if I "Resolve comments" that I think are handled or would you like to resolve the comments as the reviewer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be helpful for me to see them when I review again, but if it's better for you for tracking things, that's also fine!

Comment thread openfe_benchmarks/data/data_generation/generate_mnsol_data.py Outdated
Base automatically changed from update_freesolv to main March 2, 2026 13:59
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changed happened when I ran pre-commit. Only formatting changes are present.

@jaclark5 jaclark5 requested a review from hannahbaumann March 4, 2026 15:52
The reference data was generated using the [generate_mnsol_data.py](../../../data_generation/generate_mnsol_data.py) script. Entries were excluded if:

- The solute or solvent name was not present in `mnsol-name-to-smiles.json`
- Molecules with ambiguous isomeric structure were excluded including: 'bromotoluene', 'chlorotoluene', 'dichloroethane', 'fluoroctane', 'trimethylbenzene'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the different intent intended?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope thanks for catching!

print(f"Skipping {key}: Charge is not zero")
continue
if solute_name == "water":
print(f"Skipping {key}: Water")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm just blanking on something we probably discussed, but why are we skipping water here again?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an error, I was thinking I would skip writing it to the ligand.sdf but this has a larger effect, thanks for catching

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're checking for solute_name != "water" and solvent_name != "water" in a few other places, is that still desired or would that also need to be removed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is desired. Before I was accidentally skipping systems that had water as the solute. That is not desired.

Now I'm not including water in the ligands.sdf which is desired. This is because water is a special case that Pontibus handles separately to apply known water models like OPC3. Water won't have partial charges assigned and will cause our tests to fail for that reason, so we skip it and exclude it from the ligands_*.sdf.

Copy link
Copy Markdown
Contributor

@hannahbaumann hannahbaumann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaclark5 , I think this looks good, just the small comments. I think we should definitely run at least a subset of these, once we have the planning script ready, to validate this setup.

@jaclark5 jaclark5 merged commit 59f994d into main Mar 5, 2026
6 checks passed
@jaclark5 jaclark5 deleted the mnsol branch March 5, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add SFE: Add MNSol to ref data

3 participants