{2023.06}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1#825
{2023.06}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1#825trz42 wants to merge 12 commits intoEESSI:2023.06-software.eessi.iofrom
Conversation
|
Instance
|
|
Instance
|
1 similar comment
|
Instance
|
|
Instance
|
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
1 similar comment
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
Build again after applying fix to find bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
Also build for bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
eb_hooks.py
Outdated
| if self.name == 'PyTorch' and self.version == '2.1.2': | ||
| if 'cudaver' in self.cfg.template_values and self.cfg.template_values['cudaver'] == '12.1.1': |
There was a problem hiding this comment.
I don't see a reason to make this specific to a particular PyTorch version or CUDA version?
There was a problem hiding this comment.
We only know that the failure happens in this specific case. If we apply it to other cases, we will not know whether it was necessary or not.
eb_hooks.py
Outdated
| _cupti_lib_dir = os.path.join(_eessi_software_path, 'software', 'CUDA', _cudaver, 'extras', 'CUPTI', 'lib64') | ||
| print_msg("pre_configure_hook_pytorch_add_cupti_libdir: cupti_lib_dir: '%s'", _cupti_lib_dir) | ||
| if _library_path: | ||
| env.setvar('LIBRARY_PATH', ':'.join([_library_path, _cupti_lib_dir])) |
There was a problem hiding this comment.
This seems like a bug in our CUDA installation/module, no?
I'm fine with proceeding like this for now, even if we also fix it somewhere else this won't cause trouble, but there's probably a more general fix for this?
There was a problem hiding this comment.
Right, we might find a better solution by changing the CUDA module, eg, by adding the directory to LIBRARY_PATH through the module.
It could be a worthwhile effort to try.
There was a problem hiding this comment.
@boegel could it be that lib_path (at least parts of it) is missing in https://github.com/easybuilders/easybuild-easyblocks/blob/57c0eaed8dc29e223fe68a75f7bf195cca0c2d04/easybuild/easyblocks/c/cuda.py#L362
A little before that line lib_path is constructed as list ['lib64', 'extras/CUPTI/lib64', 'nvvm/lib64'], but in line 362 only ['lib64', 'stubs/lib64'] is used.
|
Try a different approach where we rebuild the CUDA module such that it prepends the directory containing the libcupti library to LIBRARY_PATH and then not using the hook used in the previous builds... bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
1 similar comment
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
bot: show_config |
Updates by the bot instance
|
Updates by the bot instance
|
|
Instance
|
|
Instance
|
|
bot: show_config |
Updates by the bot instance
|
Updates by the bot instance
|
|
Instance
|
|
Instance
|
|
bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Unable to download or merge changes between the source branch and the destination branch. |
|
@trz42 Can you split up and retarget this pr? |
|
Superseded by #973 |
Builds
Superseedes #718