Hi,
Thank you for the amazing work!
While experimenting with your code, despite running the training multiple times, we're observing stability issues. Here is an example of one of the rew_total graphs:

Is this behavior expected or indicative of an underlying problem? Is the maximum total reward achieved here (around 350) the same as you got? Additionally, if you could share the graphs from one of your runs it might help us to track down the issue and understand the expected behavior.
Thanks!
Hi,
Thank you for the amazing work!
While experimenting with your code, despite running the training multiple times, we're observing stability issues. Here is an example of one of the rew_total graphs:

Is this behavior expected or indicative of an underlying problem? Is the maximum total reward achieved here (around 350) the same as you got? Additionally, if you could share the graphs from one of your runs it might help us to track down the issue and understand the expected behavior.
Thanks!