Resume from ckpt by kevssim · Pull Request #135 · modelscope/twinkle

kevssim · 2026-03-31T01:45:42Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist

Code Review

This pull request implements a comprehensive "Strict Resume" feature for Transformers models, enabling the restoration of full training state including optimizer, scheduler, scaler, RNG states, and data progress. Key changes involve implementing load_training_state and read_training_progress across the model, server, and client layers, alongside dataloader enhancements to support sample-level skipping for map-style datasets. Feedback highlights several critical improvements: ensuring deterministic RNG in distributed settings by avoiding unseeded random states, addressing the deprecated use of StopIteration in generators, improving security by using weights_only=True during checkpoint loading, and removing an accidental BOM character in the client generator. Additionally, a more robust approach for re-initializing the dataloader is suggested to avoid modifying private PyTorch attributes.

gemini-code-assist · 2026-03-31T01:47:55Z

src/twinkle/dataloader/retry_sampler.py

+        if emitted >= target_total:
            return

        for idx in np.random.RandomState().permutation(len(self.dataset)).tolist():


Using np.random.RandomState() without a seed creates a new, non-deterministic random state. In a distributed training environment, this can lead to different ranks generating different permutations during the retry/backfill phase, causing data divergence across ranks and potentially breaking the training process. It is recommended to use a seeded random state or the global np.random state if it has been properly seeded during initialization.

client_tools/client_generator.py

gemini-code-assist · 2026-03-31T01:47:55Z

src/twinkle/dataloader/dataloader.py

+            self.dataloader.__initialized = False
+            self._rebuild_sampler_stack()
+            self.dataloader.__initialized = True


Accessing and modifying the private attribute __initialized of torch.utils.data.DataLoader is brittle and relies on internal implementation details of PyTorch that could change. A safer approach to update the sampler stack after the dataloader has been created is to simply re-instantiate the underlying self.dataloader using the stored self.dataloader_params.

Suggested change

self.dataloader.__initialized = False

self._rebuild_sampler_stack()

self.dataloader.__initialized = True

if self.dataloader is not None:

self.dataloader = None

self._lazy_init_dataloader()

gemini-code-assist · 2026-03-31T01:47:55Z

src/twinkle/dataloader/retry_sampler.py

@@ -39,12 +45,11 @@ def __iter__(self):
            else:
                raise StopIteration(f'Max retries exceeded: {self.max_retries}, no valid data found.')


Manually raising StopIteration inside a generator (a function using yield) is deprecated since PEP 479 and will be converted into a RuntimeError in Python 3.7+. Since this represents an error condition (max retries exceeded), it is better to raise a RuntimeError or ValueError directly to provide a clear error message to the user.

Suggested change

raise StopIteration(f'Max retries exceeded: {self.max_retries}, no valid data found.')

raise RuntimeError(f'Max retries exceeded: {self.max_retries}, no valid data found.')

gemini-code-assist · 2026-03-31T01:47:55Z

src/twinkle/model/transformers/transformers.py

+            if hasattr(self.strategy, 'load_optimizer_checkpoint'):
+                self.strategy.load_optimizer_checkpoint(self.model, optimizer_config.optimizer, optimizer_path)
+            else:
+                state_dict = torch.load(optimizer_path, map_location='cpu', weights_only=False)


Using weights_only=False when loading checkpoints via torch.load can be a security risk if the checkpoint file is untrusted, as it allows the execution of arbitrary code during unpickling. Since these are standard state dictionaries (optimizer, scheduler, RNG), they should be compatible with weights_only=True in modern PyTorch versions. This applies to lines 998, 1007, and 1029 as well.

Suggested change

state_dict = torch.load(optimizer_path, map_location='cpu', weights_only=False)

state_dict = torch.load(optimizer_path, map_location='cpu', weights_only=True)

…into resume_from_ckpt

kevssim and others added 13 commits March 27, 2026 12:00

docs: add transformers resume design spec

5cd3c0f

docs: refine transformers resume design spec

91eeaeb

docs: trim resume state fields

6eebda8

docs: add npu resume compatibility requirements

cdd9c1b

chore: ignore local worktrees

1542492

wip

9883118

wip

d41a634

wip

21f9918

fix

1e59531

wip

9bb3f39

fix

fdf1f71

wip

6cf5160

Merge branch 'modelscope:main' into resume_from_ckpt

144ffe6

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

kevssim added 9 commits March 31, 2026 09:52

lint

e21f870

Merge branch 'resume_from_ckpt' of https://github.com/kevssim/twinkle …

3359209

…into resume_from_ckpt

wip

70ebe50

wip

483778d

wip

039789b

wip

54de1a4

wip

920ab86

lint

ffd6304

wip

582bd41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume from ckpt#135

Resume from ckpt#135
kevssim wants to merge 22 commits intomodelscope:mainfrom
kevssim:resume_from_ckpt

kevssim commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -39,12 +45,11 @@ def __iter__(self):
		else:
		raise StopIteration(f'Max retries exceeded: {self.max_retries}, no valid data found.')

	raise StopIteration(f'Max retries exceeded: {self.max_retries}, no valid data found.')
	raise RuntimeError(f'Max retries exceeded: {self.max_retries}, no valid data found.')

	state_dict = torch.load(optimizer_path, map_location='cpu', weights_only=False)
	state_dict = torch.load(optimizer_path, map_location='cpu', weights_only=True)

Conversation

kevssim commented Mar 31, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant