Skip to content

Update runtests.jl#1096

Merged
depial merged 5 commits intomainfrom
depial-patch-1
Mar 7, 2026
Merged

Update runtests.jl#1096
depial merged 5 commits intomainfrom
depial-patch-1

Conversation

@depial
Copy link
Copy Markdown
Contributor

@depial depial commented Mar 7, 2026

Addresses Issue #1095.

Beyond testing for correct "fleet" size:

  • testing added for random "fleet" selection in Task 5
  • changed input to a copy of starships in case student uses the mutating shuffle! in Task 5.
  • corrected testing for correct prefix in Task 2
  • corrected testing for correct numerical range in Task 2

The testing for randomness of fleet selection is rather weak (only requires one ship out of given order) but I figure that's probably good enough. If there are other ideas, feel free to discuss.

Update: I've adjusted the testing for randomness in Task 5 to better reflect that the distribution of fixed points for random permutations is Poisson(1). I've set the limit of fixed points to be three, since the probability of having 4 or more fixed points is around 2%, but we could be more generous by setting the limit to four. This test is aimed at the larger fleets since this passes trivially for tests on fleets of under four starships. The issue is still open to discussion on whether there is a better option.

Addresses Issue #1095.

Beyond testing for correct "fleet" size:
- testing added for random "fleet" selection in Task 5
- changed input to a copy of `starships` in case student uses the mutating `shuffle!`
- corrected testing for correct prefix in Task 2
- corrected testing for correct numerical range in Task 2

The testing for randomness of fleet selection is rather weak, only needed one ship out of given order, but I figure that's probably good enough. If there are other ideas, feel free to discuss.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 7, 2026

This PR touches files which potentially affect the outcome of the tests of an exercise. This will cause all students' solutions to affected exercises to be re-tested.

If this PR does not affect the result of the test (or, for example, adds an edge case that is not worth rerunning all tests for), please add the following to the merge-commit message which will stops student's tests from re-running. Please copy-paste to avoid typos.

[no important files changed]

For more information, refer to the documentation. If you are unsure whether to add the message or not, please ping @exercism/maintainers-admin in a comment. Thank you!

@depial
Copy link
Copy Markdown
Contributor Author

depial commented Mar 7, 2026

I think these changes are big enough that we should rerun the ~180 students' solutions.

@depial depial requested a review from colinleach March 7, 2026 12:39
Update to reflect distribution of fixed points in random permutations (`Poisson(1)`).
@depial
Copy link
Copy Markdown
Contributor Author

depial commented Mar 7, 2026

A further thought: It's possible to use the Chi-squared test for Task 3 if desired since the test statistic is the only thing really needed if you were going to hard code the number of bins (aka degrees of freedom) and the level of alpha. No need to actually compute the full p-value.

Edit: I've looked more closely at the RosettaCode example link that you provided in run_tests.jl and I don't understand where it's coming from for a number of reasons, but I'm not terribly well versed in Chi-squared tests. In any case, from what I can tell, if you run that code on exemplar.jl with our current tests, you'll end up with pval = 0.0 every time.

Instead of that implementation, I would have expected to bin the results, as you did into three bins for example, then work with counts. This produces more reasonable numbers and we could use the following test:

third = length(random_stardates) / 3
test_stat = sum((bin - third)^2 for bin in (low, med, high)) / third
test_stat < 6

Where I've used the counts from three bins you did originally. This avoids the full p-value calculation by working directly with the test statistic.

Just for reference, the test statistic for the exemplar.jl is ~1.5 and a skewed distribution with all the points in the high bin is 6666 (the test statistic for the RosettaCode implementation is ~20000).

@colinleach
Copy link
Copy Markdown
Contributor

I used to know a reasonable amount about statistics, but I seem to have forgotten most of it! That said, your suggestions look reasonable and I'll go with what you recommend.

Porting this exercise to R is fairly high on the TODO list, so it would be good to have a working strategy.

@depial
Copy link
Copy Markdown
Contributor Author

depial commented Mar 7, 2026

I've moved to the Chi-squared test, but I'm a bit confused as to why Julia nightly would be failing when the others pass. I tried to be informative and leave a comment for students because the test gives no information on what to shoot for.

Should I try removing the comments? I really don't understand what else could have broken with the nightly release.

Edit: I'm trying different possible modifications in the two tests which failed. Hopefully I'll be able to pin down the issue.

@colinleach
Copy link
Copy Markdown
Contributor

It may just have been a random glitch? Or a bug in the nightly code? I find it hard to see what difference a comment makes!

What are the implications of relaxing the test a bit? Maybe setting the threshold to 7 instead of 6 is not statistically rigorous, but ultimately the future of civilization doesn't depend on how we test a learning exercise?

@depial
Copy link
Copy Markdown
Contributor Author

depial commented Mar 7, 2026

It looks like it was just a glitch. I had removed the comment from one of the tests and left it in another and it passed. I've put it all back to how it was now and added some inclusive limits on some of the conditions while I was there.

That said, we can put it at 7 if you'd like, for alpha = 0.025 the threshold is 7.377

@colinleach
Copy link
Copy Markdown
Contributor

It looks like it was just a glitch.

Good outcome!

we can put it at 7 if you'd like

If it now passes everything, I've no strong feelings. Just a general sense that this is a context where wrongly passing a student solution is less serious than wrongly failing.

@depial
Copy link
Copy Markdown
Contributor Author

depial commented Mar 7, 2026

I used to know a reasonable amount about statistics, but I seem to have forgotten most of it!

I hope I'm getting my stats right. I'm in the process of learning a reasonable amount and then immediately forgetting it 😅

Let me go ahead and double check I've been doing everything correctly before merging.

@depial depial merged commit 37bcdc7 into main Mar 7, 2026
5 checks passed
@depial depial deleted the depial-patch-1 branch March 7, 2026 18:48
@depial
Copy link
Copy Markdown
Contributor Author

depial commented Mar 7, 2026

I'm already seeing a lot of failed solutions... Looks like the most common errors are either not shuffling properly or the issue with not using the correct length.

However, I'm also seeing some which look to have failed, but shouldn't have... I counted eight false negatives, but I might have over counted based on how Exercism displays the results.

@colinleach
Copy link
Copy Markdown
Contributor

I'm also seeing some which look to have failed, but shouldn't have

That's worrying! What are we going to do about it?

@depial
Copy link
Copy Markdown
Contributor Author

depial commented Mar 7, 2026

Well, I've gone through and made another count and it's about 10% of the current solutions are failing, but pass when I run them locally. I would suggest pushing the threshold for the Chi-squared test back. I ran some basic tests and found that if the bins were skewed by (-75, -75, +150) from the expected count (3,333), this would fail, so we can push the alpha down if you'd like.

I haven't done all the statistics on the Poisson case, but it should be only 2% chance of producing 4 or more stationary points, then at least four of those would have to find their way into the 1-20 elements out of 100 that we are asking for, so I find it highly unlikely that's the issue, but we can push that limit up one as well.

In the end, I don't think it would be a huge problem for students, since there would not likely be more than one false negative for the odd individual, but it could be a pain for us in the future if the exemplar.jl fails fairly regularly during CI checks.

@colinleach
Copy link
Copy Markdown
Contributor

I (quite strongly) support relaxing the criteria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants