Conversation
Addresses Issue #1095. Beyond testing for correct "fleet" size: - testing added for random "fleet" selection in Task 5 - changed input to a copy of `starships` in case student uses the mutating `shuffle!` - corrected testing for correct prefix in Task 2 - corrected testing for correct numerical range in Task 2 The testing for randomness of fleet selection is rather weak, only needed one ship out of given order, but I figure that's probably good enough. If there are other ideas, feel free to discuss.
|
This PR touches files which potentially affect the outcome of the tests of an exercise. This will cause all students' solutions to affected exercises to be re-tested. If this PR does not affect the result of the test (or, for example, adds an edge case that is not worth rerunning all tests for), please add the following to the merge-commit message which will stops student's tests from re-running. Please copy-paste to avoid typos. For more information, refer to the documentation. If you are unsure whether to add the message or not, please ping |
|
I think these changes are big enough that we should rerun the ~180 students' solutions. |
Update to reflect distribution of fixed points in random permutations (`Poisson(1)`).
|
A further thought: It's possible to use the Chi-squared test for Task 3 if desired since the test statistic is the only thing really needed if you were going to hard code the number of bins (aka degrees of freedom) and the level of alpha. No need to actually compute the full p-value. Edit: I've looked more closely at the RosettaCode example link that you provided in Instead of that implementation, I would have expected to bin the results, as you did into three bins for example, then work with counts. This produces more reasonable numbers and we could use the following test: third = length(random_stardates) / 3
test_stat = sum((bin - third)^2 for bin in (low, med, high)) / third
test_stat < 6Where I've used the counts from three bins you did originally. This avoids the full p-value calculation by working directly with the test statistic. Just for reference, the test statistic for the |
|
I used to know a reasonable amount about statistics, but I seem to have forgotten most of it! That said, your suggestions look reasonable and I'll go with what you recommend. Porting this exercise to R is fairly high on the TODO list, so it would be good to have a working strategy. |
|
I've moved to the Chi-squared test, but I'm a bit confused as to why Should I try removing the comments? I really don't understand what else could have broken with the nightly release. Edit: I'm trying different possible modifications in the two tests which failed. Hopefully I'll be able to pin down the issue. |
|
It may just have been a random glitch? Or a bug in the nightly code? I find it hard to see what difference a comment makes! What are the implications of relaxing the test a bit? Maybe setting the threshold to 7 instead of 6 is not statistically rigorous, but ultimately the future of civilization doesn't depend on how we test a learning exercise? |
|
It looks like it was just a glitch. I had removed the comment from one of the tests and left it in another and it passed. I've put it all back to how it was now and added some inclusive limits on some of the conditions while I was there. That said, we can put it at 7 if you'd like, for alpha = 0.025 the threshold is 7.377 |
Good outcome!
If it now passes everything, I've no strong feelings. Just a general sense that this is a context where wrongly passing a student solution is less serious than wrongly failing. |
I hope I'm getting my stats right. I'm in the process of learning a reasonable amount and then immediately forgetting it 😅 Let me go ahead and double check I've been doing everything correctly before merging. |
|
I'm already seeing a lot of failed solutions... Looks like the most common errors are either not shuffling properly or the issue with not using the correct length. However, I'm also seeing some which look to have failed, but shouldn't have... I counted eight false negatives, but I might have over counted based on how Exercism displays the results. |
That's worrying! What are we going to do about it? |
|
Well, I've gone through and made another count and it's about 10% of the current solutions are failing, but pass when I run them locally. I would suggest pushing the threshold for the Chi-squared test back. I ran some basic tests and found that if the bins were skewed by (-75, -75, +150) from the expected count (3,333), this would fail, so we can push the alpha down if you'd like. I haven't done all the statistics on the Poisson case, but it should be only 2% chance of producing 4 or more stationary points, then at least four of those would have to find their way into the 1-20 elements out of 100 that we are asking for, so I find it highly unlikely that's the issue, but we can push that limit up one as well. In the end, I don't think it would be a huge problem for students, since there would not likely be more than one false negative for the odd individual, but it could be a pain for us in the future if the |
|
I (quite strongly) support relaxing the criteria. |
Addresses Issue #1095.
Beyond testing for correct "fleet" size:
starshipsin case student uses the mutatingshuffle!in Task 5.The testing for randomness of fleet selection is rather weak (only requires one ship out of given order) but I figure that's probably good enough. If there are other ideas, feel free to discuss.
Update: I've adjusted the testing for randomness in Task 5 to better reflect that the distribution of fixed points for random permutations is Poisson(1). I've set the limit of fixed points to be three, since the probability of having 4 or more fixed points is around 2%, but we could be more generous by setting the limit to four. This test is aimed at the larger fleets since this passes trivially for tests on fleets of under four starships. The issue is still open to discussion on whether there is a better option.