zfreq(x) only has a meaning when x > 0 and zfreq(0) tells us nothing new as the code makes zfreq(0) = 2^32 - pfreq(0). The average non-zero run length = e ~ 2.718 and the average zero run length = e / (e-1) ~ 1.582. The former is greater than the latter because there are more non-zero pair values than zero ones. The number of runs is the same for both by definition (to within one at worst). I think no more distribution tests are needed and I'll put a summary of the scores together soon.
Thanks a lot, Evan. zfreq(0) is correct now. Looking just at zfreq, [6,2,3,9] and [6,2,11,6] do very well. Not checked others.
[6 2 11 6] comes up clean. There is three very poor scores but they all check out as borderline BCFN_FF cases. So, really only a single 256 MB and the rest are 512 MB or better.
Here's all nine of the every-aperture score tables, and also the extended PR reports for the three borderline cases of [6 2 11 6].
Thanks a lot, Evan. zfreq(0) is correct now. Looking just at zfreq, [6,2,3,9] and [6,2,11,6] do very well. Not checked others.
[6 2 11 6] comes up clean. There is three very poor scores but they all check out as borderline BCFN_FF cases. So, really only a single 256 MB and the rest are 512 MB or better.
Here's all nine of the every-aperture score tables, and also the extended PR reports for the three borderline cases of [6 2 11 6].
Thanks, Evan. Have you dropped a couple of candidates recently? Is it the case that the nine consist of the seven with the best scores, plus [14,2,7,5] and [14,2,7,6]?
The key question: based on PractRand tests which do you think is the best?
Here's all nine of the every-aperture score tables, and also the extended PR reports for the three borderline cases of [6 2 11 6].
Thanks, Evan. Have you dropped a couple of candidates recently? Is it the case that the nine consist of the seven with the best scores, plus [14,2,7,5] and [14,2,7,6]?
There was plenty more where those came from. The only other candidate I'd being working on lately was the poorly scoring one for comparison.
The key question: based on PractRand tests which do you think is the best?
I've been spending a lot more time coding than pondering the results.
It would be useful to have an every-aperture table with all the corrections for borderline cases.
Ha! That won't be happening unless someone fines a way to tell Practrand to ease up on the sensitivity of the BCFN_FF tests.
You're welcome to hand edit the tables yourself.
It would be good if Chris (the PractRand author) could tell us how to modify the BCFN_FF tests. I could edit the tables, but the amended borderline scores are scattered about in different files and I think not all of them have been posted.
It's interesting that for eight of the nine [a,b,c,d] candidates b = 2, which is a left shift, whereas a, c & d are all left rotates. The exception is [13,5,10,10].
The nine were hand picked from the following table (which I don't seem to have posted before). These are the auto-culled most recent complete run of the full set of candidates:
EDIT: I note it took only 9 hours to run. That would suggest the culling threshold was set very high, ie: Any scores (except for direct 16+0 aperture) below 512 MB would have terminated the testing of that candidate.
We need some simplified scoring method to compare the every-aperture results, maybe mean and standard deviation for each sample size. Another way could be visually, using 16 by 16 "randomgrams".
The nine were hand picked from the following table (which I don't seem to have posted before). <snip>
EDIT: I note it took only 9 hours to run. That would suggest the culling threshold was set very high, ie: Any scores (except for direct 16+0 aperture) below 512 MB would have terminated the testing of that candidate.
I wonder how long it would take to run an every-aperture test for all possible [a,b,c,d] candidates (84 * 16 = 1344), then stop each test if a 16-bit sample score is less than 512M? That should weed out most of the duds. Ideally, the BCFN_FF test would be amended first.
I wonder how long it would take to run an every-aperture test for all possible [a,b,c,d] candidates (84 * 16 = 1344) ...
2000 hours (84 days) without any culling.
Ideally, the BCFN_FF test would be amended first.
Yeah, I'll have a nosy at the Practrand sources. I'm not likely to solve it myself though.
This issue has probably already made that culled list above smaller than it should be. It's been present all along but I hadn't gone looking at the report files that closely until the oddball scores really stuck out.
Notes
pfreq(x) = output pairs that occur x times
zfreq(x) = zero runs of length x
If N-1 = full period (2^32-1 for the above) then:
(a) Expected pfreq(x) = 1/x! * N/e
(b) Sum of pfreq(x) = N
(c) Sum of x * pfreq(x) = N-1
(d) Expected zero runs = (1-1/e) * N/e
(e) Expected mean zero run = 1/(1-1/e)
(f) Expected zfreq(x) = (1-1/e)^2 * N/e^x
(g) Sum of zfreq(x) = zero runs
(h) Sum of x * zfreq(x) = pfreq(0)
(i) Sum of x * zfreq(x) / Sum of zfreq(x) = mean zero run
x >= 0 for pfreq(x), x > 0 for zfreq(x)
Comments
These frequency distribution tests are intended as an independent "sanity check" but doing well does not imply the same will happen in other, more rigorous randomness tests. However, these frequency tests can detect certain bad candidates and also act as a "tie-breaker" for good candidates, as is the case here. For example, assuming all other scores are equal, [6,2,5,5] could be eliminated on pair frequency (which is more important in my opinion than zero run frequency).
I've made a multi-case launching script too. It uses all RAM and creates a lot of smaller files now. I mainly chose this approach because I was wary of many programs concurrently trying to append the same files.
There is a small error in the above code. If the last byte of the 4GB pair array is zero, then the last zero run will not be written to the zfreq array. Also, it would better if zfreq[0] held the number of zero runs, i.e. it is incremented whenever zfreq[tally] is incremented. I think the amended code could be:
The results two posts up are remarkably good. |Actual-Expected|/Expected show this best. The smaller the better and some of the errors are tiny. Note the consistency across pfreq0-pfreq3. I deduced the zero run exponential distribution from the results, I can't prove it!
I'd like to see how bad candidates perform, e.g. xoroshiro32++ [14,2,7,8] or xoroshiro32+p [14,2,7] or xoroshiro32+ [14,2,7], getting progressively worse. Repeat frequency tests suggests xoroshiro32++ [14,2,7,7] and [14,2,7,9] will be adversely affected by their proximity to [14,2,7,8].
We could run the frequency tests and use pfreq0 to cull candidates before any PractRand culling. For example, |Actual-Expected|/Expected less than 0.001 is achievable, thus the |Actual-Expected| threshold could be set to Expected/1000 = 1580030. In other words, we look for better than 0.999 accuracy (> 99.9%). Harsh, but fair. Record the candidates that pass, then try these with 16-bit sample PractRand tests with minimum threshold 512M. This process should be a lot faster than using PractRand only.
1. A pfreq13+ value that is the sum of all output pair frequencies of 13 and higher.
2. A zfreq22+ value that is the sum of all zero run lengths of 22 and higher.
3. A mean run value for average zero run length, which is the ratio of two sums calculated using test data and therefore arguably the most meaningful zero run result. The relative rankings between different candidates for mean run are very similar to pfreq0-pfreq3, which suggests the latter could be used solely as "tie-breakers".
Both 1 and 2 are expected to be zero for a good candidate, but could be significantly more than zero for a bad candidate (conjecture in the absence of an actual test).
Comments
Evan, this full table with 256 PractRand scores is excellent. Do you have similar tables for all the 11 candidates you've been looking at recently?
[6 2 11 6] comes up clean. There is three very poor scores but they all check out as borderline BCFN_FF cases. So, really only a single 256 MB and the rest are 512 MB or better.
Here's all nine of the every-aperture score tables, and also the extended PR reports for the three borderline cases of [6 2 11 6].
Its very good work. Should be published and accessible.
I haven't posted the sources in that topic. The testing source code is still evolving in some ways. And Tony has some of his own.
There is a number of archives in various posts of this topic though.
Maybe one more, if possible. [14,2,7,8] should do poorly and would be a good comparison to the others.
Thanks, Evan. Have you dropped a couple of candidates recently? Is it the case that the nine consist of the seven with the best scores, plus [14,2,7,5] and [14,2,7,6]?
The key question: based on PractRand tests which do you think is the best?
I think we should keep using this thread for test results and leave the other topic for general info about the algorithm plus recommended constants.
It would be useful to have an every-aperture table with all the corrections for borderline cases.
Ha! That won't be happening unless someone fines a way to tell Practrand to ease up on the sensitivity of the BCFN_FF tests.
You're welcome to hand edit the tables yourself.
I've been spending a lot more time coding than pondering the results.
It would be good if Chris (the PractRand author) could tell us how to modify the BCFN_FF tests. I could edit the tables, but the amended borderline scores are scattered about in different files and I think not all of them have been posted.
It's interesting that for eight of the nine [a,b,c,d] candidates b = 2, which is a left shift, whereas a, c & d are all left rotates. The exception is [13,5,10,10].
EDIT: I note it took only 9 hours to run. That would suggest the culling threshold was set very high, ie: Any scores (except for direct 16+0 aperture) below 512 MB would have terminated the testing of that candidate.
I wonder how long it would take to run an every-aperture test for all possible [a,b,c,d] candidates (84 * 16 = 1344), then stop each test if a 16-bit sample score is less than 512M? That should weed out most of the duds. Ideally, the BCFN_FF test would be amended first.
Yeah, I'll have a nosy at the Practrand sources. I'm not likely to solve it myself though.
This issue has probably already made that culled list above smaller than it should be. It's been present all along but I hadn't gone looking at the report files that closely until the oddball scores really stuck out.
Here are the processed results for the selected xoroshiro32++ [a,b,c,d] in the reports:
Pair frequency
Actual and Expected
Pair frequency
|Actual-Expected|/Expected
Zero run frequency
Actual and Expected
Zero run frequency
|Actual-Expected|/Expected
Notes
pfreq(x) = output pairs that occur x times
zfreq(x) = zero runs of length x
If N-1 = full period (2^32-1 for the above) then:
(a) Expected pfreq(x) = 1/x! * N/e
(b) Sum of pfreq(x) = N
(c) Sum of x * pfreq(x) = N-1
(d) Expected zero runs = (1-1/e) * N/e
(e) Expected mean zero run = 1/(1-1/e)
(f) Expected zfreq(x) = (1-1/e)^2 * N/e^x
(g) Sum of zfreq(x) = zero runs
(h) Sum of x * zfreq(x) = pfreq(0)
(i) Sum of x * zfreq(x) / Sum of zfreq(x) = mean zero run
x >= 0 for pfreq(x), x > 0 for zfreq(x)
Comments
These frequency distribution tests are intended as an independent "sanity check" but doing well does not imply the same will happen in other, more rigorous randomness tests. However, these frequency tests can detect certain bad candidates and also act as a "tie-breaker" for good candidates, as is the case here. For example, assuming all other scores are equal, [6,2,5,5] could be eliminated on pair frequency (which is more important in my opinion than zero run frequency).
There is a small error in the above code. If the last byte of the 4GB pair array is zero, then the last zero run will not be written to the zfreq array. Also, it would better if zfreq[0] held the number of zero runs, i.e. it is incremented whenever zfreq[tally] is incremented. I think the amended code could be:
I'd like to see how bad candidates perform, e.g. xoroshiro32++ [14,2,7,8] or xoroshiro32+p [14,2,7] or xoroshiro32+ [14,2,7], getting progressively worse. Repeat frequency tests suggests xoroshiro32++ [14,2,7,7] and [14,2,7,9] will be adversely affected by their proximity to [14,2,7,8].
And a response:
http://www.pcg-random.org/posts/on-vignas-pcg-critique.html
1. A pfreq13+ value that is the sum of all output pair frequencies of 13 and higher.
2. A zfreq22+ value that is the sum of all zero run lengths of 22 and higher.
3. A mean run value for average zero run length, which is the ratio of two sums calculated using test data and therefore arguably the most meaningful zero run result. The relative rankings between different candidates for mean run are very similar to pfreq0-pfreq3, which suggests the latter could be used solely as "tie-breakers".
Both 1 and 2 are expected to be zero for a good candidate, but could be significantly more than zero for a bad candidate (conjecture in the absence of an actual test).
Do you have the full-period constants for any of xoroshiro36/38/40?
Many thanks, Evan. xoroshiro40 has relatively few full-period constants.
Here's the same run for Xorshift "xs-s20" but including the log file. I don't seem to have kept very many logs.
EDIT: Looking at the s15 candidate list, it's only 73 lines, which is shorter than the s16 list. That makes the s16 list oddly long at 84 lines.