Thanks, Evan. Are the results repeated exactly if you do a second lot of battery tests, continuing from where the first one left off?
It only continues on between tests within the battery because the executable doesn't exit. A repeat run of the executable is a fresh start of the generator state each time. So, yes, a repeat run produces exact identical results.
Reversing the bit order produces fewer failures but in more categories (LinearComp added to SerialOver and MaxOft).
I've had it Crush gridding for the last few hours and it's notable there is a sort of randomness to the types of failures. The P-values of these failures are not anywhere near as pronounced as what Practrand fails on, Practrand would be issuing "unusual" and "suspicious" for most of the values I'm seeing. And Practrand does have lots of exactly that in pretty much all of its reports.
In other words, I'm thinking TestU01 failure threshold is set way too sensitive.
I don't know what to suggest, apart from testing Chris's generator next. Is there any correlation between PractRand scores and the number of TestUI failures for [14,2,7,5] with the various lsb's?
You dug up some xoroshiro shift constants for me, so I am following up on a statement I made in the PRNG group about an xoroshiro variant that I feel is very promising (and fast).
I finished all of the ground-level work, and more. Have a look if you get a chance and let me know what you think. https://gist.github.com/XoroshiroNOT
Hi ChrisR,
If I can butt in ... For speed, the original xoroshiro has local s0, s1 variables. What this does, by minimally referencing the global variables, is it clears the way for the optimiser to use any tricks for all intermediate steps. Also, I'd move the subtracting "const_odd" up at s[2] assignment line.
The other thing is I'm not keen on "const" declarations for non-constants. I know it was done in the original xoroshiro but I simply don't know the justification for it. Lying to the optimiser is not wise.
Is there any correlation between PractRand scores and the number of TestUI failures for [14,2,7,5] with the various lsb's?
I think I need to filter them better. There's far too many minor fails. That's something to read up on and see if there is controls in the test suit I can adjust ...
I included 'const' only for the aesthetic of being consistent with the original xoroshiro. Using Compiler Explorer, I did not see any effect on the ASM emitted by GCC -O3 or MSVC /Ox, but I am happy to remove them.
As for the other recommendations, they seem logical (first thing I tried), but in my testing:
Seemingly interfered with saturation and intentional steering of the Intel CPU pipeline under various loads by undoing the latent load of the S[2] state from L1/L2 cache, etc.
Did not result in fewer global references emitted by GCC or MSVC (as the compilers are smart enough to see what I am trying to do, and yet stay out of my way when I need them to).
Resulted in ASM that penalized either GCC or MSVC without providing a clear performance benefit to the other.
Moving the const_odd to the subtract line in the 2-D code over-saturates the pipeline on my CPU (too many back-to-back similar parallel operations taxing the same micro-ops), but it does benefit the 1-D code because the s0 ^ s1 is deferred.
The end result is code nearly as fast as xoroshiro** or ++ when I tested on a Xeon W3690 using 3 different benchmarks (with 'inline' and under both single and multi-threaded loads).
However, if you or anyone can demonstrate that any proposed code changes actually result in more than 5% better performance over the current code on more modern CPUs under both MSVC and GCC, I'm happy to make changes that will improve code readability.
No worries. You've obviously got some high power pipeline analytical tools at hand. A separate approach is declare the state store as static. It probably won't make any diff though.
Dynamic allocating of the state, for re-entrant practises, might be something else to measure with those fancy tools. I noticed this is supported in the TestU01 suite.
No worries. You've obviously got some high power pipeline analytical tools at hand. A separate approach is declare the state store as static. It probably won't make any diff though.
Dynamic allocating of the state, for re-entrant practises, might be something else to measure with those fancy tools. I noticed this is supported in the TestU01 suite.
My benchmark code has both 'static' state and 'inline', but it got wiped from the gist because I copied/pasted code from Compiler Explorer (https://godbolt.org/) which does not seem to support those... not sure why. I'll add those back in (presuming they won't cause anyone trouble).
Not sure about dynamic state allocation, so I'll have to look into it... Thanks!
Gridding config for this generator (I've named it "scro-rarns") is case testing at all 32 sampling aperture offsets by aperture sizes of 32, 28, 24, 20, 16, then even sizes to 2. Just started size 16.
Offset 0 of the first four sizes has been clean but I've got 31 of first 128 cases with the same looking randomness in the Crush fails that is occurring with Xoroshiro32++.
As a headline comparison, I'm interested in how XORO32 compares with Chris's (scro-rarns) generator, for full 32-bit output with no rotation (lsb is bit 0). The latter passes Crush but I'm not sure about [13,5,10,9] as the files show "bit-order reversed".
You got a point about the bit-order reversal though, I shifted that function out of the piped build of the Crush testing to gain some speed. It is present in the Xoroshiro generator build but not in the Scro-rarns generator build. I guess Scro-rarns needs to be gridded again with the reversal code added ...
EDIT: I tell a lie. Bit-order reversal is there and enabled in the Scro-rarns generator build. The reason why it isn't stated in the reports is because those ones are produced by the piped Crush test executable. So the reports just have what the Crush testing is doing, not what the generator is doing.
Ok, I just eyeballed the end of each of the 384 gridded report files for Scro-rarns32 to see what the worst Crush fail came to. Turns out there's nothing worse than an e-7 p-value.
There is definitely some eps's occurring with Xoroshiro. So, I guess, ignoring the reported fails, there is a clear measured difference.
Looking more closely at the Crush reports for the Xoroshiro grids, there is actually only one test with one setting that is producing eps p-values. The test name is SerialOver with parameter of t=2. That one stands by itself, all other test fails seem to be in the erratic low-level ballpark.
BTW: That makes the Crush reports for aperture 16>>0 of candidates [13 5 8 10] and [7 2 14 10] clean reports.
Aperture 16>>0 for [14 2 7 5] seems to be good in most of my tests, but there is one arrangement producing e-12 that I'd class as VERY SUSPICIOUS (Practrand naming). It was my experimental candidate so, until I get back to retesting it, I'm not completely sure what the various arrangements were.
Aperture 16>>0 for [14 2 7 6] produces an eps in all the arrangements so it's out.
Hmm, this is good, there is no eps's for any aperture size 16 of candidate [3 2 6 9]. The worst is two e-11's:
========= Summary results of Crush =========
Version: TestU01 1.2.3
Generator: Xorshiro32(16)++, candidate [3 2 6 9]
- single iterated, bit-order reversed, sampling aperture 16>>8
Number of statistics: 144
Total CPU time: 00:23:24.75
The following tests gave p-values outside [0.001, 0.9990]:
(eps means a value < 1.0e-300):
(eps1 means a value < 1.0e-15):
Test p-value
----------------------------------------------
1 SerialOver, t = 2 1 - 2.5e-11
8 CollisionOver, t = 8 3.0e-6
----------------------------------------------
All other tests were passed
========= Summary results of Crush =========
Version: TestU01 1.2.3
Generator: Xorshiro32(16)++, candidate [3 2 6 9]
- single iterated, bit-order reversed, sampling aperture 16>>14
Number of statistics: 144
Total CPU time: 00:24:32.19
The following tests gave p-values outside [0.001, 0.9990]:
(eps means a value < 1.0e-300):
(eps1 means a value < 1.0e-15):
Test p-value
----------------------------------------------
1 SerialOver, t = 2 1 - 2.5e-11
----------------------------------------------
All other tests were passed
Worst fails for size 16 sampling aperture of [6 2 11 3] are:
Generator: Xorshiro32(16)++, candidate [6 2 11 3]
- single iterated, bit-order reversed, sampling aperture 16>>6
1 SerialOver, t = 2 1 - 3.6e-11
41 MaxOft, t = 5 1 - 8.7e-5
42 MaxOft, t = 10 1 - 1.7e-11
- single iterated, bit-order reversed, sampling aperture 16>>11
1 SerialOver, t = 2 4.0e-11
42 MaxOft, t = 10 0.9999
43 MaxOft, t = 20 0.9996
50 AppearanceSpacings, r = 20 2.1e-4
Note there is also a MaxOft in that. Which means, although SerialOver is predominant it is not the only test that is getting poor results with Xoroshiro++ algorithm.
Oh, and e-12 for [14 2 7 5] 16>>0 is correct. And it's not the worst size 16 fail either. 6 of the 16 cases have worse fails. So, not the greatest outcome there.
Generator: Xorshiro32(16)++, candidate [14 2 7 5]
- single iterated, bit-order reversed, sampling aperture 16>>0
1 SerialOver, t = 2 1 - 1.3e-12
41 MaxOft, t = 5 0.9998
71 LinearComp, r = 0 0.9991
- single iterated, bit-order reversed, sampling aperture 16>>1
1 SerialOver, t = 2 1 - eps1
- single iterated, bit-order reversed, sampling aperture 16>>2
1 SerialOver, t = 2 1 - eps1
41 MaxOft, t = 5 0.9996
67 RandomWalk1 R (L = 1000) 1.8e-4
- single iterated, bit-order reversed, sampling aperture 16>>7
1 SerialOver, t = 2 7.0e-15
42 MaxOft, t = 10 1 - 1.9e-5
44 MaxOft, t = 30 1 - 3.6e-5
- single iterated, bit-order reversed, sampling aperture 16>>9
1 SerialOver, t = 2 1 - 3.5e-13
- single iterated, bit-order reversed, sampling aperture 16>>11
1 SerialOver, t = 2 1 - eps1
- single iterated, bit-order reversed, sampling aperture 16>>13
1 SerialOver, t = 2 1 - eps1
41 MaxOft, t = 5 1 - 7.6e-5
42 MaxOft, t = 10 0.9997
Started some BigCrush testing of scro-rarns32. 100% hard fails on Gap test for the first 8 cases. I presume Gap is the test that detects period repeating the easiest. Here's aperture 32>>0 with bit-order reversed:
========= Summary results of BigCrush =========
Version: TestU01 1.2.3
Generator: stdin32
Number of statistics: 160
Total CPU time: 03:03:24.71
The following tests gave p-values outside [0.001, 0.9990]:
(eps means a value < 1.0e-300):
(eps1 means a value < 1.0e-15):
Test p-value
----------------------------------------------
33 CouponCollector, r = 27 0.9998
34 Gap, r = 0 5.4e-15
35 Gap, r = 25 eps
36 Gap, r = 0 eps
37 Gap, r = 20 eps
65 SumCollector 6.3e-5
----------------------------------------------
All other tests were passed
Test 34 fluctuates between e-XX and eps but other than that all 8 reports are much the same.
Started some BigCrush testing of scro-rarns32. 100% hard fails on Gap test for the first 8 cases. I presume Gap is the test that detects period repeating the easiest. Here's aperture 32>>0 with bit-order reversed:
========= Summary results of BigCrush =========
Version: TestU01 1.2.3
Generator: stdin32
Number of statistics: 160
Total CPU time: 03:03:24.71
The following tests gave p-values outside [0.001, 0.9990]:
(eps means a value < 1.0e-300):
(eps1 means a value < 1.0e-15):
Test p-value
----------------------------------------------
33 CouponCollector, r = 27 0.9998
34 Gap, r = 0 5.4e-15
35 Gap, r = 25 eps
36 Gap, r = 0 eps
37 Gap, r = 20 eps
65 SumCollector 6.3e-5
----------------------------------------------
All other tests were passed
Test 34 fluctuates between e-XX and eps but other than that all 8 reports are much the same.
Thanks for all the results Evan. I've emailed all the Crush reports I have from Seba. In his BigCrush tests [14,2,7,5] and [14,2,7,6] failed on SerialOver, Gap, MaxOft and SumCollector. No SerialOver and MaxOft failures above with scro-rarns32 but there is a CouponCollector - is the normal bit order better or worse or the same?
When comparing [14,2,7,5] and [13,5,10,9], do all the apertures not listed pass all tests? And are they bit-reversed or do you always print that?
I've emailed all the Crush reports I have from Seba.
Looking at those I just noticed the generator state is emitted in his Crush reports. My reports don't have that because I'm not using TestU01's dynamic allocation facilities. That tells me I'll probably never get identical reports unless I change my integration with TestU01.
In his BigCrush tests [14,2,7,5] and [14,2,7,6] failed on SerialOver, Gap, MaxOft and SumCollector. No SerialOver and MaxOft failures above with scro-rarns32 but there is a CouponCollector - is the normal bit order better or worse.
From general inspections of mid-Crush reports I'd say reverse seems to be slightly more severe fails.
When comparing [14,2,7,5] and [13,5,10,9], did all the apertures not listed pass all tests? And are they bit-reversed or do you always print that?
Well, "pass" is relative in my understanding these days. I've started treating anything without an "eps/eps1" as a pass. Which is similar e value to Practrand pass/fail threshold.
Put it this way, they didn't have worse fails. Very few reports are actually "All tests were passed".
For Xoroshiro Crush testing, the config is emitted in the reports. Bit-order reversal is listed for each case. For Scro-rarns32, the config is not in the report text.
In both situations though, the run scripts encode the config into the report names so I know which is which.
BTW: That makes the Crush reports for aperture 16>>0 of candidates [13 5 8 10] and [7 2 14 10] clean reports.
Aperture 16>>0 for [14 2 7 5] seems to be good in most of my tests, but there is one arrangement producing e-12 that I'd class as VERY SUSPICIOUS (Practrand naming). It was my experimental candidate so, until I get back to retesting it, I'm not completely sure what the various arrangements were.
Aperture 16>>0 for [14 2 7 6] produces an eps in all the arrangements so it's out.
Long holiday away from PRNGs now over.
Evan, which of [13,5,10,9] and [13,5,8,10] are better overall in PractRand and Crush tests?
[13 5 8 10] seems the better, it was already visible in the good scoring Practrand scores before we used the distribution scores. I haven't done any exhaustive comparing with the Crushes. You've got the reports.
Comments
It only continues on between tests within the battery because the executable doesn't exit. A repeat run of the executable is a fresh start of the generator state each time. So, yes, a repeat run produces exact identical results.
I've had it Crush gridding for the last few hours and it's notable there is a sort of randomness to the types of failures. The P-values of these failures are not anywhere near as pronounced as what Practrand fails on, Practrand would be issuing "unusual" and "suspicious" for most of the values I'm seeing. And Practrand does have lots of exactly that in pretty much all of its reports.
In other words, I'm thinking TestU01 failure threshold is set way too sensitive.
I guess part of the problem is the fixed length test ...
You dug up some xoroshiro shift constants for me, so I am following up on a statement I made in the PRNG group about an xoroshiro variant that I feel is very promising (and fast).
I finished all of the ground-level work, and more. Have a look if you get a chance and let me know what you think.
https://gist.github.com/XoroshiroNOT
Thanks,
ChrisR
If I can butt in ... For speed, the original xoroshiro has local s0, s1 variables. What this does, by minimally referencing the global variables, is it clears the way for the optimiser to use any tricks for all intermediate steps. Also, I'd move the subtracting "const_odd" up at s[2] assignment line.
The other thing is I'm not keen on "const" declarations for non-constants. I know it was done in the original xoroshiro but I simply don't know the justification for it. Lying to the optimiser is not wise.
Consider it done.
I think I need to filter them better. There's far too many minor fails. That's something to read up on and see if there is controls in the test suit I can adjust ...
Thanks for the comments.
I included 'const' only for the aesthetic of being consistent with the original xoroshiro. Using Compiler Explorer, I did not see any effect on the ASM emitted by GCC -O3 or MSVC /Ox, but I am happy to remove them.
As for the other recommendations, they seem logical (first thing I tried), but in my testing:
Seemingly interfered with saturation and intentional steering of the Intel CPU pipeline under various loads by undoing the latent load of the S[2] state from L1/L2 cache, etc.
Did not result in fewer global references emitted by GCC or MSVC (as the compilers are smart enough to see what I am trying to do, and yet stay out of my way when I need them to).
Resulted in ASM that penalized either GCC or MSVC without providing a clear performance benefit to the other.
Moving the const_odd to the subtract line in the 2-D code over-saturates the pipeline on my CPU (too many back-to-back similar parallel operations taxing the same micro-ops), but it does benefit the 1-D code because the s0 ^ s1 is deferred.
The end result is code nearly as fast as xoroshiro** or ++ when I tested on a Xeon W3690 using 3 different benchmarks (with 'inline' and under both single and multi-threaded loads).
However, if you or anyone can demonstrate that any proposed code changes actually result in more than 5% better performance over the current code on more modern CPUs under both MSVC and GCC, I'm happy to make changes that will improve code readability.
ChrisR
Dynamic allocating of the state, for re-entrant practises, might be something else to measure with those fancy tools. I noticed this is supported in the TestU01 suite.
Not sure about dynamic state allocation, so I'll have to look into it... Thanks!
Offset 0 of the first four sizes has been clean but I've got 31 of first 128 cases with the same looking randomness in the Crush fails that is occurring with Xoroshiro32++.
Attached is 160 Crush reports
As a headline comparison, I'm interested in how XORO32 compares with Chris's (scro-rarns) generator, for full 32-bit output with no rotation (lsb is bit 0). The latter passes Crush but I'm not sure about [13,5,10,9] as the files show "bit-order reversed".
I wonder whether scro-rarns passes Big Crush.
My understanding is it's not possible to pass BigCrush with a 32-bit state size.
EDIT: I tell a lie. Bit-order reversal is there and enabled in the Scro-rarns generator build. The reason why it isn't stated in the reports is because those ones are produced by the piped Crush test executable. So the reports just have what the Crush testing is doing, not what the generator is doing.
There is definitely some eps's occurring with Xoroshiro. So, I guess, ignoring the reported fails, there is a clear measured difference.
Looking more closely at the Crush reports for the Xoroshiro grids, there is actually only one test with one setting that is producing eps p-values. The test name is SerialOver with parameter of t=2. That one stands by itself, all other test fails seem to be in the erratic low-level ballpark.
Aperture 16>>0 for [14 2 7 5] seems to be good in most of my tests, but there is one arrangement producing e-12 that I'd class as VERY SUSPICIOUS (Practrand naming). It was my experimental candidate so, until I get back to retesting it, I'm not completely sure what the various arrangements were.
Aperture 16>>0 for [14 2 7 6] produces an eps in all the arrangements so it's out.
Note there is also a MaxOft in that. Which means, although SerialOver is predominant it is not the only test that is getting poor results with Xoroshiro++ algorithm.
Thanks for all the results Evan. I've emailed all the Crush reports I have from Seba. In his BigCrush tests [14,2,7,5] and [14,2,7,6] failed on SerialOver, Gap, MaxOft and SumCollector. No SerialOver and MaxOft failures above with scro-rarns32 but there is a CouponCollector - is the normal bit order better or worse or the same?
When comparing [14,2,7,5] and [13,5,10,9], do all the apertures not listed pass all tests? And are they bit-reversed or do you always print that?
From general inspections of mid-Crush reports I'd say reverse seems to be slightly more severe fails.
Well, "pass" is relative in my understanding these days. I've started treating anything without an "eps/eps1" as a pass. Which is similar e value to Practrand pass/fail threshold.
Put it this way, they didn't have worse fails. Very few reports are actually "All tests were passed".
In both situations though, the run scripts encode the config into the report names so I know which is which.
Long holiday away from PRNGs now over.
Evan, which of [13,5,10,9] and [13,5,8,10] are better overall in PractRand and Crush tests?