To complete the results, the 1 GB tests are:
One case on 8-cores with -multithread option: 59 secs
Eight parallel cases on 8-cores with -multithread option: 80 secs
One case without -multithread option: 99 secs
Eight parallel cases on 8-cores without -multithread option: 97 secs
The later two are same within error margin. That's good to know.
I presume you got it second hand cheaply. It's performing really well!
I bought the PC used in 2015 for $400 US. My friend got the dual CPU X5680 at the same time for $525.
My dual CPU workstation is running E5-2690s, which I picked up for $600 last year.
It came with 64GB RAM (whereas my W3690 only has 12GB), but I don't have any jobs that require that much right now.
After looking over the data posted, I see that PractRand is reporting that s0 * 5 is up to 70% more random than the original s0 alone.
I was hoping for something closer to 100%.
I believe the pair frequency might be overly weighing on the ranking (perhaps culling the list of 20-60 overall rank), as anything below about 500 for pfreq, zfreq and nzfreq seems plausible.
Regardless of the above, I am still curious if my streaming mod with a single M and using only a single candidate (e.g. 4,8,5,7 is great for s0 * 5 ) will drive pfreq, zfreq and nzfreq all well below 100 simultaneously?
I am very reasonably sure from my own testing that it will not substantially affect the PractRand scores on any aperture, but will cause the pfreq, zfreq and nzfreq to wander (presumably with some M, all to very near 0 total).
The actual mod I am currently using would look like this, if refactored slightly, to affect only the last line of code with C=5:
s[1] = rotl(s1 ^ M, C);
Of course the missing value would swing to a non-zero, which requires an E subtraction (or subtraction from E, if that would be better) in the scrambler, which would have to be performed prior to analyzing for pfreq, zfreq and nzfreq.
However, running a few dozen random M without applying E would demonstrate the viability of the concept if a few of those M would rank at #1 above all previous.
Hopefully that all makes sense...
I could certainly grid score a lot more candidates from a longer list. With an average of 10 seconds per score, assuming average 1 GB per candidate, 3x16=48 per candidate grid is 8 minutes per candidate. So 180 candidates in 24 hours.
EDIT: It would likely be twice that fast faster (7.5 seconds per score) because the good averages are barely hitting exponent of 29 (512 MB).
I am still curious if my streaming mod with a single M and using only a single candidate (e.g. 4,8,5,7 is great for s0 * 5 ) will drive pfreq, zfreq and nzfreq all well below 100 simultaneously?
Ah, so, try lots of M's with everything else set. Correct?
The current code you are testing is with M=0, so absent. However, when M<>0, you get a completely new decorrelated stream that, from a PractRand standpoint is fundamentally identical, but pfreq, nzfreq and zfreq will likely not be.
So if you try several random values of M with the above code mod patched in, see how these statistics from Tony are affected when redone for each M:
4, 8, 5, 7, 1, 47, 91, 10, 5, 192, 457
I suspect that you will quickly find an M that is superior, and if that is the case, then we could worry about repeating with an E present (which would change the stats once again toward perhaps a different M).
The end goal would be to force 4,8,5,7,E,M to be perfect, e.g. rank 1,1,1 for pf, zf and nzf (assuming that 4,8,5,7 is the best target to do this).
I am still curious if my streaming mod with a single M and using only a single candidate (e.g. 4,8,5,7 is great for s0 * 5 ) will drive pfreq, zfreq and nzfreq all well below 100 simultaneously?
Ah, so, try lots of M's with everything else set. Correct?
Yes, but initially no E, since it is too much trouble to calculate it for a proof of concept run (though it would be required later and would again change the choice of M, perhaps by exhaustive search).
Hmm, including in both x 5 and + E makes it four adders in the summing, which makes the scrambler three deep again.
Depth one allows a single adder summing: Xoroshiro+.
Depth two allows three adder summing: Xoroshiro++, Xoroshiron+e+, Xoroshiro+x+.
Depth three would allow up to seven adders! Hardware speed limitations would need looked at but I presume this does open up a lot more options for more complex scramblers.
So, I'm implying that's ultimately a non-option. EDIT: Either do more to really use depth three or confine to depth two only.
I'm pretty certain the only reason we got away with depth two was because the register size is only 16 bit. I doubt that leeway will extend to three levels.
The regular way to go more complex is take more clocks about it. It sounds like Chip was saying it's a pain to make timing exceptions for lots of multi-clock paths so will usually go with multi-stage instead. Which requires intermediate staging registers to be added.
Agreed**. The only possible work-around I can see for justifying the use of M is:
1. Drop E from the code, which simplifies the search for M.
2. If no M values result in pfreq/zfreq/nzfreq simultaneously significantly closer to zero, then abort the idea of using M completely and fall back to original path of considering s0 * 5.
3. If a superior M<>0 is found, calculate E and document it.
4. If the end-user magically* requires that the value that is short by one output is 0, then they can subtract or xor the documented value of E with both halves of XORO32 in their own code.
*'Magically' implies that it would be rare that a give application cares which value is short (only 65535 of that value in the full period), and those few that do care (if they exist?) might likely be just as unhappy about a missing zero. Therefore, the adaptive approach would be to write application code (that cares about such things) to either throw out zeros, or to xor a counter with the output and increment the counter once every full period so that all values are represented equally. In the latter case, it does not matter which value is short by one.
**Edit: Thinking about the 'non-option' of actually coding E a little deeper... since xor also works, inverting the set bits of E on the output read of XORO32 would have the same effect as subtraction/xor in the code... just requires an average of about 16 inverters (one for each set bit of E * 2 words) on the 32-bits of output stage... thus virtually tiny real estate compared to another adder and no clocking required. I hope that makes sense.
**Edit: Thinking about the 'non-option' of actually coding E a little deeper... since xor also works, inverting the set bits of E on the output read of XORO32 would have the same effect as subtraction/xor in the code...
True, NOTs are the fastest gate type. I think I understand this idea now too, ie: It's a completely different value of E but serves the same purpose as the subtracting version. I just have to adapt the FindECorr() function to match.
E is the same value whether you subtract or xor it:
For example, if the short output is 0xAAAA, then E=0xAAAA and thus: If Output = E then (Output - E) = (Output ^ E) = 0, and if Output <> E then 'Don't Care'.
The PractRand results will not be impacted by the choice of subtraction/xor, nor should the pfreq, but zfreq and nzfreq would be. Thus xor E would ultimately affect the specific choice of M.
A few notes:
1. Refactoring the M into the last line of code might have a (hopefully minor) consequence, as the first pass of s1 to the output value and engine would not have M applied. I need to study this more to see what the actual consequence is. I avoided any potential issue in my own code by performing
uint16_t s1 = s[1] ^ M;
up front. I might have to refactor a test down to 16-bit state to study this. Edit: Non-issue... s[1] = rotl(s1 ^ M, C) works perfect. I just couldn't make sense of it in my head, so had to run full simulations to ensure everything was correct.
2. I noticed that the original xoroshiro32pp full period has exactly two occurrences where certain 16-bit values repeat 3 times in a row (i.e. two triplets). Statistically this should only happen once, but either zero or twice is plausible. What is interesting is that M can be specifically chosen to set the occurrence of triplets to zero (or one), if desired. However, since two 16-bit outputs are combined into XORO32, it would not be obvious that either the 1st or the 3rd 16-bit output occurrence in a row is present in the adjacent 32-bit output.
Let us recap the big picture anyway, just for posterity:
1. s0 * 5 would net approximately 70% better randomness, perhaps more with some other untested ABCD. Seemingly proven.
2. s0 * 5 might fit in the pipeline (barely). Possibly sound.
3. Use of M xor'd in the final stage will fit in the pipeline. Seems sound.
4. Use of M with near 8 bits set, will erase all traces of sparse set-bit state recovery effect on the output. Has been demonstrated.
5. Specific choices of M should manipulate pfreq, zfreq and nzfreq to superior target goals without compromising PractRand performance. Current evidences supports this, but remains to be fully demonstrated.
6. M affects triplet generation and could be used to cancel it entirely, if desired (i.e. xoroshiro32pp 13,5,10,9 has two sets of triplets). Has been demonstrated.
7. E could possibly be factored into the final output using Not gates to assign the missing output value back to zero, with minimal impact to design efficiency. Logically sound.
8. The use of E would affect zfreq and nzfreq, but not PractRand results, so would be required for final tuning. Logically sound with freqs, but needs to be formally demonstrated. Has been demonstrated with PractRand.
9. Ultimately, the specific choice of M would be used to calculate/discover the bad-seed-value to xor mask against the normal non-zero seed (or as I do in 128-bit version: seed, pump, xor bit 0 with a 1 if new state = old state). Presumably implementable.
2. s0 * 5 might fit in the pipeline (barely). Possibly sound.
3. Use of M xor'd in the final stage will fit in the pipeline. Seems sound.
I'm happy with this. 16-bit adder depth of two fits easy enough in a single clock, and what little the NOTing of E would add doesn't really count. The engine doesn't have any summing so fits even easier than that, and M can be done as NOTed state output the same as E is NOTed result output.
5. Specific choices of M should manipulate pfreq, zfreq and nzfreq to superior target goals without compromising PractRand performance. Current evidences supports this, but remains to be fully demonstrated.
What approach am I to take for this? Do I need to be running exhaustive, every M, distribution runs on preferred candidates?
9. Ultimately, the specific choice of M would be used to calculate/discover the bad-seed-value to xor mask against the normal non-zero seed (or as I do in 128-bit version: seed, pump, xor bit 0 with a 1 if new state = old state). Presumably implementable.
Evan and Chris, I think it would be useful if you were able to calculate Chi-Square sums so that you could rank the pair/zero/non-zero frequency distributions yourselves, without needing me.
The equations for the expected distributions are
pfreq(x) = 1/x! * N * 1/e zfreq(x) = (1-1/e)^2 * N * (1/e)^x nzfreq(x) = (1/e)^2 * N * (1-1/e)^x
The first is the well-known binomial. The second and third have a very nice symmetry and although I cannot derive them mathematically the results show that they are correct. N = 2^32 in this case.
The lower the value, the closer the distribution is to that expected. I sum Chi-Square(x) for all x for which the nearest integer to the expected value is non-zero. The nearest integer values are listed at the end of this post.
For pfreq, sum Chi-Square values for x = 0 to x = 12 incl.
For zfreq, sum Chi-Square values for x = 1 to x = 21 incl.
For nzfreq, sum Chi-Square values for x = 1 to x = 45 incl.
Defining prank/zrank/nzrank as the ranking of pfreq/zfreq/nzfreq for all full-length candidates, a ranking of 1 is the best and the overall ranking is prank+zrank+nzrank, thus the lowest overall is considered to be the best. However, one can argue that pfreq is more important than zfreq or nzfreq as the pfreq distribution is identical to the XORO32 output distribution. Having said that, the scro generator beats any xoroshiro so far tested for all three distributions, but is too large to be implemented and also has no equidistribution.
zfreq(0) and nzfreq(0) are not used for Chi-Square as they have no real meaning and instead could hold the total number of zero and non-zero runs. Note that a zero or non-zero run is terminated by a '1' or '0' and by the end of the 4GB of data.
What approach am I to take for this? Do I need to be running exhaustive, every M, distribution runs on preferred candidates?
Only run on preferred candidates that are 2^29 min. and >2^30.0 avg. for PractRand. At least a few dozen random M on each so we can see freq results.
If you can calculate E, then use it. Otherwise, M alone is just a quick-check to see if the freq diffs are driven like I think (hopefully some toward zero).
Isn't that E's job?
E only corrects the output back to a missing zero. 'Bad-seed' prevents the state from starting in a single cycle loop... not sure how the XORO32 seeder works.
E only corrects the output back to a missing zero. 'Bad-seed' prevents the state from starting in a single cycle loop... not sure how the XORO32 seeder works.
XORO32 D reads D as the current state, iterates it twice and writes it back to D. The two 16-bit outputs are concatenated and the 32-bit result put into the source field of the next instruction (low word is first output, high word second). There is nothing in hardware to detect and change an invalid state of all zeroes.
The documentation should be modified to say the algorithm is now xoroshiro32++ and D must be > 0.
In an email from February 2018, Seba said that PRN = ((s0 + s1) * 5) rotl 7) * 9 might work well for 32-bit state. I don't think this could be implemented within the same time constraints as the existing double-iterated xoroshiro32++, but it's a generator with a hybrid +** scrambler that we haven't tested yet, probably.
E only corrects the output back to a missing zero. 'Bad-seed' prevents the state from starting in a single cycle loop... not sure how the XORO32 seeder works.
XORO32 D reads D as the current state, iterates it twice and writes it back to D. The two 16-bit outputs are concatenated and the 32-bit result put into the source field of the next instruction (low word is first output, high word second). There is nothing in hardware to detect and change an invalid state of all zeroes.
The documentation should be modified to say the algorithm is now xoroshiro32++ and D must be > 0.
Based on the above, I cannot predict a perfectly suitable work-around to seeding when using M, due to the fact that when M<>0 then 'bad-seed-value'<>0. That leaves only four options I can see:
1. Modify the documentation to say that D (as in XORO32 D) must be <> X, where X is the 'bad-seed-value' resulting from a specific choice of ABCM. That seems somewhat cumbersome (as compared to when X=0).
2. Keep the documentation to say 'D must be > 0' and magically not/xor the D bits with the set bits of X (aka 'bad-seed-value') in a way that does not interfere with the normal iteration process, but only affects seeding. I have no idea if this is possible.
3. Change the documentation to read something like 'D must be xor'd with 1 (or other non-zero value) and reseeded if D(i)=D(i+1). This would also work with the current xoroshiro32++, though is somewhat cryptic.
4. Abandon the pursuit of using M for the purpose of actually attempting to perfect the modified xoroshiro32++ implementation, but perhaps continue the pursuit of M for academic/future use.
In an email from February 2018, Seba said that PRN = ((s0 + s1) * 5) rotl 7) * 9 might work well for 32-bit state. I don't think this could be implemented within the same time constraints as the existing double-iterated xoroshiro32++, but it's a generator with a hybrid +** scrambler that we haven't tested yet, probably.
That looks good from a theoretical perspective on multiple levels. I have looked at a few similar scramblers, but did not see them as suitable for parallelization based on what has already been discussed here. It seems that PRN = rotl(s0 + s1, D ) + s0 * 5 stretches the dependency chain about as far as it can go (and other minor variations on that simple modification concept seem to fall apart statistically).
Edit: I looked at a few candidates using Seba's new scrambler, but the PractRand results were not good. Either there are some good candidates to be found by a much more exhaustive search (perhaps by allowing D not exclusively 7), or the scrambler is overly symmetrical as-is, lacking a second occurrence of s0. The latter comment is based on my understanding of Melissa's small-state plotting of the xoroshiro state showing a cris-cross pattern, which I don't think can be cancelled sufficiently by a simple bivariate convolution.
Evan and Chris, I think it would be useful if you were able to calculate Chi-Square sums so that you could rank the pair/zero/non-zero frequency distributions yourselves, without needing me.
I get most of what you posted, but have a few fundamental questions about the frequency distribution calculations (and since C is not my language of choice... I have to convert to/from either ASM or VB, which is tedious, at best):
1. What are the raw count values based on for each (e.g. for pairs, is it word pairs, byte pairs, bit pairs)?
2. Why terminating z / nz runs at end of 4GB of data (as opposed to the 8GB of a single period or 16GB of a double period)?
That will fit in depth three by using more real-estate. At this stage three is considered too much but hasn't been proven.
For speed optimise, it would be:
x = (s0 + (s0<<2) + s1 + (s1<<2)) rotl 7
PRN = x + x<<3
Resolving to "x" is a full two levels. Which means there is flexibility to freely adjust one of the shifter parameters to split the x5 into separate multipliers. Same could be done to fully use the third level but it costs more transistors of course.
In an email from February 2018, Seba said that PRN = ((s0 + s1) * 5) rotl 7) * 9 might work well for 32-bit state. I don't think this could be implemented within the same time constraints as the existing double-iterated xoroshiro32++, but it's a generator with a hybrid +** scrambler that we haven't tested yet, probably.
That looks good from a theoretical perspective on multiple levels. I have looked at a few similar scramblers, but did not see them as suitable for parallelization based on what has already been discussed here. It seems that PRN = rotl(s0 + s1, D ) + s0 * 5 stretches the dependency chain about as far as it can go (and other minor variations on that simple modification concept seem to fall apart statistically).
Edit: I looked at a few candidates using Seba's new scrambler, but the PractRand results were not good. Either there are some good candidates to be found by a much more exhaustive search (perhaps by allowing D not exclusively 7), or the scrambler is overly symmetrical as-is, lacking a second occurrence of s0. The latter comment is based on my understanding of Melissa's small-state plotting of the xoroshiro state showing a cris-cross pattern, which I don't think can be cancelled sufficiently by a simple bivariate convolution.
OK, I thought it was worth mentioning but I don't think we should spend too much time on PRN = ((s0 + s1) * 5) rotl 7) * 9 as this ** scrambler is intended for s0 only when there are at least 128 bits of state (see the paper for more details).
I've been looking at xoroshiro32 scramblers using subtraction instead of addition and the following four permutations output non-zero values 2^16 times and zero 2^16-1 times.
I don't know whether the pair distributions (pfreq) are identical because I don't have 4GB of RAM to test them. Also I haven't looked at (s0 - s1) yet.
Comments
One case on 8-cores with -multithread option: 59 secs
Eight parallel cases on 8-cores with -multithread option: 80 secs
One case without -multithread option: 99 secs
Eight parallel cases on 8-cores without -multithread option: 97 secs
The later two are same within error margin. That's good to know.
My dual CPU workstation is running E5-2690s, which I picked up for $600 last year.
It came with 64GB RAM (whereas my W3690 only has 12GB), but I don't have any jobs that require that much right now.
I was hoping for something closer to 100%.
I believe the pair frequency might be overly weighing on the ranking (perhaps culling the list of 20-60 overall rank), as anything below about 500 for pfreq, zfreq and nzfreq seems plausible.
Regardless of the above, I am still curious if my streaming mod with a single M and using only a single candidate (e.g. 4,8,5,7 is great for s0 * 5 ) will drive pfreq, zfreq and nzfreq all well below 100 simultaneously?
I am very reasonably sure from my own testing that it will not substantially affect the PractRand scores on any aperture, but will cause the pfreq, zfreq and nzfreq to wander (presumably with some M, all to very near 0 total).
The actual mod I am currently using would look like this, if refactored slightly, to affect only the last line of code with C=5: Of course the missing value would swing to a non-zero, which requires an E subtraction (or subtraction from E, if that would be better) in the scrambler, which would have to be performed prior to analyzing for pfreq, zfreq and nzfreq.
However, running a few dozen random M without applying E would demonstrate the viability of the concept if a few of those M would rank at #1 above all previous.
Hopefully that all makes sense...
EDIT: It would likely be twice that fast faster (7.5 seconds per score) because the good averages are barely hitting exponent of 29 (512 MB).
So if you try several random values of M with the above code mod patched in, see how these statistics from Tony are affected when redone for each M: I suspect that you will quickly find an M that is superior, and if that is the case, then we could worry about repeating with an E present (which would change the stats once again toward perhaps a different M).
The end goal would be to force 4,8,5,7,E,M to be perfect, e.g. rank 1,1,1 for pf, zf and nzf (assuming that 4,8,5,7 is the best target to do this). Yes, but initially no E, since it is too much trouble to calculate it for a proof of concept run (though it would be required later and would again change the choice of M, perhaps by exhaustive search).
Depth one allows a single adder summing: Xoroshiro+.
Depth two allows three adder summing: Xoroshiro++, Xoroshiron+e+, Xoroshiro+x+.
Depth three would allow up to seven adders! Hardware speed limitations would need looked at but I presume this does open up a lot more options for more complex scramblers.
I'm pretty certain the only reason we got away with depth two was because the register size is only 16 bit. I doubt that leeway will extend to three levels.
The regular way to go more complex is take more clocks about it. It sounds like Chip was saying it's a pain to make timing exceptions for lots of multi-clock paths so will usually go with multi-stage instead. Which requires intermediate staging registers to be added.
1. Drop E from the code, which simplifies the search for M.
2. If no M values result in pfreq/zfreq/nzfreq simultaneously significantly closer to zero, then abort the idea of using M completely and fall back to original path of considering s0 * 5.
3. If a superior M<>0 is found, calculate E and document it.
4. If the end-user magically* requires that the value that is short by one output is 0, then they can subtract or xor the documented value of E with both halves of XORO32 in their own code.
*'Magically' implies that it would be rare that a give application cares which value is short (only 65535 of that value in the full period), and those few that do care (if they exist?) might likely be just as unhappy about a missing zero. Therefore, the adaptive approach would be to write application code (that cares about such things) to either throw out zeros, or to xor a counter with the output and increment the counter once every full period so that all values are represented equally. In the latter case, it does not matter which value is short by one.
**Edit: Thinking about the 'non-option' of actually coding E a little deeper... since xor also works, inverting the set bits of E on the output read of XORO32 would have the same effect as subtraction/xor in the code... just requires an average of about 16 inverters (one for each set bit of E * 2 words) on the 32-bits of output stage... thus virtually tiny real estate compared to another adder and no clocking required. I hope that makes sense.
For example, if the short output is 0xAAAA, then E=0xAAAA and thus: If Output = E then (Output - E) = (Output ^ E) = 0, and if Output <> E then 'Don't Care'.
The PractRand results will not be impacted by the choice of subtraction/xor, nor should the pfreq, but zfreq and nzfreq would be. Thus xor E would ultimately affect the specific choice of M.
Too tired... must sleep to get up in 5 hours.
1. Refactoring the M into the last line of code might have a (hopefully minor) consequence, as the first pass of s1 to the output value and engine would not have M applied. I need to study this more to see what the actual consequence is. I avoided any potential issue in my own code by performing up front. I might have to refactor a test down to 16-bit state to study this. Edit: Non-issue... s[1] = rotl(s1 ^ M, C) works perfect. I just couldn't make sense of it in my head, so had to run full simulations to ensure everything was correct.
2. I noticed that the original xoroshiro32pp full period has exactly two occurrences where certain 16-bit values repeat 3 times in a row (i.e. two triplets). Statistically this should only happen once, but either zero or twice is plausible. What is interesting is that M can be specifically chosen to set the occurrence of triplets to zero (or one), if desired. However, since two 16-bit outputs are combined into XORO32, it would not be obvious that either the 1st or the 3rd 16-bit output occurrence in a row is present in the adjacent 32-bit output.
1. s0 * 5 would net approximately 70% better randomness, perhaps more with some other untested ABCD. Seemingly proven.
2. s0 * 5 might fit in the pipeline (barely). Possibly sound.
3. Use of M xor'd in the final stage will fit in the pipeline. Seems sound.
4. Use of M with near 8 bits set, will erase all traces of sparse set-bit state recovery effect on the output. Has been demonstrated.
5. Specific choices of M should manipulate pfreq, zfreq and nzfreq to superior target goals without compromising PractRand performance. Current evidences supports this, but remains to be fully demonstrated.
6. M affects triplet generation and could be used to cancel it entirely, if desired (i.e. xoroshiro32pp 13,5,10,9 has two sets of triplets). Has been demonstrated.
7. E could possibly be factored into the final output using Not gates to assign the missing output value back to zero, with minimal impact to design efficiency. Logically sound.
8. The use of E would affect zfreq and nzfreq, but not PractRand results, so would be required for final tuning. Logically sound with freqs, but needs to be formally demonstrated. Has been demonstrated with PractRand.
9. Ultimately, the specific choice of M would be used to calculate/discover the bad-seed-value to xor mask against the normal non-zero seed (or as I do in 128-bit version: seed, pump, xor bit 0 with a 1 if new state = old state). Presumably implementable.
I hope I didn't miss anything.
What approach am I to take for this? Do I need to be running exhaustive, every M, distribution runs on preferred candidates?
Isn't that E's job?
The equations for the expected distributions are
pfreq(x) = 1/x! * N * 1/e
zfreq(x) = (1-1/e)^2 * N * (1/e)^x
nzfreq(x) = (1/e)^2 * N * (1-1/e)^x
The first is the well-known binomial. The second and third have a very nice symmetry and although I cannot derive them mathematically the results show that they are correct. N = 2^32 in this case.
The equation for "goodness of fit" is
Chi-Square(x) = [Actual(x)-Expected(x)]^2/Expected(x)
The lower the value, the closer the distribution is to that expected. I sum Chi-Square(x) for all x for which the nearest integer to the expected value is non-zero. The nearest integer values are listed at the end of this post.
For pfreq, sum Chi-Square values for x = 0 to x = 12 incl.
For zfreq, sum Chi-Square values for x = 1 to x = 21 incl.
For nzfreq, sum Chi-Square values for x = 1 to x = 45 incl.
Defining prank/zrank/nzrank as the ranking of pfreq/zfreq/nzfreq for all full-length candidates, a ranking of 1 is the best and the overall ranking is prank+zrank+nzrank, thus the lowest overall is considered to be the best. However, one can argue that pfreq is more important than zfreq or nzfreq as the pfreq distribution is identical to the XORO32 output distribution. Having said that, the scro generator beats any xoroshiro so far tested for all three distributions, but is too large to be implemented and also has no equidistribution.
zfreq(0) and nzfreq(0) are not used for Chi-Square as they have no real meaning and instead could hold the total number of zero and non-zero runs. Note that a zero or non-zero run is terminated by a '1' or '0' and by the end of the 4GB of data.
pfreq(x)
zfreq(x)
nzfreq(x)
If you can calculate E, then use it. Otherwise, M alone is just a quick-check to see if the freq diffs are driven like I think (hopefully some toward zero).
E only corrects the output back to a missing zero. 'Bad-seed' prevents the state from starting in a single cycle loop... not sure how the XORO32 seeder works.
XORO32 D reads D as the current state, iterates it twice and writes it back to D. The two 16-bit outputs are concatenated and the 32-bit result put into the source field of the next instruction (low word is first output, high word second). There is nothing in hardware to detect and change an invalid state of all zeroes.
The documentation should be modified to say the algorithm is now xoroshiro32++ and D must be > 0.
1. Modify the documentation to say that D (as in XORO32 D) must be <> X, where X is the 'bad-seed-value' resulting from a specific choice of ABCM. That seems somewhat cumbersome (as compared to when X=0).
2. Keep the documentation to say 'D must be > 0' and magically not/xor the D bits with the set bits of X (aka 'bad-seed-value') in a way that does not interfere with the normal iteration process, but only affects seeding. I have no idea if this is possible.
3. Change the documentation to read something like 'D must be xor'd with 1 (or other non-zero value) and reseeded if D(i)=D(i+1). This would also work with the current xoroshiro32++, though is somewhat cryptic.
4. Abandon the pursuit of using M for the purpose of actually attempting to perfect the modified xoroshiro32++ implementation, but perhaps continue the pursuit of M for academic/future use.
Edit: I looked at a few candidates using Seba's new scrambler, but the PractRand results were not good. Either there are some good candidates to be found by a much more exhaustive search (perhaps by allowing D not exclusively 7), or the scrambler is overly symmetrical as-is, lacking a second occurrence of s0. The latter comment is based on my understanding of Melissa's small-state plotting of the xoroshiro state showing a cris-cross pattern, which I don't think can be cancelled sufficiently by a simple bivariate convolution.
1. What are the raw count values based on for each (e.g. for pairs, is it word pairs, byte pairs, bit pairs)?
2. Why terminating z / nz runs at end of 4GB of data (as opposed to the 8GB of a single period or 16GB of a double period)?
For speed optimise, it would be: Resolving to "x" is a full two levels. Which means there is flexibility to freely adjust one of the shifter parameters to split the x5 into separate multipliers. Same could be done to fully use the third level but it costs more transistors of course.
OK, I thought it was worth mentioning but I don't think we should spend too much time on PRN = ((s0 + s1) * 5) rotl 7) * 9 as this ** scrambler is intended for s0 only when there are at least 128 bits of state (see the paper for more details).
I don't know whether the pair distributions (pfreq) are identical because I don't have 4GB of RAM to test them. Also I haven't looked at (s0 - s1) yet.