So you want double iterated 32-bit output from the Xoroshiro generator. And remove the pairing from the distribution test code so that it still only has to deal with 32-bit indexing. Right?
Do you have the code handy, Evan? It's quite late here. In a nutshell, don't save the previous output and do two new iterations to get the next 32-bit pair. Everything else is the same. No point testing more than one candidate.
The functional detail is it concatenates two consecutive 16-bit outputs into one 32-bit output that when sampled as 32>>0 still provides all the random data. And the second 2 Gwords has the the two halves of the output exchanged because of the 2^32-1 state space.
This looks functionally very similar to the pairing function in the distribution test code:
There is a difference if the sample size and accumulator size are different but for the distribution testing, so far, they are the same.
I think the pairing might be doing what you want already. Was that maybe the original intent of the pairing? I thought it was a feature of the distribution test though.
The functional detail is it concatenates two consecutive 16-bit outputs into one 32-bit output that when sampled as 32>>0 still provides all the random data. And the second 2 Gwords has the the two halves of the output exchanged because of the 2^32-1 state space.
This looks functionally very similar to the pairing function in the distribution test code:
There is a difference if the sample size and accumulator size are different but for the distribution testing, so far, they are the same.
I think the pairing might be doing what you want already. Was that maybe the original intent of the pairing? I thought it was a feature of the distribution test though.
Some thoughts about pair distribution. The C code iterates xoroshiro32++, concatenates successive 16-bit outputs and records how often each possible 32-bit value occurs. If the first 16-bit output is 1, the second 2, etc., then the C code produces the following 32-bit data:
Although the order is different the data are the same, i.e. the pair distribution is the XORO32 output distribution and therefore a good test of the XORO32 randomness. The xoroshiro32++ 16-bit output is equidistributed so that every non-zero values occurs 2^16 times and zero 2^16-1 times, but the XORO32 32-bit output is not equidistributed because the ++ scrambler is only 1-dimensionally equidistributed.
It would have been very easy to make the C code work exactly like XORO32 but it is quicker the way it is now. If one candidate has the same distributions for either method then they all will and it seems that is the case, so thanks for checking. Every 16-bit output occurs twice and the C code interleaves the two ~2GB blocks that are separate with XORO32. Although the byte value at any particular address is of no consequence as it's the overall frequency distribution we want, the byte addresses differ by a one bit rotation, more or less, in the two methods.
Originally, I looked at pairs of outputs to see whether they were duplicates. I knew the distribution would be binomial from one of Melissa's posts. The max frequency is less than 2^16 so it was easy to do on my PC. I soon realised a generalised pair distribution would be much more accurate, as every output would count instead of only one in 2^16, but my PC had insufficient RAM. A little while later it dawned on me that this pair distribution is identical to the XORO32 output distribution.
The distributions test XORO32 as a 32-bit generator. PractRand and the Crush suite test xoroshiro32++ as a 16-bit generator. A good score for one size does not imply a good score in the other. We need the overall best score in both and I think [13,5,10,9] gives us that.
Looks like the new code is running about 5% slower. Hmm, it's still doing 4 G loops of the pairs array filling, which makes it 8 G calls to nextxoro(). I think that might be over doing it.
Wow, twice the number of random numbers generated but it's still the same number of writes to main memory! DRAM speed is hitting hard!
Tony,
You realise, because of the 8 G calls, it's doubling up on every occurrence. All the pair[] tallies will be even values as a result.
You're doing 8G iterations to get 4G pairs now. Before it was 4G iterations as previous one was saved. Nothing else has changed. Remember period is 2^32-1, therefore a particular output that provides low word of address first time around provides high word the second time around and vice-versa.
The next three posts show distribution results using the Chi-Square goodness of fit test. These show most clearly the difference between the current [14,2,7,5] and proposed [13,5,10,9] constants in the XORO32 instruction when considering the output as a single 32-bit value.
Scroll to the extreme right to see the total for all frequencies.
Comments
Do you have the code handy, Evan? It's quite late here. In a nutshell, don't save the previous output and do two new iterations to get the next 32-bit pair. Everything else is the same. No point testing more than one candidate.
This looks functionally very similar to the pairing function in the distribution test code: There is a difference if the sample size and accumulator size are different but for the distribution testing, so far, they are the same.
I think the pairing might be doing what you want already. Was that maybe the original intent of the pairing? I thought it was a feature of the distribution test though.
I found my original post:
It would have been very easy to make the C code work exactly like XORO32 but it is quicker the way it is now. If one candidate has the same distributions for either method then they all will and it seems that is the case, so thanks for checking. Every 16-bit output occurs twice and the C code interleaves the two ~2GB blocks that are separate with XORO32. Although the byte value at any particular address is of no consequence as it's the overall frequency distribution we want, the byte addresses differ by a one bit rotation, more or less, in the two methods.
Originally, I looked at pairs of outputs to see whether they were duplicates. I knew the distribution would be binomial from one of Melissa's posts. The max frequency is less than 2^16 so it was easy to do on my PC. I soon realised a generalised pair distribution would be much more accurate, as every output would count instead of only one in 2^16, but my PC had insufficient RAM. A little while later it dawned on me that this pair distribution is identical to the XORO32 output distribution.
The distributions test XORO32 as a 32-bit generator. PractRand and the Crush suite test xoroshiro32++ as a 16-bit generator. A good score for one size does not imply a good score in the other. We need the overall best score in both and I think [13,5,10,9] gives us that.
Wow, twice the number of random numbers generated but it's still the same number of writes to main memory! DRAM speed is hitting hard!
Yes and it's only changing a byte at a time, but I don't see any way to make it faster.
XORO32 [13,5,10,9] outputs agree in BASIC and x86. Should be verified independently, though. Next stop is the modified Verilog.
You realise, because of the 8 G calls, it's doubling up on every occurrence. All the pair[] tallies will be even values as a result.
You're doing 8G iterations to get 4G pairs now. Before it was 4G iterations as previous one was saved. Nothing else has changed. Remember period is 2^32-1, therefore a particular output that provides low word of address first time around provides high word the second time around and vice-versa.
Thanks Evan, the two different ways to generate the distributions agree.
[14,2,7,5] current:
[13,5,10,9] proposed:
xoro32y, xoro32 and xoro32r to be changed.
If the above is correct, it should produce the output here.
P.S. Good to see the old code blocks back as the recent change was affecting productivity.
I'm part way through creating some tables that show how good it is. I'll finish later today.
Scroll to the extreme right to see the total for all frequencies.