So you want double iterated 32-bit output from the Xoroshiro generator. And remove the pairing from the distribution test code so that it still only has to deal with 32-bit indexing. Right?
Do you have the code handy, Evan? It's quite late here. In a nutshell, don't save the previous output and do two new iterations to get the next 32-bit pair. Everything else is the same. No point testing more than one candidate.
The functional detail is it concatenates two consecutive 16-bit outputs into one 32-bit output that when sampled as 32>>0 still provides all the random data. And the second 2 Gwords has the the two halves of the output exchanged because of the 2^32-1 state space.
This looks functionally very similar to the pairing function in the distribution test code:
There is a difference if the sample size and accumulator size are different but for the distribution testing, so far, they are the same.
I think the pairing might be doing what you want already. Was that maybe the original intent of the pairing? I thought it was a feature of the distribution test though.
The functional detail is it concatenates two consecutive 16-bit outputs into one 32-bit output that when sampled as 32>>0 still provides all the random data. And the second 2 Gwords has the the two halves of the output exchanged because of the 2^32-1 state space.
This looks functionally very similar to the pairing function in the distribution test code:
There is a difference if the sample size and accumulator size are different but for the distribution testing, so far, they are the same.
I think the pairing might be doing what you want already. Was that maybe the original intent of the pairing? I thought it was a feature of the distribution test though.
Some thoughts about pair distribution. The C code iterates xoroshiro32++, concatenates successive 16-bit outputs and records how often each possible 32-bit value occurs. If the first 16-bit output is 1, the second 2, etc., then the C code produces the following 32-bit data:
Although the order is different the data are the same, i.e. the pair distribution is the XORO32 output distribution and therefore a good test of the XORO32 randomness. The xoroshiro32++ 16-bit output is equidistributed so that every non-zero values occurs 2^16 times and zero 2^16-1 times, but the XORO32 32-bit output is not equidistributed because the ++ scrambler is only 1-dimensionally equidistributed.
It would have been very easy to make the C code work exactly like XORO32 but it is quicker the way it is now. If one candidate has the same distributions for either method then they all will and it seems that is the case, so thanks for checking. Every 16-bit output occurs twice and the C code interleaves the two ~2GB blocks that are separate with XORO32. Although the byte value at any particular address is of no consequence as it's the overall frequency distribution we want, the byte addresses differ by a one bit rotation, more or less, in the two methods.
Originally, I looked at pairs of outputs to see whether they were duplicates. I knew the distribution would be binomial from one of Melissa's posts. The max frequency is less than 2^16 so it was easy to do on my PC. I soon realised a generalised pair distribution would be much more accurate, as every output would count instead of only one in 2^16, but my PC had insufficient RAM. A little while later it dawned on me that this pair distribution is identical to the XORO32 output distribution.
The distributions test XORO32 as a 32-bit generator. PractRand and the Crush suite test xoroshiro32++ as a 16-bit generator. A good score for one size does not imply a good score in the other. We need the overall best score in both and I think [13,5,10,9] gives us that.
Looks like the new code is running about 5% slower. Hmm, it's still doing 4 G loops of the pairs array filling, which makes it 8 G calls to nextxoro(). I think that might be over doing it.
Wow, twice the number of random numbers generated but it's still the same number of writes to main memory! DRAM speed is hitting hard!
Tony,
You realise, because of the 8 G calls, it's doubling up on every occurrence. All the pair[] tallies will be even values as a result.
You're doing 8G iterations to get 4G pairs now. Before it was 4G iterations as previous one was saved. Nothing else has changed. Remember period is 2^32-1, therefore a particular output that provides low word of address first time around provides high word the second time around and vice-versa.
The next three posts show distribution results using the Chi-Square goodness of fit test. These show most clearly the difference between the current [14,2,7,5] and proposed [13,5,10,9] constants in the XORO32 instruction when considering the output as a single 32-bit value.
Scroll to the extreme right to see the total for all frequencies.
Comments
Do you have the code handy, Evan? It's quite late here. In a nutshell, don't save the previous output and do two new iterations to get the next 32-bit pair. Everything else is the same. No point testing more than one candidate.
#ifdef dITER random64 = rotr( (drnword_t)nextxoro() | ((drnword_t)nextxoro() << ACCUM_SIZE), SAMPLE_POSITION ); #else
The functional detail is it concatenates two consecutive 16-bit outputs into one 32-bit output that when sampled as 32>>0 still provides all the random data. And the second 2 Gwords has the the two halves of the output exchanged because of the 2^32-1 state space.This looks functionally very similar to the pairing function in the distribution test code:
next_prn = rotr( nextxoro(), SAMPLE_POSITION ) & SAMPLE_MASK; pairs[ prev_prn | (next_prn << SAMPLE_SIZE) ]++; // 8-bit limited tallies prev_prn = next_prn;
There is a difference if the sample size and accumulator size are different but for the distribution testing, so far, they are the same.I think the pairing might be doing what you want already. Was that maybe the original intent of the pairing? I thought it was a feature of the distribution test though.
I found my original post:
It would have been very easy to make the C code work exactly like XORO32 but it is quicker the way it is now. If one candidate has the same distributions for either method then they all will and it seems that is the case, so thanks for checking. Every 16-bit output occurs twice and the C code interleaves the two ~2GB blocks that are separate with XORO32. Although the byte value at any particular address is of no consequence as it's the overall frequency distribution we want, the byte addresses differ by a one bit rotation, more or less, in the two methods.
62690201 12A2AE16 D7194AE8 984B0C52 743C1DF1 BCC6DBA0 746C34C9 07FF3643 C642BBC0 594EAA85 701AD05A F3AEA328 695A67EE 93C6C140 4964F5E1 FC575E24 A638D7AD 181AE233 7BE766B5 5CBD8445 49A45F17 C687C751 175E7CA8 8BBA386F 46776DFD 62BFF63A 4F64FD4F 5E5A15D9 B4E1379D 2E1E01C0 6A806734 B2D8BDB8 C4645DBB 99D47BC4 650F3E11 C0E82B43 82C2B0A0 9D53DC65 E0C12D02 92249C67 84638C3F 2A36E209 640EF1B9 D236A707 184A9FE3 8A183209 947E9886 FC1232D4 F6A2D1CA FE8A985B B93481C2 22425E3A 8EE02987 E12D494E E72C03B4 EE576CA0 5C0B072E 6FFB0598 4D699199 16A97FAE DC090AAC 689A45D1 E280AF76 436BCA2D
#ifdef dITER next_prn = rotr( (drnword_t)nextxoro() | ((drnword_t)nextxoro() << ACCUM_SIZE), SAMPLE_POSITION ) & SAMPLE_MASK; pairs[next_prn]++; #else next_prn = rotr( nextxoro(), SAMPLE_POSITION ) & SAMPLE_MASK; pairs[ prev_prn | (next_prn << SAMPLE_SIZE) ]++; // 8-bit limited tallies prev_prn = next_prn; #endif
Originally, I looked at pairs of outputs to see whether they were duplicates. I knew the distribution would be binomial from one of Melissa's posts. The max frequency is less than 2^16 so it was easy to do on my PC. I soon realised a generalised pair distribution would be much more accurate, as every output would count instead of only one in 2^16, but my PC had insufficient RAM. A little while later it dawned on me that this pair distribution is identical to the XORO32 output distribution.
The distributions test XORO32 as a 32-bit generator. PractRand and the Crush suite test xoroshiro32++ as a 16-bit generator. A good score for one size does not imply a good score in the other. We need the overall best score in both and I think [13,5,10,9] gives us that.
Wow, twice the number of random numbers generated but it's still the same number of writes to main memory! DRAM speed is hitting hard!
Yes and it's only changing a byte at a time, but I don't see any way to make it faster.
XORO32 [13,5,10,9] outputs agree in BASIC and x86. Should be verified independently, though. Next stop is the modified Verilog.
You realise, because of the 8 G calls, it's doubling up on every occurrence. All the pair[] tallies will be even values as a result.
You're doing 8G iterations to get 4G pairs now. Before it was 4G iterations as previous one was saved. Nothing else has changed. Remember period is 2^32-1, therefore a particular output that provides low word of address first time around provides high word the second time around and vice-versa.
Thanks Evan, the two different ways to generate the distributions agree.
[14,2,7,5] current:
wire [15:0] xoro32z = d[31:16] ^ d[15:0]; // first iteration wire [31:0] xoro32y = { xoro32z[8:0], xoro32z[15:9], {d[1:0], d[15:2]} ^ {xoro32z[13:0], 2'b0} ^ xoro32z }; wire [15:0] xoro32x = xoro32y[31:16] ^ xoro32y[15:0]; // second iteration wire [31:0] xoro32 = { xoro32x[8:0], xoro32x[15:9], // xoro32 = d result (state store) {xoro32y[1:0], xoro32y[15:2]} ^ {xoro32x[13:0], 2'b0} ^ xoro32x }; wire [15:0] xoro32a = d[31:16] + d[15:0]; // hashing sum of input state wire [15:0] xoro32b = xoro32y[31:16] + xoro32y[15:0]; // hashing sum of first iterator state output assign xoro32r = { {xoro32b[10:0], xoro32b[15:11]} + xoro32y[15:0], {xoro32a[10:0], xoro32a[15:11]} + d[15:0] }; // xoro32r = prng result, next instruction's s value
[13,5,10,9] proposed:
wire [15:0] xoro32z = d[31:16] ^ d[15:0]; // first iteration wire [31:0] xoro32y = { xoro32z[5:0], xoro32z[15:6], {d[2:0], d[15:3]} ^ {xoro32z[10:0], 5'b0} ^ xoro32z }; wire [15:0] xoro32x = xoro32y[31:16] ^ xoro32y[15:0]; // second iteration wire [31:0] xoro32 = { xoro32x[5:0], xoro32x[15:6], // xoro32 = d result (state store) {xoro32y[2:0], xoro32y[15:3]} ^ {xoro32x[10:0], 5'b0} ^ xoro32x }; wire [15:0] xoro32a = d[31:16] + d[15:0]; // hashing sum of input state wire [15:0] xoro32b = xoro32y[31:16] + xoro32y[15:0]; // hashing sum of first iterator state output assign xoro32r = { {xoro32b[6:0], xoro32b[15:7]} + xoro32y[15:0], {xoro32a[6:0], xoro32a[15:7]} + d[15:0] }; // xoro32r = prng result, next instruction's s value
xoro32y, xoro32 and xoro32r to be changed.
If the above is correct, it should produce the output here.
P.S. Good to see the old code blocks back as the recent change was affecting productivity.
I'm part way through creating some tables that show how good it is. I'll finish later today.
// Verilog HDL for Xoroshiro32++ candidate [13 5 10 9] /////////////////////////////////////////////////////////// wire [15:0] xoro32z = d[31:16] ^ d[15:0]; // first iteration wire [31:0] xoro32y = { xoro32z[5:0], xoro32z[15:6], // rol #10 {d[2:0], d[15:3]} ^ // rol #13 {xoro32z[10:0], 5'b0} ^ xoro32z }; // shl #5 wire [15:0] xoro32x = xoro32y[31:16] ^ xoro32y[15:0]; // second iteration wire [31:0] xoro32 = { xoro32x[5:0], xoro32x[15:6], // rol #10, xoro32 = d result (state store) {xoro32y[2:0], xoro32y[15:3]} ^ // rol #13 {xoro32x[10:0], 5'b0} ^ xoro32x }; // shl #5 wire [15:0] xoro32a = d[31:16] + d[15:0]; // hashing sum of input state wire [15:0] xoro32b = xoro32y[31:16] + xoro32y[15:0]; // hashing sum of first iterator state output assign xoro32r = { {xoro32b[6:0], xoro32b[15:7]} + xoro32y[15:0], {xoro32a[6:0], xoro32a[15:7]} + d[15:0] }; // rol #9, xoro32r = prng result, next instruction's s value
# |Observed-Expected| # a, b, c, d, pfreq0, pfreq1, pfreq2, pfreq3, pfreq4, pfreq5, pfreq6, pfreq7, pfreq8, pfreq9, pfreq10, pfreq11, pfreq12 # scro , , , , 12447, 14664, 6423, 9195, 2453, 1491, 434, 120, 173, 34, 13, 4, 0 13, 5, 10, 9, 78899, 89687, 41504, 42541, 5417, 2691, 1688, 89, 26, 49, 28, 6, 1 13, 5, 8, 10, 119611, 135729, 56511, 50169, 14837, 6873, 1113, 677, 287, 14, 37, 5, 1 14, 2, 7, 6, 693240, 705595, 333559, 116940, 150145, 56973, 17844, 2842, 1044, 126, 11, 8, 2 6, 2, 3, 9, 700291, 715125, 325707, 114279, 137714, 64907, 18365, 4083, 1164, 58, 16, 13, 2 3, 2, 6, 9, 1852203, 1862261, 896846, 283035, 392180, 168742, 48919, 11809, 1791, 368, 53, 3, 3 14, 2, 7, 5, 3369116, 3393152, 1658965, 561575, 713103, 302470, 83700, 18360, 3320, 409, 63, 1, 0
# |Observed-Expected| # a, b, c, d, zfreq1, zfreq2, zfreq3, zfreq4, zfreq5, zfreq6, zfreq7, zfreq8, zfreq9, zfreq10, zfreq11, zfreq12, zfreq13, zfreq14, zfreq15, zfreq16, zfreq17, zfreq18, zfreq19, zfreq20, zfreq21 # scro , , , , 17271, 11651, 8070, 2037, 2957, 444, 775, 515, 427, 230, 275, 148, 72, 24, 15, 19, 8, 4, 11, 3, 0 13, 5, 10, 9, 274316, 12986, 14125, 38593, 19943, 10341, 7191, 4027, 574, 852, 317, 255, 74, 107, 7, 9, 2, 0, 4, 0, 2 13, 5, 8, 10, 399287, 37005, 16381, 29568, 11097, 8837, 2939, 1086, 453, 238, 53, 138, 106, 15, 19, 16, 9, 4, 1, 1, 0 14, 2, 7, 6, 62101, 44563, 23048, 37978, 31150, 23066, 11996, 5767, 4211, 1980, 830, 406, 222, 30, 9, 9, 1, 3, 3, 1, 3 6, 2, 3, 9, 262462, 352036, 114239, 57912, 10014, 11589, 11818, 5940, 3476, 1584, 765, 527, 137, 56, 32, 36, 10, 7, 3, 1, 1 3, 2, 6, 9, 1118583, 79497, 214756, 182201, 137546, 77988, 39744, 19549, 9537, 4176, 1966, 772, 315, 189, 33, 33, 18, 10, 2, 2, 1 14, 2, 7, 5, 1866306, 236796, 410816, 336427, 246302, 139743, 72213, 34671, 16523, 7142, 2820, 1207, 475, 232, 107, 51, 21, 9, 1, 1, 0
# |Observed-Expected| # a, b, c, d, nzfreq1, nzfreq2, nzfreq3, nzfreq4, nzfreq5, nzfreq6, nzfreq7, nzfreq8, nzfreq9, nzfreq10, nzfreq11, nzfreq12, nzfreq13, nzfreq14, nzfreq15, nzfreq16, nzfreq17, nzfreq18, nzfreq19, nzfreq20, nzfreq21, nzfreq22, nzfreq23, nzfreq24, nzfreq25, nzfreq26, nzfreq27, nzfreq28, nzfreq29, nzfreq30, nzfreq31, nzfreq32, nzfreq33, nzfreq34, nzfreq35, nzfreq36, nzfreq37, nzfreq38, nzfreq39, nzfreq40, nzfreq41, nzfreq42, nzfreq43, nzfreq44, nzfreq45 # scro , , , , 17548, 8203, 8451, 18153, 10472, 8301, 7820, 5494, 259, 3692, 14, 1199, 188, 227, 320, 756, 744, 283, 272, 184, 180, 61, 78, 21, 61, 16, 17, 31, 26, 5, 38, 12, 1, 2, 2, 0, 6, 2, 3, 4, 2, 0, 2, 1, 1 13, 5, 10, 9, 123916, 28906, 96761, 13003, 13403, 24422, 86, 9758, 20511, 9668, 5791, 4952, 1724, 3477, 1213, 975, 212, 691, 201, 20, 140, 20, 114, 188, 3, 30, 10, 94, 14, 55, 4, 20, 6, 1, 1, 6, 9, 1, 0, 2, 1, 2, 0, 0, 1 13, 5, 8, 10, 301733, 12768, 111630, 26604, 7345, 34203, 18037, 16628, 9519, 9368, 13916, 4820, 4130, 2393, 1137, 61, 892, 513, 299, 107, 222, 151, 51, 98, 210, 37, 44, 10, 36, 18, 41, 5, 12, 16, 18, 5, 10, 2, 9, 5, 2, 1, 0, 2, 0 14, 2, 7, 6, 461338, 61444, 84874, 129163, 87316, 32725, 40235, 27583, 13636, 3551, 3536, 8258, 8282, 5669, 4864, 3407, 3451, 2262, 2161, 1686, 1247, 752, 572, 396, 178, 189, 194, 36, 107, 47, 33, 29, 23, 15, 9, 6, 8, 7, 5, 0, 1, 2, 2, 0, 0 6, 2, 3, 9, 164117, 201875, 89806, 204802, 64224, 64097, 22963, 24718, 40474, 35969, 32163, 21653, 17188, 14458, 9364, 8544, 5381, 3922, 2496, 1889, 1204, 724, 661, 269, 281, 170, 153, 109, 88, 15, 42, 5, 6, 18, 18, 10, 4, 1, 0, 1, 0, 2, 1, 0, 1 3, 2, 6, 9, 356549, 212743, 188614, 299169, 147752, 84527, 32412, 21264, 3829, 13929, 20684, 18030, 16223, 11768, 10150, 7530, 5548, 4239, 2753, 1775, 1316, 898, 565, 427, 420, 169, 101, 110, 18, 59, 53, 24, 23, 2, 13, 1, 3, 7, 3, 0, 0, 1, 1, 1, 1 14, 2, 7, 5, 796603, 449094, 371827, 495986, 298229, 174131, 28618, 22581, 19650, 23117, 35575, 29984, 22917, 20059, 17747, 11176, 8308, 5917, 4938, 3120, 2176, 1456, 1062, 829, 579, 265, 301, 104, 89, 59, 20, 20, 24, 20, 8, 4, 8, 0, 1, 1, 0, 3, 1, 2, 1
Scroll to the extreme right to see the total for all frequencies.
# (Observed-Expected)^2/Expected # a, b, c, d, pfreq0, pfreq1, pfreq2, pfreq3, pfreq4, pfreq5, pfreq6, pfreq7, pfreq8, pfreq9, pfreq10, pfreq11, pfreq12, total # scro , , , , 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3 13, 5, 10, 9, 4, 5, 2, 7, 0, 1, 1, 0, 0, 1, 2, 1, 0, 24 13, 5, 8, 10, 9, 12, 4, 10, 3, 4, 1, 1, 2, 0, 3, 1, 0, 50 14, 2, 7, 6, 304, 315, 141, 52, 342, 247, 145, 26, 28, 4, 0, 2, 1, 1606 6, 2, 3, 9, 310, 324, 134, 50, 288, 320, 154, 53, 35, 1, 1, 4, 1, 1674 3, 2, 6, 9, 2171, 2195, 1018, 304, 2336, 2163, 1090, 445, 82, 31, 6, 0, 3, 11845 14, 2, 7, 5, 7184, 7287, 3484, 1198, 7724, 6948, 3192, 1075, 281, 38, 9, 0, 0, 38421