Here's the progress. It's still on "degree 9" part of the list so the [14 2 7 x]'s are still to come.
Only first 10 candidates of [5 2 6 x] have max scores but the key thing is there's no distinction at all between them. And, as expected, that is recurring on many others. [14 2 7 x] ain't far away so I'll leave it running for the day ...
Thanks for the latest results, Evan. It's good to know we can get maximum scores in PractRand for 16-bit samples at last, without the FPF tests. xoroshiro32++ [14,2,7,x] is very strong - FPF and it might be interesting to test xoroshiro32+p [14,2,7] - FPF as a comparison.
I think Xoroshiro+(p) [14 2 7] got knocked out by the extended testing, with or without FPF. It was good until I enabled PractRand's "-te 1" option. I'll double check when I get home.
Turns out both -te 1 and -tf 2 are required together to knock out Xoroshiro+ [14 2 7]. And it's only the [15:1] sampling aperture that gets clobbered, everything else looks normal ...
XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.
Oh! And improving it's timing would help?
Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.
Hmm, a little unexpectedly, I've managed to get original Xoroshiro128+ to fail PractRand -te1 -tf2 testing at a mere 16 TB using [57:2] sampling. Took about 3 days (NOTE: sampling nomenclature of filenames is width and position rather than most and least):
Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p0 - PractRand score: 512 KB
length= 512 kilobytes (2^19 bytes), time= 13.5 seconds
Test Name Raw Processed Evaluation
[Low1/8]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
[Low4/32]BCFN_FF(2+7):freq R= +7.5 p~= 1e-6 unusual
...and 474 test result(s) without anomalies
Candidate sampling test-s64/testcase-xo-a55b14c36d0w32p0 - PractRand score: 2 MB
length= 2 megabytes (2^21 bytes), time= 19.5 seconds
Test Name Raw Processed Evaluation
[Low1/32]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 584 test result(s) without anomalies
Candidate sampling test-s64/testcase-xo-a55b14c36d0w64p0 - PractRand score: 4 MB
length= 4 megabytes (2^22 bytes), time= 22.3 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 640 test result(s) without anomalies
Full width sampling achieves only 4 MB! All BRank fails, perfectly inline with the above sampling widths. That's not pretty at all.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p0 - PractRand score: 32 MB
length= 32 megabytes (2^25 bytes), time= 31.4 seconds
Test Name Raw Processed Evaluation
[Low1/16]BRank(18):768(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 795 test result(s) without anomalies
Another BRank, suggests byte aligned bit0 is the dominant factor.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p1 - PractRand score: 64 GB
length= 64 gigabytes (2^36 bytes), time= 1507 seconds
Test Name Raw Processed Evaluation
[Low1/8]BRank(18):12K(1) R=+86692 p~= 0 FAIL !!!!!!!!
...and 1433 test result(s) without anomalies
BRank again, due to byte aligned bit1 I guess. Bit1 is clearly an improvement over bit0.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p1 - PractRand score: 512 GB
length= 512 gigabytes (2^39 bytes), time= 9767 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +33.3 p = 2.1e-12 FAIL
...and 1622 test result(s) without anomalies
I'm guessing bit1 is the limiting factor here. It could also be because I'm using sampling widths of multiples of 8 bits which presents byte alignments to PractRand. Odd sized width's probably the next thing to check.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p0 - PractRand score: 16 TB
length= 16 terabytes (2^44 bytes), time= 263605 seconds
Test Name Raw Processed Evaluation
BCFN_FF(2+2):freq R= +9.4 p~= 1e-9 very suspicious
DC6-6x2Bytes-1 R= +43.6 p = 6.8e-20 FAIL !!
DC6-5x4Bytes-1 R= +11.6 p = 2.9e-7 very suspicious
...and 1856 test result(s) without anomalies
The BCFN here is presumably due to bit0 included in the sampling data. 56-bit sampling width seems to hide the regularities from PractRand.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p2 - PractRand score: 16 TB
length= 16 terabytes (2^44 bytes), time= 263652 seconds
Test Name Raw Processed Evaluation
DC6-6x2Bytes-1 R= +27.3 p = 1.1e-12 FAIL
DC6-5x4Bytes-1 R= +6.5 p = 3.6e-4 unusual
...and 1857 test result(s) without anomalies
I'm surprised this didn't get further than the [55:0] score.
Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.
Hmm, a little unexpectedly, I've managed to get original Xoroshiro128+ to fail PractRand -te1 -tf2 testing at a mere 16 TB using [57:2] sampling. Took about 3 days ...
Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.
In the last couple of days I've been running lots of tests without the FPF tests in PractRand (After looking around PractRand's sources for something else I worked out how to remove selected tests). As a result I've been back testing with original Xoroshiro+ algorithm and the byte1 [8:1] scores are comparatively even less pretty now. Although, I don't know how much removing FPF tests has nerf'd PractRand's testing.
Chip,
On the free running Xoroshiro128+, maybe the taps should be adjusted to no longer use bit1 of the summing output, ie: restricted to selections from [63:2].
Done.
Ignoring bits 1 and 0 and using subsets of the remaining 62 for the cogs and other stuff is the best we can do with xoroshiro128+ and the results will be good enough for the intended purposes. Probably more than good enough.
Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.
I'm confused too, but I think that post applies only to Xoroshiro128+ - which is not in P2 ?
There was some talk of speeding up XORO32 a few posts above ( to move it further back from the critical path) - did that happen ?
No, no changes. Maybe something for a future Prop3.
That's just me comparing the original Xoroshriro128+ (not even scratching a tiny fraction of full period scoring), which is only used in the free running GETRND, with how far we've come with the XORO32 instruction (close to full period scoring).
PS: I'll might try a more oddball sampling width - not multiples of 8. That should reach far better scores. Probably take weeks or even months to finish.
Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.
I'm confused too, but I think that post applies only to Xoroshiro128+ - which is not in P2 ?
There was some talk of speeding up XORO32 a few posts above ( to move it further back from the critical path) - did that happen ?
xoroshiro128+ is used in the free-running hub-based 64-bit generator (low two bits ignored).
xoroshiro32++ is used in the cog-based 16-bit generator, double-iterated in XORO32 to give a 32-bit output.
Evan's tests prove that xoroshiro++ is much better than xoroshiro+. All the bits in xoroshiro++ can be equally good and there is none of the linear artifacts that are present in xoroshiro+. However, a xoroshiro128++ would have needed another 64-bit adder plus 64-bit pipelining and the decision was made that it was not worth the extra logic.
Well, what is the recipe for a xoroshiro 128++, the same kind of thing that we have in the XORO32 Instruction?
Yes, but only a single iteration obviously.
For a xoroshiro128++ [a,b,c,d], all we could do is use xoroshiro128+ [55,14,36,d] and guess the d value. Seba told me that in theory the best result will be if d is close to half-rotation but not exactly half (32), preferably odd and a prime number, but it is very much theoretical. On this basis, d = 29 or d = 31 might be good.
Xoroshiro128 aside, we can still speed up XORO32 without any quality consequence, by starting the sum at the initial D inputs, instead of after the first iteration.
Here is the current code for XORO32:
wire [15:0] xoro32z = d[31:16] ^ d[15:0]; // first iteration [14,2,7,5]
wire [31:0] xoro32y = { xoro32z[8:0], xoro32z[15:9],
{d[1:0], d[15:2]} ^
{xoro32z[13:0], 2'b0} ^ xoro32z };
wire [15:0] xoro32x = xoro32y[31:16] ^ xoro32y[15:0]; // second iteration
wire [31:0] xoro32 = { xoro32x[8:0], xoro32x[15:9], // xoro32 = d result
{xoro32y[1:0], xoro32y[15:2]} ^
{xoro32x[13:0], 2'b0} ^ xoro32x };
wire [16:0] xoro32a = xoro32y[31:16] + xoro32y[15:0]; // first sum ** CRITICAL PATH, AFTER HUB RAMS **
wire [16:0] xoro32b = xoro32[31:16] + xoro32[15:0]; // second sum ** CRITICAL PATH, AFTER HUB RAMS **
assign xoro32r = { {xoro32b[10:0], xoro32b[15:11]} + xoro32[15:0], // xoro32r = prng result, next instruction's s value
{xoro32a[10:0], xoro32a[15:11]} + xoro32y[15:0] }; // ** CRITICAL PATH, AFTER HUB RAMS **
Comments
Only first 10 candidates of [5 2 6 x] have max scores but the key thing is there's no distinction at all between them. And, as expected, that is recurring on many others. [14 2 7 x] ain't far away so I'll leave it running for the day ...
I'm trying a "-tf 2" run now to see if that has any significant impact.
I think the new XORO32 Verilog might be:
Evan, can you check this? I have to stop for the night now.
xoro32y, xoro32 and xoro32r are different from before.
I don't know how much time this will save.
But we'd only get a 16-bit PRN then.
New XORO32 test outputs here.
Thanks to Evan for remembering this from the C code.
Thanks. I will make this change first thing tomorrow.
Edit: Doh!
Please pretend this is in the Prop2 FPGA files!!! thread
Super, Garryj! Thanks for reporting.
BEGIN ... at 2018-03-31 02:29:55
PractRand scoring candidate [55 14 36] of Xoroshiro128(64)+ random generator.
PractRand v0.93 options: stdin -multithreaded -te 1 -tf 2 -tlmin 1KB
Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p0 - PractRand score: 512 KB
length= 512 kilobytes (2^19 bytes), time= 13.5 seconds
Test Name Raw Processed Evaluation
[Low1/8]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
[Low4/32]BCFN_FF(2+7):freq R= +7.5 p~= 1e-6 unusual
...and 474 test result(s) without anomalies
Candidate sampling test-s64/testcase-xo-a55b14c36d0w32p0 - PractRand score: 2 MB
length= 2 megabytes (2^21 bytes), time= 19.5 seconds
Test Name Raw Processed Evaluation
[Low1/32]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 584 test result(s) without anomalies
Candidate sampling test-s64/testcase-xo-a55b14c36d0w64p0 - PractRand score: 4 MB
length= 4 megabytes (2^22 bytes), time= 22.3 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 640 test result(s) without anomalies
Full width sampling achieves only 4 MB! All BRank fails, perfectly inline with the above sampling widths. That's not pretty at all.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p0 - PractRand score: 32 MB
length= 32 megabytes (2^25 bytes), time= 31.4 seconds
Test Name Raw Processed Evaluation
[Low1/16]BRank(18):768(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 795 test result(s) without anomalies
Another BRank, suggests byte aligned bit0 is the dominant factor.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p1 - PractRand score: 64 GB
length= 64 gigabytes (2^36 bytes), time= 1507 seconds
Test Name Raw Processed Evaluation
[Low1/8]BRank(18):12K(1) R=+86692 p~= 0 FAIL !!!!!!!!
...and 1433 test result(s) without anomalies
BRank again, due to byte aligned bit1 I guess. Bit1 is clearly an improvement over bit0.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p1 - PractRand score: 512 GB
length= 512 gigabytes (2^39 bytes), time= 9767 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +33.3 p = 2.1e-12 FAIL
...and 1622 test result(s) without anomalies
I'm guessing bit1 is the limiting factor here. It could also be because I'm using sampling widths of multiples of 8 bits which presents byte alignments to PractRand. Odd sized width's probably the next thing to check.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p0 - PractRand score: 16 TB
length= 16 terabytes (2^44 bytes), time= 263605 seconds
Test Name Raw Processed Evaluation
BCFN_FF(2+2):freq R= +9.4 p~= 1e-9 very suspicious
DC6-6x2Bytes-1 R= +43.6 p = 6.8e-20 FAIL !!
DC6-5x4Bytes-1 R= +11.6 p = 2.9e-7 very suspicious
...and 1856 test result(s) without anomalies
The BCFN here is presumably due to bit0 included in the sampling data. 56-bit sampling width seems to hide the regularities from PractRand.
Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p2 - PractRand score: 16 TB
length= 16 terabytes (2^44 bytes), time= 263652 seconds
Test Name Raw Processed Evaluation
DC6-6x2Bytes-1 R= +27.3 p = 1.1e-12 FAIL
DC6-5x4Bytes-1 R= +6.5 p = 3.6e-4 unusual
...and 1857 test result(s) without anomalies
I'm surprised this didn't get further than the [55:0] score.
Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.
There was a change made on March 12th:
Ignoring bits 1 and 0 and using subsets of the remaining 62 for the cogs and other stuff is the best we can do with xoroshiro128+ and the results will be good enough for the intended purposes. Probably more than good enough.
I'm confused too, but I think that post applies only to Xoroshiro128+ - which is not in P2 ?
There was some talk of speeding up XORO32 a few posts above ( to move it further back from the critical path) - did that happen ?
That's just me comparing the original Xoroshriro128+ (not even scratching a tiny fraction of full period scoring), which is only used in the free running GETRND, with how far we've come with the XORO32 instruction (close to full period scoring).
PS: I'll might try a more oddball sampling width - not multiples of 8. That should reach far better scores. Probably take weeks or even months to finish.
xoroshiro128+ is used in the free-running hub-based 64-bit generator (low two bits ignored).
xoroshiro32++ is used in the cog-based 16-bit generator, double-iterated in XORO32 to give a 32-bit output.
Evan's tests prove that xoroshiro++ is much better than xoroshiro+. All the bits in xoroshiro++ can be equally good and there is none of the linear artifacts that are present in xoroshiro+. However, a xoroshiro128++ would have needed another 64-bit adder plus 64-bit pipelining and the decision was made that it was not worth the extra logic.
Yes, but only a single iteration obviously.
For a xoroshiro128++ [a,b,c,d], all we could do is use xoroshiro128+ [55,14,36,d] and guess the d value. Seba told me that in theory the best result will be if d is close to half-rotation but not exactly half (32), preferably odd and a prime number, but it is very much theoretical. On this basis, d = 29 or d = 31 might be good.
Here is the current code for XORO32:
I'm working on the change now...
It would be best to test the actual 32-bit subsets used by GETRND D or 8-bit (?) subsets used for dithering.
http://www.pcg-random.org/posts/xoroshiro-fails-truncated.html
I told her about the xoroshiro++ algorithm on February 22nd.