Random/LFSR on P2

evanh · 2018-03-13 21:43

Here's the progress. It's still on "degree 9" part of the list so the [14 2 7 x]'s are still to come.

Only first 10 candidates of [5 2 6 x] have max scores but the key thing is there's no distinction at all between them. And, as expected, that is recurring on many others. [14 2 7 x] ain't far away so I'll leave it running for the day ...

evanh · 2018-03-14 08:09

Here's where it got to when I stopped it. Not everything is a clean sweep but basically as expected.

I'm trying a "-tf 2" run now to see if that has any significant impact.

evanh · 2018-03-14 10:13

-tf 2 used on the [14 2 7 x] candidates produces an identical score set. That's proof enough to say FPF tests are needed for Xoroshiro++ scoring.

TonyB_ · 2018-03-15 02:10

Thanks for the latest results, Evan. It's good to know we can get maximum scores in PractRand for 16-bit samples at last, without the FPF tests. xoroshiro32++ [14,2,7,x] is very strong - FPF and it might be interesting to test xoroshiro32+p [14,2,7] - FPF as a comparison.

evanh · 2018-03-15 03:46

I think Xoroshiro+(p) [14 2 7] got knocked out by the extended testing, with or without FPF. It was good until I enabled PractRand's "-te 1" option. I'll double check when I get home.

evanh · 2018-03-15 11:03

Turns out both -te 1 and -tf 2 are required together to knock out Xoroshiro+ [14 2 7]. And it's only the [15:1] sampling aperture that gets clobbered, everything else looks normal ...

evanh · 2018-03-15 14:13

Took some effort to piece these all together.

TonyB_ · 2018-04-01 03:15

evanh wrote: »

cgracey wrote: »

XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.

Oh! And improving it's timing would help?

Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.

cgracey wrote: »

Yes, let's change XORO32 to cut the time down.

I think the new XORO32 Verilog might be:

wire [15:0] xoro32z	= d[31:16] ^ d[15:0];			// first iteration
wire [31:0] xoro32y	= { xoro32z[8:0], xoro32z[15:9],
			    {d[1:0], d[15:2]} ^
			    {xoro32z[13:0], 2'b0} ^ xoro32z };

wire [15:0] xoro32x	= xoro32y[31:16] ^ xoro32y[15:0];	// second iteration
wire [31:0] xoro32	= { xoro32x[8:0], xoro32x[15:9],	// xoro32 = d result
			    {xoro32y[1:0], xoro32y[15:2]} ^
			    {xoro32x[13:0], 2'b0} ^ xoro32x };

wire [15:0] xoro32a	= d[31:16] + d[15:0];			// first sum
wire [15:0] xoro32b	= xoro32y[31:16] + xoro32y[15:0];	// second sum

assign xoro32r	= { {xoro32b[10:0], xoro32b[15:11]} + xoro32[15:0],
		    {xoro32a[10:0], xoro32a[15:11]} + xoro32y[15:0] };

Evan, can you check this? I have to stop for the night now.

evanh · 2018-04-01 03:28

Here's what I get. I had it stashed away from earlier. The last equation, xoro32r, also differs.

wire [15:0] xoro32z	= d[31:16] ^ d[15:0];			// first iteration
wire [31:0] xoro32y	= { xoro32z[8:0], xoro32z[15:9],
			    {d[1:0], d[15:2]} ^
			    {xoro32z[13:0], 2'b0} ^ xoro32z };

wire [15:0] xoro32x	= xoro32y[31:16] ^ xoro32y[15:0];	// second iteration
wire [31:0] xoro32	= { xoro32x[8:0], xoro32x[15:9],	// xoro32 = d result (state store)
			    {xoro32y[1:0], xoro32y[15:2]} ^
			    {xoro32x[13:0], 2'b0} ^ xoro32x };

wire [15:0] xoro32a	= d[31:16] + d[15:0];			// hashing sum of input state
wire [15:0] xoro32b	= xoro32y[31:16] + xoro32y[15:0];	// hashing sum of first iterator state output

assign xoro32r	= { {xoro32b[10:0], xoro32b[15:11]} + xoro32y[15:0],
		    {xoro32a[10:0], xoro32a[15:11]} + d[15:0] };	// xoro32r = prng result, next instruction's s value

TonyB_ · 2018-04-01 03:37

Evan, your code looks correct.
xoro32y, xoro32 and xoro32r are different from before.

I don't know how much time this will save.

TonyB_ · 2018-04-01 04:01

Any time saving could be doubled by reverting to a single iteration.

cgracey · 2018-04-01 04:44

TonyB_ wrote: »

Any time saving could be doubled by reverting to a single iteration.

But we'd only get a 16-bit PRN then.

TonyB_ · 2018-04-01 18:26

More about the late change to XORO32 here.
New XORO32 test outputs here.

Thanks to Evan for remembering this from the C code.

cgracey · 2018-04-01 18:44

TonyB_ wrote: »

More about the late change to XORO32 here.
New XORO32 test outputs here.

Thanks to Evan for remembering this from the C code.

Thanks. I will make this change first thing tomorrow.

garryj · 2018-04-01 19:49

The USB demo is running fine on P2v32.

Edit: Doh!
Please pretend this is in the Prop2 FPGA files!!! thread

cgracey · 2018-04-01 20:38

garryj wrote: »

The USB demo is running fine on P2v32.

Edit: Doh!
Please pretend this is in the Prop2 FPGA files!!! thread

Super, Garryj! Thanks for reporting.

evanh · 2018-04-02 22:24

Hmm, a little unexpectedly, I've managed to get original Xoroshiro128+ to fail PractRand -te1 -tf2 testing at a mere 16 TB using [57:2] sampling. Took about 3 days (NOTE: sampling nomenclature of filenames is width and position rather than most and least):

BEGIN ... at 2018-03-31 02:29:55

PractRand scoring candidate [55 14 36] of Xoroshiro128(64)+ random generator.
PractRand v0.93 options: stdin -multithreaded -te 1 -tf 2 -tlmin 1KB

Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p0 - PractRand score: 512 KB
length= 512 kilobytes (2^19 bytes), time= 13.5 seconds
Test Name Raw Processed Evaluation
[Low1/8]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
[Low4/32]BCFN_FF(2+7):freq R= +7.5 p~= 1e-6 unusual
...and 474 test result(s) without anomalies

Candidate sampling test-s64/testcase-xo-a55b14c36d0w32p0 - PractRand score: 2 MB
length= 2 megabytes (2^21 bytes), time= 19.5 seconds
Test Name Raw Processed Evaluation
[Low1/32]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 584 test result(s) without anomalies

Candidate sampling test-s64/testcase-xo-a55b14c36d0w64p0 - PractRand score: 4 MB
length= 4 megabytes (2^22 bytes), time= 22.3 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 640 test result(s) without anomalies

Full width sampling achieves only 4 MB! All BRank fails, perfectly inline with the above sampling widths. That's not pretty at all.

Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p0 - PractRand score: 32 MB
length= 32 megabytes (2^25 bytes), time= 31.4 seconds
Test Name Raw Processed Evaluation
[Low1/16]BRank(18):768(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
...and 795 test result(s) without anomalies

Another BRank, suggests byte aligned bit0 is the dominant factor.

Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p1 - PractRand score: 64 GB
length= 64 gigabytes (2^36 bytes), time= 1507 seconds
Test Name Raw Processed Evaluation
[Low1/8]BRank(18):12K(1) R=+86692 p~= 0 FAIL !!!!!!!!
...and 1433 test result(s) without anomalies

BRank again, due to byte aligned bit1 I guess. Bit1 is clearly an improvement over bit0.

Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p1 - PractRand score: 512 GB
length= 512 gigabytes (2^39 bytes), time= 9767 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +33.3 p = 2.1e-12 FAIL
...and 1622 test result(s) without anomalies

I'm guessing bit1 is the limiting factor here. It could also be because I'm using sampling widths of multiples of 8 bits which presents byte alignments to PractRand. Odd sized width's probably the next thing to check.

Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p0 - PractRand score: 16 TB
length= 16 terabytes (2^44 bytes), time= 263605 seconds
Test Name Raw Processed Evaluation
BCFN_FF(2+2):freq R= +9.4 p~= 1e-9 very suspicious
DC6-6x2Bytes-1 R= +43.6 p = 6.8e-20 FAIL !!
DC6-5x4Bytes-1 R= +11.6 p = 2.9e-7 very suspicious
...and 1856 test result(s) without anomalies

The BCFN here is presumably due to bit0 included in the sampling data. 56-bit sampling width seems to hide the regularities from PractRand.

Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p2 - PractRand score: 16 TB
length= 16 terabytes (2^44 bytes), time= 263652 seconds
Test Name Raw Processed Evaluation
DC6-6x2Bytes-1 R= +27.3 p = 1.1e-12 FAIL
DC6-5x4Bytes-1 R= +6.5 p = 3.6e-4 unusual
...and 1857 test result(s) without anomalies

I'm surprised this didn't get further than the [55:0] score.

cgracey · 2018-04-02 23:45

What do we do? Anything?

Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.

TonyB_ · 2018-04-03 00:34

evanh wrote: »

Hmm, a little unexpectedly, I've managed to get original Xoroshiro128+ to fail PractRand -te1 -tf2 testing at a mere 16 TB using [57:2] sampling. Took about 3 days ...

cgracey wrote: »

What do we do? Anything?

Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.

There was a change made on March 12th:

cgracey wrote: »

evanh wrote: »

In the last couple of days I've been running lots of tests without the FPF tests in PractRand (After looking around PractRand's sources for something else I worked out how to remove selected tests). As a result I've been back testing with original Xoroshiro+ algorithm and the byte1 [8:1] scores are comparatively even less pretty now. Although, I don't know how much removing FPF tests has nerf'd PractRand's testing.

Chip,
On the free running Xoroshiro128+, maybe the taps should be adjusted to no longer use bit1 of the summing output, ie: restricted to selections from [63:2].

Done.

Ignoring bits 1 and 0 and using subsets of the remaining 62 for the cogs and other stuff is the best we can do with xoroshiro128+ and the results will be good enough for the intended purposes. Probably more than good enough.

jmg · 2018-04-03 00:34

cgracey wrote: »

What do we do? Anything?

Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.

I'm confused too, but I think that post applies only to Xoroshiro128+ - which is not in P2 ?

There was some talk of speeding up XORO32 a few posts above ( to move it further back from the critical path) - did that happen ?

evanh · 2018-04-03 00:38

No, no changes. Maybe something for a future Prop3.

That's just me comparing the original Xoroshriro128+ (not even scratching a tiny fraction of full period scoring), which is only used in the free running GETRND, with how far we've come with the XORO32 instruction (close to full period scoring).

PS: I'll might try a more oddball sampling width - not multiples of 8. That should reach far better scores. Probably take weeks or even months to finish.

evanh · 2018-04-03 00:48

Actually, I'll run more selections of 32-bit widths first. That's what the Cogs use after all ...

TonyB_ · 2018-04-03 00:59

jmg wrote: »

cgracey wrote: »

What do we do? Anything?

Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.

I'm confused too, but I think that post applies only to Xoroshiro128+ - which is not in P2 ?

There was some talk of speeding up XORO32 a few posts above ( to move it further back from the critical path) - did that happen ?

xoroshiro128+ is used in the free-running hub-based 64-bit generator (low two bits ignored).
xoroshiro32++ is used in the cog-based 16-bit generator, double-iterated in XORO32 to give a 32-bit output.

Evan's tests prove that xoroshiro++ is much better than xoroshiro+. All the bits in xoroshiro++ can be equally good and there is none of the linear artifacts that are present in xoroshiro+. However, a xoroshiro128++ would have needed another 64-bit adder plus 64-bit pipelining and the decision was made that it was not worth the extra logic.

cgracey · 2018-04-03 01:09

Well, what is the recipe for a xoroshiro 128++, the same kind of thing that we have in the XORO32 Instruction?

TonyB_ · 2018-04-03 01:25

cgracey wrote: »

Well, what is the recipe for a xoroshiro 128++, the same kind of thing that we have in the XORO32 Instruction?

Yes, but only a single iteration obviously.

For a xoroshiro128++ [a,b,c,d], all we could do is use xoroshiro128+ [55,14,36,d] and guess the d value. Seba told me that in theory the best result will be if d is close to half-rotation but not exactly half (32), preferably odd and a prime number, but it is very much theoretical. On this basis, d = 29 or d = 31 might be good.

TonyB_ · 2018-04-03 01:30

deleted

cgracey · 2018-04-03 01:34

Xoroshiro128 aside, we can still speed up XORO32 without any quality consequence, by starting the sum at the initial D inputs, instead of after the first iteration.

Here is the current code for XORO32:

wire [15:0] xoro32z	=   d[31:16] ^ d[15:0];				// first iteration [14,2,7,5]
wire [31:0] xoro32y	= { xoro32z[8:0], xoro32z[15:9],
			    {d[1:0], d[15:2]} ^
			    {xoro32z[13:0], 2'b0} ^ xoro32z };

wire [15:0] xoro32x	=   xoro32y[31:16] ^ xoro32y[15:0];		// second iteration
wire [31:0] xoro32	= { xoro32x[8:0], xoro32x[15:9],		// xoro32 = d result
			    {xoro32y[1:0], xoro32y[15:2]} ^
			    {xoro32x[13:0], 2'b0} ^ xoro32x };

wire [16:0] xoro32a	=   xoro32y[31:16] + xoro32y[15:0];		// first sum       ** CRITICAL PATH, AFTER HUB RAMS **
wire [16:0] xoro32b	=   xoro32[31:16] + xoro32[15:0];		// second sum      ** CRITICAL PATH, AFTER HUB RAMS **

assign xoro32r		= { {xoro32b[10:0], xoro32b[15:11]} + xoro32[15:0],		// xoro32r = prng result, next instruction's s value
			    {xoro32a[10:0], xoro32a[15:11]} + xoro32y[15:0] };		// ** CRITICAL PATH, AFTER HUB RAMS **

I'm working on the change now...

TonyB_ · 2018-04-03 01:48

evanh wrote: »

Actually, I'll run more selections of 32-bit widths first. That's what the Cogs use after all ...

It would be best to test the actual 32-bit subsets used by GETRND D or 8-bit (?) subsets used for dithering.

evanh · 2018-04-03 01:51

Yep, just update XORO32. Far too much to do to solve a xoroshiro128++

TonyB_ · 2018-04-03 02:06

I've just spotted tonight that there is a new post about xoroshiro128+ on Melissa O'Neill's "PCG, A Better Random Number Generator" blog:
http://www.pcg-random.org/posts/xoroshiro-fails-truncated.html

I told her about the xoroshiro++ algorithm on February 22nd.

Random/LFSR on P2

Comments