Shop OBEX P1 Docs P2 Docs Learn Events
Random/LFSR on P2 - Page 41 — Parallax Forums

Random/LFSR on P2

1383941434492

Comments

  • evanhevanh Posts: 15,916
    Here's the progress. It's still on "degree 9" part of the list so the [14 2 7 x]'s are still to come.

    Only first 10 candidates of [5 2 6 x] have max scores but the key thing is there's no distinction at all between them. And, as expected, that is recurring on many others. [14 2 7 x] ain't far away so I'll leave it running for the day ...
  • evanhevanh Posts: 15,916
    Here's where it got to when I stopped it. Not everything is a clean sweep but basically as expected.

    I'm trying a "-tf 2" run now to see if that has any significant impact.

  • evanhevanh Posts: 15,916
    -tf 2 used on the [14 2 7 x] candidates produces an identical score set. That's proof enough to say FPF tests are needed for Xoroshiro++ scoring.

  • Thanks for the latest results, Evan. It's good to know we can get maximum scores in PractRand for 16-bit samples at last, without the FPF tests. xoroshiro32++ [14,2,7,x] is very strong - FPF and it might be interesting to test xoroshiro32+p [14,2,7] - FPF as a comparison.
  • evanhevanh Posts: 15,916
    I think Xoroshiro+(p) [14 2 7] got knocked out by the extended testing, with or without FPF. It was good until I enabled PractRand's "-te 1" option. I'll double check when I get home.

  • evanhevanh Posts: 15,916
    Turns out both -te 1 and -tf 2 are required together to knock out Xoroshiro+ [14 2 7]. And it's only the [15:1] sampling aperture that gets clobbered, everything else looks normal ...
  • evanhevanh Posts: 15,916
    Took some effort to piece these all together.
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-04-01 03:42
    evanh wrote: »
    cgracey wrote: »
    XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.

    Oh! And improving it's timing would help?

    Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.
    cgracey wrote: »
    Yes, let's change XORO32 to cut the time down.

    I think the new XORO32 Verilog might be:
    wire [15:0] xoro32z	= d[31:16] ^ d[15:0];			// first iteration
    wire [31:0] xoro32y	= { xoro32z[8:0], xoro32z[15:9],
    			    {d[1:0], d[15:2]} ^
    			    {xoro32z[13:0], 2'b0} ^ xoro32z };
    
    wire [15:0] xoro32x	= xoro32y[31:16] ^ xoro32y[15:0];	// second iteration
    wire [31:0] xoro32	= { xoro32x[8:0], xoro32x[15:9],	// xoro32 = d result
    			    {xoro32y[1:0], xoro32y[15:2]} ^
    			    {xoro32x[13:0], 2'b0} ^ xoro32x };
    
    wire [15:0] xoro32a	= d[31:16] + d[15:0];			// first sum
    wire [15:0] xoro32b	= xoro32y[31:16] + xoro32y[15:0];	// second sum
    
    assign xoro32r	= { {xoro32b[10:0], xoro32b[15:11]} + xoro32[15:0],
    		    {xoro32a[10:0], xoro32a[15:11]} + xoro32y[15:0] };
    

    Evan, can you check this? I have to stop for the night now.
  • evanhevanh Posts: 15,916
    Here's what I get. I had it stashed away from earlier. The last equation, xoro32r, also differs.
    wire [15:0] xoro32z	= d[31:16] ^ d[15:0];			// first iteration
    wire [31:0] xoro32y	= { xoro32z[8:0], xoro32z[15:9],
    			    {d[1:0], d[15:2]} ^
    			    {xoro32z[13:0], 2'b0} ^ xoro32z };
    
    wire [15:0] xoro32x	= xoro32y[31:16] ^ xoro32y[15:0];	// second iteration
    wire [31:0] xoro32	= { xoro32x[8:0], xoro32x[15:9],	// xoro32 = d result (state store)
    			    {xoro32y[1:0], xoro32y[15:2]} ^
    			    {xoro32x[13:0], 2'b0} ^ xoro32x };
    
    wire [15:0] xoro32a	= d[31:16] + d[15:0];			// hashing sum of input state
    wire [15:0] xoro32b	= xoro32y[31:16] + xoro32y[15:0];	// hashing sum of first iterator state output
    
    assign xoro32r	= { {xoro32b[10:0], xoro32b[15:11]} + xoro32y[15:0],
    		    {xoro32a[10:0], xoro32a[15:11]} + d[15:0] };	// xoro32r = prng result, next instruction's s value
    
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-10-05 19:16
    Evan, your code looks correct.
    xoro32y, xoro32 and xoro32r are different from before.

    I don't know how much time this will save.
  • Any time saving could be doubled by reverting to a single iteration.
  • cgraceycgracey Posts: 14,155
    TonyB_ wrote: »
    Any time saving could be doubled by reverting to a single iteration.

    But we'd only get a 16-bit PRN then.
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-04-01 18:28
    More about the late change to XORO32 here.
    New XORO32 test outputs here.

    Thanks to Evan for remembering this from the C code.
  • cgraceycgracey Posts: 14,155
    TonyB_ wrote: »
    More about the late change to XORO32 here.
    New XORO32 test outputs here.

    Thanks to Evan for remembering this from the C code.

    Thanks. I will make this change first thing tomorrow.
  • garryjgarryj Posts: 337
    edited 2018-04-01 20:15
    The USB demo is running fine on P2v32.

    Edit: Doh!
    Please pretend this is in the Prop2 FPGA files!!! thread :blush:
  • cgraceycgracey Posts: 14,155
    garryj wrote: »
    The USB demo is running fine on P2v32.

    Edit: Doh!
    Please pretend this is in the Prop2 FPGA files!!! thread :blush:

    Super, Garryj! Thanks for reporting.
  • evanhevanh Posts: 15,916
    edited 2018-04-02 23:04
    Hmm, a little unexpectedly, I've managed to get original Xoroshiro128+ to fail PractRand -te1 -tf2 testing at a mere 16 TB using [57:2] sampling. Took about 3 days (NOTE: sampling nomenclature of filenames is width and position rather than most and least):

    BEGIN ... at 2018-03-31 02:29:55

    PractRand scoring candidate [55 14 36] of Xoroshiro128(64)+ random generator.
    PractRand v0.93 options: stdin -multithreaded -te 1 -tf 2 -tlmin 1KB

    Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p0 - PractRand score: 512 KB
    length= 512 kilobytes (2^19 bytes), time= 13.5 seconds
    Test Name Raw Processed Evaluation
    [Low1/8]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
    [Low4/32]BCFN_FF(2+7):freq R= +7.5 p~= 1e-6 unusual
    ...and 474 test result(s) without anomalies

    Candidate sampling test-s64/testcase-xo-a55b14c36d0w32p0 - PractRand score: 2 MB
    length= 2 megabytes (2^21 bytes), time= 19.5 seconds
    Test Name Raw Processed Evaluation
    [Low1/32]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
    ...and 584 test result(s) without anomalies

    Candidate sampling test-s64/testcase-xo-a55b14c36d0w64p0 - PractRand score: 4 MB
    length= 4 megabytes (2^22 bytes), time= 22.3 seconds
    Test Name Raw Processed Evaluation
    [Low1/64]BRank(18):256(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
    ...and 640 test result(s) without anomalies

    Full width sampling achieves only 4 MB! All BRank fails, perfectly inline with the above sampling widths. That's not pretty at all.


    Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p0 - PractRand score: 32 MB
    length= 32 megabytes (2^25 bytes), time= 31.4 seconds
    Test Name Raw Processed Evaluation
    [Low1/16]BRank(18):768(1) R= +2650 p~= 9.8e-799 FAIL !!!!!!!
    ...and 795 test result(s) without anomalies

    Another BRank, suggests byte aligned bit0 is the dominant factor.


    Candidate sampling test-s64/testcase-xo-a55b14c36d0w8p1 - PractRand score: 64 GB
    length= 64 gigabytes (2^36 bytes), time= 1507 seconds
    Test Name Raw Processed Evaluation
    [Low1/8]BRank(18):12K(1) R=+86692 p~= 0 FAIL !!!!!!!!
    ...and 1433 test result(s) without anomalies

    BRank again, due to byte aligned bit1 I guess. Bit1 is clearly an improvement over bit0.


    Candidate sampling test-s64/testcase-xo-a55b14c36d0w48p1 - PractRand score: 512 GB
    length= 512 gigabytes (2^39 bytes), time= 9767 seconds
    Test Name Raw Processed Evaluation
    DC6-9x1Bytes-1 R= +33.3 p = 2.1e-12 FAIL
    ...and 1622 test result(s) without anomalies

    I'm guessing bit1 is the limiting factor here. It could also be because I'm using sampling widths of multiples of 8 bits which presents byte alignments to PractRand. Odd sized width's probably the next thing to check.


    Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p0 - PractRand score: 16 TB
    length= 16 terabytes (2^44 bytes), time= 263605 seconds
    Test Name Raw Processed Evaluation
    BCFN_FF(2+2):freq R= +9.4 p~= 1e-9 very suspicious
    DC6-6x2Bytes-1 R= +43.6 p = 6.8e-20 FAIL !!
    DC6-5x4Bytes-1 R= +11.6 p = 2.9e-7 very suspicious
    ...and 1856 test result(s) without anomalies

    The BCFN here is presumably due to bit0 included in the sampling data. 56-bit sampling width seems to hide the regularities from PractRand.


    Candidate sampling test-s64/testcase-xo-a55b14c36d0w56p2 - PractRand score: 16 TB
    length= 16 terabytes (2^44 bytes), time= 263652 seconds
    Test Name Raw Processed Evaluation
    DC6-6x2Bytes-1 R= +27.3 p = 1.1e-12 FAIL
    DC6-5x4Bytes-1 R= +6.5 p = 3.6e-4 unusual
    ...and 1857 test result(s) without anomalies

    I'm surprised this didn't get further than the [55:0] score.
  • cgraceycgracey Posts: 14,155
    edited 2018-04-02 23:48
    What do we do? Anything?

    Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-04-03 01:58
    evanh wrote: »
    Hmm, a little unexpectedly, I've managed to get original Xoroshiro128+ to fail PractRand -te1 -tf2 testing at a mere 16 TB using [57:2] sampling. Took about 3 days ...
    cgracey wrote: »
    What do we do? Anything?

    Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.

    There was a change made on March 12th:
    cgracey wrote: »
    evanh wrote: »
    In the last couple of days I've been running lots of tests without the FPF tests in PractRand (After looking around PractRand's sources for something else I worked out how to remove selected tests). As a result I've been back testing with original Xoroshiro+ algorithm and the byte1 [8:1] scores are comparatively even less pretty now. Although, I don't know how much removing FPF tests has nerf'd PractRand's testing.

    Chip,
    On the free running Xoroshiro128+, maybe the taps should be adjusted to no longer use bit1 of the summing output, ie: restricted to selections from [63:2].

    Done.

    Ignoring bits 1 and 0 and using subsets of the remaining 62 for the cogs and other stuff is the best we can do with xoroshiro128+ and the results will be good enough for the intended purposes. Probably more than good enough.
  • jmgjmg Posts: 15,173
    cgracey wrote: »
    What do we do? Anything?

    Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.

    I'm confused too, but I think that post applies only to Xoroshiro128+ - which is not in P2 ?

    There was some talk of speeding up XORO32 a few posts above ( to move it further back from the critical path) - did that happen ?

  • evanhevanh Posts: 15,916
    edited 2018-04-03 00:41
    No, no changes. Maybe something for a future Prop3.

    That's just me comparing the original Xoroshriro128+ (not even scratching a tiny fraction of full period scoring), which is only used in the free running GETRND, with how far we've come with the XORO32 instruction (close to full period scoring).

    PS: I'll might try a more oddball sampling width - not multiples of 8. That should reach far better scores. Probably take weeks or even months to finish.
  • evanhevanh Posts: 15,916
    Actually, I'll run more selections of 32-bit widths first. That's what the Cogs use after all ...

  • TonyB_TonyB_ Posts: 2,178
    edited 2018-04-03 02:22
    jmg wrote: »
    cgracey wrote: »
    What do we do? Anything?

    Didn't we change from the original recipe, at some point? I remember we have more adders in there than we used to. We could go back to the original. Just better do it in the next 12 hours. Today is the last day for a new Verilog drop, according to our project schedule.

    I'm confused too, but I think that post applies only to Xoroshiro128+ - which is not in P2 ?

    There was some talk of speeding up XORO32 a few posts above ( to move it further back from the critical path) - did that happen ?

    xoroshiro128+ is used in the free-running hub-based 64-bit generator (low two bits ignored).
    xoroshiro32++ is used in the cog-based 16-bit generator, double-iterated in XORO32 to give a 32-bit output.

    Evan's tests prove that xoroshiro++ is much better than xoroshiro+. All the bits in xoroshiro++ can be equally good and there is none of the linear artifacts that are present in xoroshiro+. However, a xoroshiro128++ would have needed another 64-bit adder plus 64-bit pipelining and the decision was made that it was not worth the extra logic.
  • cgraceycgracey Posts: 14,155
    Well, what is the recipe for a xoroshiro 128++, the same kind of thing that we have in the XORO32 Instruction?
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-04-03 01:56
    cgracey wrote: »
    Well, what is the recipe for a xoroshiro 128++, the same kind of thing that we have in the XORO32 Instruction?

    Yes, but only a single iteration obviously.

    For a xoroshiro128++ [a,b,c,d], all we could do is use xoroshiro128+ [55,14,36,d] and guess the d value. Seba told me that in theory the best result will be if d is close to half-rotation but not exactly half (32), preferably odd and a prime number, but it is very much theoretical. On this basis, d = 29 or d = 31 might be good.
  • TonyB_TonyB_ Posts: 2,178
    edited 2018-04-03 01:52
    deleted
  • cgraceycgracey Posts: 14,155
    Xoroshiro128 aside, we can still speed up XORO32 without any quality consequence, by starting the sum at the initial D inputs, instead of after the first iteration.

    Here is the current code for XORO32:
    wire [15:0] xoro32z	=   d[31:16] ^ d[15:0];				// first iteration [14,2,7,5]
    wire [31:0] xoro32y	= { xoro32z[8:0], xoro32z[15:9],
    			    {d[1:0], d[15:2]} ^
    			    {xoro32z[13:0], 2'b0} ^ xoro32z };
    
    wire [15:0] xoro32x	=   xoro32y[31:16] ^ xoro32y[15:0];		// second iteration
    wire [31:0] xoro32	= { xoro32x[8:0], xoro32x[15:9],		// xoro32 = d result
    			    {xoro32y[1:0], xoro32y[15:2]} ^
    			    {xoro32x[13:0], 2'b0} ^ xoro32x };
    
    wire [16:0] xoro32a	=   xoro32y[31:16] + xoro32y[15:0];		// first sum       ** CRITICAL PATH, AFTER HUB RAMS **
    wire [16:0] xoro32b	=   xoro32[31:16] + xoro32[15:0];		// second sum      ** CRITICAL PATH, AFTER HUB RAMS **
    
    assign xoro32r		= { {xoro32b[10:0], xoro32b[15:11]} + xoro32[15:0],		// xoro32r = prng result, next instruction's s value
    			    {xoro32a[10:0], xoro32a[15:11]} + xoro32y[15:0] };		// ** CRITICAL PATH, AFTER HUB RAMS **
    

    I'm working on the change now...
  • evanh wrote: »
    Actually, I'll run more selections of 32-bit widths first. That's what the Cogs use after all ...

    It would be best to test the actual 32-bit subsets used by GETRND D or 8-bit (?) subsets used for dithering.
  • evanhevanh Posts: 15,916
    Yep, just update XORO32. Far too much to do to solve a xoroshiro128++

  • TonyB_TonyB_ Posts: 2,178
    edited 2018-04-03 02:07
    I've just spotted tonight that there is a new post about xoroshiro128+ on Melissa O'Neill's "PCG, A Better Random Number Generator" blog:
    http://www.pcg-random.org/posts/xoroshiro-fails-truncated.html

    I told her about the xoroshiro++ algorithm on February 22nd.
Sign In or Register to comment.