Random/LFSR on P2

evanh · 2018-08-13 19:25

TonyB_ wrote: »

Thanks, Evan. Are the results repeated exactly if you do a second lot of battery tests, continuing from where the first one left off?

It only continues on between tests within the battery because the executable doesn't exit. A repeat run of the executable is a fresh start of the generator state each time. So, yes, a repeat run produces exact identical results.

evanh · 2018-08-13 19:34

TonyB_ wrote: »

Reversing the bit order produces fewer failures but in more categories (LinearComp added to SerialOver and MaxOft).

I've had it Crush gridding for the last few hours and it's notable there is a sort of randomness to the types of failures. The P-values of these failures are not anywhere near as pronounced as what Practrand fails on, Practrand would be issuing "unusual" and "suspicious" for most of the values I'm seeing. And Practrand does have lots of exactly that in pretty much all of its reports.

In other words, I'm thinking TestU01 failure threshold is set way too sensitive.

evanh · 2018-08-13 19:41

I don't have any formatted table for these. Here's the gridded set of Crush reports for the first distribution candidate:

evanh · 2018-08-13 19:45

evanh wrote: »

In other words, I'm thinking TestU01 failure threshold is set way too sensitive.

I guess part of the problem is the fixed length test ...

TonyB_ · 2018-08-15 00:08

I don't know what to suggest, apart from testing Chris's generator next. Is there any correlation between PractRand scores and the number of TestUI failures for [14,2,7,5] with the various lsb's?

xoroshironot · 2018-08-15 00:54

TonyB,

You dug up some xoroshiro shift constants for me, so I am following up on a statement I made in the PRNG group about an xoroshiro variant that I feel is very promising (and fast).

I finished all of the ground-level work, and more. Have a look if you get a chance and let me know what you think.
https://gist.github.com/XoroshiroNOT

Thanks,

ChrisR

evanh · 2018-08-15 02:09

Hi ChrisR,
If I can butt in ... For speed, the original xoroshiro has local s0, s1 variables. What this does, by minimally referencing the global variables, is it clears the way for the optimiser to use any tricks for all intermediate steps. Also, I'd move the subtracting "const_odd" up at s[2] assignment line.

The other thing is I'm not keen on "const" declarations for non-constants. I know it was done in the original xoroshiro but I simply don't know the justification for it. Lying to the optimiser is not wise.

uintx xoroshiroNOT2d(void) {
	uintx s0 = s[0];
	uintx s1 = s[1];
	uintx s2 = s[2];
	uintx tmp = s0 ^ s1;

	s[2] = s2 = rotl(s0 + s1, D) + (s2 ^ s0) - tmp - const_odd;
	s[0] = rotl(s0, A) ^ tmp ^ (tmp >> B );
	s[1] = rotl(tmp, C);

	return s2; }

evanh · 2018-08-15 02:23

TonyB_ wrote: »

I don't know what to suggest ...

That's a first! :P

... apart from testing Chris's generator next.

Consider it done.

Is there any correlation between PractRand scores and the number of TestUI failures for [14,2,7,5] with the various lsb's?

I think I need to filter them better. There's far too many minor fails. That's something to read up on and see if there is controls in the test suit I can adjust ...

xoroshironot · 2018-08-15 03:10

EvanH,

Thanks for the comments.

I included 'const' only for the aesthetic of being consistent with the original xoroshiro. Using Compiler Explorer, I did not see any effect on the ASM emitted by GCC -O3 or MSVC /Ox, but I am happy to remove them.

As for the other recommendations, they seem logical (first thing I tried), but in my testing:
Seemingly interfered with saturation and intentional steering of the Intel CPU pipeline under various loads by undoing the latent load of the S[2] state from L1/L2 cache, etc.
Did not result in fewer global references emitted by GCC or MSVC (as the compilers are smart enough to see what I am trying to do, and yet stay out of my way when I need them to).
Resulted in ASM that penalized either GCC or MSVC without providing a clear performance benefit to the other.
Moving the const_odd to the subtract line in the 2-D code over-saturates the pipeline on my CPU (too many back-to-back similar parallel operations taxing the same micro-ops), but it does benefit the 1-D code because the s0 ^ s1 is deferred.

The end result is code nearly as fast as xoroshiro** or ++ when I tested on a Xeon W3690 using 3 different benchmarks (with 'inline' and under both single and multi-threaded loads).

However, if you or anyone can demonstrate that any proposed code changes actually result in more than 5% better performance over the current code on more modern CPUs under both MSVC and GCC, I'm happy to make changes that will improve code readability.

ChrisR

evanh · 2018-08-15 03:26

No worries. You've obviously got some high power pipeline analytical tools at hand. A separate approach is declare the state store as static. It probably won't make any diff though.

Dynamic allocating of the state, for re-entrant practises, might be something else to measure with those fancy tools. I noticed this is supported in the TestU01 suite.

xoroshironot · 2018-08-15 03:46

evanh wrote: »

No worries. You've obviously got some high power pipeline analytical tools at hand. A separate approach is declare the state store as static. It probably won't make any diff though.

Dynamic allocating of the state, for re-entrant practises, might be something else to measure with those fancy tools. I noticed this is supported in the TestU01 suite.

My benchmark code has both 'static' state and 'inline', but it got wiped from the gist because I copied/pasted code from Compiler Explorer (https://godbolt.org/) which does not seem to support those... not sure why. I'll add those back in (presuming they won't cause anyone trouble).

Not sure about dynamic state allocation, so I'll have to look into it... Thanks!

evanh · 2018-08-15 12:27

TonyB_ wrote: »

... Chris's generator next.

Gridding config for this generator (I've named it "scro-rarns") is case testing at all 32 sampling aperture offsets by aperture sizes of 32, 28, 24, 20, 16, then even sizes to 2. Just started size 16.

Offset 0 of the first four sizes has been clean but I've got 31 of first 128 cases with the same looking randomness in the Crush fails that is occurring with Xoroshiro32++.

Size   Failed-cases
  32     6
  28     6
  24     6
  20    13

evanh · 2018-08-15 14:26

Sampling aperture size 16 done - 8 failed cases, at offsets 4, 8, 12, 14, 15, 20, 26, 31.

Attached is 160 Crush reports

TonyB_ · 2018-08-16 12:46

Thanks for the results Evan.

As a headline comparison, I'm interested in how XORO32 compares with Chris's (scro-rarns) generator, for full 32-bit output with no rotation (lsb is bit 0). The latter passes Crush but I'm not sure about [13,5,10,9] as the files show "bit-order reversed".

I wonder whether scro-rarns passes Big Crush.

evanh · 2018-08-16 13:05

I consider all the Crush tests I've done as invalid. The variations of the fails are all over the place. The more tests done the more random it looks.

My understanding is it's not possible to pass BigCrush with a 32-bit state size.

evanh · 2018-08-16 13:13

You got a point about the bit-order reversal though, I shifted that function out of the piped build of the Crush testing to gain some speed. It is present in the Xoroshiro generator build but not in the Scro-rarns generator build. I guess Scro-rarns needs to be gridded again with the reversal code added ...

EDIT: I tell a lie. Bit-order reversal is there and enabled in the Scro-rarns generator build. The reason why it isn't stated in the reports is because those ones are produced by the piped Crush test executable. So the reports just have what the Crush testing is doing, not what the generator is doing.

evanh · 2018-08-16 16:14

Ok, I just eyeballed the end of each of the 384 gridded report files for Scro-rarns32 to see what the worst Crush fail came to. Turns out there's nothing worse than an e-7 p-value.

There is definitely some eps's occurring with Xoroshiro. So, I guess, ignoring the reported fails, there is a clear measured difference.

Looking more closely at the Crush reports for the Xoroshiro grids, there is actually only one test with one setting that is producing eps p-values. The test name is SerialOver with parameter of t=2. That one stands by itself, all other test fails seem to be in the erratic low-level ballpark.

evanh · 2018-08-16 16:34

BTW: That makes the Crush reports for aperture 16>>0 of candidates [13 5 8 10] and [7 2 14 10] clean reports.

Aperture 16>>0 for [14 2 7 5] seems to be good in most of my tests, but there is one arrangement producing e-12 that I'd class as VERY SUSPICIOUS (Practrand naming). It was my experimental candidate so, until I get back to retesting it, I'm not completely sure what the various arrangements were.

Aperture 16>>0 for [14 2 7 6] produces an eps in all the arrangements so it's out.

evanh · 2018-08-17 01:15

Hmm, this is good, there is no eps's for any aperture size 16 of candidate [3 2 6 9]. The worst is two e-11's:

========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        Xorshiro32(16)++, candidate [3 2 6 9]
  - single iterated, bit-order reversed, sampling aperture 16>>8
 Number of statistics:  144
 Total CPU time:   00:23:24.75
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  1  SerialOver, t = 2              1 - 2.5e-11
  8  CollisionOver, t = 8            3.0e-6
 ----------------------------------------------
 All other tests were passed


========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        Xorshiro32(16)++, candidate [3 2 6 9]
  - single iterated, bit-order reversed, sampling aperture 16>>14
 Number of statistics:  144
 Total CPU time:   00:24:32.19
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  1  SerialOver, t = 2              1 - 2.5e-11
 ----------------------------------------------
 All other tests were passed

evanh · 2018-08-19 03:38

Got a weird strong fail of "Run of U01" at aperture 16>>7 of [3 2 6 5].

 Generator:        Xorshiro32(16)++, candidate [3 2 6 5]
  - single iterated, bit-order reversed, sampling aperture 16>>7
       Test                          p-value
 ----------------------------------------------
  1  SerialOver, t = 2              1 -  5.2e-8
 36  Run of U01, r = 15               eps  
 38  Permutation, r = 15             3.0e-6

Don't think I've seen that one in the noise before.

evanh · 2018-08-19 03:49

Worst fails for size 16 sampling aperture of [6 2 11 3] are:

 Generator:        Xorshiro32(16)++, candidate [6 2 11 3]
  - single iterated, bit-order reversed, sampling aperture 16>>6
  1  SerialOver, t = 2              1 - 3.6e-11
 41  MaxOft, t = 5                  1 -  8.7e-5
 42  MaxOft, t = 10                 1 - 1.7e-11

  - single iterated, bit-order reversed, sampling aperture 16>>11
  1  SerialOver, t = 2              4.0e-11
 42  MaxOft, t = 10                  0.9999 
 43  MaxOft, t = 20                  0.9996 
 50  AppearanceSpacings, r = 20      2.1e-4

Note there is also a MaxOft in that. Which means, although SerialOver is predominant it is not the only test that is getting poor results with Xoroshiro++ algorithm.

evanh · 2018-08-19 03:59

Oh, and e-12 for [14 2 7 5] 16>>0 is correct. And it's not the worst size 16 fail either. 6 of the 16 cases have worse fails. So, not the greatest outcome there.

 Generator:        Xorshiro32(16)++, candidate [14 2 7 5]
  - single iterated, bit-order reversed, sampling aperture 16>>0
  1  SerialOver, t = 2              1 - 1.3e-12
 41  MaxOft, t = 5                   0.9998 
 71  LinearComp, r = 0               0.9991 

  - single iterated, bit-order reversed, sampling aperture 16>>1
  1  SerialOver, t = 2              1 - eps1

  - single iterated, bit-order reversed, sampling aperture 16>>2
  1  SerialOver, t = 2              1 - eps1
 41  MaxOft, t = 5                   0.9996 
 67  RandomWalk1 R (L = 1000)        1.8e-4

  - single iterated, bit-order reversed, sampling aperture 16>>7
  1  SerialOver, t = 2              7.0e-15
 42  MaxOft, t = 10                 1 -  1.9e-5
 44  MaxOft, t = 30                 1 -  3.6e-5

  - single iterated, bit-order reversed, sampling aperture 16>>9
  1  SerialOver, t = 2              1 - 3.5e-13

  - single iterated, bit-order reversed, sampling aperture 16>>11
  1  SerialOver, t = 2              1 - eps1

  - single iterated, bit-order reversed, sampling aperture 16>>13
  1  SerialOver, t = 2              1 - eps1
 41  MaxOft, t = 5                  1 -  7.6e-5
 42  MaxOft, t = 10                  0.9997

evanh · 2018-08-19 04:37

I'd class [13 5 10 9] worse than [14 2 7 5].

 Generator:        Xorshiro32(16)++, candidate [13 5 10 9]
  - single iterated, bit-order reversed, sampling aperture 16>>0
  1  SerialOver, t = 2              1 - eps1
 43  MaxOft, t = 20                 1 -  6.2e-5

  - single iterated, bit-order reversed, sampling aperture 16>>2
  1  SerialOver, t = 2              1 - 3.2e-13
  8  CollisionOver, t = 8            1.3e-4
 41  MaxOft, t = 5                   0.9994 
 42  MaxOft, t = 10                  0.9993 
 44  MaxOft, t = 30                  0.9992 

  - single iterated, bit-order reversed, sampling aperture 16>>5
  1  SerialOver, t = 2              2.2e-12
 67  RandomWalk1 M (L = 1000)        6.1e-4

  - single iterated, bit-order reversed, sampling aperture 16>>6
  1  SerialOver, t = 2                eps  
 44  MaxOft, t = 30                  0.9996 

  - single iterated, bit-order reversed, sampling aperture 16>>8
  1  SerialOver, t = 2                eps  
 43  MaxOft, t = 20                  0.9997 
 44  MaxOft, t = 30                  0.9992 

  - single iterated, bit-order reversed, sampling aperture 16>>11
  1  SerialOver, t = 2              1 - 3.9e-12
 41  MaxOft, t = 5                  1 -  3.6e-7
 42  MaxOft, t = 10                 1 -  1.3e-5

  - single iterated, bit-order reversed, sampling aperture 16>>14
  1  SerialOver, t = 2              1 - eps1
  9  CollisionOver, t = 20          1 -  2.3e-5
 65  RandomWalk1 R (L = 90)         1 -  5.2e-5

evanh · 2018-08-19 09:39

Started some BigCrush testing of scro-rarns32. 100% hard fails on Gap test for the first 8 cases. I presume Gap is the test that detects period repeating the easiest. Here's aperture 32>>0 with bit-order reversed:

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:        stdin32
 Number of statistics:  160
 Total CPU time:   03:03:24.71
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 33  CouponCollector, r = 27         0.9998 
 34  Gap, r = 0                     5.4e-15
 35  Gap, r = 25                      eps  
 36  Gap, r = 0                       eps  
 37  Gap, r = 20                      eps  
 65  SumCollector                    6.3e-5
 ----------------------------------------------
 All other tests were passed

Test 34 fluctuates between e-XX and eps but other than that all 8 reports are much the same.

TonyB_ · 2018-08-19 10:37

evanh wrote: »

Started some BigCrush testing of scro-rarns32. 100% hard fails on Gap test for the first 8 cases. I presume Gap is the test that detects period repeating the easiest. Here's aperture 32>>0 with bit-order reversed:

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:        stdin32
 Number of statistics:  160
 Total CPU time:   03:03:24.71
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 33  CouponCollector, r = 27         0.9998 
 34  Gap, r = 0                     5.4e-15
 35  Gap, r = 25                      eps  
 36  Gap, r = 0                       eps  
 37  Gap, r = 20                      eps  
 65  SumCollector                    6.3e-5
 ----------------------------------------------
 All other tests were passed

Test 34 fluctuates between e-XX and eps but other than that all 8 reports are much the same.

Thanks for all the results Evan. I've emailed all the Crush reports I have from Seba. In his BigCrush tests [14,2,7,5] and [14,2,7,6] failed on SerialOver, Gap, MaxOft and SumCollector. No SerialOver and MaxOft failures above with scro-rarns32 but there is a CouponCollector - is the normal bit order better or worse or the same?

When comparing [14,2,7,5] and [13,5,10,9], do all the apertures not listed pass all tests? And are they bit-reversed or do you always print that?

evanh · 2018-08-19 11:10

TonyB_ wrote: »

I've emailed all the Crush reports I have from Seba.

Looking at those I just noticed the generator state is emitted in his Crush reports. My reports don't have that because I'm not using TestU01's dynamic allocation facilities. That tells me I'll probably never get identical reports unless I change my integration with TestU01.

In his BigCrush tests [14,2,7,5] and [14,2,7,6] failed on SerialOver, Gap, MaxOft and SumCollector. No SerialOver and MaxOft failures above with scro-rarns32 but there is a CouponCollector - is the normal bit order better or worse.

From general inspections of mid-Crush reports I'd say reverse seems to be slightly more severe fails.

When comparing [14,2,7,5] and [13,5,10,9], did all the apertures not listed pass all tests? And are they bit-reversed or do you always print that?

Well, "pass" is relative in my understanding these days. I've started treating anything without an "eps/eps1" as a pass. Which is similar e value to Practrand pass/fail threshold.

Put it this way, they didn't have worse fails. Very few reports are actually "All tests were passed".

evanh · 2018-08-19 11:47

For Xoroshiro Crush testing, the config is emitted in the reports. Bit-order reversal is listed for each case. For Scro-rarns32, the config is not in the report text.

In both situations though, the run scripts encode the config into the report names so I know which is which.

TonyB_ · 2018-09-29 23:24

evanh wrote: »

BTW: That makes the Crush reports for aperture 16>>0 of candidates [13 5 8 10] and [7 2 14 10] clean reports.

Aperture 16>>0 for [14 2 7 5] seems to be good in most of my tests, but there is one arrangement producing e-12 that I'd class as VERY SUSPICIOUS (Practrand naming). It was my experimental candidate so, until I get back to retesting it, I'm not completely sure what the various arrangements were.

Aperture 16>>0 for [14 2 7 6] produces an eps in all the arrangements so it's out.

Long holiday away from PRNGs now over.

Evan, which of [13,5,10,9] and [13,5,8,10] are better overall in PractRand and Crush tests?

evanh · 2018-09-30 16:30

[13 5 8 10] seems the better, it was already visible in the good scoring Practrand scores before we used the distribution scores. I haven't done any exhaustive comparing with the Crushes. You've got the reports.

evanh · 2018-09-30 16:34

I should rerun the culling to pick out more candidates for PR gridding.

Random/LFSR on P2

Comments