Shop OBEX P1 Docs P2 Docs Learn Events
Random/LFSR on P2 - Page 85 — Parallax Forums

Random/LFSR on P2

1828385878892

Comments

  • evanhevanh Posts: 16,028
    Chris/Tony,
    I had commented out the older result collating code that we were using for Tony to process. Re-enabling that, here's them and associated CSV files that might be handy if you want to do your own distribution stats.
  • xoroshironotxoroshironot Posts: 309
    edited 2020-01-16 05:01
    The pfChi is perfect... it should be about 12.
    The zfChi is only a little low... it should be about 20.
    The nzfChi is only a little low... it should be about 44.

    Edit: Each perfect value is approximately the total number of > 0 freqs summed, with each individual value that created the freqsum being about 1 (non-linear, but usually in the range 0.05 to 5).
    #a	b	c	d	pfreq0	pfreq1	pfreq2	pfreq3	pfreq4	pfreq5	pfreq6	pfreq7	pfreq8	pfreq9	pfreq10	pfreq11	pfreq12	pfreq13	total
    #normal				~1	~1	~1	~1	~1	~1	~1	~1	~1	~1	~1	~1	~1	~1	~14
    
    The above is an approximation, as are the following: as each '~1' is most likely to be in the range of 0.15 to 6, and the sum '~14' is likely to be in the range of 4 to 30.
    With only a single trial, high freq values near the end may carry excess weight, so stopping before actual '0's appear can minimize bias, which would eventually dissipate with the addition of more 'randomly seeded trials'* of the same PRNG.
    TonyB_ wrote: »
    Bear in mind that different initial acc values will affect how closely the distributions match the ideal ones, therefore more than one initial acc should be tested. Other [a,b,c,d] will be better than [13,5,10,9] for init_acc = X, but could [13,5,10,9] do just as well if not better when init_acc = Y?

    Do pfreq/zfreq/nzfreq ever repeat for different init_acc or is there always one best value? If the latter, do Chi-square scores fall away smoothly either side of this peak init_acc? Probably the only practicable way to test this is to look at xoroshiro16++ with all possible 256 acc1 values.
    evanh wrote: »
    So just completed full distribution run of ++1a. CSV file attached
    This latest run of ++1a supports everything above.
    From the file, look at the freq Chi^2 sum averages across all candidates: pfChiAvg=12, zfChiAvg=21, nzfChiAvg=43
    These averages match the values I predicted very closely (12, 20 and 44).
    Also from the file, the min and max values for pfChi are 2 and 34, respectively, which also match the values I predicted very closely (4 and 30), but are slightly more extreme due to the large number of values.
    Based on that, the numbers in the ++1a file are (log?) normally distributed and exactly what would be found if a perfect PRNG was run 1344 times with a different set of seeds each time.
    To actually distinguish between them (and eliminate most of the skew of the resulting data), it would require running freqs on each of the 1344 PRNGs at least 12-16 times (30 would be best) using completely independent sets of seeds for states 0,1 and 2 each time. If you pick the seeds ahead of time for use in all candidates, that is fine.
    The winning candidates would be those closest to having an overal average of about 12, 20 and 44... depending on the max x in freq(x) used to create the Chi^2 sums.
    Due to the nature of this ++1a PRNG design, you have a distinct possibility that a majority of candidates are acceptably random, but there will be a subset more obviously closer to theoretical values I listed. Anything within 10-15% would be expected and acceptable.
    Tony would likely prefer to use the raw freq data for each trial, which allows for a more exact calculation (as he pointed out my math was sloppy) which would then reduce the list of candidates to a few dozen... with [13 5 10 9] likely not near the top of the list (from previous testing only about 6 trials were required to prove it was not truly random).
  • evanhevanh Posts: 16,028
    To actually distinguish between them (and eliminate most of the skew of the resulting data), it would require running freqs on each of the 1344 PRNGs at least 12-16 times (30 would be best) using completely independent sets of seeds for states 0,1 and 2 each time. If you pick the seeds ahead of time for use in all candidates, that is fine.
    The winning candidates would be those closest to having an overal average of about 12, 20 and 44... depending on the max x in freq(x) used to create the Chi^2 sums.
    So you want this full set distribution run repeated several times with each new full set using another seed in the accumulator, right? Successive seeds of 1,2,3,4,5,6,7,... would be fine I assume.

    BTW, This is all still single iterated. I've not tried engaging double iterating.
  • TonyB_TonyB_ Posts: 2,195
    edited 2020-01-16 11:02
    evanh wrote: »
    To actually distinguish between them (and eliminate most of the skew of the resulting data), it would require running freqs on each of the 1344 PRNGs at least 12-16 times (30 would be best) using completely independent sets of seeds for states 0,1 and 2 each time. If you pick the seeds ahead of time for use in all candidates, that is fine.
    The winning candidates would be those closest to having an overal average of about 12, 20 and 44... depending on the max x in freq(x) used to create the Chi^2 sums.
    So you want this full set distribution run repeated several times with each new full set using another seed in the accumulator, right? Successive seeds of 1,2,3,4,5,6,7,... would be fine I assume.

    BTW, This is all still single iterated. I've not tried engaging double iterating.

    Evan, if you are using the previous p/z/nzfreq C program with acc code added as I suggested, how can it be single iterated? There will be 2^32-1 double iterations, each producing a 32-bit output. If it were equidistributed (which it is NOT), each non-zero output would occur once and zero never. I assume you're calculating Chi-Square by comparing with the ideal binomial output (N/e outputs never occur, N/e occur once and the rest more than once).

    The new acc increases effective state bits by 50% to state + top half of acc (acc1). It would be good if we knew how changing initial acc1 but keeping state the same affects p/z/nzfChi and p/z/nzRank. (Different initial states could be tested later.)

    Successive seeds of 1,2,3,4,5,6,7 would be OK for acc1, but not acc as a whole. As it would take a very long time to fully test xoroshiro32++1a, I suggest testing xoroshiro16(8)++1a with every possible initial acc = 0000, 0100, 0200, ..., FD00, FE00, FF00. That's 256 separate acc seeds for 26 [a,b,c] * 8 = 208 [a,b,c,d] which is more than 50000 separate tests, however each one will have only 2^16-1 double iterations and require only 64KB for the byte array.
  • TonyB_TonyB_ Posts: 2,195
    edited 2020-01-16 11:29
    Brute forcing this is the only way to understand what is going on.

    There could be separate files for every xoroshiro16(8)++1a [a,b,c,d] with 256 lines for each acc1, or one file for each [a,b,c] with 2048 lines for each d = 0-7 and each acc1.

    Binomial distributions for N = 2^16 samples:
    pfreq(x)
    
     x	Expected count		Nearest dec	Nearest hex
    
     0	N/0!e  =  24109.347	24109		5E2D
     1	N/1!e  =  24109.347	24109		5E2D
     2	N/2!e  =  12054.674	12055		2F17
     3	N/3!e  =   4018.225	 4018		0FB2
     4	N/4!e  =   1004.556	 1005		03ED
     5	N/5!e  =    200.911	  201		00C9
     6	N/6!e  =     33.485	   33		0021
     7	N/7!e  =      4.784	    5		0005
     8	N/8!e  =       0.598	    1		0001
     9	N/9!e  =       0.0664	    0		0000
    
    zfreq(x)
    
     x				Nearest dec	Nearest hex
    
     1				 9634		25A2
     2				 3544		0DD8
     3				 1304  		0518
     4				  480  		01E0
     5				  177  		00B1
     6				   65   	0041
     7				   24  		0018
     8				    9   	0009
     9				    3 		0003
    10				    1  		0001
    11				    0  		0000
    
    nzfreq(x)
    
     x				Nearest dec	Nearest hex
    
     1				 5606		15E6
     2				 3544		0DD8
     3				 2240		08C0
     4				 1416		0588
     5				  895		037F
     6				  566		0236
     7				  358		0166
     8				  226		00E2
     9				  143		008F
    10				   90		005A
    11				   57		0039
    12				   36		0024
    13				   23		0017
    14				   14		000E
    15				    9		0009
    16				    6		0006
    17				    4		0004
    18				    2		0002
    19				    1		0001
    20				    1		0001
    21				    1		0001
    22				    0		0000
    

    Floating-point would be more accurate than nearest integer, though.
  • xoroshironotxoroshironot Posts: 309
    edited 2020-01-16 12:05
    TonyB_ wrote: »
    Evan, if you are using the previous p/z/nzfreq C program with acc code added as I suggested
    I assume you mean the base double-iterated code you recently posted, with e=1 only in the first iteration (no rotation on the second)?
    acc0 := acc1 + prn0 + e
    acc1 := acc0 + prn1
    
    TonyB_ wrote: »
    It would be good if we knew how changing initial acc1 but keeping state the same affects p/z/nzfChi and p/z/nzRank. (Different initial states could be tested later.)
    I agree, but it runs more of a risk of skewing the results, unless perhaps that the acc1 seeds were randomly chosen to increase the chance that they are maximally de-correlated from each other.
    TonyB_ wrote: »
    I suggest testing xoroshiro16(8)++1a with every possible initial acc = 0000, 0100, 0200, ..., FD00, FE00, FF00. That's 256 separate acc seeds for 26 [a,b,c] * 8 = 208 [a,b,c,d] which is more than 50000 separate tests, however each one will have only 2^16-1 double iterations and require only 64KB for the byte array.
    That would certainly answer the question of whether or not skew might play a part when sequential acc1 seeds are used (which is normally a bad idea in actual usage).
    I think the freq code Evan is using would be close to being able to handle the 16-bit outputs of of a double-iterated xoroshiro16(8)++1a.
  • TonyB_TonyB_ Posts: 2,195
    edited 2020-01-16 12:04
    I assume you mean the base double-iterated code you recently posted, with e=1 only in the first iteration (no rotation on the second)?
    acc0 := acc1 + prn0 + e
    acc1 := acc0 + prn1
    

    Yes, no rotation of prn1 just yet, for simplicity, not because I think it's not worthwhile.

    Testing xoroshiro16(8)++1a with every possible initial acc1 should tell us a lot more than xoroshiro32(16)++1a with a few randomly chosen acc1.
  • evanh wrote: »

    Not right. Adding acc to your working C prog (post unknown) is the way to do it..

  • xoroshironotxoroshironot Posts: 309
    edited 2020-01-16 13:19
    Does this seem correct?
    rngword_t nextrnd( void )
    {
    	rngword_t  s0 = s[0];
    	rngword_t  s1 = s[1];
    
    #if (ACCUM_SIZE == 16) || (ACCUM_SIZE == 32) || (ACCUM_SIZE == 64)
    	static int e = CONSTANT_E; // CONSTANT_E is 1
    	static int f = CONSTANT_F; // CONSTANT_F is 0, for now
    #if (TOGGLE_ROTATE == 1) // TOGGLE_ROTATE is 0, for now, 1 later 
    	f ^= CONSTANT_f;
    #endif	
    	s[2] += rotl(rotl(s0 + s1, CONSTANT_D) + s0, f) + e; // f=0, for now
    #if (TOGGLE_SUM == 1) // TOGGLE_SUM is 1 to double-iterate
    	e ^= CONSTANT_E;
    #endif
    
    #else
    .
    .
    .
    
    Or just declare acc[0] and acc[1], which would make double-iterated more readable?
  • Does this seem correct?
    rngword_t nextrnd( void )
    {
    	rngword_t  s0 = s[0];
    	rngword_t  s1 = s[1];
    
    #if (ACCUM_SIZE == 16) || (ACCUM_SIZE == 32) || (ACCUM_SIZE == 64)
    	static int e = CONSTANT_E; // CONSTANT_E is 1
    	static int f = CONSTANT_F; // CONSTANT_F is 0, for now
    #if (TOGGLE_ROTATE == 1) // TOGGLE_ROTATE is 0, for now, 1 later 
    	f ^= CONSTANT_f;
    #endif	
    	s[2] += rotl(rotl(s0 + s1, CONSTANT_D) + s0, f) + e; // f=0, for now
    #if (TOGGLE_SUM == 1) // TOGGLE_SUM is 1 to double-iterate
    	e ^= CONSTANT_E;
    #endif
    
    #else
    .
    .
    .
    
    Or just declare acc[0] and acc[1], which would make double-iterated more readable?

    Doing a double iteration using acc[0] and acc[1] would be better than the confusing s[2], I think. The +1 added to s[2] every iteration was the error.
  • evanhevanh Posts: 16,028
    What I've got is based off of https://forums.parallax.com/discussion/comment/1486680/#Comment_1486680 but with a +s0 appended. I'm not doing any code toggling since that constitutes extra state bits.
  • evanhevanh Posts: 16,028
    TonyB_ wrote: »
    Adding acc to your working C prog (post unknown) is the way to do it..
    Since it's single iterated, that makes no diff.
  • evanh wrote: »
    What I've got is based off of https://forums.parallax.com/discussion/comment/1486680/#Comment_1486680 but with a +s0 appended. I'm not doing any code toggling since that constitutes extra state bits.
    That code for 32-bit variables has no 32-bit equidistribution when translated to 16-bit variables, as written (but does include the comment on how to do so).
  • evanhevanh Posts: 16,028
    edited 2020-01-16 19:43
    Do tell. EDIT: Hang-on. Chris, haven't you just been saying how well the numbers have panned out in the most recent run? I don't get what's wrong here.
  • xoroshironotxoroshironot Posts: 309
    edited 2020-01-16 19:52
    evanh wrote: »
    Do tell. EDIT: Hang-on. Chris, haven't you just been saying how well the numbers have panned out in the most recent run? I don't get what's wrong here.
    The only issue is with the equidistribution at full period... the numbers will look good either way at a fraction of the entire period (2^48-2^16, or 2^49-2^17 for toggling +1), but might change slightly, which is what Tony is worried about. The PractRand results will likely come out better also.
    You could combine Tony's code above using ACC with my TOGGLE_SUM/CONSTANT_E/e in the code I posted to make things more readable and still configurable, perhaps.
  • TonyB_TonyB_ Posts: 2,195
    edited 2020-01-16 21:34
    evanh wrote: »
    Do tell. EDIT: Hang-on. Chris, haven't you just been saying how well the numbers have panned out in the most recent run? I don't get what's wrong here.
    The only issue is with the equidistribution at full period... the numbers will look good either way at a fraction of the entire period (2^48-2^16, or 2^49-2^17 for toggling +1), but might change slightly, which is what Tony is worried about. The PractRand results will likely come out better also.
    You could combine Tony's code above using ACC with my TOGGLE_SUM/CONSTANT_E/e in the code I posted to make things more readable and still configurable, perhaps.

    Evan, the new xoroacc code (that will become logic sometime, I hope) produces 32-bit equidistribution after 64K streams and any C code that you use for testing now must work in the same way, even though the testing length is only one stream, otherwise all the testing will need to be done again 'properly' later.

    The critical point is the + 1 happens only once every double iteration, i.e. every other single iteration, hence Chris's TOGGLE_SUM. The convention we have used for some time is that + 1 (or more generally + e) applies to acc0, the low half of acc. It could have been acc1 instead, but we have to choose one or the other and stick to it.
    acc0 := acc1 + prn0 + 1
    acc1 := acc0 + prn1
    
    EDIT:
    It is possible there might not be any differences in the overall distributions between the "+1,+1" tests done so far and the "+1,+0" needed for 32-bit equidistribution. Perhaps a few [a,b,c,d] should be tried first rather than the whole lot and this also applies to testing all possible acc1 with xoroshiro16++a.
  • evanhevanh Posts: 16,028
    edited 2020-01-16 21:44
    I stated I had chosen option #3, the single iterated one, at the outset. You're wanting me to abandon it entirely and use #1 or #2 instead.

    EDIT: Okay, maybe not quite stated as such, but certainly the only one of interest to me - https://forums.parallax.com/discussion/comment/1486685/#Comment_1486685

  • xoroshironotxoroshironot Posts: 309
    edited 2020-01-17 00:28
    Sorry, I hadn't been clear about the third option, which Tony and I discussed briefly.
    It was written for single iteration in the case that someone wanted to access perfect (not just near-perfect) 32-bit equidistribution and didn't mind translating that 32-bit variable code directly for use in the P2, etc.
    Originally I thought you were focusing on that code translated to 16-bit variables so that you could model the + 1 toggle mode more easily by adding a state bit, as required.

    What you have done with the non-toggle so far is get a taste of the difficulty in using freq information, since the randomness is capped at either 8G or 16GB in PractRand depending on toggle usage, but does not give meaningful data about how random the average stream is.
    The ultimate goal with freqs would be to ascertain how well [13 5 10 9] compares on average against other constants through multiple double-iterated streams, in the fashion Tony mentioned (by modeling a double-iterated 8-bit variable version using toggle in which all constants could all complete 256 seperate times, once per extended state, very quickly), before stepping up to the 16-bit toggle version.
  • evanhevanh Posts: 16,028
    I hadn't seen any difficulty sorry, I've treated the freq stuff just as a means of culling the candidates before bringing Practrand to bear.
  • evanhevanh Posts: 16,028
    edited 2020-01-16 23:57
    The Practrand gridding runs I've done are with the accumulator included. EDIT: Haven't done the +s0+1 candidates yet though.
  • xoroshironotxoroshironot Posts: 309
    edited 2020-01-17 22:38
    Evan, some extra detail on chi^2 sums that was not as important for XORO32, but now is more important on XOROACC, since much of the variant results will be already plausibly random at only one double-iterated stream.

    If you want to dig in to an appoximate probability that a given Chi^Sum could occur, you can use Open Office Calc or Microsoft Excel.
    x = Sum of freqs, y= Number of (non-zero) freqs summed - 1
    Calc: '=CHIDIST(x;y)'
    Excel: '=CHIDIST(x,y)'

    When looking at hundreds of 'p-values', just assume that the vast majority should be in the range 0.001 (not random/uniform) to 0.999 (too random/uniform).

    Regardless of the above, when calculating chi^2 freq sums for storing to a file, it is better to use a more exact floating point value during calculations for the 'ideal' (e.g. pfreq(0) 1580030168.7021, instead of 1580030169).
    This will not make much difference for the larger values, but becomes more important for the smaller values (e.g. pfreq(12) 3.298591, instead of 3), which will affect the overall sum.

    When you have a series of actual freq(x) values from multiple streams, you can do the following for each x to generate a composite freq(x) before summing:
    ('sum of all freq(x)' - 'total number of freq(x)' * 'ideal freq(x)')^2/('total number of freq(x)' * 'ideal freq(x)')
    On XOROACC, what you will find is that as 'total number of freq(x)' (i.e. the number of streams) increases, the sum will trend upward.
    This upward trend will not happen within 30 streams on a much better quality PRNG, and knowing that helps categorize candidates by multi-stream quality to help choose PractRand candidates, or if there is no visible difference in PractRand results, as a tie-breaker.

    I'm not sure how far Tony wants you to go with this, as the likely outcome is that some candidates will be somewhat measurably superior to [13 5 10 9]++a, but ultimately distinguishing between different variants of that one (e.g. with rotl(,1), etc.) will be important to justify any added complexity.

    I've attached 50 randomly seeded double-iterated streams of [13 5 10 9]++a with +1 on first iteration and rotl(,1) on the second, just as a reference.
  • TonyB_ wrote: »
    Just to confirm, are you saying that +1 and rotl(XOROACC,1) are acceptable with xoroshiro32++ and anything more complicated is unnecessary? If so, then the +1 could be implemented as a simple add-with-carry.
    Just to be clear, at this state size +1 and revbits(XOROACC) is still noticeably superior to +1 and rotl(XOROACC,1). At larger state sizes the difference would not be noticeable.
    To illustrate that point, have a look at the attached plots for revbits, where I have converted the chi^2 sums to p-values for each of 100+ random double-iterated stream trials, then sorted the p-values.
    I have not found anything better than this that did not involve much more complicated code.

    Regarding alternative scramblers/constants using only +1 and +0 (have not tested rotl/revbits):
    state[2] += rotl(state[0] * 5, D) + 1; // toggle +1 on/off for extended equidistribution
    .
    .
    .
    return state[2];
    
    Presumably revbits (without +1) would also be superior on the second iteration here.
    This is about as simple as it could possibly get, if a fresh implementation were ever needed.
  • TonyB_ wrote: »
    Just to confirm, are you saying that +1 and rotl(XOROACC,1) are acceptable with xoroshiro32++ and anything more complicated is unnecessary? If so, then the +1 could be implemented as a simple add-with-carry.
    Just to be clear, at this state size +1 and revbits(XOROACC) is still noticeably superior to +1 and rotl(XOROACC,1). At larger state sizes the difference would not be noticeable.
    To illustrate that point, have a look at the attached plots for revbits, where I have converted the chi^2 sums to p-values for each of 100+ random double-iterated stream trials, then sorted the p-values.
    I have not found anything better than this that did not involve much more complicated code.

    Apologies for the slow reply, partly due to computer problems. Reversing bits would be simple in hardware (less so for testing with x86 code, though).
    Regarding alternative scramblers/constants using only +1 and +0 (have not tested rotl/revbits):
    state[2] += rotl(state[0] * 5, D) + 1; // toggle +1 on/off for extended equidistribution
    .
    .
    .
    return state[2];
    
    Presumably revbits (without +1) would also be superior on the second iteration here.
    This is about as simple as it could possibly get, if a fresh implementation were ever needed.

    * 5 would require * 4 + * 1 and I think there would not be enough time for a modified P2 to cascade 16-bit state[2] to get a combined 32-bit result.
  • xoroshironotxoroshironot Posts: 309
    edited 2020-02-15 20:01
    TonyB_ wrote: »
    Reversing bits would be simple in hardware (less so for testing with x86 code, though).
    I added the function 'bitreverse16' to the attached c++ header file (from Lemire)... uses fast look-up table that works really well (for testing).
    TonyB_ wrote: »
    * 5 would require * 4 + * 1 and I think there would not be enough time for a modified P2 to cascade 16-bit state[2] to get a combined 32-bit result.
    I thought it would be helpful to list expected randomness from most-random to least-random of variants:
    1. XOROACC xoroshiro32pp+1, revbits(xoroshiro32pp)
    2. XOROACC xoroshiro32pp+1, rotl(xoroshiro32pp,1)
    3. Alternate (from above) rotl(state[0] * 5, D) + 1, rotl(state[0] * 5, D)
    4. XORO32 (based on ++)

    Obviously revbits would be the way to go, especially considering the preexisting hardware basis.

    The 'Alternate' (in optimized form, currently named 'sps') will be what I release for x64, 192-bit state, plus a faster version ('psp') that is intended to replace xoroshiro+.
    Those are only beta names... 'acc' will be added later (in some fashion?).
    Some Linux benchmark results attached that attempt to cover real-world performance of 64-bit code for integer and floating-point.
  • xoroshironotxoroshironot Posts: 309
    edited 2020-06-06 17:47
    Tony and Evan, just a quick status update on my work.
    I plan on reaching out to Seba in the near future, as I have have discovered a level above #1 in the list above which is simpler (e.g. no revbits), faster (up to ~35%), more random, and has potential crypto applications.
    In terms of XORO32 (and previous XOROACC's discussed), this new XOROACC variant I have discovered fails PractRand at 256GB forward bits and 2TB reverse bits (in the general purpose version, at 49 bits state),
    I have attached 1 set 2 sets of benchmarks (of many I have run on various old and new Intel and AMD CPUs) for floating point and general purpose variants, that hopefully demonstrates the full potential of extended state xoroshiro.
    I appreciate all of the effort both of you have put into entertaining my explorations of scramblers for the xoroshiro engine.

    Edited for clarity, in bold.
    Added AMD benchmark.
    Evan, I also attached a buffered version of TestU01_stdin, which is ~2x faster.
  • evanhevanh Posts: 16,028
    Evan, I also attached a buffered version of TestU01_stdin, which is ~2x faster.[/b]
    I never measured the speed of mine but did use same general idea. I think it was only 1 kB buffer size though.

  • evanhevanh Posts: 16,028
    Is that your own 3800X?
  • evanh wrote: »
    Is that your own 3800X?
    Not exactly, but I helped my son put it together, so he lets me borrow it occasionally.
  • evanhevanh Posts: 16,028
    Ah, that makes sense. I figured you'd be looking at at least a 3900X for yourself and it's not cheap to be buying new all the time.
Sign In or Register to comment.