Random/LFSR on P2

17980828485

Comments

  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-06 - 14:46:24
    TonyB_ wrote: »
    Averaging the chi^2 total is allowable, I think, but not chi^2 for individual frequencies.
    I believe that with sufficient individual frequency results to avoid skew/bias, the average should converge to the ideal.
    However, a good random sequence should certainly not always produce the ideal for each independent trial, but should produce values that have a certain deviation (MAD and/or Std) from the ideal.
    Since the chi^2 throws away the direction of bias, you could have seemingly good results, but all biased in one direction, which would be inferior to the same chi^2 value based on results that had no directional bias.

    Hopefully the xoroshiro128** results will shed some light on this so it can be codified at the 16GB mark as an additional statistic for consideration.
    I will try to do 30+ trials (and prefer to use StdDev).


  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-08 - 18:20:36
    Here are the pfreq(x) final stats for 32 independent trials of 'xoroshiro128** high 32 bits':
    Average frequencies chi^2 total = 0.2 (very close to zero, as I predicted)
    Average chi^2 frequencies total = 11.8 12.5 (lowest value seen was 4)
    StdDev chi^2 frequencies total = 5.3 (values from 0-2 above should occasionally be seen, based on this and the above total)

    For comparison, here are the pfreq(x) stats for 32 independent trials of xoroacc_rol7^AD65h_revbits(rol7):
    Average frequencies chi^2 total = 2.7 (not as close to zero as I had hoped, but very respectable)
    Average chi^2 frequencies total = 15.7 14.3 (lowest value seen was 4)
    StdDev chi^2 frequencies total = 5.3 (values from 0-2 above will rarely be seen, based on this and the above total)

    I have attached the ods files, just in case the above is not clear.

    This should set some clear targets for other variants, and I will process the zfreq and nzfreq data when I have a chance.

    There was a bug with the binary data when migrating to Windows (i.e. some of the files were larger than expected), which I have just fixed (by opening files with wb, instead of w, which is now cross-platform compatible).

    Edit: Corrected 'Average chi^2 frequencies total' above, and re-attached corrected ods files.
  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-09 - 14:45:39
    It seems that (on average) the chi^2 sums total to a value that is equal to where: expected freq(x) = 1.
    So the 12.5 total for xoroshiro128 in the previous post, would make perfect sense, as ideal pfreq(12.47) = 1.00 and sum of good PRNG pfreqs 0 to 13 =~12.47.
    I am not sure if that is an exact expectation (probably documented somewhere), but the values I see would bear out the idea, as well.
    TonyB_ wrote: »
    Also I now regard pfreq, zfreq & nzfreq as having equal worth.
    Therefore, in order to combine the three sums: Average(pfreqchi^2sum / 12.47, zfreqchi^2sum / 21.26, nzfreqchi^2sum / 44.00) = 1 for an 'ideal' PRNG.
    The range 0.5-2.0 would be typical. For 32 runs of xoroshiro128** high 32 bits, I got:
    Min = 0.54 from Average(0.49, 0.41, 0.70)
    Max = 1.85 from Average(1.29, 2.34, 1.91)
    Avg = 1.11

    Edit: Improved formula, added more notes, xoroshiro128** stats and added quote.
    Edit2: The ideal divisor values for zfreq=21.26 and nzfreq=44.00 might be slightly wrong, since x=0 is not included in the sum. Xoroshiro128** averaged divisor values are zfreq=23.74 and nzfreq=53.33.
    However, they are close enough to allow for rational comparisons between pfreq, zfreq and nzfreq chi^2 sums.

    Edit3: Attached 3-page 'xoroshiro128** high 32 bits' ods file which shows all pfreq, zfreq and nzfreq data supporting the above.
  • TonyB_TonyB_ Posts: 1,456
    edited 2019-12-10 - 01:45:51
    TonyB_ wrote: »
    Also I now regard pfreq, zfreq & nzfreq as having equal worth.
    Therefore, in order to combine the three sums: Average(pfreqchi^2sum / 12.47, zfreqchi^2sum / 21.26, nzfreqchi^2sum / 44.00) = 1 for an 'ideal' PRNG.
    The range 0.5-2.0 would be typical. For 32 runs of xoroshiro128** high 32 bits, I got:
    Min = 0.54 from Average(0.49, 0.41, 0.70)
    Max = 1.85 from Average(1.29, 2.34, 1.91)
    Avg = 1.11
    I'm not mad keen on the fractional divisors. Integers equal to the degrees of freedom seem better.
    http://onlinestatbook.com/2/chi_square/distribution.html states:
    "A standard normal deviate is a random sample from the standard normal distribution. The Chi Square distribution is the distribution of the sum of squared standard normal deviates. The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed... The mean of a Chi Square distribution is its degrees of freedom."
    Are the few nzfreq(51)=23 values for xoroshiro128** due to the larger state?
    If you have time, could you find pfreq/zfreq/nzfreq for xoroacc ^1, no bit reversal when ^AD65H is finished?
  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-10 - 03:56:01
    TonyB_ wrote: »
    I'm not mad keen on the fractional divisors. Integers equal to the degrees of freedom seem better.
    The fractions have minimal impact. I was just extrapolating from the xoroshiro128 data, but would need more data from a other quality random sources to draw more sound conclusions.
    TonyB_ wrote: »
    Are the few nzfreq(51)=23 values for xoroshiro128** due to the larger state?
    The nzfreq(51) hits did not escape my attention. Looking into those is related to my response to the fractional divisors. Not something I am willing to dig into at this time.
    For the moment, let us assume that it is just random noise that had to appear somewhere. In my experience, deviations like that (in non-chaotic PRNGs) can only progress with state size: XORO32 < XOROACC < xoroshiro128.
    TonyB_ wrote: »
    If you have time, could you find pfreq/zfreq/nzfreq for xoroacc ^1, no bit reversal when ^AD65H is finished?
    I am working on automating many of the variants by including passable parameters for 'e', rol() and revbits()... so this simplest variant 'xoroacc^1' would be e=1, rol=0, and revbits=0. Since xoroacc+1 is likely superior to ^1, I will have to run those variants as a separate set, but will do ^1 and +1 first.
    Initial evidence is that there will be less distinction than expected between most variants at one double-iterated stream, so it will not be as useful as I thought it would be for an 'e' search. Still need to confirm.

    I believe that pfreq, etc., performance at 12 to 24 sequential streams will be where most of the distinction will be. But it was never originally my intent to put too much weight into it until I saw how compellingly well some variants perform at it.
    Ultimately, weighting factors will have to be created to balance 1 stream freqs, multi-stream pfreqs and PractRand rotation results. Gjrand at 16GB and 32GB seems useful, as it produces fairly easy to understand result... perhaps useful for cross-check or tie-breaker.

    Distracted right now, as my car has failed me in a way that it will cost more to fix than it is worth, so I may have to dig into my new server funds to handle it. I will console myself by noting that TSMC's forthcoming 5nm process (and then 3nm that follows) will ensure that whatever I could buy now will be obsolete in a matter of 12-24 months.
  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-12 - 02:58:22
    The data I just ran from xoroshiro128+ has no hits at all on nzfreq(51) like ** does. I need to ensure a good baseline, so testing xoroshiro128++ to complete a 3-way comparison (which seem apropos).

    I have much of the simple XOROACC variant data complete, just need to cross-check and package it.
  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-16 - 02:14:43
    I have much of the simple XOROACC variant data complete, just need to cross-check and package it.
    I have attached only bin files for ^1/+1/AD65h variants. Notes: rol0 = no rotation, rev0 = not reversed, rev1rol7 = rotate bits left 7 places, then reverse all 16 bits.
    I have the chi^2 sum data that I automated within the c++ code, but it is not entirely correct and needs to be redone (so I did not attached it).
    I finally realized that the chi^2 set formula for multiple independent streams must be: (sum(actuals) - expected * actuals_count)^2 / (expected * actuals_count)
    Also been studying Yate's Correction for Continuity, and the impact of misuse of Degrees of Freedom. I think we are good with the latter (which is one of the best explanations on the web), but the former might explain the excessive chi^2 value for the occasional nzfreq(51) hits on xoroshiro128**.

    Edit: Fixed multi-stream chi^2 formula
  • TonyB_TonyB_ Posts: 1,456
    edited 2019-12-16 - 01:35:31
    I have much of the simple XOROACC variant data complete, just need to cross-check and package it.
    I have attached only bin files for ^1/+1/AD65h variants. Notes: rol0 = no rotation, rev0 = not reversed, rev1rol7 = rotate bits left 7 places, then reverse all 16 bits.
    Thanks for the files. What exactly are the hex digits at the end of the filenames?
    The data I just ran from xoroshiro128+ has no hits at all on nzfreq(51) like ** does. I need to ensure a good baseline, so testing xoroshiro128++ to complete a 3-way comparison (which seem apropos).

    I have much of the simple XOROACC variant data complete, just need to cross-check and package it.
    I agree it would be better for the baseline to have a ++ scrambler, not **, however I doubt xoroshiro128++ would ever be used on the P2. The more practicable (but considerably slower) software alternative to a hardware xoroacc32 is xoroshiro64++ and it would be fairer to compare these two.

    It's interesting that the xoroshiro128** in the P2 rev B needs less logic than xoroshiro128++, which is why we skipped the latter when upgrading from xoroshiro128+ in rev A.
  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-18 - 01:29:06
    TonyB_ wrote: »
    What exactly are the hex digits at the end of the filenames?
    Those are the non-deterministic seeds, which get further processed to seed the specific PRNG state... really, I just needed unique filenames, so that worked doubly well.

    I ran those over again, and more variants, with the fixed chi^2 multi-stream sums:
    Variants		pfreq		zfreq		nzfreq		Average 
    (N=45 streams)		chi^2 sum	chi^2 sum	chi^2 sum	(ideal ~= 1)
    xoroshiro128**		 16		 38		 60		 1.32
    xoroshiro128+		 15		 14		 52		 0.90
    xoroshiro128++		  3		 24		 44		 0.71
    rol0plus1h_rev0rol0	338		176		360		12.95
    rol0plus1h_rev1rol0	218		186		366		10.28
    rol7plus1h_rev0rol7	 91		145		338		 6.48 !
    rol7plus1h_rev1rol7	265		229		149		10.60
    rol0xor1h_rev0rol0	190		235		474		11.03
    rol0xor1h_rev1rol0	107		249		379		 8.63
    rol7xor1h_rev0rol7	172		205		502		10.35
    rol7xor1h_rev1rol7	131		249		480		 9.86
    rol0xorAD65h_rev0rol0	267		187		375		11.52
    rol0xorAD65h_rev1rol0	128		163		324		 7.53 *
    rol7xorAD65h_rev0rol7	122		151		339		 7.31 **
    rol7xorAD65h_rev1rol7	162		286		458		11.00 !!
    (scroll down)
    Formula for chi^2 sum:  (sum(actuals) - expected * N)^2 / (expected * N)
    Formula for Average  :  Average(pfreqsum/14 + zfreqsum/23 + nzfreqsum/51)
    Freqs used           : pfreq 0-13, zfreq 1-23, nzfreq 1-51
    
    !  Fails PractRand miserably at 16GB (instead of 32GB), so this success was unexpected
    *  Highly recommended and previously explored, so this success was fully expected
    ** Very good, but not previously explored. Probably not recommended, since no revbits, which is important in other tests
    !! Not as good as expected based on previous PractRand and other testing
    
    Keep in mind that having an average ~7.5x greater than xoroshiro128 in 45 double-iterated streams is good considering that known (mostly GAP and autocorrelation) defects exist just after 1 stream (16GB, which is the absolute maximum rating for this PRNG for most stringent applications, or perhaps only 8GB for quality hard research-level work).

    For comparison, here is the recommended variant at 5 streams:
    Variants		pfreq		zfreq		nzfreq		Average 
    (N=5 streams)		chi^2 sum	chi^2 sum	chi^2 sum	(ideal ~= 1)
    xoroshiro128**		8		17		50		0.75
    rol0xorAD65h_rev1rol0	15		30		56		1.17 *
    
    * About as good as xoroshiro128** for at least 5 streams, with respect to these specific freq tests.
    
  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-18 - 01:28:18
    TonyB_ wrote: »
    The more practicable (but considerably slower) software alternative to a hardware xoroacc32 is xoroshiro64++ and it would be fairer to compare these two.
    xoroshiro64++ (with quality constants) would likely be nearly indistinguishable from xoroshiro128++, unless perhaps many thousands of 16GB streams are run.
    If it were possible to implement, then the truncated form of XOROACC would also be easy to implement: acc = acc + rol(s0 + s1, D) + 1
    That would have near-perfect 2-dimensional 32-bit equidistribution (and fully 1D), but no 64-bit equidistribution (unless the +1 were applied every other iteration).
    Much harder to vet constants for such endeavors, but using Seba's recommended ABC would only require searching for best D constant.
    I'll run a few hundred on xoroshiro64++, if I can guess a suitable D constant (since I don't think one is published).

    Edit: Added clarifications.
    Edit2: I notice xoshiro128++ (32-bit output, like xoroshiro64) uses D=7, which is likely a good starting point for xoroshiro64++.
  • ^926Bh_rev1 produces an N=45 average of 6.06 (with good PractRand performance). The N=12 average is ~2, which is on the border of being statistically plausible.
    Much lower averages seem statistically possible. An N=45 average of 4 or less would be a target goal.
    I will write a program that will search for all 16-bit primes that have runs of 1 or 2 consecutive 0 or 1 bits, which seems to me related.
  • TonyB_TonyB_ Posts: 1,456
    edited 2019-12-18 - 04:54:15
    TonyB_ wrote: »
    The more practicable (but considerably slower) software alternative to a hardware xoroacc32 is xoroshiro64++ and it would be fairer to compare these two.
    xoroshiro64++ (with quality constants) would likely be nearly indistinguishable from xoroshiro128++, unless perhaps many thousands of 16GB streams are run.
    If it were possible to implement, then the truncated form of XOROACC would also be easy to implement: acc = acc + rol(s0 + s1, D) + 1
    That would have near-perfect 2-dimensional 32-bit equidistribution (and fully 1D), but no 64-bit equidistribution (unless the +1 were applied every other iteration).
    Much harder to vet constants for such endeavors, but using Seba's recommended ABC would only require searching for best D constant.
    I'll run a few hundred on xoroshiro64++, if I can guess a suitable D constant (since I don't think one is published).

    Edit: Added clarifications.
    Edit2: I notice xoshiro128++ (32-bit output, like xoroshiro64) uses D=7, which is likely a good starting point for xoroshiro64++.

    A couple of earlier posts that might be of interest:

    xoroshiro128++ vs. xoroshiro128**
    http://forums.parallax.com/discussion/comment/1448782/#Comment_1448782

    xoroshiro64++
    http://forums.parallax.com/discussion/comment/1449224/#Comment_1449224

    Seba's suggestion for what we call constant d for xoroshiro++ when state too large to iterate fully:
    If output width = w, then d ~ w/2, in theory prime number is better, consider w/2 +/- 1 first.
    However, this is just a rule-of-thumb and "'not set in stone."
  • xoroshironotxoroshironot Posts: 231
    edited 2019-12-19 - 04:29:22
    TonyB_ wrote: »
    Seba's suggestion for what we call constant d for xoroshiro++ when state too large to iterate fully:
    If output width = w, then d ~ w/2, in theory prime number is better, consider w/2 +/- 1 first.
    However, this is just a rule-of-thumb and "'not set in stone."
    I remember having read that, which is why the choice of 7 for for xoshiro128++ threw me, since it does not follow that logic.
    My guesses are that the use of 7 in that case:
    1. Is a safe minimum value for masking binary matrix / linear complexity issues.
    2. Minimizes correlations with the other shift / rotation constants in use.
    3. May be more specific to 'xo' rather than 'xoro'-shiro to improve bit mixing (due to 4, instead of 2, state variables).

    I had actually thought about exploring an xoshiro32 version of xoroacc (or the truncated version), with a period of 2^40-2^8 (using 5 8-bit state variables).
    It could be iterated fully to perhaps help provide answers to some fundamental questions.
  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-09 - 02:19:38
    Update: I have identified many of the most useful and/or simplest variants related to previous discussions, and adapted the smallest and fastest of those to 96-bit and 192-bit state versions for in-depth testing. My testing so far shows that there is nothing else available that can compete with the simplest version when simultaneously considering the five main desirable properties of speed, randomness, equidistribution, mathematical soundness and code/implementation simplicity. The only superficial quality that suffers is that the extended state variable is not used optimally (i.e. XOROACC is not 65536x more random than XORO32), unless you weight the other advantages more highly, which I believe are more important and thus largely justifiable.

    Obviously more complexity would be useful for the 48-bit state versions to work around the comparatively smaller size. Here is what I found:
    1. The smallest / fastest version (based on '+') that I am testing with larger states is still about 8x more random at 48-bit state than xoroshiro32++ (and has the added equidistribution and sparse-state-recovery benefits). Since a basic XORACC32 based on '+' is nearly identical in hardware implementation complexity to XORO32, it is unfortunate that this scrambler variant was not previously known.
    2. The best version for maximum randomness at any state size (based on the '++' that we have been looking at) requires an additional xor of 's0' against the ACC output before subsequent addition (which can be done in parallel with the rotl(s0+s1) before final addition). This in intriguing, since it forces multiple collision carries with the 'x0' in the ++ part, that I originally assumed would lead to less randomness, but was wrong (obviously so, after observing forward PractRand fail well after the 32GB mark).
    3. None of the versions I am testing or recommending have an xor/addition constant greater than 1 (+1 is currently recommended, per Tony's original intuition), since large values increase complexity without fully solving the base issue with multi-stream correlation.
    4. When creating a double-iterated version for maximum randomness (~32x that of XORO32, and including the addition of near-perfect 32-bit equidistribution), the modification described in #2 coupled with rotl(XOROACC,1) on the second iteration is the best. Revbits(XOROACC) no longer becomes worth considering (also because it created headaches for mathematical analysis and a slower software implementation).

    I'll post test code for some of the above variants at various state sizes after I have finished further evaluations.
  • Ah, I haven't been following but, any replacement for XORO32 must use 32 bits for its state store.
  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-09 - 04:43:58
    evanh wrote: »
    Ah, I haven't been following but, any replacement for XORO32 must use 32 bits for its state store.
    Nice to see you pop in here. You will have to go back several pages to brush up.

    In summary, we are not exactly discussing replacement, but if we were, then in that context (at least 1 of) the XORO32 output registers would become bi-directional, thus not technically using any more state than currently used by XORO32. (Then an explosion of benefits ensues which initially were difficult to believe).
  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-09 - 04:44:50
    Evan, assuming I didn't make any mistakes in translation to a form that should make sense to you for Px, here is one example of the 16-bit code I referred to in #4 above (previously unpublished):
    // most random version of XOROACC by C. Rutz, period 2^49-2^17, perfect 16-bit 1 dimensional equidistribution, and almost perfect 32-bit 1D and 16-bit 2D
    uint16_t state[2] = { 1, 0 }; // state must not be seeded everywhere zero
    uint16_t   ACC[2] = { 0, 0 }; // bi-directional hardware output register, any initial values allowed (but only ACC[1] will be used)
    const uint16_t e  =   1     ; // 1 is recommended, but must be odd, without high-bit set unless e is omitted from second iteration.
    
    inline uint16_t rotl(const uint16_t x, int k) { return (x << k) | (x >> (16 - k)); }
    
    // XORO32 single-iteration
    inline uint16_t xoroshiro32pp(void) {
    	uint16_t s0 = state[0];
    	uint16_t s1 = state[1];
    	uint16_t result = rotl(s0 + s1, 9) + s0; // 7 should be better than 9 for XOROACC
    	s1 ^= s0;
    	state[0] = rotl(s0, 13) ^ s1 ^ (s1 << 5);
    	state[1] = rotl(s1, 10);
    	return result;
    }
    // Call this...
    inline uint32_t XOROACC(void) {
    	ACC[0] = (ACC[1] ^ state[0]) +      xoroshiro32pp() + e;     // ^ e works also 
    	ACC[1] = (ACC[0] ^ state[0]) + rotl(xoroshiro32pp() + e, 1); // e is optional here
    	return ACC[0] | (uint32_t(ACC[1]) << 16); // return output register as 32-bits for testing (not for hardware implementation)
    }
    
    The code can be simplified greatly when extended to 32-bit or larger state variables, as no programmatic mechanism exists that I am aware of to prove the loss of random quality that is apparent at 16-bit.
  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-09 - 05:40:17
    ... and here is the smallest / fastest version that is usable for double-iteration, per #1 above (also previously unpublished):
    // small and fast version of XOROACC by C. Rutz, period 2^49-2^17, perfect 16-bit 1 dimensional equidistribution, and almost perfect 32-bit 1D and 16-bit 2D
    uint16_t state[2] = { 1, 0 }; // state must not be seeded everywhere zero
    uint16_t   ACC[2] = { 1, 0 }; // bi-directional hardware output register, any initial values allowed (but only ACC[1] will be used)
    const uint16_t e  =   1     ; // 1 is recommended, but must be odd
    
    inline uint16_t rotl(const uint16_t x, int k) { return (x << k) | (x >> (16 - k)); }
    
    // plus version, with rotate
    inline uint16_t xoroshiro32p(void) {
    	uint16_t s0 = state[0];
    	uint16_t s1 = state[1];
    	uint16_t result = rotl(s0 + s1, 7); // using 7 for D
    	s1 ^= s0;
    	state[0] = rotl(s0, 13) ^ s1 ^ (s1 << 5);
    	state[1] = rotl(s1, 10);
    	return result;
    }
    // Call this...
    inline uint32_t XOROACC(void) {
    	ACC[0] = ACC[1] + xoroshiro32p() + e;     // ^ e works also 
    	ACC[1] = ACC[0] + xoroshiro32p()    ;     // missing e creates 32-bit near-perfect 1D
    	return ACC[0] | (uint32_t(ACC[1]) << 16); // return output register for testing (not for hardware implementation)
    }
    
    Let me know if you find any issues with this code or that in the post above.
  • evanhevanh Posts: 8,975
    edited 2020-01-09 - 05:12:38
    ... 
    // Call this...
    inline uint32_t XOROACC(void) {
    	ACC[0] = ACC[1] + xoroshiro32p() + e;     // ^ e works also 
    	ACC[1] = ACC[0] + xoroshiro32p()    ;     // missing e creates 32-bit near-perfect 1D
    	return ACC[0] | (uint32_t(ACC[1]) << 16); // return output register for testing (not for hardware implementation)
    }
    
    Let me know if you find any issues with this code or that in the post above.
    That's really pushing what could be done. It would have to somehow reach forward for an early fetch of the following instruction's operand to acquire the accumulation data before forwarding the result on. And it would have to be the D operand so that ACC is then written back by the following instruction.

    Chip may be able to indicate how hard that would be to do.

  • There is only one write port window per each 2-clock instruction execution. That said, if XORO32 were to take a third clock to execute then maybe two 32-bit writes would be doable.

  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-13 - 04:34:19
    And finally, per Tony's suggestion, here is the user-mode code for a single-iterated xoroshiro64 variant (full testing has not finished, but preliminary results are outstanding and it is mathematically near-perfect... full documentation eventually to follow):
    // fast xoroshiro64+a by C. Rutz, based on xoroshiro engine by S. Vigna and D. Blackman... public domain applies to the greatest extent possible
    // period 2^96-2^32, perfect 32-bit 1 dimensional equidistribution, and almost perfect 32-bit 2D (2^32 pairs occurring once less often over full period) 
    // exploits a trivial property of the xoroshiro full-period output modulo sum to recover equidistribution, extend randomness period and increase execution speed
    
    uint32_t state[3] = { 1, 0, 0 }; // states 0 and 1 must not be seeded everywhere zero, state 2 should not to be used in isolation as a stream selector
    
    inline uint32_t rotl(const uint32_t x, int k) { return (x << k) | (x >> (32 - k)); }
    
    inline uint32_t xoroshiro64plusacc(void) {
    	uint32_t s0 = state[0];
    	uint32_t s1 = state[1];
    	uint32_t s2 = state[2];
    	uint32_t tmp = s2 + rotl(s0 + s1, 15) + 1; // Include '+ s0' to improve randomness further, if desired, and toggle '+ 1' on and off to extend equidistribution to double output word size
    	s1 ^= s0;
    	state[0] = rotl(s0, 26) ^ s1 ^ (s1 << 9);
    	state[2] = tmp;
    	state[1] = rotl(s1, 13);
    	return s2;
    }
    
    I cannot thinks of anything more to contribute here, but if you need anything or have any questions, just ask.

    Edit: Improved code execution speed by deferring state[2] reassignment through use of tmp variable.
    Edit2: Better documentation added here.
  • evanhevanh Posts: 8,975
    edited 2020-01-09 - 06:50:34
    Huh. Have you also been testing that last single iterated algorithm at 16-bit word size? That would then work as a XORO32 replacement when prefixing a plain ADD instruction. XORO32 also then stays with S operand insertion and the ADD performs the accumulation step.

  • Ah, that'll be why some prefixing instructions, like XORO32, use the Q register. I suspect it'll be the difference between field vs operand insertion. Those that insert directly to the ALU operand port require a holding register, Q being convenient, while those that perform an instruction field substitution don't need the holding register. Just a hunch at this stage. :)
  • TonyB_TonyB_ Posts: 1,456
    edited 2020-01-09 - 20:44:24
    Evan, if you have time to spare, I suggest reading the posts after the last one you made in this thread before today. Some posts are complicated, but the basic innovations are apparent quite early on.

    Modified XORO32 logic is needed in a future P2 or P3 to provide a new XOROACC function, yet only two instructions will be required, the same as now, as follows:
    	XORO32 state
    	XOROACC	result
    
    Some of the existing XORO32 adders could be re-used for XOROACC, probably, saving logic.

    result is the new 32-bit PRN result, made up of two separate 16-bit accumulations in a similar way to the current XORO32 PRN. The high word of result is used to generate the next result, thus it must not be destroyed and the state is now 48 bits. The extra state bits mean the overall period is larger, consisting of 2^16 streams of 2^33-2 XORO32 iterations. Each stream is correlated with every other stream, with constant differences between outputs for corresponding iterations.

    Instead of one stream following another, there is a jump to a different stream after each xoroshiro32++ double-iteration. These jumps produce near-perfect 2D equidistribution, i.e. each possible 32-bit result occurs almost equally over the full extended period. With or without jumps, there is perfect 1D equidistribution, i.e. each 16-bit word occurs exactly equally, including zero. Compare this with the current XORO32, which has almost perfect 1D (16-bit zero occurs one time less than non-zero outputs) and a good approximation to random 32-bit output but nowhere near 2D (e.g. 37% of possible long values never occur).

    Chris, thanks for the update. Just to confirm, are you saying that +1 and rotl(XOROACC,1) are acceptable with xoroshiro32++ and anything more complicated is unnecessary? If so, then the +1 could be implemented as a simple add-with-carry.

    Note:
    This extra +1 every double-iteration is enough to cause a jump to another stream and these jumps are crucial in achieving near-perfect 2D.
  • May as well make it one 3-clock instruction that takes two operands.
  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-09 - 20:23:13
    evanh wrote: »
    Huh. Have you also been testing that last single iterated algorithm at 16-bit word size?
    To reach near-perfect 32-bit equidistribution with the 16-bit word size, the +1 must be toggled (e.g. on for first iteration and off for second).
    All previous code that Tony and I discussed was based on the original ++, which is about 2x-4x more random than + only. Therefore, we had not discussed altering the underlying ++ to drop the '+ s0'.

    Recently, I found that keeping the underlying XORO32 (with its ++) could be improved further by xoring the extended state with state[0] prior to adding the original scrambler +1... but this assumes access to state information that is no longer available from XORO32. Therefore, it would have to be embedded in the base scrambler, which was not on the table as an option. Interesting though.
    TonyB_ wrote: »
    Chris, thanks for the update. Just to confirm, are you saying that +1 and rotl(XOROACC,1) are acceptable with xoroshiro32++ and anything more complicated is unnecessary? If so, then the +1 could be implemented as a simple add-with-carry.
    Yes, for the most part. That gets us up to 16GB randomness and acknowledges that any minor attempt (e.g. xor 0xAD65h instead of +1) to mask issues going beyond that is denying the issues with subsequent double-iterated stream correlation that only 'looks' less correlated.

    You can optionally add the +1 to the second iteration in addition to the first (if that consistency were somehow advantageous), but it must be prior to the (now required) rotation, else it breaks the 32-bit equidistribution properties. I don't recall comparing the randomness directly with and without the +1 in the second iteration when the rotation is used. Reminder that the second iteration rotation was used to break the early GAP issue and improve overall randomness from 8GB up to 16GB.
  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-10 - 00:54:22
    You can optionally add the +1 to the second iteration in addition to the first (if that consistency were somehow advantageous)...
    Probably not a good idea to add +1 to second iteration, but somewhat subjective. Here are the forward bits results for each from PractRand pre0.95 (source code needs manual repair to fix mod3n, etc. issues... see here):
    ACC[0] = ACC[1]  +      xoroshiro32pp() + 1;
    ACC[1] = ACC[0]  + rotl(xoroshiro32pp(), 1); 
    rng=RNG_stdin, seed=unknown
    length= 16 gigabytes (2^34 bytes), time= 366 seconds
      Test Name                         Raw       Processed     Evaluation
      [Low4/16]FPF-14+6/4:all           R=  -7.0  p =1-2.1e-6   mildly suspicious
      ...and 1904 test result(s) without anomalies
    
    rng=RNG_stdin, seed=unknown
    length= 32 gigabytes (2^35 bytes), time= 670 seconds
      Test Name                         Raw       Processed     Evaluation
      DC7-9x1Bytes-1:even               R= +12.9  p =  5.0e-7   suspicious
      DC7-9x1Bytes-1:odd                R= +24.3  p =  6.6e-13    FAIL
      DC7-9x1Bytes-1:both               R= +17.6  p =  8.1e-12    FAIL
      DC7-9x1Bytes-1:indep              R= +40.5  p =  0          FAIL !!!!!!!!
      Gap-16:A                          R=+555.0  p =  1.3e-366   FAIL !!!!!!!
      Gap-16:B                          R=+259.2  p =  1.5e-221   FAIL !!!!!!
      [Low1/8]Gap-16:B                  R=  +6.5  p =  1.9e-5   unusual
      [Low1/16]FPF-14+6/4:(0,14-0)      R=  -9.0  p =1-5.8e-8   mildly suspicious
      [Low1/16]FPF-14+6/4:all           R=  -7.0  p =1-1.8e-6   mildly suspicious
      [Low1/16]Gap-16:A                 R= +20.6  p =  1.4e-14    FAIL
      [Low1/16]Gap-16:B                 R= +22.1  p =  9.0e-19    FAIL !
      [Low1/32]Gap-16:A                 R= +68.3  p =  1.6e-59    FAIL !!!!
      [Low1/32]Gap-16:B                 R= +45.4  p =  1.0e-38    FAIL !!!
      [Low4/16]FPF-14+6/16:all          R=  -7.2  p =1-1.4e-6   mildly suspicious
      [Low4/16]FPF-14+6/4:(0,14-0)      R= -14.3  p =1-7.4e-13   VERY SUSPICIOUS
      [Low4/16]FPF-14+6/4:(1,14-0)      R=  -8.4  p =1-2.0e-7   unusual
      [Low4/16]FPF-14+6/4:(2,14-0)      R= -10.4  p =1-2.9e-9   suspicious
      [Low4/16]FPF-14+6/4:all           R= -14.8  p =1-4.6e-14    FAIL
      ...and 1977 test result(s) without anomalies
    
    ACC[0] = ACC[1]  +      xoroshiro32pp() + 1;
    ACC[1] = ACC[0]  + rotl(xoroshiro32pp() + 1, 1); 
    rng=RNG_stdin, seed=unknown
    length= 16 gigabytes (2^34 bytes), time= 358 seconds
      Test Name                         Raw       Processed     Evaluation
      [Low4/16]FPF-14+6/4:all           R=  -7.2  p =1-1.2e-6   mildly suspicious
      ...and 1902 test result(s) without anomalies
    
    rng=RNG_stdin, seed=unknown
    length= 32 gigabytes (2^35 bytes), time= 663 seconds
      Test Name                         Raw       Processed     Evaluation
      mod3n(0):(0,9-0)                  R=+132.4  p =  3.3e-72    FAIL !!!!
      mod3n(0):(1,9-0)                  R=+144.1  p =  1.4e-78    FAIL !!!!
      mod3n(0):(2,9-0)                  R=+134.1  p =  4.1e-73    FAIL !!!!
      DC7-9x1Bytes-1:odd                R=  +8.8  p =  6.4e-5   unusual
      DC7-9x1Bytes-1:both               R=  +7.1  p =  9.5e-5   unusual
      DC7-9x1Bytes-1:indep              R= +12.2  p =   1e-5    mildly suspicious
      Gap-16:A                          R=+560.6  p =  2.7e-370   FAIL !!!!!!!
      Gap-16:B                          R=+264.9  p =  1.9e-226   FAIL !!!!!!
      [Low1/16]Gap-16:A                 R= +22.0  p =  1.5e-15    FAIL
      [Low1/16]Gap-16:B                 R= +20.3  p =  2.8e-17    FAIL !
      [Low1/32]Gap-16:A                 R= +68.0  p =  3.1e-59    FAIL !!!!
      [Low1/32]Gap-16:B                 R= +42.2  p =  5.3e-36    FAIL !!!
      [Low4/16]FPF-14+6/4:(0,14-0)      R= -12.7  p =1-2.3e-11  very suspicious
      [Low4/16]FPF-14+6/4:(1,14-0)      R=  -8.3  p =1-2.5e-7   unusual
      [Low4/16]FPF-14+6/4:(2,14-0)      R=  -9.1  p =1-4.5e-8   mildly suspicious
      [Low4/16]FPF-14+6/4:(3,14-0)      R=  -8.0  p =1-4.7e-7   unusual
      [Low4/16]FPF-14+6/4:all           R= -12.6  p =1-6.8e-12   VERY SUSPICIOUS
      [Low4/32]Gap-16:A                 R=  +7.1  p =  6.8e-5   unusual
      [Low4/32]Gap-16:B                 R=  +7.7  p =  1.6e-6   mildly suspicious
      [Low8/32]mod3n(0):(0,9-0)         R=+130.7  p =  2.6e-71    FAIL !!!!
      ...and 1973 test result(s) without anomalies
    
  • All previous code that Tony and I discussed was based on the original ++, which is about 2x-4x more random than + only. Therefore, we had not discussed altering the underlying ++ to drop the '+ s0'.
    Ah, I thought that final listing, replacing the "+ s0" with "+ 1", was indicating an advantange over "+ s0".

    My takeaway is the existing XORO32 is still about as good as it can be without taking longer to execute or using more state store. Probably both.

  • xoroshironotxoroshironot Posts: 231
    edited 2020-01-10 - 04:19:36
    evanh wrote: »
    All previous code that Tony and I discussed was based on the original ++, which is about 2x-4x more random than + only. Therefore, we had not discussed altering the underlying ++ to drop the '+ s0'.
    Ah, I thought that final listing, replacing the "+ s0" with "+ 1", was indicating an advantange over "+ s0".

    My takeaway is the existing XORO32 is still about as good as it can be without taking longer to execute or using more state store. Probably both.
    Actually, ++ is has only approximately 2x more random period than + when considering the double-iterated version, which is why I chose + for larger word sizes in the single iterated variant where the difference between the two likely cannot be discerned in any reasonable time frame (e.g. several years of analysis at 100GB/s with the 32-bit word version).
    Here are the PractRand results for a comparable double-iterated + variant using the rotl(xoroshiro32prol,1) on the second iteration (for which you can extrapolate the 32-bit word variant properties):
    uint16_t result = rotl16(s0 + s1, 7); // xoroshiro32prol is xoroshiro32++ scrambler with '+ s0' removed and D=7 (W/2-1)
    ACC[0] = ACC[1]  +      xoroshiro32prol() + 1;
    ACC[1] = ACC[0]  + rotl(xoroshiro32prol(), 1); 
    rng=RNG_stdin, seed=unknown
    length= 8 gigabytes (2^33 bytes), time= 181 seconds
      Test Name                         Raw       Processed     Evaluation
      [Low1/16]FPF-14+6/4:(5,14-0)      R=  +7.6  p =  1.2e-6   unusual
      ...and 1795 test result(s) without anomalies
    
    rng=RNG_stdin, seed=unknown
    length= 16 gigabytes (2^34 bytes), time= 358 seconds
      Test Name                         Raw       Processed     Evaluation
      [Low1/16]FPF-14+6/4:(5,14-0)      R= +16.1  p =  1.6e-14    FAIL
      [Low1/16]FPF-14+6/4:(7,14-0)      R=  +9.3  p =  2.9e-8   mildly suspicious
      [Low1/16]FPF-14+6/4:(8,14-1)      R=  +7.8  p =  1.2e-6   unusual
      [Low1/16]FPF-14+6/4:all           R= +11.2  p =  4.9e-10   VERY SUSPICIOUS
      [Low4/16]FPF-14+6/4:(0,14-0)      R=  -8.3  p =1-2.6e-7   unusual
      [Low4/16]FPF-14+6/4:(1,14-0)      R=  -8.7  p =1-1.1e-7   mildly suspicious
      [Low4/16]FPF-14+6/4:all           R=  -6.3  p =1-1.1e-5   unusual
      ...and 1895 test result(s) without anomalies
    
    Notice the fail is very gentle... it could have just as easily failed at 32GB with different seeds.
    Chasing the rabbit down the hole with more complicate code using ++ will not extend the forward failure beyond 32GB (though I recall the reverse failure is extended to 128GB):
    uint16_t result = rotl16(s0 + s1, 7) + s0; // xoroshiro32pp with D=7 (W/2-1)
    ACC[0] = (ACC[1] ^ s0) + xoroshiro32pp() + 1;
    ACC[1] = (ACC[0] ^ s0) + rotl(xoroshiro32pp(), 1); 
    
    rng=RNG_stdin, seed=unknown
    length= 16 gigabytes (2^34 bytes), time= 359 seconds
      Test Name                         Raw       Processed     Evaluation
      [Low4/16]FPF-14+6/16:all          R=  -6.8  p =1-3.6e-6   mildly suspicious
      [Low4/16]FPF-14+6/4:all           R=  -8.2  p =1-1.3e-7   suspicious
      ...and 1901 test result(s) without anomalies
    
    rng=RNG_stdin, seed=unknown
    length= 32 gigabytes (2^35 bytes), time= 662 seconds
      Test Name                         Raw       Processed     Evaluation
      [Low1/16]Gap-16:A                 R= +24.9  p =  1.4e-17    FAIL !
      [Low1/16]Gap-16:B                 R= +22.2  p =  7.3e-19    FAIL !
      [Low1/32]Gap-16:A                 R= +68.6  p =  8.5e-60    FAIL !!!!
      [Low1/32]Gap-16:B                 R= +45.2  p =  1.4e-38    FAIL !!!
      [Low4/16]FPF-14+6/16:all          R=  -7.1  p =1-1.5e-6   mildly suspicious
      [Low4/16]FPF-14+6/4:(0,14-0)      R=  -8.1  p =1-3.8e-7   unusual
      [Low4/16]FPF-14+6/4:all           R= -10.9  p =1-2.8e-10   VERY SUSPICIOUS
      ...and 1988 test result(s) without anomalies
    
    All of these variants are at least 16x more random than XORO32 (and all with the superior equidistribution properties).
  • evanhevanh Posts: 8,975
    edited 2020-01-10 - 04:18:44
    All of these variants are at least 16x more random than XORO32 (and all with the superior equidistribution properties).
    But only the last one will fit the constraints when as a single iterated 16-bit implementation.
Sign In or Register to comment.