They're only warnings so compile and link still succeeds. I haven't tried using it yet.

Yes, I eventually got the compile to succeed, but then found the results were missing some entries that were present in 0.94... some of that was intentional, but some was not and needed the source to be hand modified.
All the notes are in that discussion.
I hope it is finalized soon.

Strong scores but has a weak spot with rotated 8-bit window. Will need search for a better candidate.

I am unable to replicate your 8-bit or 16-bit window results using this scrambler seeded { 1, 0, 0 }: Edit: I found you are using rotr instead of rotl, which allows me to replicate your results.

state[2] += rotl(s0 + s1, 9) + 1;

In the case of the 8-bit window, I am attempting an easy fail at 32MB in PractRand 0.94 by:
1. Rotating the 16-bit output by 11 bits.
2. Concatenating the low 8 bits with those from the next output shifted left 8 bits.
3. Buffer the result from #2 and repeat.
4. Using this code with a 16-bit buffer array:

I only just remembered why so many of my old scripts are wildly out of date. Tony's approach of using frequency distribution sped up the culling process a lot.

1: Identify full period engine candidates.
2: Run distribution tests on all those.
3: Select a decent number of top distribution scoring to run gridding with Practrand.
4: The best few of those can then be Crush tested.

I ran lots of distribution tests on various candidates (attached to several of my previous posts), but none on the one candidate you focused on with the missing + s0.
That simplest/fastest candidate using three 32-bit state variables is not expected to show any weakness until after 256TB... but I have not finished testing it to be sure it gets even that far.
I was already at least vaguely aware that it was unsuitable for 16-bit state variables, which is another reason why Tony and I put little, if any, effort into discussing it.
Using 16-bit state variables, I will stand by the [13 5 10 9] scrambler:

state[2] += rotl(s0 + s1, 9) + s0 + 1; // toggle + 1 on/off each iteration to increase equidistribution

... which should fail at 8GB under the same conditions that caused the 'no + s0' version to fail at 32MB.
At other windows, most fails would be at 16GB (or perhaps greater), without other changes.

All of that being said, I have not searched for other ABCD candidates that might redeem the 'no + s0' version for 16-bit variable use.

Evan, are you confident that the existing XORO32 32-bit output accessed as two 16-bit words cannot be used to feed the scrambler above, as Tony and I had hoped for creating a XOROACC extension?

As far as Crush testing goes, all variants Tony and I discussed passed BigCrush.

state[2] += rotl(s0 + s1, 9) + s0 + 1; // toggle + 1 on/off each iteration to increase equidistribution

Ignoring the accumulation component, that fits in the the execution speed criteria of XORO32. Although, toggling the +1 would likely be dropped. That is effectively an extra state bit.

Evan, are you confident that the existing XORO32 32-bit output accessed as two 16-bit words cannot be used to feed the scrambler above, as Tony and I had hoped for creating a XOROACC extension?

I have no idea. The prop's first SIMD instruction. EDIT: Add a bunch of 2x16-bit SIMD's. Catch is they'd need to be single operand. The double operand instruction spaces are all but gone.

state[2] += rotl(s0 + s1, 9) + s0 + 1; // toggle + 1 on/off each iteration to increase equidistribution

Ignoring the accumulation component, that fits in the the execution speed criteria of XORO32. Although, toggling the +1 would likely be dropped. That is effectively an extra state bit.

Evan, are you confident that the existing XORO32 32-bit output accessed as two 16-bit words cannot be used to feed the scrambler above, as Tony and I had hoped for creating a XOROACC extension?

I have no idea. The prop's first SIMD instruction. EDIT: Add a bunch of 2x16-bit SIMD's. Catch is they'd need to be single operand. The double operand instruction spaces are all but gone.

The extra state bit increases the period to 2^49-2^17 in order to achieve the 32-bit near-perfect equidistribution... a nice touch, but obviously only if possible.

The SIMD on x86/x64 is part of the reason that this approach (without toggling the +1) might be extremely desirable, as a double sum is a good way to automatically invoke compilers to generate an LEA instruction (where the +1 is free) and the accumulator is a good way to defer the scrambler in the pipeline, since its result is not immediately required. To wit:

Which, if properly vetted, might make the accumulator versions the fastest all-purpose (non-cryptographic) PRNGs available that also:
1. Pass all known randomness tests.
2. Have fixed long period that is an easy, mathematically provable, extension to the base xoroshiro engine properties.
3. Achieve near-perfect 2 dimensional equidistribution.
4. Produces small and easily comprehensible source code.
5. Are non-trivially invertible.
6. Fairly easily implementable in hardware.
7. Have inherited Jump support using existing xoroshiro subroutine.
8. Allows for extension to near-perfect 1 dimensional equidistribution at twice the output word size (by toggling the +1 on and off, at the expense of less speed / higher implementation complexity).
9. Reduces sparse state (a.k.a. escape-from-zero) recovery time and occurrence substantially.

The 64-bit variable 'a' versions benchmarked here (I'm using Tony's naming convention), that we have been discussing in 16-bit variable form, are using a slightly more CPU pipeline optimized version of the original code that I posted for the 32-bit variable version (which I just updated and linked back to).
I will release the final code in a few months, if all goes well.

That's looking nice. Good even when up against algorithms that are using magic multiplies. Huh, I wonder if avoiding the use of multiplies allows tighter loop times. There's bound to be more execution stages needed for big multiplies. In a tight loop that'll have an impact even in fat x86 CPUs.

Right, I've just been checking progress of the full set distribution run of Xoroshiro64plusacc, I've named it Xoroacc32+1 for now. It's about 2/3rd way through. After sorting, candidate [13 5 10 9] is sitting at position 380. That's well outside the normal group cut-off that I'd then run gridding on. It has sum-ranking of 1812.

[6 2 3 7] is in the lead at the moment with sum-ranking of 118.

SumRanking formula is: sumRank = pRank * 2 + zRank + nzRank

Right, I've just been checking progress of the full set distribution run of Xoroshiro64plusacc, I've named it Xoroacc32+1 for now. It's about 2/3rd way through. After sorting, candidate [13 5 10 9] is sitting at position 380. That's well outside the normal group cut-off that I'd then run gridding on. It has sum-ranking of 1812.

[6 2 3 7] is in the lead at the moment with sum-ranking of 118.

SumRanking formula is: sumRank = pRank * 2 + zRank + nzRank

Using the name xoroshiro32++a for xoroshiro32++ plus accumulator for the time being at least, two new constants are needed: e for low word increment and f for high word left rotation. e must be odd for 2D (32-bit) equidistribution, but f > 0 is not vital. Chris, does f = 1 improve PractRand scores when lsb is bit 0?

Evan, what exactly are you testing and for how many iterations? As defined in the previous paragraph, is it xoroshiro32++a [a,b,c,d,1,1] or xoroshiro32++a [a,b,c,d,1,0]? To achieve full equidistribution requires 2^32-1 * 65536 double-iterations, however at just * 1 [13,5,10,9] should be bettered by other [a,b,c,d] depending on initial state and acc.

f = 1 is easy for hardware, but a little tricky in software. For f = 0, I think the code below is the shortest possible for a software emulation of the new XOROACC functionality (six extra instructions). Could you please test this to see if results are the same as in C?

P.S.
I prefer equally weighting, i.e. sumRank = pRank + zRank + nzRank.
Note that xoroshiro32++a [a,b,c,d,e,f) has two a's and I'm open to other suggestions, e.g. xoroshiro32++acc [a,b,c,d,e,f].

'XOROACC instruction pseudo-code:
'acc_out[15:0] := acc_in[31:16] + prn[15:0] + 1
'acc_out[31:16] := acc_out[15:0] + prn[31:16]
'P2 code equivalent
'acc[31:16] = current acc word, acc[15:0] = previous acc word
'N.B. rotl(prn[31:16],1) omitted for speed
xoro32 state 'iterate state twice, PRN in next S
mov prn,0 'prn = XORO32 PRN
shr acc,#16 'acc = {16'b0,acc[31:16]}
add acc,prn 'acc = {16'bX,acc[31:16]+prn[15:0]}
add acc,#1 'acc = {16'bX,acc[31:16]+prn[15:0]+1}
movbyts acc,#%%1010 'acc = {acc[31:16]+prn[15:0]+1,acc[31:16]+prn[15:0]+1}
bitl prn,#111100000 'prn = {prn[31:16],16'b0}
add acc,prn 'acc = {acc[31:16]+prn[15:0]+1+prn[31:16],acc[31:16]+prn[15:0]+1}
state long 1
prn long 0
acc long 0
'In future hardware, the above code could be replaced by
xoro32 state
xoroacc acc,0 'preserve acc[31:16] for next xoroacc
state long 1
acc long 0

Evan, regarding your comments, what is even more surprising, is that doing the following 'magic' will still result in slightly faster execution than that of xoroshiro128++ or Lehmer:

I need something other than my current Xeons to confirm these results on a more modern architecture... still considering a modern Threadripper.
The speed delta I am seeing between + and ++ matches the 15% that Seba reported, which provides some comfort.

Thanks for looking at other constants... that you have found a 118 without using '+ s0' is very surprising to me!
An 'ideal' sum the way you are doing it is equal to about 92, or 80 when using Tony's equal weighting, but could vary substantially from about 40 all the way up to about 160 on these OR a perfect PRNG, depending entirely on seed.

Extra detail:
The SumRanking formula I had been using in my code, as I recall, was something like: average(pranksum/N1 + zranksum/N2 + nzranksum/N3)
N1, N2 and N3 are are either degrees of freedom (n-1) or the number of values added (n), resulting in a weighted average between 0.5 and 2.0.
Since I was combining data from multiple randomly seeded runs, I was using 14, 24 and 51, respectively, since my column sums were decidely non-zero out to that range.
The larger values are due a larger number of x in freq(x) since larger state and/or more random PRNGs will commonly bump into those values across multiple runs.
Obviously none of this 'extra detail' is required for fast vetting of candidates.

Using the name xoroshiro32++a for xoroshiro32++ plus accumulator for the time being at least, two new constants are needed: e for low word increment and f for high word left rotation. e must be non-zero for 2D (32-bit) equidistribution, but f > 0 is not vital. Chris, does f = 1 improve PractRand scores when lsb is bit 0?

Tony, we found e must be odd, normally 1, with larger (usually prime, like AD65h) values only considered for improving multiple (more than 1) double-iterated stream freqs (which was a questionable pursuit, but not entirely without merit).
I noticed f = 1 for rotation of the second iteration solved multiple issues simultaneously, primarily improving PractRand/freqs and enabling de-correlation of the second stream for double-iterated use.
I later found that the rotation would suffice for the second iteration, in lieu of using a reverse-bits approach that would also (like rotl 1) superimpose the msb from the first iteration on the lsb for the second.
Other non-zero f values near 1 might provide some minor benefit, but generally as f increases, we lose the benefit seen in PractRand, but I never looked a freqs in that case.

Here's the automated distribution results. I had to append the .txt suffix for the forum attachment.

EDIT: And a zip of the individual case outputs

PS: Tony, these are all single iterated only. It doesn't look like I ever got the double iterating ported to new automation. Ah, found a long forgotten Bash variable to set to enable double iterating. I should make that something for the parameters I think.

EDIT2: Also attached the source file for my algorithm implementation.

I've just modified the algorithm source. It doesn't affect the results, just reduces number of explicit masks applied when testing a non-native C word size.

I looked at the TXT file and all of those look good to me if I am interpreting the data properly.

Take a random example from half way down the list:
Xoroacc32(16)+1 [2 11 3 8] 12 17 40 779 441 558 2557

The pfChi is perfect... it should be about 12.
The zfChi is only a little low... it should be about 20.
The nzfChi is only a little low... it should be about 44.

Edit: Each perfect value is approximately the total number of > 0 freqs summed, with each individual value that created the freqsum being about 1 (non-linear, but usually in the range 0.05 to 5).

Gridding takes 40 mins I think per candidate. Err, it'll take longer with larger Practrand scores, maybe a few hours. I'll let you select what you want gridded.

Since you are doing 'no + s0', in order to test a theory I would be interested in full grids on:
[9 8 14 15]
[13 9 8 7]
[4 8 5 1] // should benefit from the same type of logic as rotl 1 being used on second iteration???

I picked these from near the top of a chi square sum list I created using data from '(ideal freq sum - actual freq sum)^2/ideal freq sum' where ideal is the degrees of freedom I believe you are using.

Edit: DF for zfChi is likely 20, not 22. Updated list.
Sums less than 0.5 (which is kind of overly restrictive):

This list might have a side effect of predicting that [13 5 10 2] might be about the best choice if '+ s0' were used.
I am curious how [13 5 10 2] might have fared as the original XORO32. List was slightly incorrect.

Here's gridding results of that whole list. Well, as it was 14 hours ago at least. And a quick summary of the grids. Wasn't as high scoring as I was expecting. EDIT: Added the individual Practrand reports as well.

There's a problem with the C algorithm implementation. The scrambler is not being applied to the immediate result output. That means the first result is always bare state data.

I was just setting up for a new run with +s0+1 when it sunk in. It really needs to return the accumulator state[2] directly, not s2. s2 becomes redundant.

Here's updated algorithm:

#include "xorotypes.h"
static rngword_t s[3] = { 1, 0, 0 }; // states 0 and 1 must not be seeded everywhere zero, state 2 should not to be used in isolation as a stream selector
rngword_t nextrnd( void )
{
rngword_t s0 = s[0];
rngword_t s1 = s[1];
#if (ACCUM_SIZE == 16) || (ACCUM_SIZE == 32) || (ACCUM_SIZE == 64)
s[2] += rotl(s0 + s1, CONSTANT_D) + s0 + 1;
s1 ^= s0;
s[0] = rotl( s0, CONSTANT_A ) ^ s1 ^ (s1 << CONSTANT_B);
#else
s[2] = (s[2] + rotl(s0 + s1, CONSTANT_D) + 1) & ACCUM_MASK;
s1 ^= s0;
s[0] = rotl( s0, CONSTANT_A ) ^ s1 ^ (s1 << CONSTANT_B) & ACCUM_MASK;
#endif
s[1] = rotl(s1, CONSTANT_C);
return s[2];
}

The 'ideal' freq values of '0' are what cause a potential issue. If all of the freqs were '0' then the PRNG that produced them would be a perfectly uniform generator.
However, 'uniform' <> 'random'. Truly random numbers are not completely uniform, otherwise you could predict their frequency distribution at some point in the period.
This was great for xoroshiro32++, where it was known where to stop and look for uniformity... single and/or double iterated period.
Therefore, that helped constants search, since it was a struggle to achieve full period randomness on freqs.
However, now there will be an impact, since the entire period cannot be inspected directly.
So now a new entry is required in the table:

#a b c d pfreq0 pfreq1 pfreq2 pfreq3 pfreq4 pfreq5 pfreq6 pfreq7 pfreq8 pfreq9 pfreq10 pfreq11 pfreq12 pfreq13 total
#normal ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~14 (Actually closer to 12, see below).

The above is an approximation, as are the following: as each '~1' is most likely to be in the range of 0.15 to 6, and the sum '~14' is likely to be in the range of 3 to 30. The actual value is normally one less to express Degrees of Freedom, but in this case it is approximately even one less than that due to the final point being so close to zero it causes skew... so effectively ~12 if Yates correction is applied.
With only a single trial, high freq values near the end may carry excess weight, so stopping before actual '0's appear can minimize bias, which would eventually dissipate with the addition of more 'randomly seeded trials'* of the same PRNG. (see below)
In that case you can sum the freqs and apply expected * #trials. For example (sum_of_pfreq0's - expected * count_of_pfreq0's)^2/(expected * count_of_pfreq0's).
Normally, for long period PRNGs, you might just run your trial for a longer period and adjust your expected values accordingly.
However, with these 'a' or 'acc' PRNGs, it is a known issue that each stream is correlated to the next, and subsequent streams are likely more strongly correlated with each other... and perhaps in freq results.
*Thus the recommendation to not use consecutive streams for statistical evaluation.

Edit: Clarify how degrees of freedom is calculated, and suggest why it is not the number of data points - 1 in all cases.

Given what I just posted above, the odds of missing the best candidate, without testing all possible candidates, is very high.
That being said, I am completely surprised with what [11 2 6 11] accomplished.

This +1a design (as you call it) was intended to exceed xoroshiro+ in most all relevant considerations, which I believe it will do with most constants of sufficient linear complexity.
The ++1a design (using '+ s0') was intended to do the same when compared with xoroshiro++, while offering the extended benefit of allowing the randomness period to be more predictable (e.g. 8GB for ++1a), thus compatible with a broader range of constants.

I mostly followed Tony by asking for exact code that I could incorporate.

I did the same thing, if you look back through the posts over the last several months, and ended up porting most of your freq code after multiple discussions with Tony.
Then I started looking at the freq data from high bits of xoroshiro128+ and **, which prompted me to agree with Tony that PractRand, Crush, etc. are insufficient for these evaluations.
However, the status quo of how to interpret the freq data no longer applied, and after several false starts, it sunk in reasonably well.

The biggest disappointment was finding that although the freqs for certain more complex variants would look good across multiple independent and consecutive streams, there was no getting around the fact that the streams are correlated, no matter how I tried to mask it... with the exception of stumbling onto rotl(xoroshiro32pp, 1) for use on the second iteration, which granted a solid doubling of randomness period (which I am still somewhat suspicious of, but cannot demonstrate any substantial correlation fault when [13, 5, 10, 9] was used).

I was curious, so ran 16-bit aperture rotations on [11 2 6 11] +1a (no '+ s0') with the additional modification of applying rotl(, 1) to the second iteration:

The accumulator algorithm is existing XORO32 followed by new XOROACC. XORO32 on its own and XORO32 + XOROACC both generate a pseudo-32-bit PRN by concatenating two successive 16-bit ones. Let prn = XORO32 PRN and acc = new XOROACC PRN, then

where e = new odd constant (and left rotation of prn1 by new constant f omitted for now for simplicity). Note that xoroshiro32++ state variable is not needed to calculate acc once prn is known.

A stream is a unique sequence of 2^32-1 XOROACC iterations (i.e. 2^32-1 double iterations or 2^33-2 single iterations of xoroshiro32++) with a unique initial acc value. To be more precise, the high word of acc (acc1) is unique as it represents the current 16-bit output, whereas acc0 is the previous output. Thus there are 2^16 streams and the total XOROACC period is 2^48-2^16 double or 2^49-2^17 single iterations. This is the number of iterations needed for state and acc1 to repeat their initial values at the same time.

Each stream is correlated with all the others, i.e. the difference between the outputs of any two streams for a particular iteration i is the same for all i. One stream leads automatically to the next one and there wil be perfect 1D (16-bit) equidistribution, including all zeroes, over the total period. However, for 2D (32-bit) equidistribution that is near perfect, the minimum requirement is that an odd value (e) must be added to either acc0 or acc1 just once for each stream output and e = 1 is the smallest and easiest value to use.

To achieve 2D equidistribution, each stream must be output in full, with low and high words repeated but swapped over after 2^32-1 single iterations. However, the "-1" means acc will be one less than it should be at this midway point of the double iteration period and so adding 1 will correct this error. This + e adjustment could be done at any time during every other 2^32-1 single iteration period, but that would mean testing the state all the time. It is quicker to do the + e once every double iteration, which has the effect of interleaving all the streams by continually jumping from one to another. Each stream will be output fully over the total period.

Evan, I'm not quite sure what you've been testing but I think you should be doing the following:

Amend the pfreq/zfreq/nzfreq C program used to create the binary data files that showed [13,5,10,9] was the best overall candidate when combining PractRand scores and frequency distributions. You will need to add code for acc as shown above and use the 32-bit acc value as the index to the 4GB array instead of prn. There must be 2^32-1 double iterations.

I still prefer equally weighting of pfreq/zfreq/nzfreq.

Bear in mind that different initial acc values will affect how closely the distributions match the ideal ones, therefore more than one initial acc should be tested. Other [a,b,c,d] will be better than [13,5,10,9] for init_acc = X, but could [13,5,10,9] do just as well if not better when init_acc = Y?

Do pfreq/zfreq/nzfreq ever repeat for different init_acc or is there always one best value? If the latter, do Chi-square scores fall away smoothly either side of this peak init_acc? Probably the only practicable way to test this is to look at xoroshiro16++ with all possible 256 acc1 values.

Some clarification on equidistribution as it relates to the full double-iterated implementation:
We know that XORO32 has near-perfect 1D 16-bit equidistribution, with two missing zeros (since two periods of underlying xoroshiro).
XOROACC (in any discussed form) will always have perfect 1D 16-bit and near-perfect 2D 16-bit equidistribution.
However, without removing the +1 from (A), and/or using rotl(,1) on (B), the second iteration, we cannot achieve the near-perfect 1D 32-bit equidistribution.
That is because about half of the 2D 16-bit distribution is not aligned on same 32-bit boundary as the 32-bit output DWord.
Doing A and/or B forces the 32-bit alignment, with both A and B together likely producing the highest degree of randomness.
This is obviously a nice addition, since a true improvement on XORO32 would benefit from a concert of simultaneous improvements, if practical.
Since the 32-bit 1D is not exactly perfect (65536 32-bit values occur once less often), a value judgement had to be made.
Tony and I agreed in favor of, based on the same logic that allows for the missing 0s in XORO32...
Not ideal, but close enough to be disregarded as being a problem, since it would be very difficult to detect that any given 32-bit value did not occur exactly the same number of times as any other value.
That, as opposed to allowing only a complete normal 32-bit distribution (which was an incomplete normal with XORO32), where it is much easier to prove that all 32-bit values do/will not occur equally.

Regarding +1:
We know that +1 and ^1 are interchangeable (or any odd value, likely prime if larger values were used), and can be substituted with the same effects discussed above.
The only requirement is that if xor is used, it be applied to PRN prior to addition with ACC.
However, it is possible that the randomness will be affected, sometimes in a positive way, thus can be checked for comparison with +, if desired.

Evan, I'm not quite sure what you've been testing but I think you should be doing the following:

Amend the pfreq/zfreq/nzfreq C program used to create the binary data files that showed [13,5,10,9] was the best overall candidate when combining PractRand scores and frequency distributions. You will need to add code for acc as shown above and use the 32-bit acc value as the index to the 4GB array instead of prn. There must be 2^32-1 double iterations.

Yep all that is automatic with each posted algorithm. I just plug it in now and the rest is sorted for me.

I make a new algorithm file then name it matching a new entry in the config. Here's my plugging in config lines (bottom one is newest):

## Comments

297All the notes are in that discussion.

I hope it is finalized soon.

297Edit: I found you are using rotr instead of rotl, which allows me to replicate your results.In the case of the 8-bit window, I am attempting an easy fail at 32MB in PractRand 0.94 by:1. Rotating the 16-bit output by 11 bits.

2. Concatenating the low 8 bits with those from the next output shifted left 8 bits.

3. Buffer the result from #2 and repeat.

4. Using this code with a 16-bit buffer array: I also tried a single element char buffer: Both with same 4GB fail result.

Do you see something obviously wrong in my approach (or in your code)?

Edit: I found my issue. See above.11,29111,2911: Identify full period engine candidates.

2: Run distribution tests on all those.

3: Select a decent number of top distribution scoring to run gridding with Practrand.

4: The best few of those can then be Crush tested.

297That simplest/fastest candidate using three 32-bit state variables is not expected to show any weakness until after 256TB... but I have not finished testing it to be sure it gets even that far.

I was already at least vaguely aware that it was unsuitable for 16-bit state variables, which is another reason why Tony and I put little, if any, effort into discussing it.

Using 16-bit state variables, I will stand by the [13 5 10 9] scrambler: ... which should fail at 8GB under the same conditions that caused the 'no + s0' version to fail at 32MB.

At other windows, most fails would be at 16GB (or perhaps greater), without other changes.

All of that being said, I have not searched for other ABCD candidates that might redeem the 'no + s0' version for 16-bit variable use.

Evan, are you confident that the existing XORO32 32-bit output accessed as two 16-bit words cannot be used to feed the scrambler above, as Tony and I had hoped for creating a XOROACC extension?

As far as Crush testing goes, all variants Tony and I discussed passed BigCrush.

11,291I have no idea. The prop's first SIMD instruction. EDIT: Add a bunch of 2x16-bit SIMD's. Catch is they'd need to be single operand. The double operand instruction spaces are all but gone.

297The SIMD on x86/x64 is part of the reason that this approach (without toggling the +1) might be extremely desirable, as a double sum is a good way to automatically invoke compilers to generate an LEA instruction (where the +1 is free) and the accumulator is a good way to defer the scrambler in the pipeline, since its result is not immediately required. To wit: Which, if properly vetted, might make the accumulator versions the fastest all-purpose (non-cryptographic) PRNGs available that also:

1. Pass all known randomness tests.

2. Have fixed long period that is an easy, mathematically provable, extension to the base xoroshiro engine properties.

3. Achieve near-perfect 2 dimensional equidistribution.

4. Produces small and easily comprehensible source code.

5. Are non-trivially invertible.

6. Fairly easily implementable in hardware.

7. Have inherited Jump support using existing xoroshiro subroutine.

8. Allows for extension to near-perfect 1 dimensional equidistribution at twice the output word size (by toggling the +1 on and off, at the expense of less speed / higher implementation complexity).

9. Reduces sparse state (a.k.a. escape-from-zero) recovery time and occurrence substantially.

The 64-bit variable 'a' versions benchmarked here (I'm using Tony's naming convention), that we have been discussing in 16-bit variable form, are using a slightly more CPU pipeline optimized version of the original code that I posted for the 32-bit variable version (which I just updated and linked back to).

I will release the final code in a few months, if all goes well.

Edit: Expanded documentation11,291Right, I've just been checking progress of the full set distribution run of Xoroshiro64plusacc, I've named it Xoroacc32+1 for now. It's about 2/3rd way through. After sorting, candidate [13 5 10 9] is sitting at position 380. That's well outside the normal group cut-off that I'd then run gridding on. It has sum-ranking of 1812.

[6 2 3 7] is in the lead at the moment with sum-ranking of 118.

SumRanking formula is: sumRank = pRank * 2 + zRank + nzRank

1,752Using the name xoroshiro32++a for xoroshiro32++ plus accumulator for the time being at least, two new constants are needed: e for low word increment and f for high word left rotation. e must be odd for 2D (32-bit) equidistribution, but f > 0 is not vital. Chris, does f = 1 improve PractRand scores when lsb is bit 0?

Evan, what exactly are you testing and for how many iterations? As defined in the previous paragraph, is it xoroshiro32++a [a,b,c,d,1,1] or xoroshiro32++a [a,b,c,d,1,0]? To achieve full equidistribution requires 2^32-1 * 65536 double-iterations, however at just * 1 [13,5,10,9] should be bettered by other [a,b,c,d] depending on initial state and acc.

f = 1 is easy for hardware, but a little tricky in software. For f = 0, I think the code below is the shortest possible for a software emulation of the new XOROACC functionality (six extra instructions). Could you please test this to see if results are the same as in C?

P.S.

I prefer equally weighting, i.e. sumRank = pRank + zRank + nzRank.

Note that xoroshiro32++a [a,b,c,d,e,f) has two a's and I'm open to other suggestions, e.g. xoroshiro32++acc [a,b,c,d,e,f].

297The speed delta I am seeing between + and ++ matches the 15% that Seba reported, which provides some comfort.

Thanks for looking at other constants... that you have found a 118 without using '+ s0' is very surprising to me!

An 'ideal' sum the way you are doing it is equal to about 92, or 80 when using Tony's equal weighting, but could vary substantially from about 40 all the way up to about 160 on these OR a perfect PRNG, depending entirely on seed.

Extra detail:

The SumRanking formula I had been using in my code, as I recall, was something like: average(pranksum/N1 + zranksum/N2 + nzranksum/N3)

N1, N2 and N3 are are either degrees of freedom (n-1) or the number of values added (n), resulting in a weighted average between 0.5 and 2.0.

Since I was combining data from multiple randomly seeded runs, I was using 14, 24 and 51, respectively, since my column sums were decidely non-zero out to that range.

The larger values are due a larger number of x in freq(x) since larger state and/or more random PRNGs will commonly bump into those values across multiple runs.

Obviously none of this 'extra detail' is required for fast vetting of candidates.

297I noticed f = 1 for rotation of the second iteration solved multiple issues simultaneously, primarily improving PractRand/freqs and enabling de-correlation of the second stream for double-iterated use.

I later found that the rotation would suffice for the second iteration, in lieu of using a reverse-bits approach that would also (like rotl 1) superimpose the msb from the first iteration on the lsb for the second.

Other non-zero f values near 1 might provide some minor benefit, but generally as f increases, we lose the benefit seen in PractRand, but I never looked a freqs in that case.

11,291EDIT: And a zip of the individual case outputs

PS: Tony, these are all single iterated only. It doesn't look like I ever got the double iterating ported to new automation. Ah, found a long forgotten Bash variable to set to enable double iterating. I should make that something for the parameters I think.

EDIT2: Also attached the source file for my algorithm implementation.

11,29111,291297Take a random example from half way down the list:

Xoroacc32(16)+1 [2 11 3 8] 12 17 40 779 441 558 2557

The pfChi is perfect... it should be about 12.

The zfChi is only a little low... it should be about 20.

The nzfChi is only a little low... it should be about 44.

Edit: Each perfect value is approximately the total number of > 0 freqs summed, with each individual value that created the freqsum being about 1 (non-linear, but usually in the range 0.05 to 5).11,291297[9 8 14 15]

[13 9 8 7]

[4 8 5 1] // should benefit from the same type of logic as rotl 1 being used on second iteration???

I picked these from near the top of a chi square sum list I created using data from '(ideal freq sum - actual freq sum)^2/ideal freq sum' where ideal is the degrees of freedom I believe you are using.

Edit: DF for zfChi is likely 20, not 22. Updated list.Sums less than 0.5 (which is kind of overly restrictive): This list might have a side effect of predicting that [13 5 10 2] might be about the best choice if '+ s0' were used.

I am curious how [13 5 10 2] might have fared as the original XORO32. List was slightly incorrect.

11,29111,291I was just setting up for a new run with +s0+1 when it sunk in. It really needs to return the accumulator state[2] directly, not s2. s2 becomes redundant.

Here's updated algorithm:

11,291and this one is Xoroshiro128++1a

297Just in case I wasn't clear on the totals I'm using, look at this familiar table: The 'ideal' freq values of '0' are what cause a potential issue. If all of the freqs were '0' then the PRNG that produced them would be a perfectly uniform generator.

However, 'uniform' <> 'random'. Truly random numbers are not completely uniform, otherwise you could predict their frequency distribution at some point in the period.

This was great for xoroshiro32++, where it was known where to stop and look for uniformity... single and/or double iterated period.

Therefore, that helped constants search, since it was a struggle to achieve full period randomness on freqs.

However, now there will be an impact, since the entire period cannot be inspected directly.

So now a new entry is required in the table: The above is an approximation, as are the following: as each '~1' is most likely to be in the range of 0.15 to 6, and the sum '~14' is likely to be in the range of

3to 30.The actual value is normally one less to express Degrees of Freedom, but in this case it is approximately even one less than that due to the final point being so close to zero it causes skew... so effectively ~12 if Yates correction is applied.With only a single trial, high freq values near the end may carry excess weight, so stopping before actual '0's appear can minimize bias, which would eventually dissipate with the addition of more 'randomly seeded trials'* of the same PRNG. (see below)

In that case you can sum the freqs and apply expected * #trials. For example (sum_of_pfreq0's - expected * count_of_pfreq0's)^2/(expected * count_of_pfreq0's).

Normally, for long period PRNGs, you might just run your trial for a longer period and adjust your expected values accordingly.

However, with these 'a' or 'acc' PRNGs, it is a known issue that each stream is correlated to the next, and subsequent streams are likely more strongly correlated with each other... and perhaps in freq results.

*Thus the recommendation to not use consecutive streams for statistical evaluation.

Edit: Clarify how degrees of freedom is calculated, and suggest why it is not the number of data points - 1 in all cases.11,291I mostly followed Tony by asking for exact code that I could incorporate.

297That being said, I am completely surprised with what [11 2 6 11] accomplished.

This +1a design (as you call it) was intended to exceed xoroshiro+ in most all relevant considerations, which I believe it will do with most constants of sufficient linear complexity.

The ++1a design (using '+ s0') was intended to do the same when compared with xoroshiro++, while offering the extended benefit of allowing the randomness period to be more predictable (e.g. 8GB for ++1a), thus compatible with a broader range of constants.

297Then I started looking at the freq data from high bits of xoroshiro128+ and **, which prompted me to agree with Tony that PractRand, Crush, etc. are insufficient for these evaluations.

However, the status quo of how to interpret the freq data no longer applied, and after several false starts, it sunk in reasonably well.

The biggest disappointment was finding that although the freqs for certain more complex variants would look good across multiple independent and consecutive streams, there was no getting around the fact that the streams are correlated, no matter how I tried to mask it... with the exception of stumbling onto rotl(xoroshiro32pp, 1) for use on the second iteration, which granted a solid doubling of randomness period (which I am still somewhat suspicious of, but cannot demonstrate any substantial correlation fault when [13, 5, 10, 9] was used).

11,2912971,752where e = new odd constant (and left rotation of prn1 by new constant f omitted for now for simplicity). Note that xoroshiro32++ state variable is not needed to calculate acc once prn is known.

A stream is a unique sequence of 2^32-1 XOROACC iterations (i.e. 2^32-1 double iterations or 2^33-2 single iterations of xoroshiro32++) with a unique initial acc value. To be more precise, the high word of acc (acc1) is unique as it represents the current 16-bit output, whereas acc0 is the previous output. Thus there are 2^16 streams and the total XOROACC period is 2^48-2^16 double or 2^49-2^17 single iterations. This is the number of iterations needed for state and acc1 to repeat their initial values

at the same time.Each stream is correlated with all the others, i.e. the difference between the outputs of any two streams for a particular iteration i is the same for all i. One stream leads automatically to the next one and there wil be perfect 1D (16-bit) equidistribution, including all zeroes, over the total period. However, for 2D (32-bit) equidistribution that is near perfect, the minimum requirement is that an odd value (e) must be added to either acc0 or acc1 just once for each stream output and e = 1 is the smallest and easiest value to use.

To achieve 2D equidistribution, each stream must be output in full, with low and high words repeated but swapped over after 2^32-1 single iterations. However, the "-1" means acc will be one less than it should be at this midway point of the double iteration period and so adding 1 will correct this error. This + e adjustment could be done at any time during every other 2^32-1 single iteration period, but that would mean testing the state all the time. It is quicker to do the + e once every double iteration, which has the effect of interleaving all the streams by continually jumping from one to another. Each stream will be output fully over the total period.

Evan, I'm not quite sure what you've been testing but I think you should be doing the following:

Amend the pfreq/zfreq/nzfreq C program used to create the binary data files that showed [13,5,10,9] was the best overall candidate when combining PractRand scores and frequency distributions. You will need to add code for acc as shown above and use the 32-bit acc value as the index to the 4GB array instead of prn. There must be 2^32-1

doubleiterations.I still prefer equally weighting of pfreq/zfreq/nzfreq.

Bear in mind that different initial acc values will affect how closely the distributions match the ideal ones, therefore more than one initial acc should be tested. Other [a,b,c,d] will be better than [13,5,10,9] for init_acc = X, but could [13,5,10,9] do just as well if not better when init_acc = Y?

Do pfreq/zfreq/nzfreq ever repeat for different init_acc or is there always one best value? If the latter, do Chi-square scores fall away smoothly either side of this peak init_acc? Probably the only practicable way to test this is to look at xoroshiro16++ with all possible 256 acc1 values.

297We know that XORO32 has near-perfect 1D 16-bit equidistribution, with two missing zeros (since two periods of underlying xoroshiro).

XOROACC (in any discussed form) will always have perfect 1D 16-bit and near-perfect 2D 16-bit equidistribution.

However, without removing the +1 from (A), and/or using rotl(,1) on (B), the second iteration, we cannot achieve the near-perfect 1D 32-bit equidistribution.

That is because about half of the 2D 16-bit distribution is not aligned on same 32-bit boundary as the 32-bit output DWord.

Doing A and/or B forces the 32-bit alignment, with both A and B together likely producing the highest degree of randomness.

This is obviously a nice addition, since a true improvement on XORO32 would benefit from a concert of simultaneous improvements, if practical.

Since the 32-bit 1D is not exactly perfect (65536 32-bit values occur once less often), a value judgement had to be made.

Tony and I agreed in favor of, based on the same logic that allows for the missing 0s in XORO32...

Not ideal, but close enough to be disregarded as being a problem, since it would be very difficult to detect that any given 32-bit value did not occur exactly the same number of times as any other value.

That, as opposed to allowing only a complete normal 32-bit distribution (which was an incomplete normal with XORO32), where it is much easier to prove that all 32-bit values do/will not occur equally.

Regarding +1:

We know that +1 and ^1 are interchangeable (or any odd value, likely prime if larger values were used), and can be substituted with the same effects discussed above.

The only requirement is that if xor is used, it be applied to PRN prior to addition with ACC.

However, it is possible that the randomness will be affected, sometimes in a positive way, thus can be checked for comparison with +, if desired.

11,291Yep all that is automatic with each posted algorithm. I just plug it in now and the rest is sorted for me.

I make a new algorithm file then name it matching a new entry in the config. Here's my plugging in config lines (bottom one is newest):

11,291