I am glad you got it working to see what I was seeing.
I was up till 4AM testing variations on this, but there is no simple way I can find that will get around the lsb issue (which likely is what also maintains the structural coherency).

I've been looking at the final pfreq distribution and ignoring individual pair outputs. Just to clarify, the lsb issue is that the xored bit has lower quality than if no xor?

Another issue for me with xoring the high word is that the low and high words are treated differently, with only the latter modified within the same pair. I think it might be better to xor the low word instead, but my current preference is adding 1 to the low word so that both words are one more than they would have been otherwise.

The jump action then becomes clearer: rather than jumping after every double-iteration period, the jumping occurs after every double-iteration, thus the streams are interleaved not sequential. It is jumping that produces 2D equidistribution. (This will be obvious to you and is intended mainly for others who might be interested.)

'XOROACC instruction pseudo-code:
'result_out[15:0] := result_in[31:16] + prn[15:0] + 1
'result_out[31:16] := result_out[15:0] + prn[31:16]
'P2 code equivalent
'result[31:16] = current result word, result[15:0] = previous result word
xoro32 state 'update state, PRN in next S
mov prn,0 'prn = XORO32 PRN
getword tmp,prn,#0 'tmp = prn[15:0]
add tmp,#1 'tmp = prn[15:0]+1
shl tmp,#16 'tmp = {prn[15:0]+1,16'b0}
add result,tmp 'result = {result[31:16]+prn[15:0]+1,16'bX}
movbyts result,#%%3232 'result = {result[31:16]+prn[15:0]+1,
'result[31:16]+prn[15:0]+1}
setword prn,tmp,#0 'prn = {prn[31:16],16'b0}
add result,prn 'result = {result[31:16]+prn[15:0]+1+prn[31:16],
'result[31:16]+prn[15:0]+1}
state long 1
prn long 0
result long 0
tmp long 0
'In future hardware, the above code could be replaced by
xoro32 state
xoroacc result,0 'preserve result[31:16] for next xoroacc
state long 1
result long 0

I've been looking at the final pfreq distribution and ignoring individual pair outputs. Just to clarify, the lsb issue is that the xored bit has lower quality than if no xor?

No, the lsb is the only bit still tied entirely to the underlying xoroshiro++, which has a period of 2^32-1. Modulating that bit with xor in the described fashion should double the period, but not in a way that will subtantially affect randomness tests related to that bit.[/quote]

Another issue for me with xoring the high word is that the low and high words are treated differently, with only the latter modified within the same pair. I think it might be better to xor the low word instead, but my current preference is adding 1 to the low word so that both words are one more than they would have been otherwise.

It is fine to modify the low word (instead of the high), but my testing has indicated that adding 1 (instead of xor) propagates in a way that exactly repeats every other word in the other half of the next stream. I will post my exact findings shortly.

Another issue for me with xoring the high word is that the low and high words are treated differently, with only the latter modified within the same pair. I think it might be better to xor the low word instead, but my current preference is adding 1 to the low word so that both words are one more than they would have been otherwise.

It is fine to modify the low word (instead of the high), but my testing has indicated that adding 1 (instead of xor) propagates in a way that exactly repeats every other word in the other half of the next stream. I will post my exact findings shortly.

I prefer modifying the low word and I find it easier to visualize what is happening with +1, but if ^1 is better then no problem as changes to software or hardware are trivial. As mentioned, I haven't been studying pair values to spot undesirable patterns.

Just to back up a step and 'explore undesirable patterns', I want to illustrate some of the variants output from the small version. The data makes the case for xor... and further for using odd value > 1.

Single-iterated streams of 65535 Bytes (small version xoroshiro[3,7,4,4] variants, seeds s0=0,s1=1,accH/L=0[b], first four bytes of each stream shown.[/b])
Xoroshiro16++ (non-extended, no acc)
First : 10 9A 80 AE
Second: 10 9A 80 AE 'repeats all bytes of previous and next stream
Third : 10 9A 80 AE
Fourth: 10 9A 80 AE
...
3 Last: 10 9A 80 AE
2 Last: 10 9A 80 AE
Last : 10 9A 80 AE
First : 10 9A 80 AE
Xoroshiro16++a +1L/+1H (original extended, w/acc)
First : 11 AC 2D DC
Second: 10 AB 2C DB 'correlated to previous and next stream
Third : F AA 2B DA
Fourth: E A9 2A D9
...
3 Last: 14 AF 30 DF
2 Last: 13 AE 2F DE
Last : 12 AD 2E DD
First : 11 AC 2D DC
Xoroshiro16++a +1L/+0H (half-adding)
First : 11 AB 2C DA
Second: 10 AB 2B DA 'correlated and repeat
Third : 10 AA 2B D9
Fourth: F AA 2A D9
...
3 Last: 12 AD 2D DC
2 Last: 12 AC 2D DB
Last : 11 AC 2C DB
First : 11 AB 2C DA
Xoroshiro16++a ^1L/^0H (half-xoring)
First : 11 AB 2C DA
Second: C2 5D DD 8C 'not obviously correlated to previous stream
Third : 10 AA 2B D9 'correlated to first stream
Fourth: C1 5C DC 8B 'correlated to second stream
...
3 Last: C4 5F DF 8E
2 Last: 12 AC 2D DB
Last : C3 5E DE 8D
First : 11 AB 2C DA
Xoroshiro16++a +0x73L/+0H (half-adding >1 odd value)
First : 83 1D 10 BE
Second: 10 1D 9D BE 'every output correlated to either previous or next stream
Third : 10 AA 9D 4B
Fourth: 9D AA 2A 4B
...
3 Last: F6 3 83 A4
2 Last: F6 90 83 31
Last : 83 90 10 31
First : 83 1D 10 BE
Xoroshiro16++a ^0x73L/^0H (half-xoring >1 odd value)
First : 63 FD F0 9E
Second: 3E 27 A7 84 'no obvious visual correlation to previous or next stream
Third : F0 8A 7D 2B
Fourth: CB B4 34 11
...
3 Last: 24 D 8D 6A
2 Last: D6 70 63 11
Last : B1 9A 1A F7
First : 63 FD F0 9E

If you can replicate the last one (xor of low half with 115 decimal / 0x73 hex) from what I have provided, I think it speaks for itself. Edit: Likely not necessary, as I verified exactly against your full-size output.

All but the first two examples have the following remarkable distribution properties:

1D-8 : 131070 : 256 131070 : 256 'all 256 byte values occur 131070 times
1D-16 : 255 : 256 256 : 65280 '256 word values occur 255 times and 65280 word values occur 256 times
2D-8 : 511 : 512 512 : 65024 '512 pairs of byte values occur 511 times and 65024 pairs of byte values occur 512 times

Edit: Various changes for correctness and clarity.

Abreviated PractRand results for xor 0x27D3 (an arbitrary odd number).

Forward:

rng=RNG_stdin, seed=unknown
length= 16 gigabytes (2^34 bytes), time= 598 seconds
Test Name Raw Processed Evaluation
[Low1/16]FPF-14+6/4:all R= -5.3 p =1-1.0e-4 unusual
[Low1/16]Gap-16:A R= +69.9 p = 6.3e-61 FAIL !!!!
[Low1/16]Gap-16:B R= +43.9 p = 1.7e-37 FAIL !!!
[Low4/16]FPF-14+6/4:all R= -9.5 p =1-7.5e-9 very suspicious
...and 1636 test result(s) without anomalies

Similar forward failures as compared to xor 1.

Reverse (16-bits at a time):

rng=RNG_stdin, seed=unknown
length= 32 gigabytes (2^35 bytes), time= 967 seconds
Test Name Raw Processed Evaluation
Gap-16:A R=+553.9 p = 7.1e-366 FAIL !!!!!!!
Gap-16:B R=+258.8 p = 3.2e-221 FAIL !!!!!!
...and 1716 test result(s) without anomalies

The reverse failures are remarkably fewer as compared to xor 1 (which logged nearly 150 failures at 32GB).

Abbreviated PractRand results for xor 0x27D3 (an arbitrary odd number)...

Thanks for the results.

Inverting about half the bits of prn before the low accumulation looks optimal, based on nothing more than a gut feeling, and nine seems better than eight for a 16-bit output, which is true for 27D3h = 0010011111010011b. It does have a run of five 1s and is not prime, though.

Are a few more PractRand tests worth doing?

5E2Dh = 24109d = 0101111000101101b is prime, has nine 1s with a longest run of four and is very close to 65536 * 1/e (the expected number of 16-bit outputs that occur once or never for a random generator with period 2^16-1). A couple of other, more control-type xor values are 00FFh and 5555h.

5E2Dh = 24109d = 0101111000101101b is prime, has nine 1s with a longest run of four and is very close to 65536 * 1/e (the expected number of 16-bit outputs that occur once or never for a random generator with period 2^16-1). A couple of other, more control-type xor values are 00FFh and 5555h.

I did some preliminary tests on those values (00FFh is comparatively horrible, BTW), and several things occurred to me while looking at the results:
1. The low nibble of the xor value drives the forward failures, with perhaps 3 or 5 being optimal.
2. Perhaps need to use a better seed set (i.e. I suspect than 1/0/0 is not representative enough).
3. Should force-test forward up to 32GB, and reverse up to 64G, in order to provide more data points.

I recorded the first eight acc pair outputs at the start of each of the double-iteration periods or streams for xoroshiro16++a[3,7,4,4]:

If
i = double-iteration (0-65534)
j = iteration period (0-255)
acc(i,j) = acc pair
x = xor value
then
acc(i,j) - acc(i,j+1) = x * 256 + x or (x-1) * 256 + x

Thus, low difference = x, high difference = x or x-1 and correlation between periods is obvious. I tested many values of x using xor and replacing it with add produced same results for the one or two I tested. initial pair values for each stream.

Thus, low difference = x, high difference = x or x-1 and correlation between periods is obvious. I tested many values of x using xor and replacing it with add produced same results for the one or two I tested.

The difference between +x or ^x is purely visual, regardless of whether x=1 or another odd value.
In the the last of the example outputs I gave above:

First : 63 FD F0 9E
Second: 3E 27 A7 84 'no obvious visual correlation to previous or next stream
Third : F0 8A 7D 2B
Fourth: CB B4 34 11

... It is clear that, for example Abs(FD-27)=Abs(8A-B4).
The only questions are related to how desirable it is to have the correlation appear less obvious, and will it affect statistical use case scenarios?
If the answer to either is 'yes', then xor seems to be the better choice.
For example, is a single or double stream actually more random, since the flipped bits produce new carries into ACC?
The answer seems to be 'yes' when looking at forward PractRand results at the 8GB mark where the xor >1 seems to help completion of the single stream without any suspicious results.
However, my original statement on values >1 still partially stands (though could be more broadly applied to using sum, as well):

This only visually dresses the subsequent stream pairs, so I am not convinced it is worth the effort over just xor 0/1... however, it is an option to consider...

Regarding xor values, I have attached a file of results that gives a clue about PractRand's statistical power.
Spoiler: 0x3725h would be the best constant I have tried (found by using an iterative technique to converge on a seemingly good value).

For example, is a single or double stream actually more random, since the flipped bits produce new carries into ACC?
The answer seems to be 'yes' when looking at forward PractRand results at the 8GB mark where the xor >1 seems to help completion of the single stream without any suspicious results.

I've attached the full set of 16-bit rotations for xor 3725h, which sets the bar for 8GB performance (the point at which all subsequent output is provably correlated to previous output), which can be compared to previous results.
I am using the arbitrary seeds (s0 = B49Fh, s1 = 888Eh, acc = 945Bh) for this and re-runs on +1 and ^1, which I will add here and summarize, for comparison.

Tony, given everything we have looked at, let me know which additional data / summaries / BigCrush results, etc. that you need on any specific + or xor constants.

For example, is a single or double stream actually more random, since the flipped bits produce new carries into ACC?
The answer seems to be 'yes' when looking at forward PractRand results at the 8GB mark where the xor >1 seems to help completion of the single stream without any suspicious results.

I've attached the full set of 16-bit rotations for xor 3725h, which sets the bar for 8GB performance (the point at which all subsequent output is provably correlated to previous output), which can be compared to previous results.
I am using the arbitrary seeds (s0 = B49Fh, s1 = 888Eh, acc = 945Bh) for this and re-runs on +1 and ^1, which I will add here and summarize, for comparison.

Tony, given everything we have looked at, let me know which additional data / summaries / BigCrush results, etc. that you need on any specific + or xor constants.

Chris, thanks for the PractRand results.

I can't replicate your outputs for xoroshiro16++a when xoring 73h and I see the same correlation between every adjacent stream, for 73h or any other value I tested.

In the P2 hardware, there is enough time for only two cascaded 16-bit additions, where the result from the first adder is used in the second. +1 could be done by add with carry then add, but adding large(r) values would take too long. In contrast, xoring means just inverting some of the first adder inputs and any value would work.

It is clear that +1 or ^1 are enough to hide the 1D equidistribution from PractRand so we get maximum scores, which we never achieved with xoroshiro32++. In theory, all the failures should be 16GB and I'm not sure why some are 32GB.

Adjacent streams that differ by more than 1 make me feel better, even though they are just as correlated. I suggest BigCrush testing with +1 (but no more +), ^1, ^5E2Dh and other ^ you feel are worthwhile. That should be enough to finish ++a.

I can't replicate your outputs for xoroshiro16++a when xoring 73h...

I think we should be able to agree on this... note my inadvertent non-standard seeding (s0=0,s1=1,acc=0), and post your first several outputs from the first two streams of xor 1 and xor 73h.

... In theory, all the failures should be 16GB and I'm not sure why some are 32GB.

I believe PractRand is only fully able to consider the weak lsb in isolation on the forward non-rotated test, and has insufficient tests / statistical power to find the obvious issue on other variations (reverse and/or non-zero rotations).
That is part of the reason why I consider the 8GB passing results for all variants on fwd/rev and rotations so important as a catch-all for pushing forward to BigCrush (which must find at least some issue(s) with any given variant, I presume).

Adjacent streams that differ by more than 1 make me feel better, even though they are just as correlated. I suggest BigCrush testing with +1 (but no more +), ^1, ^5E2Dh and other ^ you feel are worthwhile. That should be enough to finish ++a.

I can't replicate your outputs for xoroshiro16++a when xoring 73h...

I think we should be able to agree on this... note my inadvertent non-standard seeding (s0=0,s1=1,acc=0), and post your first several outputs from the first two streams of xor 1 and xor 73h.

I didn't notice s0 & s1 were swapped from the usual seed of 1. Also, my x86 code was calculating PRN after the state iteration. Both now fixed, I can match your data and see the problem. The period is 65535 * 2 as we are dealing with pairs and the start of your third stream is actually start of the second.

I didn't notice s0 & s1 were swapped from the usual seed of 1. Also, my x86 code was calculating PRN after the state iteration. Both now fixed, I can match your data and see the problem. The period is 65535 * 2 as we are dealing with pairs and the start of your third stream is actually start of the second.

Correct, the stream orientation was for comparison forward from the base non-extended xoroshiro stream to illustrate the progression.

I need to write an accurate XOROACC module for BigCrush... must be XOROACC, since the 32-bit format is required.
Actually, the analysis is mostly 30-bit, which will certainly put an interesting spin on most of the results. Created and running attached '+ 1' code, following Lemire's convention for testing.
Note: There is no mention of ++a, as the transition directly from ++ to XOROACC is cleaner to code in C, as no xor toggled state variable is required to apply e to every other ++ iteration.

Any plot result that is entirely within +/- 4 Sigma, with the vast majority of points within +/- 3 Sigma, is considered normal and acceptable...
^1 and +1 did not pass BigCrush, with +1 clearly being the better of the two.
^3725h and ^5E2Dh did pass BigCrush, with ^5E2Dh being marginally better (enough to be recommended).

^1 and +1 did not pass BigCrush, with +1 clearly being the better of the two.
^3725h and ^5E2Dh did pass BigCrush, with ^5E2Dh being marginally better (enough to be recommended).

Many thanks for running the BigCrush tests. The 48 logs are for 3 tests x 16 rotations? The filenames are rather cryptic. xoroshiro32++[13,5,10,9] does not pass all of BigCrush. I'm pleased that ^5E2Dh does and is the best of those tested.

I haven't looked at '2D-8' for xoroshiro16++a yet. Another thing I'm wondering about is how many iterations must be done to output all possible acc pairs.

Many thanks for running the BigCrush tests. The 48 logs are for 3 tests x 16 rotations? The filenames are rather cryptic. xoroshiro32++[13,5,10,9] does not pass all of BigCrush. I'm pleased that ^5E2Dh does and is the best of those tested.

I haven't looked at '2D-8' for xoroshiro16++a yet. Another thing I'm wondering about is how many iterations must be done to output all possible acc pairs.

I am very pleased also... throwing 48 bits of state at the problem would be almost a wasted effort if it could not pass BigCrush, which requires a minimum of about 35 or 36 state bits (if all bits are used optimally)..

I ran 16 seed sets (provided by SplitMix and starting seed documented in the filename) for each of forward bits, reverse bits, and reverse bytes, in order to look for p-value bias... so no rotations.
With about 30 of the p-values, I have applied corrections (to compensate for defects in TestU01). However, only a few of those corrections would be required when looking at only 48 BigCrush runs.
When processed through meta-analysis (required to compete with PractRand), abnormal bias was noticed in only a few of the passing p-values, mostly when the bits were reversed.
P-value # 50 (of 254 total) was the most notable offender, but only when bits were reversed: sknuth_CouponCollector: N=1,n=200000000,r=0,d=8

If you feel you must steer other aspects of the distribution (without disrupting the existing equidistribution or randomness quality) and find you cannot do so when trying other odd constants, then try this:
1. Add 1 to the second xoroshiro iteration result AND
2. Use an even xor constant on the first xoroshiro iteration (e.g. 5E2Ch)

I am happy to run a few more PractRand/BigCrush, if you have specific changes / constants in mind.

Tony, you asked "Would it be best to output acc with bits reversed?"... it did not occur to me to reverse the xoroshiro32++ result bits prior to applying the xor (on first iteration) and subsequent sum with the accumulator (on both iterations). So I ran a quick PractRand on 5E2Dh using reverse-result-bits... remember that the lsb is the master clock, and now that lsb has the highest linear complexity, so its carry is no longer discarded. Have a look at the results:

PractRand Forward XOROACC bits:
rng=RNG_stdin, seed=unknown
length= 16 gigabytes (2^34 bytes), time= 572 seconds
Test Name Raw Processed Evaluation
[Low1/16]Gap-16:A R= +68.2 p = 1.7e-59 FAIL !!!!
[Low1/16]Gap-16:B R= +44.0 p = 1.4e-37 FAIL !!!
...and 1636 test result(s) without anomalies
Practrand Reverse XOROACC bits:
rng=RNG_stdin, seed=unknown
length= 32 gigabytes (2^35 bytes), time= 943 seconds
Test Name Raw Processed Evaluation
Gap-16:A R=+553.3 p = 1.9e-365 FAIL !!!!!!!
Gap-16:B R=+256.6 p = 2.4e-219 FAIL !!!!!!
[Low4/16]FPF-14+6/16:all R= -5.4 p =1-6.9e-5 unusual
[Low4/16]FPF-14+6/4:all R= -7.9 p =1-2.7e-7 suspicious
...and 1715 test result(s) without anomalies

FPs disappeared from forward XOROACC, but popped-up on reverse XOROACC... better overall, though.

Then I got bold and tried the same reverse-result-bits experiment, combined with using the methodology described in my previous post, with 5E2Ch (an even xor value) on the first iteration and +1 on the second iteration:

PractRand Forward XOROACC bits:
rng=RNG_stdin, seed=unknown
length= 16 gigabytes (2^34 bytes), time= 581 seconds
Test Name Raw Processed Evaluation
[Low1/16]Gap-16:A R= +68.2 p = 1.7e-59 FAIL !!!!
[Low1/16]Gap-16:B R= +44.0 p = 1.4e-37 FAIL !!!
...and 1636 test result(s) without anomalies
Practrand Reverse XOROACC bits:
rng=RNG_stdin, seed=unknown
length= 32 gigabytes (2^35 bytes), time= 943 seconds
Test Name Raw Processed Evaluation
Gap-16:A R=+558.7 p = 4.9e-369 FAIL !!!!!!!
Gap-16:B R=+262.3 p = 3.4e-224 FAIL !!!!!!
...and 1715 test result(s) without anomalies

Suddenly all FP issues at the failure point for the forward and reverse XOROACC bits have nearly evaporated (though one or two 'unusual' might 'randomly' pop in at ~5e-5 on either forward at 16GB or reverse at 32GB, with some initial seeds).
This leaves only the endemic GAP issue (which likely has no simplistic solution).

Sorry to drag this out... but often these kinds of seemingly incremental (pun not originally intended) improvements positively impact other aspects of the distribution properties that you may find desirable. Attached PractRand rotations and will attach BigCrush results of the latter modification that uses [reverse ++, e= ^5E2Ch, f = +1.] Edit: BigCrush results on these variants is a step backward, so will not post. Will do reverse ++, e= ^5E2Dh, f = 0 also... Attached, and also looks very good (e.g. the zip file is even smaller, indicating fewer total failures). Edited multiple times for clarity (and humorous effect).

This leaves only the endemic GAP issue (which likely has no simplistic solution).

GAP issue is deferred to > 16GB PractRand failure by:
Accumulate forward bits of xoroshiro++ result Xor'd with 5E2Dh on first iteration (i.e. normal) and then accumulate reverse bits of xoroshiro++ result on second iteration. Same equidistribution properties.

I will post all PractRand and BigCrush results on this variant tomorrow, then I am finally done playing with this. Edit: PractRand results posted... looks very good! Edit: BigCrush results posted... looks good, with nearly ideal failure rate (i.e. about half of all runs on a good PRNG should show 1 fail). However, meta-analysis shows p-value #50 bias is now visible in both forward and reverse bits (but not reverse bytes).

Accumulate forward bits of xoroshiro++ result Xor'd with 5E2Dh on first iteration (i.e. normal) and then accumulate reverse bits of xoroshiro++ result on second iteration. Same equidistribution properties.

I will post all PractRand and BigCrush results on this variant tomorrow, then I am finally done playing with this. Edit: PractRand results posted... looks very good!

Thanks for the results. They do look good, 32GB across the board.

Another thing I'm wondering about is how many iterations must be done to output all possible acc pairs.

I haven't had time to look at this yet for xoroshiro16++a, perhaps over the weekend. I don't have enough RAM to test xoroshiro32++a[13,5,10,9] but is it possible that different variants might need different numbers of iterations before all 2^32 outputs occur? If so, the lower the better?

... I don't have enough RAM to test xoroshiro32++a[13,5,10,9] but is it possible that different variants might need different numbers of iterations before all 2^32 outputs occur? If so, the lower the better?

I am not sure the answer, but regarding RAM, several weeks ago I did a 32-bit distribution test that simply accumulated counts in the ranges 0x00000000-0x0000FFFF and the same range again after rotating 16 bits... I guessed that this was a good proxy for the full distribution.

Edit: My intuition says: If 1 full stream is 2^32 double-iterations (16GB), then each stream should accomplish 50% coverage of the unsampled 32-bit values, which should deplete, on average, by the 33rd stream. Therefore, in my example above, complete coverage of 2^17 specific 32-bit values should occur in about 18 streams.
My intuition is obviously wrong... decimation of unsampled values is by 1/e, so complete depletion should occur by about the 23rd stream, so therefore complete coverage of 2^16 specific values should occur in about 12 streams.
I tested this logic out on the small version and it works as expected (at about 12 and 6 streams respectively, in that case).

Attached is one final submission: e=0x5E13,f=RevBits(0). Includes PractRand rotations, BigCrush results w/meta-analysis plot. 5E13h is a prime.

The PractRand results have no mod3n artifact at 32GB fail on Fwd00, and the BigCrush meta-analysis is the first one I have found for f=Revbits where the p-value #50 bias is only on reversed bits and is minimized to the point where it does not manifest in the full meta-analysis of combined forward, reverse-bits and reverse-bytes.

This might be a keeper... too many primes to test for e. Final hex digit of 3 or 5 seems to minimize most artifacts, if I were to try more.

... I don't have enough RAM to test xoroshiro32++a[13,5,10,9] but is it possible that different variants might need different numbers of iterations before all 2^32 outputs occur? If so, the lower the better?

I am not sure the answer, but regarding RAM, several weeks ago I did a 32-bit distribution test that simply accumulated counts in the ranges 0x00000000-0x0000FFFF and the same range again after rotating 16 bits... I guessed that this was a good proxy for the full distribution.

Edit: My intuition says: If 1 full stream is 2^32 double-iterations (16GB), then each stream should accomplish 50% coverage of the unsampled 32-bit values, which should deplete, on average, by the 33rd stream. Therefore, in my example above, complete coverage of 2^17 specific 32-bit values should occur in about 18 streams.

Rather than one word constant, I think the 64K outputs with both words the same would be better, as it would test the immediate effect of the xor and bit reversal in all of these outputs. I'll make some time this week and also check the equidistribution of low and high bytes separately for xoroshiro16++a to see whether they are the same. Some time ago Evan ran some PractRand tests of xoroshiro32++ with every other word skipped, which is how XORO32 or XOROACC might be used.

Corrected above information and generated evidence that supports the idea that XOROACC32 should generally perform as well as XOROACC16 actually does in producing all possible outputs.

Edit: My intuition says: If 1 full stream is 2^32 double-iterations (16GB), then each stream should accomplish 50% coverage of the unsampled 32-bit values, which should deplete, on average, by the 33rd stream. Therefore, in my example above, complete coverage of 2^17 specific 32-bit values should occur in about 18 streams.
My intuition is obviously wrong... decimation of unsampled values is by 1/e, so complete depletion should occur by about the 23rd stream, so therefore complete coverage of 2^16 specific values should occur in about 12 streams.
I tested this logic out on the small version and it works as expected (at about 12 and 6 streams respectively, in that case).

Obviously there will be some variation among candidates, but they all should be in the vicinity of being statistically plausible. In any case, should be explicitly tested (since PractRand nor BigCrush will find that information).

...with every other word skipped, which is how XORO32 or XOROACC might be used.

Do you mean with every other dword skipped (as I can't visualize how an individual word could be skipped in the actual output, since I assumed that only fully processed pairs of words are available to be read from XOROACC by the end user)?

...with every other word skipped, which is how XORO32 or XOROACC might be used.

Do you mean with every other dword skipped (as I can't visualize how an individual word could be skipped in the actual output, since I assumed that only fully processed pairs of words are available to be read from XOROACC by the end user)?

It's possible the end user might ignore the low or high word of every XORO32/XOROACC output and the randomness of this could be tested by skipping every other word in PractRand or BigCrush, 2^33-2 iterations producing 2^32-1 used words. Instead of being interleaved, the even and odd words would be separated by a full single-iteration period.

It's possible the end user might ignore the low or high word of every XORO32/XOROACC output and the randomness of this could be tested by skipping every other word in PractRand or BigCrush, 2^33-2 iterations producing 2^32-1 used words. Instead of being interleaved, the even and odd words would be separated by a full single-iteration period.

Of course... did some quick non-rotational tests: High and low words each fail (with no precursor issues) in PractRand on GAP at 16GB, regardless of whether revbits was used on the 2nd xoroshiro++ iteration.

This makes some sense, as every other word benefits from the bit-mixing of the unused word, but still races twice as fast toward failure. However, I would expect BigCrush would notice more issues with the non-revbits version that PractRand somehow missed.

Edit: Instead of BigCrush, I ran gjrand on 32GB of the various combinations of both/high/low words, forward/reverse output bits, and forward/reverse 2nd iteration using ^5E13. Results attached, w/readme.
There really is not much very exciting (i.e. statistically differentiating) going on with any of the results of gjrand or BigCrush, other than the conclusion that only PractRand seems to be aware that an improvement occurred (within its statistical purview) with GAP on 32-bit outputs (regarding only the lsb) when only the 2nd xoroshiro++ iteration result bits are reversed. My best guess is that reversing the bits of the 2nd iteration is a good thing (beyond the lsb improvement), but it is also a fine point that would not necessarily be relevant in a purely software implementation.

Additionally, by random trial I have demonstrated that there are some odd values of e that might be very slightly better than ^5E13h, but many more that are substantially worse.
Some of those e values are so bad they cause abort issues to occur in PractRand pre0.95 (and might actually trigger a NaN lock-up in 0.94), which I am beta testing only in the random e trial. Edit: Readme file in attachment mistakenly mentions (twice) 'reversed before xor', which should instead read 'reversed before sum'.

My best guess is that reversing the bits of the 2nd iteration is a good thing (beyond the lsb improvement), but it is also a fine point that would not necessarily be relevant in a purely software implementation.

Probably true, but I have discovered that reversing the bits is not exactly what creates the perceived improvement, but more specifically it is the introduction of the msb into the lsb position...
Using ROL16(xoroshiro32++, 1) instead of RevBits16(xoroshiro32++) in the second iteration would accomplish the same goal and be more explicitly compatible/performant with a software only implementation, if desired.

More later, plus I will attach high and low word rotations once I have re-vetted some of the xor constants using this approach, which will take some time.

## Comments

1,414I've been looking at the final pfreq distribution and ignoring individual pair outputs. Just to clarify, the lsb issue is that the xored bit has lower quality than if no xor?

Another issue for me with xoring the high word is that the low and high words are treated differently, with only the latter modified within the same pair. I think it might be better to xor the low word instead, but my current preference is adding 1 to the low word so that both words are one more than they would have been otherwise.

The jump action then becomes clearer: rather than jumping after every double-iteration period, the jumping occurs after every double-iteration, thus the streams are interleaved not sequential. It is jumping that produces 2D equidistribution.

(This will be obvious to you and is intended mainly for others who might be interested.)1,414231It is fine to modify the low word (instead of the high), but my testing has indicated that adding 1 (instead of xor) propagates in a way that exactly repeats every other word in the other half of the next stream. I will post my exact findings shortly.

1,414I prefer modifying the low word and I find it easier to visualize what is happening with +1, but if ^1 is better then no problem as changes to software or hardware are trivial. As mentioned, I haven't been studying pair values to spot undesirable patterns.

1,414Initial state = 1, initial acc = 0

+1 or ^1 applied to low word of acc

231Edit: Likely not necessary, as I verified exactly against your full-size output.All but the first two exampleshave the following remarkable distribution properties:Edit: Various changes for correctness and clarity.231231Forward: Similar forward failures as compared to xor 1.

Reverse (16-bits at a time): The reverse failures are remarkably fewer as compared to xor 1 (which logged nearly 150 failures at 32GB).

1,414Inverting about half the bits of prn before the low accumulation looks optimal, based on nothing more than a gut feeling, and nine seems better than eight for a 16-bit output, which is true for 27D3h = 0010011111010011b. It does have a run of five 1s and is not prime, though.

Are a few more PractRand tests worth doing?

5E2Dh = 24109d = 0101111000101101b is prime, has nine 1s with a longest run of four and is very close to 65536 * 1/e (the expected number of 16-bit outputs that occur once or never for a random generator with period 2^16-1). A couple of other, more control-type xor values are 00FFh and 5555h.

I used QuickBASIC, not the P2 code I posted!

I intend to check xoroshiro16++a with xor value of 115d, on Friday with luck.

231comparatively horrible, BTW), and several things occurred to me while looking at the results:1. The low nibble of the xor value drives the forward failures, with perhaps 3 or 5 being optimal.

2. Perhaps need to use a better seed set (i.e. I suspect than 1/0/0 is not representative enough).

3. Should force-test forward up to 32GB, and reverse up to 64G, in order to provide more data points.

1,414Thus, low difference = x, high difference = x or x-1 and correlation between periods is obvious. I tested many values of x using xor and replacing it with add produced same results for the one or two I tested. initial pair values for each stream.

231In the the last of the example outputs I gave above: ... It is clear that, for example Abs(FD-27)=Abs(8A-B4).

The only questions are related to how desirable it is to have the correlation appear less obvious, and will it affect statistical use case scenarios?

If the answer to either is 'yes', then xor seems to be the better choice.

For example, is a single or double stream actually more random, since the flipped bits produce new carries into ACC?

The answer seems to be 'yes' when looking at forward PractRand results at the 8GB mark where the xor >1 seems to help completion of the single stream without any suspicious results.

However, my original statement on values >1 still partially stands (though could be more broadly applied to using sum, as well): Regarding xor values, I have attached a file of results that gives a clue about PractRand's statistical power.

Spoiler: 0x3725h would be the best constant I have tried (found by using an iterative technique to converge on a seemingly good value).

231I am using the arbitrary seeds (s0 = B49Fh, s1 = 888Eh, acc = 945Bh) for this and re-runs on +1 and ^1, which I will add here and summarize, for comparison.

Tony, given everything we have looked at, let me know which additional data / summaries / BigCrush results, etc. that you need on any specific + or xor constants.

Added more variants using above seeds.1,414Chris, thanks for the PractRand results.

I can't replicate your outputs for xoroshiro16++a when xoring 73h and I see the same correlation between every adjacent stream, for 73h or any other value I tested.

In the P2 hardware, there is enough time for only two cascaded 16-bit additions, where the result from the first adder is used in the second. +1 could be done by add with carry then add, but adding large(r) values would take too long. In contrast, xoring means just inverting some of the first adder inputs and any value would work.

It is clear that +1 or ^1 are enough to hide the 1D equidistribution from PractRand so we get maximum scores, which we never achieved with xoroshiro32++. In theory, all the failures should be 16GB and I'm not sure why some are 32GB.

Adjacent streams that differ by more than 1 make me feel better, even though they are just as correlated. I suggest BigCrush testing with +1 (but no more +), ^1, ^5E2Dh and other ^ you feel are worthwhile. That should be enough to finish ++a.

231That is part of the reason why I consider the 8GB passing results for all variants on fwd/rev and rotations so important as a catch-all for pushing forward to BigCrush (which must find at least some issue(s) with any given variant, I presume). I agree... will do.

1,414I didn't notice s0 & s1 were swapped from the usual seed of 1. Also, my x86 code was calculating PRN after the state iteration. Both now fixed, I can match your data and see the problem. The period is 65535 * 2 as we are dealing with pairs and the start of your third stream is actually start of the second.

231I need to write an accurate XOROACC module for BigCrush... must be XOROACC, since the 32-bit format is required.

Actually, the analysis is mostly 30-bit, which will certainly put an interesting spin on most of the results.

Created and running attached '+ 1' code, following Lemire's convention for testing.Note: There is no mention of ++a, as the transition directly from ++ to XOROACC is cleaner to code in C, as no xor toggled state variable is required to apply e to every other ++ iteration.

231Lemire's naming convention: Any plot result that is entirely within +/- 4 Sigma, with the vast majority of points within +/- 3 Sigma, is considered normal and acceptable...

^1 and +1 did not pass BigCrush, with +1 clearly being the better of the two.

^3725h and ^5E2Dh did pass BigCrush, with ^5E2Dh being marginally better (enough to be recommended).

1,414Many thanks for running the BigCrush tests. The 48 logs are for 3 tests x 16 rotations? The filenames are rather cryptic. xoroshiro32++[13,5,10,9] does not pass all of BigCrush. I'm pleased that ^5E2Dh does and is the best of those tested.

I haven't looked at '2D-8' for xoroshiro16++a yet. Another thing I'm wondering about is how many iterations must be done to output all possible acc pairs.

231, which requires a minimum of about 35 or 36 state bits (if all bits are used optimally)..I ran 16 seed sets (provided by SplitMix and starting seed documented in the filename) for each of forward bits, reverse bits, and reverse bytes, in order to look for p-value bias... so no rotations.

With about 30 of the p-values, I have applied corrections (to compensate for defects in TestU01). However, only a few of those corrections would be required when looking at only 48 BigCrush runs.

When processed through meta-analysis

(required to compete with PractRand), abnormal bias was noticed in only a few of the passing p-values, mostly when the bits were reversed.P-value # 50 (of 254 total) was the most notable offender, but only when bits were reversed: sknuth_CouponCollector: N=1,n=200000000,r=0,d=8

If you feel you must steer other aspects of the distribution (without disrupting the existing equidistribution

or randomness quality) and find you cannot do so when trying other odd constants, then try this:1. Add 1 to the second xoroshiro iteration result AND

2. Use an even xor constant on the first xoroshiro iteration (e.g. 5E2Ch)

I am happy to run a few more PractRand/BigCrush, if you have specific changes / constants in mind.

231remember that the lsb is the master clock, and now that lsb has the highest linear complexity, so its carry is no longer discarded. Have a look at the results: FPs disappeared from forward XOROACC, but popped-up on reverse XOROACC... better overall, though.Then I got bold and tried the same reverse-result-bits experiment, combined with using the methodology described in my previous post, with 5E2Ch (an even xor value) on the first iteration and +1 on the second iteration: Suddenly all FP issues at the failure point for the forward and reverse XOROACC bits have nearly evaporated (though one or two 'unusual' might 'randomly' pop in at ~5e-5 on either forward at 16GB or reverse at 32GB, with some initial seeds).

This leaves only the endemic GAP issue (which likely has no simplistic solution).

Sorry to drag this out... but often these kinds of seemingly incremental

(pun not originally intended)improvements positively impact other aspects of the distribution properties that you may find desirable.AttachedPractRand rotations and will attach BigCrush results of the latter modification that uses [reverse ++, e= ^5E2Ch, f = +1.]Edit: BigCrush results on these variants is a step backward, so will not post.Will do reverse ++, e= ^5E2Dh, f = 0 also... Attached, and also looks very good (e.g. the zip file is even smaller, indicating fewer total failures).Edited multiple times for clarity (and humorous effect).231Accumulate forward bits of xoroshiro++ result Xor'd with 5E2Dh on first iteration (i.e. normal) and then accumulate reverse bits of xoroshiro++ result on second iteration. Same equidistribution properties.

I will post all PractRand and BigCrush results on this variant tomorrow, then I am finally done playing with this.

Edit: PractRand results posted... looks very good!Edit: BigCrush results posted... looks good, with nearly ideal failure rate (i.e. about half of all runs on a good PRNG should show 1 fail). However, meta-analysis shows p-value #50 bias is now visible in both forward and reverse bits (but not reverse bytes).1,414I haven't had time to look at this yet for xoroshiro16++a, perhaps over the weekend. I don't have enough RAM to test xoroshiro32++a[13,5,10,9] but is it possible that different variants might need different numbers of iterations before all 2^32 outputs occur? If so, the lower the better?

231Edit: My intuition says: If 1 full stream is 2^32 double-iterations (16GB), then each stream should accomplish 50% coverage of the unsampled 32-bit values, which should deplete, on average, by the 33rd stream. Therefore, in my example above, complete coverage of 2^17 specific 32-bit values should occur in about 18 streams.My intuition is obviously wrong... decimation of unsampled values is by 1/e, so complete depletion should occur by about the 23rd stream, so therefore complete coverage of 2^16 specific values should occur in about 12 streams.

I tested this logic out on the small version and it works as expected (at about 12 and 6 streams respectively, in that case).

231The PractRand results have no mod3n artifact at 32GB fail on Fwd00, and the BigCrush meta-analysis is the first one I have found for f=Revbits where the p-value #50 bias is only on reversed bits and is minimized to the point where it does not manifest in the full meta-analysis of combined forward, reverse-bits and reverse-bytes.

This might be a keeper... too many primes to test for e. Final hex digit of 3 or 5 seems to minimize most artifacts, if I were to try more.

1,414Rather than one word constant, I think the 64K outputs with both words the same would be better, as it would test the immediate effect of the xor and bit reversal in all of these outputs. I'll make some time this week and also check the equidistribution of low and high bytes separately for xoroshiro16++a to see whether they are the same. Some time ago Evan ran some PractRand tests of xoroshiro32++ with every other word skipped, which is how XORO32 or XOROACC might be used.

231Edited for clarity.1,414It's possible the end user might ignore the low or high word of every XORO32/XOROACC output and the randomness of this could be tested by skipping every other word in PractRand or BigCrush, 2^33-2 iterations producing 2^32-1 used words. Instead of being interleaved, the even and odd words would be separated by a full single-iteration period.

231This makes some sense, as every other word benefits from the bit-mixing of the unused word, but still races twice as fast toward failure. However, I would expect BigCrush would notice more issues with the non-revbits version that PractRand somehow missed.

Edit: Instead of BigCrush, I ran gjrand on 32GB of the various combinations of both/high/low words, forward/reverse output bits, and forward/reverse 2nd iteration using ^5E13. Results attached, w/readme.

There really is not much very exciting (i.e. statistically differentiating) going on with any of the results of gjrand or BigCrush, other than the conclusion that only PractRand seems to be aware that an improvement occurred (within its statistical purview) with GAP on 32-bit outputs (regarding only the lsb) when only the 2nd xoroshiro++ iteration result bits are reversed. My best guess is that reversing the bits of the 2nd iteration is a good thing (beyond the lsb improvement), but it is also a fine point that would not necessarily be relevant in a purely software implementation.

Additionally, by random trial I have demonstrated that there are some odd values of e that might be very slightly better than ^5E13h, but many more that are substantially worse.

Some of those e values are so bad they cause abort issues to occur in PractRand pre0.95 (and might actually trigger a NaN lock-up in 0.94), which I am beta testing only in the random e trial.

Edit: Readme file in attachment mistakenly mentions (twice) 'reversed before xor', which should instead read 'reversed before sum'.231Using ROL16(xoroshiro32++, 1) instead of RevBits16(xoroshiro32++) in the second iteration would accomplish the same goal and be more explicitly compatible/performant with a software only implementation, if desired.

More later, plus I will attach high and low word rotations once I have re-vetted some of the xor constants using this approach, which will take some time.