It looks like you are exploiting a property of the incremental state that I had neglected (based partially on the 1-lag idea I mentioned, but couldn't make work).
I am not sure that it is possible to improve upon the distribution you have achieved.
I need to review this further once I have replicated your results.

Taken as a whole, I think this is more than just noteworthy when viewed in the proper context as an extensibility option.

'now
XORO32 state 'prn[31:0] into next S
MOV prn,0-0 'prn = existing PRN
'future option
XORO32 state 'prn[31:0] into next S
XOROACC acc,0-0 'acc = improved PRN
'xoroacc operation
'acc_out[15:0] := prn[15:0] + acc_in[15:0] + e*
'acc_out[31:16] := prn[31:16] + acc_out[15:0] + e
'e is new constant, maybe 1, to go with unchanged a,b,c,d
'* force e = 0 if state = mid_state

acc holds two 16-bit accumulator values, with current one in the high word and previous one in the low word. Therefore xoroacc code above is wrong and it should read:

I have a good understanding of how to get 2D equidistribution with the new xoroshiro++ and accumulator algorithm. Is xoroshiro++a an acceptable short name?

Defining two new constants, e and f, the accumulator output is given by:

acc := prn + acc + e for all states except one (2^w-2)
acc := prn + acc + f for one state
w = state width
prn = xoroshiro++ PRN output
N.B. e <> f

An easy-to-implement scheme is e = 0 and f = 1, but let's consider the case of e = f = 0 first. No midpoint correction is needed for the first half of a stream to be followed directly by the second half. However, at the end there is a wraparound to the beginning, therefore one stream will repeat ad infinitum. A non-zero f is required at one particular state in order to jump into a different stream. Note that this second stream is output in its entirety, due to the wraparound effect, before jumping to a third stream, etc.

Another simple scheme is e = 1. As the xoroshiro period is one less than a power of 2, after half the stream has been output the value of acc will be one less than it should be to continue into the second half and if this is desired f should be 2. At the end of the stream the next one begins automatically (no wraparound). There is no reason why f must be 2, it could be 0 or some large number to jump into a distant stream. If e = 1 and f <> 2, then the first half of one stream is followed by a jump to the second half of another, which flows into the first half of the following stream, then another jump, etc. This means that the two halves of every stream could be output with gaps between them, possibly large gaps.

The 'best' values of e and f could be determined by the level of hardware complexity and the results of testing the randomness with PractRand or TestUI.

Since that is sufficiently descriptive, I am good with it if it refers to a generic version of the code.
Then, I assume XOROACC, or similar, will refer to your specific double-iterated 2D-recovery modification?

An obvious question that might be asked with regard to either of the above is: Why not xoshiro (or PRNG 'X') with a larger state?
I believe some obvious answers would be with regard to reduced footprint, testing time, and predictable stream randomness.
Additional answers might be derived from previous posts.

Regarding the stream randomness, I have found that '++' is slightly more complicated than necessary when used as a segue to the 'a' version.
The logic is fairly easy to follow (even if not fully complete):
Why was '+' not used for XORO32 originally? Due to endemic linear complexity, Hamming-weight and binary matrix rank issues of LSBs.
All of those issues were hidden well by '++', but '++a', by definition, includes another '+' (i.e. 'a') which can be used to negate the need for one of the original '+'.
It would not be immediately apparent, but the answer of how to accomplish that is from the lesson learned from attempting 'D=8' with xoroshiro32++.
Why did that not work? Presumably because it creates a type of parity between the two halves when '+ x' is added again. [Forgive the poor description].
Once the 'a' became available, that parity can be automatically broken by the constant stream of carry bits across the half-rotation divide... thus '+ x' can be dropped entirely.
In fact, one can just substitute the original '-1' (or &HFFFF) for '+x', since '+x' creates an unnecessary constructive (i.e. unwanted) interference.
Suffice it to say that it is possible for each individual stream to complete without sufficient hint that it will do so ahead of time (e.g. like watching a demolition video in reverse).
There is more to it than that... I will post more later.

Since that is sufficiently descriptive, I am good with it if it refers to a generic version of the code.
Then, I assume XOROACC, or similar, will refer to your specific double-iterated 2D-recovery modification.

To be pedantic, specifying engine & scrambler & accumulator separately, it should be xoroshiro[a,b,c]++[d]a[e,f].

XOROACC is the suggested name for a future enhanced P2 or P3 instruction that does the accumulation and is not intended to have a wider application.

An obvious question that might be asked with regard to either of the above, is 'why not xoshiro?' (or PRNG 'X') with a larger state.
I believe the obvious answers would be with regard to reduced footprint, testing time, and predictable stream randomness.

Regarding the latter, I have found that '++' is slightly more complicated than necessary when used as a segue to the 'a' version.
The logic is fairly simple:
Why was '+' not used for XORO32 originally? Due to endemic linear complexity, Hamming-weight and binary matrix rank issues of LSBs.
All of those issues were hidden well by '++', but '++a', by definition, includes another '+' (i.e. 'a') which can be used constructively to remove one of the original '+'.
It would not be immediately apparent, but the answer of how to accomplish that is from the lesson learned from attempting 'D=8' with xoroshiro32++.
Why did that not work? Presumably because it creates a type of parity between the two halves when '+ x' is added again. [Forgive the poor description].
Once the 'a' became available, that parity can be automatically broken by the constant stream of carry bits across the half-rotation divide... thus '+ x' can be dropped entirely.
In fact, one can just substitute the original '-1' (or &HFFFF) for '+x'. There is more to it than that... I will post more later.

I can't honestly say whether or not a xoroshiro+a would be good enough quality as I don't run PractRand or TestUI, but our starting point for the accumulator extension is XORO32 and that uses ++. We did use plain xoroshiro+ early on but dropped the poor bit 0 so the output was only 15-bit. It was only after we had 16 good bits, thanks to either parity or ++, that Chip had the idea of double-iteration to give a full 32-bit output.

xoshiro came much too late in the P2 process and anyway fitting the entire state in one 32-bit register is an important design feature.

To be explicit: 'result = result + rotl(s0 + s1, 8 ) - 1' (i.e. no '+ s0' is present)
This is 'nearly' equivalent in randomness to its more complicated sibling we have been discussing. I will post some PractRand results, just for comparison...
But for my purpose, it brings the ability to make a few additional small changes, which will simultaneously:
1. Better exploit parallelism (e.g. nearly saturate the typical x86/x64 pipeline)
2. Approach the theoretical limit of stream randomness... stream is probably not such a good term unless qualified, as some might incorrectly take it to imply a lack of correlation with other streams.*
3. Provide speed equivalent to xoroshiro+, given enough spare thread / memory bandwidth availability (which is usually, but not always, the case).

*[The accumulator should not be used by itself as a stream selector without also changing starting bits of the underlying xoroshiro state that are unique for each stream].

A few further observations on the single-iterated distribution, which I believe are correct:
1. +1 in the base ++a code seems to prevent all triplets from re-occuring.
2. -1 in the base ++a code seems to create 3D, but 1 excess for each triplet than would be expected.
3. +1 or -1 in the modified scrambler where '+ s0' has been removed creates perfect 3D, in the expected proportion.

I am not sure if knowing that would help with double-iterated distributions. Let me know if you see anything wrong with my assessment. Edit: The above is not correct.

To be explicit: 'result = result + rotl(x + y, 8 ) - 1' (i.e. no '+ x' is present)

To avoid confusion, I think it would be better to keep the state names s0 and s1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I prefer result for the scrambler result to be consistent with their C code and acc for the new accumulator. Adding new constants e and f, the code is as follows:

xoroshiro++a

result = rotl(s0 + s1, d) + s0
acc = acc + result + e ;for all states bar one
acc = acc + result + f ;for one state

Could you run some PractRand tests for xoroshiro32[13,5,10,9]++a? I'm interested mainly in e > 0, f = 0 and e = 0, f > 0 and as they are simple to do in hardware and software in particular e = 1, f = 0 and e = 0, f = 1. I don't need the results for all 16 possible low bits, just 0 and 8 would do.

The jump state (where f replaces e) should be the same as the initial state. As there is only one jump for every two single-iteration periods, there are two alternative sequences when the state = initial/jump state:

1. + e, + f, + e, + f, ...
2. + f, + e, + f, + e, ...

This is 'nearly' equivalent in randomness to its more complicated sibling we have been discussing. I will post some PractRand results, just for comparison...
But for my purpose, it brings the ability to make a few additional small changes, which will simultaneously:
1. Better exploit parallelism (e.g. nearly saturate the typical x86/x64 pipeline)
2. Approach the theoretical limit of stream randomness... stream is probably not such a good term unless qualified, as some might incorrectly take it to imply a lack of correlation with other streams.*
3. Provide speed equivalent to xoroshiro+, given enough spare thread / memory bandwidth availability (which is usually, but not always, the case).

*[The accumulator should not be used by itself as a stream selector without also changing starting bits of the underlying xoroshiro state that are unique for each stream].

I'm happy to call a stream something else, e.g. sequence or set, the shorter and snappier the better. Let's say set for now, if you don't like stream. I define set0 and set1 as having the same initial state and initial acc = 0 and 1, respectively.

To avoid confusion, I think it would be better to keep the state names s0 and s1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I prefer result for the scrambler result to be consistent with their C code and acc for the new accumulator. Adding new constants e and f, the code is as follows:

xoroshiro++a

result = rotl(s0 + s1, d) + s0
acc = acc + result + e ;for all states bar one
acc = acc + result + f ;for one state

I have corrected some of the 'x/y' above to 's0/s1'... copy/paste issue, where 'x' is a holdout from Marsaglia that I use in my own VB code base.
My own code has evolved from your above example, but I can translate yours (using your nomenclature, which is suitable):

xoroshiro++a (optimized for x86/64)

result = acc
tmp = rotl(s0 + s1, d) + s0
acc = acc + tmp + e ;for all states bar one
acc = acc + tmp + f ;for one state

The seed correlation to first result was defended by Vigna for performance reasons, and also since one can burn the first result(s) if they do no trust their seed source.
In your code, 'acc' is the output (since 'result' includes no reference to 'acc'), whereas in mine 'result' is the output.
The above can be simplified further by adding 'acc' to 'tmp' (e.g. tmp = acc + rotl(s0 + s1, d) + s0 ), reducing the final 'acc' value to a state-counter conditional expression calculated from tmp and e/f.

Could you run some PractRand tests for xoroshiro32[13,5,10,9]++a? I'm interested mainly in e > 0, f = 0 and e = 0, f > 0 and as they are simple to do in hardware and software in particular e = 1, f = 0 and e = 0, f = 1. I don't need the results for all 16 possible low bits, just 0 and 8 would do...

My servers are loaded fully right now running TestU01 on fully alternate code at 192-bit state, but I will slot your most recent code in on PractRand as soon as I confirm I have each e/f implemented properly with correct distributions.

I'm happy to call a stream something else, e.g. sequence or set, the shorter and snappier the better. Let's say set for now, if you don't like stream. I define set0 and set1 as having the same initial state and initial acc = 0 and 1, respectively.

Regarding stream nomenclature, Seba criticized Melissa's use of the term and/or implementation on her PCG ext variants, which have a similar stream correlation issue. Thus, needs to be well-documented (regardless of terminology used).
In this case (unlike PCG, as I recall), the randomness of an individual stream is improved substantially by the extension, thus can more easily be defended when randomness, equidistribution, easy extensibilty, and speed (i.e. using my 'result = acc' modification above) are taken as a whole.

To avoid confusion, I think it would be better to keep the state names s0 and s1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I prefer result for the scrambler result to be consistent with their C code and acc for the new accumulator. Adding new constants e and f, the code is as follows:

xoroshiro++a

result = rotl(s0 + s1, d) + s0
acc = acc + result + e ;for all states bar one
acc = acc + result + f ;for one state

I have corrected some of the 'x/y' above to 's0/s1'... copy/paste issue, where 'x' is a holdout from Marsaglia that I use in my own VB code base.
My own code has evolved from your above example, but I can translate yours (using your nomenclature, which is suitable):

xoroshiro++a (optimized for x86/64)

result = acc
tmp = rotl(s0 + s1, d) + s0
acc = acc + tmp + e ;for all states bar one
acc = acc + tmp + f ;for one state

The seed correlation to first result was defended by Vigna for performance reasons, and also since one can burn the first result(s) if they do no trust their seed source.
In your code, 'acc' is the output (since 'result' includes no reference to 'acc'), whereas in mine 'result' is the output.
The above can be simplified further by adding 'acc' to 'tmp' (e.g. tmp = acc + rotl(s0 + s1, d) + s0 ), reducing the final 'acc' value to a state-counter conditional expression calculated from tmp and e/f.

The order of my code above matches that for XORO32 and a new XOROACC. Replacing = with := to indicate sequential instructions, the pseudo-code and equivalent P2 instructions could be written as:

result := rotl(s0 + s1, d) + s0 ;XORO32 state 'state = {s1,s0}
acc := acc + result + e_or_f ;XOROACC acc

The XORO32 logic calculates result in parallel with the state iteration to save time, in the same way as the C code. This result is placed into the source field of the next instruction, here XOROACC, because there is only one destination for XORO32, here state.

The result = acc in your code means the first result is not a function of the state, which is incorrect. A software implementation could be written as:

result := rotl(s0 + s1, d) + s0 ;use entry state
acc := acc + result + e_or_f
s0 := ... ;generate next state
s1 := ...

I prefer result for the normal xoroshiro++ PRN and acc for the exhanced PRN.

P.S. Thanks in advance for running the PractRand tests.

Re PractRand testing, I expect you'd like to also try e = -1, f= 0 and e = 0, f = -1. It might be interesting to test 0 and one other value, e.g. 24109 which is prime and has about the same number of 1 and 0 bits.

I believe I understand how it works well enough... I just need to make it work.

I came home to a crashed server... apparently the new RAM I was testing is faulty (under about 64 total threads from both PractRand and BigCrush simultaneously), so backed down Intel TurboBoost using Intel XTU (hoping that will magically help) until I can swap the RAM back out.

But, others don't share my appreciation for awe-inspiring processors.

Apparently those "others" do not like the noise of my current servers... a couple days down-time required to move servers and get back on the network.

I was having trouble implementing 'e/f' in the small version to get exactly the expected distribution, per last instructions.
I might have figured out why, but will have to test later. I'll let you know if I cannot get it sorted.

For my verification on the small version, are you sure 2^25-2^9 and 2^24-2^8 are the correct periods for and 'e/f' implemented double-iterated byte and word 1D distributions?
On the small version I was seeing something closer to 2^32-2^16... any confirmation or clarification would be appreciated.

I struck the post several up concerning single-iterated 1/2/3D, since there was an error in my accumulation and enumeration, which I fixed (but didn't re-post results for).

But, others don't share my appreciation for awe-inspiring processors.

Apparently those "others" do not like the noise of my current servers... a couple days down-time required to move servers and get back on the network.

I was having trouble implementing 'e/f' in the small version to get exactly the expected distribution, per last instructions.
I might have figured out why, but will have to test later. I'll let you know if I cannot get it sorted.

For my verification on the small version, are you sure 2^25-2^9 and 2^24-2^8 are the correct periods for and 'e/f' implemented double-iterated byte and word 1D distributions?
On the small version I was seeing something closer to 2^32-2^16... any confirmation or clarification would be appreciated.

I struck the post several up concerning single-iterated 1/2/3D, since there was an error in my accumulation and enumeration, which I fixed (but didn't re-post results for).

The total period for double-iterated xoroshiro16++a = (2^16-1) * 2 * 2^8 = single-iteration period * 2 for double-iterating * different initial acc values / streams. So, yes, 2^25-2^9 single or 2^24-2^8 double iterations.

In my x86 code, I count single iterations and stop when I get to 0FFFF00H * 2 = 1FFFE00H = 33553920 = 2^25-2^9.

I got it sorted. Must apply f to the pair of xoroshiro outputs, which is obvious now.
Based on the small version, anything more complicated than e=0, f=1 may not be worth the effort, but I’ll run a few tests tomorrow.

I got it sorted. Must apply f to the pair of xoroshiro outputs, which is obvious now.
Based on the small version, anything more complicated than e=0, f=1 may not be worth the effort, but I’ll run a few tests tomorrow.

f is added to acc instead of e only once per double-iteration. It obviously affects the current pair (and subsequent pairs due to the accumulation) but it only applies to half of the current pair.

Tony, sorry about the delay, but have a look at this:

acc = acc + (result ^ e) // recommend dynamic initialized to e = 0; + could be used instead of ^
e = e ^ f // recommend static constant f = 1

I noticed your code modification fractured the already imperfect 16-bit 2D slightly more than it was.
I believe the above modification:
1. Maintains the same perfect 16-bit 1D over period 2^49-2^17.
2. Maintains the same nearly perfect 32-bit 1D over period 2^48-2^16.
3. Repairs the fractured 16-bit 2D back to a nearly perfect 2D over period 2^49-2^17.
4. Is more random, based on preliminary PractRand testing and observation of half-stream hex output.
5. Is probably easier to implement (especially using recommended e/f, which is a 1-bit square wave generator; no e/f carry/addition required).
6. Produces a cleaner C implementation.

Here are the distribution results of both (small version):

I am still re-tooling to do PractRand rotations... I had a "mishap" while moving servers. Edit: Finished re-tooling, running '++a [13,5,10,9,1,0]' now.
Let me know what you think, and assuming you can confirm my results.

The 1D-8 and 1D-16 above are what I would call 1D and 2D, for single outputs and two consecutive outputs. I'm not at all sure what 2D-8 is.

For 2D-8 in the small version, the first 16-bit is built from two 8-bit outputs, then the second 16-bit is built from the last/previous 8-bit of the first and a new 8-bit output.

I've tried some xor-ing, but without success. How often is e = e ^ f done? My algorithm uses f instead of e only once for each double-iteration.

To achieve the same effect when testing with pairs of 8-bit outputs to build ACC(16), calculate the first 8-bit result half without e/f being used, and xor (or add, if you choose) 1 to the second 8-bit result half.

On other matters... I am having trouble replicating my previous PractRand results on rotations > 0 after rebuilding some of my code base.
Now all rotations > 0 are also failing at 16GB... which taken at face-value would indicate that PractRand is not as blind to the rotations as I had previously found.
One possible explanation is that my old rotation code had a bug, so I am investigating further while generating new results.

acc holds two 16-bit accumulator values, with current one in the high word and previous one in the low word. Therefore xoroacc code above is wrong and it should read:

*As previously noted, either '^ 1' or '+ 1' may be used. Edit: I added parentheses and moved the location of the '+ 1' in order to support '^ 1' correctly. Edit2: Deprecated '+ 1', since '^ 1' is very noticeably superior... supporting data in next post.

The key point for achieving 2D equidistribution is that acc must be modified in a different way to normal for one particular state once during the double-iteration.

I've been looking at my code again and when I modify acc as described above the change takes effect on the first output of the next pair, not the second output of the pair just calculated. After some testing, here is what I find works:

'Either
if state = jump_state then
acc_out[15:0] := prn[15:0] + acc_in[31:16] + e
else
acc_out[15:0] := prn[15:0] + acc_in[31:16]
endif
acc_out[31:16] := prn[31:16] + acc_out[15:0] + f
'or
if state = jump_state then
acc_out[15:0] := prn[15:0] + acc_in[31:16] + e
else
acc_out[15:0] := prn[15:0] + acc_in[31:16]
endif
acc_out[31:16] := prn[31:16] + acc_out[15:0] ^ f

I've re-introduced e and f, but in a modified form. e <> 0, f can be anything and the simplest values are e = 1, f = 0. Replacing + e with ^ e did not work in my tests.

PractRand attachments related to my above post demonstrating the superiority of ^ (xor) compared to + (sum).
All but 2 total fwd/rev rotations fail at 32GB in the xor version, which is much better than ALL of the other sum-only variants that have been discussed previously (but will post results on those also).

The order of operations must be correct, then it will interleave both half-streams perfectly with no apparent correlation in anything but the lsb of the two halves.

Here is a tested working VB code example with best achievable distribution properties (given the shorter period of the underlying engine) and, I believe, best randomness:

Dim acc As Byte 'May be initialized to any value
Dim bit As Byte 'Must be initialized to zero
Function XOROACC() as Integer 'Returns 16-bit word from period 2^24-2^8. Function will overflow unless compiled with checks disabled.
XOROACC = Xoroshiro25ppa 'This half will effectively not have Xor applied, as 'bit' will be equal to zero
XOROACC = XOROACC Or (Xoroshiro25ppa * 256) 'This half will have Xor applied, as 'bit' will be equal to one. '* 256' = '<< 8'
End Function
Function Xoroshiro25ppa() As Byte 'Returns 8-bit byte from period 2^25-2^9. Function will overflow unless compiled with checks disabled.
acc = acc + (Xoroshiro16pp Xor bit) 'Xoroshiro16pp returns byte from period 2^16-1
bit = bit Xor 1& 'bit may be XORed with any odd value, 1 is recommended for simplicity.
Xoroshiro25ppa = acc
End Function

Edit: Added 16-bit output code and a few clarifications.

I think your VB code has the same effect as the bottom two lines below with f = 1. Doing the xor before or after the addition might alter the randomness somewhat but it didn't change the equidistribution in my tests. The critical thing for 2D equidistribution is the first line below and without it or something similar I don't see how one can jump to a different stream.

'XOROACC
if state = jump_state then
acc_out[15:0] := prn[15:0] + acc_in[31:16] + e
else
acc_out[15:0] := prn[15:0] + acc_in[31:16]
endif
acc_out[31:16] := prn[31:16] + acc_out[15:0] ^ f

I doubt the xor is necessary in a microcontroller, to be honest. The simplest code is e = 1, f = 0, but it might be interesting to test a much larger e with PractRand, e.g.

'XOROACC
if state = jump_state then
acc_out[15:0] := prn[15:0] + acc_in[31:16] + 24109
else
acc_out[15:0] := prn[15:0] + acc_in[31:16]
endif
acc_out[31:16] := prn[31:16] + acc_out[15:0]

A new XOROACC instruction would follow immediately after XORO32

XORO32 state 'prn[31:0] into next S
XOROACC acc,0-0 'acc = improved PRN

however its only operands are acc and prn, not state, therefore XORO32 would need to test for one hardwired state and set an internal flag or C or Z, or an arbitrary state could be tested for at the cost of a third instruction, or both.

I think your VB code has the same effect as the bottom two lines below with f = 1. Doing the xor before or after the addition might alter the randomness somewhat but it didn't change the equidistribution in my tests. The critical thing for 2D equidistribution is the first line below and without it or something similar I don't see how one can jump to a different stream

The xor must be done exactly as shown on the result (i.e. prn) prior to adding to acc, otherwise the perfect equidistribution is completely broken.
Then each byte call is equivalent to a jump to an entirely different stream, with distance between streams growing at each call until all streams have been sampled twice (once by each half of the double-iteration) at 2^17-2 calls.
PractRand is positively dazzled by this approach... even the failures at no-rotation forward-bits is minimized to just GAP.

I have tried provable e/f variations on all of the code we discussed, but PractRand is not really impressed by any of them any more than just using 1/0 or 0/1.

For comparison, I have attached PractRand results for the original e=1, f=0 you requested, where f is added (instead of e) to each 4294967295th result of twice that value of iterations.
I am actually testing for 4294967294, as my counter is zero-based and post incremented.

The xor must be done exactly as shown on the result (i.e. prn) prior to adding to acc, otherwise the perfect equidistribution is completely broken.
Then each byte call is equivalent to a jump to an entirely different stream, with distance between streams growing at each call until all streams have been sampled twice (once by each half of the double-iteration) at 2^17-2 calls.
PractRand is positively dazzled by this approach... even the failures at no-rotation forward-bits is minimized to just GAP.

I have tried provable e/f variations on all of the code we discussed, but PractRand is not really impressed by any of them any more than just using 1/0 or 0/1.

I have the "acc with xor" algorithm working perfectly now. Apologies for not doing so earlier. It's half past midnight here and most of my tests yesterday were done much later than that.

I've tried xoring different single bits of acc(high), starting with the msb and it's interesting to see how the 1D equidistribution converges towards perfection and reaches it for bit 0. In contrast, the 2D equidistribution still looks way off for bit 1.

There is no need to test any state at all, which makes this even better. I'll post the modified XOROACC later. Maybe a better name would be XORACC for "xor and accumulate"? Would it be best to output acc with bits reversed?

EDIT1:
Have you run any Crush/BigCrush tests? No-rotation forward and reverse are all that really matter, I think.

EDIT2:
Xoring acc(low) instead of acc(high) also works, as expected.

I am glad you got it working to see what I was seeing.
I was up till 4AM testing variations on this, but there is no simple way I can find that will get around the lsb issue (which likely is what also maintains the structural coherency).
One variation is to alternate between any pair of even (probably 0) and odd (prime?) values, XORing the first half result with the even value and the second half result with the odd value.
This only visually dresses the subsequent stream pairs, so I am not convinced it is worth the effort over just xor 0/1... however, it is an option to consider.

My original goal was simply to achieve a solid 8GB random stream with better distribution properties.
Having increased the distribution quality beyond that while creating an almost-solidly random 16GB stream is good enough, in my opinion.

Maybe a better name would be XORACC for "xor and accumulate"? Would it be best to output acc with bits reversed?

With respect to Seba, XOROACC is probably better. Bits reversed... why exactly? If you are referring to the PractRand results, keep in mind the other Chris could have just as easily focused on high-bit tests for low bits (and vice-versa), but he is instead writing for known-faults in existing PRNGs. So things like that and my JMT are really just over-sights on his part. Wait to see what the pending 0.95 brings to the table (which I have not been able to get the beta to compile in .NET, as yet).

I have two things planned:
1. Document example input seeds and expected output results so we are on the same page.
2. Run TestU01 BigCrush (because almost 16GB is close-enough to try... I saw your edit after I wrote this... I use Lemire's TestU01 module, which only does fwd/rev/byte-rev).

## Comments

265I am not sure that it is possible to improve upon the distribution you have achieved.

I need to review this further once I have replicated your results.

Taken as a whole, I think this is more than just noteworthy when viewed in the proper context as an extensibility option.

1,521I think the acc adjustment, which produces 2D equidistribution (almost), could be simply missing out the "+ e" for one particular state.

2651,521acc holds two 16-bit accumulator values, with current one in the high word and previous one in the low word. Therefore xoroacc code above is wrong and it should read:

Original post has been corrected.

1,521xoroshiro++aan acceptable short name?Defining two new constants, e and f, the accumulator output is given by:

If e and f are different then 2D equidistribution is assured. What this means is every possible 32-bit acc value is output the same number of times (almost). For the exact details, see

http://forums.parallax.com/discussion/comment/1479426/#Comment_1479426

An easy-to-implement scheme is e = 0 and f = 1, but let's consider the case of e = f = 0 first. No midpoint correction is needed for the first half of a stream to be followed directly by the second half. However, at the end there is a wraparound to the beginning, therefore one stream will repeat

ad infinitum. A non-zero f is required at one particular state in order to jump into a different stream. Note that this second stream is output in its entirety, due to the wraparound effect, before jumping to a third stream, etc.Another simple scheme is e = 1. As the xoroshiro period is one less than a power of 2, after half the stream has been output the value of acc will be one less than it should be to continue into the second half and if this is desired f should be 2. At the end of the stream the next one begins automatically (no wraparound). There is no reason why f must be 2, it could be 0 or some large number to jump into a distant stream. If e = 1 and f <> 2, then the first half of one stream is followed by a jump to the second half of another, which flows into the first half of the following stream, then another jump, etc. This means that the two halves of every stream could be output with gaps between them, possibly large gaps.

The 'best' values of e and f could be determined by the level of hardware complexity and the results of testing the randomness with PractRand or TestUI.

1,521jump state(where f is used instead of e) could be any non-zero value and does not have to be the same as the seed state.I studied the 'midpoint' of a stream to understand what was going on, but there are no real midpoints as one state or point is as good as any other.

265Then, I assume XOROACC, or similar, will refer to your specific double-iterated 2D-recovery modification?

An obvious question that might be asked with regard to either of the above is: Why not xoshiro (or PRNG 'X') with a larger state?

I believe some obvious answers would be with regard to reduced footprint, testing time, and predictable stream randomness.

Additional answers might be derived from previous posts.

Regarding the stream randomness, I have found that '++' is slightly more complicated than necessary when used as a segue to the 'a' version.

The logic is fairly easy to follow (even if not fully complete):

Why was '+' not used for XORO32 originally? Due to endemic linear complexity, Hamming-weight and binary matrix rank issues of LSBs.

All of those issues were hidden well by '++', but '++a', by definition, includes another '+' (i.e. 'a') which can be used to negate the need for one of the original '+'.

It would not be immediately apparent, but the answer of how to accomplish that is from the lesson learned from attempting 'D=8' with xoroshiro32++.

Why did that not work? Presumably because it creates a type of parity between the two halves when '+ x' is added again. [Forgive the poor description].

Once the 'a' became available, that parity can be automatically broken by the constant stream of carry bits across the half-rotation divide... thus '+ x' can be dropped entirely.

In fact, one can just substitute the original '-1' (or &HFFFF) for '+x', since '+x' creates an unnecessary constructive (i.e. unwanted) interference.

Suffice it to say that it is possible for each individual stream to complete without sufficient hint that it will do so ahead of time (e.g. like watching a demolition video in reverse).

There is more to it than that... I will post more later.

[Extensively edited, several times.]

1,521To be pedantic, specifying engine & scrambler & accumulator separately, it should be

xoroshiro[a,b,c]++[d]a[e,f].XOROACC is the suggested name for a future enhanced P2 or P3 instruction that does the accumulation and is not intended to have a wider application.

I can't honestly say whether or not a xoroshiro+a would be good enough quality as I don't run PractRand or TestUI, but our starting point for the accumulator extension is XORO32 and that uses ++. We did use plain xoroshiro+ early on but dropped the poor bit 0 so the output was only 15-bit. It was only after we had 16 good bits, thanks to either parity or ++, that Chip had the idea of double-iteration to give a full 32-bit output.

xoshiro came much too late in the P2 process and anyway fitting the entire state in one 32-bit register is an important design feature.

The meaning of '+ x' is not totally clear to me.

EDIT:Is '+ x' what I call '+ e' or '+ f'?

265s0 + s1, 8 ) - 1' (i.e. no '+s0' is present)This is 'nearly' equivalent in randomness to its more complicated sibling we have been discussing. I will post some PractRand results, just for comparison...

But for my purpose, it brings the ability to make a few additional small changes, which will simultaneously:

1. Better exploit parallelism (e.g. nearly saturate the typical x86/x64 pipeline)

2. Approach the theoretical limit of stream randomness... stream is probably not such a good term unless qualified, as some might incorrectly take it to imply a lack of correlation with other streams.*

3. Provide speed equivalent to xoroshiro+, given enough spare thread / memory bandwidth availability (which is usually, but not always, the case).

*

[The accumulator should not be used by itself as a stream selector without also changing starting bits of the underlying xoroshiro state that are unique for each stream].265on the single-iterated distribution, which I believe are correct:1. +1 in the base ++a code seems to prevent all triplets from re-occuring.

2. -1 in the base ++a code seems to create 3D, but 1 excess for each triplet than would be expected.

3. +1 or -1 in the modified scrambler where '+

s0' has been removed creates perfect 3D, in the expected proportion.I am not sure if knowing that would help with double-iterated distributions.

Let me know if you see anything wrong with my assessment.Edit: The above is not correct.1,521To avoid confusion, I think it would be better to keep the state names

s0ands1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I preferresultfor the scrambler result to be consistent with their C code andaccfor the new accumulator. Adding new constants e and f, the code is as follows:xoroshiro++a

Could you run some PractRand tests for xoroshiro32[13,5,10,9]++a? I'm interested mainly in e > 0, f = 0 and e = 0, f > 0 and as they are simple to do in hardware and software in particular e = 1, f = 0 and e = 0, f = 1. I don't need the results for all 16 possible low bits, just 0 and 8 would do.

The jump state (where f replaces e) should be the same as the initial state. As there is only one jump for every two single-iteration periods, there are two alternative sequences when the state = initial/jump state:

1. + e, + f, + e, + f, ...

2. + f, + e, + f, + e, ...

I'm happy to call a stream something else, e.g. sequence or set, the shorter and snappier the better. Let's say set for now, if you don't like stream. I define set0 and set1 as having the same initial state and initial acc = 0 and 1, respectively.

265My own code has evolved from your above example, but I can translate yours (using your nomenclature, which is suitable):

xoroshiro++a (optimized for x86/64) The seed correlation to first result was defended by Vigna for performance reasons, and also since one can burn the first result(s) if they do no trust their seed source.

In your code, 'acc' is the output (since 'result' includes no reference to 'acc'), whereas in mine 'result' is the output.

The above can be simplified further by adding 'acc' to 'tmp' (e.g. tmp = acc + rotl(s0 + s1, d) + s0 ), reducing the final 'acc' value to a state-counter conditional expression calculated from tmp and e/f.

My servers are loaded fully right now running TestU01 on fully alternate code at 192-bit state, but I will slot your most recent code in on PractRand as soon as I confirm I have each e/f implemented properly with correct distributions.

Regarding stream nomenclature, Seba criticized Melissa's use of the term and/or implementation on her PCG ext variants, which have a similar stream correlation issue. Thus, needs to be well-documented (regardless of terminology used).

In this case (unlike PCG, as I recall), the randomness of an individual stream is improved substantially by the extension, thus can more easily be defended when randomness, equidistribution, easy extensibilty, and speed (i.e. using my 'result = acc' modification above) are taken as a whole.

1,521=with:=to indicate sequential instructions, the pseudo-code and equivalent P2 instructions could be written as:The XORO32 logic calculates

resultin parallel with the state iteration to save time, in the same way as the C code. This result is placed into the source field of the next instruction, here XOROACC, because there is only one destination for XORO32, herestate.The

result = accin your code means the first result is not a function of the state, which is incorrect. A software implementation could be written as:I prefer

resultfor the normal xoroshiro++ PRN andaccfor the exhanced PRN.P.S. Thanks in advance for running the PractRand tests.

1,521265I came home to a crashed server... apparently the new RAM I was testing is faulty (under about 64 total threads from both PractRand and BigCrush simultaneously), so backed down Intel TurboBoost using Intel XTU (hoping that will magically help) until I can swap the RAM back out.

265I was having trouble implementing 'e/f' in the small version to get exactly the expected distribution, per last instructions.

I might have figured out why, but will have to test later. I'll let you know if I cannot get it sorted.

For my verification on the small version, are you sure 2^25-2^9 and 2^24-2^8 are the correct periods for and 'e/f' implemented double-iterated byte and word 1D distributions?

On the small version I was seeing something closer to 2^32-2^16... any confirmation or clarification would be appreciated.

I struck the post several up concerning single-iterated 1/2/3D, since there was an error in my accumulation and enumeration, which I fixed (but didn't re-post results for).

1,521The total period for double-iterated xoroshiro16++a = (2^16-1) * 2 * 2^8 = single-iteration period * 2 for double-iterating * different initial acc values / streams. So, yes, 2^25-2^9 single or 2^24-2^8 double iterations.

In my x86 code, I count single iterations and stop when I get to 0FFFF00H * 2 = 1FFFE00H = 33553920 = 2^25-2^9.

265Based on the small version, anything more complicated than e=0, f=1 may not be worth the effort, but I’ll run a few tests tomorrow.

1,521f is added to acc instead of e only once per double-iteration. It obviously affects the current pair (and subsequent pairs due to the accumulation) but it only applies to half of the current pair.

265I believe the above modification:

1. Maintains the same perfect 16-bit 1D over period 2^49-2^17.

2. Maintains the same nearly perfect 32-bit 1D over period 2^48-2^16.

3. Repairs the fractured 16-bit 2D back to a nearly perfect 2D over period 2^49-2^17.

4. Is more random, based on preliminary PractRand testing and observation of half-stream hex output.

5. Is probably easier to implement (especially using recommended e/f, which is a 1-bit square wave generator; no e/f carry/addition required).

6. Produces a cleaner C implementation.

Here are the distribution results of both (small version):

I am still re-tooling to do PractRand rotations... I had a "mishap" while moving servers.

Edit: Finished re-tooling, running '++a [13,5,10,9,1,0]' now.Let me know what you think, and assuming you can confirm my results.

Multiple correction and clarity edits.1,521I've tried some xor-ing, but without success. How often is e = e ^ f done? My algorithm uses f instead of e only once for each double-iteration.

265On other matters... I am having trouble replicating my previous PractRand results on rotations > 0 after rebuilding some of my code base.

Now all rotations > 0 are also failing at 16GB... which taken at face-value would indicate that PractRand is not as blind to the rotations as I had previously found.

One possible explanation is that my old rotation code had a bug, so I am investigating further while generating new results.

265Edit: I added parentheses and moved the location of the '+ 1' in order to support '^ 1' correctly.Edit2: Deprecated '+ 1', since '^ 1' is very noticeably superior... supporting data in next post.1,521I've been looking at my code again and when I modify acc as described above the change takes effect on the first output of the next pair, not the second output of the pair just calculated. After some testing, here is what I find works:

I've re-introduced e and f, but in a modified form. e <> 0, f can be anything and the simplest values are e = 1, f = 0. Replacing + e with ^ e did not work in my tests.

265All but 2 total fwd/rev rotations fail at 32GB in the xor version, which is much better than ALL of the other sum-only variants that have been discussed previously (but will post results on those also).

265Here is a tested working VB code example with best achievable distribution properties (given the shorter period of the underlying engine) and, I believe, best randomness:

Edit: Added 16-bit output code and a few clarifications.1,521I think your VB code has the same effect as the bottom two lines below with f = 1. Doing the xor before or after the addition might alter the randomness somewhat but it didn't change the equidistribution in my tests. The critical thing for 2D equidistribution is the first line below and without it or something similar I don't see how one can jump to a different stream.

I doubt the xor is necessary in a microcontroller, to be honest. The simplest code is e = 1, f = 0, but it might be interesting to test a much larger e with PractRand, e.g.

A new XOROACC instruction would follow immediately after XORO32

however its only operands are acc and prn, not state, therefore XORO32 would need to test for one hardwired state and set an internal flag or C or Z, or an arbitrary state could be tested for at the cost of a third instruction, or both.

265Then each byte call is equivalent to a jump to an entirely different stream, with distance between streams growing at each call until all streams have been sampled twice (once by each half of the double-iteration) at 2^17-2 calls.

PractRand is positively dazzled by this approach... even the failures at no-rotation forward-bits is minimized to just GAP.

I have tried provable e/f variations on all of the code we discussed, but PractRand is not really impressed by any of them any more than just using 1/0 or 0/1.

For comparison, I have attached PractRand results for the original e=1, f=0 you requested, where f is added (instead of e) to each 4294967295th result of twice that value of iterations.

I am actually testing for 4294967294, as my counter is zero-based and post incremented.

1,521I have the "acc with xor" algorithm working perfectly now. Apologies for not doing so earlier. It's half past midnight here and most of my tests yesterday were done much later than that.

I've tried xoring different single bits of acc(high), starting with the msb and it's interesting to see how the 1D equidistribution converges towards perfection and reaches it for bit 0. In contrast, the 2D equidistribution still looks way off for bit 1.

There is no need to test any state at all, which makes this even better. I'll post the modified XOROACC later. Maybe a better name would be XORACC for "xor and accumulate"? Would it be best to output acc with bits reversed?

EDIT1:Have you run any Crush/BigCrush tests? No-rotation forward and reverse are all that really matter, I think.

EDIT2:Xoring acc(low) instead of acc(high) also works, as expected.

265I was up till 4AM testing variations on this, but there is no simple way I can find that will get around the lsb issue (which likely is what also maintains the structural coherency).

One variation is to alternate between any pair of even (probably 0) and odd (prime?) values, XORing the first half result with the even value and the second half result with the odd value.

This only visually dresses the subsequent stream pairs, so I am not convinced it is worth the effort over just xor 0/1... however, it is an option to consider.

My original goal was simply to achieve a solid 8GB random stream with better distribution properties.

Having increased the distribution quality beyond that while creating an almost-solidly random 16GB stream is good enough, in my opinion.

With respect to Seba, XOROACC is probably better. Bits reversed... why exactly?

If you are referring to the PractRand results, keep in mind the other Chris could have just as easily focused on high-bit tests for low bits (and vice-versa), but he is instead writing for known-faults in existing PRNGs. So things like that and my JMT are really just over-sights on his part. Wait to see what the pending 0.95 brings to the table (which I have not been able to get the beta to compile in .NET, as yet).I have two things planned:

1. Document example input seeds and expected output results so we are on the same page.

2. Run TestU01 BigCrush (because almost 16GB is close-enough to try...

I saw your edit after I wrote this... I use Lemire's TestU01 module, which only does fwd/rev/byte-rev).