It looks like you are exploiting a property of the incremental state that I had neglected (based partially on the 1-lag idea I mentioned, but couldn't make work).
I am not sure that it is possible to improve upon the distribution you have achieved.
I need to review this further once I have replicated your results.

Taken as a whole, I think this is more than just noteworthy when viewed in the proper context as an extensibility option.

'now
XORO32 state 'prn[31:0] into next S
MOV prn,0-0 'prn = existing PRN
'future option
XORO32 state 'prn[31:0] into next S
XOROACC acc,0-0 'acc = improved PRN
'xoroacc operation
'acc_out[15:0] := prn[15:0] + acc_in[15:0] + e*
'acc_out[31:16] := prn[31:16] + acc_out[15:0] + e
'e is new constant, maybe 1, to go with unchanged a,b,c,d
'* force e = 0 if state = mid_state

acc holds two 16-bit accumulator values, with current one in the high word and previous one in the low word. Therefore xoroacc code above is wrong and it should read:

I have a good understanding of how to get 2D equidistribution with the new xoroshiro++ and accumulator algorithm. Is xoroshiro++a an acceptable short name?

Defining two new constants, e and f, the accumulator output is given by:

acc := prn + acc + e for all states except one (2^w-2)
acc := prn + acc + f for one state
w = state width
prn = xoroshiro++ PRN output
N.B. e <> f

An easy-to-implement scheme is e = 0 and f = 1, but let's consider the case of e = f = 0 first. No midpoint correction is needed for the first half of a stream to be followed directly by the second half. However, at the end there is a wraparound to the beginning, therefore one stream will repeat ad infinitum. A non-zero f is required at one particular state in order to jump into a different stream. Note that this second stream is output in its entirety, due to the wraparound effect, before jumping to a third stream, etc.

Another simple scheme is e = 1. As the xoroshiro period is one less than a power of 2, after half the stream has been output the value of acc will be one less than it should be to continue into the second half and if this is desired f should be 2. At the end of the stream the next one begins automatically (no wraparound). There is no reason why f must be 2, it could be 0 or some large number to jump into a distant stream. If e = 1 and f <> 2, then the first half of one stream is followed by a jump to the second half of another, which flows into the first half of the following stream, then another jump, etc. This means that the two halves of every stream could be output with gaps between them, possibly large gaps.

The 'best' values of e and f could be determined by the level of hardware complexity and the results of testing the randomness with PractRand or TestUI.

Since that is sufficiently descriptive, I am good with it if it refers to a generic version of the code.
Then, I assume XOROACC, or similar, will refer to your specific double-iterated 2D-recovery modification?

An obvious question that might be asked with regard to either of the above is: Why not xoshiro (or PRNG 'X') with a larger state?
I believe some obvious answers would be with regard to reduced footprint, testing time, and predictable stream randomness.
Additional answers might be derived from previous posts.

Regarding the stream randomness, I have found that '++' is slightly more complicated than necessary when used as a segue to the 'a' version.
The logic is fairly easy to follow (even if not fully complete):
Why was '+' not used for XORO32 originally? Due to endemic linear complexity, Hamming-weight and binary matrix rank issues of LSBs.
All of those issues were hidden well by '++', but '++a', by definition, includes another '+' (i.e. 'a') which can be used to negate the need for one of the original '+'.
It would not be immediately apparent, but the answer of how to accomplish that is from the lesson learned from attempting 'D=8' with xoroshiro32++.
Why did that not work? Presumably because it creates a type of parity between the two halves when '+ x' is added again. [Forgive the poor description].
Once the 'a' became available, that parity can be automatically broken by the constant stream of carry bits across the half-rotation divide... thus '+ x' can be dropped entirely.
In fact, one can just substitute the original '-1' (or &HFFFF) for '+x', since '+x' creates an unnecessary constructive (i.e. unwanted) interference.
Suffice it to say that it is possible for each individual stream to complete without sufficient hint that it will do so ahead of time (e.g. like watching a demolition video in reverse).
There is more to it than that... I will post more later.

Since that is sufficiently descriptive, I am good with it if it refers to a generic version of the code.
Then, I assume XOROACC, or similar, will refer to your specific double-iterated 2D-recovery modification.

To be pedantic, specifying engine & scrambler & accumulator separately, it should be xoroshiro[a,b,c]++[d]a[e,f].

XOROACC is the suggested name for a future enhanced P2 or P3 instruction that does the accumulation and is not intended to have a wider application.

An obvious question that might be asked with regard to either of the above, is 'why not xoshiro?' (or PRNG 'X') with a larger state.
I believe the obvious answers would be with regard to reduced footprint, testing time, and predictable stream randomness.

Regarding the latter, I have found that '++' is slightly more complicated than necessary when used as a segue to the 'a' version.
The logic is fairly simple:
Why was '+' not used for XORO32 originally? Due to endemic linear complexity, Hamming-weight and binary matrix rank issues of LSBs.
All of those issues were hidden well by '++', but '++a', by definition, includes another '+' (i.e. 'a') which can be used constructively to remove one of the original '+'.
It would not be immediately apparent, but the answer of how to accomplish that is from the lesson learned from attempting 'D=8' with xoroshiro32++.
Why did that not work? Presumably because it creates a type of parity between the two halves when '+ x' is added again. [Forgive the poor description].
Once the 'a' became available, that parity can be automatically broken by the constant stream of carry bits across the half-rotation divide... thus '+ x' can be dropped entirely.
In fact, one can just substitute the original '-1' (or &HFFFF) for '+x'. There is more to it than that... I will post more later.

I can't honestly say whether or not a xoroshiro+a would be good enough quality as I don't run PractRand or TestUI, but our starting point for the accumulator extension is XORO32 and that uses ++. We did use plain xoroshiro+ early on but dropped the poor bit 0 so the output was only 15-bit. It was only after we had 16 good bits, thanks to either parity or ++, that Chip had the idea of double-iteration to give a full 32-bit output.

xoshiro came much too late in the P2 process and anyway fitting the entire state in one 32-bit register is an important design feature.

To be explicit: 'result = result + rotl(s0 + s1, 8 ) - 1' (i.e. no '+ s0' is present)
This is 'nearly' equivalent in randomness to its more complicated sibling we have been discussing. I will post some PractRand results, just for comparison...
But for my purpose, it brings the ability to make a few additional small changes, which will simultaneously:
1. Better exploit parallelism (e.g. nearly saturate the typical x86/x64 pipeline)
2. Approach the theoretical limit of stream randomness... stream is probably not such a good term unless qualified, as some might incorrectly take it to imply a lack of correlation with other streams.*
3. Provide speed equivalent to xoroshiro+, given enough spare thread / memory bandwidth availability (which is usually, but not always, the case).

*[The accumulator should not be used by itself as a stream selector without also changing starting bits of the underlying xoroshiro state that are unique for each stream].

A few further observations on the single-iterated distribution, which I believe are correct:
1. +1 in the base ++a code seems to prevent all triplets from re-occuring.
2. -1 in the base ++a code seems to create 3D, but 1 excess for each triplet than would be expected.
3. +1 or -1 in the modified scrambler where '+ s0' has been removed creates perfect 3D, in the expected proportion.

I am not sure if knowing that would help with double-iterated distributions. Let me know if you see anything wrong with my assessment. Edit: The above is not correct.

To be explicit: 'result = result + rotl(x + y, 8 ) - 1' (i.e. no '+ x' is present)

To avoid confusion, I think it would be better to keep the state names s0 and s1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I prefer result for the scrambler result to be consistent with their C code and acc for the new accumulator. Adding new constants e and f, the code is as follows:

xoroshiro++a

result = rotl(s0 + s1, d) + s0
acc = acc + result + e ;for all states bar one
acc = acc + result + f ;for one state

Could you run some PractRand tests for xoroshiro32[13,5,10,9]++a? I'm interested mainly in e > 0, f = 0 and e = 0, f > 0 and as they are simple to do in hardware and software in particular e = 1, f = 0 and e = 0, f = 1. I don't need the results for all 16 possible low bits, just 0 and 8 would do.

The jump state (where f replaces e) should be the same as the initial state. As there is only one jump for every two single-iteration periods, there are two alternative sequences when the state = initial/jump state:

1. + e, + f, + e, + f, ...
2. + f, + e, + f, + e, ...

This is 'nearly' equivalent in randomness to its more complicated sibling we have been discussing. I will post some PractRand results, just for comparison...
But for my purpose, it brings the ability to make a few additional small changes, which will simultaneously:
1. Better exploit parallelism (e.g. nearly saturate the typical x86/x64 pipeline)
2. Approach the theoretical limit of stream randomness... stream is probably not such a good term unless qualified, as some might incorrectly take it to imply a lack of correlation with other streams.*
3. Provide speed equivalent to xoroshiro+, given enough spare thread / memory bandwidth availability (which is usually, but not always, the case).

*[The accumulator should not be used by itself as a stream selector without also changing starting bits of the underlying xoroshiro state that are unique for each stream].

I'm happy to call a stream something else, e.g. sequence or set, the shorter and snappier the better. Let's say set for now, if you don't like stream. I define set0 and set1 as having the same initial state and initial acc = 0 and 1, respectively.

To avoid confusion, I think it would be better to keep the state names s0 and s1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I prefer result for the scrambler result to be consistent with their C code and acc for the new accumulator. Adding new constants e and f, the code is as follows:

xoroshiro++a

result = rotl(s0 + s1, d) + s0
acc = acc + result + e ;for all states bar one
acc = acc + result + f ;for one state

I have corrected some of the 'x/y' above to 's0/s1'... copy/paste issue, where 'x' is a holdout from Marsaglia that I use in my own VB code base.
My own code has evolved from your above example, but I can translate yours (using your nomenclature, which is suitable):

xoroshiro++a (optimized for x86/64)

result = acc
tmp = rotl(s0 + s1, d) + s0
acc = acc + tmp + e ;for all states bar one
acc = acc + tmp + f ;for one state

The seed correlation to first result was defended by Vigna for performance reasons, and also since one can burn the first result(s) if they do no trust their seed source.
In your code, 'acc' is the output (since 'result' includes no reference to 'acc'), whereas in mine 'result' is the output.
The above can be simplified further by adding 'acc' to 'tmp' (e.g. tmp = acc + rotl(s0 + s1, d) + s0 ), reducing the final 'acc' value to a state-counter conditional expression calculated from tmp and e/f.

Could you run some PractRand tests for xoroshiro32[13,5,10,9]++a? I'm interested mainly in e > 0, f = 0 and e = 0, f > 0 and as they are simple to do in hardware and software in particular e = 1, f = 0 and e = 0, f = 1. I don't need the results for all 16 possible low bits, just 0 and 8 would do...

My servers are loaded fully right now running TestU01 on fully alternate code at 192-bit state, but I will slot your most recent code in on PractRand as soon as I confirm I have each e/f implemented properly with correct distributions.

I'm happy to call a stream something else, e.g. sequence or set, the shorter and snappier the better. Let's say set for now, if you don't like stream. I define set0 and set1 as having the same initial state and initial acc = 0 and 1, respectively.

Regarding stream nomenclature, Seba criticized Melissa's use of the term and/or implementation on her PCG ext variants, which have a similar stream correlation issue. Thus, needs to be well-documented (regardless of terminology used).
In this case (unlike PCG, as I recall), the randomness of an individual stream is improved substantially by the extension, thus can more easily be defended when randomness, equidistribution, easy extensibilty, and speed (i.e. using my 'result = acc' modification above) are taken as a whole.

To avoid confusion, I think it would be better to keep the state names s0 and s1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I prefer result for the scrambler result to be consistent with their C code and acc for the new accumulator. Adding new constants e and f, the code is as follows:

xoroshiro++a

result = rotl(s0 + s1, d) + s0
acc = acc + result + e ;for all states bar one
acc = acc + result + f ;for one state

I have corrected some of the 'x/y' above to 's0/s1'... copy/paste issue, where 'x' is a holdout from Marsaglia that I use in my own VB code base.
My own code has evolved from your above example, but I can translate yours (using your nomenclature, which is suitable):

xoroshiro++a (optimized for x86/64)

result = acc
tmp = rotl(s0 + s1, d) + s0
acc = acc + tmp + e ;for all states bar one
acc = acc + tmp + f ;for one state

The seed correlation to first result was defended by Vigna for performance reasons, and also since one can burn the first result(s) if they do no trust their seed source.
In your code, 'acc' is the output (since 'result' includes no reference to 'acc'), whereas in mine 'result' is the output.
The above can be simplified further by adding 'acc' to 'tmp' (e.g. tmp = acc + rotl(s0 + s1, d) + s0 ), reducing the final 'acc' value to a state-counter conditional expression calculated from tmp and e/f.

The order of my code above matches that for XORO32 and a new XOROACC. Replacing = with := to indicate sequential instructions, the pseudo-code and equivalent P2 instructions could be written as:

result := rotl(s0 + s1, d) + s0 ;XORO32 state 'state = {s1,s0}
acc := acc + result + e_or_f ;XOROACC acc

The XORO32 logic calculates result in parallel with the state iteration to save time, in the same way as the C code. This result is placed into the source field of the next instruction, here XOROACC, because there is only one destination for XORO32, here state.

The result = acc in your code means the first result is not a function of the state, which is incorrect. A software implementation could be written as:

result := rotl(s0 + s1, d) + s0 ;use entry state
acc := acc + result + e_or_f
s0 := ... ;generate next state
s1 := ...

I prefer result for the normal xoroshiro++ PRN and acc for the exhanced PRN.

P.S. Thanks in advance for running the PractRand tests.

Re PractRand testing, I expect you'd like to also try e = -1, f= 0 and e = 0, f = -1. It might be interesting to test 0 and one other value, e.g. 24109 which is prime and has about the same number of 1 and 0 bits.

I believe I understand how it works well enough... I just need to make it work.

I came home to a crashed server... apparently the new RAM I was testing is faulty (under about 64 total threads from both PractRand and BigCrush simultaneously), so backed down Intel TurboBoost using Intel XTU (hoping that will magically help) until I can swap the RAM back out.

But, others don't share my appreciation for awe-inspiring processors.

Apparently those "others" do not like the noise of my current servers... a couple days down-time required to move servers and get back on the network.

I was having trouble implementing 'e/f' in the small version to get exactly the expected distribution, per last instructions.
I might have figured out why, but will have to test later. I'll let you know if I cannot get it sorted.

For my verification on the small version, are you sure 2^25-2^9 and 2^24-2^8 are the correct periods for and 'e/f' implemented double-iterated byte and word 1D distributions?
On the small version I was seeing something closer to 2^32-2^16... any confirmation or clarification would be appreciated.

I struck the post several up concerning single-iterated 1/2/3D, since there was an error in my accumulation and enumeration, which I fixed (but didn't re-post results for).

But, others don't share my appreciation for awe-inspiring processors.

Apparently those "others" do not like the noise of my current servers... a couple days down-time required to move servers and get back on the network.

I was having trouble implementing 'e/f' in the small version to get exactly the expected distribution, per last instructions.
I might have figured out why, but will have to test later. I'll let you know if I cannot get it sorted.

For my verification on the small version, are you sure 2^25-2^9 and 2^24-2^8 are the correct periods for and 'e/f' implemented double-iterated byte and word 1D distributions?
On the small version I was seeing something closer to 2^32-2^16... any confirmation or clarification would be appreciated.

I struck the post several up concerning single-iterated 1/2/3D, since there was an error in my accumulation and enumeration, which I fixed (but didn't re-post results for).

The total period for double-iterated xoroshiro16++a = (2^16-1) * 2 * 2^8 = single-iteration period * 2 for double-iterating * different initial acc values / streams. So, yes, 2^25-2^9 single or 2^24-2^8 double iterations.

In my x86 code, I count single iterations and stop when I get to 0FFFF00H * 2 = 1FFFE00H = 33553920 = 2^25-2^9.

I got it sorted. Must apply f to the pair of xoroshiro outputs, which is obvious now.
Based on the small version, anything more complicated than e=0, f=1 may not be worth the effort, but I’ll run a few tests tomorrow.

I got it sorted. Must apply f to the pair of xoroshiro outputs, which is obvious now.
Based on the small version, anything more complicated than e=0, f=1 may not be worth the effort, but I’ll run a few tests tomorrow.

f is added to acc instead of e only once per double-iteration. It obviously affects the current pair (and subsequent pairs due to the accumulation) but it only applies to half of the current pair.

Tony, sorry about the delay, but have a look at this:

acc = acc + (result ^ e) // recommend dynamic initialized to e = 0; + could be used instead of ^
e = e ^ f // recommend static constant f = 1

I noticed your code modification fractured the already imperfect 16-bit 2D slightly more than it was.
I believe the above modification:
1. Maintains the same perfect 16-bit 1D over period 2^49-2^17.
2. Maintains the same nearly perfect 32-bit 1D over period 2^48-2^16.
3. Repairs the fractured 16-bit 2D back to a nearly perfect 2D over period 2^49-2^17.
4. Is more random, based on preliminary PractRand testing and observation of half-stream hex output.
5. Is probably easier to implement (especially using recommended e/f, which is a 1-bit square wave generator; no e/f carry/addition required).
6. Produces a cleaner C implementation.

Here are the distribution results of both (small version):

I am still re-tooling to do PractRand rotations... I had a "mishap" while moving servers. Edit: Finished re-tooling, running '++a [13,5,10,9,1,0]' now.
Let me know what you think, and assuming you can confirm my results.

The 1D-8 and 1D-16 above are what I would call 1D and 2D, for single outputs and two consecutive outputs. I'm not at all sure what 2D-8 is.

For 2D-8 in the small version, the first 16-bit is built from two 8-bit outputs, then the second 16-bit is built from the last/previous 8-bit of the first and a new 8-bit output.

I've tried some xor-ing, but without success. How often is e = e ^ f done? My algorithm uses f instead of e only once for each double-iteration.

To achieve the same effect when testing with pairs of 8-bit outputs to build ACC(16), calculate the first 8-bit result half without e/f being used, and xor (or add, if you choose) 1 to the second 8-bit result half.

On other matters... I am having trouble replicating my previous PractRand results on rotations > 0 after rebuilding some of my code base.
Now all rotations > 0 are also failing at 16GB... which taken at face-value would indicate that PractRand is not as blind to the rotations as I had previously found.
One possible explanation is that my old rotation code had a bug, so I am investigating further while generating new results.

## Comments

135I am not sure that it is possible to improve upon the distribution you have achieved.

I need to review this further once I have replicated your results.

Taken as a whole, I think this is more than just noteworthy when viewed in the proper context as an extensibility option.

1,289I think the acc adjustment, which produces 2D equidistribution (almost), could be simply missing out the "+ e" for one particular state.

1351,289acc holds two 16-bit accumulator values, with current one in the high word and previous one in the low word. Therefore xoroacc code above is wrong and it should read:

Original post has been corrected.

1,289xoroshiro++aan acceptable short name?Defining two new constants, e and f, the accumulator output is given by:

If e and f are different then 2D equidistribution is assured. What this means is every possible 32-bit acc value is output the same number of times (almost). For the exact details, see

http://forums.parallax.com/discussion/comment/1479426/#Comment_1479426

An easy-to-implement scheme is e = 0 and f = 1, but let's consider the case of e = f = 0 first. No midpoint correction is needed for the first half of a stream to be followed directly by the second half. However, at the end there is a wraparound to the beginning, therefore one stream will repeat

ad infinitum. A non-zero f is required at one particular state in order to jump into a different stream. Note that this second stream is output in its entirety, due to the wraparound effect, before jumping to a third stream, etc.Another simple scheme is e = 1. As the xoroshiro period is one less than a power of 2, after half the stream has been output the value of acc will be one less than it should be to continue into the second half and if this is desired f should be 2. At the end of the stream the next one begins automatically (no wraparound). There is no reason why f must be 2, it could be 0 or some large number to jump into a distant stream. If e = 1 and f <> 2, then the first half of one stream is followed by a jump to the second half of another, which flows into the first half of the following stream, then another jump, etc. This means that the two halves of every stream could be output with gaps between them, possibly large gaps.

The 'best' values of e and f could be determined by the level of hardware complexity and the results of testing the randomness with PractRand or TestUI.

1,289jump state(where f is used instead of e) could be any non-zero value and does not have to be the same as the seed state.I studied the 'midpoint' of a stream to understand what was going on, but there are no real midpoints as one state or point is as good as any other.

135Then, I assume XOROACC, or similar, will refer to your specific double-iterated 2D-recovery modification?

An obvious question that might be asked with regard to either of the above is: Why not xoshiro (or PRNG 'X') with a larger state?

I believe some obvious answers would be with regard to reduced footprint, testing time, and predictable stream randomness.

Additional answers might be derived from previous posts.

Regarding the stream randomness, I have found that '++' is slightly more complicated than necessary when used as a segue to the 'a' version.

The logic is fairly easy to follow (even if not fully complete):

Why was '+' not used for XORO32 originally? Due to endemic linear complexity, Hamming-weight and binary matrix rank issues of LSBs.

All of those issues were hidden well by '++', but '++a', by definition, includes another '+' (i.e. 'a') which can be used to negate the need for one of the original '+'.

It would not be immediately apparent, but the answer of how to accomplish that is from the lesson learned from attempting 'D=8' with xoroshiro32++.

Why did that not work? Presumably because it creates a type of parity between the two halves when '+ x' is added again. [Forgive the poor description].

Once the 'a' became available, that parity can be automatically broken by the constant stream of carry bits across the half-rotation divide... thus '+ x' can be dropped entirely.

In fact, one can just substitute the original '-1' (or &HFFFF) for '+x', since '+x' creates an unnecessary constructive (i.e. unwanted) interference.

Suffice it to say that it is possible for each individual stream to complete without sufficient hint that it will do so ahead of time (e.g. like watching a demolition video in reverse).

There is more to it than that... I will post more later.

[Extensively edited, several times.]

1,289To be pedantic, specifying engine & scrambler & accumulator separately, it should be

xoroshiro[a,b,c]++[d]a[e,f].XOROACC is the suggested name for a future enhanced P2 or P3 instruction that does the accumulation and is not intended to have a wider application.

I can't honestly say whether or not a xoroshiro+a would be good enough quality as I don't run PractRand or TestUI, but our starting point for the accumulator extension is XORO32 and that uses ++. We did use plain xoroshiro+ early on but dropped the poor bit 0 so the output was only 15-bit. It was only after we had 16 good bits, thanks to either parity or ++, that Chip had the idea of double-iteration to give a full 32-bit output.

xoshiro came much too late in the P2 process and anyway fitting the entire state in one 32-bit register is an important design feature.

The meaning of '+ x' is not totally clear to me.

EDIT:Is '+ x' what I call '+ e' or '+ f'?

135s0 + s1, 8 ) - 1' (i.e. no '+s0' is present)This is 'nearly' equivalent in randomness to its more complicated sibling we have been discussing. I will post some PractRand results, just for comparison...

But for my purpose, it brings the ability to make a few additional small changes, which will simultaneously:

1. Better exploit parallelism (e.g. nearly saturate the typical x86/x64 pipeline)

2. Approach the theoretical limit of stream randomness... stream is probably not such a good term unless qualified, as some might incorrectly take it to imply a lack of correlation with other streams.*

3. Provide speed equivalent to xoroshiro+, given enough spare thread / memory bandwidth availability (which is usually, but not always, the case).

*

[The accumulator should not be used by itself as a stream selector without also changing starting bits of the underlying xoroshiro state that are unique for each stream].135on the single-iterated distribution, which I believe are correct:1. +1 in the base ++a code seems to prevent all triplets from re-occuring.

2. -1 in the base ++a code seems to create 3D, but 1 excess for each triplet than would be expected.

3. +1 or -1 in the modified scrambler where '+

s0' has been removed creates perfect 3D, in the expected proportion.I am not sure if knowing that would help with double-iterated distributions.

Let me know if you see anything wrong with my assessment.Edit: The above is not correct.1,289To avoid confusion, I think it would be better to keep the state names

s0ands1, as introduced by David Blackman and Sebastiano Vigna, not x and y. Also, I preferresultfor the scrambler result to be consistent with their C code andaccfor the new accumulator. Adding new constants e and f, the code is as follows:xoroshiro++a

Could you run some PractRand tests for xoroshiro32[13,5,10,9]++a? I'm interested mainly in e > 0, f = 0 and e = 0, f > 0 and as they are simple to do in hardware and software in particular e = 1, f = 0 and e = 0, f = 1. I don't need the results for all 16 possible low bits, just 0 and 8 would do.

The jump state (where f replaces e) should be the same as the initial state. As there is only one jump for every two single-iteration periods, there are two alternative sequences when the state = initial/jump state:

1. + e, + f, + e, + f, ...

2. + f, + e, + f, + e, ...

I'm happy to call a stream something else, e.g. sequence or set, the shorter and snappier the better. Let's say set for now, if you don't like stream. I define set0 and set1 as having the same initial state and initial acc = 0 and 1, respectively.

135My own code has evolved from your above example, but I can translate yours (using your nomenclature, which is suitable):

xoroshiro++a (optimized for x86/64) The seed correlation to first result was defended by Vigna for performance reasons, and also since one can burn the first result(s) if they do no trust their seed source.

In your code, 'acc' is the output (since 'result' includes no reference to 'acc'), whereas in mine 'result' is the output.

The above can be simplified further by adding 'acc' to 'tmp' (e.g. tmp = acc + rotl(s0 + s1, d) + s0 ), reducing the final 'acc' value to a state-counter conditional expression calculated from tmp and e/f.

My servers are loaded fully right now running TestU01 on fully alternate code at 192-bit state, but I will slot your most recent code in on PractRand as soon as I confirm I have each e/f implemented properly with correct distributions.

Regarding stream nomenclature, Seba criticized Melissa's use of the term and/or implementation on her PCG ext variants, which have a similar stream correlation issue. Thus, needs to be well-documented (regardless of terminology used).

In this case (unlike PCG, as I recall), the randomness of an individual stream is improved substantially by the extension, thus can more easily be defended when randomness, equidistribution, easy extensibilty, and speed (i.e. using my 'result = acc' modification above) are taken as a whole.

1,289=with:=to indicate sequential instructions, the pseudo-code and equivalent P2 instructions could be written as:The XORO32 logic calculates

resultin parallel with the state iteration to save time, in the same way as the C code. This result is placed into the source field of the next instruction, here XOROACC, because there is only one destination for XORO32, herestate.The

result = accin your code means the first result is not a function of the state, which is incorrect. A software implementation could be written as:I prefer

resultfor the normal xoroshiro++ PRN andaccfor the exhanced PRN.P.S. Thanks in advance for running the PractRand tests.

1,289135I came home to a crashed server... apparently the new RAM I was testing is faulty (under about 64 total threads from both PractRand and BigCrush simultaneously), so backed down Intel TurboBoost using Intel XTU (hoping that will magically help) until I can swap the RAM back out.

135I was having trouble implementing 'e/f' in the small version to get exactly the expected distribution, per last instructions.

I might have figured out why, but will have to test later. I'll let you know if I cannot get it sorted.

For my verification on the small version, are you sure 2^25-2^9 and 2^24-2^8 are the correct periods for and 'e/f' implemented double-iterated byte and word 1D distributions?

On the small version I was seeing something closer to 2^32-2^16... any confirmation or clarification would be appreciated.

I struck the post several up concerning single-iterated 1/2/3D, since there was an error in my accumulation and enumeration, which I fixed (but didn't re-post results for).

1,289The total period for double-iterated xoroshiro16++a = (2^16-1) * 2 * 2^8 = single-iteration period * 2 for double-iterating * different initial acc values / streams. So, yes, 2^25-2^9 single or 2^24-2^8 double iterations.

In my x86 code, I count single iterations and stop when I get to 0FFFF00H * 2 = 1FFFE00H = 33553920 = 2^25-2^9.

135Based on the small version, anything more complicated than e=0, f=1 may not be worth the effort, but I’ll run a few tests tomorrow.

1,289f is added to acc instead of e only once per double-iteration. It obviously affects the current pair (and subsequent pairs due to the accumulation) but it only applies to half of the current pair.

135I believe the above modification:

1. Maintains the same perfect 16-bit 1D over period 2^49-2^17.

2. Maintains the same nearly perfect 32-bit 1D over period 2^48-2^16.

3. Repairs the fractured 16-bit 2D back to a nearly perfect 2D over period 2^49-2^17.

4. Is more random, based on preliminary PractRand testing and observation of half-stream hex output.

5. Is probably easier to implement (especially using recommended e/f, which is a 1-bit square wave generator; no e/f carry/addition required).

6. Produces a cleaner C implementation.

Here are the distribution results of both (small version):

I am still re-tooling to do PractRand rotations... I had a "mishap" while moving servers.

Edit: Finished re-tooling, running '++a [13,5,10,9,1,0]' now.Let me know what you think, and assuming you can confirm my results.

Multiple correction and clarity edits.1,289I've tried some xor-ing, but without success. How often is e = e ^ f done? My algorithm uses f instead of e only once for each double-iteration.

135On other matters... I am having trouble replicating my previous PractRand results on rotations > 0 after rebuilding some of my code base.

Now all rotations > 0 are also failing at 16GB... which taken at face-value would indicate that PractRand is not as blind to the rotations as I had previously found.

One possible explanation is that my old rotation code had a bug, so I am investigating further while generating new results.