Right, I can see the 2-D equidistribution for the variable result properly now.
During a complete result period of xoroshiro16+++, every 8-bit result occurs 65535 times and every 16-bit result pair occurs 255 times for 256 pairs and 256 times for 65536-256 = 65280 pairs. Equidistribution is perfect for 1-D and not quite perfect for 2-D (because the period of 65535 * 256 is not exactly divisible by 65536).
Thus the intrinsic 2-D equidistribution that is destroyed by the adder scrambler can be restored (almost).
I forgot to mention that initializing 'result = 0' is not a strict requirement. Only s0 and s1 not both simultaneously zero... which is why the full period is short by 2^16 (or 2^8 for small version).
I forgot to mention that initializing 'result = 0' is not a strict requirement. Only s0 and s1 not both simultaneously zero... which is why the full period is short by 2^16 (or 2^8 for small version).
Changing the initial value of result acts as a type of jump function and I have used
values other than zero, e.g. 255 for xoroshiro16+++ so that the state periods start with 255, 254, ..., 1, 0.
The pfreq distribution for one state period of xoroshiro16+++ appears to show a significant overall improvement compared to the full (state) period of xoroshiro16++, for the small number of [a,b,c,d] candidates I looked at.
It seems to me that you could quickly experiment with a 16-bit PRNG of some type and then apply the lessons learned to a 32, 64, or 128-bit PRNG of the same type. I imagine you could figure out how to best employ the carry-out, for example, to get the quality you are looking for.
It seems to me that you could quickly experiment with a 16-bit PRNG of some type and then apply the lessons learned to a 32, 64, or 128-bit PRNG of the same type. I imagine you could figure out how to best employ the carry-out, for example, to get the quality you are looking for.
Yes, a xoroshiro16+++ is very fast to test and that's what we have been doing. A xoroshiro32+++ would take 65536 times longer than xoroshiro32++, which is too long to be practical for the whole period but smaller versions could be tested to confirm the principle.
xoroshiro+++ consists of an engine that iterates the state, a scrambler (in this case a double addition) and a new accumulator which can be initialised to any value. If the state width is w bits, then the period between state and accumulator repeating their initial values is (2^w-1)*2^(w/2). For w = 32 as in XORO32, this is (2^32-1)*2^16 and instead of a single period of 2^32-1 there are 65536 such periods that begin with the accumulator one lower/higher than the period before/after.
I'm not sure whether there is any advantage in using xoroshiro32+++ over xoroshiro32++ for single 16-bit outputs but there is for paired 32-bit outputs. XORO32 longs are a good approximation to ideal randomness; instead of all values occuring once during a state period, some occur multiple times and others not at all. In fact, 1/e or ~37% of the possible outputs never happen, ~37% once and the remainder twice or more.
In contrast, xoroshiro32+++ outputs all 32-bit paired values the same number of times (2-dimensional equidistribution) or very nearly so. Extrapolating, 65536 values should occur 65535 times and 4294967296-65536 = 4294901760 values should occur 65536 times.
It seems to me that you could quickly experiment with a 16-bit PRNG of some type and then apply the lessons learned to a 32, 64, or 128-bit PRNG of the same type. I imagine you could figure out how to best employ the carry-out, for example, to get the quality you are looking for.
To complement Tony's answer:
All of the ground-work was laid out with the current implementation of xoroshiro++.
My opinion is that it is the best general purpose PRNG available, thus worth exploring in further detail.
There are some minor deficits that leave a little room for improvement, which I had previously put considerable thought into, but was never able to simplify to be potentially useful until recently.
All of the current discussion is involving a general purpose equidistibution recovery extension that is a simplification of my original experiments, increasing code complexity by only a small amount.
It has some merit on multiple non-trivial points, with only a few limitations that are seemingly unavoidable without moving to even greater state and/or word size... where the same logic could be applied, as well.
xoroshiro+++ can be implemented as a software add-on, i.e. XORO32 plus extra instructions. Perhaps in the future XORO32 could be extended in hardware. There is a spare L opcode bit that could select ++ or +++ and there is a free time for an extra 16-bit addition (s0 + X) in parallel with (s0 + s1). Just adding an optional {wc} to XORO32 that sets C would save an instruction.
Another thought: if the accumulator (acc) is reset after two state periods (one period of paired outputs), the pair frequency could be closer to the ideal random binomial distribution than with plain XORO32, based on brief testing with 16-bit states. The initial value of acc does not matter, but would the resetting create a discontinuity?
Another thought: if the accumulator (acc) is reset after two state periods (one period of paired outputs), the pair frequency could be closer to the ideal random binomial distribution than with plain XORO32, based on brief testing with 16-bit states. The initial value of acc does not matter, but would the resetting create a discontinuity?
Resetting part of the state almost always causes issues with PRNGs in general. However, there may be a way to do so (on a 1 lag?, not sure) that would work. [Edit]I cannot find a lag that matches a period to maintain equidistribution, if one exists.[/Edit]
The only other possible way I have found is to xor mask s1 (s1 = s1 xor M) prior to the scrambler (yielding a non-correlated stream), where M possibly be updated periodically. Does not work when previous result is summed with new result.
After more study, I think what follows describes what is happening with xoroshiro+++. There is 2-d equidistribution (almost) only when each result is used twice. For xoroshiro32+++, each result forms the high word of one long and the low word of the next long or vice-versa. Therefore only one state iteration is required per long due to this result repetition.
XORO32 does two state iterations of xoroshiro++ to generate a long. An option for xoroshiro+++ as a software extension of XORO32 is that only one word from the long could be used, twice, and the other word ignored. At the end of the state period, which is an odd length, the halves would swap over so that all the outputs would be used in the end. Words output by successive states would be the entire state period apart, i.e. all the high words would come after all the low words or vice-versa. Tests show that all possible long outputs can be generated but the output is not equidistributed. If xoroshiro+++ uses both words from XORO32 in succession only once the output is similar, i.e. no equidistribution.
Was the PractRand testing done with repeating and skipping, or was it a continuous stream?
EDIT1:
Updated second paragraph after more testing.
Was the PractRand testing done with repeating and skipping, or was it a continuous stream?
Continuous stream of 16-bit outputs, or in the case of JMT, swapping pairs of 16-bit outputs.
I am not sure I understand exactly how skipping would work, as the entire state (including 'result') is an even period.
I believe I have actually found a simple method of switching streams (that are de-correlated) after 2^wordsize iterations by counting... and maintain the same 2D at a period of 2^32-2-16 2^64-2^32.
However, the counter itself introduces another (very simple, but powerful) 16-bit state variable.
At that point you at, or very close to, needing to just move on to using xoshiro, instead of xoroshiro.
Was the PractRand testing done with repeating and skipping, or was it a continuous stream?
Continuous stream of 16-bit outputs, or in the case of JMT, swapping pairs of 16-bit outputs.
I am not sure I understand exactly how skipping would work, as the entire state (including 'result') is an even period.
Second paragraph of my previous post should be ignored. Do you agree with the first one?
No need for any skipping in PractRand tests but, if I am right in para. 1 above, all values should be repeated for 2-d equidistribution, which probably won't score well.
Here is exactly what I see when sequencing the small state verison:
Each 8-bit value occurs exactly 65535 times. Each possible pair of 8-bit values occurs exactly 256 times.
However, half of the total 8-bit 2D equidistribution is split between the second half of a 16-bit pair and the first half of the next 16-bit pair.
This breaks all equidistribution at the 16-bit level.
This does produce an excellent normal distribution in the 16-bits, but I see no way back toward actual pair equidistribution without either compromising randomness or using 16-bit variables.
The only way I know of to achieve the (near) perfect 1D 8+8=16 pairing with xoroshiro along with reasonable randomness is to use the ** variant, which brings other issues to the table.
Here is exactly what I see when sequencing the small state verison:
Each 8-bit value occurs exactly 65535 times. Each possible pair of 8-bit values occurs exactly 256 times.
However, half of the total 8-bit 2D equidistribution is split between the second half of a 16-bit pair and the first half of the next 16-bit pair.
This breaks all equidistribution at the 16-bit level.
This does produce an excellent normal distribution in the 16-bits, but I see no way back toward actual pair equidistribution without either compromising randomness or using 16-bit variables.
The only way I know of to achieve the (near) perfect 1D 8+8=16 pairing with xoroshiro along with reasonable randomness is to use the ** variant, which brings other issues to the table.
It seems we are seeing the same thing. 1/256th of the pair values occur 255 times, as mentioned earlier.
The period is doubled when dealing with pair values and at the midpoint the high and low halves are swapped due to the odd state period (as shown below).
To get pair equidistribution, I use the following sequence
0_1, 1_2, 2_3, 3_4, ..., 65533_65534, 65534_0
and record the pair frequencies. The sequence could be rewritten as
which avoids repetitions and thus corresponds to the linear stream sent to PractRand. 65534_0 is the midpoint of the double-length pair period and I wonder whether pair equidistribution could be achieved by reloading the accumulator at this point.
I wonder whether pair equidistribution could be achieved by reloading the accumulator at this point.
I cannot see a solution involving the accumulator, or to defer recovery of the pairing (yet **, or just plain 'result=s0', does so effortlessly, but missing a 0).
In fact the method I thought I had found using 's1 ^ M' will not work either (I think because s0 ^ M2 needs to be done simultaneously).
I cannot see any issue in bitmap simulation... but PractRand seems to be displeased with this small version (as it is with any good PRNG where the pair equidistribution is obvious too early).
Can you confirm what I did above? I will retool and test with larger state... On the large state, PractRand and SmallCrush easily see through my trick, even though it passes most tests in the latter.
I am surprised that SmallCrush did as well as it did (passing 10 out of 15 tests)... considering that 1 iteration of the engine is being used to generate both the high and low values!.
which avoids repetitions and thus corresponds to the linear stream sent to PractRand. 65534_0 is the midpoint of the double-length pair period and I wonder whether pair equidistribution could be achieved by reloading the accumulator at this point.
I have pair equidistribution working now for xoroshiro16+++.
0_1, 2_3, ..., 65534_0, 1_2, 3_4, ..., 65533_65534, 0_1, ..., 65534_0, ...
^ * ^ *
load acc load acc load acc
dec load value dec load value
Before iteration at * the accumulator must be reloaded with the same value as it has before the preceding ^, then this value must be decremented so that at the next * acc is one lower than previous *, etc. Loading acc takes place before the first ^ and every *.
I have pair equidistribution working now for xoroshiro16+++.
0_1, 2_3, ..., 65534_0, 1_2, 3_4, ..., 65533_65534, 0_1, ..., 65534_0, ...
^ * ^ *
load acc load acc load acc
dec load value dec load value
Before iteration at * the accumulator must be reloaded with the same value as it has before the preceding ^, then this value must be decremented so that at the next * acc is one lower than previous *, etc. Loading acc takes place before the first ^ and every *.
I am not getting this to work... probably over-thinking what you wrote (and suffering from company dinner). Do you have some code I can try?
On another aspect, I have an issue with naming, as Seba's original '+' scrambler cannot be extended with a second '+' to signify the protracted equidistribution state variant, since that is already taken.
Using xoroshiro32++ as an example, I am thinking xoroshiro48p++, or for Seba's original xoroshiro32+ using xoroshiro48p+.
Of the above two variants, for the 192-bit version they can be expressed in C (if done properly) to be at least as fast as the original (using Seba's benchmark).
I have pair equidistribution working now for xoroshiro16+++.
0_1, 2_3, ..., 65534_0, 1_2, 3_4, ..., 65533_65534, 0_1, ..., 65534_0, ...
^ * ^ *
load acc load acc load acc
dec load value dec load value
Before iteration at * the accumulator must be reloaded with the same value as it has before the preceding ^, then this value must be decremented so that at the next * acc is one lower than previous *, etc. Loading acc takes place before the first ^ and every *.
I am not getting this to work... probably over-thinking what you wrote (and suffering from company dinner). Do you have some code I can try?
My code is x86 and I use Windows Debug.com to run it and then examine RAM - not a setup designed for sharing, to be honest.
I suggest proving first of all that this sequence repeated 256 times produces pair equidistribution:
0_1, 1_2, 2_3, 3_4, ..., 65533_65534, 65534_0
After each xoroshiro++ iteration, the current acc is the high word and the previous acc is the low word (or vice-versa) of the 16-bit pair. The order in which these pairs occur does not affect their frequency, therefore the following modified sequence will give the same pfreq:
The important point is that acc must have the same value before the second instance of acc/state X as the first instance of X. This means acc must be reloaded at the midpoints, not just initialised once at the start.
The second sequence is the way XORO32 works with its double-iteration and is a lot more random than the first, which repeats each acc immediately.
On another aspect, I have an issue with naming, as Seba's original '+' scrambler cannot be extended with a second '+' to signify the protracted equidistribution state variant, since that is already taken.
Using xoroshiro32++ as an example, I am thinking xoroshiro48p++, or for Seba's original xoroshiro32+ using xoroshiro48p+.
Of the above two variants, for the 192-bit version they can be expressed in C (if done properly) to be at least as fast as the original (using Seba's benchmark).
Before we knew about the ++ scrambler, we used xoroshiro+ and replaced bit 0 with the parity of (s0+s1) including the carry, i.e 17-bit parity for xoroshiro32+. The quality of this parity is better than any other bit of the sum and we called our new algorithm xoroshiro32+p. However, it had a very short life because Seba told me about ++ soon afterwards.
On another aspect, I have an issue with naming, as Seba's original '+' scrambler cannot be extended with a second '+' to signify the protracted equidistribution state variant, since that is already taken.
Using xoroshiro32++ as an example, I am thinking xoroshiro48p++, or for Seba's original xoroshiro32+ using xoroshiro48p+.
Of the above two variants, for the 192-bit version they can be expressed in C (if done properly) to be at least as fast as the original (using Seba's benchmark).
Before we knew about the ++ scrambler, we used xoroshiro+ and replaced bit 0 with the parity of (s0+s1) including the carry, i.e 17-bit parity for xoroshiro32+. The quality of this parity is better than any other bit of the sum and we called our new algorithm xoroshiro32+p. However, it had a very short life because Seba told me about ++ soon afterwards.
Here is what we currently have for the P2 chip's XORO32 instruction:
The important point is that acc must have the same value before the second instance of acc/state X as the first instance of X. This means acc must be reloaded at the midpoints, not just initialised once at the start.
Now that I have thought about it, I believe that can work, but I need to play with it to get the pairs exact so I can run it in PractRand to see the effect.
I easily think in x86 and 65xx, since I started on and mastered the latter, and I've been using the former for over 35 years.
Assuming I get that working and tested... I cannot exactly visualize a P2 implementation, but I gather you are thinking cyclical load/store/decrement is no challenge under that architecture (whereas it will induce a performance penalty in compiled C, which is not really a problem since it likely would not be used extensively in that form).
Regarding naming, for non-double-iterated, I will probably do something like 'rs+' for '+', and 'rs++' for '++'. The 'rs' would probably stand for 'result summation' and help ensure clarity about which base scrambler is being used. If that sounds rational (and jumping ahead), then perhaps 'xoroshiro48rsm++' (i.e. result summation modified) might do for what you are thinking.
Now that I have thought about it, I believe that can work, but I need to play with it to get the pairs exact...
I believe I have the correct small code, and the double original period 8-bit equidistribution is correct, but the double-iterated 16-bit distribution seems to be slightly off.
I see 512 values occurred 255 times, 256 values occurred 257 times, and the balance (64768 values) occurred 256 times.
PractRand results on normal code with this mod still fail as expected at 16GB of output using [13R,5R,10R,9R,1], but more severe due to the exact swap of high and low words:
Now that I have thought about it, I believe that can work, but I need to play with it to get the pairs exact...
I believe I have the correct small code, and the double original period 8-bit equidistribution is correct, but the double-iterated 16-bit distribution seems to be slightly off.
I see 512 values occurred 255 times, 256 values occurred 257 times, and the balance (64768 values) occurred 256 times.
PractRand results on normal code with this mod still fail as expected at 16GB of output using [13R,5R,10R,9R,1], but more severe due to the exact swap of high and low words:
Enjoy.
I get the same results. There are 256 pair values X and X+1 such that pfreq(X) = 257 and pfreq(X+1) = 255 if e = +1 or pfreq(X) = 255 and pfreq(X+1) = 257 if e = -1, instead of both being 256. X and X+1 differ only in the low byte. The accumulator seems to be one less or more than it should be sometimes. I think we should be able to get rid of this error because, as mentioned already, the order in which the pair values occur should not affect their frequency. Maybe a particular X with pfreq(X) = 257 should be trapped each time the frequency is incremented and the iteration numbers recorded.
Re naming, I suggest xoroshiro++a for ++ scrambler and accumulator.
In the scrambler, I tried changing the +1 back to a -1, which removed all 2D component artifacts (overlapping pairs that are present), but unfortunately that had no effect on the 1D pfreqs.
Perhaps more advanced decrementing (e.g. flipping between bit inverted pairs: -2 +1... ) prior to reload? I am worried about getting too complex. No, that does not work.
[Edit]Perhaps change the reload point (full state?) to full period -/+ 1 or some variation on that idea? Harder to investigate without using an even smaller state xoroshiro.[/Edit]
I have perfect 1D and almost perfect 2D equidistribution working together now, after much testing. It is simple.
0_1,2_3,...,65534_0,1_2,3_4,...,65533_65534
*
The asterisk above shows precisely when the accumulator must be adjusted (not loaded) near the midpoint. If the state before the previous iteration is the initial seed state, then acc should be incremented/decremented (assuming e = +/-1) to move to the next/previous stream, where a stream is defined as a single xoroshiro++ period with a unique initial accumulator value.
To put it another way, if double-iterating do inc acc or dec acc when state equals next one after initial state. This is how it must be done using XORO32, as the initial state value is not returned and is therefore untestable when it applies to the second iteration.
I have tried different small values of e apart from +/- 1 with acc adjustment and they work successfully, including zero. For non-zero values of e and no acc adjust, 2D (pfreq) looks to be a normal distribution. PractRand testing might show that certain e are better than others.
I have perfect 1D and almost perfect 2D equidistribution working together now, after much testing. It is simple.
0_1,2_3,...,65534_0,1_2,3_4,...,65533_65534
*
The asterisk above shows precisely when the accumulator must be adjusted (not loaded) near the midpoint. If the state before the previous iteration is the initial state, then acc should be incremented/decremented to move to the next/previous stream, where a stream is defined as a single xoroshiro++ period with a unique initial accumulator value.
To put it another way, if double-iterating do inc acc or dec acc when state equals next one after initial state. This is how it must be done using XORO32, as the initial state value is not returned and is therefore untestable when it applies to the second iteration.
I have tried different small values of e apart from +/- 1 with acc adjustment and they work successfully, including zero. For non-zero values of e and no acc adjust, 2D (pfreq) looks to be a normal distribution. PractRand testing might show that certain e are better than others.
So, to make XORO32 more perfect, we would need the hardware to detect a certain value (that evolves midway through the overall sequence) and then change it.
So, to make XORO32 more perfect, we would need the hardware to detect a certain value (that evolves midway through the overall sequence) and then change it.
Of all the random number ideas we've looked at since P2 revB was fixed, I think this latest one is the best. We could improve XORO32 with or without hardwiring a particular state. The first thing we need is an extra instruction that does the accumulation, say XOROACC, which could use the same opcode as XORO32 except the immediate bit = 1 (similar to WRNZ & MODCZ).
Here is how the code might look:
'now
XORO32 state 'prn[31:0] into next S
MOV prn,0-0 'prn = existing PRN
' all word values occur almost equally
' many long values never occur while
' other longs occur more than once
'future option
XORO32 state 'prn[31:0] into next S
XOROACC acc,0-0 'acc = improved PRN
' all word values occur exactly equally
' all long values occur almost equally
'xoroacc operation
'acc_out[15:0] := prn[15:0] + acc_in[31:16] + e*
'acc_out[31:16] := prn[31:16] + acc_out[15:0] + e
'e is new constant, maybe 1, to go with unchanged a,b,c,d
'* force e = 0 if state = mid_state
EDIT1:
It is better to hardwire the mid-state acc adjustment.
More info below, extrapolating from 16-bit state tests:
XORO32 (existing) double-iterated xoroshiro++
Output All PRN All PRN Period Number of Output
width values values possible frequency
output? output PRN
equally? values
word yes almost 2^33-2 2^16 2^17 for 2^16-1 values (> 0)
2^17-2 for 1 value (= 0)
long no no 2^32-1 2^32 mostly 1 or 0, good
approximation to ideal
random distribution
XORO32 and accumulator (proposed) double-iterated "xoroshiro++a"
Output All PRN All PRN Period Number of Output
width values values possible frequency
output? output PRN
equally? values
word yes yes 2^49-2^17 2^16 2^33-2 for all values
long yes almost 2^48-2^16 2^32 2^16 for 2^32-2^16 values
2^16-1 for 2^16 values
Comments
During a complete result period of xoroshiro16+++, every 8-bit result occurs 65535 times and every 16-bit result pair occurs 255 times for 256 pairs and 256 times for 65536-256 = 65280 pairs. Equidistribution is perfect for 1-D and not quite perfect for 2-D (because the period of 65535 * 256 is not exactly divisible by 65536).
Thus the intrinsic 2-D equidistribution that is destroyed by the adder scrambler can be restored (almost).
Changing the initial value of result acts as a type of jump function and I have used
values other than zero, e.g. 255 for xoroshiro16+++ so that the state periods start with 255, 254, ..., 1, 0.
The pfreq distribution for one state period of xoroshiro16+++ appears to show a significant overall improvement compared to the full (state) period of xoroshiro16++, for the small number of [a,b,c,d] candidates I looked at.
Yes, a xoroshiro16+++ is very fast to test and that's what we have been doing. A xoroshiro32+++ would take 65536 times longer than xoroshiro32++, which is too long to be practical for the whole period but smaller versions could be tested to confirm the principle.
xoroshiro+++ consists of an engine that iterates the state, a scrambler (in this case a double addition) and a new accumulator which can be initialised to any value. If the state width is w bits, then the period between state and accumulator repeating their initial values is (2^w-1)*2^(w/2). For w = 32 as in XORO32, this is (2^32-1)*2^16 and instead of a single period of 2^32-1 there are 65536 such periods that begin with the accumulator one lower/higher than the period before/after.
I'm not sure whether there is any advantage in using xoroshiro32+++ over xoroshiro32++ for single 16-bit outputs but there is for paired 32-bit outputs. XORO32 longs are a good approximation to ideal randomness; instead of all values occuring once during a state period, some occur multiple times and others not at all. In fact, 1/e or ~37% of the possible outputs never happen, ~37% once and the remainder twice or more.
In contrast, xoroshiro32+++ outputs all 32-bit paired values the same number of times (2-dimensional equidistribution) or very nearly so. Extrapolating, 65536 values should occur 65535 times and 4294967296-65536 = 4294901760 values should occur 65536 times.
All of the ground-work was laid out with the current implementation of xoroshiro++.
My opinion is that it is the best general purpose PRNG available, thus worth exploring in further detail.
There are some minor deficits that leave a little room for improvement, which I had previously put considerable thought into, but was never able to simplify to be potentially useful until recently.
All of the current discussion is involving a general purpose equidistibution recovery extension that is a simplification of my original experiments, increasing code complexity by only a small amount.
It has some merit on multiple non-trivial points, with only a few limitations that are seemingly unavoidable without moving to even greater state and/or word size... where the same logic could be applied, as well.
Another thought: if the accumulator (acc) is reset after two state periods (one period of paired outputs), the pair frequency could be closer to the ideal random binomial distribution than with plain XORO32, based on brief testing with 16-bit states. The initial value of acc does not matter, but would the resetting create a discontinuity?
The only other possible way I have found is to xor mask s1 (s1 = s1 xor M) prior to the scrambler (yielding a non-correlated stream), where M possibly be updated periodically. Does not work when previous result is summed with new result.
XORO32 does two state iterations of xoroshiro++ to generate a long. An option for xoroshiro+++ as a software extension of XORO32 is that only one word from the long could be used, twice, and the other word ignored. At the end of the state period, which is an odd length, the halves would swap over so that all the outputs would be used in the end. Words output by successive states would be the entire state period apart, i.e. all the high words would come after all the low words or vice-versa. Tests show that all possible long outputs can be generated but the output is not equidistributed. If xoroshiro+++ uses both words from XORO32 in succession only once the output is similar, i.e. no equidistribution.
Was the PractRand testing done with repeating and skipping, or was it a continuous stream?
EDIT1:
Updated second paragraph after more testing.
EDIT2:
Second paragraph obliterated.
I am not sure I understand exactly how skipping would work, as the entire state (including 'result') is an even period.
I believe I have actually found a simple method of switching streams (that are de-correlated) after 2^wordsize iterations by counting... and maintain the same 2D at a period of 2^32-2-16 2^64-2^32.
However, the counter itself introduces another (very simple, but powerful) 16-bit state variable.
At that point you at, or very close to, needing to just move on to using xoshiro, instead of xoroshiro.
I'll post the code here, after some more study.
Second paragraph of my previous post should be ignored. Do you agree with the first one?
No need for any skipping in PractRand tests but, if I am right in para. 1 above, all values should be repeated for 2-d equidistribution, which probably won't score well.
Each 8-bit value occurs exactly 65535 times. Each possible pair of 8-bit values occurs exactly 256 times.
However, half of the total 8-bit 2D equidistribution is split between the second half of a 16-bit pair and the first half of the next 16-bit pair.
This breaks all equidistribution at the 16-bit level.
This does produce an excellent normal distribution in the 16-bits, but I see no way back toward actual pair equidistribution without either compromising randomness or using 16-bit variables.
The only way I know of to achieve the (near) perfect 1D 8+8=16 pairing with xoroshiro along with reasonable randomness is to use the ** variant, which brings other issues to the table.
It seems we are seeing the same thing. 1/256th of the pair values occur 255 times, as mentioned earlier.
The period is doubled when dealing with pair values and at the midpoint the high and low halves are swapped due to the odd state period (as shown below).
To get pair equidistribution, I use the following sequence
and record the pair frequencies. The sequence could be rewritten as
which avoids repetitions and thus corresponds to the linear stream sent to PractRand. 65534_0 is the midpoint of the double-length pair period and I wonder whether pair equidistribution could be achieved by reloading the accumulator at this point.
In fact the method I thought I had found using 's1 ^ M' will not work either (I think because s0 ^ M2 needs to be done simultaneously).
Can you confirm what I did above? I will retool and test with larger state...
On the large state, PractRand and SmallCrush easily see through my trick, even though it passes most tests in the latter.
I am surprised that SmallCrush did as well as it did (passing 10 out of 15 tests)... considering that 1 iteration of the engine is being used to generate both the high and low values!.
I have pair equidistribution working now for xoroshiro16+++.
Before iteration at * the accumulator must be reloaded with the same value as it has before the preceding ^, then this value must be decremented so that at the next * acc is one lower than previous *, etc. Loading acc takes place before the first ^ and every *.
Using xoroshiro32++ as an example, I am thinking xoroshiro48p++, or for Seba's original xoroshiro32+ using xoroshiro48p+.
Of the above two variants, for the 192-bit version they can be expressed in C (if done properly) to be at least as fast as the original (using Seba's benchmark).
My code is x86 and I use Windows Debug.com to run it and then examine RAM - not a setup designed for sharing, to be honest.
I suggest proving first of all that this sequence repeated 256 times produces pair equidistribution: After each xoroshiro++ iteration, the current acc is the high word and the previous acc is the low word (or vice-versa) of the 16-bit pair. The order in which these pairs occur does not affect their frequency, therefore the following modified sequence will give the same pfreq: The important point is that acc must have the same value before the second instance of acc/state X as the first instance of X. This means acc must be reloaded at the midpoints, not just initialised once at the start.
The second sequence is the way XORO32 works with its double-iteration and is a lot more random than the first, which repeats each acc immediately.
Before we knew about the ++ scrambler, we used xoroshiro+ and replaced bit 0 with the parity of (s0+s1) including the carry, i.e 17-bit parity for xoroshiro32+. The quality of this parity is better than any other bit of the sum and we called our new algorithm xoroshiro32+p. However, it had a very short life because Seba told me about ++ soon afterwards.
Here is what we currently have for the P2 chip's XORO32 instruction:
I easily think in x86 and 65xx, since I started on and mastered the latter, and I've been using the former for over 35 years.
Assuming I get that working and tested... I cannot exactly visualize a P2 implementation, but I gather you are thinking cyclical load/store/decrement is no challenge under that architecture (whereas it will induce a performance penalty in compiled C, which is not really a problem since it likely would not be used extensively in that form).
Regarding naming, for non-double-iterated, I will probably do something like 'rs+' for '+', and 'rs++' for '++'. The 'rs' would probably stand for 'result summation' and help ensure clarity about which base scrambler is being used. If that sounds rational (and jumping ahead), then perhaps 'xoroshiro48rsm++' (i.e. result summation modified) might do for what you are thinking.
I see 512 values occurred 255 times, 256 values occurred 257 times, and the balance (64768 values) occurred 256 times.
PractRand results on normal code with this mod still fail as expected at 16GB of output using [13R,5R,10R,9R,1], but more severe due to the exact swap of high and low words:
Enjoy.
I get the same results. There are 256 pair values X and X+1 such that pfreq(X) = 257 and pfreq(X+1) = 255 if e = +1 or pfreq(X) = 255 and pfreq(X+1) = 257 if e = -1, instead of both being 256. X and X+1 differ only in the low byte. The accumulator seems to be one less or more than it should be sometimes. I think we should be able to get rid of this error because, as mentioned already, the order in which the pair values occur should not affect their frequency. Maybe a particular X with pfreq(X) = 257 should be trapped each time the frequency is incremented and the iteration numbers recorded.
Re naming, I suggest xoroshiro++a for ++ scrambler and accumulator.
EDIT:
Added pfreq(X), pfreq(X+1) for e = -1
Perhaps more advanced decrementing (e.g. flipping between bit inverted pairs: -2 +1... ) prior to reload? I am worried about getting too complex. No, that does not work.
[Edit]Perhaps change the reload point (full state?) to full period -/+ 1 or some variation on that idea? Harder to investigate without using an even smaller state xoroshiro.[/Edit]
The asterisk above shows precisely when the accumulator must be adjusted (not loaded) near the midpoint. If the state before the previous iteration is the initial seed state, then acc should be incremented/decremented (assuming e = +/-1) to move to the next/previous stream, where a stream is defined as a single xoroshiro++ period with a unique initial accumulator value.
To put it another way, if double-iterating do inc acc or dec acc when state equals next one after initial state. This is how it must be done using XORO32, as the initial state value is not returned and is therefore untestable when it applies to the second iteration.
I have tried different small values of e apart from +/- 1 with acc adjustment and they work successfully, including zero. For non-zero values of e and no acc adjust, 2D (pfreq) looks to be a normal distribution. PractRand testing might show that certain e are better than others.
So, to make XORO32 more perfect, we would need the hardware to detect a certain value (that evolves midway through the overall sequence) and then change it.
Of all the random number ideas we've looked at since P2 revB was fixed, I think this latest one is the best. We could improve XORO32 with or without hardwiring a particular state. The first thing we need is an extra instruction that does the accumulation, say XOROACC, which could use the same opcode as XORO32 except the immediate bit = 1 (similar to WRNZ & MODCZ).
Here is how the code might look:
EDIT1:
It is better to hardwire the mid-state acc adjustment.
EDIT2:
xoroacc operation corrected.
XORO32 (existing) double-iterated xoroshiro++
XORO32 and accumulator (proposed) double-iterated "xoroshiro++a"
word means bits [31:16] and [15:0] only