Chris, Is that for me to do something with? I'm not making any sense of it, sorry.

The seed corrections are for use with the seed generator, and are likely not that important right now, unless you are performing random seeding and accidentally use the pair of values from the seed columns DE with the triplets specified in columns ABC (which would create a single cycle loop).

The remaining 16 columns F through U are the 16 values (0-15) relating to the choice of Constant_D for a given ABC columns triplet, used for the final output xor correction to replace the missing value with a 0 in the output stream.

Regarding the latter, you may want to double-check the statistics on a promising candidate by applying the appropriate constant to the output:
1. By xor'ing it with the output.
2. By subtracting it from the output (as Tony first suggested).
There will be some tiny effect on non-zero pairs and a larger effect on zero-pairs, for example. Both 1&2 should have the same effect.
It is unlikely that other statistics would be substantially affected, however, Practrand may be subtly aware of the choice of 1 or 2 in slightly different ways.

Edit: The output corrections listed in the table (which represent the missing value for a given set of ABCD constants) were calculated by replacing the s0 and s1 values in the scrambler with the seed correction values:
output xor/subtraction constant = Seed_Correction_S0 + rotl(Seed_Correction_S0+Seed_Correction_S1, Constant_D )

Ah, okay, I'll call that one constant E. And, for option 2, I presume there is no issue with paralleling two of the adders in the scrambler, ie: rnd = (x - E) + rotl(x + y, D). That's the only way I'd be happy to include a third adder circuit.

That is actually quite splendid! Now I understand why Tony suggested subtraction.
It should have been more obvious to me that subtraction would be easier to parallelize, and it might perhaps be (slightly) statistically superior to xor.

EDIT: Let me know if you want the corresponding table for Xorograyro32++ ('>>'), as:
1. I hadn't considered the slight possibility of constant subtraction affecting the reverse FPF behavior.
2. There are some candidates that are seemingly not adversely affected by the use of '>>'.

... Is there code I could add for auto-generating constant E from the others?

This should be good code for calculating constant E using columns DE from the table:
const uint16_t E = SEED_CORRECTION_0 + rotl(SEED_CORRECTION_0 + SEED_CORRECTION_1, D );

Presumes SEED_CORRECTION_0 and SEED_CORRECTION_1 are #DEFINED (passed in).

If necessary, I can try and come up with the exact formula for calculating the seed corrections for the above using ABC (for both '<<' or '>>').
I am not sure how tedious it would be to do so...

I was guessing you were already passing ABCD as parameters for compilation, so the two additional seed correction constants could be passed for calculating E, correct?

The problem is they have to be stored as a sequential lookup. The ABC components are not sequential. The only sequential component is the index on the full-period triplets. So that means the full-period file, or a matching file pulled in at the same time, has to be the store of the table. It doesn't make any diff if there is one extra or hundred extra entries per record line.

Yep, off to bed now. FYI, it's 5 minutes per candidate to run, with 7 candidates in parallel. I think a solo candidate runs about 60% faster. That's all down to cache thrashing.

Just in case...
Attached brute-force code for finding the two seed correction values (combined into a single 32-bit return value).
Obviously, once split back out to 16-bits, could be passed along with D to previous code I posted to calculate E constant.

Ouch! That is a tad burdensome to be run every time.

It takes about 3-4 seconds on my PC... < 5% of the total time for a single PractRand run.
The slow speed is still kind of annoying though.
In reality, I'm just loading the entire table from a file into an array for use.
In any case, the brute-force code is a good verification and can help generate all of the '>>' constants, as well.
If I can fathom the modular equations for replacing it, that would be magic.
But I think only really useful from a documentation standpoint (and extending state), as few would need to effortlessly flip back and forth.

Oh, when D=0 then E is ignored because that case is implemented as original single summing algorithm. Err, got the reason wrong. It's just the simpler scrambler from the original algorithm so I didn't try to fit constant E into it.

static rngword_t nextrnd( void )
{
#if (ACCUM_SIZE == 16) || (ACCUM_SIZE == 32) || (ACCUM_SIZE == 64)
rngword_t s0 = s[0];
rngword_t s1 = s[1];
#else
rngword_t s0 = s[0] & ACCUM_MASK;
rngword_t s1 = s[1] & ACCUM_MASK;
#endif
#if CONSTANT_D == 0
// rngword_t result = s0 + s1; // output hash (scrambler) for Xoroshiro+
rngword_t result = s0 + s1 - CONSTANT_E; // with experimental CONSTANT_E
#else
// New addition from authors: Second stage summing (scrambler) for Xoroshiro++
// rngword_t result = rotl( s0 + s1, CONSTANT_D ) + s0;
rngword_t result = rotl( s0 + s1, CONSTANT_D ) + (s0 - CONSTANT_E); // with experimental CONSTANT_E
#endif
// Iterator (engine)
s1 ^= s0;
#if (ACCUM_SIZE == 16) || (ACCUM_SIZE == 32) || (ACCUM_SIZE == 64)
s[0] = rotl( s0, CONSTANT_A ) ^ (-1) ^ s1 ^ (s1 << CONSTANT_B); // where integer word size matches the engine word size
#else
s[0] = rotl( s0, CONSTANT_A ) ^ (-1) ^ s1 ^ (s1 << CONSTANT_B) & ACCUM_MASK;
#endif
s[1] = rotl( s1, CONSTANT_C );
#if (ACCUM_SIZE == 16) || (ACCUM_SIZE == 32) || (ACCUM_SIZE == 64)
return( result ); // where integer word size matches the engine word size
#else
return( result & ACCUM_MASK );
#endif
}

EDIT: Added D=0 application of constant E.
EDIT2: Fixed a bug in the D=0 change.

Attached is frequency distribution data for Xoroshiron32, with new correcting constant E from the provided table.

Evan, thanks for doing the tests.

Compared with xoroshiro32++, it seems there are very small differences with xoroshiron32++. [13,5,10,9] slightly worse and now ranked second, with original second improved a bit. I could post some comparisons later. The first idea with a right shift produces some pfreq distributions that are a little better than xoroshiro32++[13,5,10,9].

I did a lot of work yesterday on my fast Z80 interpreter but something strange happened and I lost everything new, despite saving the file regularly. Anyway, I had to recreate the changes early this morning while they were fresh in my mind, hence little time to spend on this thread lately.

Constant E for candidate [2 1 15 10] is 42665 and took 1.443 seconds at 02:27:50
Pairs and Zero-run frequency distribution data for candidate [2 1 15 10], correction 42665
Started at 93175.310 Malloc size = 0x100000000
Constant E for candidate [2 1 15 11] is 19796 and took 1.443 seconds at 02:27:52
Pairs and Zero-run frequency distribution data for candidate [2 1 15 11], correction 19796
Xoroshiron32 [2 1 15 5] finished at 93176.563 - total duration = 295.242
Started at 93177.369 Malloc size = 0x100000000
Constant E for candidate [2 1 15 12] is 39593 and took 1.440 seconds at 02:27:54
Pairs and Zero-run frequency distribution data for candidate [2 1 15 12], correction 39593
Xoroshiron32 [2 1 15 6] finished at 93179.080 - total duration = 295.739
Started at 93179.424 Malloc size = 0x100000000
Xoroshiron32 [2 1 15 7] finished at 93179.577 - total duration = 294.174
Constant E for candidate [2 1 15 13] is 13652 and took 1.445 seconds at 02:27:57
Pairs and Zero-run frequency distribution data for candidate [2 1 15 13], correction 13652
Xoroshiron32 [2 1 15 8] finished at 93181.844 - total duration = 294.322
Started at 93182.513 Malloc size = 0x100000000
Xoroshiron32 [2 1 15 9] finished at 93183.092 - total duration = 291.462
Constant E for candidate [2 1 15 14] is 27305 and took 1.444 seconds at 02:27:59
Pairs and Zero-run frequency distribution data for candidate [2 1 15 14], correction 27305
Started at 93184.543 Malloc size = 0x100000000
Constant E for candidate [2 1 15 15] is 54611 and took 1.459 seconds at 02:28:01
Pairs and Zero-run frequency distribution data for candidate [2 1 15 15], correction 54611
Started at 93186.651 Malloc size = 0x100000000
Constant E for candidate [15 1 2 0] is 21845 and took 0.739 seconds at 02:28:02
Pairs and Zero-run frequency distribution data for candidate [15 1 2], correction 21845
Started at 93188.282 Malloc size = 0x100000000

Based on the observations so far, I have a new working theory on all of this:
Since all of the explored variations on the base engine include: x ^ (x shift B ) // shift is either << or >>
The expanded form would be: M ^ x ^ ( x shift B ) // M could be any value: e.g. M=0 for base, and M=0xFFFF for NOT.
The optimal value for M is likely:
1. A value that flips about half of the untouched bits after the shift is performed. e.g. 2-3 of the 5 lsb for the original 13,5,10
2. And flips about half of the remaining touched bits.
3. Therefore flips about half of the total bits.
4. Perhaps contains a rich mixture of modulo bit sub-sequence patterns. e.g. 0x46F5

I think that the most that easily could be done with this idea would be to demonstrate that using using random values for M on any given A,B,C,D,shift direction, and the derived E subtraction, would generate statistics that would on average describe that candidate more accurately, which might help with selection when there is more than one competing set of candidates.
From there, a specific value for M could be settled on for the winning candidate.

For sake of argument, if we were to assume that the original 13,5,10,9 is optimal, then it is proposed that there is likely an optimal value for M that is non-zero, possibly having the qualities described above.
We could stop there, as it is unlikely that even an exhaustive search for an optimal M would produce a significant statistical improvement (on any candidate set).

A more robust implementation (given the above likely limitation) would be available if the equations were derived to calculate seed corrections directly, and thus E as well (assuming E and M are field programmable):
1. Allow M to be chosen at random and/or from a table (starting with M=0) on boot
2. Choose seed with applied seed correction.
3. Apply M to the engine and the derived E to the scrambler .
Now we have 65536 possible mostly de-correlated streams available, with the possibility of choosing a new stream at will (by the end user, perhaps).

What's up with the seed correction thingy? I'm following the rest I think but that piece doesn't seem to fit in anywhere. Is it just the initial state value? Why would initial state matter?

What's up with the seed correction thingy? I'm following the rest I think but that piece doesn't seem to fit in anywhere. Is it just the initial state value? Why would initial state matter?

Just as you cannot seed xoroshiro with all zeros (because M=0) in order to avoid a single cycle loop, you cannot seed any variant (M<>0) with some other value, the seed correction (which probably should be called ’bad seed value’).

M has more combinations than all of ABCD combined.

All M are valid. You have been testing with M=0xffff and it is ok. Any M used with 13,5,10,9, for example, should have roughly similar statistical behavior, some a little better, some a little worse.

I just ran a quick check on two xoroshiro 13,5,10,9, one normal with M=0,E=0 and the other with M<>0 (and using the corresponding E value in the scrambler, that I brute-forced from M). Starting both with the same state, xoring their output together, PractRand failed at 16GB. The two streams are effectively de-correlated.

This illustrates that instead of just a few good sets of candidates, you have a large resource of potential additional randomness from any of those few good candidates just by adding a single xor to the engine and a single subtraction to the scrambler.

A few dozen random choices for M (and calculating E from each) would likely yield at least a few slightly superior statistical results from which to make a single choice.

Going deeper to allow M to be user selectable would depend on being able to calculate good seeds and E on the fly from ABCD & M.

Edit: I am double-checking all of the above for correctness, since it could make a fine addition to the xoroshiro playbook, even if not used by the P2.

Currently, full-period search is engine based, namely on ABC only. Including E (via M) in the engine implies also including D in full-period search. That's going to make that step much longer.

EDIT: Lol, I think it's circular. Can't find M or E without having A,B and C first. A, B and C can't be found without M.

Currently, full-period search is engine based, namely on ABC only. Including E (via M) in the engine implies also including D in full-period search. That's going to make that step much longer.

I am looking for evidence of any non-full-period (minus one) M value... adding a note to the end of the previous post.
My intuition says that they are all full period, due to the nature of how xor works.
D and E do not factor in, since they are not part of the engine iterator.
If all looks well, I need to magic the ‘bad seed value’ equations.

I just ran a quick check on two xoroshiro 13,5,10,9, one normal with M=0,E=0 and the other with M<>0 (and using the corresponding E value in the scrambler, that I brute-forced from M). Starting both with the same state, xoring their output together, PractRand failed at 16GB. The two streams are effectively de-correlated.

Is that operating two separate generators and feeding Practrand with the XOR of the two outputs? Have I understood?

M={bad seed values}=E=0 only applies to the original xoroshiro32++, so therefore those varibles don't (need to) exist in the original code.
Let us call the the 'bad seed values' BS: BS0 and BS1 for each 16-bit half of the state.
When M=0xFFFF (as in xoroshiroN):
1. There exists a different non-zero BS (when M<>0) for each A,B,C triplet (to do: calculate BS directly, rather than brute-force).
2. Once we have BS (as a function of A,B,C), we can calculate E as a function of (BS,D): E = BS0 + ROL(BS0+BS1,D).
3. 1 and 2 are true for all values of M, not just for 0 and 0xFFFF (proved experimentally, and logically sound).
4. Each M (with its derived E), for a given A,B,C,D, results in a random stream de-correlated from any other M,E stream (to do: search for counter-examples).
5. [Edit] Each A,B,C,D,M,E stream will have similar statistical properties to the base A,B,C,D stream (to do: characterize the extents of statistical variation with various M,E).

Is that operating two separate generators and feeding Practrand with the XOR of the two outputs? Have I understood?

Yes, per #4 above.

This is kind of cool, as M becomes a key, which has the side-effect of bringing a marginal level of cryptographic security when paired with the strong mixing of the ++ scrambler.

## Comments

10,440Is that for me to do something with? I'm not making any sense of it, sorry.

280The remaining 16 columns F through U are the 16 values (0-15) relating to the choice of Constant_D for a given ABC columns triplet, used for the final output xor correction to replace the missing value with a 0 in the output stream.

Regarding the latter, you may want to double-check the statistics on a promising candidate by applying the appropriate constant to the output:

1. By xor'ing it with the output.

2. By subtracting it from the output (as Tony first suggested).

There will be some tiny effect on non-zero pairs and a larger effect on zero-pairs, for example. Both 1&2 should have the same effect.

It is unlikely that other statistics would be substantially affected, however, Practrand may be subtly aware of the choice of 1 or 2 in slightly different ways.

Edit: The output corrections listed in the table (which represent the missing value for a given set of ABCD constants) were calculated by replacing the s0 and s1 values in the scrambler with the seed correction values:

output xor/subtraction constant = Seed_Correction_S0 + rotl(Seed_Correction_S0+Seed_Correction_S1, Constant_D )

10,440280It should have been more obvious to me that subtraction would be easier to parallelize, and it might perhaps be (slightly) statistically superior to xor.

EDIT: Let me know if you want the corresponding table for Xorograyro32++ ('>>'), as:

1. I hadn't considered the slight possibility of constant subtraction affecting the reverse FPF behavior.

2. There are some candidates that are seemingly not adversely affected by the use of '>>'.

10,440280const uint16_t E = SEED_CORRECTION_0 + rotl(SEED_CORRECTION_0 + SEED_CORRECTION_1, D );

Presumes SEED_CORRECTION_0 and SEED_CORRECTION_1 are #DEFINED (passed in).

If necessary, I can try and come up with the exact formula for calculating the seed corrections for the above using ABC (for both '<<' or '>>').

I am not sure how tedious it would be to do so...

10,440I'm back working on using the table by integrating it into the full-period records. That's the only sane way to do the lookup.

280Sleep?

10,44010,44010,440280Attached brute-force code for finding the two seed correction values (combined into a single 32-bit return value).

Obviously, once split back out to 16-bits, could be passed along with D to previous code I posted to calculate E constant.

10,440280The slow speed is still kind of annoying though.

In reality, I'm just loading the entire table from a file into an array for use.

In any case, the brute-force code is a good verification and can help generate all of the '>>' constants, as well.

If I can fathom the modular equations for replacing it, that would be magic.

But I think only really useful from a documentation standpoint (and extending state), as few would need to effortlessly flip back and forth.

28010,440Here's a massaged version for arbitrary word size. Still needs error handling for fall-through condition.

EDIT: Updated to include parameter constant D.

EDIT2: Fix "break" bug - removed.

10,44010,440EDIT: Added D=0 application of constant E.

EDIT2: Fixed a bug in the D=0 change.

1,626Evan, thanks for doing the tests.

Compared with xoroshiro32++, it seems there are very small differences with xoroshiron32++. [13,5,10,9] slightly worse and now ranked second, with original second improved a bit. I could post some comparisons later. The first idea with a right shift produces some pfreq distributions that are a little better than xoroshiro32++[13,5,10,9].

I did a lot of work yesterday on my fast Z80 interpreter but something strange happened and I lost everything new, despite saving the file regularly. Anyway, I had to recreate the changes early this morning while they were fresh in my mind, hence little time to spend on this thread lately.

10,44010,440280Since all of the explored variations on the base engine include: x ^ (x shift B ) // shift is either << or >>

The expanded form would be: M ^ x ^ ( x shift B ) // M could be any value: e.g. M=0 for base, and M=0xFFFF for NOT.

The optimal value for M is likely:

1. A value that flips about half of the untouched bits after the shift is performed. e.g. 2-3 of the 5 lsb for the original 13,5,10

2. And flips about half of the remaining touched bits.

3. Therefore flips about half of the total bits.

4. Perhaps contains a rich mixture of modulo bit sub-sequence patterns. e.g. 0x46F5

I think that the most that easily could be done with this idea would be to demonstrate that using using random values for M on any given A,B,C,D,shift direction, and the derived E subtraction, would generate statistics that would on average describe that candidate more accurately, which might help with selection when there is more than one competing set of candidates.

From there, a specific value for M could be settled on for the winning candidate.

For sake of argument, if we were to assume that the original 13,5,10,9 is optimal, then it is proposed that there is likely an optimal value for M that is non-zero, possibly having the qualities described above.

We could stop there, as it is unlikely that even an exhaustive search for an optimal M would produce a significant statistical improvement (on any candidate set).

A more robust implementation (given the above likely limitation) would be available if the equations were derived to calculate seed corrections directly, and thus E as well (assuming E and M are field programmable):

1. Allow M to be chosen at random and/or from a table (starting with M=0) on boot

2. Choose seed with applied seed correction.

3. Apply M to the engine and the derived E to the scrambler .

Now we have 65536 possible mostly de-correlated streams available, with the possibility of choosing a new stream at will (by the end user, perhaps).

10,44010,440280I just ran a quick check on two xoroshiro 13,5,10,9, one normal with M=0,E=0 and the other with M<>0 (and using the corresponding E value in the scrambler, that I brute-forced from M). Starting both with the same state, xoring their output together, PractRand failed at 16GB. The two streams are effectively de-correlated.

This illustrates that instead of just a few good sets of candidates, you have a large resource of potential additional randomness from any of those few good candidates just by adding a single xor to the engine and a single subtraction to the scrambler.

A few dozen random choices for M (and calculating E from each) would likely yield at least a few slightly superior statistical results from which to make a single choice.

Going deeper to allow M to be user selectable would depend on being able to calculate good seeds and E on the fly from ABCD & M.

Edit: I am double-checking all of the above for correctness, since it could make a fine addition to the xoroshiro playbook, even if not used by the P2.

10,440EDIT: Lol, I think it's circular. Can't find M or E without having A,B and C first. A, B and C can't be found without M.

280My intuition says that they are all full period, due to the nature of how xor works.

D and E do not factor in, since they are not part of the engine iterator.

If all looks well, I need to magic the ‘bad seed value’ equations.

Must sleep.

10,44010,440280Let us call the the 'bad seed values' BS: BS0 and BS1 for each 16-bit half of the state.

When M=0xFFFF (as in xoroshiroN):

1. There exists a different non-zero BS (when M<>0) for each A,B,C triplet (to do: calculate BS directly, rather than brute-force).

2. Once we have BS (as a function of A,B,C), we can calculate E as a function of (BS,D): E = BS0 + ROL(BS0+BS1,D).

3. 1 and 2 are true for all values of M, not just for 0 and 0xFFFF (proved experimentally, and logically sound).

4. Each M (with its derived E), for a given A,B,C,D, results in a random stream de-correlated from any other M,E stream (to do: search for counter-examples).

5. [Edit] Each A,B,C,D,M,E stream will have similar statistical properties to the base A,B,C,D stream (to do: characterize the extents of statistical variation with various M,E). Yes, per #4 above.

This is kind of cool, as M becomes a key, which has the side-effect of bringing a marginal level of cryptographic security when paired with the strong mixing of the ++ scrambler.