5. [Edit] Each A,B,C,D,M,E stream will have similar statistical properties to the base A,B,C,D stream (to do: characterize the extents of statistical variation with various M,E).

Are you thinking this could be improved within the restrictive confines we're working under?

5. [Edit] Each A,B,C,D,M,E stream will have similar statistical properties to the base A,B,C,D stream (to do: characterize the extents of statistical variation with various M,E).

Are you thinking this could be improved within the restrictive confines we're working under?

Yes, to some extent... try a dozen or so random/arbitrary 16-bit values for M (replacing 0xFFFF) with 13,5,10,9,E; calculating E using your exhaustive search code with M in effect. Edit: ... calculating BS from the table in the following post, and then E from BS, both with M in effect.
If you find a value for M that is statistically superior, great, then you could modify the seeder, engine and scrambler to use BS, M and E, respectively.

If you find all values of (M,E) are statistically similar, then consider whether there is room to apply M and E as a replaceable parameters, perhaps adjustable by the end user:
Could the P2 support replaceable parameters for M and E within the engine and scrambler?... if so:
1. User/Firmware may pick M (starting with 0, for default P2 behavior) and store it.
2. Firmware will (hopefully be able to) calculate BS from M and use it to verify any current/new seed is valid (i.e. seed <> BS), adding 1, if not.
3. Firmware will automatically calculate E from BS and store it (again, starting with 0, for default P2 behavior).
4. (Optional) User may override firmware calculated E (e.g. to create a stream with an equal number of 1 and 0 bits in the entire period).

I am working on calculating BS directly from M.

However if there were no constraint on automatically supplying an E that restores the single short output value of the full period to a zero, then BS could eliminated as well:
1. Generate an arbitrary 32-bit seed.
2. Choose any value for M.
3. Run a single iteration of xoroshiro and then compare the state to the original seed from step 1.
4. If the state is equal to the original seed, add 1 to the state.

For example, suppose I pick an arbitrary M=21 (0x15).
Therefore bits 0, 2 and 4 are set.
So we xor the table values for those 3 set bits together: BS = 0x105A06DE ^ 0x61DC16C4 ^ 0x6D75302E = 0x1CF32034
Thus BS0=0x2034 and BS1=0x1CF3.
Now we calculate E from BS and D : E = 0x2034 + ROL(0x2034 + 0x1CF3, 9) = 0x6EAE
Finally, we ensure the current state <> BS: If (state==BS) {state++;}

The BS table values were generated by passing A=13,B=5,C=10 with each of the 16 single-bit set values for M to my original brute force search algorithm which returns BS, but modified to receive M (instead of only using 0xFFFF).

Edit: Fixed a lookup/math error when manually calculating BS and E in the example.
Edit2: Obviously a better way to calculate BS directly from M would be required for implementing this on larger state versions of xoroshiro... but only in order to restore the missing zero to the output, which would be kind of pointless for a larger state generator. Otherwise, just ensure 1 iteration does not return the same state as the seed.

If you find a value for M that is statistically superior, great, then you could modify the seeder, engine and scrambler to use BS, M and E, respectively.

I did a Practrand gridding run last night. It took about 5 hours to produce 1900 scores. Being a grid run, the average score is higher than a plain full-width sampling aperture. So 400 scores per hour should be possible for ungridded singles. This is an 8-core 4 GHz CPU.

84 x 15 x 65534 = 82572840 test cases to run. At 400/hr comes to about 23.5 years to run.

If you find all values of (M,E) are statistically similar, then consider whether there is room to apply M and E as a replaceable parameters, perhaps adjustable by the end user:

Aside from the fact that that isn't an option due to enormous increase in hardware requirements, so far there isn't any real sign of a measurable improvement over the existing Xoroshiro32++. If anything is found, it'll either be fixed constants ... or can only usefully be a software oriented solution.

The moment a function has adjustable parameters it incurs hardware cost. Simple logic is least costly but even that still needs parameter store register. A barrel shifter is a great example of a costly function. If it has a constant shift parameter then it is 0% logic and 0% store cost. Only costs a little routing. But the moment the shift parameter becomes adjustable you have a small parameter register plus a huge mass of logic to mux each bit of the input data to selectively all bits of the output data. And the routing bloats out with that.

Aside from the fact that that isn't an option due to enormous increase in hardware requirements

Heh, apologies, that was a tad harsh. All that's needed is two parameter registers for an xor and adder. In both cases the logic doesn't change in size so it's not a huge increase at all. But even those extra registers have to be worthy. The quality improvement would need to be significant, as in reaching up to the 8 GB mark.

84 x 15 x 65534 = 82572840 test cases to run. At 400/hr comes to about 23.5 years to run.

84 triplets... likely only 5-10 of the candidates from your original testing (that eventually settled on 13,5,10,9) would be required. Other candidates should still fail scrutiny with any value of M.
15 D rotations... D = 1,2,14 and 15 have little chance to make the grade in my testing on several candidates, but automating all 15 is fine.
65534 M values (as 0 and ffff were tested)... only a few dozen should be required for each ABCD to decide which few candidates to focus on.

Once a few ABCD candidates are chosen, then 1 week for each to find the best M at 400/hr... or stop short, as it is unlikely that the best randomly chosen M of 1000 will be significantly better that the best of 65536.
Take the best ABCD from those few.

If all M for the best ABCD candidate look good within statistical reason, that would be evidence to allow M to be user programmable (if that were possible). Otherwise, use the best M found.

On a related note... if for some reason you think that the engine cannot handle the extra ^ M term... it could be moved to the last line: s[1] = rotl(s1 ^ M, C);
However, that would take me back to the beginning on double-checking basic statistics, verifying de-correlation of parallel streams, and solving for BS, etc.
If I release code for 128-bit state xoroshiro, it will likely have that change made, but only if all looks good at 32-bit state.

The quality improvement would need to be significant, as in reaching up to the 8 GB mark.

The only ways that could happen would be:
1. Break equidistribution (e.g. 'scro-rarns')
2. Increase state size.
3. Xor the output of two good 32-bit prngs (which is basically an inferior version of #2).

If you shuffled all output values from one full period and fed them into PractRand, it would likely fail at 2GB, but still probably 1GB.
At best, I was hoping to find an ABCDEM candidate that would consistently make it to >= 1GB on all apertures (or whatever criteria makes sense for a measured improvement).
If you find one, that would be awesome.

For my part, I'm very pleased to have stumbled on a viable mechanism for spawning de-correlated xoroshiro128 streams that benchmark about 90% as fast as the original.

Chris,
I can hand over all my sources if you like. The automation is all done with Bash scripts though. And it's an evolving beast, every day some changes are made. The useful parts in C are mostly the reconfigurable enhancements on the basic algorithms.

Tony can vouch for my ability to keep the computer on for more than a couple of days at a time.

I have 6 + 16 cores (two workstations) so 44 threads... and that is not always enough.
I sometimes still have to reach out to a friend (who is 'off-the-grid', so to speak) for his additional 24 threads for weeks at a time.

Sometimes I envy those crypto-jackers for having been able to bring to bear a huge amount of parallel processing power through unaware web-browsers.
Some of my years long calculation ideas would be done in hours.

Chris,
I can hand over all my sources if you like. The automation is all done with Bash scripts though. And it's an evolving beast, every day some changes are made. The useful parts in C are mostly the reconfigurable enhancements on the basic algorithms.

That would be great! I do most of my development on a combination of MSVC and Windows Subsystem for Linux (Ubuntu).

The only pipe-able version of TestU01 Big/Small/Crush that I have is for Windows, that I compiled with Cygwin... slow, but reliable. I compile for Linux for bigger jobs, though.
I've generated about 30000 BigCrush results across about 50 PRNGs so I could perform meta-analysis to find weaknesses in my PRNGs and identify P-Val corrections for 20-30 of the BigCrush statistics.

My work with PRNGs has been a mostly thank-less endeavor... but it does help fine-tune my logic skills for my day job (where I routinely connect with professionals all over the globe in diagnosing issues with analytical equipment... which requires proficiency in computer science, chemistry, physics, electronics, robotics, optics and mechanics).

If you shuffled all output values from one full period and fed them into PractRand, it would likely fail at 2GB, but still probably 1GB.

I didn't have a clue. So that's the nature of equidistribution then. Which I know we want to keep.

I find 2 1-Dimensional equidistribution (e.g. xoroshiro32++) adequate for my work, however higher might be required for some uses.
The theory goes that once you reach the square root of the period on a good PRNG (not sure how many dimensions), then an optimized analysis would reveal a statistically significant issue with an some of the outputs that are currently in deficit and excess suddenly changing toward excess and deficit at the same time (where normally only about half of those in either excess and deficit would head the other way).

Some PRNGs (e.g. xoroshiro raw full-state output) will fail birthday spacings tests. Xoroshiro32+ will certainly fail tests looking for 3 consecutive same outputs.

I am unaware of any statistical packages that attempt to make some of those judgement over longer periods and/or larger states, as the memory requirements are large.
Melissa wrote one for birthday spacings that could identify and fail a single-dimensional 64-bit PRNG that uses only 1 state variable.
PractRand is aware enough of various common issues to throw up a flag eventually, but not as soon as is theoretically possible in some cases.
I found that Big Crush is only marginally aware of some issues, until you perform a meta-analysis of enough runs (sometimes hundreds) on a given PRNG.

Thanks, Evan. I've found the right QuickBASIC ranking programs amongst the mass of xoroshiro-related QB stuff, which means I can compare the xoroshiro32++ and xoroshiron32++ distributions and post the results.

I'm looking at my own collection right now and it's so many old scripts I'm not sure what still works even. Scripts call scripts that compiles code which calls more code ... Parameter passing changes. Naming schemes change. Directory structures change. Source data changes.

prank/zrank/nzrank = pfreq/zfreq/nzfreq ranking
rank = pfreq+zfreq+nzfreq ranking
pchi/zchi/nzchi = Chi-Square total for pair/zero run/non-zero run frequency distributions

I had mentioned upper limits on what could be expected from PractRand for a 2 1-dimensionally equidistributed PRNG like xoroshiro32++, so I did a few tests to re-confirm.

Here is an example scrambler modification for use with the venerable 13,5,10,9, which I think pushes the upper limits of what is possible:

Although I have no idea how that looks with regard to 'pair/zero run/non-zero run frequency', it does nearly exactly what I said in terms of PractRand:
Almost passes at 1GB, such that 2GB seems that it will happen in a dozen or so tries. That is on both forward and reverse bits, with no notable differences between the two.

That is the best measuring stick I have come up with.

Edit: * 3 also works in the first line of the above code (i.e. 'result = s0 + s1 + s1 +s1' ), but slightly less balanced.
However, it certainly can often make the 2GB mark before failing PractRand, but on forward bits only. Reverse bits fails at 1GB often (but not quite as gracefully as *5) or at 512MB sometimes.

Those two examples do not quite work (actually didn't check the last one you edited in the last one works, but not well), but it got me thinking.
This is within a hair of my previous * 5 example (on both forward and reverse bits):

rngword_t result = rotl( s0 + s1, CONSTANT_D ) + s1 + (s1 << 2); // for some candidates '... + s0 + (s0 << 2)' will be better

Quickly running out of simple variations that will work well.

Apologies for the silence. I decided to refactor the way the algorithms get compiled so that they now get linked instead of bundled in the one compile. It's decluttered the C sources quite a lot.

Got a little sidetracked with the other work so still not finished ...

For use with 128-bit state, I've been playing with the last * 5 scrambler that I posted on smaller state sizes and noticed that if it is paired with M<>0:
1. 1-dimensional equidistribution is maintained in all cases, as expected... good.
2. Single missing state, and thus the short output value (if left uncorrected by subtraction) is almost always non-zero, as expected... which is irrelevant with larger state sizes.
3. Some values of M may produce, in addition to the normal output pairs, occasional triples, or even a rare quad depending on where M was inserted, which was unexpected... but likely ok.
4. The 2-dimensional distribution accumulated across all values of M as a superset is not perfectly equidistributed, but so close that it looks viable. 1-d is perfect across the superset when y*5 is used, (but not x*5... probably ok).
5. Speed with 128-bit state can be made equal to to xoroshiro128starstar, but with these notable differences: no accidental partial-invertibility (Melissa criticized, perhaps unfairly), only 1-D equidistribution, no perceptible escape-from-zero issues with vast majority of values of M, and since stream selection is via M, no jump function is required (but must test current state for single cycle loop when seeding and increment state, if necessary). This gives access to 2^64 total 2^128-1 period de-correlated streams... good.

A few notes on the beta code I linked, as to how it would relate to use at 32-bit state with the choice of a single M (s[2] in the example):
1. x * 5, was chosen for both its effect on superset near-perfect equidistribution and superior x64 compiled speed.
2. y * 5 is likely the superior choice for a single value of M (even if M=0 or absent).
3. Some otherwise unusable triplets may suddenly become viable with y * 5 or x * 5.
4. A half-rotation in the scrambler may become viable. I looked specifically at 'rotl( x + y, 8 ) + y * 5;' (with or without M present).
5. The position of M in the code is mostly relevant to speed/parallelization.
6. None of the changes I have explored allow for simultaneously failing PractRand at 2GB with both forward and reverse bits (i.e. obtaining 2GB fail on reverse bits, usually fails at 512MB forward).
7. The use of x/y * 5 in the latter-half of the the output scrambler makes the escape-from-zero issue much less apparent, thus has allowed me to focus more on using M as a steam selector.

That's a little over my head but I think I'm learning.
Comments:
- I've never identified what constitutes a dimension in this world.
- What's a stream?
- "2^64 2^128-1" seemed a broken number until I looked at the linked page where it says "2^64 De-correlated, Jump-less Streams Each With Period 2^128-1 !!!"
- The source code posted on that linked webpage is truncated. All I get is

// Xoroshiro128psp Beta Test Code - Copyright Christopher Rutz - Free for all uses.
// Based on xoroshiro128++ by S. Vigna / D. Blackman, and discussions of it with the Parallax development team.
// Designed to address all common issues and criticisms of PRNGs, and to allow for most real usage scenarios without caveat.
// Nearly-perfect 1-dimensional equidistribution (one output occurring 2^64-1 times and the rest 2^64 times).
// All streams are nearly-perfectly de-correlated from each other, and form a super-set that is nearly-perfectly 2-dimensionally equidistributed.
// Though not intended for cryptographic uses, the basic design is reasonably secure when seeded properly.
// Any subset of output bits may be used for floating-point conversion.
// Pipeline optimized for Intel CPUs, to meet the same performance speed of xoroshiro128**.
#include <stdint.h>
// Current state, seeded with any values by calling xoroshiro128psp_seed
// Do not modify s[2] (stream selector) after seeding.
uint64_t s[3] = { 0 , 0 , 1 };
// 64-bit barrel-rotation, simplifies to single ROL op-code by most compilers
inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k)); }
// Returns a single 64-bit output value from the currently selected stream of 2^128-1 possible values
inline uint64_t xoroshiro128psp() {
uint64_t s0 = s[0];
uint64_t s1 = s[1] ^ s[2];
uint64_t result = rotl(s0 + s1, 33) + s0 * 5; // s0 * 5 = s0 + (s0 << 2)
s1 ^= s0;
s[0] = rotl(s0, 24) ^ s1 ^ (s1 << 16);
s[1] = rotl(s1, 37);
return result; }
// Must use this function to seed both state variables and select a stream
// It is recommended to use SplitMix to generate the seeds and stream selector for passing to this function
void xoroshiro128psp_seed(uint64_t seed0, uint64_t seed1, uint64_t stream) {
s[0] = seed0; s[1] = seed1; s[2] = stream;
next();
if (s[0]==seed0 && s[1]==seed1) { s[0] ^= 1ULL; }

That's a little over my head but I think I'm learning.
Comments:
- I've never identified what constitutes a dimension in this world.
- What's a stream?
- "2^64 2^128-1" seemed a broken number until I looked at the linked page where it says "2^64 De-correlated, Jump-less Streams Each With Period 2^128-1 !!!"
- The source code posted on that linked webpage is truncated. All I get is...

1-Dimensional Equidistribution = Every output value appears an equal number of times (except some good PRNGs are short by 1 occurrance of a single output value).
2-Dimensional Equidistribution = In addition to the above, each output value will occur in pairs an equal number of times, as well as every possible pair of different values.
3-Dimensional Equidistribution = In addition to the above, all possible triplets are produced equally, which is good if you want the possibility of filling any/all points in a cube, for example.
A Higher dimensional distribution may be obtained by using only a subset of bits of the above, thus a 64-bit output 1-D PRNG may be converted to floating point by dropping 12 bits and might possibly achieve up to 13-dimensional equidistribution... more than enough for working on the vast majority of problems.
Some problems, like the (theoretical) ability to randomly generate all possible shuffles of a deck of 52 cards, (as I understand it) require a PRNG with a minimum of about 8*10^67 6-bit outputs. This is not quite within reach of xoroshiro128psp, even using all streams. Xoshiro256 can handle this easily. (I can't help but think there is a flaw in the logic of what I have read on card shuffling PRNGs, but never took the time to look).

Streams are useful when parallelizing problem solving, as each stream provides a source that is not related to the others to maximize coverage and avoid invalid statistical inference.

I fixed the odd way I wrote the numbers in the post.

The code is complete now... it just needed an extra } at the end (which I wrap back to the end of the previous line out of poor habit), and I fixed the bad function call within the seed function.

Thanks... now all I need is to run a minimum of 32TB of PractRand, 1PB of Hamming weight tests, a few hundred Big Crush tests for meta-analysis and a 10TB gjrand test.

1-Dimensional Equidistribution = Every output value appears an equal number of times (except some good PRNGs are short by 1 occurrance of a single output value).
2-Dimensional Equidistribution = In addition to the above, each output value will occur in pairs an equal number of times, as well as every possible pair of different values.
3-Dimensional Equidistribution = In addition to the above, all possible triplets are produced equally, which is good if you want the possibility of filling any/all points in a cube, for example.
A Higher dimensional distribution may be obtained by using only a subset of bits of the above, thus a 64-bit output 1-D PRNG may be converted to floating point by dropping 12 bits and might possibly achieve up to 13-dimensional equidistribution... more than enough for working on the vast majority of problems.
Some problems, like the (theoretical) ability to randomly generate all possible shuffles of a deck of 52 cards, (as I understand it) require a PRNG with a minimum of about 8*10^67 6-bit outputs. This is not quite within reach of xoroshiro128psp, even using all streams. Xoshiro256 can handle this easily. (I can't help but think there is a flaw in the logic of what I have read on card shuffling PRNGs, but never took the time to look).

Thanks heaps. That's made it clear.

Streams is way simpler than I expected. I hadn't considered you were referring to uses. Okay so M can be used as evenly spaced offsets to the state. But it won't be as simple as, say, invert msb of M to jump 50% through state space. It won't be that linear, right?

Those two examples do not quite work (actually didn't check the last one you edited in the last one works, but not well), but it got me thinking.
This is within a hair of my previous * 5 example (on both forward and reverse bits):

rngword_t result = rotl( s0 + s1, CONSTANT_D ) + s1 + (s1 << 2); // for some candidates '... + s0 + (s0 << 2)' will be better

Quickly running out of simple variations that will work well.

## Comments

9,827265If you find a value for M that is statistically superior, great, then you could modify the seeder, engine and scrambler to use BS, M and E, respectively.

If you find all values of (M,E) are statistically similar, then consider whether there is room to apply M and E as a replaceable parameters, perhaps adjustable by the end user:

Could the P2 support replaceable parameters for M and E within the engine and scrambler?... if so:

1. User/Firmware may pick M (starting with 0, for default P2 behavior) and store it.

2. Firmware will (hopefully be able to) calculate BS from M and use it to verify any current/new seed is valid (i.e. seed <> BS), adding 1, if not.

3. Firmware will automatically calculate E from BS and store it (again, starting with 0, for default P2 behavior).

4. (Optional) User may override firmware calculated E (e.g. to create a stream with an equal number of 1 and 0 bits in the entire period).

I am working on calculating BS directly from M.

However if there were no constraint on automatically supplying an E that restores the single short output value of the full period to a zero, then BS could eliminated as well:

1. Generate an arbitrary 32-bit seed.

2. Choose any value for M.

3. Run a single iteration of xoroshiro and then compare the state to the original seed from step 1.

4. If the state is equal to the original seed, add 1 to the state.

265Therefore bits 0, 2 and 4 are set.

So we xor the table values for those 3 set bits together: BS = 0x105A06DE ^ 0x61DC16C4 ^ 0x6D75302E = 0x1CF32034

Thus BS0=0x2034 and BS1=0x1CF3.

Now we calculate E from BS and D : E = 0x2034 + ROL(0x2034 + 0x1CF3, 9) = 0x6EAE

Finally, we ensure the current state <> BS: If (state==BS) {state++;}

The BS table values were generated by passing A=13,B=5,C=10 with each of the 16 single-bit set values for M to my original brute force search algorithm which returns BS, but modified to receive M (instead of only using 0xFFFF).

Edit: Fixed a lookup/math error when manually calculating BS and E in the example.

Edit2: Obviously a better way to calculate BS directly from M would be required for implementing this on larger state versions of xoroshiro... but only in order to restore the missing zero to the output, which would be kind of pointless for a larger state generator. Otherwise, just ensure 1 iteration does not return the same state as the seed.

9,82784 x 15 x 65534 = 82572840 test cases to run. At 400/hr comes to about 23.5 years to run.

9,827The moment a function has adjustable parameters it incurs hardware cost. Simple logic is least costly but even that still needs parameter store register. A barrel shifter is a great example of a costly function. If it has a constant shift parameter then it is 0% logic and 0% store cost. Only costs a little routing. But the moment the shift parameter becomes adjustable you have a small parameter register plus a huge mass of logic to mux each bit of the input data to selectively all bits of the output data. And the routing bloats out with that.

9,82726515 D rotations... D = 1,2,14 and 15 have little chance to make the grade in my testing on several candidates, but automating all 15 is fine.

65534 M values (as 0 and ffff were tested)... only a few dozen should be required for each ABCD to decide which few candidates to focus on.

Once a few ABCD candidates are chosen, then 1 week for each to find the best M at 400/hr... or stop short, as it is unlikely that the best randomly chosen M of 1000 will be significantly better that the best of 65536.

Take the best ABCD from those few.

If all M for the best ABCD candidate look good within statistical reason, that would be evidence to allow M to be user programmable (if that were possible). Otherwise, use the best M found.

On a related note... if for some reason you think that the engine cannot handle the extra ^ M term... it could be moved to the last line: s[1] = rotl(s1 ^ M, C);

However, that would take me back to the beginning on double-checking basic statistics, verifying de-correlation of parallel streams, and solving for BS, etc.

If I release code for 128-bit state xoroshiro, it will likely have that change made, but only if all looks good at 32-bit state.

9,827Tony can vouch for my ability to keep the computer on for more than a couple of days at a time.

2651. Break equidistribution (e.g. 'scro-rarns')

2. Increase state size.

3. Xor the output of two good 32-bit prngs (which is basically an inferior version of #2).

If you shuffled all output values from one full period and fed them into PractRand, it would likely fail at 2GB, but still probably 1GB.

At best, I was hoping to find an ABCDEM candidate that would consistently make it to >= 1GB on all apertures (or whatever criteria makes sense for a measured improvement).

If you find one, that would be awesome.

For my part, I'm very pleased to have stumbled on a viable mechanism for spawning de-correlated xoroshiro128 streams that benchmark about 90% as fast as the original.

9,827I can hand over all my sources if you like. The automation is all done with Bash scripts though. And it's an evolving beast, every day some changes are made. The useful parts in C are mostly the reconfigurable enhancements on the basic algorithms.

9,827265I sometimes still have to reach out to a friend (who is 'off-the-grid', so to speak) for his additional 24 threads for weeks at a time.

Sometimes I envy those crypto-jackers for having been able to bring to bear a huge amount of parallel processing power through unaware web-browsers.

Some of my years long calculation ideas would be done in hours.

265The only pipe-able version of TestU01 Big/Small/Crush that I have is for Windows, that I compiled with Cygwin... slow, but reliable. I compile for Linux for bigger jobs, though.

I've generated about 30000 BigCrush results across about 50 PRNGs so I could perform meta-analysis to find weaknesses in my PRNGs and identify P-Val corrections for 20-30 of the BigCrush statistics.

My work with PRNGs has been a mostly thank-less endeavor... but it does help fine-tune my logic skills for my day job (where I routinely connect with professionals all over the globe in diagnosing issues with analytical equipment... which requires proficiency in computer science, chemistry, physics, electronics, robotics, optics and mechanics).

265The theory goes that once you reach the square root of the period on a good PRNG (not sure how many dimensions), then an optimized analysis would reveal a statistically significant issue with an some of the outputs that are currently in deficit and excess suddenly changing toward excess and deficit at the same time (where normally only about half of those in either excess and deficit would head the other way).

Some PRNGs (e.g. xoroshiro raw full-state output) will fail birthday spacings tests. Xoroshiro32+ will certainly fail tests looking for 3 consecutive same outputs.

I am unaware of any statistical packages that attempt to make some of those judgement over longer periods and/or larger states, as the memory requirements are large.

Melissa wrote one for birthday spacings that could identify and fail a single-dimensional 64-bit PRNG that uses only 1 state variable.

PractRand is aware enough of various common issues to throw up a flag eventually, but not as soon as is theoretically possible in some cases.

I found that Big Crush is only marginally aware of some issues, until you perform a meta-analysis of enough runs (sometimes hundreds) on a given PRNG.

1,517http://forums.parallax.com/discussion/comment/1447783/#Comment_1447783

9,827Yep, that does seem to be correct, an almost complete set of single iterated freq distribution data.

I can see I did a small run of double iterated the next day, but nothing since. I can't remember what I found with it.

1,5179,8271,517xoroshiro32++[a,b,c,d]

xoroshiron32++[a,b,c,d]

prank/zrank/nzrank = pfreq/zfreq/nzfreq ranking

rank = pfreq+zfreq+nzfreq ranking

pchi/zchi/nzchi = Chi-Square total for pair/zero run/non-zero run frequency distributions

1,517265Here is an example scrambler modification for use with the venerable 13,5,10,9, which I think pushes the upper limits of what is possible: Although I have no idea how that looks with regard to 'pair/zero run/non-zero run frequency', it does nearly exactly what I said in terms of PractRand:

Almost passes at 1GB, such that 2GB seems that it will happen in a dozen or so tries. That is on both forward and reverse bits, with no notable differences between the two.

That is the best measuring stick I have come up with.

Edit: * 3 also works in the first line of the above code (i.e. 'result = s0 + s1 + s1 +s1' ), but slightly less balanced.

However, it certainly can often make the 2GB mark before failing PractRand, but on forward bits only. Reverse bits fails at 1GB often (but not quite as gracefully as *5) or at 512MB sometimes.

9,827or:

265This is within a hair of my previous * 5 example (on both forward and reverse bits): Quickly running out of simple variations that will work well.

9,827Got a little sidetracked with the other work so still not finished ...

2651. 1-dimensional equidistribution is maintained in all cases, as expected... good.

2. Single missing state, and thus the short output value (if left uncorrected by subtraction) is almost always non-zero, as expected... which is irrelevant with larger state sizes.

3. Some values of M may produce, in addition to the normal output pairs, occasional triples, or even a rare quad depending on where M was inserted, which was unexpected... but likely ok.

4. The 2-dimensional distribution accumulated across all values of M as a superset is not perfectly equidistributed, but so close that it looks viable. 1-d is perfect across the superset when y*5 is used, (but not x*5... probably ok).

5. Speed with 128-bit state can be made equal to to xoroshiro128starstar, but with these notable differences: no accidental partial-invertibility (Melissa criticized, perhaps unfairly), only 1-D equidistribution, no perceptible escape-from-zero issues with vast majority of values of M, and since stream selection is via M, no jump function is required (but must test current state for single cycle loop when seeding and increment state, if necessary). This gives access to 2^64 total 2^128-1 period de-correlated streams... good.

Beta code here: Xoroshiro128psp

Even if you do not find anything useful out of all of this, I have benefited greatly from some of your and Tony's insights... Thanks!

2651. x * 5, was chosen for both its effect on superset near-perfect equidistribution and superior x64 compiled speed.

2. y * 5 is likely the superior choice for a single value of M (even if M=0 or absent).

3. Some otherwise unusable triplets may suddenly become viable with y * 5 or x * 5.

4. A half-rotation in the scrambler may become viable. I looked specifically at 'rotl( x + y, 8 ) + y * 5;' (with or without M present).

5. The position of M in the code is mostly relevant to speed/parallelization.

6. None of the changes I have explored allow for simultaneously failing PractRand at 2GB with both forward and reverse bits (i.e. obtaining 2GB fail on reverse bits, usually fails at 512MB forward).

7. The use of x/y * 5 in the latter-half of the the output scrambler makes the escape-from-zero issue much less apparent, thus has allowed me to focus more on using M as a steam selector.

9,827Comments:

- I've never identified what constitutes a dimension in this world.

- What's a stream?

- "2^64 2^128-1" seemed a broken number until I looked at the linked page where it says "2^64 De-correlated, Jump-less Streams Each With Period 2^128-1 !!!"

- The source code posted on that linked webpage is truncated. All I get is

2652-Dimensional Equidistribution = In addition to the above, each output value will occur in pairs an equal number of times, as well as every possible pair of different values.

3-Dimensional Equidistribution = In addition to the above, all possible triplets are produced equally, which is good if you want the possibility of filling any/all points in a cube, for example.

A Higher dimensional distribution may be obtained by using only a subset of bits of the above, thus a 64-bit output 1-D PRNG may be converted to floating point by dropping 12 bits and might possibly achieve up to 13-dimensional equidistribution... more than enough for working on the vast majority of problems.

Some problems, like the (theoretical) ability to randomly generate all possible shuffles of a deck of 52 cards, (as I understand it) require a PRNG with a minimum of about 8*10^67 6-bit outputs. This is not quite within reach of xoroshiro128psp, even using all streams. Xoshiro256 can handle this easily. (I can't help but think there is a flaw in the logic of what I have read on card shuffling PRNGs, but never took the time to look).

Streams are useful when parallelizing problem solving, as each stream provides a source that is not related to the others to maximize coverage and avoid invalid statistical inference.

I fixed the odd way I wrote the numbers in the post.

The code is complete now... it just needed an extra } at the end (which I wrap back to the end of the previous line out of poor habit), and I fixed the bad function call within the seed function.

Thanks... now all I need is to run a minimum of 32TB of PractRand, 1PB of Hamming weight tests, a few hundred Big Crush tests for meta-analysis and a 10TB gjrand test.

9,827Streams is way simpler than I expected. I hadn't considered you were referring to uses. Okay so M can be used as evenly spaced offsets to the state. But it won't be as simple as, say, invert msb of M to jump 50% through state space. It won't be that linear, right?

9,827What about something like this, without a D?

EDIT: Oh, that still has the bad bit0 doesn't it.