On the above screenshot, the CPU Load graph is a sum of all 16 of the CPU's threads, which is just 6.25% of the vertical each. So you can see the vertical wall jumps up when all 15 iterator tasks are launched together (16 running tasks would have filled all 16 threads and hit 100%) and then the stepping down as they finish one by one.

PS: The three colours (Blue, Red, and Yellow) are, respectively, system tasks, user tasks, and "niced" tasks.

One thing I realized was kind of neat about this xoroshiro topology, as opposed to a simple LFSR, is that by summing its halves together, it's possible to output a zero, even though its internal state is always non-0. And iterating this topology is just as fast as an LFSR, because there's only some shallow XOR operations. The adder gives it its final high quality output, but is not needed for iteration.

You know, you now have a toolset which you can use to come up with new topologies and prove them in the manageable 32-bit domain.

To save trying to deliver all three lots of 1331 files, for a total of 53.6 MB, of the much smaller s12 scoring run, here's a directory listing of the s12 byte scores. They're sorted by file size, suitable as a summary of scores. File sizes are listed in left most column in rounded up KByte.

The hardware xoroshiro generator will create the next value in its sequence every system clock and the tests you have done show that such sequences are close to random but it is unlikely and perhaps impossible for a cog to read every consecutive value.

A cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?

The lsb of the summing component is avoided for that very reason. Not all bits are equal for sure. That'll be why the msb bit-truncated scoring is so much flatter than the other two variants. Though, generally the quality is good according to PractRand. I'd think any poor bit positions would trigger a fail in PractRand.

I've been pondering digging up the TestU01 BigCrush suite but haven't got around to it.

As for regular interval sampling, no idea. I'd be surprised if anything shows up but it's certainly something to test ...

I've just done some other iteration runs, namely s15 and s17. s15 took 1.5 minutes while s17 took a surprising 30 minutes using the Ryzen flat out! Even the dynamic clocking settled on full rate for every core for the duration - not a bump to be seen. Larger sizes are going to explode. Whaaa!

Chip,
It don't look promising. Here's the modified generator code. Attached is the s12 max iterate findings without any filter. It's across the board rapid terminations in less than 100 120 iterations. EDIT: And the scoring is 100% instant fail on every combination. EDIT2: Lol, we have actually have 3 of the 1331 scores that just scraped passed an instant fail.

s0 = S[0]
s1 = S[1]
S[0] = s0 rol a ^ s0 ^ s1 ^ s0 shl b ^ s1 shl b
S[1] = s0 rol c ^ s1 rol c

You can imagine all the possibilities of rotates, shifts, and XORs. I imagine the xoroshiro topology economized by using the 's1 ^= s0' operation early. This made it practical for efficient software implementation. For hardware, all that matters is that the net XOR depth is not high. Bits don't even need to be ordered. What kind of quality can be economically achieved without software-constrained thinking?

I threw away the call to next() and almost eliminated the intermediate steps from it. EDIT: When I shrunk it down to just that loop I had to declare s0 and s1 as statics to prevent GCC from optimising that loop into oblivion.

The compilers are too smart now. Out-thinking them is an increasingly tenuous art.

I think the quality of xoroshiro can be attributed to two things:

1) an LFSR-type iterator
2) an adder used only for output

Those are the take-aways.

The xoroshiro algorithm lies at the intersection of a few critical concepts and what can be practically computed by a CPU. If the core matters were to be optimized for gate count, it would look a lot different. But what would it be?

The hardware xoroshiro generator will create the next value in its sequence every system clock and the tests you have done show that such sequences are close to random but it is unlikely and perhaps impossible for a cog to read every consecutive value.

A cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?

If anyone thinks it's worthwhile, a couple of tests with n > 1 should be enough, not too close together, say n = 16 and n = 256.

The hardware xoroshiro generator will create the next value in its sequence every system clock and the tests you have done show that such sequences are close to random but it is unlikely and perhaps impossible for a cog to read every consecutive value.

A cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?

If anyone thinks it's worthwhile, a couple of tests with n > 1 should be enough, not too close together, say n = 16 and n = 256.

I think the PRNG test batteries check at bit intervals, so this should be covered, already, by xoroshiro128+.

That would be just me - unless you can handle a bunch of Bash scripts?

I've never done Bash scripts. I don't want you to do any work on Easter, either. I'm just pondering what kind of high-quality 32-bit PRNG may be possible in hardware (without software-constrained thinking).

I've started an s12 scoring run at iteration intervals of 16. It's generating decent output so far, actually is getting some 128 MB scores. Will take another 30 minutes I think ...

I don't want you to do any work on Easter, either.

This is my holiday fun. :D I've not had anything I can quickly sink my teeth into for a very long time.

I'm just pondering what kind of high-quality 32-bit PRNG may be possible in hardware (without software-constrained thinking).

I'm not sure there is any real shortcuts as such. Just packing it down for small space, or chopping it up for power efficiency is about all I can imagine.

I don't want you to do any work on Easter, either.

This is my holiday fun. :D I've not had anything I can quickly sink my teeth into for a very long time.

I'm just pondering what kind of high-quality 32-bit PRNG may be possible in hardware (without software-constrained thinking).

I'm not sure there is any real shortcuts as such. Just packing it down for small space, or chopping it up for power efficiency is about all I can imagine.

Think about this matter of rotating and shifting. The important part is that bits are changing places, not that they are necessarily all moving in a certain direction. The direction business comes from what a CPU can efficiently do.

I've never written a Bash script in my life before. I'll credit PractRand documentation for getting me started with pipes and automating things. And Google/Stackexchange for giving the answers on doing stuff with Bash, and C for that matter. It's easier to ask Google than try to find the answers in the manuals.

Oh, now I get it. Effectively perform all ops on a register in a single op and maybe make it even better to boot. EDIT: Of course it's anyone's guess as to what changes would be a good improvement. Best I can do is pure trial and error.

## Comments

3,9760Vote UpVote DownPS: The three colours (Blue, Red, and Yellow) are, respectively, system tasks, user tasks, and "niced" tasks.

7,7500Vote UpVote Down7,7500Vote UpVote DownYou know, you now have a toolset which you can use to come up with new topologies and prove them in the manageable 32-bit domain.

3,9760Vote UpVote Down3,9760Vote UpVote Down3,9760Vote UpVote Down3,9760Vote UpVote Down730Vote UpVote DownA cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?

3,9760Vote UpVote DownI've been pondering digging up the TestU01 BigCrush suite but haven't got around to it.

As for regular interval sampling, no idea. I'd be surprised if anything shows up but it's certainly something to test ...

3,9760Vote UpVote Down3,9760Vote UpVote Down7,7500Vote UpVote DownIf that ROL+ROL+SHL activity were changed to ROL+ROL+ROL, then the algorithm would become reversible, but would it still be any good?

3,9760Vote UpVote DownIt don't look promising. Here's the modified generator code. Attached is the s12 max iterate findings without any filter. It's across the board rapid terminations in less than

~~100~~120 iterations. EDIT: And the scoring is 100% instant fail on every combination. EDIT2: Lol, we have actually have 3 of the 1331 scores that just scraped passed an instant fail.7,7500Vote UpVote Down7,7500Vote UpVote Downs1 = S[1]

s1 ^= s0;

s[0] = (rotl( s0, CONSTANT_A ) ^ s1 ^ rotl( s1, CONSTANT_B )) & ACCUM_MASK;

s[1] = rotl(s1, CONSTANT_C);

...Could be rewritten as...

s0 = S[0]

s1 = S[1]

S[0] = s0 rol a ^ s0 ^ s1 ^ s0 shl b ^ s1 shl b

S[1] = s0 rol c ^ s1 rol c

You can imagine all the possibilities of rotates, shifts, and XORs. I imagine the xoroshiro topology economized by using the 's1 ^= s0' operation early. This made it practical for efficient software implementation. For hardware, all that matters is that the net XOR depth is not high. Bits don't even need to be ordered. What kind of quality can be economically achieved without software-constrained thinking?

3,9760Vote UpVote Down7,7500Vote UpVote DownI think the quality of xoroshiro can be attributed to two things:

1) an LFSR-type iterator

2) an adder used only for output

Those are the take-aways.

The xoroshiro algorithm lies at the intersection of a few critical concepts and what can be practically computed by a CPU. If the core matters were to be optimized for gate count, it would look a lot different. But what would it be?

730Vote UpVote DownIf anyone thinks it's worthwhile, a couple of tests with n > 1 should be enough, not too close together, say n = 16 and n = 256.

3,9760Vote UpVote Down7,7500Vote UpVote DownI think the PRNG test batteries check at bit intervals, so this should be covered, already, by xoroshiro128+.

7,7500Vote UpVote DownI've never done Bash scripts. I don't want you to do any work on Easter, either. I'm just pondering what kind of high-quality 32-bit PRNG may be possible in hardware (without software-constrained thinking).

3,9760Vote UpVote Down3,9760Vote UpVote DownI'm not sure there is any real shortcuts as such. Just packing it down for small space, or chopping it up for power efficiency is about all I can imagine.

7,7500Vote UpVote DownThink about this matter of rotating and shifting. The important part is that bits are changing places, not that they are necessarily all moving in a certain direction. The direction business comes from what a CPU can efficiently do.

7,7500Vote UpVote Down3,9760Vote UpVote Down3,9760Vote UpVote Down7,7500Vote UpVote Down7,7500Vote UpVote Down3,9760Vote UpVote Down