Random/LFSR on P2

evanh · 2017-04-16 07:45

On the above screenshot, the CPU Load graph is a sum of all 16 of the CPU's threads, which is just 6.25% of the vertical each. So you can see the vertical wall jumps up when all 15 iterator tasks are launched together (16 running tasks would have filled all 16 threads and hit 100%) and then the stepping down as they finish one by one.

PS: The three colours (Blue, Red, and Yellow) are, respectively, system tasks, user tasks, and "niced" tasks.

cgracey · 2017-04-16 07:48

Very neat!

cgracey · 2017-04-16 07:59

One thing I realized was kind of neat about this xoroshiro topology, as opposed to a simple LFSR, is that by summing its halves together, it's possible to output a zero, even though its internal state is always non-0. And iterating this topology is just as fast as an LFSR, because there's only some shallow XOR operations. The adder gives it its final high quality output, but is not needed for iteration.

You know, you now have a toolset which you can use to come up with new topologies and prove them in the manageable 32-bit domain.

evanh · 2017-04-16 12:19

Can now specify the word size to iterate on. Here's the s12 outcome:

   Combinations 	Full	Byte	Bit
==================	====	====	====
 1 1 a    1, 1,10			4MB
 1 5 6    1, 5, 6			4MB
 2 2 3    2, 2, 3			4MB
 3 2 2    3, 2, 2			
 3 2 8    3, 2, 8			4MB
 4 2 b    4, 2,11			4MB
 4 4 7    4, 4, 7	32MB		4MB
 4 4 9    4, 4, 9			4MB
 4 9 7    4, 9, 7	16MB		4MB
 6 5 1    6, 5, 1			
 7 4 4    7, 4, 4			4MB
 7 4 8    7, 4, 8	16MB	16MB	4MB
 7 9 4    7, 9, 4	32MB		4MB
 7 a a    7,10,10	32MB	4MB	4MB
 8 2 3    8, 2, 3	64MB	8MB	4MB
 8 2 b    8, 2,11		8MB	4MB
 8 3 9    8, 3, 9	32MB	8MB	4MB
 8 4 7    8, 4, 7			4MB
 8 9 b    8, 9,11	32MB		4MB
 9 3 8    9, 3, 8			4MB
 9 4 4    9, 4, 4	16MB		4MB
 a 1 1   10, 1, 1			4MB
 a a 7   10,10, 7			4MB
 b 2 4   11, 2, 4		16MB	4MB
 b 2 8   11, 2, 8			4MB
 b 9 8   11, 9, 8			4MB

evanh · 2017-04-16 12:21

My favourite of that lot is {8,2,3}. It conforms nicely and I do like that 64MB score.

evanh · 2017-04-16 13:14

To save trying to deliver all three lots of 1331 files, for a total of 53.6 MB, of the much smaller s12 scoring run, here's a directory listing of the s12 byte scores. They're sorted by file size, suitable as a summary of scores. File sizes are listed in left most column in rounded up KByte.

File size	Score
=========	=====
152 KB		2 GB
137 KB 		1 GB
123 KB		512 MB
109 KB		256 MB
97 KB		128 MB
85 KB		64 MB
75 KB		32 MB
64 KB		16 MB
55 KB		8 MB
47 KB		4 MB

evanh · 2017-04-16 13:24

And here's the summary of s16 scores (EDIT: This is a listing of over 10000 files that fills 346 MB of score data!)

TonyB · 2017-04-16 15:00

The hardware xoroshiro generator will create the next value in its sequence every system clock and the tests you have done show that such sequences are close to random but it is unlikely and perhaps impossible for a cog to read every consecutive value.

A cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?

evanh · 2017-04-16 15:14

The lsb of the summing component is avoided for that very reason. Not all bits are equal for sure. That'll be why the msb bit-truncated scoring is so much flatter than the other two variants. Though, generally the quality is good according to PractRand. I'd think any poor bit positions would trigger a fail in PractRand.

I've been pondering digging up the TestU01 BigCrush suite but haven't got around to it.

As for regular interval sampling, no idea. I'd be surprised if anything shows up but it's certainly something to test ...

evanh · 2017-04-16 15:21

I've just done some other iteration runs, namely s15 and s17. s15 took 1.5 minutes while s17 took a surprising 30 minutes using the Ryzen flat out! Even the dynamic clocking settled on full rate for every core for the duration - not a bump to be seen. Larger sizes are going to explode. Whaaa!

evanh · 2017-04-16 15:33

Here's the start of the next size up, s18, showing the brick wall on both CPU load and dynamic clocking:

cgracey · 2017-04-16 15:43

Evanh,

If that ROL+ROL+SHL activity were changed to ROL+ROL+ROL, then the algorithm would become reversible, but would it still be any good?

evanh · 2017-04-16 16:42

Chip,
It don't look promising. Here's the modified generator code. Attached is the s12 max iterate findings without any filter. It's across the board rapid terminations in less than 100 120 iterations. EDIT: And the scoring is 100% instant fail on every combination. EDIT2: Lol, we have actually have 3 of the 1331 scores that just scraped passed an instant fail.

static uint64_t  next(void) {
	uint64_t  s0 = s[0];
	uint64_t  s1 = s[1];
	uint64_t  result = (s0 + s1) & ACCUM_MASK;

	s1 ^= s0;
	s[0] = (rotl( s0, CONSTANT_A ) ^ s1 ^ rotl( s1, CONSTANT_B )) & ACCUM_MASK;
	s[1] = rotl(s1, CONSTANT_C);

	return result;
}

cgracey · 2017-04-16 16:48

So, not reversible.

cgracey · 2017-04-16 17:19

s0 = S[0]
s1 = S[1]
s1 ^= s0;
s[0] = (rotl( s0, CONSTANT_A ) ^ s1 ^ rotl( s1, CONSTANT_B )) & ACCUM_MASK;
s[1] = rotl(s1, CONSTANT_C);

...Could be rewritten as...

s0 = S[0]
s1 = S[1]
S[0] = s0 rol a ^ s0 ^ s1 ^ s0 shl b ^ s1 shl b
S[1] = s0 rol c ^ s1 rol c

You can imagine all the possibilities of rotates, shifts, and XORs. I imagine the xoroshiro topology economized by using the 's1 ^= s0' operation early. This made it practical for efficient software implementation. For hardware, all that matters is that the net XOR depth is not high. Bits don't even need to be ordered. What kind of quality can be economically achieved without software-constrained thinking?

evanh · 2017-04-16 17:34

Funny you bring that up. Here's my longiter inner loop:

	do {
		count = (count + 1) & CNT_MASK;
		work = s0 ^ s1;
		s0 = (rotl( s0, constant_a ) ^ work ^ (work << constant_b)) & ACCUM_MASK;
		s1 = rotl( work, constant_c );
		if( s0 == 1 && s1 == 0 )
			break;
	} while( count );

I threw away the call to next() and almost eliminated the intermediate steps from it. EDIT: When I shrunk it down to just that loop I had to declare s0 and s1 as statics to prevent GCC from optimising that loop into oblivion.

cgracey · 2017-04-16 18:03

The compilers are too smart now. Out-thinking them is an increasingly tenuous art.

I think the quality of xoroshiro can be attributed to two things:

1) an LFSR-type iterator
2) an adder used only for output

Those are the take-aways.

The xoroshiro algorithm lies at the intersection of a few critical concepts and what can be practically computed by a CPU. If the core matters were to be optimized for gate count, it would look a lot different. But what would it be?

TonyB · 2017-04-16 18:13

TonyB wrote: »

The hardware xoroshiro generator will create the next value in its sequence every system clock and the tests you have done show that such sequences are close to random but it is unlikely and perhaps impossible for a cog to read every consecutive value.

A cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?

If anyone thinks it's worthwhile, a couple of tests with n > 1 should be enough, not too close together, say n = 16 and n = 256.

evanh · 2017-04-16 18:34

That would be just me - unless you can handle a bunch of Bash scripts?

cgracey · 2017-04-16 18:39

TonyB wrote: »

TonyB wrote: »

The hardware xoroshiro generator will create the next value in its sequence every system clock and the tests you have done show that such sequences are close to random but it is unlikely and perhaps impossible for a cog to read every consecutive value.

A cog that needs a regular flow of random numbers without using its own code will be sampling the xoroshiro output at interval n, where n > 1 probably. How do the tests perform for n > 1, for example 8, 16 or 32? Or 7, 15 or 31? Will some of the 63 bits be less random as a result and if so which ones?

If anyone thinks it's worthwhile, a couple of tests with n > 1 should be enough, not too close together, say n = 16 and n = 256.

I think the PRNG test batteries check at bit intervals, so this should be covered, already, by xoroshiro128+.

cgracey · 2017-04-16 18:47

evanh wrote: »

That would be just me - unless you can handle a bunch of Bash scripts?

I've never done Bash scripts. I don't want you to do any work on Easter, either. I'm just pondering what kind of high-quality 32-bit PRNG may be possible in hardware (without software-constrained thinking).

evanh · 2017-04-16 18:53

I've started an s12 scoring run at iteration intervals of 16. It's generating decent output so far, actually is getting some 128 MB scores. Will take another 30 minutes I think ...

evanh · 2017-04-16 18:58

cgracey wrote: »

I don't want you to do any work on Easter, either.

This is my holiday fun.

I've not had anything I can quickly sink my teeth into for a very long time.

I'm just pondering what kind of high-quality 32-bit PRNG may be possible in hardware (without software-constrained thinking).

I'm not sure there is any real shortcuts as such. Just packing it down for small space, or chopping it up for power efficiency is about all I can imagine.

cgracey · 2017-04-16 19:06

evanh wrote: »

cgracey wrote: »

I don't want you to do any work on Easter, either.

This is my holiday fun. I've not had anything I can quickly sink my teeth into for a very long time.

I'm just pondering what kind of high-quality 32-bit PRNG may be possible in hardware (without software-constrained thinking).

I'm not sure there is any real shortcuts as such. Just packing it down for small space, or chopping it up for power efficiency is about all I can imagine.

Think about this matter of rotating and shifting. The important part is that bits are changing places, not that they are necessarily all moving in a certain direction. The direction business comes from what a CPU can efficiently do.

cgracey · 2017-04-16 19:08

Imagine a few different bit remaps, instead of those rotates. That's what those SEUSSF/SEUSSF instructions do, with ~half the bits being NOT'd.

evanh · 2017-04-16 19:10

I've never written a Bash script in my life before. I'll credit PractRand documentation for getting me started with pipes and automating things. And Google/Stackexchange for giving the answers on doing stuff with Bash, and C for that matter. It's easier to ask Google than try to find the answers in the manuals.

evanh · 2017-04-16 19:13

cgracey wrote: »

Imagine a few different bit remaps, instead of those rotates. That's what those SEUSSF/SEUSSF instructions do, with ~half the bits being NOT'd.

The HDL compilers will do well on optimising fixed value rotates, it'll boil down to a fixed mapping too.

cgracey · 2017-04-16 19:13

It's very tedious to remap bits in software, so algorithm development probably hasn't proceeded in that direction. It's nothing in hardwarw, though.

cgracey · 2017-04-16 19:15

Maybe pick a cadywumpus remap for 32 bits and try using it at different rotates and shifts.

evanh · 2017-04-16 19:25

Oh, now I get it. Effectively perform all ops on a register in a single op and maybe make it even better to boot. EDIT: Of course it's anyone's guess as to what changes would be a good improvement. Best I can do is pure trial and error.

Random/LFSR on P2

Comments