Random/LFSR on P2

TonyB_ · 2017-10-28 23:09

evanh wrote: »

In this case it resolves to a constant of 2. The outer for() loop only does two iterations.

And normally you'd have {} brackets following a for() but since there isn't the subsequent indentation is a hint that that for() loop only contains the singular inner for() loop. Ie; s[0] = s0; s[1] = s1; are executed only after all loops are completed.

Thanks Evan, but still enough mumbo-jumbo for me not to understand. Plain old pseudo-code would help.

We'd need new bit patterns for xoroshiro32+ based on the characteristic polynomial or some such.

cgracey · 2017-10-28 23:11

With these randomness test batteries, we can do PRNG development without understanding the theory behind it. Kind of works for layout, too.

evanh · 2017-10-28 23:25

cgracey wrote: »

With these randomness test batteries, we can do PRNG development without understanding the theory behind it. Kind of works for layout, too.

Totally. We were dead in the water without PractRand. It tucks the science away so mere mortals can do a little engineering.

TonyB_ · 2017-10-28 23:27

The idea of the jump function is to find well-spaced states so that lots of XORO32 sequences could be running that don't overlap with each other, for a while at least.

To do this properly requires deeper understanding but shall we just brute force it instead? Start with state = 1, iterate 64K times and save that state, rinse and repeat 64K times. End result is file with 64K longs.

Heater. · 2017-10-28 23:30

Only on thing bothers me.

Published PRNGs, from the likes of Marsaglia, come with statements about their cycle length. (2 ^ 64) - 1 or whatever.

I have no idea how they come to make such statements but I naively assume they have some mathematical proof to back it up.

There is no way our batteries of tests can determine the cycle length of a PRNG.

Are there any such statements and proofs for the PRNGs that have been presented here?

evanh · 2017-10-28 23:31

TonyB_ wrote: »

We'd need new bit patterns for xoroshiro32+ based on the characteristic polynomial or some such.

Ya, I'm afraid that part is in need of real knowledge and math skills. No amount of understanding of that C source code is going to magic up the needed key for your desired Xoroshiro48+. Or in our case we'd now be wanting the key for Xoroshiro48+p.

EDIT: Struck out my brain fart. Full period candidates aren't affected by those extensions.

TonyB_ · 2017-10-28 23:36

Heater. wrote: »

Only on thing bothers me.

Published PRNGs, from the likes of Marsaglia, come with statements about their cycle length. (2 ^ 64) - 1 or whatever.

I have no idea how they come to make such statements but I naively assume they have some mathematical proof to back it up.

There is no way our batteries of tests can determine the cycle length of a PRNG.

Are there any such statements and proofs for the PRNGs that have been presented here?

xoroshiro32+ has period of 2^32-1, proven by several people here by iterating until initial state repeats.

Heater. · 2017-10-28 23:47

Hmm...OK. If the state space is small enough proof of the cycle length can be brute forced by iteration.

Why do we need such a small PRNG?

Even a simple thing like JKISS32 has a cycle length of ~2^121.

There is no way to prove that by brute force.

evanh · 2017-10-29 00:16

Heater. wrote: »

There is no way our batteries of tests can determine the cycle length of a PRNG.

Are you going senile?! - http://forums.parallax.com/discussion/comment/1407742/#Comment_1407742

Xoroshiro32 is only 4G iterations for best case full period. Chip tested every triplet combination on a Prop1 and found 84 triplet combinations reached full period. And I later double checked his finding when I got with the programme.

evanh · 2017-10-29 00:17

Since then, all size 16 quality tests have been conducted on that set of 84 full period triplets.

AJL · 2017-10-29 00:32

potatohead wrote: »

Great, interesting work. I've read with interest.

I like that you have agreement across a PC, P2 and a Z80. Had I the time, I would love to add 6809, 6502 to that list.

I'll need to tweak my 6502 implementation of XORO32 to address the changes made recently.

Regards,
Anthony.

TonyB_ · 2017-10-29 01:05

AJL wrote: »

potatohead wrote: »

Great, interesting work. I've read with interest.

I like that you have agreement across a PC, P2 and a Z80. Had I the time, I would love to add 6809, 6502 to that list.

I'll need to tweak my 6502 implementation of XORO32 to address the changes made recently.

Regards,
Anthony.

Is the convention now to add "p" to the name for the new parity method, hence xoroshiro32+p?

xoroshiro32+p is dead easy on the Z80, but adds 25% to the execution time making it almost the same as xoroshiro34+.

Best xoroshiro32+p has better pair of top and bottom bytes overall than best xoroshiro34+ (ignoring bit 0) with same word scores so might as well use xoroshiro32+p.

evanh · 2017-10-29 01:30

TonyB_ wrote: »

Is the convention now to add "p" to the name for the new parity method, hence xoroshiro32+p?

I'd come to realise the naming often descriptively reflects the algorithm. In this case I'd noticed the "+" appeared to represent the non-iterator summing, so I started using p+ when we were doing the parity pre-summing. I only just flipped it around to +p because the parity is now post-summing.

scro · 2017-10-29 01:56

Any particular reason for the focus on LFSRs and LFibs and the like? I know in the PractRand docs I included a section on what PRNGs to use on exotic hardware, including this snippet:

7. If you need to minimize the number of transistors required to implement the
RNG in hardware:
Trivium, sfc*, and jsf* can all be implemented with very little hardware.
Of those, Trivium offers the highest quality and is the only one of those
to possess any degree of cryptographic security.

evanh · 2017-10-29 02:15

Welcome scro,
I gather Tony has been bouncing lots of forums.

Xoroshiro is quite good for hardware. One requirement on our efforts here is that all state data must fit in 32 bits.

evanh · 2017-10-29 02:30

PS: There isn't any cryptographic objectives here but as well as the Xoroshiro32 based instruction that we've spun, we also have a totally hardware based full spec'd Xoroshiro128+ as a free running iterator per clock for a more robust data source.

msrobots · 2017-10-29 04:50

Besides the awesomeness of this thread and the goal to produce a most repeatable random set of 32^2-1 numbers, I still stumble over this repeatable.

A long time ago, in Germany last century, we had 'Geldspielautomaten' (like one armed-bandits) just in every bar. Just worked with coins, but still lucrative. There was a market for ROM copies and software to find the actual state of the machine, while just playing a couple of hit and miss games. the you had a sequence, found the sequence in the file and from that moment you owned the thing. Full predictable. Full repeatable.

Same for those ATARI/COMMODORE things, RANDOM was not RANDOM but predictable. Sure you can trick with new seeding at keystrokes, or time BUT the player would simply repeat them keystrokes timely and run the same sequence again.

predictable, not RANDOM. And if predictable does the pseudo randomness even matter?

As far as I understand the random generator in the P1 uses some free running -there I got lost - thing but did produce a not repeatable sequence of numbers. That I would call RANDOM.

still confused,

Mike
still confused.

Heater. · 2017-10-29 05:51

evanh,

Are you going senile?!

Quite possibly. I was overlooking the detail that Xoroshiro32 only has 4G of states. Unlike things like JKISS32 that have far more.

I hope my mind is still somewhat intact by the time the P2 comes out.

Heater. · 2017-10-29 06:07

msrobots,

Sometimes repeatability is required in a "random" number generator and actual randomness is not. For example when writing simulations it might be nice to re-run the thing and demonstrate the exact same outcomes. It's handy when debugging and testing for regressions in simulation code. One might rewrite a simulation and want to check the new implementation is correct.

The major need for simple PRNGs is speed. Simulations often need billions/trillions of random numbers, perhaps spread over many parallel processes. Turns out it's hard to get a lot of real random numbers quickly.

Certainly one should not be using such simple sources of randomness in games machines where money is involved. There are many stories of such systems being gamed. For example early online poker games got stung by this http://www.lauradhamilton.com/random-lessons-online-poker-exploit

evanh · 2017-10-29 07:14

Certainly the limited period is another factor in making this approach viable. If the period was too large I think Chip wouldn't have engaged in the first place.

scro · 2017-10-29 08:21

evanh wrote: »

Welcome scro,
I gather Tony has been bouncing lots of forums.

Xoroshiro is quite good for hardware. One requirement on our efforts here is that all state data must fit in 32 bits.

I actually found the discussion via google.

A lot of algorithms are quite good for hardware. Although... 32 bits of state?!? That I did not expect. That puts a pretty low hard limit on PRNG quality, and almost requires a single-cycle algorithm. Hm... have you considered hybrid options?

If, instead of having a 32 bit PRNG opcode, you moved the PRNG state in to software and added a decent 32 bit (reversible) hash opcode in hardware, that would make a software PRNG with 32 bits of state almost as fast as a PRNG opcode (it can be done with just two upcodes if I understand your ISA correctly - an increment opcode and a hash opcode). By moving the state requirement from hardware to software, that would allow software that didn't need a PRNG to have an extra 32 bits, while software that needed a higher quality PRNG wouldn't have the overhead of keeping a low quality one around in addition. Furthermore, the same hash opcode would be useful for other things too (hashtables, higher quality PRNGs, custom checksum implementations, that kind of thing). Though I have no clue know how common things like that are on microcontrollers for robots.

evanh · 2017-10-29 11:38

scro wrote: »

evanh wrote: »

Welcome scro,
I gather Tony has been bouncing lots of forums.

Xoroshiro is quite good for hardware. One requirement on our efforts here is that all state data must fit in 32 bits.

I actually found the discussion via google.

Excellent. Chip speculated this may happen at some stage.

Just to make sure you're following where we're at, here's Chip's latest ALU wiring of the XORO32 instruction that's currently in the Prop2 instruction set - http://forums.parallax.com/discussion/comment/1423925/#Comment_1423925

This uses a 32-bit general processor register for state store. On top of that, it does a double iteration and alters the S operand of the following instruction to insert the concatenated 32-bit result.

If, instead of having a 32 bit PRNG opcode, you moved the PRNG state in to software and added a decent 32 bit (reversible) hash opcode in hardware, that would make a software PRNG with 32 bits of state almost as fast as a PRNG opcode (it can be done with just two upcodes if I understand your ISA correctly - an increment opcode and a hash opcode). By moving the state requirement from hardware to software, that would allow software that didn't need a PRNG to have an extra 32 bits, while software that needed a higher quality PRNG wouldn't have the overhead of keeping a low quality one around in addition. Furthermore, the same hash opcode would be useful for other things too (hashtables, higher quality PRNGs, custom checksum implementations, that kind of thing). Though I have no clue know how common things like that are on microcontrollers for robots.

I guess the question about that approach is how flexible can a hash instruction be? There is a number of convoluted specifics with the bit shuffling of the average random generator. Eg: Producing a parity bit is horribly slow if there isn't any hardware in the ALU to help out. It's still just a single function that can't be adjusted.

The XORO32 has 3 calibrated constants for the amounts to shift/rotate by. The 3 constants differ for each state size. How to make that adjustable but generalised?

I can see a rotate with bounds being generally useful beyond RNGs.

EDIT: An example might be helpful, to get me following you.

TonyB_ · 2017-10-29 14:47

evanh wrote: »

In this case it resolves to a constant of 2. The outer for() loop only does two iterations.

And normally you'd have {} brackets following a for() but since there isn't the subsequent indentation is a hint that that for() loop only contains the singular inner for() loop. Ie; s[0] = s0; s[1] = s1; are executed only after all loops are completed.

So only two iterations of next(); ?

evanh · 2017-10-29 21:16

Two times this:

		for(int b = 0; b < 64; b++) {
			if (JUMP[i] & 1ULL << b) {
				s0 ^= s[0];
				s1 ^= s[1];
			}
			next();
		}

So, 128 iterations of next(). 64 iterations with i = 0, and 64 iterations with i = 1.

Between i and b, it tests each bit of the key array once. If the key bit tested is a 1 then the local state (s0, s1) is xor'd with the iterated state (s[0], s[1]) and stored back in the local state.

evanh · 2017-10-29 21:20

And finally, after all those xor's of the local state, it is copied back over the iterated state. Erasing whatever the iterator had.

	s[0] = s0;
	s[1] = s1;

evanh · 2017-10-29 21:49

cgracey wrote: »

wire [16:0] xoro32a	= xoro32y[31:16] + xoro32y[15:0];	// first iteration sum
wire [16:0] xoro32b	= xoro32[31:16] + xoro32[15:0];		// second iteration sum

wire [31:0] xoro32r	= {xoro32b[15:1], ^xoro32b,		// xoro32r = prng result
			   xoro32a[15:1], ^xoro32a};

Chip,
I gather the prefixed ^ operator on ^xoro32b, and likewise ^xoro32a, there on the last two lines, xor's all bits of the group into a single bit?

cgracey · 2017-10-29 22:04

evanh wrote: »
cgracey wrote: »
wire [16:0] xoro32a	= xoro32y[31:16] + xoro32y[15:0];	// first iteration sum
wire [16:0] xoro32b	= xoro32[31:16] + xoro32[15:0];		// second iteration sum

wire [31:0] xoro32r	= {xoro32b[15:1], ^xoro32b,		// xoro32r = prng result
			   xoro32a[15:1], ^xoro32a};
Chip,
I gather the prefixed ^ operator on ^xoro32b, and likewise ^xoro32a, there on the last two lines, xor's all bits of the group into a single bit?

That is correct.

TonyB_ · 2017-10-29 22:14

evanh wrote: »
Two times this:
		for(int b = 0; b < 64; b++) {
			if (JUMP[i] & 1ULL << b) {
				s0 ^= s[0];
				s1 ^= s[1];
			}
			next();
		}
So, 128 iterations of next(). 64 iterations with i = 0, and 64 iterations with i = 1.

Between i and b, it tests each bit of the key array once. If the key bit tested is a 1 then the local state (s0, s1) is xor'd with the iterated state (s[0], s[1]) and stored back in the local state.

Thanks Evan. xoroshiro32+ jump function would have only 32 bits in the bit pattern.

As you might already know, Sebastiano Vigna (co-inventor of xoroshiro+) thinks PractRand is not specific enough and we need to test with BigCrush as well.

TonyB_ · 2017-10-29 22:43

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

Ia 14,2,7 still the optimal setting for how XORO32 is now working?

Yes, accepting the 128M "Byte1" score, this is the best triplet for current algorithm. All other byte variants of this triplet came out at 512M or 1G.

Okay. Great!

So, we can consider the improved XORO32 instruction to be optimized and DONE.

I think it is quite complete now and it has maybe realized its full potential as a seedable/steppable 32-bit PRNG for software use. It couldn't be higher quality for 32 bits and it couldn't be any easier to use.

Would it be useful if XORO32 could modify Z, to report an invalid seed? Users could give it a random seed to get started and check after the MOV. "Shoot first, ask questions later" programming.

scro · 2017-10-29 23:33

For a PRNG suited for this:
32 bits of state necessitates a minimum cycle length close to the full possible statespace, ruling out most chaotic state transition functions. Lack of easy multiplication rules out most forms of LCGs. That leaves LFSRs, the very worst of LCGs, counters, and a few more obscure possibilities along the same lines. LFSRs are at least as good as any other possibility there.
Of course, LFSRs on their own aren't very good, so an output hashing function is also necessary.
You're using a reduced-size variant of xoroshiro128plus at the moment, which is a decent LFSR (as far as software LFSRs go anyway...) coupled with a bad output hashing function. You're double-pumping the LFSR to get a big enough output word atm, which gives you an effective cycle length of almost 8 gigabytes instead of the almost 16 gigabytes it would otherwise be.
Here's an alternative (expressed in C, but it's very easy to translate to verilog or most asms):

unsigned int rng_state = 1;
unsigned int get_rng_output() {
	unsigned int tmp = state;
	state ^= state << 13;
	state ^= state >> 17;
	state ^= state << 5;
	tmp += (tmp << 18) | (tmp >> 14);//barrel shift and add
	tmp += ((tmp << 7) | (tmp >> 25)) ^ ((tmp << 11) | (tmp >> 21));//two barrel shifts, an xor, and an add
	return tmp ^ state;
}

That has a cycle length of almost 16 gigabytes (because it uses a full-sized output function, so it doesn't have to be double-pumped), and fails PractRand at 32 GB. In order to get that far, the output hashing quality had to be increased, plus it also had to be deliberately biased. If the output hashing was 1-to-1, no matter how high the quality was, PractRand would notice the results getting bizarrely regular as it got towards a significant fraction of a cycle, resulting in the FPF:all subtest producing a p-value too near 1.0 at a test length of 1 GB - yet another hazard to having just 32 bits of state.
That's about the best you can do while being efficient in both hardware and software and limiting yourself to 32 bits of state. Using the state transition function from your xoroshiro32+ might improve it slightly. If you're willing to forgo efficient software implementations you could probably manage yet higher quality LFSRs, though you'd still need output hashing to get decent results on statistical tests no matter how high the LFSR quality got since some problems are simply fundamental to LFSRs.

For doing a hardware hash function instead of a hardware PRNG:
I hadn't gotten to really thinking of what algorithm might be appropriate, but... just naming possibilities off the top of my head... how about something like (C-like pseudocode here):

x = (x ^ rotate32(x, K1) ^ rotate32(x, K2));
x += x << K0;
x = permutate_bit_order(x) + K3;
//now do it all a second time:
x = (x ^ rotate32(x, K1) ^ rotate32(x, K2));
x += x << K0;
x = permutate_bit_order(x) + K3;

It's not great, but adequate for typical uses. Probably better versions could be thought up with a little more work or research, though I'm very fuzzy on what your limitations are. K0, K1, and K2 are fairly arbitrary constants in the range of 1 to 31, while K3 is a full 32 bit integer constant, possibly any value would do, maybe even zero. The constants for the 2nd half could be made different from those for the 1st half to improve quality minutely.
That's all for a unary hash function though. Better would be a binary hash function. One that takes one two inputs hash(A,B), and ideally guarantees 1-to-1 for one of its inputs if the value of the other is known. A simple implementation would be to take the function above, and have some of its constants modulated by a 2nd parameter. K3 could reasonably have 29 or so bits vary with the 2nd parameter, K0/K1/K2 could each have 1 bit vary with the parameter, to all add up to 32 bits of the 2nd parameter. I'm not sure if that scheme would be adequate for all possible values of the 2nd parameter or not, investigation would be required.

Anyway, with such a hash function exposed in C as an intrinsic, a PRNG function could be as simple as:

unsigned int rng_state;
unsigned int get_rng_output() {return intrinsic_unary_hash(rng_state++);} // if the hash is unary, or...
unsigned int get_rng_output() {return intrinsic_binary_hash(rng_state++, 0);} // if the hash is binary

If the hash was halfway decent that wouldn't fail PractRand until about 1 GB of output.
Then if they wanted a higher quality of PRNG, numerous possibilities would be both fast and reasonable quality:

unsigned int rng_state_A, rng_state_B;
unsigned int get_rng_output() {return rng_state_A = intrinsic_unary_hash(rng_state_A + rng_state_B++);}

That might pass PractRand out to a few hundred GB for the example hash function above at a guess, or a lot more if a higher quality hash function could be used.
Plus such an intrinsic would be useful in hashtable implementations and other such things.
Sorry all my examples are in C, but I can write C much quicker (and with a much lower error rate) than a I can write Verilog.

Random/LFSR on P2

Comments