Random/LFSR on P2 - Page 35 — Parallax Forums

# Random/LFSR on P2

• evanh wrote: »
cgracey wrote: »
What was this other algorithm that got to 32GB-quality in 32 bits of state? That is the best yet!
Here's my compile of it - http://forums.parallax.com/discussion/comment/1424408/#Comment_1424408

Here's the original as supplied by Scro - http://forums.parallax.com/discussion/comment/1424224/#Comment_1424224

Scro says he intentionally biased it. What that implies I'm not sure. Maybe it's a sort of cheat to beat up on PractRand and it ain't quite as good as the score suggests.
Re: Intentionally biased.

Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.

Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.

If the PRNG state were bigger, this question wouldn't actually matter. A sequence of 2^128 128 bit random values is functionally impossible to distinguish from a random permutation of 2^128 bit values, so most people don't care anymore at that size. Even at 64 bits, a random permutation vs a random sequence is very hard to tell apart, though it can just barely be done if you use tons of time and memory working on it (PractRand doesn't bother, at the moment).
• scro,

Thanks for that explanation.

Previously I would have naively expected a "good" PRNG with 32 bits of state to visit every possible state once in a 4 giga sample space. Clearly this is not what would happen with a true random selection of 32 bits four billion times.
Is that actually better? Well... yes, mostly... Some people feel that...
Looks to me like "randomness" is in the eye of the beholder.

At the moment I'm leaning towards preferring that a PRNG with 32 bits of state does visit every possible output exactly once in 4 giga samples.

In what situation is that detrimental, apart from satisfying statistical tests like PractRand ?

Sure it means that somebody collecting enough samples can tell that you have a PRNG instead of a source of real randomness.

But presumably they can do that if there is a bias added as well.

• edited 2017-11-02 01:06
cgracey wrote: »
TonyB_ wrote: »
cgracey wrote: »
If XORO32 can be used to produce a high-quality 16-bit result per iteration, then I could enhance the instruction to do TWO iterations, get 16-bit sums from each, and substitute the high-quality 32-bit PRNG result into the next instruction's S value. Meanwhile, the double iteration would be written back to the original D in the XORO32 instruction.

So, do a XORO32 iteration and store the 32-bit state in S. Do another XORO32 iteration, then sum the two states as two parallel 16-bit adds, with the parity jiggery-pokery, to give a 32-bit PRN in D. Is that correct and is there time to do all this?

I'm all for XORO32 doing the summing itself, if possible. We'd need a XORO32 write for the seed and a XORO32 read for the PRN.

It's all in one!

Because the iteration is just a bunch of XOR's, we could do a double iteration at once, grabbing results from each to perform the separate 16-bit adds and parity computations, in order to compute the compound PRNG output for the next instruction's S value. The 'XORO32 D' will write the double-iterated value (which is separate from the PRNG output) back into D, which is just a register. To seed the PRNG, just put a non-0 value into the register being used.

XORO32 D 'iterate D and put 32-bit PRNG result into next instruction's S

Example:
```        XORO32	rnd             'iterate rnd
MOV	outb,0          'write XORO32 PRNG result to outb
```

Just wondering whether the MOV could be replaced by a "XORO32B" that does seem extra bit scrambling. In other words, XORO32 could do two different things, determined by a spare opcode bit. The same adder could be used in both parts and the new second part would not be compulsory.
```        XORO32	rnd,0           'iterate rnd
MOV	outb,0          'write XORO32 PRNG result to outb

XORO32	rnd,0           'iterate rnd
XORO32	outb,1          'write XORO32 PRNG result after some extra processing to outb
```
• scro wrote: »
Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.

Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.

If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.
• TonyB_ wrote: »
scro wrote: »
Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.

Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.

If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.
True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.
Heater. wrote: »
scro,

Thanks for that explanation.

Previously I would have naively expected a "good" PRNG with 32 bits of state to visit every possible state once in a 4 giga sample space. Clearly this is not what would happen with a true random selection of 32 bits four billion times.
Is that actually better? Well... yes, mostly... Some people feel that...
Looks to me like "randomness" is in the eye of the beholder.

At the moment I'm leaning towards preferring that a PRNG with 32 bits of state does visit every possible output exactly once in 4 giga samples.

In what situation is that detrimental, apart from satisfying statistical tests like PractRand ?

Sure it means that somebody collecting enough samples can tell that you have a PRNG instead of a source of real randomness.

But presumably they can do that if there is a bias added as well.
When considered from the context of a single run of a program using no more than the cycle length of the PRNG, the biased form is just better. It behaves like a PRNG should, in such a narrow case (assuming it was well written). For instance, duplicated 32 bit values occur at about the right frequency and spacings.

On the other hand, when things are going to be wrong anyway, the unbiased form may have things be wrong in a more intuitive way. Maybe. For instance, if you have a program that uses 2^32 32 bit bit values, and you run that program ten thousand times with different seeds (and thus different starting points in the cycle), the distribution of results will be quite wrong for both cases, but the way it is skewed for the unbiased PRNG might be easier to understand than the way it is skewed for the biased PRNG. Maybe.
• Sounds reasonable scro.

Given the speed of a Propeller and the applications it might get used for are any of these statistical subtleties worth worrying about?

There is some kind of madness in the search for the perfect PRNG.
• edited 2017-11-02 00:27
cgracey wrote: »
Because the iteration is just a bunch of XOR's, we could do a double iteration at once, grabbing results from each to perform the separate 16-bit adds and parity computations, in order to compute the compound PRNG output for the next instruction's S value. The 'XORO32 D' will write the double-iterated value (which is separate from the PRNG output) back into D, which is just a register. To seed the PRNG, just put a non-0 value into the register being used.

XORO32 D 'iterate D and put 32-bit PRNG result into next instruction's S

Example:
```        XORO32	rnd             'iterate rnd
MOV	outb,0          'write XORO32 PRNG result to outb
```

Another thought on this, simpler than the previous one.

XORO32 stores two 16-bit PRN outputs in the S field of the next instruction and the above code can be re-written as
```        XORO32	rnd
INS	D,PRNs
```

where INS is an instruction with D and S fields which does not have to be a MOV. Some of the useful instructions after an XORO32 are
```	MOV
SUB
XOR
MUL
```

MUL is D[31:0] := D[15:0] * S[15:0]. Multiplying is a good way of scrambling bits, however the PRN in S[31:16] would be wasted. To avoid this, one of the spare CZ opcode bits could select between single and double iterations, leaving the other for forward/reverse.

EDIT
This is largely about saving one instruction and not so exciting on second reading.
• TonyB_,

It's already that way. Whatever the next instruction is, its S value is the PRNG output. That next instruction can be anything.
• There could be some weird and wacky instructions after XORO32.
• Wow!
I have been following along without understanding most of it.

Thanks scro for the recent post that put it into perspective. Like heater, I presumed that a 32-bit PRNG that produced 1 of each number in a 4GB sample would be the perfect PRNG. But I now see why this isn't really.

However, if we have a truly 4GB where 1 of each number is given, does it really matter for P2?
I guess that is the real question at hand. Are the 4GB of numbers truly random though?

Chip, could a wider number of bits be used and then select just 32 bits from the wider answer?
eg use 40 bits, do the appropriate "xoro40" and select say b38..b7 as the 32 bits?
That might get over the double add which doesn't work in a clock.
• edited 2017-11-02 02:03
scro wrote: »
TonyB_ wrote: »
If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.
True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.

scro, what would be good 16-bit constant to multiply with the xoroshiro32+ 16-bit output to scramble the bits better? Or is it too late after the addition to improve things by multiplying?

What I don't understand about multiplying by a constant is how people handle the result. There are bound to be leading zeroes, so do they shift and truncate it?

Melissa O'Neill says a xoroshiro* would be better than xoroshiro+, right at the end of this webpage:
http://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html
• edited 2017-11-02 01:40
evanh wrote: »
100% coverage parity is the only one that is a contender. Anything else is at least 32x sub-par. I don't think there is any way to make up a second bonus bit with this approach.

Yes, parity is a one-trick pony but it's a good trick and I'm glad I found it.

It would be interesting to see how bit 1 on its own scores in tests sometime and whether it is that much worse than bit 2, say.

• Cluso99 wrote: »
Wow!
Chip, could a wider number of bits be used and then select just 32 bits from the wider answer?
eg use 40 bits, do the appropriate "xoro40" and select say b38..b7 as the 32 bits?
That might get over the double add which doesn't work in a clock.

xoroshiro40+ needs 40 state bits and they wouldn't fit into one register. Also the output would be only 20 bits.
• edited 2017-11-02 14:54
TonyB_ wrote: »
scro wrote: »
TonyB_ wrote: »
If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.
True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.

scro, what would be good 16-bit constant to multiply with the xoroshiro32+ 16-bit output to scramble the bits better? Or is it too late after the addition to improve things by multiplying?

What I don't understand about multiplying by a constant is how people handle the result. There are bound to be leading zeroes, so do they shift and truncate it?

Melissa O'Neill says a xoroshiro* would be better than xoroshiro+, right at the end of this webpage:
http://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html
Um... it's not too late, though it might not be ideal either. We're talking about a 1-to-1 hash of 16 bit values using multiplication? I'd say, generally, for a multiplier constant like that, you want an odd number that, when expressed in binary, has no obvious repeating patterns. Multiplication is very good mixing, but tends to be too left-wards in the flow of information (lower bits of input effect higher bits of output, but not the reverse), so it's often paired with a right-shift and xor (edit: as in x *= K; x ^= x >> 8; (edit2: corrected shift value from 16 to 8, this was a 16 bit multiplication not 32)). Multiplication is not, I think, cost-effective from a hardware perspective though, I think multiplication tends to be kind of expensive compared to the alternatives, unless you just happen to have a multiplication unit sitting around that's not doing anything worthwhile. I think you could probably fit some reasonable PRNGs or even CSPRNGs in a comparable amount of die area to a 16x16 multiplication, and have the result run faster too.
• edited 2017-11-04 12:07
More on running xoroshiro32+ backwards. I'm sure some people would find this feature extremely useful. The ability to go backwards at any time means there would be no need to store any previous states apart from the current one. This could be really handy during long simulations, or in games which could have random starting levels, but I think it's so cool we should do it anyway.

Here's the reverse algorithm , the first piece of Verilog I have ever written:
```
//	xoroshiro forward & reverse algorithms
//	for triplet [14,2,7]
//	s0 = state 0, s1 = state 1

// s0->s1
//	x   =  s0h XOR s0l
//	s1l = {s0l ROL 14} XOR {x SHL 2} XOR x
//	s1h =  x ROL 7

// s0<-s1
//	x   =  s1h ROR 7 =  s0h XOR s0l
//	s0l = {s1l XOR {x SHL 2} XOR x} ROR 14
//	s0h =  s0l XOR x

// reverse algorithm	{s1h,s1l} = {d[31:16],d[15:0]}

wire [15:0] xorr32x	= {d[22:16], d[31:23]};		 	// x = s1h ROR 7 = s0h ^ s0l
wire [15:0] xorr32xx	=  d[15:0] ^
{xorr32x[13:0], 2'b0} ^ xorr32x;	// s1l ^ {x SHL 2} ^ x
wire [31:0] xorr32	= {xorr32x ^
{xorr32xx[13:0], xorr32xx[15:14]},	// s0h ^ s0l ^ s0l = s0h,
{xorr32xx[13:0], xorr32xx[15:14]}};	// {s1l ^ {x SHL 2} ^ x} ROR 14 = s0l

```

Here's how the combined double iteration forward and reverse code might look:
```
// needs new xoro32dir bit from opcode: 0 = forward, 1 = reverse

wire [15:0] xoro32z	= d[31:16] ^ d[15:0];			// first forward iteration
wire [31:0] xoro32y	= {xoro32z[8:0], xoro32z[15:9],
{d[1:0], d[15:2]} ^
{xoro32z[13:0], 2'b0} ^ xoro32z};

wire [15:0] xoro32x	= xoro32y[31:16] ^ xoro32y[15:0];	// second forward iteration
wire [31:0] xoro32	= {xoro32x[8:0], xoro32x[15:9],		// xoro32 = forward result
{xoro32y[1:0], xoro32y[15:2]} ^
{xoro32x[13:0], 2'b0} ^ xoro32x};

wire [15:0] xorr32z	= {d[22:16], d[31:23]};		 	// first reverse iteration
wire [15:0] xorr32zz	=  d[15:0] ^
{xorr32z[13:0], 2'b0} ^ xorr32z;
wire [31:0] xorr32y	= {xorr32z ^
{xorr32zz[13:0], xorr32zz[15:14]},
{xorr32zz[13:0], xorr32zz[15:14]}};

wire [15:0] xorr32x	= {xorr32y[22:16], xorr32y[31:23]}	// second reverse iteration
wire [15:0] xorr32xx	=  xorr32y[15:0] ^
{xorr32x[13:0], 2'b0} ^ xorr32x;
wire [31:0] xorr32	= {xorr32x ^				// xorr32 = reverse result
{xorr32xx[13:0], xorr32xx[15:14]},
{xorr32xx[13:0], xorr32xx[15:14]}};

wire	    xoro32d	= xoro32dir ? xorr32 : xoro32;		// xoro32d = d result

wire [16:0] xoro32a	= xoro32dir ? 				// first iteration sum
xorr32y[31:16] + xorr32y[15:0]
: xoro32y[31:16] + xoro32y[15:0];

wire [16:0] xoro32b	= xoro32dir ? 				// second iteration sum
xorr32[31:16] + xorr32[15:0]
: xoro32[31:16] + xoro32[15:0];

wire [31:0] xoro32r	= {xoro32b[15:1], ^xoro32b,		// xoro32r = prng result
xoro32a[15:1], ^xoro32a};

```

The XORO32 instruction would need a second operand to select the direction:
```		XORO32	D,#0		' forward
XORO32	D,#1		' reverse
```

but the new xoro32dir signal could be a constant = 1 to test things work in reverse. To check the output, the following file has states and sums for 2048 single iterations backwards and forwards from state \$0000_0001: http://forums.parallax.com/discussion/download/121567/xoro.txt
• I think it would be better to just have:

XORO32F
XORO32R

...for forward and reverse.
• Please do. It's common sense. Easy peasy.

Very cool, BTW. You guys did a lot with PRNGs here. Can't wait to see what people end up doing with this feature.
• edited 2017-11-04 20:31
There is an issue with XORO32 as it is now. The double iteration means that even states are written back to D but odd ones are not. The only way XORO32 can tell you the state immediately after your initial seed is by iterating it 2^31 or 2147483648 times. Combining two 16-bit PRNs into one 32-bit value is great but it halves the period and some users will want only 16 bits, always in the low word, and the full 2^32-1 period.

In a nutshell, we have lost the ability to single-step through the states. However, if we implement forward and reverse, then we could get single-stepping back for very little extra logic. Here is how the modified Verilog code might look:
```
// XORO32 code that outputs 16-bit or 32-bit PRN values, both forward and reverse
// 16-bit PRN in low word with zeroes in high word and second iteration discarded
// new state written is next one for 16-bit PRN or next but one for 32-bit PRN

// needs new xoro32ctl[1:0] using CZ opcode bits, similar to SETNIB/GETNIB/etc.

// xoroctl[1:0]	= 00 for 16-bit PRN, forward		XORO16F
//		  01 for 32-bit PRN, forward		XORO32F
//		  10 for 16-bit PRN, reverse		XORO16R
//		  11 for 32-bit PRN, reverse		XORO32R

// wire [1:0] xoroctl	= ?

wire [15:0] xoro32z	= d[31:16] ^ d[15:0];
wire [31:0] xoro32y	= {xoro32z[8:0], xoro32z[15:9],		// first forward state
{d[1:0], d[15:2]} ^
{xoro32z[13:0], 2'b0} ^ xoro32z};

wire [15:0] xoro32x	= xoro32y[31:16] ^ xoro32y[15:0];
wire [31:0] xoro32	= {xoro32x[8:0], xoro32x[15:9],		// second forward state
{xoro32y[1:0], xoro32y[15:2]} ^
{xoro32x[13:0], 2'b0} ^ xoro32x};

wire [15:0] xorr32z	= {d[22:16], d[31:23]};
wire [15:0] xorr32zz	=  d[15:0] ^
{xorr32z[13:0], 2'b0} ^ xorr32z;
wire [31:0] xorr32y	= {xorr32z ^		 		// first reverse state
{xorr32zz[13:0], xorr32zz[15:14]},
{xorr32zz[13:0], xorr32zz[15:14]}};

wire [15:0] xorr32x	= {xorr32y[22:16], xorr32y[31:23]}
wire [15:0] xorr32xx	=  xorr32y[15:0] ^
{xorr32x[13:0], 2'b0} ^ xorr32x;
wire [31:0] xorr32	= {xorr32x ^				// second reverse state
{xorr32xx[13:0], xorr32xx[15:14]},
{xorr32xx[13:0], xorr32xx[15:14]}};

wire	    xoro32d	= xoro32ctl ?				// xoro32d = d state result
xorr32
: xorr32y
: xoro32
: xoro32y;

wire [16:0] xoro32a	= xoro32ctl ? 			// first iteration sum
xorr32y[31:16] + xorr32y[15:0]
: xoro32y[31:16] + xoro32y[15:0];

wire [16:0] xoro32b	= xoro32ctl ? 				// second iteration sum
xorr32[31:16] + xorr32[15:0]
: 16'b0
: xoro32[31:16] + xoro32[15:0]
: 16'b0;

wire [31:0] xoro32r	= {xoro32b[15:1], ^xoro32b,		// xoro32r = prng result
xoro32a[15:1], ^xoro32a};

```

Please bear in mind I'm on day two as a Verilog coder.

The four different options are selected by the otherwise unused CZ opcode bits, as already done by several other instructions.
cgracey wrote: »
I think it would be better to just have:

XORO32F
XORO32R

...for forward and reverse.

We could expand this to

XORO16F
XORO16R
XORO32F
XORO32R

These mnenomics make it clear what the operation is. Before anyone objects to using XORO16 for what we have been calling xoroshiro32+, Sebastiano Vigna says its proper name should be xoroshiro32+.

Chip, I've spent hours on this for several reasons (a) because I think XORO32 would be better with a reverse gear, (b) so that you could see this working for yourself, (c) to save you the time and trouble of implementing the reverse algorithm, (d) to bring back single-stepping and (e) to make the P2 the best it can be.
• Good work, TonyB_!

But don't we actually have a 2^32-1 period, already. At 2^31 iterations, don't we step into the 'odd' domain and then go another 2^31-1 iterations before arriving at our original value?
• cgracey wrote: »
Good work, TonyB_!

But don't we actually have a 2^32-1 period, already. At 2^31 iterations, don't we step into the 'odd' domain and then go another 2^31-1 iterations before arriving at our original value?
Yes, technically. After 2^31-1, you're repeating the same bytes, but what used to show up in the upper 16 bits will now show up in the lower 16 bits instead, and vise versa, I think. Technically that means your period is a full 2^32-1 calls, and for some purposes that can actually matter. However, if looked at from a byte-stream perspective, the period remains approximately 2^31-1 outputs - and for some purposes that counts for a lot more than the technical period.
• edited 2017-11-04 22:56
Yes, it would repeat the sequence, but with results shifted by 16 bits, so that what used two appear as a unified long, now appears as two words in consecutive longs. So, the words would swap high/low positions, but be spread across two longs.

Scro, what do you think about the utility of having XORO32 be reversible?
• edited 2017-11-17 13:36
I vote for reversible and potatohead likes it too. If anyone else thinks it would be good for XORO32 to run forwards and backwards please say so now.

Is the P2 going to be an ONC18: 0.18 µm CMOS Standard Cell and will it have individual gates or FPGA-like LEs? I'm wondering whether a 4-to-1 mux would be any bigger than a 2-to-1 mux. If the answer is no, or the difference is insignificant, then reversibility gives you single-stepping for free and single-stepping gives you reversibility for free.

Even if you always use 32-bit PRN data, what if you want to have two sequences running, offset by one state? A XORO32 that only does double-stepping cannot tell you the next state unless you go through the whole period. A XORO16 would tell you in one instruction, then you could use that state in XORO32 thereafter.

Without reversibility, states would need to be saved if you ever want to stop and restart further back. Even then, you wouldn't be able to run a simulation, say, backwards, just jump back so far depending on how much RAM is available for storing states and then run it forwards again.

I get the feeling Chip doesn't really believe this thing can run backwards. There's only one way to find out!
• edited 2017-11-05 01:01
Am I missing a point? What possible use is running a PRNG backwards?

Generally in a simulation we have a state S and we evolve a new state from the old one plus some random number(s) like so:

Set intitial state S0
Get random number N0. Calculate new simulation state S1 as a function S1 = f(S0, N0)
Get random number N1. Calculate new simulation state S2 as a function S2 = f(S1, N1)
Get random number N2. Calculate new simulation state S3 as a function S3 = f(S2, N2)
Get random number N3. Calculate new simulation state S4 as a function S4 = f(S3, N3)
...

Now, if we reverse our PRNG we continue as:

Get random number N2. Calculate new simulation state S5 as a function S5 = f(S4, N2)
Get random number N1. Calculate new simulation state S6 as a function S6 = f(S5, N1)
Get random number N0. Calculate new simulation state S7 as a function S7 = f(S6, N0)
...

It's pretty unlikely that our simulation states are going to run backwards just because the PRNG does. S5 will not be S3, S6 will not be S2 and so on.

Once we have stirred up our simulation state with the PRNG it will not magically get un-stirred by running the PRNG backwards.
• cgracey wrote: »
Yes, it would repeat the sequence, but with results shifted by 16 bits, so that what used two appear as a unified long, now appears as two words in consecutive longs. So, the words would swap high/low positions, but be spread across two longs.

Scro, what do you think about the utility of having XORO32 be reversible?

Mostly pointless. It lets you rewind faster than you could already, but what are you going to use it for? Use cases for backwards PRNG stepping are pretty rare. Plus, since you're limiting it to 32 bits of state, I'm suspecting that the state is mapped to local accessible memory anyway, so I think you'll already have fast state save/restore. 90% of use cases for backwards PRNG stepping can be done about as easily (or better) using fast state save/restore instead.

Can anyone come up with a use case that *isn't* easily & efficiently handled with fast state save/restore instead?
• edited 2017-11-05 01:11
cgracey wrote: »
evanh wrote: »
cgracey wrote: »
But it does score 32 times better, right?
Ya, 32GB score is 8G-Iterations. PractRand is detecting the full period repeating, so it's a perfect score for a 32 bit state with 32 bit output.
The showstopper is that we don't have enough time in a clock cycle to do two 32-bit additions, one after the other.
Chip,
I'm thinking there is the whole two execution clock cycles available. As per the original Xoroshiro128+ source code, the summing can be in parallel with the iterating. EDIT: In fact, Scro even arranges his algorithm the same way.
• Can anyone come up with a use case that *isn't* easily & efficiently handled with fast state save/restore instead?

I can't, other than procedural data generation. Like game world, open, roaming. Walking one way, vs walking the opposite way or any other way and seeing the environment present consistently.

State / restore would do this, but would do it at a much higher cost, in that there would need to be state markers in various places to know what the state to return to actually was.

With a forward / reverse, one can just run the same computation and generate routines without regard to state at all, other than the current one.
• Running forwards all the time gives you one sequence.
Running backwards all the time gives you another sequence.
Isn't that the real utility? In effect two PRNGs in one.

I'd be more worried about knowing only half the states if single-stepping is not possible.
• edited 2017-11-05 02:12
I'd be happy if anyone could give me even one case where being able to run a PRNG backwards is useful at all !

With a forward / reverse, one can just run the same computation and generate routines without regard to state at all, other than the current one.

It can work if your state is only a function of the PRNG value. Then if you reverse the PRNG you reverse the state. But that is never true. As I tried to show in my post above.

Let's think of a really simple simulation. A simulated robot that makes a left or right turn at every iteration and takes one step forwards. The state here is a position on an X, Y plane. The left or right is determined by a PRNG.

If the PRNG produces steps left(L) and right(R), say "LRLLR", then the robot ends up at some new X,Y coordinate.

If we now reverse the PRNG we get "LLRL".

The robot will not end up where it started.

I don't see how you can run a simulation backwards just by using a PRNG does.

So far, I see no use for a reversible PRNG at all.

• edited 2017-11-05 02:07
evanh wrote: »
cgracey wrote: »
evanh wrote: »
cgracey wrote: »
But it does score 32 times better, right?
Ya, 32GB score is 8G-Iterations. PractRand is detecting the full period repeating, so it's a perfect score for a 32 bit state with 32 bit output.
The showstopper is that we don't have enough time in a clock cycle to do two 32-bit additions, one after the other.
Chip,
I'm thinking there is the whole two execution clock cycles available. As per the original Xoroshiro128+ source code, the summing can be in parallel with the iterating. EDIT: In fact, Scro even arranges his algorithm the same way.

But the second add uses the result of the first add, doesn't it?
scro wrote: »
Here's an alternative (expressed in C, but it's very easy to translate to verilog or most asms):
```unsigned int rng_state = 1;
unsigned int get_rng_output() {
unsigned int tmp = state;
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
tmp += (tmp << 18) | (tmp >> 14);//barrel shift and add
tmp += ((tmp << 7) | (tmp >> 25)) ^ ((tmp << 11) | (tmp >> 21));//two barrel shifts, an xor, and an add
return tmp ^ state;
}
```

scro, the three state xors are Xorshift, I believe. Did you invent the hashing function after it and does it have a name?
• edited 2017-11-05 02:30
TonyB_ wrote: »
But the second add uses the result of the first add, doesn't it?
Yep, Prop2 execution is two clocks long, that could execute in the second cycle. Unless, that is, the subsequent S operand replacement occurs at that stage. I presume that'll depend on the addressing modes allowed - If the random number is solely an immediate operand then both execution cycles are available but if the random number is allowed to be a register address then only the first execution cycle is available.