Can anyone come up with a use case that *isn't* easily & efficiently handled with fast state save/restore instead?
I can't, other than procedural data generation. Like game world, open, roaming. Walking one way, vs walking the opposite way or any other way and seeing the environment present consistently.
State / restore would do this, but would do it at a much higher cost, in that there would need to be state markers in various places to know what the state to return to actually was.
With a forward / reverse, one can just run the same computation and generate routines without regard to state at all, other than the current one.
...I'm not following. I've written a few procedural game world type things, and for what I've done save/restore state was essential, while reverse iteration was pointless. If all the world was generated at once, then the compressed form of the world was simply a PRNG seed, whereas if it was generated tile-by-tile or somesuch then the compressed form of the world was, IIRC, a PRNG seed to which I added the coordinates (slightly transformed to merge X & Y coordinates) of the tile to get the seed from which the tile would be generated. In both cases PRNG backstepping was unused because there was no 1-dimensional sequence for which I wanted to be able to walk bidirectionally one step at a time.
You have a different conception of how such a thing would go?
Running forwards all the time gives you one sequence.
Running backwards all the time gives you another sequence.
Isn't that the real utility? In effect two PRNGs in one.
And you could get this for free, probably, with single-stepping.
I'd be more worried about knowing only half the states if single-stepping is not possible.
The reverse order of the same sequence is not remotely close to being an independent sequence.
Ya, 32GB score is 8G-Iterations. PractRand is detecting the full period repeating, so it's a perfect score for a 32 bit state with 32 bit output.
The showstopper is that we don't have enough time in a clock cycle to do two 32-bit additions, one after the other.
Chip,
I'm thinking there is the whole two execution clock cycles available. As per the original Xoroshiro128+ source code, the summing can be in parallel with the iterating. EDIT: In fact, Scro even arranges his algorithm the same way.
But the second add uses the result of the first add, doesn't it?
Here's an alternative (expressed in C, but it's very easy to translate to verilog or most asms):
unsigned int rng_state = 1;
unsigned int get_rng_output() {
unsigned int tmp = state;
state ^= state << 13;
state ^= state >> 17;
state ^= state << 5;
tmp += (tmp << 18) | (tmp >> 14);//barrel shift and add
tmp += ((tmp << 7) | (tmp >> 25)) ^ ((tmp << 11) | (tmp >> 21));//two barrel shifts, an xor, and an add
return tmp ^ state;
}
scro, the three state xors are Xorshift, I believe. Did you invent the hashing function after it and does it have a name?
If the output hashing is removed then it is indeed an xorshift-type LFSR. It doesn't have a name, though it's based upon one of my PRNGs named "rarns" which is broadly similar though with a significantly larger state size and slightly different details.
Note that I remain skeptical of the idea of a PRNG with only 32 bits of state, and uncertain what other limitations you may be laboring under (maximum gate delays? transistor count?).
The second add does indeed use the result of the first add. I could redo that to use only one add if need be, though to keep quality high I'd probably need to either increase the number of xors or introduce some constructs that would be slow in software. Also I'll point out that there are high quality PRNGs designed specifically for hardware like Trivium, which uses only bitwise-xor and bitwise-and and fixed width shifts (though its state is much larger than 32 bits... I think something like 288 bits?).
You might be amused to know, when only using 16-bits or less, the quality scores for our XORO32 doing the even-old looping trick are generally a little better than straight incremental. There is a few 1 GB scores in the word0 column, and even some 4 GB and 8 GB scores in other columns too. It's also more of a mixed bag, [14 2 7] no longer looks so great for example.
For the testing, I modified the source to double iterate on every call to next(). Nothing else changed.
You might be amused to know, when only using 16-bits or less, the quality scores for our XORO32 doing the even-old looping trick are generally a little better than straight incremental. There is a few 1 GB scores in the word0 column, and even some 4 GB and 8 GB scores in other columns too. It's also more of a mixed bag, [14 2 7] no longer looks so great for example.
For the testing, I modified the source to double iterate on every call to next(). Nothing else changed.
That's basically skipping half the output of xoroshiro32plus, right? It makes sense that it would be slightly higher quality, the output hashing is the same but the underlying LFSR is... still an LFSR, but one that's no longer efficient to do in software, probably a slightly higher quality LFSR, more taps.
Evan, if you're double iterating why are you only using 16 bits or less? Don't we need a new long score that tests 32 bits?
The data order for the new XORO32 with concatenated dual 16-bit sums is the same as the regular method with just a 16-bit sum per iteration. ParactRand will see the same sequence either way.
By double iterating I'm sampling every second 16 bits all the way through then every other 16 bits on the second time around. Which is the same effect as the new XORO32 produces if you only use 16 bits of its result.
I'd be happy if anyone could give me even one case where being able to run a PRNG backwards is useful at all !
potatohead,
With a forward / reverse, one can just run the same computation and generate routines without regard to state at all, other than the current one.
I don't follow your idea Spud.
It can work if your state is only a function of the PRNG value. Then if you reverse the PRNG you reverse the state. But that is never true. As I tried to show in my post above.
Let's think of a really simple simulation. A simulated robot that makes a left or right turn at every iteration and takes one step forwards. The state here is a position on an X, Y plane. The left or right is determined by a PRNG.
If the PRNG produces steps left(L) and right(R), say "LRLLR", then the robot ends up at some new X,Y coordinate.
If we now reverse the PRNG we get "LLRL".
The robot will not end up where it started.
I don't see how you can run a simulation backwards just by using a PRNG does.
So far, I see no use for a reversible PRNG at all.
The robot isn't directly impacting the state. The environment the robot is in is procedural driven by said state. Being procedural, it can be much bigger than the memory resources available, and is dynamically generated as needed. Instead of paging in big data, it's just computed on the fly.
Now, that can be done with a state capture, and the was "more efficiently", and I think my case is an answer. There are probably others.
Do we need reversibility? No.
Would people do cool stuff with it, if it were there? Unknown.
I don't care if it's reversible or not. It's quite good now.
To implement a backwards XORO32, I would need to remap some instructions. I don't want to just use the WC/WZ bits. This was looking like a headache. That's why I asked if it was really useful. It wasn't because I was principally against it.
A reversible PRNG could be used in games. Two XORO32s starting from the same seed in different directions would use unique data for 2^30 double iterations. A 64K jump function would allow the same if iterated 32K times and if we knew how to create the jump function.
To implement a backwards XORO32, I would need to remap some instructions. I don't want to just use the WC/WZ bits. This was looking like a headache. That's why I asked if it was really useful. It wasn't because I was principally against it.
Chip, I wrongly thought the CZ bits would be easy to use. We could always do a reverse xoroshiro32+ in software and a forward version (to get the odd states) if needed.
Evan, if you're double iterating why are you only using 16 bits or less? Don't we need a new long score that tests 32 bits?
The data order for the new XORO32 with concatenated dual 16-bit sums is the same as the regular method with just a 16-bit sum per iteration. ParactRand will see the same sequence either way.
By double iterating I'm sampling every second 16 bits all the way through then every other 16 bits on the second time around. Which is the same effect as the new XORO32 produces if you only use 16 bits of its result.
A test of bit 1 compared to bit 0 or bit 2 would have been more useful!
To implement a backwards XORO32, I would need to remap some instructions. I don't want to just use the WC/WZ bits. This was looking like a headache. That's why I asked if it was really useful. It wasn't because I was principally against it.
Chip, I wrongly thought the CZ bits would be easy to use. We could always do a reverse xoroshiro32+ in software and a forward version (to get the odd states) if needed.
The C/Z bits *could* be used, but it breaks a convention and creates a caveat that lengthens logic times. I'd just rather have a unique opcode for a new instruction. To implement one, I'd have to shuffle some things around from their current states. It just gets messy quick.
The robot isn't directly impacting the state. The environment the robot is in is procedural driven by said state. Being procedural, it can be much bigger than the memory resources available, and is dynamically generated as needed. Instead of paging in big data, it's just computed on the fly.
I don't understand what you are saying there.
In my example simulation the state is the robot, the robot is the state. The next state is determined by the previous state and some PRNG and whatever rule/function. You can't wind the state backwards simply by winding the PRNG backwards.
The only way I can see to do that is if the state is a simple 1 to 1 mapping (function) of only the PRNG value at any time. That is generally not the case and does not sound very interesting.
Do you have a simple example of where running a PRNG backwards is useful in winding the state of a system backwards?
Even though an instruction takes two clocks, we must have the ALU result done by the end of the first clock. After that, it goes through a series of result muxes which take time, too. I had to reorder the way the CORDIC works because of this problem of trying to stack up two 32+ bit additions in one clock cycle. There's just not enough time for it.
You might be amused to know, when only using 16-bits or less, the quality scores for our XORO32 doing the even-odd looping trick are generally a little better than straight incremental. There is a few 1 GB scores in the word0 column, and even some 4 GB and 8 GB scores in other columns too. It's also more of a mixed bag, [14 2 7] no longer looks so great for example.
For the testing, I modified the source to double iterate on every call to next(). Nothing else changed.
This is what people will get if they use only 16 bits of each 32-bit double-iterated sum, from the even domain then the odd domain. [15,3,6] is the best overall, but [14,2,7] is nowhere near second best. The second [15,3,6] score is better than both of the first scores.
[15,3,6] could be the triplet to go for. What we need, that we've never had, is a full bit[31:0] test for double iterations.
TonyB_,
To implement a backwards XORO32, I would need to remap some instructions. I don't want to just use the WC/WZ bits. This was looking like a headache. That's why I asked if it was really useful. It wasn't because I was principally against it.
Chip, I wrongly thought the CZ bits would be easy to use. We could always do a reverse xoroshiro32+ in software and a forward version (to get the odd states) if needed.
The C/Z bits *could* be used, but it breaks a convention and creates a caveat that lengthens logic times. I'd just rather have a unique opcode for a new instruction. To implement one, I'd have to shuffle some things around from their current states. It just gets messy quick.
There's an empty D,S slot next to SETNIB. Could XORO32 move there so it could do more than now? Instead of XORO32 D, it would be XORO32 D,{#}S.
The idea is that the new S would affect the sum. Is there enough time to XOR the sum results with S? Or, perhaps better, include it in the additions, i.e. 2 x (state high + state low + S high/low)?
What we need, that we've never had, is a full bit[31:0] test for double iterations.
That's your first first score group there. There is no difference for PractRand between single iterated 16-bit scores and double iterated 32-bit scores. Even the endianess should be identical.
What we need, that we've never had, is a full bit[31:0] test for double iterations.
That's your first first score group there. There is no difference for PractRand between single iterated 16-bit scores and double iterated 32-bit scores. Even the endianess should be identical.
So PractRand can't tell the difference between a sequence of 16-bit values and a sequence of 32-bit values?
Here is what scro said earlier:
Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.
Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.
If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.
True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.
You might be amused to know, when only using 16-bits or less, the quality scores for our XORO32 doing the even-odd looping trick are generally a little better than straight incremental. There is a few 1 GB scores in the word0 column, and even some 4 GB and 8 GB scores in other columns too. It's also more of a mixed bag, [14 2 7] no longer looks so great for example.
For the testing, I modified the source to double iterate on every call to next(). Nothing else changed.
Remember when you swapped bits 1 and 9? [14,2,7] got worse but [15,3,6] got better. I wonder what the double-iteration score would be with that swapping?
It's still only a sequence of 16 bits. The concatenation done in the XORO32 instruction is no different to the unending concatenating that is in effect when generating the test data. PractRand is seeing the same thing either way.
What we need, that we've never had, is a full bit[31:0] test for double iterations.
That's your first first score group there. There is no difference for PractRand between single iterated 16-bit scores and double iterated 32-bit scores. Even the endianess should be identical.
So PractRand can't tell the difference between a sequence of 16-bit values and a sequence of 32-bit values?
Here is what scro said earlier:
Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.
Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.
If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.
True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.
There is no fundamental difference between a random sequence of 8 bit values and a random sequence of 32 bit values. PractRand accepts as input any sequence of 8192 bit (1 kilobyte) random values (effectively input length is rounded down to a multiple of 1 kilobyte). It cannot infer from the random numbers what sort of PRNG made that, but it does check metadata for that information. By default if the information is present in the metadata (for data piped in to standard input, this is done with by changing the "stdin" command line option to "stdin8" or "stdin16" or "stdin32" or "stdin64" for PRNGs that generate 8/16/32/64 bits at a time; plain "stdin" is used to indicate that the bits generated at once is unknown or unspecified; no other values are supported), it will adjust behavior slightly (the "folding" portion, which creates data-streams out of subsets of the bits in the original data-stream, will adjust which bits get used - for instance if it's told the data originally was generated 32 bits at a time, it will make a datastream using the 1st bit out of each set of 32 bits (on little endian systems), while the equivalent datastream if it thought the original was 16 bit values would hold the 1st bit out of each 16 bits).
Individual tests with PractRand do their own things. In the core test battery, the DC6 parameterization treats any data-stream it sees as a sequence of 8-bit values, the FPF parameterization treats any data-stream it sees as a sequence of 16-bit values, as does the Gap16 test, while the BCFN parameterization treats any data-stream it sees as a sequence of 32-bit values, the BRank test treats its input as... I think 64 bit values, though my memory is fuzzy on that one, and mod3n treats its input as a sequence of 8-bit values. More or less. In some of those cases the word size the data-stream is treated as is not actually meaningful.
Chip,
Isn't the ONC18 RAM really fast, something in the order of 320+MHz ?
Using the current (anticipated) 160MHz clock, and the instruction sequence of "I, S&D, e, R" which takes 4 clocks, but due to pipelining it appears as 2 clocks to the user.
Could the "I" and "S&D" cycles be combined into 1 clock, with the "I" and "S&D" being done as DDR/DTR (double data ram) ? This might free up a whole additional clock for the "e" (ALU) operation.
Alternately, by slowing down the clock a little, it might appear as 1.5 clocks per instuction.
If I remember correctly, the AVR fetched the S&D and executed all in a single cycle. Simplest instructions are single-clock execution. It would've only worked because of the tiny adders involved and fast access general register space.
The Program Counter and Instruction Register, together, form a minimum two stage pipeline. So the following instruction will be fetched irrespective of what the current instruction does. This shows in the branches being 2, 3, or 4 clocks and untaken conditional branches are only 1 clock.
EDIT: Hmm, Cluso, maybe, on second thought, your idea isn't so bad. The fetching time of the instruction is probably the biggest factor for allocating a whole cycle to it. The Prop has a speed advantage over even something like the AVR - The Cog's instructions are always fetched from inside the Cog core. Either CogRAM, LUT or FIFO. There is no direct addressing of HubRAM, for example. Maybe it could be possible to do all three, I, S and D, fetches on a single cycle. That would absolutely need three read ports to the general registers though.
I think those RAMs can run up to 288MHz, or so. To get that kind of speed, though, there would be no time to do anything but register the output. We need to do some decoding and stuff, so there's some logic in there that slows things down. The alternative would be to add a stage of pipeline, but that would increase branch times by one clock. Too late now, anyway,
I think those RAMs can run up to 288MHz, or so. To get that kind of speed, though, there would be no time to do anything but register the output. We need to do some decoding and stuff, so there's some logic in there that slows things down. The alternative would be to add a stage of pipeline, but that would increase branch times by one clock. Too late now, anyway,
Thanks Chip.
A long time ago you posted the basic clock processing for the P2. Not sure where that was and if it is even current. How does it work now?
Here's the P1 for example
Comments
You have a different conception of how such a thing would go?
The reverse order of the same sequence is not remotely close to being an independent sequence. If the output hashing is removed then it is indeed an xorshift-type LFSR. It doesn't have a name, though it's based upon one of my PRNGs named "rarns" which is broadly similar though with a significantly larger state size and slightly different details.
Note that I remain skeptical of the idea of a PRNG with only 32 bits of state, and uncertain what other limitations you may be laboring under (maximum gate delays? transistor count?).
The second add does indeed use the result of the first add. I could redo that to use only one add if need be, though to keep quality high I'd probably need to either increase the number of xors or introduce some constructs that would be slow in software. Also I'll point out that there are high quality PRNGs designed specifically for hardware like Trivium, which uses only bitwise-xor and bitwise-and and fixed width shifts (though its state is much larger than 32 bits... I think something like 288 bits?).
For the testing, I modified the source to double iterate on every call to next(). Nothing else changed.
By double iterating I'm sampling every second 16 bits all the way through then every other 16 bits on the second time around. Which is the same effect as the new XORO32 produces if you only use 16 bits of its result.
The robot isn't directly impacting the state. The environment the robot is in is procedural driven by said state. Being procedural, it can be much bigger than the memory resources available, and is dynamically generated as needed. Instead of paging in big data, it's just computed on the fly.
Now, that can be done with a state capture, and the was "more efficiently", and I think my case is an answer. There are probably others.
Do we need reversibility? No.
Would people do cool stuff with it, if it were there? Unknown.
I don't care if it's reversible or not. It's quite good now.
To implement a backwards XORO32, I would need to remap some instructions. I don't want to just use the WC/WZ bits. This was looking like a headache. That's why I asked if it was really useful. It wasn't because I was principally against it.
Chip, I wrongly thought the CZ bits would be easy to use. We could always do a reverse xoroshiro32+ in software and a forward version (to get the odd states) if needed.
Is [14,2,7] still the best for 32-bit samples?
A test of bit 1 compared to bit 0 or bit 2 would have been more useful!
The C/Z bits *could* be used, but it breaks a convention and creates a caveat that lengthens logic times. I'd just rather have a unique opcode for a new instruction. To implement one, I'd have to shuffle some things around from their current states. It just gets messy quick.
Have I got this idea correct? - http://forums.parallax.com/discussion/comment/1424891/#Comment_1424891
In my example simulation the state is the robot, the robot is the state. The next state is determined by the previous state and some PRNG and whatever rule/function. You can't wind the state backwards simply by winding the PRNG backwards.
The only way I can see to do that is if the state is a simple 1 to 1 mapping (function) of only the PRNG value at any time. That is generally not the case and does not sound very interesting.
Do you have a simple example of where running a PRNG backwards is useful in winding the state of a system backwards?
Even though an instruction takes two clocks, we must have the ALU result done by the end of the first clock. After that, it goes through a series of result muxes which take time, too. I had to reorder the way the CORDIC works because of this problem of trying to stack up two 32+ bit additions in one clock cycle. There's just not enough time for it.
Thanks, Evan. I've snipped the full results which can be seen at
http://forums.parallax.com/discussion/comment/1424903/#Comment_1424903
Here is a comparison of the just the two best triplets from the previous tests.
The first scores are for single-iterations, from tests done when the carry was included in the parity.
[14,2,7] and [15,3,6] are better than all the other triplets.
The second scores are for double-iterations, done earlier today.
This is what people will get if they use only 16 bits of each 32-bit double-iterated sum, from the even domain then the odd domain. [15,3,6] is the best overall, but [14,2,7] is nowhere near second best. The second [15,3,6] score is better than both of the first scores.
[15,3,6] could be the triplet to go for. What we need, that we've never had, is a full bit[31:0] test for double iterations.
There's an empty D,S slot next to SETNIB. Could XORO32 move there so it could do more than now? Instead of XORO32 D, it would be XORO32 D,{#}S.
The idea is that the new S would affect the sum. Is there enough time to XOR the sum results with S? Or, perhaps better, include it in the additions, i.e. 2 x (state high + state low + S high/low)?
So PractRand can't tell the difference between a sequence of 16-bit values and a sequence of 32-bit values?
Here is what scro said earlier:
Remember when you swapped bits 1 and 9? [14,2,7] got worse but [15,3,6] got better. I wonder what the double-iteration score would be with that swapping?
Individual tests with PractRand do their own things. In the core test battery, the DC6 parameterization treats any data-stream it sees as a sequence of 8-bit values, the FPF parameterization treats any data-stream it sees as a sequence of 16-bit values, as does the Gap16 test, while the BCFN parameterization treats any data-stream it sees as a sequence of 32-bit values, the BRank test treats its input as... I think 64 bit values, though my memory is fuzzy on that one, and mod3n treats its input as a sequence of 8-bit values. More or less. In some of those cases the word size the data-stream is treated as is not actually meaningful.
Isn't the ONC18 RAM really fast, something in the order of 320+MHz ?
Using the current (anticipated) 160MHz clock, and the instruction sequence of "I, S&D, e, R" which takes 4 clocks, but due to pipelining it appears as 2 clocks to the user.
Could the "I" and "S&D" cycles be combined into 1 clock, with the "I" and "S&D" being done as DDR/DTR (double data ram) ? This might free up a whole additional clock for the "e" (ALU) operation.
Alternately, by slowing down the clock a little, it might appear as 1.5 clocks per instuction.
Just a thought to get more time for the ALU.
The usual fix for gaining extra execution time is to extend the pipeline further - Which Chip recently did to the CORDIC.
The Program Counter and Instruction Register, together, form a minimum two stage pipeline. So the following instruction will be fetched irrespective of what the current instruction does. This shows in the branches being 2, 3, or 4 clocks and untaken conditional branches are only 1 clock.
EDIT: Hmm, Cluso, maybe, on second thought, your idea isn't so bad. The fetching time of the instruction is probably the biggest factor for allocating a whole cycle to it. The Prop has a speed advantage over even something like the AVR - The Cog's instructions are always fetched from inside the Cog core. Either CogRAM, LUT or FIFO. There is no direct addressing of HubRAM, for example. Maybe it could be possible to do all three, I, S and D, fetches on a single cycle. That would absolutely need three read ports to the general registers though.
I think those RAMs can run up to 288MHz, or so. To get that kind of speed, though, there would be no time to do anything but register the output. We need to do some decoding and stuff, so there's some logic in there that slows things down. The alternative would be to add a stage of pipeline, but that would increase branch times by one clock. Too late now, anyway,
A long time ago you posted the basic clock processing for the P2. Not sure where that was and if it is even current. How does it work now?
Here's the P1 for example P2 is something like...