I assume the new FPGA images with this new random number generator will have the same instruction set and encoding as the current v16. Is that correct?
Yes. This is a very subtle change that almost nobody would recognize, in practice.
Of the 63 PRNG bits, I figure that all 63 bits will get used 8 times and 8 of those 63 bits will get used 1 more time. This will keep the fan-out lowest. How to parse this, though?
The XOR'ing idea is good, I think, but it would cause delays, since there'd be some logic to go through before interconnect amps and delays. Coming out of a flop is a lot faster.
I made a program which randomly selects sixteen 32-bit tap sets from the 63-bit PRNG output. It uses RealRandom to shuffle the data and then it optimizes the tap usage to minimize fan-out. It generates this, for example:
I wrote the generator in Spin. I realize in Spin2, we'll need a SWAP(v1,v2) instruction. We'll also need to clean up the string syntax and make it terser.
Here's the Spin program that generates the random tap sets:
I tried to solve Chips programming challenge using brute force. Randomize everything and see if each bit gets tapped 8 or 9 times. Of course I did expect it to take a long time, but with a little bit of math and by measuring the factor between getting 3 and 4 bits right I can say it's not feasible. It would take 1,1013671599500632751e50 seconds on my computer!
I tried to solve Chips programming challenge using brute force. Randomize everything and see if each bit gets tapped 8 or 9 times. Of course I did expect it to take a long time, but with a little bit of math and by measuring the factor between getting 3 and 4 bits right I can say it's not feasible. It would take 1,1013671599500632751e50 seconds on my computer!
I made a program to do it in the post above. I just shuffle the deck, use the first 32, and track how many times each tap has been used. After shuffling the deck, though, I need to swap less-used taps into the lower 32 positions. It turns out that it uses 55 bits 8 times over and the other 8 bits 9 times over, which is optimal.
Thanks for working on this, Ahle2. I didn't know if anybody was going to work on it, so I spent some time figuring it out. It's been really quiet on the forum all day, after maybe 60 posts yesterday.
Chip,
If there was a desire to even that out to 8 uses for each bit then extending the ss result to 64 bits wouldn't take much. Just make the s0 + s1 adder have 65-bit result.
Chip,
If there was a desire to even that out to 8 uses for each bit then extending the ss result to 64 bits wouldn't take much. Just make the s0 + s1 adder have 65-bit result.
That would effectively be the carry output, right? I wonder if its behavior would pass all those tests that the other bits did.
What happens with your shuffle algo if the 64-bit random result turns out to have just one "1" bit in it? It could occur.
-Phil
I don't understand. That doesn't seem like a problem.
Someone said something that made me see things differently: If all those tests were passed in Dieharder, then every bit of output is like it's own TRNG. Therefore, it doesn't matter if you switch their positions around or statically invert some of them.
I guess what I was thinking is that such an unbalance of ones vs. zeroes would increase the likelihood that two or more of the derivative 32-bit results would be identical -- in my hypothetical case, all zeroes. But maybe that's not a problem?
I guess what I was thinking is that such an unbalance of ones vs. zeroes would increase the likelihood that two or more of the derivative 32-bit results would be identical -- in my hypothetical case, all zeroes. But maybe that's not a problem?
-Phil
extreme imbalances are rare, but part of the idea, I guess.
Perhaps I'm mistaken, but I believe what you want is for the 16 32-bit values derived from a single 64-bit value to be statistically independent from each other. Right? I guess what concerns me is that, as the imbalance of ones to zeroes in that 64-bit value increases, the likelihood of statistical independence of the 32-bit values decreases dramatically. Admittedly, this is just a gut feeling. Hopefully, the analysis of some Monte Carlo simulations will prove me wrong, but I would not put a wager on that being the case.
Chip,
If there was a desire to even that out to 8 uses for each bit then extending the ss result to 64 bits wouldn't take much. Just make the s0 + s1 adder have 65-bit result.
That would effectively be the carry output, right? I wonder if its behavior would pass all those tests that the other bits did.
Include the Carry, yep. Sequence doesn't change since the add exists outside the loop but obviously bit 63 of the result pops into existence, so that would have to be factored into any comparison.
Yes, that is the problem. How to magically create 16 PRNG running in parallel from a single PRNG in such a way that one cannot tell they are related. The guys doing lots of simulation work on super computers worry about this a lot.
I'm not sure if that is even possible.
The best thing is to have 16 independent PRNG. That boils down to the problem of generating 16 seeds from a single seed in such a way that the PRNGS are all operating in different parts of the state space and will not overlap in state space during any feasible run time.
xoroshiro 128+ includes a method to solve this problem. One can take a single seed and generate up to 2^64 other seeds from it each of which can be used to drive a PRNG with a 2^64 state space that will not overlap with any other.
This is done with the JUMP function:
/* This is the jump function for the generator. It is equivalent
to 2^64 calls to next(); it can be used to generate 2^64
non-overlapping subsequences for parallel computations. */
void jump(void) {
static const uint64_t JUMP[] = { 0xbeac0467eba5facb, 0xd86b048b86aa9922 };
uint64_t s0 = 0;
uint64_t s1 = 0;
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b) {
s0 ^= s[0];
s1 ^= s[1];
}
next();
}
s[0] = s0;
s[1] = s1;
}
My guess is this solution starts to get too big to put into hardware.
I'd be inclined to just arrange that no single PRNG value is ever used more than once. With or without any bit shuffling/inverting. That is to say be sure the PRNG advances by one step for every random number requested by anyone.
I'd be inclined to just arrange that no single PRNG value is ever used more than once. With or without any bit shuffling/inverting. That is to say be sure the PRNG advances by one step for every random number requested by anyone.
Not a good idea, imho. The simple way would sacrifice speed and a little determinism. The more complex way sacrifices determinism and a little speed. The expensive way, as you've already noted, takes too much silicon.
Cogs have no need to be secured from each other. Propeller isn't targetting general computing.
I though the idea was the the PRNG was continuously being clocked in the background. Endlessly stepping through it's sequence no matter if anybody read a random number form it or not. That is to say most of the random numbers get missed.
As such I would suspect that the independence of the random sequence that any COG sees, from any other, is good enough for all practical purposes. With or without bit shuffling.
GOGs will be waiting on I/O and timers etc so multiple COGs won't even be reading the PRNG in the same order all the time. Which randomizes things even more.
The downside of this is perhaps power consumption. Stepping the PRNG when it's output is not used.
The words "secure" and "cryptographic" have popped up on this thread.
I'm pretty sure the xoroshiro128+ PRNG is not regarded as a Cryptographically Strong RRNG (CSPRNG) and security gurus would frown at using it for cryptographic purposes.
Yep, all of the above. I imagine it's power cost will be about four 32-bit timers. Adders are linear and not much more than a counter I think. The 16 way shuffled order doesn't cost any extra silicon or power, that's why it's being included.
The state of the PRNG does advance on every clock, regardless of whether it's being used anywhere.
Each cog receives a unique subset of 32 bits from the PRNG's 63 output bits. Those bits were randomly chosen the other day by the Spin program I made to pick them out, and they are in random order. Some of the bits are always inverted.
Each smart pin gets a unique subset of 8 bits from the PRNG, with some bits always inverted.
All these bit groups and inversions were chosen using RealRandom as a TRNG source.
While all cogs and smart pins receive data from the same PRNG, they each get different versions, with different bits and different inversions. This is probably overkill, but it was done to make every point of usage unique.
Here are the 16 unique longs that go to each cog. And each byte's worth goes to a different smart pin (64 total). The source of this data is the xoroshiro128+ PRNG which outputs x[62:0], which was randomly spread all over and randomly inverted to make these tap sets:
Why was the lsb of ss removed to give 63 bit result?
Because it's only an LFSR, according to the xoroshiro128+ documentation. It wouldn't pass the tests.
As you pointed out, though, the carry output of the final adder would seem to be of good quality, but it must not be if they didn't mention it, because it would have been a great way to get a full 64 bits out of the PRNG.
Because there is a comment in the original source code of xoroshiro128+ which says:
"
Beside passing BigCrush, this generator passes the PractRand test suite
up to (and included) 16TB, with the exception of binary rank tests,
which fail due to the lowest bit being an LFSR; all other bits pass all
tests. We suggest to use a sign test to extract a random Boolean value.
"
So basically the LSB taken alone is not as "random" as the other bits. it's a simple Linear Feedback Shift Register PRNG. It probably has a very short sequence compared to the 10^128 sequence length of the rest of the bits.
Comments
I made a program which randomly selects sixteen 32-bit tap sets from the 63-bit PRNG output. It uses RealRandom to shuffle the data and then it optimizes the tap usage to minimize fan-out. It generates this, for example:
I wrote the generator in Spin. I realize in Spin2, we'll need a SWAP(v1,v2) instruction. We'll also need to clean up the string syntax and make it terser.
Here's the Spin program that generates the random tap sets:
I made a program to do it in the post above. I just shuffle the deck, use the first 32, and track how many times each tap has been used. After shuffling the deck, though, I need to swap less-used taps into the lower 32 positions. It turns out that it uses 55 bits 8 times over and the other 8 bits 9 times over, which is optimal.
Thanks for working on this, Ahle2. I didn't know if anybody was going to work on it, so I spent some time figuring it out. It's been really quiet on the forum all day, after maybe 60 posts yesterday.
If there was a desire to even that out to 8 uses for each bit then extending the ss result to 64 bits wouldn't take much. Just make the s0 + s1 adder have 65-bit result.
That would effectively be the carry output, right? I wonder if its behavior would pass all those tests that the other bits did.
What happens with your shuffle algo if the 64-bit random result turns out to have just one "1" bit in it? It could occur.
-Phil
I don't understand. That doesn't seem like a problem.
Someone said something that made me see things differently: If all those tests were passed in Dieharder, then every bit of output is like it's own TRNG. Therefore, it doesn't matter if you switch their positions around or statically invert some of them.
-Phil
extreme imbalances are rare, but part of the idea, I guess.
This was pretty funny.
Perhaps I'm mistaken, but I believe what you want is for the 16 32-bit values derived from a single 64-bit value to be statistically independent from each other. Right? I guess what concerns me is that, as the imbalance of ones to zeroes in that 64-bit value increases, the likelihood of statistical independence of the 32-bit values decreases dramatically. Admittedly, this is just a gut feeling. Hopefully, the analysis of some Monte Carlo simulations will prove me wrong, but I would not put a wager on that being the case.
-Phil
I'm not sure if that is even possible.
The best thing is to have 16 independent PRNG. That boils down to the problem of generating 16 seeds from a single seed in such a way that the PRNGS are all operating in different parts of the state space and will not overlap in state space during any feasible run time.
xoroshiro 128+ includes a method to solve this problem. One can take a single seed and generate up to 2^64 other seeds from it each of which can be used to drive a PRNG with a 2^64 state space that will not overlap with any other.
This is done with the JUMP function:
My guess is this solution starts to get too big to put into hardware.
I'd be inclined to just arrange that no single PRNG value is ever used more than once. With or without any bit shuffling/inverting. That is to say be sure the PRNG advances by one step for every random number requested by anyone.
Cogs have no need to be secured from each other. Propeller isn't targetting general computing.
As such I would suspect that the independence of the random sequence that any COG sees, from any other, is good enough for all practical purposes. With or without bit shuffling.
GOGs will be waiting on I/O and timers etc so multiple COGs won't even be reading the PRNG in the same order all the time. Which randomizes things even more.
The downside of this is perhaps power consumption. Stepping the PRNG when it's output is not used.
I'm pretty sure the xoroshiro128+ PRNG is not regarded as a Cryptographically Strong RRNG (CSPRNG) and security gurus would frown at using it for cryptographic purposes.
Depends how secure you want to be I guess.
Each cog receives a unique subset of 32 bits from the PRNG's 63 output bits. Those bits were randomly chosen the other day by the Spin program I made to pick them out, and they are in random order. Some of the bits are always inverted.
Each smart pin gets a unique subset of 8 bits from the PRNG, with some bits always inverted.
All these bit groups and inversions were chosen using RealRandom as a TRNG source.
While all cogs and smart pins receive data from the same PRNG, they each get different versions, with different bits and different inversions. This is probably overkill, but it was done to make every point of usage unique.
Here are the 16 unique longs that go to each cog. And each byte's worth goes to a different smart pin (64 total). The source of this data is the xoroshiro128+ PRNG which outputs x[62:0], which was randomly spread all over and randomly inverted to make these tap sets:
Because it's only an LFSR, according to the xoroshiro128+ documentation. It wouldn't pass the tests.
As you pointed out, though, the carry output of the final adder would seem to be of good quality, but it must not be if they didn't mention it, because it would have been a great way to get a full 64 bits out of the PRNG.
Actually, I might have found the source of it, Heater dug up an example 32-bit conversion that used [32:01] - http://forums.parallax.com/discussion/comment/1402202/#Comment_1402202
"
Beside passing BigCrush, this generator passes the PractRand test suite
up to (and included) 16TB, with the exception of binary rank tests,
which fail due to the lowest bit being an LFSR; all other bits pass all
tests. We suggest to use a sign test to extract a random Boolean value.
"
So basically the LSB taken alone is not as "random" as the other bits. it's a simple Linear Feedback Shift Register PRNG. It probably has a very short sequence compared to the 10^128 sequence length of the rest of the bits.