Instead of the taps, why not make the RNG work like CORDIC?
It's just a HUB resource. Pipeline it, like CORDIC, and we get good randomness and a reasonable throughput.
Yes, I know that is a VERILOG change, but I see it as a bug fix, not new feature. It's actually a downgrade.
And this needs a device, test cycle anyway.
Might be worth it.
No taps, no potential correlation between COGS. I find it hard to believe there won't be. The taps are static. There will be correlations. It's running random through not RANDOM. This always equals not random.
What I suspect will happen is people will end up making a random COG to avoid those correlations. So why not just share the generator CORDIC style?
I personally see no need for a REAL random number, whatever this will be...
The most pressing case for real random numbers is in cryptography, key generation, secure communication. Most PRNG are not strong enough for such use but real random numbers can be used as seeds for cryptographically strong PRNGs (CSPRNG).
It's often required to seed PRNGs to get repeatable runs of software. For testing if nothing else. This is not the use case here and I suspect it's better if such software contains it's own PRNG.
I'm not sure of this pipelining idea. What do you guys mean by 'pipelining'.
I could imagine stuffing random numbers into a FIFO and have COGS read from the end of the FIFO. But then they are all running in lock step following each other along the random sequence. That is no good.
The way to do this is to have 16 sets of 128 bit state registers. Basically 16 independent PRNG. But that is only sure to work nicely if all those state registers are seeded properly (See jump() function in previous post) such that the sub-sequences each PRNG goes through don't overlap with any others. That seeding business, the jump() function, is perhaps too much to do in hardware.
I like the idea of making the PRNG a hub resource. It obviates any need for those dodgy taps, since each cog will access a different value from the sequence. The only thing to check on, then, is the autocorrelation of the PRNG's values spaced 16 apart.
Edit: Of course, this presumes that the PRNG can spew out a new value on every clock cycle. 'Seems pretty far-fetched.
Of course, this presumes that the PRNG can spew out a new value on every clock cycle. 'Seems pretty far-fetched.
I believe that is exactly what it does. Looking at the Verilog Chip posted it produces a new output on every positive edge of it's clock input. Which I presume is driven from the system clock at full speed. Remember Chip said this would be running continuously, no matter if anyone reads it's output or not.
I like the idea of making the PRNG a hub resource.
Is whatever PRNG instruction going to be a HUB op or a regular instruction? I thought the latter.
If it's a HUB op that means it's going to run slower than other instructions.
If it's a regular instruction then potentially all COGs could read the same thing at the same time! Hence the rats nest of bit shuffling on the output. Is that a good idea? Still not sure.
I have no idea how one would test any correlation between the numbers received by each COG. The tests we have been using don't cater for that.
Of course, this presumes that the PRNG can spew out a new value on every clock cycle. 'Seems pretty far-fetched.
I believe that is exactly what it does. Looking at the Verilog Chip posted it produces a new output on every positive edge of it's clock input. Which I presume is driven from the system clock at full speed. Remember Chip said this would be running continuously, no matter if anyone reads it's output or not.
I like the idea of making the PRNG a hub resource.
Is whatever PRNG instruction going to be a HUB op or a regular instruction? I thought the latter.
If it's a HUB op that means it's going to run slower than other instructions.
If it's a regular instruction then potentially all COGs could read the same thing at the same time! Hence the rats nest of bit shuffling on the output. Is that a good idea? Still not sure.
I have no idea how one would test any correlation between the numbers received by each COG. The tests we have been using don't cater for that.
The scrambled MUX means each COG can each access a different value all on the same SysCLK.
( & we still have not proven those copies pass random tests ?)
That HW has to come at quite a cost in metal layer routing paths, whilst the HUB idea uses muxes & routing that is already there, for other tasks. (but at a small access cost)
I'm not sure the PRNG needs such a high bandwidth availability ?
Cogs have no need to be secured from each other. Propeller isn't targetting general computing.
I like that it works as a fast instruction.
Pipelined is expensive compared to what we have now. As Heater says, it requires 16x the working registers. I think it would be smaller as an explicit Cog resource in each Cog.
I was under the impression that each cog was going to have its own PRNG circuit and be seeded on startup from that XORed array that Chip created. That should be random enough, no?
I was under the impression that each cog was going to have its own PRNG circuit and be seeded on startup from that XORed array that Chip created. That should be random enough, no?
Other way round. The array map is for 16 unique views of the single PRNG generator.
In what sense ? As a HUB-Slotted access, it has no 'pipeline', just a time slot.
When someone says they want pipelined ... and like the CORDIC ... I interpret that as said.
I don't think there is any advantage in forcing time slots for each Cog on to the single generator. It's still the same single source. Keeping it fast is a feature.
Do you mean 16 copies - that's now bumped the size to 4 smart pins. (but has saved some routing)
>When someone says they want pipelined ... and like the CORDIC ... I interpret that as said.
Yes. Buffered would be a better term, if we even do that. I'm realizing that makes no sense, unless the numbers themselves are kept to meet request demands. Seems overkill to me.
The minimum would be COG exclusive. Only allow one request at a time.
I'm having trouble thinking of a use case for all COGS to get random numbers at full clip.
The only saving I feel that could be made is a power saving - By idling the generator most of the time and only cycling it after the current result has been read. Although, this would also make it more predictable due to elimination of instruction timing variations.
OK fine by me, just so long as at least one COG is not hashed so we get the benefit of a good source.
We've already done multiple tests of subsets of the 63 bits, admittedly contiguous bit groups, and found no degradation of quality. I did a few variations just yesterday before deciding that a 130-bit accumulator was likely to work.
If they're doing independent things, yes it's high quality in all cases. If multiple Cogs are comparing their results then likely not so ideal. This can be tested ...
The current xoroshiro128+ implementation is as small as can be.
Every cog gets a uniquely-picked/ordered/inverted set of 32 bits from the 63-bit source. Every smart pin gets 8 such bits. Smart pins need these for DAC dithering and noise generation. Cogs need them for the GETRND instruction. I think cogs and pins will all be happy with this arrangement. These patterns really only need to serve as apparently-uncoordinated white noise sources.
On start-up, the PRNG will be 1/4 seeded maybe 64 times with 4 bits of pure thermal noise, along with another 27 bits of process/voltage/temperature-dependent bits.
If you think about those 128 PRNG bits being totally randomized on each start-up, and then the PRNG iterating at 160MHz thereafter for, let's say, a whole year before a reset occurs, that's 2^52 iterations, which only spans 1/(2^76) the gamut of the PRNG. That's like landing somewhere along the Earth's equator, and then heading westward at a pace of 160M steps per second, only to travel 17 picometers over the next year.
The chance of any chip ever experiencing the same PRNG state in its life is negligible. Even if billions of chips were made and ran concurrently for a millenium, chances are near zero that any of them would ever experience any of the same PRNG states during that time.
Heh, I've just been testing of overlapped results and noticed that multiples of eight bits register a denser fail rate than non-multiples of eight bits. I'm guessing that's more an artefact of the tester rather than any real difference in randomness. They all fail the very first block so it's just splitting hairs anyway.
Yeah, so, the cross Cog multi-tapping does have contamination when compared to each other, as expected, ... if that matters. For most situations it doesn't matter at all.
1) If you pull 32 bits out of the 63 "good" bits available. Are those 32 bits as random as we like to think? Gut says that any permutation of 63 random bits should also be equally random. Is that really true given that they are not actually random bits but a PRNG?
2) There is a worry over the possible correlation between the numbers fetched by each COG.
Obviously there is such correlation, all COGS are fetching from the same sequence. The shuffling helps hide that a little. More significantly the PRNG is running at full speed all the time so COGs are going to be skipping much of it's output.
Actually un-correlating those 16 PRNGs actually requires 16 instances of the state variables. It requires they are seeded very carefully (See jump() function posted earlier). This not going to happen. Anything else is a bodge but I think we have to live with it.
3) That pesky LSB is said to be less "random". I'm wondering if that is even a worry. The statement in the xoroshiro128+ source code points out that using all 64 bits of output this PRNG easily gets through Diehard and BigCrunch tests. It only fails PractRand on the binary rank tests. I just tried that for myself and it is so. Ignore the binary rank tests and PractRand works fine.
So, even with that dodgy LSB xoroshiro128+ is an incredibly good PRNG.
We can go with Chip's solution. Unless someone comes up with a test that shows a weakness.
Perhaps Parallax could offer a challenge to create a P2 program that demonstrates any correlation between the PRNG sequence seen by two or more COGS. A thousand dollar reward perhaps.
1) If you pull 32 bits out of the 63 "good" bits available. Are those 32 bits as random as we like to think?
Thoroughly proven, yes.
3)...
So, even with that dodgy LSB xoroshiro128+ is an incredibly good PRNG.
The result is currently only 63 bits. The summing lsb isn't included. This is why I went to the trouble yesterday to test out extending the accumulator size from 128 bits to 130 bits, so as to make the summed result a full 64 bits. Which was a success, btw. Chip can definitely make this change.
ERR: Heater! You know all that ... I thought that was Doug I was answering.
Comments
Dieharder - 3 "WEAK" tests.
PractRand (Up to 16GB) - Two 'unusual' tests. At 64 and 128 MBytes.
Well done.
It's just a HUB resource. Pipeline it, like CORDIC, and we get good randomness and a reasonable throughput.
Yes, I know that is a VERILOG change, but I see it as a bug fix, not new feature. It's actually a downgrade.
And this needs a device, test cycle anyway.
Might be worth it.
No taps, no potential correlation between COGS. I find it hard to believe there won't be. The taps are static. There will be correlations. It's running random through not RANDOM. This always equals not random.
What I suspect will happen is people will end up making a random COG to avoid those correlations. So why not just share the generator CORDIC style?
It's often required to seed PRNGs to get repeatable runs of software. For testing if nothing else. This is not the use case here and I suspect it's better if such software contains it's own PRNG.
I could imagine stuffing random numbers into a FIFO and have COGS read from the end of the FIFO. But then they are all running in lock step following each other along the random sequence. That is no good.
The way to do this is to have 16 sets of 128 bit state registers. Basically 16 independent PRNG. But that is only sure to work nicely if all those state registers are seeded properly (See jump() function in previous post) such that the sub-sequences each PRNG goes through don't overlap with any others. That seeding business, the jump() function, is perhaps too much to do in hardware.
Edit: Of course, this presumes that the PRNG can spew out a new value on every clock cycle. 'Seems pretty far-fetched.
-Phil
For a PRNG, this means buffering some requests. Good catch.
If it's a HUB op that means it's going to run slower than other instructions.
If it's a regular instruction then potentially all COGs could read the same thing at the same time! Hence the rats nest of bit shuffling on the output. Is that a good idea? Still not sure.
I have no idea how one would test any correlation between the numbers received by each COG. The tests we have been using don't cater for that.
( & we still have not proven those copies pass random tests ?)
That HW has to come at quite a cost in metal layer routing paths, whilst the HUB idea uses muxes & routing that is already there, for other tasks. (but at a small access cost)
I'm not sure the PRNG needs such a high bandwidth availability ?
We only have one PRNG. 16 different hashes of it doesn't add to the randomness. How can it, when those aren't themselves random?
For multi COG use, can't we employ the locks?
I like that it works as a fast instruction.
Pipelined is expensive compared to what we have now. As Heater says, it requires 16x the working registers. I think it would be smaller as an explicit Cog resource in each Cog.
It's currently a small footprint. Quarter of a Smartpin I think Chip said ... here we go - http://forums.parallax.com/discussion/comment/1402335/#Comment_1402335
Do you mean 16 copies - that's now bumped the size to 4 smart pins. (but has saved some routing)
I guess the question is, what size is the PRNG, with fanout MUX removed, and multiply that x16.
Seems that x16 will have a lot of chip-area impact.
Other way round. The array map is for 16 unique views of the single PRNG generator.
I don't think there is any advantage in forcing time slots for each Cog on to the single generator. It's still the same single source. Keeping it fast is a feature.
Yes.
Yes. Buffered would be a better term, if we even do that. I'm realizing that makes no sense, unless the numbers themselves are kept to meet request demands. Seems overkill to me.
The minimum would be COG exclusive. Only allow one request at a time.
I'm having trouble thinking of a use case for all COGS to get random numbers at full clip.
-Phil
That gains us nothing in terms of higher quality over the existing implementation and loses in determinism and responsiveness.
True. At the very least, no need to update any more frequently than every 2 clocks.
OK fine by me, just so long as at least one COG is not hashed so we get the benefit of a good source.
Is that true? If so, I missed it.
Beyond that, I guess I don't personally care. If it's less random when a lot of COGS get values, fine. People can pick speed vs quality.
Maybe that's not even the case.
Every cog gets a uniquely-picked/ordered/inverted set of 32 bits from the 63-bit source. Every smart pin gets 8 such bits. Smart pins need these for DAC dithering and noise generation. Cogs need them for the GETRND instruction. I think cogs and pins will all be happy with this arrangement. These patterns really only need to serve as apparently-uncoordinated white noise sources.
On start-up, the PRNG will be 1/4 seeded maybe 64 times with 4 bits of pure thermal noise, along with another 27 bits of process/voltage/temperature-dependent bits.
If you think about those 128 PRNG bits being totally randomized on each start-up, and then the PRNG iterating at 160MHz thereafter for, let's say, a whole year before a reset occurs, that's 2^52 iterations, which only spans 1/(2^76) the gamut of the PRNG. That's like landing somewhere along the Earth's equator, and then heading westward at a pace of 160M steps per second, only to travel 17 picometers over the next year.
The chance of any chip ever experiencing the same PRNG state in its life is negligible. Even if billions of chips were made and ran concurrently for a millenium, chances are near zero that any of them would ever experience any of the same PRNG states during that time.
Yeah, so, the cross Cog multi-tapping does have contamination when compared to each other, as expected, ... if that matters. For most situations it doesn't matter at all.
1) If you pull 32 bits out of the 63 "good" bits available. Are those 32 bits as random as we like to think? Gut says that any permutation of 63 random bits should also be equally random. Is that really true given that they are not actually random bits but a PRNG?
2) There is a worry over the possible correlation between the numbers fetched by each COG.
Obviously there is such correlation, all COGS are fetching from the same sequence. The shuffling helps hide that a little. More significantly the PRNG is running at full speed all the time so COGs are going to be skipping much of it's output.
Actually un-correlating those 16 PRNGs actually requires 16 instances of the state variables. It requires they are seeded very carefully (See jump() function posted earlier). This not going to happen. Anything else is a bodge but I think we have to live with it.
3) That pesky LSB is said to be less "random". I'm wondering if that is even a worry. The statement in the xoroshiro128+ source code points out that using all 64 bits of output this PRNG easily gets through Diehard and BigCrunch tests. It only fails PractRand on the binary rank tests. I just tried that for myself and it is so. Ignore the binary rank tests and PractRand works fine.
So, even with that dodgy LSB xoroshiro128+ is an incredibly good PRNG.
We can go with Chip's solution. Unless someone comes up with a test that shows a weakness.
Perhaps Parallax could offer a challenge to create a P2 program that demonstrates any correlation between the PRNG sequence seen by two or more COGS. A thousand dollar reward perhaps.
The result is currently only 63 bits. The summing lsb isn't included. This is why I went to the trouble yesterday to test out extending the accumulator size from 128 bits to 130 bits, so as to make the summed result a full 64 bits. Which was a success, btw. Chip can definitely make this change.
ERR: Heater! You know all that ... I thought that was Doug I was answering.