Shop OBEX P1 Docs P2 Docs Learn Events
Random/LFSR on P2 - Page 10 — Parallax Forums

Random/LFSR on P2

17810121392

Comments

  • Heater.Heater. Posts: 21,230
    The fixed up evanh PRNG passes:

    Dieharder - 3 "WEAK" tests.
    PractRand (Up to 16GB) - Two 'unusual' tests. At 64 and 128 MBytes.

    Well done.

  • potatoheadpotatohead Posts: 10,253
    edited 2017-03-08 18:21
    Instead of the taps, why not make the RNG work like CORDIC?

    It's just a HUB resource. Pipeline it, like CORDIC, and we get good randomness and a reasonable throughput.

    Yes, I know that is a VERILOG change, but I see it as a bug fix, not new feature. It's actually a downgrade.

    And this needs a device, test cycle anyway.

    Might be worth it.

    No taps, no potential correlation between COGS. I find it hard to believe there won't be. The taps are static. There will be correlations. It's running random through not RANDOM. This always equals not random.

    What I suspect will happen is people will end up making a random COG to avoid those correlations. So why not just share the generator CORDIC style?





  • Heater.Heater. Posts: 21,230
    MJB,
    I personally see no need for a REAL random number, whatever this will be...
    The most pressing case for real random numbers is in cryptography, key generation, secure communication. Most PRNG are not strong enough for such use but real random numbers can be used as seeds for cryptographically strong PRNGs (CSPRNG).

    It's often required to seed PRNGs to get repeatable runs of software. For testing if nothing else. This is not the use case here and I suspect it's better if such software contains it's own PRNG.

  • Heater.Heater. Posts: 21,230
    I'm not sure of this pipelining idea. What do you guys mean by 'pipelining'.

    I could imagine stuffing random numbers into a FIFO and have COGS read from the end of the FIFO. But then they are all running in lock step following each other along the random sequence. That is no good.

    The way to do this is to have 16 sets of 128 bit state registers. Basically 16 independent PRNG. But that is only sure to work nicely if all those state registers are seeded properly (See jump() function in previous post) such that the sub-sequences each PRNG goes through don't overlap with any others. That seeding business, the jump() function, is perhaps too much to do in hardware.




  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2017-03-08 19:30
    I like the idea of making the PRNG a hub resource. It obviates any need for those dodgy taps, since each cog will access a different value from the sequence. The only thing to check on, then, is the autocorrelation of the PRNG's values spaced 16 apart.

    Edit: Of course, this presumes that the PRNG can spew out a new value on every clock cycle. 'Seems pretty far-fetched.

    -Phil
  • potatoheadpotatohead Posts: 10,253
    edited 2017-03-08 19:26
    Heater, CORDIC is pipelines so multiple COGS can access it and get fast results.

    For a PRNG, this means buffering some requests. Good catch.
  • Heater.Heater. Posts: 21,230
    edited 2017-03-08 20:05
    @Phil,
    Of course, this presumes that the PRNG can spew out a new value on every clock cycle. 'Seems pretty far-fetched.
    I believe that is exactly what it does. Looking at the Verilog Chip posted it produces a new output on every positive edge of it's clock input. Which I presume is driven from the system clock at full speed. Remember Chip said this would be running continuously, no matter if anyone reads it's output or not.
    I like the idea of making the PRNG a hub resource.
    Is whatever PRNG instruction going to be a HUB op or a regular instruction? I thought the latter.

    If it's a HUB op that means it's going to run slower than other instructions.

    If it's a regular instruction then potentially all COGs could read the same thing at the same time! Hence the rats nest of bit shuffling on the output. Is that a good idea? Still not sure.

    I have no idea how one would test any correlation between the numbers received by each COG. The tests we have been using don't cater for that.


  • jmgjmg Posts: 15,140
    Heater. wrote: »
    @Phil,
    Of course, this presumes that the PRNG can spew out a new value on every clock cycle. 'Seems pretty far-fetched.
    I believe that is exactly what it does. Looking at the Verilog Chip posted it produces a new output on every positive edge of it's clock input. Which I presume is driven from the system clock at full speed. Remember Chip said this would be running continuously, no matter if anyone reads it's output or not.
    I like the idea of making the PRNG a hub resource.
    Is whatever PRNG instruction going to be a HUB op or a regular instruction? I thought the latter.

    If it's a HUB op that means it's going to run slower than other instructions.

    If it's a regular instruction then potentially all COGs could read the same thing at the same time! Hence the rats nest of bit shuffling on the output. Is that a good idea? Still not sure.

    I have no idea how one would test any correlation between the numbers received by each COG. The tests we have been using don't cater for that.
    The scrambled MUX means each COG can each access a different value all on the same SysCLK.
    ( & we still have not proven those copies pass random tests ?)


    That HW has to come at quite a cost in metal layer routing paths, whilst the HUB idea uses muxes & routing that is already there, for other tasks. (but at a small access cost)

    I'm not sure the PRNG needs such a high bandwidth availability ?

  • This is exactly why making it a HUB resource like CORDIC makes the best sense.

    We only have one PRNG. 16 different hashes of it doesn't add to the randomness. How can it, when those aren't themselves random?

    For multi COG use, can't we employ the locks?
  • evanhevanh Posts: 15,091
    edited 2017-03-09 00:34
    Cogs have no need to be secured from each other. Propeller isn't targetting general computing.

    I like that it works as a fast instruction.

    Pipelined is expensive compared to what we have now. As Heater says, it requires 16x the working registers. I think it would be smaller as an explicit Cog resource in each Cog.

    It's currently a small footprint. Quarter of a Smartpin I think Chip said ... here we go - http://forums.parallax.com/discussion/comment/1402335/#Comment_1402335
  • jmgjmg Posts: 15,140
    evanh wrote: »
    Pipelined is expensive compared to what we have now.
    In what sense ? As a HUB-Slotted access, it has no 'pipeline', just a time slot.
    evanh wrote: »
    I think it would be smaller as an explicit Cog resource in each Cog.
    It's currently a small footprint. Quarter of a Smartpin I think Chip said ... here we go - http://forums.parallax.com/discussion/comment/1402335/#Comment_1402335
    Do you mean 16 copies - that's now bumped the size to 4 smart pins. (but has saved some routing)

    I guess the question is, what size is the PRNG, with fanout MUX removed, and multiply that x16.
    Seems that x16 will have a lot of chip-area impact.

  • I was under the impression that each cog was going to have its own PRNG circuit and be seeded on startup from that XORed array that Chip created. That should be random enough, no?
  • evanhevanh Posts: 15,091
    Seairth wrote: »
    I was under the impression that each cog was going to have its own PRNG circuit and be seeded on startup from that XORed array that Chip created. That should be random enough, no?

    Other way round. The array map is for 16 unique views of the single PRNG generator.
  • evanhevanh Posts: 15,091
    edited 2017-03-09 01:52
    jmg wrote: »
    In what sense ? As a HUB-Slotted access, it has no 'pipeline', just a time slot.
    When someone says they want pipelined ... and like the CORDIC ... I interpret that as said.

    I don't think there is any advantage in forcing time slots for each Cog on to the single generator. It's still the same single source. Keeping it fast is a feature.

    Do you mean 16 copies - that's now bumped the size to 4 smart pins. (but has saved some routing)
    Yes.
  • potatoheadpotatohead Posts: 10,253
    edited 2017-03-09 02:09
    >When someone says they want pipelined ... and like the CORDIC ... I interpret that as said.

    Yes. Buffered would be a better term, if we even do that. I'm realizing that makes no sense, unless the numbers themselves are kept to meet request demands. Seems overkill to me.

    The minimum would be COG exclusive. Only allow one request at a time.

    I'm having trouble thinking of a use case for all COGS to get random numbers at full clip.

  • jmgjmg Posts: 15,140
    potatohead wrote: »
    I'm having trouble thinking of a use case for all COGS to get random numbers at full clip.
    Me too, but allowing for that, seems to incur a chip-area cost, that could well mean less RAM.

  • jmg wrote:
    I'm not sure the PRNG needs such a high bandwidth availability ?
    I shouldn't think so, either. Who needs a new random number on every clock cycle?

    -Phil
  • evanhevanh Posts: 15,091
    It's the no-latency that's nice. And in case no one noticed my earlier statements - It is very compact the way it is now!
  • evanhevanh Posts: 15,091
    potatohead wrote: »
    The minimum would be COG exclusive. Only allow one request at a time.

    That gains us nothing in terms of higher quality over the existing implementation and loses in determinism and responsiveness.
  • jmg wrote:
    I'm not sure the PRNG needs such a high bandwidth availability ?
    I shouldn't think so, either. Who needs a new random number on every clock cycle?

    -Phil

    True. At the very least, no need to update any more frequently than every 2 clocks.
  • evanhevanh Posts: 15,091
    The only saving I feel that could be made is a power saving - By idling the generator most of the time and only cycling it after the current result has been read. Although, this would also make it more predictable due to elimination of instruction timing variations.
  • evanhevanh Posts: 15,091
    Seairth wrote: »
    True. At the very least, no need to update any more frequently than every 2 clocks.
    That works too. A minor power saving and the Cogs will be none the wiser.
  • Frankly, if the hashed results are not random, quality, then a single cog will be used by those who need that.

    OK fine by me, just so long as at least one COG is not hashed so we get the benefit of a good source.

  • evanhevanh Posts: 15,091
    potatohead wrote: »
    OK fine by me, just so long as at least one COG is not hashed so we get the benefit of a good source.
    We've already done multiple tests of subsets of the 63 bits, admittedly contiguous bit groups, and found no degradation of quality. I did a few variations just yesterday before deciding that a 130-bit accumulator was likely to work.
  • If, one COG pulling values is nice and random, great! If that is true for any COG, no matter which one, awesome!

    Is that true? If so, I missed it.

    Beyond that, I guess I don't personally care. If it's less random when a lot of COGS get values, fine. People can pick speed vs quality.

    Maybe that's not even the case. :D





  • evanhevanh Posts: 15,091
    If they're doing independent things, yes it's high quality in all cases. If multiple Cogs are comparing their results then likely not so ideal. This can be tested ...
  • cgraceycgracey Posts: 14,131
    edited 2017-03-09 09:02
    The current xoroshiro128+ implementation is as small as can be.

    Every cog gets a uniquely-picked/ordered/inverted set of 32 bits from the 63-bit source. Every smart pin gets 8 such bits. Smart pins need these for DAC dithering and noise generation. Cogs need them for the GETRND instruction. I think cogs and pins will all be happy with this arrangement. These patterns really only need to serve as apparently-uncoordinated white noise sources.

    On start-up, the PRNG will be 1/4 seeded maybe 64 times with 4 bits of pure thermal noise, along with another 27 bits of process/voltage/temperature-dependent bits.

    If you think about those 128 PRNG bits being totally randomized on each start-up, and then the PRNG iterating at 160MHz thereafter for, let's say, a whole year before a reset occurs, that's 2^52 iterations, which only spans 1/(2^76) the gamut of the PRNG. That's like landing somewhere along the Earth's equator, and then heading westward at a pace of 160M steps per second, only to travel 17 picometers over the next year.

    The chance of any chip ever experiencing the same PRNG state in its life is negligible. Even if billions of chips were made and ran concurrently for a millenium, chances are near zero that any of them would ever experience any of the same PRNG states during that time.
  • evanhevanh Posts: 15,091
    edited 2017-03-09 09:11
    Heh, I've just been testing of overlapped results and noticed that multiples of eight bits register a denser fail rate than non-multiples of eight bits. I'm guessing that's more an artefact of the tester rather than any real difference in randomness. They all fail the very first block so it's just splitting hairs anyway.

    Yeah, so, the cross Cog multi-tapping does have contamination when compared to each other, as expected, ... if that matters. For most situations it doesn't matter at all.
  • Heater.Heater. Posts: 21,230
    edited 2017-03-09 09:22
    There are three points of contention here:

    1) If you pull 32 bits out of the 63 "good" bits available. Are those 32 bits as random as we like to think? Gut says that any permutation of 63 random bits should also be equally random. Is that really true given that they are not actually random bits but a PRNG?

    2) There is a worry over the possible correlation between the numbers fetched by each COG.
    Obviously there is such correlation, all COGS are fetching from the same sequence. The shuffling helps hide that a little. More significantly the PRNG is running at full speed all the time so COGs are going to be skipping much of it's output.

    Actually un-correlating those 16 PRNGs actually requires 16 instances of the state variables. It requires they are seeded very carefully (See jump() function posted earlier). This not going to happen. Anything else is a bodge but I think we have to live with it.

    3) That pesky LSB is said to be less "random". I'm wondering if that is even a worry. The statement in the xoroshiro128+ source code points out that using all 64 bits of output this PRNG easily gets through Diehard and BigCrunch tests. It only fails PractRand on the binary rank tests. I just tried that for myself and it is so. Ignore the binary rank tests and PractRand works fine.

    So, even with that dodgy LSB xoroshiro128+ is an incredibly good PRNG.

    We can go with Chip's solution. Unless someone comes up with a test that shows a weakness.

    Perhaps Parallax could offer a challenge to create a P2 program that demonstrates any correlation between the PRNG sequence seen by two or more COGS. A thousand dollar reward perhaps.
  • evanhevanh Posts: 15,091
    edited 2017-03-09 10:19
    Heater. wrote: »
    1) If you pull 32 bits out of the 63 "good" bits available. Are those 32 bits as random as we like to think?
    Thoroughly proven, yes.
    3)...
    So, even with that dodgy LSB xoroshiro128+ is an incredibly good PRNG.
    The result is currently only 63 bits. The summing lsb isn't included. This is why I went to the trouble yesterday to test out extending the accumulator size from 128 bits to 130 bits, so as to make the summed result a full 64 bits. Which was a success, btw. Chip can definitely make this change.

    ERR: Heater! You know all that ... I thought that was Doug I was answering.
Sign In or Register to comment.