Shop OBEX P1 Docs P2 Docs Learn Events
Random/LFSR on P2 - Page 65 — Parallax Forums

Random/LFSR on P2

1626365676892

Comments

  • Heater.Heater. Posts: 21,230
    evanh,

    I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.

    As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
  • Heater.Heater. Posts: 21,230
    edited 2018-10-20 22:58
    TonyB_,
    If you posted your results here then I could see them, instead of Secure Connection Failed, blah, blah, blah.
    What, are you saying my link to the github repository at https://github.com/ZiCog/xoroshiro-plusplus does not work for you?

    That would be most odd.

    Anyway, cutting and pasting from the page at that link:

    The C++ output does this:
        $ ./xoroshiroPlusPlus
        First 16 outputs of Xoroshiro32PlusPlus:
        6269
        ae16
        12a2
        4ae8
        d719
        0c52
        984b
        1df1
        743c
        dba0
        bcc6
        34c9
        746c
        3643
        07ff
        bbc0
    
        First 16 outputs of Xoroshiro64PlusPlus:
        48020a01
        81662931
        cd2b5253
        d3e6cbe6
        cd5af43d
        860aa4ba
        b7bea7fb
        63dcaff3
        762d74c9
        3e7d7e8f
        e10e0616
        5788242d
        d8ece2a3
        7a242fab
        add23d97
        98ef01be
    

    The Verilog output does this:
    [info] [Progress] Start Xoroshiro64PlusPlus test_Xoroshiro64PlusPlus simulation with seed -296360197591574595, wave in             /mnt/c/Users/zicog/Documents/xoroshiro-plusplus/./simWorkspace/Xoroshiro64PlusPlus/test_Xoroshiro64PlusPlus.vcd
        [info] 48020a01
        [info] 81662931
        [info] cd2b5253
        [info] d3e6cbe6
        [info] cd5af43d
        [info] 860aa4ba
        [info] b7bea7fb
        [info] 63dcaff3
        [info] 762d74c9
        [info] 3e7d7e8f
        [info] e10e0616
        [info] 5788242d
        [info] d8ece2a3
        [info] 7a242fab
        [info] add23d97
        [info] 98ef01be
        [info] [Done] Simulation done in 38.373 ms
        [success] Total time: 6 s, completed Oct 20, 2018 10:01:41 AM
    
    Which I think matches the results we have discussed earlier in this thread.

    Is this what a P2 actually does today ?




  • Heater. wrote: »
    evanh,

    I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.

    As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.

    What is this HDL implementation to which you allude?
  • Heater.Heater. Posts: 21,230
    edited 2018-10-20 23:16
    TonyB_
    What is this HDL implementation to which you allude
    The one that I sort of announced today in this thread: http://forums.parallax.com/discussion/comment/1450145/#Comment_1450145

    Which is here: https://github.com/ZiCog/xoroshiro-plusplus

    Includes a dinky demo to run on a DE0 Nano board.

    In case you cannot reach it the Verilog looks like this:
    // Generator : SpinalHDL v1.1.5    git head : 0310b2489a097f2b9de5535e02192d9ddd2764ae
    // Date      : 20/10/2018, 10:15:16
    // Component : Xoroshiro64PlusPlus
    
    
    module Xoroshiro64PlusPlus (
          output [31:0] io_prng,
          input   io_next,
          input   clk,
          input   reset);
      wire [31:0] _zz_3;
      reg [31:0] s0;
      reg [31:0] s1;
      wire [31:0] s0_;
      wire [31:0] _zz_1;
      wire [31:0] s1_;
      wire [31:0] _zz_2;
      assign _zz_3 = ((s1 ^ s0) <<< 9);
      assign s0_ = (({s0[5 : 0],s0[31 : 6]} ^ (s1 ^ s0)) ^ _zz_3);
      assign _zz_1 = (s1 ^ s0);
      assign s1_ = {_zz_1[18 : 0],_zz_1[31 : 19]};
      assign _zz_2 = (s0_ + s1_);
      assign io_prng = ({_zz_2[14 : 0],_zz_2[31 : 15]} + s0_);
      always @ (posedge clk or posedge reset) begin
        if (reset) begin
          s0 <= (32'b00000000000000000000000000000001);
          s1 <= (32'b00000000000000000000000000000000);
        end else begin
          if(io_next)begin
            s0 <= s0_;
            s1 <= s1_;
          end
        end
      end
    
    endmodule
    
    The SpinalHDL source from which that is generated looks much nicer:
    class  Xoroshiro64PlusPlus extends Component {
      val io = new Bundle {
        val prng = out UInt(32 bits)
        val next = in Bool
      }
    
      // The PRNG state
      val s0 = Reg(UInt(32 bits)) init(1)
      val s1 = Reg(UInt(32 bits)) init(0)
    
      // The xoroshiro64++ magic numbers
      val a = 26
      val b = 9
      val c = 13
      val d = 17
    
      // Calculate the next PRNG state.
      val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
      val s1_ = (s1 ^ s0).rotateLeft(c)
    
      when(io.next) {
        // Update the PRNG state
        s0 := s0_
        s1 := s1_
      }
    
      // Deliver the "++" scrambled output
      io.prng := (s0_ + s1_).rotateLeft(d) + s0_
    }
    







  • TonyB_TonyB_ Posts: 2,190
    edited 2018-10-20 23:33
    Heater. wrote: »
    The SpinalHDL source from which that is generated looks much nicer:
    class  Xoroshiro64PlusPlus extends Component {
      val io = new Bundle {
        val prng = out UInt(32 bits)
        val next = in Bool
      }
    
      // The PRNG state
      val s0 = Reg(UInt(32 bits)) init(1)
      val s1 = Reg(UInt(32 bits)) init(0)
    
      // The xoroshiro64++ magic numbers
      val a = 26
      val b = 9
      val c = 13
      val d = 17
    
      // Calculate the next PRNG state.
      val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
      val s1_ = (s1 ^ s0).rotateLeft(c)
    
      when(io.next) {
        // Update the PRNG state
        s0 := s0_
        s1 := s1_
      }
    
      // Deliver the "++" scrambled output
      io.prng := (s0_ + s1_).rotateLeft(d) + s0_
    }
    

    This is the "PRN calculated at end of algorithm after new state generated" version. The output will match my software P2 asm version:
    http://forums.parallax.com/discussion/comment/1449224/#Comment_1449224

    You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware and double-iterate to match the XORO32 output.
  • evanhevanh Posts: 16,007
    edited 2018-10-20 23:39
    Heater. wrote: »
    As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
    Yep, that's good. Standard synchronous design.

    However, in hardware, you have to be thinking a level lower as well. The fastest clock rate achievable is set by the longest cascading logic time, or the longest access time. This longest time gets labelled "critical path".

    The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together. Nothing any optimiser can do about it.

    As Chip was working his way through tightening each critical path to push above 160 MHz, producing new shorter criticals in the process, ... at one stage XORO32 became it. Which he then resolved by moving the scrambler to operate parallel to the engine.


    As for the C++ edition. Parallel thinking has the same impact, and for very similar reason. Both a compiler optimiser and processor instruction reordering can see a tight parallel opportunity with ease. EDIT: Allows multiple instructions per clock to flow in a superscalar CPU.

  • Heater.Heater. Posts: 21,230
    That is great and all but I have no idea what your P2 assembler code might output.

    I feel like I'm standing on quick sand here. There is no reference to check against.

  • TonyB_TonyB_ Posts: 2,190
    edited 2018-10-20 23:56
    First 32 outputs for xoroshiro64++ [26,9,13,17] when seed = 1 {s1 = 0, s0 = 1}
    PRN calculated at end of algorithm after new state generated
    48020A01
    81662931
    CD2B5253
    D3E6CBE6
    CD5AF43D
    860AA4BA
    B7BEA7FB
    63DCAFF3
    762D74C9
    3E7D7E8F
    E10E0616
    5788242D
    D8ECE2A3
    7A242FAB
    ADD23D97
    98EF01BE
    E9C6F835
    126167D3
    A7DCC109
    9FE42570
    F9D0CC5F
    50C44C1C
    EA63FC4D
    17489AD9
    50213E15
    C2B6785F
    B88837DA
    0D83D95A
    AADF08B1
    B682DDFD
    BD2B53A7
    FEDA90E8
    
  • evanhevanh Posts: 16,007
    edited 2018-10-21 00:09
    Chip posted the Propeller 2 verilog, for XORO32 instruction, a few pages back - https://forums.parallax.com/discussion/comment/1448460/#Comment_1448460

    Tony, you could add that to your information topic.

  • evanhevanh Posts: 16,007
    Xoroshiro32++ [13 5 10 9]
    
    0201
    6269
    ae16
    12a2
    4ae8
    d719
    0c52
    984b
    1df1
    743c
    dba0
    bcc6
    34c9
    746c
    3643
    07ff
    bbc0
    c642
    aa85
    594e
    
    XORO32
    
    62690201
    12a2ae16
    d7194ae8
    984b0c52
    743c1df1
    bcc6dba0
    746c34c9
    07ff3643
    c642bbc0
    594eaa85
    701ad05a
    f3aea328
    695a67ee
    93c6c140
    4964f5e1
    fc575e24
    a638d7ad
    181ae233
    7be766b5
    5cbd8445
    
  • Heater.Heater. Posts: 21,230
    Thank you TonyB_.

    That matches my xoroshiro64++ output.

  • evanh wrote: »
    Chip posted the Propeller 2 verilog, for XORO32 instruction, a few pages back - https://forums.parallax.com/discussion/comment/1448460/#Comment_1448460

    Tony, you could add that to your information topic.

    Good idea, Evan. I've edited an existing post to add a link:
    http://forums.parallax.com/discussion/comment/1448894/#Comment_1448894
  • Heater.Heater. Posts: 21,230
    edited 2018-10-21 12:27
    TonyB_ and evanh,

    Seems I missed a couple of you points last night.
    You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware
    OK, changed that for my xoroshiro32++.
    ..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.
    Quite so.

    If one was really pushed for time one could also pipeline the addition. Perhaps like this:
    class  Xoroshiro32PlusPlus extends Component {
      val io = new Bundle {
        val prng = out UInt(16 bits)
        val next = in Bool
      }
    
      // The PRNG state
      val s0 = Reg(UInt(16 bits)) init(1)
      val s1 = Reg(UInt(16 bits)) init(0)
      val p0 = Reg(UInt(16 bits)) init(0)
      val p1 = Reg(UInt(16 bits)) init(0)
    
      // The xoroshiro32++ magic numbers
      val a = 13
      val b = 5
      val c = 10
      val d = 9
    
      // Calculate the next PRNG state.
      val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
      val s1_ = (s1 ^ s0).rotateLeft(c)
    
      when(io.next) {
        // Update the PRNG state
        s0 := s0_
        s1 := s1_
        p0 := (s0 + s1).rotateLeft(d)
        p1 := s0
      }
    
      // Deliver the "++" scrambled output
      io.prng := p0 + p1
    }
    
    Or in Verilog:
    module Xoroshiro32PlusPlus (
          output [15:0] io_prng,
          input   io_next,
          input   clk,
          input   reset);
      wire [15:0] _zz_3;
      reg [15:0] s0;
      reg [15:0] s1;
      reg [15:0] p0;
      reg [15:0] p1;
      wire [15:0] s0_;
      wire [15:0] _zz_1;
      wire [15:0] s1_;
      wire [15:0] _zz_2;
      assign _zz_3 = ((s1 ^ s0) <<< 5);
      assign s0_ = (({s0[2 : 0],s0[15 : 3]} ^ (s1 ^ s0)) ^ _zz_3);
      assign _zz_1 = (s1 ^ s0);
      assign s1_ = {_zz_1[5 : 0],_zz_1[15 : 6]};
      assign _zz_2 = (s0 + s1);
      assign io_prng = (p0 + p1);
      always @ (posedge clk or posedge reset) begin
        if (reset) begin
          s0 <= (16'b0000000000000001);
          s1 <= (16'b0000000000000000);
          p0 <= (16'b0000000000000000);
          p1 <= (16'b0000000000000000);
        end else begin
          if(io_next)begin
            s0 <= s0_;
            s1 <= s1_;
            p0 <= {_zz_2[6 : 0],_zz_2[15 : 7]};
            p1 <= s0;
          end
        end
      end
    endmodule
    
    With the down side of using more LE's and introducing an invalid output at start up:
    [info] [Progress] Start Xoroshiro32PlusPlus test_Xoroshiro32PlusPlus simulation with seed 1966752197249320824, wave in /mnt/c/Users/michael/Documents/xoroshiro-plusplus/./simWorkspace/Xoroshiro32PlusPlus/test_Xoroshiro32PlusPlus.vcd
    [info] 0000
    [info] 2010
    [info] 6269
    [info] ae16
    [info] 12a2
    [info] 4ae8
    [info] d719
    [info] c520
    [info] 984b
    [info] 1df1
    [info] 743c
    [info] dba0
    [info] bcc6
    [info] 34c9
    [info] 746c
    [info] 3643
    [info] [Done] Simulation done in 10.724 ms
    
    xoroshiro32pluplus.png

    1245 x 636 - 58K
  • evanhevanh Posts: 16,007
    TonyB_ wrote: »
    TonyB_ wrote: »
    You could try xoroshiro64++ [26,9,13,17].

    The paper mentions the xoroshiro64 [26,9,13] engine and the +, * and ** scramblers. + is not good enough and ++ would be a much better alternative if there are no multipliers available. In an email to me Seba suggested 15 or 17 for the ++ rotation, i.e. half the output width +/- 1.

    I chose d = 17 as Seba told me that, in theory, an odd prime has a little more mojo than plain odd (although not in those exact words). Also our chosen xoroshiro32++ has d = w/2 +1.

    Nobody really knows whether d = 17 is the best choice or not, as xoroshiro64++ is far too big to test fully.

    I've done a little testing. 64+ untruncated netted me impressively poor scores of 128 KB for aperture 32>>0 and no more than 2 MB for all other 32 bit apertures. The moment bit0 was remove the score went to 2 GB, and with both bit0 and bit1 it went way up. I stopped it at 512 GB.

    I'm now running four 64++ candidates at aperture 32>>0: [26 9 13 17], [26 9 13 1], [26 9 13 2], [26 9 13 3]. So far, have cleared 2 TB at highest Practrand strength and all is well. Nothing more than a sprinkling of "unusual's".

    2 TB was about 12 hours. At that rate, some back of the envelope sums tells me that full period is 46000 years away.

  • TonyB_TonyB_ Posts: 2,190
    edited 2018-10-21 13:22
    Heater. wrote: »
    TonyB_ and evanh,

    Seems I missed a couple of you points last night.
    You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware
    OK, changed that for my xoroshiro32++.

    Remember you'll have to double-iterate to emulate the XORO32 instruction.
    Heater. wrote: »
    ..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.
    Quite so.

    If one was really pushed for time one could also pipeline the addition.

    The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined as two cascaded 64-bit adds in one clock is fantasy. If you would like to practise your skills some more you could try implementing this generator, which is fully explained in the paper and doesn't require multiplications as such.

    Aside.
    The generator engines and scramblers are separate and their constants could be separate, too, e.g.
    xoroshiro32++ [13,5,10,9] -> xoroshiro32[13,5,10]++[9]
    xoroshiro128** [24,16,37,5,7,9] -> xoroshiro128[24,16,37]**[5,7,9]

    First 16 outputs of xoroshiro128**[24,16,37,5,7,9] when seed = 1.
    PRN calculated at start of algorithm before new state generated.
    0000000000001680
    0000001696801680
    E682E68000001680
    800016F099C09692
    DB9CE96699C368C0
    ACD4A897E816F3BC
    736225BD8540F37D
    D5EF4BC8DB65564A
    E6844F92C4467F20
    ECBFDD2089C19276
    3E3C8F32277DC824
    6135641F1DADD2CA
    0695119B0DE23CFA
    8837DED221B7094D
    99F5FDC2C04A92D6
    57262A94B712228C
    

    Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.
  • evanhevanh Posts: 16,007
    Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.

  • TonyB_TonyB_ Posts: 2,190
    edited 2018-10-21 13:18
    I think Seba stops his tests after 32 TB.

    xoroshiro64 period is ~584 years at one iteration per nanosecond.
    xoroshiro64++ would last ~64000 years in P2 asm @ 200 MHz.
  • Heater.Heater. Posts: 21,230
    edited 2018-10-21 20:30
    TonyB_,
    Remember you'll have to double-iterate to emulate the XORO32 instruction.
    I'm leaving that as an exercise for later.
    The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined...
    Just when I thought I was getting things straight I find I'm confused again. I did didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++? Do I have my wires crossed?
    If you would like to practise your skills some more...
    Oh yeah. Here is my xoroshiro128** in Spinal, no pipelining or such yet but no multiply used:
    class  Xoroshiro128StarStar extends Component {
      val io = new Bundle {
        val prng = out UInt(64 bits)
        val next = in Bool
      }
    
      // The PRNG state
      val s0 = Reg(UInt(64 bits)) init(1)
      val s1 = Reg(UInt(64 bits)) init(0)
    
      // The xoroshiro128** magic numbers
      def a = 24
      def b = 16
      def c = 37
      def s = 5
      def r = 71
      def t = 9
    
      // Calculate the next PRNG state.
      val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
      val s1_ = (s1 ^ s0).rotateLeft(c)
    
      when(io.next) {
        // Update the PRNG state
        s0 := s0_
        s1 := s1_
      }
    
      // Deliver the "**" scrambled output: "rotl(s0 * S, R) * T"
      val temp = ((s0 |<< 2) + s0).rotateLeft(r)
      io.prng := (temp |<< 3) + temp
    }
    
    Which produces the following iky looking Verilog:
    module Xoroshiro128StarStar (
          output [63:0] io_prng,
          input   io_next,
          input   clk,
          input   reset);
      wire [63:0] _zz_3;
      wire [63:0] _zz_4;
      wire [63:0] _zz_5;
      reg [63:0] s0;
      reg [63:0] s1;
      wire [63:0] s0_;
      wire [63:0] _zz_1;
      wire [63:0] s1_;
      wire [63:0] _zz_2;
      wire [63:0] temp;
      assign _zz_3 = ((s1 ^ s0) <<< 16);
      assign _zz_4 = (s0 <<< 2);
      assign _zz_5 = (temp <<< 3);
      assign s0_ = (({s0[39 : 0],s0[63 : 40]} ^ (s1 ^ s0)) ^ _zz_3);
      assign _zz_1 = (s1 ^ s0);
      assign s1_ = {_zz_1[26 : 0],_zz_1[63 : 27]};
      assign _zz_2 = (_zz_4 + s0);
      assign temp = {_zz_2[56 : 0],_zz_2[63 : 57]};
      assign io_prng = (_zz_5 + temp);
      always @ (posedge clk or posedge reset) begin
        if (reset) begin
          s0 <= (64'b0000000000000000000000000000000000000000000000000000000000000001);
          s1 <= (64'b0000000000000000000000000000000000000000000000000000000000000000);
        end else begin
          if(io_next)begin
            s0 <= s0_;
            s1 <= s1_;
          end
        end
      end
    
    endmodule
    
    Which produces the results you showed above.
    Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.
    How so? If you can write the Verilog you can convert it to C++ and run it with the Verilator. Which is what the Spinal system does for it's test benches.


  • TonyB_TonyB_ Posts: 2,190
    edited 2018-10-21 21:09
    Heater. wrote: »
    Just when I thought I was getting things straight I find I'm confused again. I didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++?

    No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.

    There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them, but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
    Heater. wrote: »
    Here is my xoroshiro128** in Spinal, no pipelining or such yet but no multiply used:
    ...
      def r = 71
    ...
    

    r = 71 or r = 7, whatever! :D
  • Heater.Heater. Posts: 21,230
    edited 2018-10-22 06:11
    TonyB_,
    No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.
    No worries. Thanks, it's all good stuff. I quite like xoroshiro64++ as a middle ground. xoroshiro128** eats LE's. Like 1% of a DE0 Nano.
    There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them,
    Ah OK. We won't go there then.
    ...but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
    I see what you mean.
    r = 71 or r = 7, whatever!
    WTF, where did that "1" come from? I even cut and pasted those numbers from your post.

    Eerily it turns out not to matter. The generated Verilog is the same with "def r = 7" or "def r = 71".

    Presumably because (7 % 64) = (71 % 64) = 7.

    What a flukey typo!

  • evanhevanh Posts: 16,007
    evanh wrote: »
    Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.

    They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...

  • evanh wrote: »
    evanh wrote: »
    Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.

    They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...

    Try not to switch off the power this time! :)
  • evanhevanh Posts: 16,007
    I have to pull the plug out now that the switch is welded on.

  • evanhevanh Posts: 16,007
    edited 2018-10-24 16:07
    Huh, chalked up the 16 TB mark 8 hours early. I guess there was a little too much CPU demand before. Here's progress reports. You'll note [26 9 13 17] currently has zero "unusual"s. I'm suitably impressed.

    When I killed off the 4th case to free up the RAM, half an hour late the automatics script, that was still active, started up a fresh case, [26 9 13 4], and it rolled out a VERY SUSPICIOUS on a BCFN test at 8 MB.
  • ErNaErNa Posts: 1,752
    cgracey wrote: »
    Heater. wrote: »
    I just wish Americans would put the "l" back into "solder" :)

    I wish we could put the Pb back in.

    Maybe just inform somebody that removing Pb was ordered by his predecessor.
  • evanhevanh Posts: 16,007
    evanh wrote: »
    I have to pull the plug out now that the switch is welded on.

    I guess it never was a power switch problem. I just did a shutdown in a moment of distraction. :(

    I'm away a few days now anyway so maybe it's for the better.

  • xoroshironotxoroshironot Posts: 309
    edited 2019-03-31 23:23
    The below slightly modified code (addition of bit inversion '~' and B shift direction change, requiring different constants) has slightly better escape from zero behavior and PractRand results, as compared to:
    Xoroshiro32pp 13,5,10,9
    Not sure about 'Pair frequency', if someone had time to check?
    There are many variations on this theme... too many to test.
    // Xoroshiro32ppg (variation of XoroshiroPlusPlus from Vigna/Blackman), by CRutz, free for all uses
    
    #include <stdint.h>
    
    #define  ACCUM_SIZE  16
    #define  CONSTANT_A  5
    #define  CONSTANT_B  1 // 1 is required for inverted Gray Code modification below
    #define  CONSTANT_C  14
    #define  CONSTANT_D  9 // 9 seems optimal when the underlying x/y distribution is optimal and correlation is minimized
    
    typedef  uint16_t  rngword_t;
    
    static inline rngword_t  rotl( rngword_t value, int shift )  {
    	return (value << shift) | (value >> (ACCUM_SIZE - shift));
    }
    
    // Seed the state array.  Seed MUST not be all zero.
    static rngword_t  s[2] = {1, 0};
    
    static rngword_t  nextxoro( void )
    {
    	rngword_t  s0 = s[0];
    	rngword_t  s1 = s[1];
    	rngword_t  result = s0 + s1;  // output hash (scrambler) for Xoroshiro+
    //New addition from authors:  Rotation and second summing
    	result = rotl( result, CONSTANT_D ) + s0;
    
    	s1 ^= s0;
    	s[0] = rotl( s0, CONSTANT_A ) ^ ~s1 ^ (s1 >> CONSTANT_B); // Note '~s1 ^ (s1 >> 1)' forms an inverted Gray Code
    	s[1] = rotl( s1, CONSTANT_C );
    
    	return( result );
    }
    
    Additional information here.
  • evanhevanh Posts: 16,007
    Very neat. A third summing circuit but in parallel to the other two. It looks like that'd fit speed wise. And I think you are saying that CONTANT_B should always be 1. Which brings the searching down to three parameters again.
  • xoroshironotxoroshironot Posts: 309
    edited 2019-03-31 23:55
    Same summing (sorry, I left the original comments in place, which will be confusing), basically just added a single negation and reversed the bit shift direction.
    CONSTANT_B does not actually need to be 1... I was researching Gray Codes (which require 1) at the time I tested this and it worked so well inverted that I left B alone.
    Feel free to try other variations... the '~s1' is perhaps the most important change. Then find all ABC for both '>>' and '<<'. Permute D on those.
    Like I said... too many variations. I simply guessed that the changes I made would be best, and they do seem really good on initial inspection.
    You guys probably have all the scripts for proving me right/wrong on an optimal choice of changes to make.
    If this proved out, I guess Seba would have the know-how to calculate constants for larger state sizes (i.e. 64/128/256).
  • It's still alive! The thread, I mean.

    Where's the third sum? B = 1 implies less scrambling than before, but bit inversion could make up for that.
Sign In or Register to comment.