Random/LFSR on P2

Heater. · 2018-10-20 22:47

evanh,

I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.

As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.

Heater. · 2018-10-20 22:56

TonyB_,

If you posted your results here then I could see them, instead of Secure Connection Failed, blah, blah, blah.

What, are you saying my link to the github repository at https://github.com/ZiCog/xoroshiro-plusplus does not work for you?

That would be most odd.

Anyway, cutting and pasting from the page at that link:

The C++ output does this:

    $ ./xoroshiroPlusPlus
    First 16 outputs of Xoroshiro32PlusPlus:
    6269
    ae16
    12a2
    4ae8
    d719
    0c52
    984b
    1df1
    743c
    dba0
    bcc6
    34c9
    746c
    3643
    07ff
    bbc0

    First 16 outputs of Xoroshiro64PlusPlus:
    48020a01
    81662931
    cd2b5253
    d3e6cbe6
    cd5af43d
    860aa4ba
    b7bea7fb
    63dcaff3
    762d74c9
    3e7d7e8f
    e10e0616
    5788242d
    d8ece2a3
    7a242fab
    add23d97
    98ef01be

The Verilog output does this:

[info] [Progress] Start Xoroshiro64PlusPlus test_Xoroshiro64PlusPlus simulation with seed -296360197591574595, wave in             /mnt/c/Users/zicog/Documents/xoroshiro-plusplus/./simWorkspace/Xoroshiro64PlusPlus/test_Xoroshiro64PlusPlus.vcd
    [info] 48020a01
    [info] 81662931
    [info] cd2b5253
    [info] d3e6cbe6
    [info] cd5af43d
    [info] 860aa4ba
    [info] b7bea7fb
    [info] 63dcaff3
    [info] 762d74c9
    [info] 3e7d7e8f
    [info] e10e0616
    [info] 5788242d
    [info] d8ece2a3
    [info] 7a242fab
    [info] add23d97
    [info] 98ef01be
    [info] [Done] Simulation done in 38.373 ms
    [success] Total time: 6 s, completed Oct 20, 2018 10:01:41 AM

Which I think matches the results we have discussed earlier in this thread.

Is this what a P2 actually does today ?

TonyB_ · 2018-10-20 22:58

Heater. wrote: »

evanh,

I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.

As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.

What is this HDL implementation to which you allude?

Heater. · 2018-10-20 23:11

TonyB_

What is this HDL implementation to which you allude

The one that I sort of announced today in this thread: http://forums.parallax.com/discussion/comment/1450145/#Comment_1450145

Which is here: https://github.com/ZiCog/xoroshiro-plusplus

Includes a dinky demo to run on a DE0 Nano board.

In case you cannot reach it the Verilog looks like this:

// Generator : SpinalHDL v1.1.5    git head : 0310b2489a097f2b9de5535e02192d9ddd2764ae
// Date      : 20/10/2018, 10:15:16
// Component : Xoroshiro64PlusPlus


module Xoroshiro64PlusPlus (
      output [31:0] io_prng,
      input   io_next,
      input   clk,
      input   reset);
  wire [31:0] _zz_3;
  reg [31:0] s0;
  reg [31:0] s1;
  wire [31:0] s0_;
  wire [31:0] _zz_1;
  wire [31:0] s1_;
  wire [31:0] _zz_2;
  assign _zz_3 = ((s1 ^ s0) <<< 9);
  assign s0_ = (({s0[5 : 0],s0[31 : 6]} ^ (s1 ^ s0)) ^ _zz_3);
  assign _zz_1 = (s1 ^ s0);
  assign s1_ = {_zz_1[18 : 0],_zz_1[31 : 19]};
  assign _zz_2 = (s0_ + s1_);
  assign io_prng = ({_zz_2[14 : 0],_zz_2[31 : 15]} + s0_);
  always @ (posedge clk or posedge reset) begin
    if (reset) begin
      s0 <= (32'b00000000000000000000000000000001);
      s1 <= (32'b00000000000000000000000000000000);
    end else begin
      if(io_next)begin
        s0 <= s0_;
        s1 <= s1_;
      end
    end
  end

endmodule

The SpinalHDL source from which that is generated looks much nicer:

class  Xoroshiro64PlusPlus extends Component {
  val io = new Bundle {
    val prng = out UInt(32 bits)
    val next = in Bool
  }

  // The PRNG state
  val s0 = Reg(UInt(32 bits)) init(1)
  val s1 = Reg(UInt(32 bits)) init(0)

  // The xoroshiro64++ magic numbers
  val a = 26
  val b = 9
  val c = 13
  val d = 17

  // Calculate the next PRNG state.
  val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
  val s1_ = (s1 ^ s0).rotateLeft(c)

  when(io.next) {
    // Update the PRNG state
    s0 := s0_
    s1 := s1_
  }

  // Deliver the "++" scrambled output
  io.prng := (s0_ + s1_).rotateLeft(d) + s0_
}

TonyB_ · 2018-10-20 23:29

Heater. wrote: »

The SpinalHDL source from which that is generated looks much nicer:

class  Xoroshiro64PlusPlus extends Component {
  val io = new Bundle {
    val prng = out UInt(32 bits)
    val next = in Bool
  }

  // The PRNG state
  val s0 = Reg(UInt(32 bits)) init(1)
  val s1 = Reg(UInt(32 bits)) init(0)

  // The xoroshiro64++ magic numbers
  val a = 26
  val b = 9
  val c = 13
  val d = 17

  // Calculate the next PRNG state.
  val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
  val s1_ = (s1 ^ s0).rotateLeft(c)

  when(io.next) {
    // Update the PRNG state
    s0 := s0_
    s1 := s1_
  }

  // Deliver the "++" scrambled output
  io.prng := (s0_ + s1_).rotateLeft(d) + s0_
}

This is the "PRN calculated at end of algorithm after new state generated" version. The output will match my software P2 asm version:
http://forums.parallax.com/discussion/comment/1449224/#Comment_1449224

You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware and double-iterate to match the XORO32 output.

evanh · 2018-10-20 23:37

Heater. wrote: »

As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.

Yep, that's good. Standard synchronous design.

However, in hardware, you have to be thinking a level lower as well. The fastest clock rate achievable is set by the longest cascading logic time, or the longest access time. This longest time gets labelled "critical path".

The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together. Nothing any optimiser can do about it.

As Chip was working his way through tightening each critical path to push above 160 MHz, producing new shorter criticals in the process, ... at one stage XORO32 became it. Which he then resolved by moving the scrambler to operate parallel to the engine.

As for the C++ edition. Parallel thinking has the same impact, and for very similar reason. Both a compiler optimiser and processor instruction reordering can see a tight parallel opportunity with ease. EDIT: Allows multiple instructions per clock to flow in a superscalar CPU.

Heater. · 2018-10-20 23:44

That is great and all but I have no idea what your P2 assembler code might output.

I feel like I'm standing on quick sand here. There is no reference to check against.

TonyB_ · 2018-10-20 23:55

First 32 outputs for xoroshiro64++ [26,9,13,17] when seed = 1 {s1 = 0, s0 = 1}
PRN calculated at end of algorithm after new state generated

48020A01
81662931
CD2B5253
D3E6CBE6
CD5AF43D
860AA4BA
B7BEA7FB
63DCAFF3
762D74C9
3E7D7E8F
E10E0616
5788242D
D8ECE2A3
7A242FAB
ADD23D97
98EF01BE
E9C6F835
126167D3
A7DCC109
9FE42570
F9D0CC5F
50C44C1C
EA63FC4D
17489AD9
50213E15
C2B6785F
B88837DA
0D83D95A
AADF08B1
B682DDFD
BD2B53A7
FEDA90E8

evanh · 2018-10-21 00:00

Chip posted the Propeller 2 verilog, for XORO32 instruction, a few pages back - https://forums.parallax.com/discussion/comment/1448460/#Comment_1448460

Tony, you could add that to your information topic.

evanh · 2018-10-21 00:05

Xoroshiro32++ [13 5 10 9]

0201
6269
ae16
12a2
4ae8
d719
0c52
984b
1df1
743c
dba0
bcc6
34c9
746c
3643
07ff
bbc0
c642
aa85
594e

XORO32

62690201
12a2ae16
d7194ae8
984b0c52
743c1df1
bcc6dba0
746c34c9
07ff3643
c642bbc0
594eaa85
701ad05a
f3aea328
695a67ee
93c6c140
4964f5e1
fc575e24
a638d7ad
181ae233
7be766b5
5cbd8445

Heater. · 2018-10-21 00:10

Thank you TonyB_.

That matches my xoroshiro64++ output.

TonyB_ · 2018-10-21 00:10

evanh wrote: »

Chip posted the Propeller 2 verilog, for XORO32 instruction, a few pages back - https://forums.parallax.com/discussion/comment/1448460/#Comment_1448460

Tony, you could add that to your information topic.

Good idea, Evan. I've edited an existing post to add a link:
http://forums.parallax.com/discussion/comment/1448894/#Comment_1448894

Heater. · 2018-10-21 12:26

TonyB_ and evanh,

Seems I missed a couple of you points last night.

You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware

OK, changed that for my xoroshiro32++.

..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.

Quite so.

If one was really pushed for time one could also pipeline the addition. Perhaps like this:

class  Xoroshiro32PlusPlus extends Component {
  val io = new Bundle {
    val prng = out UInt(16 bits)
    val next = in Bool
  }

  // The PRNG state
  val s0 = Reg(UInt(16 bits)) init(1)
  val s1 = Reg(UInt(16 bits)) init(0)
  val p0 = Reg(UInt(16 bits)) init(0)
  val p1 = Reg(UInt(16 bits)) init(0)

  // The xoroshiro32++ magic numbers
  val a = 13
  val b = 5
  val c = 10
  val d = 9

  // Calculate the next PRNG state.
  val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
  val s1_ = (s1 ^ s0).rotateLeft(c)

  when(io.next) {
    // Update the PRNG state
    s0 := s0_
    s1 := s1_
    p0 := (s0 + s1).rotateLeft(d)
    p1 := s0
  }

  // Deliver the "++" scrambled output
  io.prng := p0 + p1
}

Or in Verilog:

module Xoroshiro32PlusPlus (
      output [15:0] io_prng,
      input   io_next,
      input   clk,
      input   reset);
  wire [15:0] _zz_3;
  reg [15:0] s0;
  reg [15:0] s1;
  reg [15:0] p0;
  reg [15:0] p1;
  wire [15:0] s0_;
  wire [15:0] _zz_1;
  wire [15:0] s1_;
  wire [15:0] _zz_2;
  assign _zz_3 = ((s1 ^ s0) <<< 5);
  assign s0_ = (({s0[2 : 0],s0[15 : 3]} ^ (s1 ^ s0)) ^ _zz_3);
  assign _zz_1 = (s1 ^ s0);
  assign s1_ = {_zz_1[5 : 0],_zz_1[15 : 6]};
  assign _zz_2 = (s0 + s1);
  assign io_prng = (p0 + p1);
  always @ (posedge clk or posedge reset) begin
    if (reset) begin
      s0 <= (16'b0000000000000001);
      s1 <= (16'b0000000000000000);
      p0 <= (16'b0000000000000000);
      p1 <= (16'b0000000000000000);
    end else begin
      if(io_next)begin
        s0 <= s0_;
        s1 <= s1_;
        p0 <= {_zz_2[6 : 0],_zz_2[15 : 7]};
        p1 <= s0;
      end
    end
  end
endmodule

With the down side of using more LE's and introducing an invalid output at start up:

[info] [Progress] Start Xoroshiro32PlusPlus test_Xoroshiro32PlusPlus simulation with seed 1966752197249320824, wave in /mnt/c/Users/michael/Documents/xoroshiro-plusplus/./simWorkspace/Xoroshiro32PlusPlus/test_Xoroshiro32PlusPlus.vcd
[info] 0000
[info] 2010
[info] 6269
[info] ae16
[info] 12a2
[info] 4ae8
[info] d719
[info] c520
[info] 984b
[info] 1df1
[info] 743c
[info] dba0
[info] bcc6
[info] 34c9
[info] 746c
[info] 3643
[info] [Done] Simulation done in 10.724 ms

evanh · 2018-10-21 12:52

TonyB_ wrote: »

TonyB_ wrote: »

You could try xoroshiro64++ [26,9,13,17].

The paper mentions the xoroshiro64 [26,9,13] engine and the +, * and ** scramblers. + is not good enough and ++ would be a much better alternative if there are no multipliers available. In an email to me Seba suggested 15 or 17 for the ++ rotation, i.e. half the output width +/- 1.

I chose d = 17 as Seba told me that, in theory, an odd prime has a little more mojo than plain odd (although not in those exact words). Also our chosen xoroshiro32++ has d = w/2 +1.

Nobody really knows whether d = 17 is the best choice or not, as xoroshiro64++ is far too big to test fully.

I've done a little testing. 64+ untruncated netted me impressively poor scores of 128 KB for aperture 32>>0 and no more than 2 MB for all other 32 bit apertures. The moment bit0 was remove the score went to 2 GB, and with both bit0 and bit1 it went way up. I stopped it at 512 GB.

I'm now running four 64++ candidates at aperture 32>>0: [26 9 13 17], [26 9 13 1], [26 9 13 2], [26 9 13 3]. So far, have cleared 2 TB at highest Practrand strength and all is well. Nothing more than a sprinkling of "unusual's".

2 TB was about 12 hours. At that rate, some back of the envelope sums tells me that full period is 46000 years away.

TonyB_ · 2018-10-21 13:00

Heater. wrote: »

TonyB_ and evanh,

Seems I missed a couple of you points last night.

You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware

OK, changed that for my xoroshiro32++.

Remember you'll have to double-iterate to emulate the XORO32 instruction.

Heater. wrote: »

..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.

Quite so.

If one was really pushed for time one could also pipeline the addition.

The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined as two cascaded 64-bit adds in one clock is fantasy. If you would like to practise your skills some more you could try implementing this generator, which is fully explained in the paper and doesn't require multiplications as such.

Aside.
The generator engines and scramblers are separate and their constants could be separate, too, e.g.
xoroshiro32++ [13,5,10,9] -> xoroshiro32[13,5,10]++[9]
xoroshiro128** [24,16,37,5,7,9] -> xoroshiro128[24,16,37]**[5,7,9]

First 16 outputs of xoroshiro128**[24,16,37,5,7,9] when seed = 1.
PRN calculated at start of algorithm before new state generated.

0000000000001680
0000001696801680
E682E68000001680
800016F099C09692
DB9CE96699C368C0
ACD4A897E816F3BC
736225BD8540F37D
D5EF4BC8DB65564A
E6844F92C4467F20
ECBFDD2089C19276
3E3C8F32277DC824
6135641F1DADD2CA
0695119B0DE23CFA
8837DED221B7094D
99F5FDC2C04A92D6
57262A94B712228C

Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.

evanh · 2018-10-21 13:02

Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.

TonyB_ · 2018-10-21 13:08

I think Seba stops his tests after 32 TB.

xoroshiro64 period is ~584 years at one iteration per nanosecond.
xoroshiro64++ would last ~64000 years in P2 asm @ 200 MHz.

Heater. · 2018-10-21 19:56

TonyB_,

Remember you'll have to double-iterate to emulate the XORO32 instruction.

I'm leaving that as an exercise for later.

The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined...

Just when I thought I was getting things straight I find I'm confused again. I did didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++? Do I have my wires crossed?

If you would like to practise your skills some more...

Oh yeah. Here is my xoroshiro128** in Spinal, no pipelining or such yet but no multiply used:

class  Xoroshiro128StarStar extends Component {
  val io = new Bundle {
    val prng = out UInt(64 bits)
    val next = in Bool
  }

  // The PRNG state
  val s0 = Reg(UInt(64 bits)) init(1)
  val s1 = Reg(UInt(64 bits)) init(0)

  // The xoroshiro128** magic numbers
  def a = 24
  def b = 16
  def c = 37
  def s = 5
  def r = 71
  def t = 9

  // Calculate the next PRNG state.
  val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
  val s1_ = (s1 ^ s0).rotateLeft(c)

  when(io.next) {
    // Update the PRNG state
    s0 := s0_
    s1 := s1_
  }

  // Deliver the "**" scrambled output: "rotl(s0 * S, R) * T"
  val temp = ((s0 |<< 2) + s0).rotateLeft(r)
  io.prng := (temp |<< 3) + temp
}

Which produces the following iky looking Verilog:

module Xoroshiro128StarStar (
      output [63:0] io_prng,
      input   io_next,
      input   clk,
      input   reset);
  wire [63:0] _zz_3;
  wire [63:0] _zz_4;
  wire [63:0] _zz_5;
  reg [63:0] s0;
  reg [63:0] s1;
  wire [63:0] s0_;
  wire [63:0] _zz_1;
  wire [63:0] s1_;
  wire [63:0] _zz_2;
  wire [63:0] temp;
  assign _zz_3 = ((s1 ^ s0) <<< 16);
  assign _zz_4 = (s0 <<< 2);
  assign _zz_5 = (temp <<< 3);
  assign s0_ = (({s0[39 : 0],s0[63 : 40]} ^ (s1 ^ s0)) ^ _zz_3);
  assign _zz_1 = (s1 ^ s0);
  assign s1_ = {_zz_1[26 : 0],_zz_1[63 : 27]};
  assign _zz_2 = (_zz_4 + s0);
  assign temp = {_zz_2[56 : 0],_zz_2[63 : 57]};
  assign io_prng = (_zz_5 + temp);
  always @ (posedge clk or posedge reset) begin
    if (reset) begin
      s0 <= (64'b0000000000000000000000000000000000000000000000000000000000000001);
      s1 <= (64'b0000000000000000000000000000000000000000000000000000000000000000);
    end else begin
      if(io_next)begin
        s0 <= s0_;
        s1 <= s1_;
      end
    end
  end

endmodule

Which produces the results you showed above.

Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.

How so? If you can write the Verilog you can convert it to C++ and run it with the Verilator. Which is what the Spinal system does for it's test benches.

TonyB_ · 2018-10-21 21:08

Heater. wrote: »

Just when I thought I was getting things straight I find I'm confused again. I didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++?

No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.

There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them, but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.

Heater. wrote: »
Here is my xoroshiro128** in Spinal, no pipelining or such yet but no multiply used:
...
  def r = 71
...

r = 71 or r = 7, whatever!

Heater. · 2018-10-21 21:44

TonyB_,

No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.

No worries. Thanks, it's all good stuff. I quite like xoroshiro64++ as a middle ground. xoroshiro128** eats LE's. Like 1% of a DE0 Nano.

There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them,

Ah OK. We won't go there then.

...but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.

I see what you mean.

r = 71 or r = 7, whatever!

WTF, where did that "1" come from? I even cut and pasted those numbers from your post.

Eerily it turns out not to matter. The generated Verilog is the same with "def r = 7" or "def r = 71".

Presumably because (7 % 64) = (71 % 64) = 7.

What a flukey typo!

evanh · 2018-10-22 21:25

evanh wrote: »

Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.

They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...

TonyB_ · 2018-10-24 00:49

evanh wrote: »

evanh wrote: »

Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.

They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...

Try not to switch off the power this time!

evanh · 2018-10-24 01:25

I have to pull the plug out now that the switch is welded on.

evanh · 2018-10-24 16:06

Huh, chalked up the 16 TB mark 8 hours early. I guess there was a little too much CPU demand before. Here's progress reports. You'll note [26 9 13 17] currently has zero "unusual"s. I'm suitably impressed.

When I killed off the 4th case to free up the RAM, half an hour late the automatics script, that was still active, started up a fresh case, [26 9 13 4], and it rolled out a VERY SUSPICIOUS on a BCFN test at 8 MB.

ErNa · 2018-10-24 20:10

cgracey wrote: »

Heater. wrote: »

I just wish Americans would put the "l" back into "solder"

I wish we could put the Pb back in.

Maybe just inform somebody that removing Pb was ordered by his predecessor.

evanh · 2018-10-25 17:14

evanh wrote: »

I have to pull the plug out now that the switch is welded on.

I guess it never was a power switch problem. I just did a shutdown in a moment of distraction.

I'm away a few days now anyway so maybe it's for the better.

xoroshironot · 2019-03-31 20:00

The below slightly modified code (addition of bit inversion '~' and B shift direction change, requiring different constants) has slightly better escape from zero behavior and PractRand results, as compared to:
Xoroshiro32pp 13,5,10,9
Not sure about 'Pair frequency', if someone had time to check?
There are many variations on this theme... too many to test.

// Xoroshiro32ppg (variation of XoroshiroPlusPlus from Vigna/Blackman), by CRutz, free for all uses

#include <stdint.h>

#define  ACCUM_SIZE  16
#define  CONSTANT_A  5
#define  CONSTANT_B  1 // 1 is required for inverted Gray Code modification below
#define  CONSTANT_C  14
#define  CONSTANT_D  9 // 9 seems optimal when the underlying x/y distribution is optimal and correlation is minimized

typedef  uint16_t  rngword_t;

static inline rngword_t  rotl( rngword_t value, int shift )  {
	return (value << shift) | (value >> (ACCUM_SIZE - shift));
}

// Seed the state array.  Seed MUST not be all zero.
static rngword_t  s[2] = {1, 0};

static rngword_t  nextxoro( void )
{
	rngword_t  s0 = s[0];
	rngword_t  s1 = s[1];
	rngword_t  result = s0 + s1;  // output hash (scrambler) for Xoroshiro+
//New addition from authors:  Rotation and second summing
	result = rotl( result, CONSTANT_D ) + s0;

	s1 ^= s0;
	s[0] = rotl( s0, CONSTANT_A ) ^ ~s1 ^ (s1 >> CONSTANT_B); // Note '~s1 ^ (s1 >> 1)' forms an inverted Gray Code
	s[1] = rotl( s1, CONSTANT_C );

	return( result );
}

Additional information here.

evanh · 2019-03-31 23:10

Very neat. A third summing circuit but in parallel to the other two. It looks like that'd fit speed wise. And I think you are saying that CONTANT_B should always be 1. Which brings the searching down to three parameters again.

xoroshironot · 2019-03-31 23:37

Same summing (sorry, I left the original comments in place, which will be confusing), basically just added a single negation and reversed the bit shift direction.
CONSTANT_B does not actually need to be 1... I was researching Gray Codes (which require 1) at the time I tested this and it worked so well inverted that I left B alone.
Feel free to try other variations... the '~s1' is perhaps the most important change. Then find all ABC for both '>>' and '<<'. Permute D on those.
Like I said... too many variations. I simply guessed that the changes I made would be best, and they do seem really good on initial inspection.
You guys probably have all the scripts for proving me right/wrong on an optimal choice of changes to make.
If this proved out, I guess Seba would have the know-how to calculate constants for larger state sizes (i.e. 64/128/256).

TonyB_ · 2019-03-31 23:37

It's still alive! The thread, I mean.

Where's the third sum? B = 1 implies less scrambling than before, but bit inversion could make up for that.

Random/LFSR on P2

Comments