I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
What is this HDL implementation to which you allude?
The SpinalHDL source from which that is generated looks much nicer:
classXoroshiro64PlusPlusextendsComponent{
val io = new Bundle {
val prng = out UInt(32 bits)
val next = in Bool
}
// The PRNG stateval s0 = Reg(UInt(32 bits)) init(1)
val s1 = Reg(UInt(32 bits)) init(0)
// The xoroshiro64++ magic numbersval a = 26val b = 9val c = 13val d = 17// Calculate the next PRNG state.val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
}
// Deliver the "++" scrambled output
io.prng := (s0_ + s1_).rotateLeft(d) + s0_
}
The SpinalHDL source from which that is generated looks much nicer:
classXoroshiro64PlusPlusextendsComponent{
val io = new Bundle {
val prng = out UInt(32 bits)
val next = in Bool
}
// The PRNG stateval s0 = Reg(UInt(32 bits)) init(1)
val s1 = Reg(UInt(32 bits)) init(0)
// The xoroshiro64++ magic numbersval a = 26val b = 9val c = 13val d = 17// Calculate the next PRNG state.val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
}
// Deliver the "++" scrambled output
io.prng := (s0_ + s1_).rotateLeft(d) + s0_
}
You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware and double-iterate to match the XORO32 output.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
Yep, that's good. Standard synchronous design.
However, in hardware, you have to be thinking a level lower as well. The fastest clock rate achievable is set by the longest cascading logic time, or the longest access time. This longest time gets labelled "critical path".
The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together. Nothing any optimiser can do about it.
As Chip was working his way through tightening each critical path to push above 160 MHz, producing new shorter criticals in the process, ... at one stage XORO32 became it. Which he then resolved by moving the scrambler to operate parallel to the engine.
As for the C++ edition. Parallel thinking has the same impact, and for very similar reason. Both a compiler optimiser and processor instruction reordering can see a tight parallel opportunity with ease. EDIT: Allows multiple instructions per clock to flow in a superscalar CPU.
You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware
OK, changed that for my xoroshiro32++.
..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.
Quite so.
If one was really pushed for time one could also pipeline the addition. Perhaps like this:
classXoroshiro32PlusPlusextendsComponent{
val io = new Bundle {
val prng = out UInt(16 bits)
val next = in Bool
}
// The PRNG stateval s0 = Reg(UInt(16 bits)) init(1)
val s1 = Reg(UInt(16 bits)) init(0)
val p0 = Reg(UInt(16 bits)) init(0)
val p1 = Reg(UInt(16 bits)) init(0)
// The xoroshiro32++ magic numbersval a = 13val b = 5val c = 10val d = 9// Calculate the next PRNG state.val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
p0 := (s0 + s1).rotateLeft(d)
p1 := s0
}
// Deliver the "++" scrambled output
io.prng := p0 + p1
}
The paper mentions the xoroshiro64 [26,9,13] engine and the +, * and ** scramblers. + is not good enough and ++ would be a much better alternative if there are no multipliers available. In an email to me Seba suggested 15 or 17 for the ++ rotation, i.e. half the output width +/- 1.
I chose d = 17 as Seba told me that, in theory, an odd prime has a little more mojo than plain odd (although not in those exact words). Also our chosen xoroshiro32++ has d = w/2 +1.
Nobody really knows whether d = 17 is the best choice or not, as xoroshiro64++ is far too big to test fully.
I've done a little testing. 64+ untruncated netted me impressively poor scores of 128 KB for aperture 32>>0 and no more than 2 MB for all other 32 bit apertures. The moment bit0 was remove the score went to 2 GB, and with both bit0 and bit1 it went way up. I stopped it at 512 GB.
I'm now running four 64++ candidates at aperture 32>>0: [26 9 13 17], [26 9 13 1], [26 9 13 2], [26 9 13 3]. So far, have cleared 2 TB at highest Practrand strength and all is well. Nothing more than a sprinkling of "unusual's".
2 TB was about 12 hours. At that rate, some back of the envelope sums tells me that full period is 46000 years away.
..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.
Quite so.
If one was really pushed for time one could also pipeline the addition.
The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined as two cascaded 64-bit adds in one clock is fantasy. If you would like to practise your skills some more you could try implementing this generator, which is fully explained in the paper and doesn't require multiplications as such.
Aside.
The generator engines and scramblers are separate and their constants could be separate, too, e.g.
xoroshiro32++ [13,5,10,9] -> xoroshiro32[13,5,10]++[9]
xoroshiro128** [24,16,37,5,7,9] -> xoroshiro128[24,16,37]**[5,7,9]
First 16 outputs of xoroshiro128**[24,16,37,5,7,9] when seed = 1.
PRN calculated at start of algorithm before new state generated.
Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.
Remember you'll have to double-iterate to emulate the XORO32 instruction.
I'm leaving that as an exercise for later.
The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined...
Just when I thought I was getting things straight I find I'm confused again. I did didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++? Do I have my wires crossed?
If you would like to practise your skills some more...
Oh yeah. Here is my xoroshiro128** in Spinal, no pipelining or such yet but no multiply used:
classXoroshiro128StarStarextendsComponent{
val io = new Bundle {
val prng = out UInt(64 bits)
val next = in Bool
}
// The PRNG stateval s0 = Reg(UInt(64 bits)) init(1)
val s1 = Reg(UInt(64 bits)) init(0)
// The xoroshiro128** magic numbers
def a = 24
def b = 16
def c = 37
def s = 5
def r = 71
def t = 9// Calculate the next PRNG state.val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
}
// Deliver the "**" scrambled output: "rotl(s0 * S, R) * T"val temp = ((s0 |<< 2) + s0).rotateLeft(r)
io.prng := (temp |<< 3) + temp
}
Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.
How so? If you can write the Verilog you can convert it to C++ and run it with the Verilator. Which is what the Spinal system does for it's test benches.
Just when I thought I was getting things straight I find I'm confused again. I didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++?
No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.
There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them, but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.
No worries. Thanks, it's all good stuff. I quite like xoroshiro64++ as a middle ground. xoroshiro128** eats LE's. Like 1% of a DE0 Nano.
There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them,
Ah OK. We won't go there then.
...but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
I see what you mean.
r = 71 or r = 7, whatever!
WTF, where did that "1" come from? I even cut and pasted those numbers from your post.
Eerily it turns out not to matter. The generated Verilog is the same with "def r = 7" or "def r = 71".
Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.
They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...
Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.
They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...
Huh, chalked up the 16 TB mark 8 hours early. I guess there was a little too much CPU demand before. Here's progress reports. You'll note [26 9 13 17] currently has zero "unusual"s. I'm suitably impressed.
When I killed off the 4th case to free up the RAM, half an hour late the automatics script, that was still active, started up a fresh case, [26 9 13 4], and it rolled out a VERY SUSPICIOUS on a BCFN test at 8 MB.
The below slightly modified code (addition of bit inversion '~' and B shift direction change, requiring different constants) has slightly better escape from zero behavior and PractRand results, as compared to: Xoroshiro32pp 13,5,10,9
Not sure about 'Pair frequency', if someone had time to check?
There are many variations on this theme... too many to test.
// Xoroshiro32ppg (variation of XoroshiroPlusPlus from Vigna/Blackman), by CRutz, free for all uses#include<stdint.h>#define ACCUM_SIZE 16#define CONSTANT_A 5#define CONSTANT_B 1 // 1 is required for inverted Gray Code modification below#define CONSTANT_C 14#define CONSTANT_D 9 // 9 seems optimal when the underlying x/y distribution is optimal and correlation is minimizedtypedefuint16_trngword_t;
staticinlinerngword_trotl( rngword_t value, int shift ){
return (value << shift) | (value >> (ACCUM_SIZE - shift));
}
// Seed the state array. Seed MUST not be all zero.staticrngword_t s[2] = {1, 0};
staticrngword_tnextxoro( void ){
rngword_t s0 = s[0];
rngword_t s1 = s[1];
rngword_t result = s0 + s1; // output hash (scrambler) for Xoroshiro+//New addition from authors: Rotation and second summing
result = rotl( result, CONSTANT_D ) + s0;
s1 ^= s0;
s[0] = rotl( s0, CONSTANT_A ) ^ ~s1 ^ (s1 >> CONSTANT_B); // Note '~s1 ^ (s1 >> 1)' forms an inverted Gray Code
s[1] = rotl( s1, CONSTANT_C );
return( result );
}
Very neat. A third summing circuit but in parallel to the other two. It looks like that'd fit speed wise. And I think you are saying that CONTANT_B should always be 1. Which brings the searching down to three parameters again.
Same summing (sorry, I left the original comments in place, which will be confusing), basically just added a single negation and reversed the bit shift direction.
CONSTANT_B does not actually need to be 1... I was researching Gray Codes (which require 1) at the time I tested this and it worked so well inverted that I left B alone.
Feel free to try other variations... the '~s1' is perhaps the most important change. Then find all ABC for both '>>' and '<<'. Permute D on those.
Like I said... too many variations. I simply guessed that the changes I made would be best, and they do seem really good on initial inspection.
You guys probably have all the scripts for proving me right/wrong on an optimal choice of changes to make.
If this proved out, I guess Seba would have the know-how to calculate constants for larger state sizes (i.e. 64/128/256).
Comments
I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
That would be most odd.
Anyway, cutting and pasting from the page at that link:
The C++ output does this:
$ ./xoroshiroPlusPlus First 16 outputs of Xoroshiro32PlusPlus: 6269 ae16 12a2 4ae8 d719 0c52 984b 1df1 743c dba0 bcc6 34c9 746c 3643 07ff bbc0 First 16 outputs of Xoroshiro64PlusPlus: 48020a01 81662931 cd2b5253 d3e6cbe6 cd5af43d 860aa4ba b7bea7fb 63dcaff3 762d74c9 3e7d7e8f e10e0616 5788242d d8ece2a3 7a242fab add23d97 98ef01be
The Verilog output does this:
[info] [Progress] Start Xoroshiro64PlusPlus test_Xoroshiro64PlusPlus simulation with seed -296360197591574595, wave in /mnt/c/Users/zicog/Documents/xoroshiro-plusplus/./simWorkspace/Xoroshiro64PlusPlus/test_Xoroshiro64PlusPlus.vcd [info] 48020a01 [info] 81662931 [info] cd2b5253 [info] d3e6cbe6 [info] cd5af43d [info] 860aa4ba [info] b7bea7fb [info] 63dcaff3 [info] 762d74c9 [info] 3e7d7e8f [info] e10e0616 [info] 5788242d [info] d8ece2a3 [info] 7a242fab [info] add23d97 [info] 98ef01be [info] [Done] Simulation done in 38.373 ms [success] Total time: 6 s, completed Oct 20, 2018 10:01:41 AM
Which I think matches the results we have discussed earlier in this thread.Is this what a P2 actually does today ?
What is this HDL implementation to which you allude?
Which is here: https://github.com/ZiCog/xoroshiro-plusplus
Includes a dinky demo to run on a DE0 Nano board.
In case you cannot reach it the Verilog looks like this:
// Generator : SpinalHDL v1.1.5 git head : 0310b2489a097f2b9de5535e02192d9ddd2764ae // Date : 20/10/2018, 10:15:16 // Component : Xoroshiro64PlusPlus module Xoroshiro64PlusPlus ( output [31:0] io_prng, input io_next, input clk, input reset); wire [31:0] _zz_3; reg [31:0] s0; reg [31:0] s1; wire [31:0] s0_; wire [31:0] _zz_1; wire [31:0] s1_; wire [31:0] _zz_2; assign _zz_3 = ((s1 ^ s0) <<< 9); assign s0_ = (({s0[5 : 0],s0[31 : 6]} ^ (s1 ^ s0)) ^ _zz_3); assign _zz_1 = (s1 ^ s0); assign s1_ = {_zz_1[18 : 0],_zz_1[31 : 19]}; assign _zz_2 = (s0_ + s1_); assign io_prng = ({_zz_2[14 : 0],_zz_2[31 : 15]} + s0_); always @ (posedge clk or posedge reset) begin if (reset) begin s0 <= (32'b00000000000000000000000000000001); s1 <= (32'b00000000000000000000000000000000); end else begin if(io_next)begin s0 <= s0_; s1 <= s1_; end end end endmodule
The SpinalHDL source from which that is generated looks much nicer:class Xoroshiro64PlusPlus extends Component { val io = new Bundle { val prng = out UInt(32 bits) val next = in Bool } // The PRNG state val s0 = Reg(UInt(32 bits)) init(1) val s1 = Reg(UInt(32 bits)) init(0) // The xoroshiro64++ magic numbers val a = 26 val b = 9 val c = 13 val d = 17 // Calculate the next PRNG state. val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b) val s1_ = (s1 ^ s0).rotateLeft(c) when(io.next) { // Update the PRNG state s0 := s0_ s1 := s1_ } // Deliver the "++" scrambled output io.prng := (s0_ + s1_).rotateLeft(d) + s0_ }
This is the "PRN calculated at end of algorithm after new state generated" version. The output will match my software P2 asm version:
http://forums.parallax.com/discussion/comment/1449224/#Comment_1449224
You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware and double-iterate to match the XORO32 output.
However, in hardware, you have to be thinking a level lower as well. The fastest clock rate achievable is set by the longest cascading logic time, or the longest access time. This longest time gets labelled "critical path".
The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together. Nothing any optimiser can do about it.
As Chip was working his way through tightening each critical path to push above 160 MHz, producing new shorter criticals in the process, ... at one stage XORO32 became it. Which he then resolved by moving the scrambler to operate parallel to the engine.
As for the C++ edition. Parallel thinking has the same impact, and for very similar reason. Both a compiler optimiser and processor instruction reordering can see a tight parallel opportunity with ease. EDIT: Allows multiple instructions per clock to flow in a superscalar CPU.
I feel like I'm standing on quick sand here. There is no reference to check against.
PRN calculated at end of algorithm after new state generated
48020A01 81662931 CD2B5253 D3E6CBE6 CD5AF43D 860AA4BA B7BEA7FB 63DCAFF3 762D74C9 3E7D7E8F E10E0616 5788242D D8ECE2A3 7A242FAB ADD23D97 98EF01BE E9C6F835 126167D3 A7DCC109 9FE42570 F9D0CC5F 50C44C1C EA63FC4D 17489AD9 50213E15 C2B6785F B88837DA 0D83D95A AADF08B1 B682DDFD BD2B53A7 FEDA90E8
Tony, you could add that to your information topic.
Xoroshiro32++ [13 5 10 9] 0201 6269 ae16 12a2 4ae8 d719 0c52 984b 1df1 743c dba0 bcc6 34c9 746c 3643 07ff bbc0 c642 aa85 594e
XORO32 62690201 12a2ae16 d7194ae8 984b0c52 743c1df1 bcc6dba0 746c34c9 07ff3643 c642bbc0 594eaa85 701ad05a f3aea328 695a67ee 93c6c140 4964f5e1 fc575e24 a638d7ad 181ae233 7be766b5 5cbd8445
That matches my xoroshiro64++ output.
Good idea, Evan. I've edited an existing post to add a link:
http://forums.parallax.com/discussion/comment/1448894/#Comment_1448894
Seems I missed a couple of you points last night. OK, changed that for my xoroshiro32++. Quite so.
If one was really pushed for time one could also pipeline the addition. Perhaps like this:
class Xoroshiro32PlusPlus extends Component { val io = new Bundle { val prng = out UInt(16 bits) val next = in Bool } // The PRNG state val s0 = Reg(UInt(16 bits)) init(1) val s1 = Reg(UInt(16 bits)) init(0) val p0 = Reg(UInt(16 bits)) init(0) val p1 = Reg(UInt(16 bits)) init(0) // The xoroshiro32++ magic numbers val a = 13 val b = 5 val c = 10 val d = 9 // Calculate the next PRNG state. val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b) val s1_ = (s1 ^ s0).rotateLeft(c) when(io.next) { // Update the PRNG state s0 := s0_ s1 := s1_ p0 := (s0 + s1).rotateLeft(d) p1 := s0 } // Deliver the "++" scrambled output io.prng := p0 + p1 }
Or in Verilog:module Xoroshiro32PlusPlus ( output [15:0] io_prng, input io_next, input clk, input reset); wire [15:0] _zz_3; reg [15:0] s0; reg [15:0] s1; reg [15:0] p0; reg [15:0] p1; wire [15:0] s0_; wire [15:0] _zz_1; wire [15:0] s1_; wire [15:0] _zz_2; assign _zz_3 = ((s1 ^ s0) <<< 5); assign s0_ = (({s0[2 : 0],s0[15 : 3]} ^ (s1 ^ s0)) ^ _zz_3); assign _zz_1 = (s1 ^ s0); assign s1_ = {_zz_1[5 : 0],_zz_1[15 : 6]}; assign _zz_2 = (s0 + s1); assign io_prng = (p0 + p1); always @ (posedge clk or posedge reset) begin if (reset) begin s0 <= (16'b0000000000000001); s1 <= (16'b0000000000000000); p0 <= (16'b0000000000000000); p1 <= (16'b0000000000000000); end else begin if(io_next)begin s0 <= s0_; s1 <= s1_; p0 <= {_zz_2[6 : 0],_zz_2[15 : 7]}; p1 <= s0; end end end endmodule
With the down side of using more LE's and introducing an invalid output at start up:[info] [Progress] Start Xoroshiro32PlusPlus test_Xoroshiro32PlusPlus simulation with seed 1966752197249320824, wave in /mnt/c/Users/michael/Documents/xoroshiro-plusplus/./simWorkspace/Xoroshiro32PlusPlus/test_Xoroshiro32PlusPlus.vcd [info] 0000 [info] 2010 [info] 6269 [info] ae16 [info] 12a2 [info] 4ae8 [info] d719 [info] c520 [info] 984b [info] 1df1 [info] 743c [info] dba0 [info] bcc6 [info] 34c9 [info] 746c [info] 3643 [info] [Done] Simulation done in 10.724 ms
I've done a little testing. 64+ untruncated netted me impressively poor scores of 128 KB for aperture 32>>0 and no more than 2 MB for all other 32 bit apertures. The moment bit0 was remove the score went to 2 GB, and with both bit0 and bit1 it went way up. I stopped it at 512 GB.
I'm now running four 64++ candidates at aperture 32>>0: [26 9 13 17], [26 9 13 1], [26 9 13 2], [26 9 13 3]. So far, have cleared 2 TB at highest Practrand strength and all is well. Nothing more than a sprinkling of "unusual's".
2 TB was about 12 hours. At that rate, some back of the envelope sums tells me that full period is 46000 years away.
Remember you'll have to double-iterate to emulate the XORO32 instruction.
The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined as two cascaded 64-bit adds in one clock is fantasy. If you would like to practise your skills some more you could try implementing this generator, which is fully explained in the paper and doesn't require multiplications as such.
Aside.
The generator engines and scramblers are separate and their constants could be separate, too, e.g.
xoroshiro32++ [13,5,10,9] -> xoroshiro32[13,5,10]++[9]
xoroshiro128** [24,16,37,5,7,9] -> xoroshiro128[24,16,37]**[5,7,9]
First 16 outputs of xoroshiro128**[24,16,37,5,7,9] when seed = 1.
PRN calculated at start of algorithm before new state generated.
0000000000001680 0000001696801680 E682E68000001680 800016F099C09692 DB9CE96699C368C0 ACD4A897E816F3BC 736225BD8540F37D D5EF4BC8DB65564A E6844F92C4467F20 ECBFDD2089C19276 3E3C8F32277DC824 6135641F1DADD2CA 0695119B0DE23CFA 8837DED221B7094D 99F5FDC2C04A92D6 57262A94B712228C
Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.
xoroshiro64 period is ~584 years at one iteration per nanosecond.
xoroshiro64++ would last ~64000 years in P2 asm @ 200 MHz.
class Xoroshiro128StarStar extends Component { val io = new Bundle { val prng = out UInt(64 bits) val next = in Bool } // The PRNG state val s0 = Reg(UInt(64 bits)) init(1) val s1 = Reg(UInt(64 bits)) init(0) // The xoroshiro128** magic numbers def a = 24 def b = 16 def c = 37 def s = 5 def r = 71 def t = 9 // Calculate the next PRNG state. val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b) val s1_ = (s1 ^ s0).rotateLeft(c) when(io.next) { // Update the PRNG state s0 := s0_ s1 := s1_ } // Deliver the "**" scrambled output: "rotl(s0 * S, R) * T" val temp = ((s0 |<< 2) + s0).rotateLeft(r) io.prng := (temp |<< 3) + temp }
Which produces the following iky looking Verilog:module Xoroshiro128StarStar ( output [63:0] io_prng, input io_next, input clk, input reset); wire [63:0] _zz_3; wire [63:0] _zz_4; wire [63:0] _zz_5; reg [63:0] s0; reg [63:0] s1; wire [63:0] s0_; wire [63:0] _zz_1; wire [63:0] s1_; wire [63:0] _zz_2; wire [63:0] temp; assign _zz_3 = ((s1 ^ s0) <<< 16); assign _zz_4 = (s0 <<< 2); assign _zz_5 = (temp <<< 3); assign s0_ = (({s0[39 : 0],s0[63 : 40]} ^ (s1 ^ s0)) ^ _zz_3); assign _zz_1 = (s1 ^ s0); assign s1_ = {_zz_1[26 : 0],_zz_1[63 : 27]}; assign _zz_2 = (_zz_4 + s0); assign temp = {_zz_2[56 : 0],_zz_2[63 : 57]}; assign io_prng = (_zz_5 + temp); always @ (posedge clk or posedge reset) begin if (reset) begin s0 <= (64'b0000000000000000000000000000000000000000000000000000000000000001); s1 <= (64'b0000000000000000000000000000000000000000000000000000000000000000); end else begin if(io_next)begin s0 <= s0_; s1 <= s1_; end end end endmodule
Which produces the results you showed above. How so? If you can write the Verilog you can convert it to C++ and run it with the Verilator. Which is what the Spinal system does for it's test benches.No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.
There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them, but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
r = 71 or r = 7, whatever!
Eerily it turns out not to matter. The generated Verilog is the same with "def r = 7" or "def r = 71".
Presumably because (7 % 64) = (71 % 64) = 7.
What a flukey typo!
They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...
Try not to switch off the power this time!
When I killed off the 4th case to free up the RAM, half an hour late the automatics script, that was still active, started up a fresh case, [26 9 13 4], and it rolled out a VERY SUSPICIOUS on a BCFN test at 8 MB.
Maybe just inform somebody that removing Pb was ordered by his predecessor.
I guess it never was a power switch problem. I just did a shutdown in a moment of distraction.
I'm away a few days now anyway so maybe it's for the better.
Xoroshiro32pp 13,5,10,9
Not sure about 'Pair frequency', if someone had time to check?
There are many variations on this theme... too many to test.
// Xoroshiro32ppg (variation of XoroshiroPlusPlus from Vigna/Blackman), by CRutz, free for all uses #include <stdint.h> #define ACCUM_SIZE 16 #define CONSTANT_A 5 #define CONSTANT_B 1 // 1 is required for inverted Gray Code modification below #define CONSTANT_C 14 #define CONSTANT_D 9 // 9 seems optimal when the underlying x/y distribution is optimal and correlation is minimized typedef uint16_t rngword_t; static inline rngword_t rotl( rngword_t value, int shift ) { return (value << shift) | (value >> (ACCUM_SIZE - shift)); } // Seed the state array. Seed MUST not be all zero. static rngword_t s[2] = {1, 0}; static rngword_t nextxoro( void ) { rngword_t s0 = s[0]; rngword_t s1 = s[1]; rngword_t result = s0 + s1; // output hash (scrambler) for Xoroshiro+ //New addition from authors: Rotation and second summing result = rotl( result, CONSTANT_D ) + s0; s1 ^= s0; s[0] = rotl( s0, CONSTANT_A ) ^ ~s1 ^ (s1 >> CONSTANT_B); // Note '~s1 ^ (s1 >> 1)' forms an inverted Gray Code s[1] = rotl( s1, CONSTANT_C ); return( result ); }
Additional information here.CONSTANT_B does not actually need to be 1... I was researching Gray Codes (which require 1) at the time I tested this and it worked so well inverted that I left B alone.
Feel free to try other variations... the '~s1' is perhaps the most important change. Then find all ABC for both '>>' and '<<'. Permute D on those.
Like I said... too many variations. I simply guessed that the changes I made would be best, and they do seem really good on initial inspection.
You guys probably have all the scripts for proving me right/wrong on an optimal choice of changes to make.
If this proved out, I guess Seba would have the know-how to calculate constants for larger state sizes (i.e. 64/128/256).
Where's the third sum? B = 1 implies less scrambling than before, but bit inversion could make up for that.