I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
What is this HDL implementation to which you allude?
The SpinalHDL source from which that is generated looks much nicer:
class Xoroshiro64PlusPlus extends Component {
val io = new Bundle {
val prng = out UInt(32 bits)
val next = in Bool
}
// The PRNG state
val s0 = Reg(UInt(32 bits)) init(1)
val s1 = Reg(UInt(32 bits)) init(0)
// The xoroshiro64++ magic numbers
val a = 26
val b = 9
val c = 13
val d = 17
// Calculate the next PRNG state.
val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
}
// Deliver the "++" scrambled output
io.prng := (s0_ + s1_).rotateLeft(d) + s0_
}
The SpinalHDL source from which that is generated looks much nicer:
class Xoroshiro64PlusPlus extends Component {
val io = new Bundle {
val prng = out UInt(32 bits)
val next = in Bool
}
// The PRNG state
val s0 = Reg(UInt(32 bits)) init(1)
val s1 = Reg(UInt(32 bits)) init(0)
// The xoroshiro64++ magic numbers
val a = 26
val b = 9
val c = 13
val d = 17
// Calculate the next PRNG state.
val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
}
// Deliver the "++" scrambled output
io.prng := (s0_ + s1_).rotateLeft(d) + s0_
}
You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware and double-iterate to match the XORO32 output.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
Yep, that's good. Standard synchronous design.
However, in hardware, you have to be thinking a level lower as well. The fastest clock rate achievable is set by the longest cascading logic time, or the longest access time. This longest time gets labelled "critical path".
The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together. Nothing any optimiser can do about it.
As Chip was working his way through tightening each critical path to push above 160 MHz, producing new shorter criticals in the process, ... at one stage XORO32 became it. Which he then resolved by moving the scrambler to operate parallel to the engine.
As for the C++ edition. Parallel thinking has the same impact, and for very similar reason. Both a compiler optimiser and processor instruction reordering can see a tight parallel opportunity with ease. EDIT: Allows multiple instructions per clock to flow in a superscalar CPU.
You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware
OK, changed that for my xoroshiro32++.
..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.
Quite so.
If one was really pushed for time one could also pipeline the addition. Perhaps like this:
class Xoroshiro32PlusPlus extends Component {
val io = new Bundle {
val prng = out UInt(16 bits)
val next = in Bool
}
// The PRNG state
val s0 = Reg(UInt(16 bits)) init(1)
val s1 = Reg(UInt(16 bits)) init(0)
val p0 = Reg(UInt(16 bits)) init(0)
val p1 = Reg(UInt(16 bits)) init(0)
// The xoroshiro32++ magic numbers
val a = 13
val b = 5
val c = 10
val d = 9
// Calculate the next PRNG state.
val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
p0 := (s0 + s1).rotateLeft(d)
p1 := s0
}
// Deliver the "++" scrambled output
io.prng := p0 + p1
}
The paper mentions the xoroshiro64 [26,9,13] engine and the +, * and ** scramblers. + is not good enough and ++ would be a much better alternative if there are no multipliers available. In an email to me Seba suggested 15 or 17 for the ++ rotation, i.e. half the output width +/- 1.
I chose d = 17 as Seba told me that, in theory, an odd prime has a little more mojo than plain odd (although not in those exact words). Also our chosen xoroshiro32++ has d = w/2 +1.
Nobody really knows whether d = 17 is the best choice or not, as xoroshiro64++ is far too big to test fully.
I've done a little testing. 64+ untruncated netted me impressively poor scores of 128 KB for aperture 32>>0 and no more than 2 MB for all other 32 bit apertures. The moment bit0 was remove the score went to 2 GB, and with both bit0 and bit1 it went way up. I stopped it at 512 GB.
I'm now running four 64++ candidates at aperture 32>>0: [26 9 13 17], [26 9 13 1], [26 9 13 2], [26 9 13 3]. So far, have cleared 2 TB at highest Practrand strength and all is well. Nothing more than a sprinkling of "unusual's".
2 TB was about 12 hours. At that rate, some back of the envelope sums tells me that full period is 46000 years away.
..."critical path"... The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together.
Quite so.
If one was really pushed for time one could also pipeline the addition.
The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined as two cascaded 64-bit adds in one clock is fantasy. If you would like to practise your skills some more you could try implementing this generator, which is fully explained in the paper and doesn't require multiplications as such.
Aside.
The generator engines and scramblers are separate and their constants could be separate, too, e.g.
xoroshiro32++ [13,5,10,9] -> xoroshiro32[13,5,10]++[9]
xoroshiro128** [24,16,37,5,7,9] -> xoroshiro128[24,16,37]**[5,7,9]
First 16 outputs of xoroshiro128**[24,16,37,5,7,9] when seed = 1.
PRN calculated at start of algorithm before new state generated.
Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.
Remember you'll have to double-iterate to emulate the XORO32 instruction.
I'm leaving that as an exercise for later.
The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined...
Just when I thought I was getting things straight I find I'm confused again. I did didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++? Do I have my wires crossed?
If you would like to practise your skills some more...
Oh yeah. Here is my xoroshiro128** in Spinal, no pipelining or such yet but no multiply used:
class Xoroshiro128StarStar extends Component {
val io = new Bundle {
val prng = out UInt(64 bits)
val next = in Bool
}
// The PRNG state
val s0 = Reg(UInt(64 bits)) init(1)
val s1 = Reg(UInt(64 bits)) init(0)
// The xoroshiro128** magic numbers
def a = 24
def b = 16
def c = 37
def s = 5
def r = 71
def t = 9
// Calculate the next PRNG state.
val s0_ = s0.rotateLeft(a) ^ (s1 ^ s0) ^ ((s1 ^ s0) |<< b)
val s1_ = (s1 ^ s0).rotateLeft(c)
when(io.next) {
// Update the PRNG state
s0 := s0_
s1 := s1_
}
// Deliver the "**" scrambled output: "rotl(s0 * S, R) * T"
val temp = ((s0 |<< 2) + s0).rotateLeft(r)
io.prng := (temp |<< 3) + temp
}
Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.
How so? If you can write the Verilog you can convert it to C++ and run it with the Verilator. Which is what the Spinal system does for it's test benches.
Just when I thought I was getting things straight I find I'm confused again. I didn't know the P2 used xoroshiro128**. Does it use xoroshiro64++?
No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.
There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them, but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.
No worries. Thanks, it's all good stuff. I quite like xoroshiro64++ as a middle ground. xoroshiro128** eats LE's. Like 1% of a DE0 Nano.
There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them,
Ah OK. We won't go there then.
...but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
I see what you mean.
r = 71 or r = 7, whatever!
WTF, where did that "1" come from? I even cut and pasted those numbers from your post.
Eerily it turns out not to matter. The generated Verilog is the same with "def r = 7" or "def r = 71".
Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.
They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...
Oh, I might have to cancel one of the running tests. I just realised that RAM usage has been slowly climbing all day. Already over 20 GB between the four of them.
They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...
Huh, chalked up the 16 TB mark 8 hours early. I guess there was a little too much CPU demand before. Here's progress reports. You'll note [26 9 13 17] currently has zero "unusual"s. I'm suitably impressed.
When I killed off the 4th case to free up the RAM, half an hour late the automatics script, that was still active, started up a fresh case, [26 9 13 4], and it rolled out a VERY SUSPICIOUS on a BCFN test at 8 MB.
The below slightly modified code (addition of bit inversion '~' and B shift direction change, requiring different constants) has slightly better escape from zero behavior and PractRand results, as compared to: Xoroshiro32pp 13,5,10,9
Not sure about 'Pair frequency', if someone had time to check?
There are many variations on this theme... too many to test.
// Xoroshiro32ppg (variation of XoroshiroPlusPlus from Vigna/Blackman), by CRutz, free for all uses
#include <stdint.h>
#define ACCUM_SIZE 16
#define CONSTANT_A 5
#define CONSTANT_B 1 // 1 is required for inverted Gray Code modification below
#define CONSTANT_C 14
#define CONSTANT_D 9 // 9 seems optimal when the underlying x/y distribution is optimal and correlation is minimized
typedef uint16_t rngword_t;
static inline rngword_t rotl( rngword_t value, int shift ) {
return (value << shift) | (value >> (ACCUM_SIZE - shift));
}
// Seed the state array. Seed MUST not be all zero.
static rngword_t s[2] = {1, 0};
static rngword_t nextxoro( void )
{
rngword_t s0 = s[0];
rngword_t s1 = s[1];
rngword_t result = s0 + s1; // output hash (scrambler) for Xoroshiro+
//New addition from authors: Rotation and second summing
result = rotl( result, CONSTANT_D ) + s0;
s1 ^= s0;
s[0] = rotl( s0, CONSTANT_A ) ^ ~s1 ^ (s1 >> CONSTANT_B); // Note '~s1 ^ (s1 >> 1)' forms an inverted Gray Code
s[1] = rotl( s1, CONSTANT_C );
return( result );
}
Very neat. A third summing circuit but in parallel to the other two. It looks like that'd fit speed wise. And I think you are saying that CONTANT_B should always be 1. Which brings the searching down to three parameters again.
Same summing (sorry, I left the original comments in place, which will be confusing), basically just added a single negation and reversed the bit shift direction.
CONSTANT_B does not actually need to be 1... I was researching Gray Codes (which require 1) at the time I tested this and it worked so well inverted that I left B alone.
Feel free to try other variations... the '~s1' is perhaps the most important change. Then find all ABC for both '>>' and '<<'. Permute D on those.
Like I said... too many variations. I simply guessed that the changes I made would be best, and they do seem really good on initial inspection.
You guys probably have all the scripts for proving me right/wrong on an optimal choice of changes to make.
If this proved out, I guess Seba would have the know-how to calculate constants for larger state sizes (i.e. 64/128/256).
Comments
I agree. Details of implementation can have a big effect on performance. But getting the correct result trumps all of that. So far I cannot even verify that.
As it stands my HDL implementation is busy calculating the next output whilst one is busy reading the current output. Or at least that is the plan. I welcome to suggestions to improve that.
That would be most odd.
Anyway, cutting and pasting from the page at that link:
The C++ output does this:
The Verilog output does this: Which I think matches the results we have discussed earlier in this thread.
Is this what a P2 actually does today ?
What is this HDL implementation to which you allude?
Which is here: https://github.com/ZiCog/xoroshiro-plusplus
Includes a dinky demo to run on a DE0 Nano board.
In case you cannot reach it the Verilog looks like this:
The SpinalHDL source from which that is generated looks much nicer:
This is the "PRN calculated at end of algorithm after new state generated" version. The output will match my software P2 asm version:
http://forums.parallax.com/discussion/comment/1449224/#Comment_1449224
You'll need the "PRN calculated at start of algorithm before new state generated" version of xoroshiro32++[13,5,10,9] to match the P2 hardware and double-iterate to match the XORO32 output.
However, in hardware, you have to be thinking a level lower as well. The fastest clock rate achievable is set by the longest cascading logic time, or the longest access time. This longest time gets labelled "critical path".
The layered xors and shifts become a cascade of gates. Summing has its own cascade. Having one serially depend on the other means you're force adding the two durations together. Nothing any optimiser can do about it.
As Chip was working his way through tightening each critical path to push above 160 MHz, producing new shorter criticals in the process, ... at one stage XORO32 became it. Which he then resolved by moving the scrambler to operate parallel to the engine.
As for the C++ edition. Parallel thinking has the same impact, and for very similar reason. Both a compiler optimiser and processor instruction reordering can see a tight parallel opportunity with ease. EDIT: Allows multiple instructions per clock to flow in a superscalar CPU.
I feel like I'm standing on quick sand here. There is no reference to check against.
PRN calculated at end of algorithm after new state generated
Tony, you could add that to your information topic.
That matches my xoroshiro64++ output.
Good idea, Evan. I've edited an existing post to add a link:
http://forums.parallax.com/discussion/comment/1448894/#Comment_1448894
Seems I missed a couple of you points last night. OK, changed that for my xoroshiro32++. Quite so.
If one was really pushed for time one could also pipeline the addition. Perhaps like this: Or in Verilog: With the down side of using more LE's and introducing an invalid output at start up:
I've done a little testing. 64+ untruncated netted me impressively poor scores of 128 KB for aperture 32>>0 and no more than 2 MB for all other 32 bit apertures. The moment bit0 was remove the score went to 2 GB, and with both bit0 and bit1 it went way up. I stopped it at 512 GB.
I'm now running four 64++ candidates at aperture 32>>0: [26 9 13 17], [26 9 13 1], [26 9 13 2], [26 9 13 3]. So far, have cleared 2 TB at highest Practrand strength and all is well. Nothing more than a sprinkling of "unusual's".
2 TB was about 12 hours. At that rate, some back of the envelope sums tells me that full period is 46000 years away.
Remember you'll have to double-iterate to emulate the XORO32 instruction.
The addition in xoroshiro128** [24,16,37,5,7,9] in the P2 is pipelined as two cascaded 64-bit adds in one clock is fantasy. If you would like to practise your skills some more you could try implementing this generator, which is fully explained in the paper and doesn't require multiplications as such.
Aside.
The generator engines and scramblers are separate and their constants could be separate, too, e.g.
xoroshiro32++ [13,5,10,9] -> xoroshiro32[13,5,10]++[9]
xoroshiro128** [24,16,37,5,7,9] -> xoroshiro128[24,16,37]**[5,7,9]
First 16 outputs of xoroshiro128**[24,16,37,5,7,9] when seed = 1.
PRN calculated at start of algorithm before new state generated.
Chip sent the low 16 bits to some pins temporarily to test the algorithm was working. It's impossible to verify in software.
xoroshiro64 period is ~584 years at one iteration per nanosecond.
xoroshiro64++ would last ~64000 years in P2 asm @ 200 MHz.
No, xoroshiro64++ is not in the P2 hardware. I offered it up because you seemed less than enthused about the period of xoroshiro32++.
There is a single xoroshiro128** in the hub (originally 128+) and scrambled subsets of the 64-bit output are sent to the cogs and pins. Reading a 64-bit output could be done with difficulty by using multiple cogs in sync to read 32-bit chunks then unscrambling them, but this generator cannot be seeded fully in software hence my "impossible to verify in software" comment.
r = 71 or r = 7, whatever!
Eerily it turns out not to matter. The generated Verilog is the same with "def r = 7" or "def r = 71".
Presumably because (7 % 64) = (71 % 64) = 7.
What a flukey typo!
They reached 27 GB RAM usage at the 8 TB mark. I terminated [26 9 13 1] to make room, back to about 20 GB again. 2, 3, and 17 are still going, two days until 16 TB ...
Try not to switch off the power this time!
When I killed off the 4th case to free up the RAM, half an hour late the automatics script, that was still active, started up a fresh case, [26 9 13 4], and it rolled out a VERY SUSPICIOUS on a BCFN test at 8 MB.
Maybe just inform somebody that removing Pb was ordered by his predecessor.
I guess it never was a power switch problem. I just did a shutdown in a moment of distraction.
I'm away a few days now anyway so maybe it's for the better.
Xoroshiro32pp 13,5,10,9
Not sure about 'Pair frequency', if someone had time to check?
There are many variations on this theme... too many to test. Additional information here.
CONSTANT_B does not actually need to be 1... I was researching Gray Codes (which require 1) at the time I tested this and it worked so well inverted that I left B alone.
Feel free to try other variations... the '~s1' is perhaps the most important change. Then find all ABC for both '>>' and '<<'. Permute D on those.
Like I said... too many variations. I simply guessed that the changes I made would be best, and they do seem really good on initial inspection.
You guys probably have all the scripts for proving me right/wrong on an optimal choice of changes to make.
If this proved out, I guess Seba would have the know-how to calculate constants for larger state sizes (i.e. 64/128/256).
Where's the third sum? B = 1 implies less scrambling than before, but bit inversion could make up for that.