Sebastiano Vigna kindly emailed me all the xoroshiro32 characteristic polynomials, essential info for creating jump functions. The weight is the number of terms in the polynomial and in theory close to 16 is preferable. Our best triplet [14,2,7] has weight of 11 and second best [15,3,6] has weight of 13.
I hadn't looked closely at all the full-period triplets before. For every [a,b,c] there is a corresponding [c,b,a] that shares the same characteristic polynomial. Knowing this in advance means brute force searches can be reduced by half.
Hey Tony, thanks for getting that. It perfectly matches our 84 brute forced findings. I only just really looked at it now.
And the recommendation of targeting only the ones of closely matching "degree" with word size will be valuable with what's coming up ... I've spoken with Sebastiano again and, same as you, I just had to ask and he immediately popped me a complete list of full-period candidates for a Xoroshiro64 iterator! Amazing stuff. It isn't nicely sorted in mirror pairs like your one was though.
Excellent news! Thanks for doing that, Evan. I could sort the list into mirror pairs, if you wish.
Tony,
In what I presented to Chip, I chose to revert back to just the earlier iterator only type function so as to limit the logic/flops burden in the instruction of the larger number of bits involved. Also, it keeps the execution to a single clock cycle, which I suspect is not the case with your approach.
Tony,
In what I presented to Chip, I chose to revert back to just the earlier iterator only type function so as to limit the logic/flops burden in the instruction of the larger number of bits involved. Also, it keeps the execution to a single clock cycle, which I suspect is not the case with your approach.
If there are no timing issues with doing a couple of XORs before a 32-bit addition then my XORO64 idea will work.
EDIT:
Two parallel xoroshiro32s would be replaced by one xoroshiro64. The inputs to one of the two parallel 16-bit additions have a two xoroshiro delay in XORO32.
What we have now is ideal, I think, and it's easy-to-use as can be. 32 bits is plenty for most uses, I imagine.
And if a *real* random number is needed, you can always just do a GETRND, without any contingencies.
XORO64 could produce 32-bit PRNs in three instructions. There is a new, simple and optional post-processing that should improve its output. I cannot say any more at the moment. We are getting tremendous help from Sebastiano Vigna, who has been testing the following xoroshiro[32]64 triples:
All xoroshiro generators (without additional backend reworking) at some point have some Hamming dependency kicking in. That is, if you look at the number of ones and zeros in a sequence of, say, 10 consecutive outputs, the resulting 10-tuple doesn't have exactly the distribution it should be, as linear generator tend to create correlation between the number of 1s of consecutive outputs.
In xoroshiro128 this happens after several TB of data (at least 32TB), so it's not a real concern. But in smaller-sized generator it will happen before (e.g., "DC6" test from PractRand).
In this case, testing is fundamental because theory cannot really say much except what is full period and what not.
That email address was just a throw away and it's actually stopped by the provided now, I seriously dislike the "social media" tendrils. Not to mention the horrid comment entry interface that Google Groups has. I'll be a very sad puppy if Parallax ever tries to use such rubbish for it's forums.
It has definitely been helpful reading both Scro's and Seba's comments. It's helped to know what to look for when 100% testing isn't an option ... Here's a summary of what I've got so far:
Note: The full width 32-bit tests top out at 32 GB. All failed with a massive brick wall BRank. I believe this is due to the summed bit1 being shifted into the more sensitive bit0 position for PractRand.
EDIT:
The second column of scores, [31:1], what I've also called Word1 before, shows more substantial results. These all failed on the DC6-9x1Bytes-1 test in PractRand.
PS: The byte sized sampling variants (except for LSByte) feel like they're going to go a lot further. I tested a pair of low quality candidates out to 4 TB (took more than a day) and it wasn't giving up there.
LSByte, [7:0], always dies at 8 GB (exactly quarter of full 32-bit sampling). These have the same massive BRank fails as the full width variant.
EDIT2:
When placing the parity at bit0 the BRank test fails the [31:0] variant at 512 GB instead of 32G. There is a mix of DC6 failures at 256 GB with these. The group of good scores is diminished and not as obvious.
I'm running the parity at bit0 sweep at the moment.
Just remember, Tony, shifting a low quality bit further away from bit0 is not improving quality. It's just tricking Practrand into giving higher scores.
When placing the parity at bit0 the BRank test fails the [31:0] variant at 512 GB instead of 32G. There is a mix of DC6 failures at 256 GB with these. The group of good scores is diminished and not as obvious.
I'm running the parity at bit0 sweep at the moment.
Could you post these Xoroshiro64+p parity at bit0 scores some time?
I overwrote one of the good output files before I'd done the score table. I can post it minus that score ... EDIT: Correction, I did have that score noted already.
Here's the full list, I've only done the Word0 variant:
Yeah, It's probably not as bad as it looks because I think the scoring processed value gets scaled down more for larger datasets but those were bad enough to all fail the DC6 at the 512 MB mark as well as the BRank fail.
EDIT: The reason I included them is the 35.6 is the lowest. Just another datapoint to consider.
The current xoroshiro128+ implementation is as small as can be.
Every cog gets a uniquely-picked/ordered/inverted set of 32 bits from the 63-bit source. Every smart pin gets 8 such bits. Smart pins need these for DAC dithering and noise generation. Cogs need them for the GETRND instruction. I think cogs and pins will all be happy with this arrangement. These patterns really only need to serve as apparently-uncoordinated white noise sources.
On start-up, the PRNG will be 1/4 seeded maybe 64 times with 4 bits of pure thermal noise, along with another 27 bits of process/voltage/temperature-dependent bits.
A few questions:
1. Has the parity improvement been applied to xoroshiro128+ ? That would make a 64-bit source and could mean each bit is used the same number of times for the different cogs and smart pins.
2. Is there enough time to do the 64-bit addition in a single clock?
3. Have the different cog and smart pin scramblings been made public? (I don't want to hold up Chip with this one.)
A few bits get used 9 times, while the rest get used 8 times in those longs. This is because, as you know, we only have 63 bits in x[62:0]. No long uses the same bit twice (cog GETRND sources), nor does any byte (smart pin noise sources). I think the distribution is good enough that adding that parity fix wouldn't make any meaningful difference.
Here is the Spin program for the Prop1 that a I wrote to randomly come up with these patterns:
''*********************************************
''* Pick 16 random 32-tap sets from 63 bits *
''*********************************************
CON
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
OBJ
fds : "FullDuplexSerial"
rr : "RealRandom"
VAR
byte poolsize
byte pool[63]
byte uses[63]
byte taps[16*32]
PUB start | i, j, k
'start serial
fds.start(31,30,0,112500)
'start RealRandom
rr.start
'init pool
poolsize := 63
repeat i from 0 to poolsize - 1
pool[i] := i
'pick random tap sets
fds.str(string(13,13,"Picking taps...",13))
repeat j from 0 to 15
'scramble current pool
fds.str(string(13,"Pool "))
fds.hex(j,1)
scramble_pool
optimize_pool
show_pool
'fill a set of taps, track uses
repeat i from 0 to 31
taps[j*32+i] := pool[i]
uses[pool[i]]++
'output tap sets
fds.str(string(13,13,"16 tap sets in Verilog:",13,13))
repeat j from 0 to 15
fds.tx("{")
repeat i from 0 to 31
if rr.random & 1
fds.tx("!")
else
fds.tx(" ")
fds.str(string("x["))
k := taps[j*32+i]
if k < 10
fds.tx("0")
fds.dec(k)
fds.tx("]")
if i <> 31
fds.tx(",")
else
fds.str(string("},",13))
PRI scramble_pool | i, j
repeat 100
repeat i from 0 to poolsize - 1
if rr.random & 3 'mainly rotate
j := pool[i]
if i == poolsize - 1
pool[i] := pool[0]
pool[0] := j
else
pool[i] := pool[i+1]
pool[i+1] := j
PRI optimize_pool | i, j, k
repeat 10
repeat i from 0 to 31
'look for a used pool member in 0..31
if uses[pool[i]] > 0
'see if it can be swapped out with a less-used member in 32+
repeat j from 32 to poolsize - 1
if uses[pool[j]] < uses[pool[i]]
k := pool[i]
pool[i] := pool[j]
pool[j] := k
quit
PRI show_pool | i
'show pool as a bunch of characters with usage underneath, space at 32
fds.tx(13)
repeat i from 0 to poolsize - 1
fds.tx($40 + pool[i])
if i == 31
fds.tx(" ")
fds.tx(13)
repeat i from 0 to poolsize - 1
fds.hex(uses[pool[i]],1)
if i == 31
fds.tx(" ")
fds.tx(13)
2. Is there enough time to do the 64-bit addition in a single clock?
If the answer is yes, I'm wondering whether two 16-bit additions could be done in one clock, with second using result of the first, perhaps with bespoke carry logic.
Comments
Excellent news! Thanks for doing that, Evan. I could sort the list into mirror pairs, if you wish.
No and no, but we do need a separate subtract to get s1 on its own.
Code when keeping sum[0]:
Code when replacing sum[0] with parity:
Weights of 29-33 might not give best results. Good starting point, though.
It is possible to reduce the above to four or only three instructions.
In what I presented to Chip, I chose to revert back to just the earlier iterator only type function so as to limit the logic/flops burden in the instruction of the larger number of bits involved. Also, it keeps the execution to a single clock cycle, which I suspect is not the case with your approach.
If there are no timing issues with doing a couple of XORs before a 32-bit addition then my XORO64 idea will work.
EDIT:
Two parallel xoroshiro32s would be replaced by one xoroshiro64. The inputs to one of the two parallel 16-bit additions have a two xoroshiro delay in XORO32.
XORO64 could produce 32-bit PRNs in three instructions. There is a new, simple and optional post-processing that should improve its output. I cannot say any more at the moment. We are getting tremendous help from Sebastiano Vigna, who has been testing the following xoroshiro[32]64 triples:
The first one is his current recommendation.
XORO32 could be improved probably with a little bit of re-ordering. This is not guesswork. PractRand tests are needed to be sure.
Evan, I tried to contact you by private message from prng forum. If you didn't receive it, please try the same, thanks.
Fire away, what's the improvement?
Example:
[Low1/32]BRank(12):3K(1) R=+21249 p~= 9e-6398 FAIL !!!!!!!!
EDIT:
The second column of scores, [31:1], what I've also called Word1 before, shows more substantial results. These all failed on the DC6-9x1Bytes-1 test in PractRand.
PS: The byte sized sampling variants (except for LSByte) feel like they're going to go a lot further. I tested a pair of low quality candidates out to 4 TB (took more than a day) and it wasn't giving up there.
LSByte, [7:0], always dies at 8 GB (exactly quarter of full 32-bit sampling). These have the same massive BRank fails as the full width variant.
EDIT2:
When placing the parity at bit0 the BRank test fails the [31:0] variant at 512 GB instead of 32G. There is a mix of DC6 failures at 256 GB with these. The group of good scores is diminished and not as obvious.
I'm running the parity at bit0 sweep at the moment.
Email just sent.
Could you post these Xoroshiro64+p parity at bit0 scores some time?
Here's the full list, I've only done the Word0 variant:
[17 10 6]
[20 7 27]
[27 7 20]
[13 5 28]
[26 9 13]
I've attached the xoroshiro[32]64+ characteristic polynomials sorted by weight/degree in CSV format.
Big difference in two sets of R values.
EDIT: The reason I included them is the 35.6 is the lowest. Just another datapoint to consider.
A few questions:
1. Has the parity improvement been applied to xoroshiro128+ ? That would make a 64-bit source and could mean each bit is used the same number of times for the different cogs and smart pins.
2. Is there enough time to do the 64-bit addition in a single clock?
3. Have the different cog and smart pin scramblings been made public? (I don't want to hold up Chip with this one.)
A few bits get used 9 times, while the rest get used 8 times in those longs. This is because, as you know, we only have 63 bits in x[62:0]. No long uses the same bit twice (cog GETRND sources), nor does any byte (smart pin noise sources). I think the distribution is good enough that adding that parity fix wouldn't make any meaningful difference.
Here is the Spin program for the Prop1 that a I wrote to randomly come up with these patterns:
If the answer is yes, I'm wondering whether two 16-bit additions could be done in one clock, with second using result of the first, perhaps with bespoke carry logic.