I've arbitrarily tried a few of those ideas myself. Most end up with total fails on Practrand, some have given equivalent quality ... hmm, I've not tried to use all bits at once though, since that part I'd long moved outside the generator code. Good point, I'll have another fiddle ...
Tony,
Ha! That was silly of me, not doing a non-truncated scoring run. Turns out I had succeeded there all by myself. Here's two score tables. The first table is of the regular Xoroshiro algorithm with simple summing of the two halves of the state.
The second table is one of my earlier spins, a couple of weeks back last weekend, from your prompting - The iterator hasn't changed but the summing gets dynamically modified. It sums the two halves still but a copy of the state is rotated based on high three bits of s1 before the summing.
PS: Word0 is the full summing, without any truncation, that produces complete 16-bit results for every iteration.
Someone is saying that Xoroshiro32+ cannot possibly be as a good as an 8-bit Complementary Multiply-With-Carry PRNG and that CMWC in fact "is in an entirely different league."
Here's the pseudo-code:
;set constants
a = 253
;b = 8
;r = 8
;initialise variables
c = 0
i = 0
x[0] = 82
x[1] = 97
x[2] = 120
x[3] = 111
x[4] = 102
x[5] = 116
x[6] = 20
x[7] = 12
;iterate CMWC
t := a * x[i] + c
c := t / 256 ;high byte of t
x[i] := (t mod 256) xor 255 ;1's complement of low byte of t
prn := x[i]
i := (i + 1) and 7
;a,c,i,x[i] are 8-bit
;t is 16-bit
;Period > 2^66
I like to know how this compares to the top byte of Xoroshiro32+ [14,2,7] but I don't have the testing tools.
It's virtually identical scores to original Xoroshiro (including the lsbit) when the lsbit is replaced with parity, just a few more non-512K word0 scores. When the sum is shifted down one bit and the parity place as msbit then it all goes to Smile.
Someone is saying that Xoroshiro32+ cannot possibly be as a good as an 8-bit Complementary Multiply-With-Carry PRNG and that CWMC in fact "is in an entirely different league."
Probably quite right. The geometric tests that Melissa demo'd always came up nice and fuzzy looking with that type of algorithm. She strongly advocates for them because they fit modern CPUs really well.
However, the problem with those type is they're costly to have a hardware implementation because of the multiply.
Someone is saying that Xoroshiro32+ cannot possibly be as a good as an 8-bit Complementary Multiply-With-Carry PRNG and that CMWC in fact "is in an entirely different league."
Probably quite right. The geometric tests that Melissa demo'd always came up nice and fuzzy looking with that type of algorithm. She strongly advocates for them because they fit modern CPUs really well.
However, the problem with those type is they're costly to have a hardware implementation because of the multiply.
CMWC is yet another PRNG by the late Prof. Marsaglia.
The numbers in the pseudo-code were chosen to make it as easy as possible to implement on 8-bit CPUs but does that affect the quality? Multiply by 253 is a byte shift and three subtracts.
There are different versions, e.g. 32-bit with r=256 or r=4096, but if you're interested please test this 8-bit version first! I could generate the first 16 PRNs tonight as a check. Seeding the lag table correctly is an issue that Xoroshiro+ doesn't have.
Interesting, examining the results I now see that the best word0 score is 16M. Still doesn't make it to the nominal good scores, I'd want at least 256M, but of note the whole column is pretty much 32x better than what was achieved without the parity.
Because PRN[n] is a function of PRN[n-r] not PRN[n-1]. If r=4096, a new PRN won't be used to generate another PRN until 4096 (or 4095) more iterations have been done, hence the lag.
If XORO32 can be used to produce a high-quality 16-bit result per iteration, then I could enhance the instruction to do TWO iterations, get 16-bit sums from each, and substitute the high-quality 32-bit PRNG result into the next instruction's S value. Meanwhile, the double iteration would be written back to the original D in the XORO32 instruction.
Not sure if I've got me head fully around your intent there but two consecutive Xoroshiro32's can not a make a one Xoroshiro64. The period in particular is still that of the Xoroshiro32.
That said, I'm all for squeezing 16 useful bits out of the XORO32 instruction.
Comments
Tony,
Ha! That was silly of me, not doing a non-truncated scoring run. Turns out I had succeeded there all by myself. Here's two score tables. The first table is of the regular Xoroshiro algorithm with simple summing of the two halves of the state.
The second table is one of my earlier spins, a couple of weeks back last weekend, from your prompting - The iterator hasn't changed but the summing gets dynamically modified. It sums the two halves still but a copy of the state is rotated based on high three bits of s1 before the summing.
PS: Word0 is the full summing, without any truncation, that produces complete 16-bit results for every iteration.
I remember getting the idea from watching Melissa's lecture where she talks about using the high bits for dynamically rotating in her algorithms.
Have you had a chance to try the parity idea, sum and state?
Here's the pseudo-code:
I like to know how this compares to the top byte of Xoroshiro32+ [14,2,7] but I don't have the testing tools.
Here's the two modifications to algorithm:
EDIT: Oops, I see a bug in the second code snippet. Doesn't matter though, the first one proves that parity is no good.
EDIT2: Hmm, I've also left out the lsbit of s0 and s1 ... I'll do that all again ... but I doubt it'll make it better ...
However, the problem with those type is they're costly to have a hardware implementation because of the multiply.
And the score table still looks the same:
CMWC is yet another PRNG by the late Prof. Marsaglia.
The numbers in the pseudo-code were chosen to make it as easy as possible to implement on 8-bit CPUs but does that affect the quality? Multiply by 253 is a byte shift and three subtracts.
Here's the scores from sum parity replacing the lsbit:
And the relevant source snippet:
This also proves that the state on its own without summing is hopeless.
Because PRN[n] is a function of PRN[n-r] not PRN[n-1]. If r=4096, a new PRN won't be used to generate another PRN until 4096 (or 4095) more iterations have been done, hence the lag.
EDIT:
This would take too long to be worth doing on the P2, probably.
That said, I'm all for squeezing 16 useful bits out of the XORO32 instruction.
The outcomes are just boggling me now! For this tiny change, I think it's the worst lsbit result I've had yet!
So the only change is start by setting parity to zero instead of whatever the sum was. This effectively skips the lsbit of the sum for the parity.
EDIT: I tried starting parity as 1 instead of 0 for the hell of it but no surprise word0 stays full of 8K and 16K scores.