First twenty 17-bit (Xoroshiro34) sums for triplet [7,8,12] is:
00001 01181 1608d 0a75d 0bf68 1c71d 158e8 028f2 1c4a8 1c24f 14e6c 1f938 1f135 07a8e 068cd 15404 1d908 02648 1e45c 13161
First twenty 20-bit (Xoroshiro40) sums for triplet [7,9,12] is:
00001 01281 c6013 5a2f1 2f5ac 7590c fb594 b7ac1 2a644 33649 ebe23 66f73 18a54 fd4a2 4674a 590d3 37bc4 2cce9 3a9db 0ee4e
Evan, do you have a list of the best triplets in your opinion for different widths, say Xoroshiro24 to Xoroshiro40? If not, don't worry and if so, score tables not needed.
Yep, I've got a set of score tables for all those. Size 20 is the largest I've done though. Anything bigger took forever to brute force search for full period candidates.
Here's my selection. Some of them were a real toss up and only chosen by least number of lowest scores. A method that's somewhat dependent on the sampling variants I've scored with.
EDIT: The "Bit" column exhibits a perfect scaling of scores! That blows me away. I don't have a clue why it's so rigidly alone like that. Pity it's also the scoring variant that takes many times longer to run.
As MUL takes the same time as ADD on the P2 there would be no performance penalty. It's not clear to me whether the product could be truncated to 16 bits as well as 8 bits. If so, would it perform better in randomness tests than the 16-bit sum? If not, two iterations would be needed to get a high-quality 16-bit result, but that's just the same as with adding.
I can certainly replace the summing with multiplying. Don't even have to re-run the full period searches.
Tony,
With a quick look at that webpage, I'm not making much sense of how he's done the Xorshift algorithm. I'm suspicious he's just looking at the state data directly, because otherwise PractRand would be throwing a wobbly on our testing quick smart I'd think.
... It's not clear to me whether the product could be truncated to 16 bits as well as 8 bits. If so, would it perform better in randomness tests than the 16-bit sum? If not, two iterations would be needed to get a high-quality 16-bit result, but that's just the same as with adding.
Assuming XORO32 based, why would multiplying two 16-bit values only produce 8 bits? Or is that not what you were talking about?
Well, simply replacing the addition with a multiply produces rubbish PractRand scores. Amusingly, the "bit" variant is still the same scores but most everything else barely makes it passed the starting blocks.
I've added the author's name to my previous post. It's hard to be sure, but I think her "xoroshiro16star8" uses two 8-bit states that are multiplied by a 16-bit magic number ("MCG multiply") and only the high byte of the product kept.
All of them, but let's put this to one side for a while. Try doing a hexdump of the first few Xoroshiro32+ sums (eight is plenty) when seed is $32F3_1EDB
Dang, it just dawned on me, far too late, that DRAM prices have been going through the roof all this year! And even with those prices, supply appears to have gone to virtually nil.
One of my new 16GB DIMMs failed and there is no exact replacements in stock. I've got the full credit for them (Returned the pair) and can claim it as a refund, but the options for replacement are slim pickings.
I'm back using me old PC for the moment, and the remainder of the year looks bleak.
Well, things have progressed. I got the full refund for the faulty RAM and went elsewhere to get something faster. I figured if I was going to be paying inflated prices to get a decent set of DIMMs I may as well get something faster than I had. The alternative was something slower but still more expensive than original.
I've since discovered, and verified, that at least one program was having issues with the SMT flaw in the early Ryzens. I would have been happy to just disable SMT in the BIOS but the silly BIOS has a problem with that - It insists I had an unexpected power down every time it powers up - have to press F1 and ESC and ENTER every time I power it up with SMT disabled!
So now I'm sending the CPU back for a free exchange tomorrow - never done anything international like this before - AMD have directly emailed me a FedEx number to book it to. Conveniently there is a depot only ten minutes drive away ...
First twenty 17-bit (Xoroshiro34) sums for triplet [7,8,12] is:
00001 01181 1608d 0a75d 0bf68 1c71d 158e8 028f2 1c4a8 1c24f 14e6c 1f938 1f135 07a8e 068cd 15404 1d908 02648 1e45c 13161
First twenty 20-bit (Xoroshiro40) sums for triplet [7,9,12] is:
00001 01281 c6013 5a2f1 2f5ac 7590c fb594 b7ac1 2a644 33649 ebe23 66f73 18a54 fd4a2 4674a 590d3 37bc4 2cce9 3a9db 0ee4e
Evan, do you have a list of the best triplets in your opinion for different widths, say Xoroshiro24 to Xoroshiro40? If not, don't worry and if so, score tables not needed.
Yep, I've got a set of score tables for all those. Size 20 is the largest I've done though. Anything bigger took forever to brute force search for full period candidates.
I'll zip them up when home ...
I've done Xoroshiro40+ on the Z80 and the triplet numbers were quite friendly. A final one I'd like to try is Xoroshiro48+ as every bit in two sets of three 8-bit registers would be used. There is a trick for finding full-period triplets without brute-force as that would take forever with Xoroshiro128+, if only we knew what the trick is.
Yes, Xoroshiro128+ and 1024+ both have "jump" functions supplied in the sources from the authors.
Starting from bit 0 of both word halves of the PRNG state, it does an overlaying XOR of a working state with the current state based on what looks to be a specially crafted key, then calls the iterator, stepping the current state by one. This repeats for each bit of the word size (ie: half the state length).
And at the end it copies the working state over the current state.
Here's the actual code from original Xoroshiro128+ jump() function:
/* This is the jump function for the generator. It is equivalent
to 2^64 calls to next(); it can be used to generate 2^64
non-overlapping subsequences for parallel computations. */
void jump(void) {
static const uint64_t JUMP[] = { 0xbeac0467eba5facb, 0xd86b048b86aa9922 };
uint64_t s0 = 0;
uint64_t s1 = 0;
for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
for(int b = 0; b < 64; b++) {
if (JUMP[i] & 1ULL << b) {
s0 ^= s[0];
s1 ^= s[1];
}
next();
}
s[0] = s0;
s[1] = s1;
}
C is mostly gibberish to me. I don't see how a jump function proves a certain triplet is full-period. I've just looked through the prng Google group and it wasn't any help really.
Somewhere, some how, some mathematician can prove that some pseudo random number generator algorithm with some parameters will run though all, or most, of its combinations of states before returning to where it started.
I have no idea how.
Of course, running through a huge sequence does not imply a random looking sequence. A simple binary counter will run through such a sequence, step by step in order. We need something better to defeat the tests of "randomness".
I think understanding C is the least of the problems in understanding how these things work.
Certainly brute force running and checking them does not help. The periods can be longer to work through than the age of the universe!
What is certainly true is that tweaking the number of bits in the state or messing with the parameters can ruin a good PRNG.
I start to believe that the search for a better pseudo random number generator drives people insane. As this thread shows I felt it myself before I got help from "PRNG Anonymous". For example this from Stanford:
The top three lines of the source is a comment, so that's plain english at least. Basically, this particular jump() config is not useful for finding any full period triplets because even it still has to be called 2^64 times to hop all the way along.
So, presumably it's possible to built jump functions that steps further than a square-root into the period. But they're not inclined to explain, or maybe it's just really hard to work out. I have no idea.
What is certainly true is that tweaking the number of bits in the state or messing with the parameters can ruin a good PRNG.
Full period may be solved at a higher analytical level but the quality certainly isn't. Tony's link to Melissa's visualisation work is a good demo of how quality is accessed. And that's also why PractRand and BigCrush are heavily relied on.
As we all know, bit 0 of the Xoroshiro+ sum is a simple LFSR bit and fails the binary-rank test but is probably usable in many cases. More details here: http://xoroshiro.di.unimi.it/
It's a bit of a dense read and I think what David Roberts does is use xoroshiro128+ to create a 64-bit sum, then splits that into a pair of 32-bit values to get two PRNs for the price of one iteration. He replaces bit 0 of the low long with its parity, on the basis that parity should be random.
This would take two extra instructions with XORO32 on the P2, one to get the parity of sum[15:0] into C and the second to put C into sum[0]. To be honest, it wouldn't take a lot more to do two iterations but that would waste half the XORO32 bits.
I'm wondering whether the parity of the state could be used instead. If this produced equally random results and if the XORO32 instruction flagged parity (and zero) the same way as AND/OR/XOR/TEST, then only one extra instruction BITC sum,#0 would be needed to improve the XORO32 output, with no change to the underlying algorithm.
I've arbitrarily tried a few of those ideas myself. Most end up with total fails on Practrand, some have given equivalent quality ... hmm, I've not tried to use all bits at once though, since that part I'd long moved outside the generator code. Good point, I'll have another fiddle ...
Before he used the parity trick, David Roberts XORed the two halves of the 64-bit sum to get one 32-bit value and he said both methods passed through PractRand.
That, and I've only just got the replacement CPU on warranty exchange from AMD. They took a tad longer than promised to get it out to me. I'm still trying to verify if the SMT flaw is completely solved or not. I think everyone is worried that AMD are not telling the whole story with it.
In my latest testing it has hiccup'd once already, but is a specific set of circumstances that may not be directly related. Lots more testing to still go on this one as well ...
I have found out why the original code had BRank failures.
I ran tests dumping just the lower Dword of the Qword output and tests dumping just the upper Dword of the Qword output. It seemed that BRank failures only occurred with the lower Dword.
I did some further reading and found this with regard Xoroshiro128+ and, I suspect, it is true for Xorhift128+ as well.
Beside passing BigCrush, this generator passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests, which fail due to the lowest bit being an LFSR; all other bits pass all tests. We suggest to use a sign test to extract a random Boolean value.
So, we have an errant bit 0 in the Qword.
If we have two Dwords, X random and Y not random then X Xor Y will be random. That is why the Xor method worked. However, bit 0 will still be errant for the double precision values.
We could simply cast out the lower Dword saving us time with the Xor method. However, we will now have to go past the 'engine' twice to collect two 'good' Dwords for the Double precision values; not good for a 64 bit generator. Alternatively, we can try and rectify bit 0.
I tried the suggestion of using a sign test but all I got was a toggling of bit 0, which simply moved the location of a potential failure.
"We suggest to use a sign test ..." confused me for a while as it could be interpreted as a suggestion for solving the bit 0 "problem" by using the sign bit, but all it really means is choose the msb if you want a single random bit (because that's the most random one).
... but all it really means is choose the msb if you want a single random bit (because that's the most random one).
Yeah, I'm guessing that's entirely common knowledge in the field for this type of algorithm. Melissa refers to it that way. No doubt there is a math theory to back this up.
Chip asked for the msb as one of my score variants and it's the only bit position I've singly sampled in my tests. And Practrand scores on this variant test have, without fail, squared for each pair of extra bits added to the state size. It's a little eye popping for me.
BTW: The reason why the "bit" variant takes so much longer to run the scoring is because it iterates 8 times and stuffs 8 msb's into one byte before piping that byte out to Practrand. Practrand is multithreaded and I've got 8 cores, so Practrand can absorb the data stream as fast as the generator can feed it the data.
Comments
Evan, do you have a list of the best triplets in your opinion for different widths, say Xoroshiro24 to Xoroshiro40? If not, don't worry and if so, score tables not needed.
I'll zip them up when home ...
http://forums.parallax.com/discussion/comment/1410683/#Comment_1410683
EDIT: The "Bit" column exhibits a perfect scaling of scores! That blows me away. I don't have a clue why it's so rigidly alone like that. Pity it's also the scoring variant that takes many times longer to run.
I've just been reading this post in which Melissa O'Neill argues (right at the end) that it would be better to multiply the xoroshiro states rather than add them:
http://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html
As MUL takes the same time as ADD on the P2 there would be no performance penalty. It's not clear to me whether the product could be truncated to 16 bits as well as 8 bits. If so, would it perform better in randomness tests than the 16-bit sum? If not, two iterations would be needed to get a high-quality 16-bit result, but that's just the same as with adding.
Tony,
With a quick look at that webpage, I'm not making much sense of how he's done the Xorshift algorithm. I'm suspicious he's just looking at the state data directly, because otherwise PractRand would be throwing a wobbly on our testing quick smart I'd think.
Here's the s12 multiplying scores for comparison:
I've since discovered, and verified, that at least one program was having issues with the SMT flaw in the early Ryzens. I would have been happy to just disable SMT in the BIOS but the silly BIOS has a problem with that - It insists I had an unexpected power down every time it powers up - have to press F1 and ESC and ENTER every time I power it up with SMT disabled!
So now I'm sending the CPU back for a free exchange tomorrow - never done anything international like this before - AMD have directly emailed me a FedEx number to book it to. Conveniently there is a depot only ten minutes drive away ...
I've done Xoroshiro40+ on the Z80 and the triplet numbers were quite friendly. A final one I'd like to try is Xoroshiro48+ as every bit in two sets of three 8-bit registers would be used. There is a trick for finding full-period triplets without brute-force as that would take forever with Xoroshiro128+, if only we knew what the trick is.
Yes, Xoroshiro128+ and 1024+ both have "jump" functions supplied in the sources from the authors.
Starting from bit 0 of both word halves of the PRNG state, it does an overlaying XOR of a working state with the current state based on what looks to be a specially crafted key, then calls the iterator, stepping the current state by one. This repeats for each bit of the word size (ie: half the state length).
And at the end it copies the working state over the current state.
Here's the actual code from original Xoroshiro128+ jump() function:
I have no idea how.
Of course, running through a huge sequence does not imply a random looking sequence. A simple binary counter will run through such a sequence, step by step in order. We need something better to defeat the tests of "randomness".
I think understanding C is the least of the problems in understanding how these things work.
Certainly brute force running and checking them does not help. The periods can be longer to work through than the age of the universe!
What is certainly true is that tweaking the number of bits in the state or messing with the parameters can ruin a good PRNG.
I start to believe that the search for a better pseudo random number generator drives people insane. As this thread shows I felt it myself before I got help from "PRNG Anonymous". For example this from Stanford:
PCG: A Family of Better Random Number Generators:
So, presumably it's possible to built jump functions that steps further than a square-root into the period. But they're not inclined to explain, or maybe it's just really hard to work out. I have no idea.
Certainly jump functions can make huge strides into the state space. I have no idea about the square-root thing.
Me too. I have no idea.
I have found a trick that apparently makes all the bits of the sum pass the binary-rank test:
https://forum.powerbasic.com/forum/user-to-user-discussions/source-code/749403-xorshift-prng?p=751037#post751037 (post #8).
It's a bit of a dense read and I think what David Roberts does is use xoroshiro128+ to create a 64-bit sum, then splits that into a pair of 32-bit values to get two PRNs for the price of one iteration. He replaces bit 0 of the low long with its parity, on the basis that parity should be random.
This would take two extra instructions with XORO32 on the P2, one to get the parity of sum[15:0] into C and the second to put C into sum[0]. To be honest, it wouldn't take a lot more to do two iterations but that would waste half the XORO32 bits.
I'm wondering whether the parity of the state could be used instead. If this produced equally random results and if the XORO32 instruction flagged parity (and zero) the same way as AND/OR/XOR/TEST, then only one extra instruction BITC sum,#0 would be needed to improve the XORO32 output, with no change to the underlying algorithm.
In my latest testing it has hiccup'd once already, but is a specific set of circumstances that may not be directly related. Lots more testing to still go on this one as well ...
"We suggest to use a sign test ..." confused me for a while as it could be interpreted as a suggestion for solving the bit 0 "problem" by using the sign bit, but all it really means is choose the msb if you want a single random bit (because that's the most random one).
Chip asked for the msb as one of my score variants and it's the only bit position I've singly sampled in my tests. And Practrand scores on this variant test have, without fail, squared for each pair of extra bits added to the state size. It's a little eye popping for me.
BTW: The reason why the "bit" variant takes so much longer to run the scoring is because it iterates 8 times and stuffs 8 msb's into one byte before piping that byte out to Practrand. Practrand is multithreaded and I've got 8 cores, so Practrand can absorb the data stream as fast as the generator can feed it the data.