Random/LFSR on P2

TonyB_ · 2017-10-06 00:45

evanh wrote: »

First twenty 17-bit (Xoroshiro34) sums for triplet [7,8,12] is:
00001 01181 1608d 0a75d 0bf68 1c71d 158e8 028f2 1c4a8 1c24f 14e6c 1f938 1f135 07a8e 068cd 15404 1d908 02648 1e45c 13161

First twenty 20-bit (Xoroshiro40) sums for triplet [7,9,12] is:
00001 01281 c6013 5a2f1 2f5ac 7590c fb594 b7ac1 2a644 33649 ebe23 66f73 18a54 fd4a2 4674a 590d3 37bc4 2cce9 3a9db 0ee4e

Evan, do you have a list of the best triplets in your opinion for different widths, say Xoroshiro24 to Xoroshiro40? If not, don't worry and if so, score tables not needed.

evanh · 2017-10-06 03:22

Yep, I've got a set of score tables for all those. Size 20 is the largest I've done though. Anything bigger took forever to brute force search for full period candidates.

I'll zip them up when home ...

evanh · 2017-10-06 03:48

Actually, these previous early tables might sufficient for your needs -
http://forums.parallax.com/discussion/comment/1410683/#Comment_1410683

evanh · 2017-10-06 10:01

Okay, I completed the current effort a few days later - http://forums.parallax.com/discussion/comment/1411074/#Comment_1411074 - and re-ran all the scoring but didn't post the full set. Here they go:

evanh · 2017-10-06 11:29

Here's my selection. Some of them were a real toss up and only chosen by least number of lowest scores. A method that's somewhat dependent on the sampling variants I've scored with.

 Algorithm      Triplet     Word1   Word2  Byte99  Byte49   Byte2   Byte1    Bit
=================================================================================
Xoroshiro24+  [ 8  2  3]     64M     32M      8M              8M      8M      4M
Xoroshiro26+  [ 9 12  3]     64M    128M     16M             16M     16M     16M
Xoroshiro28+  [11  5  2]      1G    256M     32M     64M     32M     32M     64M
Xoroshiro30+  [ 6 10  8]    512M    512M    256M    128M    256M     32M    256M
Xoroshiro32+  [14  2  7]      1G    512M      1G    512M    512M    128M      1G
Xoroshiro34+  [ 7  8 12]    512M      4G      2G      2G      2G    128M      4G
Xoroshiro36+  [ 8 12 13]      1G    512M      2G      8G    512M    128M     16G
Xoroshiro38+  [11  5 16]      8G      8G     32G      8G      4G    128M     64G
Xoroshiro40+  [ 7  9 12]     16G     16G     64G     32G     32G    256M    256G

EDIT: The "Bit" column exhibits a perfect scaling of scores! That blows me away. I don't have a clue why it's so rigidly alone like that. Pity it's also the scoring variant that takes many times longer to run.

TonyB_ · 2017-10-07 01:12

Many thanks, Evan.

I've just been reading this post in which Melissa O'Neill argues (right at the end) that it would be better to multiply the xoroshiro states rather than add them:
http://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html

As MUL takes the same time as ADD on the P2 there would be no performance penalty. It's not clear to me whether the product could be truncated to 16 bits as well as 8 bits. If so, would it perform better in randomness tests than the 16-bit sum? If not, two iterations would be needed to get a high-quality 16-bit result, but that's just the same as with adding.

evanh · 2017-10-07 02:16

I can certainly replace the summing with multiplying. Don't even have to re-run the full period searches.

Tony,
With a quick look at that webpage, I'm not making much sense of how he's done the Xorshift algorithm. I'm suspicious he's just looking at the state data directly, because otherwise PractRand would be throwing a wobbly on our testing quick smart I'd think.

evanh · 2017-10-07 02:24

TonyB_ wrote: »

... It's not clear to me whether the product could be truncated to 16 bits as well as 8 bits. If so, would it perform better in randomness tests than the 16-bit sum? If not, two iterations would be needed to get a high-quality 16-bit result, but that's just the same as with adding.

Assuming XORO32 based, why would multiplying two 16-bit values only produce 8 bits? Or is that not what you were talking about?

evanh · 2017-10-07 02:57

Hmm, what we need is the randogram code so I can reproduce his pretty pictures with selected triplets.

evanh · 2017-10-07 05:46

Well, simply replacing the addition with a multiply produces rubbish PractRand scores. Amusingly, the "bit" variant is still the same scores but most everything else barely makes it passed the starting blocks.

Here's the s12 multiplying scores for comparison:


    Xorshift24* PractRand Score Table

Combination    Word1   Word2  Byte04  Byte02   Byte2   Byte1    Bit
====================================================================
 [ 1  7  1]      8K     16K     64K              8K      2K      4M
 [ 1  7  5]      8K     16K     64K              8K      2K      4M
 [ 2  5 11]      8K     16K     64K              4K      2K      4M
 [ 3  1  8]      8K     16K     64K              8K      2K      4M
 [ 3  2 11]      8K     32K     64K              4K      2K      4M
 [ 3  5  3]      4K     16K     64K              4K      2K      4M
 [ 4  5  1]      8K     32K     64K              8K      2K      4M
 [ 5  1  4]      4K     16K    128K              4K      2K      4M
 [ 5  2 11]      8K     64K     64K              4K      2K      4M
 [ 5  7  9]      8K     32K     64K              8K      2K      4M
 [ 7  5  8]      4K     16K     64K              4K      2K      4M
 [ 8  3  1]      8K     32K    128K              8K      2K      4M
 [ 9  1  2]      8K     32K     64K              4K      2K      4M
 [ 9  1  5]      4K     32K    128K              2K      1K      4M


    Xoroshiro24* PractRand Score Table

Combination    Word1   Word2  Byte04  Byte02   Byte2   Byte1    Bit
====================================================================
 [ 1  1 10]      4K     16K    128K              8K      2K      4M
 [ 1  5  6]      4K      8K     64K              4K      2K      4M
 [ 2  2  3]      4K     16K    128K              8K      2K      4M
 [ 3  2  2]      8K     16K     64K              8K      2K      4M
 [ 3  2  8]      4K     16K     64K              4K      1K      4M
 [ 4  2 11]      4K     16K    128K              4K      2K      4M
 [ 4  4  7]      4K     32K     64K              4K      2K      4M
 [ 4  4  9]      4K     32K     64K              4K      2K      4M
 [ 4  9  7]      8K     32K     64K              8K      1K      4M
 [ 6  5  1]      4K     16K     32K              8K      2K      4M
 [ 7  4  4]      8K     16K     64K              4K      2K      4M
 [ 7  4  8]      8K     16K     64K              8K      1K      4M
 [ 7  9  4]      8K     32K     64K              8K      2K      4M
 [ 7 10 10]      4K     16K     64K              4K      2K      4M
 [ 8  2  3]      4K     32K     64K              8K      2K      4M  (Same triplet as favourites above)
 [ 8  2 11]      4K     16K    128K              4K      2K      4M
 [ 8  3  9]      8K      8K     64K              4K      2K      4M
 [ 8  4  7]      4K     16K     64K              8K      2K      4M
 [ 8  9 11]      4K      4K     64K              8K      2K      4M
 [ 9  3  8]      8K     16K     64K              4K      2K      4M
 [ 9  4  4]      8K     32K    128K              8K      2K      4M
 [10  1  1]      4K     16K     64K              4K      2K      4M
 [10 10  7]      4K     32K     64K              4K      2K      4M
 [11  2  4]      4K     16K     64K              8K      2K      4M
 [11  2  8]      4K     16K     64K              8K      2K      4M
 [11  9  8]      4K     16K    128K              4K      1K      4M

TonyB_ · 2017-10-07 12:37

I've added the author's name to my previous post. It's hard to be sure, but I think her "xoroshiro16star8" uses two 8-bit states that are multiplied by a 16-bit magic number ("MCG multiply") and only the high byte of the product kept.

evanh · 2017-10-07 13:45

Which one of my posts are you answering there Tony?

TonyB_ · 2017-10-08 02:03

All of them, but let's put this to one side for a while. Try doing a hexdump of the first few Xoroshiro32+ sums (eight is plenty) when seed is $32F3_1EDB

evanh · 2017-10-08 14:08

Seeding with s0 = $1EDB and s1 = $32F3, I get the following 16-bit sums: 51ce 6f54 796e fde9 1dbf dac1 7e47 0486 85d8 48f7 7df9 0102 0601 bf23 1f34 c6b5 39be 1909 7824 c941

evanh · 2017-10-12 12:33

evanh wrote: »

Dang, it just dawned on me, far too late, that DRAM prices have been going through the roof all this year! And even with those prices, supply appears to have gone to virtually nil.

One of my new 16GB DIMMs failed and there is no exact replacements in stock. I've got the full credit for them (Returned the pair) and can claim it as a refund, but the options for replacement are slim pickings.

I'm back using me old PC for the moment, and the remainder of the year looks bleak.

Well, things have progressed. I got the full refund for the faulty RAM and went elsewhere to get something faster. I figured if I was going to be paying inflated prices to get a decent set of DIMMs I may as well get something faster than I had. The alternative was something slower but still more expensive than original.

I've since discovered, and verified, that at least one program was having issues with the SMT flaw in the early Ryzens. I would have been happy to just disable SMT in the BIOS but the silly BIOS has a problem with that - It insists I had an unexpected power down every time it powers up - have to press F1 and ESC and ENTER every time I power it up with SMT disabled!

So now I'm sending the CPU back for a free exchange tomorrow - never done anything international like this before - AMD have directly emailed me a FedEx number to book it to. Conveniently there is a depot only ten minutes drive away ...

TonyB_ · 2017-10-20 11:58

TonyB_ wrote: »

evanh wrote: »

First twenty 17-bit (Xoroshiro34) sums for triplet [7,8,12] is:
00001 01181 1608d 0a75d 0bf68 1c71d 158e8 028f2 1c4a8 1c24f 14e6c 1f938 1f135 07a8e 068cd 15404 1d908 02648 1e45c 13161

First twenty 20-bit (Xoroshiro40) sums for triplet [7,9,12] is:
00001 01281 c6013 5a2f1 2f5ac 7590c fb594 b7ac1 2a644 33649 ebe23 66f73 18a54 fd4a2 4674a 590d3 37bc4 2cce9 3a9db 0ee4e

Evan, do you have a list of the best triplets in your opinion for different widths, say Xoroshiro24 to Xoroshiro40? If not, don't worry and if so, score tables not needed.

evanh wrote: »

Yep, I've got a set of score tables for all those. Size 20 is the largest I've done though. Anything bigger took forever to brute force search for full period candidates.

I'll zip them up when home ...

I've done Xoroshiro40+ on the Z80 and the triplet numbers were quite friendly. A final one I'd like to try is Xoroshiro48+ as every bit in two sets of three 8-bit registers would be used. There is a trick for finding full-period triplets without brute-force as that would take forever with Xoroshiro128+, if only we knew what the trick is.

evanh · 2017-10-20 13:57

Cool.

Yes, Xoroshiro128+ and 1024+ both have "jump" functions supplied in the sources from the authors.

Starting from bit 0 of both word halves of the PRNG state, it does an overlaying XOR of a working state with the current state based on what looks to be a specially crafted key, then calls the iterator, stepping the current state by one. This repeats for each bit of the word size (ie: half the state length).

And at the end it copies the working state over the current state.

Here's the actual code from original Xoroshiro128+ jump() function:

/* This is the jump function for the generator. It is equivalent
   to 2^64 calls to next(); it can be used to generate 2^64
   non-overlapping subsequences for parallel computations. */

void jump(void) {
	static const uint64_t JUMP[] = { 0xbeac0467eba5facb, 0xd86b048b86aa9922 };

	uint64_t s0 = 0;
	uint64_t s1 = 0;
	for(int i = 0; i < sizeof JUMP / sizeof *JUMP; i++)
		for(int b = 0; b < 64; b++) {
			if (JUMP[i] & 1ULL << b) {
				s0 ^= s[0];
				s1 ^= s[1];
			}
			next();
		}

	s[0] = s0;
	s[1] = s1;
}

evanh · 2017-10-20 14:07

Oops, sorry, it runs the word length again with a second key before copying the working state over the current state.

TonyB_ · 2017-10-21 00:03

C is mostly gibberish to me. I don't see how a jump function proves a certain triplet is full-period. I've just looked through the prng Google group and it wasn't any help really.

Heater. · 2017-10-21 01:03

Somewhere, some how, some mathematician can prove that some pseudo random number generator algorithm with some parameters will run though all, or most, of its combinations of states before returning to where it started.

I have no idea how.

Of course, running through a huge sequence does not imply a random looking sequence. A simple binary counter will run through such a sequence, step by step in order. We need something better to defeat the tests of "randomness".

I think understanding C is the least of the problems in understanding how these things work.

Certainly brute force running and checking them does not help. The periods can be longer to work through than the age of the universe!

What is certainly true is that tweaking the number of bits in the state or messing with the parameters can ruin a good PRNG.

I start to believe that the search for a better pseudo random number generator drives people insane. As this thread shows

I felt it myself before I got help from "PRNG Anonymous". For example this from Stanford:

PCG: A Family of Better Random Number Generators:

evanh · 2017-10-21 01:10

The top three lines of the source is a comment, so that's plain english at least.

Basically, this particular jump() config is not useful for finding any full period triplets because even it still has to be called 2^64 times to hop all the way along.

So, presumably it's possible to built jump functions that steps further than a square-root into the period. But they're not inclined to explain, or maybe it's just really hard to work out. I have no idea.

evanh · 2017-10-21 01:20

Heater. wrote: »

What is certainly true is that tweaking the number of bits in the state or messing with the parameters can ruin a good PRNG.

Full period may be solved at a higher analytical level but the quality certainly isn't. Tony's link to Melissa's visualisation work is a good demo of how quality is accessed. And that's also why PractRand and BigCrush are heavily relied on.

Heater. · 2017-10-21 01:23

Hmm... if it has to be called 2^64 times I would hardly call it a "jump" function.

Certainly jump functions can make huge strides into the state space. I have no idea about the square-root thing.

Me too. I have no idea.

TonyB_ · 2017-10-26 19:51

As we all know, bit 0 of the Xoroshiro+ sum is a simple LFSR bit and fails the binary-rank test but is probably usable in many cases. More details here: http://xoroshiro.di.unimi.it/

I have found a trick that apparently makes all the bits of the sum pass the binary-rank test:
https://forum.powerbasic.com/forum/user-to-user-discussions/source-code/749403-xorshift-prng?p=751037#post751037 (post #8).

It's a bit of a dense read and I think what David Roberts does is use xoroshiro128+ to create a 64-bit sum, then splits that into a pair of 32-bit values to get two PRNs for the price of one iteration. He replaces bit 0 of the low long with its parity, on the basis that parity should be random.

This would take two extra instructions with XORO32 on the P2, one to get the parity of sum[15:0] into C and the second to put C into sum[0]. To be honest, it wouldn't take a lot more to do two iterations but that would waste half the XORO32 bits.

I'm wondering whether the parity of the state could be used instead. If this produced equally random results and if the XORO32 instruction flagged parity (and zero) the same way as AND/OR/XOR/TEST, then only one extra instruction BITC sum,#0 would be needed to improve the XORO32 output, with no change to the underlying algorithm.

Currently:
sum[0]	= state[0] XOR state[16]

Sum parity:
sum[0]	= sum[0] XOR sum[1] ... XOR sum[15]
	= state[0] XOR state[16] XOR sum[1] ... XOR sum[15]

State parity:
sum[0]	= state[0] XOR state[1] ... XOR state[16] ... XOR state[31]

evanh · 2017-10-26 21:38

I've arbitrarily tried a few of those ideas myself. Most end up with total fails on Practrand, some have given equivalent quality ... hmm, I've not tried to use all bits at once though, since that part I'd long moved outside the generator code. Good point, I'll have another fiddle ...

TonyB_ · 2017-10-26 21:57

Before he used the parity trick, David Roberts XORed the two halves of the 64-bit sum to get one 32-bit value and he said both methods passed through PractRand.

evanh · 2017-10-26 22:13

That, and I've only just got the replacement CPU on warranty exchange from AMD. They took a tad longer than promised to get it out to me. I'm still trying to verify if the SMT flaw is completely solved or not. I think everyone is worried that AMD are not telling the whole story with it.

In my latest testing it has hiccup'd once already, but is a specific set of circumstances that may not be directly related. Lots more testing to still go on this one as well ...

TonyB_ · 2017-10-26 22:40

The results of sum[15:1,parity] don't need to be as good as sum[15:1] as long as they are better than sum[15:0].

TonyB_ · 2017-10-27 00:54

On the PowerBASIC forum, David Roberts said:

I have found out why the original code had BRank failures.

I ran tests dumping just the lower Dword of the Qword output and tests dumping just the upper Dword of the Qword output. It seemed that BRank failures only occurred with the lower Dword.

I did some further reading and found this with regard Xoroshiro128+ and, I suspect, it is true for Xorhift128+ as well.

Beside passing BigCrush, this generator passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests, which fail due to the lowest bit being an LFSR; all other bits pass all tests. We suggest to use a sign test to extract a random Boolean value.

So, we have an errant bit 0 in the Qword.

If we have two Dwords, X random and Y not random then X Xor Y will be random. That is why the Xor method worked. However, bit 0 will still be errant for the double precision values.

We could simply cast out the lower Dword saving us time with the Xor method. However, we will now have to go past the 'engine' twice to collect two 'good' Dwords for the Double precision values; not good for a 64 bit generator. Alternatively, we can try and rectify bit 0.

I tried the suggestion of using a sign test but all I got was a toggling of bit 0, which simply moved the location of a potential failure.

"We suggest to use a sign test ..." confused me for a while as it could be interpreted as a suggestion for solving the bit 0 "problem" by using the sign bit, but all it really means is choose the msb if you want a single random bit (because that's the most random one).

evanh · 2017-10-27 01:43

TonyB_ wrote: »

... but all it really means is choose the msb if you want a single random bit (because that's the most random one).

Yeah, I'm guessing that's entirely common knowledge in the field for this type of algorithm. Melissa refers to it that way. No doubt there is a math theory to back this up.

Chip asked for the msb as one of my score variants and it's the only bit position I've singly sampled in my tests. And Practrand scores on this variant test have, without fail, squared for each pair of extra bits added to the state size. It's a little eye popping for me.

BTW: The reason why the "bit" variant takes so much longer to run the scoring is because it iterates 8 times and stuffs 8 msb's into one byte before piping that byte out to Practrand. Practrand is multithreaded and I've got 8 cores, so Practrand can absorb the data stream as fast as the generator can feed it the data.

Random/LFSR on P2

Comments