xoroshiro++ random number generator
TonyB_
Posts: 2,178
xoroshiro++ was created by Sebastiano Vigna in collaboration with David Blackman towards the end of 2017 and is an improvement of their xoroshiro+ algorithm. Details of xoroshiro128+, a 128-bit state version of xoroshiro+ with 64-bit output, can be found at http://xoroshiro.di.unimi.it/
Sebastiano told me about xoroshiro++ in November 2017 and at that time it was a candidate to replace xoroshiro128+. Evan (who has done a huge amount of xoroshiro testing using PractRand) and Chip were informed but the details could not be made public. Sebastiano and David have decided a different algorithm will supersede xoroshiro128+ and now we are able to discuss xoroshiro++ freely.
Our tests have shown that xoroshiro++ is a far better random number generator than xoroshiro+. It deserves to be known about and used and that is the main purpose of this post. The story of how various xoroshiro versions have been implemented on the P2 can be read in this long topic:
https://forums.parallax.com/discussion/166176/random-lfsr-on-p2
What is missing from that thread is the xoroshiro++ algorithm, which is the same as xoroshiro+ with two extra steps at the end. The (s0 + s1) sum is rotated left by the number of bits given by a new constant d, then this rotated sum is added to s0 to give the output. The extra rotation and addition scramble the output bits more than with xoroshiro+, thus making them more random. As there are now two additions the enhanced algorithm is called xoroshiro++ and a quadruple of four constants [a,b,c,d] defines it, instead of the [a,b,c] triple in xoroshiro+.
Here is the xoroshiro++ pseudo code:
Note that constant d affects only the PRN output and the state calculations in xoroshiro+ and xoroshiro++ are identical. To test xoroshiro++, the same full-period [a,b,c] triples as in xoroshiro+ are used, then the possible values of d tried in turn. The results can vary greatly for different d, even between those that differ by only one.
The P2 XORO32 D instruction does a double-iteration of xoroshiro32++. The state D is advanced by two iterations and the resulting state is written back to D, the intermediate state is not available to the user. The 32-bit PRN output is a concatenation of two successive 16-bit outputs, the low word is the first and the high word the second. This 32-bit value is copied to the S field of the instruction following XORO32.
To save some time, both in hardware and in software if instructions can be executed in parallel, the PRN output can be calculated at the start of the algorithm, not at the end. The XORO32 logic was modified in this way for the final version and here is the corresponding xoroshiro++ pseudo code:
After much testing with PractRand the quadruple [14,2,7,6] was selected in December 2017. Following more tests a decision was made in the last few days to change to [14,2,7,5], however this was probably not really necessary. (The free-running hub-based PRNG uses the original xoroshiro128+ [55,14,36] algorithm.)
An intrinsic property of both xoroshiro+ and xoroshiro++ is equidistribution. All outputs occur the same number of times, except for zero. For xoroshiro32+/++, the frequency of every non-zero 16-bit output is 2^16 and the frequency of zero is 2^16-1. A similar equidistribution applies to bytes and bits. The initial state for seeding can be any non-zero value.
xoroshiro++ can be ported easily to other CPUs and languages, for example x86, Z80, C and BASIC. Tests have shown that good results can be obtained even with a 16-bit state, 8-bit output and period of 2^16-1.
EDIT:
Quadruple changed to [13,5,10,9] for Rev B and later. See post 16 below.
Sebastiano told me about xoroshiro++ in November 2017 and at that time it was a candidate to replace xoroshiro128+. Evan (who has done a huge amount of xoroshiro testing using PractRand) and Chip were informed but the details could not be made public. Sebastiano and David have decided a different algorithm will supersede xoroshiro128+ and now we are able to discuss xoroshiro++ freely.
Our tests have shown that xoroshiro++ is a far better random number generator than xoroshiro+. It deserves to be known about and used and that is the main purpose of this post. The story of how various xoroshiro versions have been implemented on the P2 can be read in this long topic:
https://forums.parallax.com/discussion/166176/random-lfsr-on-p2
What is missing from that thread is the xoroshiro++ algorithm, which is the same as xoroshiro+ with two extra steps at the end. The (s0 + s1) sum is rotated left by the number of bits given by a new constant d, then this rotated sum is added to s0 to give the output. The extra rotation and addition scramble the output bits more than with xoroshiro+, thus making them more random. As there are now two additions the enhanced algorithm is called xoroshiro++ and a quadruple of four constants [a,b,c,d] defines it, instead of the [a,b,c] triple in xoroshiro+.
Here is the xoroshiro++ pseudo code:
;{s1,s0} = state (input and output) ;prn = pseudo-random number (output) ;tmp and prn can be the same register ;xoroshiro+ xor s1,s0 ;s1 = s1 ^ s0 mov tmp,s1 rol s0,a ;s0 = s0 rol a shl tmp,b ;tmp = (s1 ^ s0) shl b xor s0,tmp xor s0,s1 ;s0 = s0 rol a ^ (s1 ^ s0) shl b ^ s1 ^ s0 rol s1,c ;s1 = (s1 ^ s0) rol c mov prn,s0 add prn,s1 ;prn = s0 + s1 ;xoroshiro++ enhancement rol prn,d add prn,s0 ;prn = (s0 + s1) rol d + s0
Note that constant d affects only the PRN output and the state calculations in xoroshiro+ and xoroshiro++ are identical. To test xoroshiro++, the same full-period [a,b,c] triples as in xoroshiro+ are used, then the possible values of d tried in turn. The results can vary greatly for different d, even between those that differ by only one.
The P2 XORO32 D instruction does a double-iteration of xoroshiro32++. The state D is advanced by two iterations and the resulting state is written back to D, the intermediate state is not available to the user. The 32-bit PRN output is a concatenation of two successive 16-bit outputs, the low word is the first and the high word the second. This 32-bit value is copied to the S field of the instruction following XORO32.
To save some time, both in hardware and in software if instructions can be executed in parallel, the PRN output can be calculated at the start of the algorithm, not at the end. The XORO32 logic was modified in this way for the final version and here is the corresponding xoroshiro++ pseudo code:
;{s1,s0} = state (input and output) ;prn = pseudo-random number (output) mov prn,s0 add prn,s1 ;prn = s0 + s1 rol prn,d add prn,s0 ;prn = (s0 + s1) rol d + s0 xor s1,s0 ;s1 = s1 ^ s0 mov tmp,s1 rol s0,a ;s0 = s0 rol a shl tmp,b ;tmp = (s1 ^ s0) shl b xor s0,tmp xor s0,s1 ;s0 = s0 rol a ^ (s1 ^ s0) shl b ^ s1 ^ s0 rol s1,c ;s1 = (s1 ^ s0) rol c
After much testing with PractRand the quadruple [14,2,7,6] was selected in December 2017. Following more tests a decision was made in the last few days to change to [14,2,7,5], however this was probably not really necessary. (The free-running hub-based PRNG uses the original xoroshiro128+ [55,14,36] algorithm.)
An intrinsic property of both xoroshiro+ and xoroshiro++ is equidistribution. All outputs occur the same number of times, except for zero. For xoroshiro32+/++, the frequency of every non-zero 16-bit output is 2^16 and the frequency of zero is 2^16-1. A similar equidistribution applies to bytes and bits. The initial state for seeding can be any non-zero value.
xoroshiro++ can be ported easily to other CPUs and languages, for example x86, Z80, C and BASIC. Tests have shown that good results can be obtained even with a 16-bit state, 8-bit output and period of 2^16-1.
EDIT:
Quadruple changed to [13,5,10,9] for Rev B and later. See post 16 below.
Comments
This is very confusing.
That's what makes it so random. Even more than before.
Heater, ask some questions. It would be good to know if some terms need explained or more fleshing out of the events and decisions.
I don't know what questions to ask. Having been fascinated by this topic for some years I have made some observations:
1) Back in the early days of computing people slapped all kind of operations together, in a haphazard way, in the hope that the output would be kind of random.
2) As the years went by all kinds of criticisms were leveled at those early PRNGs. They have short periods, they have a bias, and so on.
3) Then there were guys like Prof. George Marsaglia, creating new PRNGs but also backing them up with some mathematical proofs about their properties. Like the period and such.
4) Today, with the xoroshio guys, we seem to be back to step 1). For example today I notice they have xorshift1024*φ, which smacks of "Oh, lets throw in a multiply by some weird number to stir things up a bit" without any explanation. I mean, why φ? Why not π? Or e? Or 2 for that matter?
5) If your PRNG is supposed to have some huge period, like 2^1024 − 1 or whatever, then please show why that might be. It's impossible to test for this. Even if I don't understand the proof it would be nice to know there is one that somebody does.
And finally, I do wish creators of PRNG algorithms would provide some test data. Some sequence that is generated from a known initial state. It's so easy for the algorithms to me implemented wrongly when translated to other programming languages. Or just implemented badly through not understanding the specification.
PRN calculated at end of algorithm after new state generated
First 32 outputs for xoroshiro32++ [14,2,7,5] when seed = 1 {s1 = 0, s0 = 1}
PRN calculated at start of algorithm before new state generated
First 16 outputs for XORO32 when seed = 1 {s1 = 0, s0 = 1}
PRN calculated at start of algorithm before new state generated
Heater,
To answer the last point first, I've posted some test data above, with more to come.
4) Some of the work started by the late Prof. Marsaglia is being carried on and improved by Sebastiano Vigna and David Blackman. I have no doubt that they know what they are doing so you need to be a bit careful what you write - they might be reading! From my personal experience of it, I think xoroshiro++ is very good and yet a forthcoming paper will not even mention it, which means (a) the replacement for xoroshiro128+ must be rather special and (b) it is up to us here to publicize xoroshiro++ because nobody else will.
There are some rigorous random number tests available and a representation of φ (arguably the most irrational number) would not have been chosen just on a whim. Multiplication scrambles bits better than addition and lots of odd numbers would do a good job probably, but a discussion of xorshift1024*φ and other generators that require a multiply is off-topic here.
5) The proof that certain classes of generator with n-bit states have a periods of 2^n-1 is not difficult, I understand, for mathematicians anyway. As I've mentioned before, we know that xoroshiro32 has a period of a 2^32-1 because the full period can be iterated in a matter of seconds.
There was some polynomial associated with each listed candidate. Chris may have provided something for his example too. Certainly he wasn't phased/daunted by large periods.
It seems these sort of capabilities are well understood in the research field so no one much details them.
I guess Xoroshiro isn't trying to be an educational product.
I am careful what I write. No disrespect to Sebastiano and David, just posing the questions.
I do like the xoroshiro generators for their small size, speed and lack of multiply, divide or modulo. Which makes them ideal for the Propeller. That is why I posted a Spin version of David Jones' JKISS32 here years ago. Another derivative of Marsaglia's work.
http://www0.cs.ucl.ac.uk/staff/d.jones/GoodPracticeRNG.pdf
https://forums.parallax.com/discussion/136975/jkiss32-high-quality-prng-with-realrandom-seed-save-a-cog
As for the maths behind all this. Likely I'm never going to understand that. Which is why I like to see sample output from the algorithms creators. Something to test my, likely buggy, implementations against.
Scrambled Linear Pseudorandom Number Generators
More info and PDF download at http://xoshiro.di.unimi.it/
PDF also available at https://arxiv.org/abs/1805.01407
Please read the paper before passing comment. Even if you don't understand some or much of it, the amount of thought and testing that has been done should be apparent.
It's good advanced publicity for Parallax and we can talk about all of the PRNG logic in the P2 freely now. Each cog has a xoroshiro32++ and the hub has a xoroshiro128**. The scrambler for the latter was changed from + to ** in early April, improving the quality and allowing all 64 output bits to be used equally.
The P2 does not use the new xoshiro generator, which has four times as many state as output bits. Something for the future, possibly.
You are right I understand very little of it.
Only one niggle:
Would it be so hard to include complete, working, examples together with some output for test data? In C, pseudo code or whatever language.
Nice to see that Parallax, Chip and co. get a mention.
I think we've got the best-bang-for-the-buck PRNG's in the Prop2.
Thanks for all of your and Evan's work, and also Sebastiano's help.
Note: For anyone reading my prior postings and sources - Engine and scrambler is what I've been calling iterator and hash respectively.
The modified Verilog code:
https://forums.parallax.com/discussion/comment/1448460/#Comment_1448460
PRN calculated at start of algorithm before new state generated
First 16 outputs for XORO32 when seed = 1
PRN calculated at start of algorithm before new state generated
I'm not making any sense out of the results you posted above for "First 16 outputs for XORO32 using xoroshiro32++[13,5,10,9] with seed = 1.
PRN calculated at start of algorithm before new state generated.". The first output you have is 62690201 which according to the present algorithm is not possible.
If you are doing the PRN calculation first then on the first run it is :
prn = (s0 + s1) rol d + s0
= (0x00000001 + 0x00000000) rol 9 + 0x00000001
= 0x00000201
Given that the seed is single 1 bit there is no way to get all the bits required to make 0x62690201
I was trying to make a C++ version of the algorithm above but it produces different results. The results: The only thing that matches is the low 12 bits of the first result, which is the 0x201 I calculated above. Where did all the extra bits in your 62690201 come from?
Is there an xoroshiro++ in C or C++ with example output anywhere? I could not find one.
I can see your variables and shifters are 32-bit. They should all be 16-bit.
Tony has always done it as engine iteration first, then scrambler summing. For quite a while Chip had XORO32 also following that order of Tony's, but later changed it to summing first followed by engine iterate when XORO32 had become the longest (critical) timing path.
It doesn't change the random data, just swaps the shortwords around.
Many moons ago I made the comment here: "Would the real xoroshiro please stand up! This is very confusing."
Well, it's still confusing
The results I was attempting to recreate are for what TonyB_ described above as "XORO32 using xoroshiro32++[13,5,10,9] with seed = 1. PRN calculated at start of algorithm before new state generated."
There is nothing there that indicates 16 bit processing. The name has "32" in it. The presented code shows 32 bit operations. The present results are 32 bit. He says right there that he is extracting PRN before generating the new state.
Why only 16 bit variables when we have a 32 bit P2?
More confused than ever. What is it you guys have been testing and is now in the P2?
XORO32 does a double-iteration of xoroshiro32++. When seed = 1, 0201 is the first output and low word, 6269 the second output and high word, of the first 32-bit XORO32 result.
I've added single-iterated test outputs to
http://forums.parallax.com/discussion/comment/1448906/#Comment_1448906
I did post actual C source code for you yesterday - https://forums.parallax.com/discussion/comment/1448759/#Comment_1448759
http://forums.parallax.com/discussion/166176/random-lfsr-on-p2#latest
Somehow it totally slipped my mind that you had posted that code. Been a bit busy around here.
Let me see...