xoroshiro++ random number generator

TonyB_ · 2018-03-12 13:45

xoroshiro++ was created by Sebastiano Vigna in collaboration with David Blackman towards the end of 2017 and is an improvement of their xoroshiro+ algorithm. Details of xoroshiro128+, a 128-bit state version of xoroshiro+ with 64-bit output, can be found at http://xoroshiro.di.unimi.it/

Sebastiano told me about xoroshiro++ in November 2017 and at that time it was a candidate to replace xoroshiro128+. Evan (who has done a huge amount of xoroshiro testing using PractRand) and Chip were informed but the details could not be made public. Sebastiano and David have decided a different algorithm will supersede xoroshiro128+ and now we are able to discuss xoroshiro++ freely.

Our tests have shown that xoroshiro++ is a far better random number generator than xoroshiro+. It deserves to be known about and used and that is the main purpose of this post. The story of how various xoroshiro versions have been implemented on the P2 can be read in this long topic:
https://forums.parallax.com/discussion/166176/random-lfsr-on-p2

What is missing from that thread is the xoroshiro++ algorithm, which is the same as xoroshiro+ with two extra steps at the end. The (s0 + s1) sum is rotated left by the number of bits given by a new constant d, then this rotated sum is added to s0 to give the output. The extra rotation and addition scramble the output bits more than with xoroshiro+, thus making them more random. As there are now two additions the enhanced algorithm is called xoroshiro++ and a quadruple of four constants [a,b,c,d] defines it, instead of the [a,b,c] triple in xoroshiro+.

Here is the xoroshiro++ pseudo code:

;{s1,s0} = state (input and output)
;prn = pseudo-random number (output)
;tmp and prn can be the same register

;xoroshiro+
	xor s1,s0		;s1 = s1 ^ s0
	mov tmp,s1
	rol s0,a		;s0 = s0 rol a
	shl tmp,b		;tmp = (s1 ^ s0) shl b
	xor s0,tmp
	xor s0,s1		;s0 = s0 rol a ^ (s1 ^ s0) shl b ^ s1 ^ s0
	rol s1,c		;s1 = (s1 ^ s0) rol c
	mov prn,s0
	add prn,s1		;prn = s0 + s1
;xoroshiro++ enhancement
	rol prn,d
	add prn,s0		;prn = (s0 + s1) rol d + s0

Note that constant d affects only the PRN output and the state calculations in xoroshiro+ and xoroshiro++ are identical. To test xoroshiro++, the same full-period [a,b,c] triples as in xoroshiro+ are used, then the possible values of d tried in turn. The results can vary greatly for different d, even between those that differ by only one.

The P2 XORO32 D instruction does a double-iteration of xoroshiro32++. The state D is advanced by two iterations and the resulting state is written back to D, the intermediate state is not available to the user. The 32-bit PRN output is a concatenation of two successive 16-bit outputs, the low word is the first and the high word the second. This 32-bit value is copied to the S field of the instruction following XORO32.

To save some time, both in hardware and in software if instructions can be executed in parallel, the PRN output can be calculated at the start of the algorithm, not at the end. The XORO32 logic was modified in this way for the final version and here is the corresponding xoroshiro++ pseudo code:

;{s1,s0} = state (input and output)
;prn = pseudo-random number (output)

	mov prn,s0
	add prn,s1		;prn = s0 + s1
	rol prn,d
	add prn,s0		;prn = (s0 + s1) rol d + s0
	xor s1,s0		;s1 = s1 ^ s0
	mov tmp,s1
	rol s0,a		;s0 = s0 rol a
	shl tmp,b		;tmp = (s1 ^ s0) shl b
	xor s0,tmp
	xor s0,s1		;s0 = s0 rol a ^ (s1 ^ s0) shl b ^ s1 ^ s0
	rol s1,c		;s1 = (s1 ^ s0) rol c

After much testing with PractRand the quadruple [14,2,7,6] was selected in December 2017. Following more tests a decision was made in the last few days to change to [14,2,7,5], however this was probably not really necessary. (The free-running hub-based PRNG uses the original xoroshiro128+ [55,14,36] algorithm.)

An intrinsic property of both xoroshiro+ and xoroshiro++ is equidistribution. All outputs occur the same number of times, except for zero. For xoroshiro32+/++, the frequency of every non-zero 16-bit output is 2^16 and the frequency of zero is 2^16-1. A similar equidistribution applies to bytes and bits. The initial state for seeding can be any non-zero value.

xoroshiro++ can be ported easily to other CPUs and languages, for example x86, Z80, C and BASIC. Tests have shown that good results can be obtained even with a 16-bit state, 8-bit output and period of 2^16-1.

EDIT:
Quadruple changed to [13,5,10,9] for Rev B and later. See post 16 below.

Heater. · 2018-03-12 18:12

Would the real xoroshiro please stand up!

This is very confusing.

cgracey · 2018-03-12 20:08

Heater. wrote: »

Would the real xorshiro please stand up!

This is very confusing.

That's what makes it so random. Even more than before.

evanh · 2018-03-12 22:01

Thanks Tony, Good coverage.

Heater, ask some questions. It would be good to know if some terms need explained or more fleshing out of the events and decisions.

Heater. · 2018-03-12 22:43

evanh,

I don't know what questions to ask. Having been fascinated by this topic for some years I have made some observations:

1) Back in the early days of computing people slapped all kind of operations together, in a haphazard way, in the hope that the output would be kind of random.

2) As the years went by all kinds of criticisms were leveled at those early PRNGs. They have short periods, they have a bias, and so on.

3) Then there were guys like Prof. George Marsaglia, creating new PRNGs but also backing them up with some mathematical proofs about their properties. Like the period and such.

4) Today, with the xoroshio guys, we seem to be back to step 1). For example today I notice they have xorshift1024*φ, which smacks of "Oh, lets throw in a multiply by some weird number to stir things up a bit" without any explanation. I mean, why φ? Why not π? Or e? Or 2 for that matter?

5) If your PRNG is supposed to have some huge period, like 2^1024 − 1 or whatever, then please show why that might be. It's impossible to test for this. Even if I don't understand the proof it would be nice to know there is one that somebody does.

And finally, I do wish creators of PRNG algorithms would provide some test data. Some sequence that is generated from a known initial state. It's so easy for the algorithms to me implemented wrongly when translated to other programming languages. Or just implemented badly through not understanding the specification.

TonyB_ · 2018-03-12 23:03

First 32 outputs for xoroshiro32++ [14,2,7,5] when seed = 1 {s1 = 0, s0 = 1}
PRN calculated at end of algorithm after new state generated

50AD
B89A
A3D9
0C87
9BD9
3CAA
3502
8840
3D24
7287
71F5
90AC
A818
B4C1
2CE3
CC58
B31D
DF3D
B7A4
A2E3
3B2E
2C20
BFFF
AFE9
5B9A
3BB8
35F4
921C
6A2D
CD9A
A217
207F

First 32 outputs for xoroshiro32++ [14,2,7,5] when seed = 1 {s1 = 0, s0 = 1}
PRN calculated at start of algorithm before new state generated

0021
50AD
B89A
A3D9
0C87
9BD9
3CAA
3502
8840
3D24
7287
71F5
90AC
A818
B4C1
2CE3
CC58
B31D
DF3D
B7A4
A2E3
3B2E
2C20
BFFF
AFE9
5B9A
3BB8
35F4
921C
6A2D
CD9A
A217

First 16 outputs for XORO32 when seed = 1 {s1 = 0, s0 = 1}
PRN calculated at start of algorithm before new state generated

50AD0021
A3D9B89A
9BD90C87
35023CAA
3D248840
71F57287
A81890AC
2CE3B4C1
B31DCC58
B7A4DF3D
3B2EA2E3
BFFF2C20
5B9AAFE9
35F43BB8
6A2D921C
A217CD9A

TonyB_ · 2018-03-13 01:15

Reserved for xoroshiro64++ code and test data.

TonyB_ · 2018-03-13 01:56

Heater. wrote: »

I don't know what questions to ask. Having been fascinated by this topic for some years I have made some observations:

1) Back in the early days of computing people slapped all kind of operations together, in a haphazard way, in the hope that the output would be kind of random.

2) As the years went by all kinds of criticisms were leveled at those early PRNGs. They have short periods, they have a bias, and so on.

3) Then there were guys like Prof. George Marsaglia, creating new PRNGs but also backing them up with some mathematical proofs about their properties. Like the period and such.

4) Today, with the xoroshiro guys, we seem to be back to step 1). For example today I notice they have xorshift1024*φ, which smacks of "Oh, lets throw in a multiply by some weird number to stir things up a bit" without any explanation. I mean, why φ? Why not π? Or e? Or 2 for that matter?

5) If your PRNG is supposed to have some huge period, like 2^1024 − 1 or whatever, then please show why that might be. It's impossible to test for this. Even if I don't understand the proof it would be nice to know there is one that somebody does.

And finally, I do wish creators of PRNG algorithms would provide some test data. Some sequence that is generated from a known initial state. It's so easy for the algorithms to me implemented wrongly when translated to other programming languages. Or just implemented badly through not understanding the specification.

Heater,

To answer the last point first, I've posted some test data above, with more to come.

4) Some of the work started by the late Prof. Marsaglia is being carried on and improved by Sebastiano Vigna and David Blackman. I have no doubt that they know what they are doing so you need to be a bit careful what you write - they might be reading! From my personal experience of it, I think xoroshiro++ is very good and yet a forthcoming paper will not even mention it, which means (a) the replacement for xoroshiro128+ must be rather special and (b) it is up to us here to publicize xoroshiro++ because nobody else will.

There are some rigorous random number tests available and a representation of φ (arguably the most irrational number) would not have been chosen just on a whim. Multiplication scrambles bits better than addition and lots of odd numbers would do a good job probably, but a discussion of xorshift1024*φ and other generators that require a multiply is off-topic here.

5) The proof that certain classes of generator with n-bit states have a periods of 2^n-1 is not difficult, I understand, for mathematicians anyway. As I've mentioned before, we know that xoroshiro32 has a period of a 2^32-1 because the full period can be iterated in a matter of seconds.

evanh · 2018-03-13 05:52

Regarding point 5, I was getting the impression it's easy-as to proof every combination. Seba was providing detailed complete lists of full period candidates, which perfectly matched our empirically found list, on the click of the finger. It didn't seem to matter how big the state space was.

There was some polynomial associated with each listed candidate. Chris may have provided something for his example too. Certainly he wasn't phased/daunted by large periods.

It seems these sort of capabilities are well understood in the research field so no one much details them.

I guess Xoroshiro isn't trying to be an educational product.

Heater. · 2018-03-13 07:53

TonyB_,

It's even more confusing if the name is wrong. It's xoroshiro - short for xor-rotate-shift-rotate.

Odd that. I cut and pasted xorshift1024*φ from their web page and did not notice the missing "o" there.

I am careful what I write. No disrespect to Sebastiano and David, just posing the questions.

I do like the xoroshiro generators for their small size, speed and lack of multiply, divide or modulo. Which makes them ideal for the Propeller. That is why I posted a Spin version of David Jones' JKISS32 here years ago. Another derivative of Marsaglia's work.
http://www0.cs.ucl.ac.uk/staff/d.jones/GoodPracticeRNG.pdf
https://forums.parallax.com/discussion/136975/jkiss32-high-quality-prng-with-realrandom-seed-save-a-cog

As for the maths behind all this. Likely I'm never going to understand that. Which is why I like to see sample output from the algorithms creators. Something to test my, likely buggy, implementations against.

TonyB_ · 2018-05-04 12:51

New paper by David Blackman & Sebastiano Vigna out today:
Scrambled Linear Pseudorandom Number Generators

More info and PDF download at http://xoshiro.di.unimi.it/
PDF also available at https://arxiv.org/abs/1805.01407

Please read the paper before passing comment. Even if you don't understand some or much of it, the amount of thought and testing that has been done should be apparent.

It's good advanced publicity for Parallax and we can talk about all of the PRNG logic in the P2 freely now. Each cog has a xoroshiro32++ and the hub has a xoroshiro128**. The scrambler for the latter was changed from + to ** in early April, improving the quality and allowing all 64 output bits to be used equally.

The P2 does not use the new xoshiro generator, which has four times as many state as output bits. Something for the future, possibly.

Heater. · 2018-05-04 21:32

Fascinating read. Thanks for the heads up TonyB_.

You are right I understand very little of it.

Only one niggle:

Would it be so hard to include complete, working, examples together with some output for test data? In C, pseudo code or whatever language.

Nice to see that Parallax, Chip and co. get a mention.

cgracey · 2018-05-04 21:49

That's neat, TonyB_!

I think we've got the best-bang-for-the-buck PRNG's in the Prop2.

Thanks for all of your and Evan's work, and also Sebastiano's help.

evanh · 2018-05-04 22:13

Reading the paper feels easy but the descriptive maths just completely leaps over my head. This isn't a criticism of the paper, just naming my own limits here.

Note: For anyone reading my prior postings and sources - Engine and scrambler is what I've been calling iterator and hash respectively.

TonyB_ · 2018-05-05 00:58

We were fortunate to have information about the new algorithms in advance. Having to wait for the paper would have been too late for the P2.

TonyB_ · 2018-10-11 11:13

After further tests the XORO32 instruction was changed in October 2018 to use xoroshiro32++[13,5,10,9].

The modified Verilog code:
https://forums.parallax.com/discussion/comment/1448460/#Comment_1448460

TonyB_ · 2018-10-11 11:49

First 32 outputs for xoroshiro32++ [13,5,10,9] when seed = 1 {s1 = 0, s0 = 1}
PRN calculated at start of algorithm before new state generated

0201
6269
AE16
12A2
4AE8
D719
0C52
984B
1DF1
743C
DBA0
BCC6
34C9
746C
3643
07FF
BBC0
C642
AA85
594E
D05A
701A
A328
F3AE
67EE
695A
C140
93C6
F5E1
4964
5E24
FC57

First 16 outputs for XORO32 when seed = 1
PRN calculated at start of algorithm before new state generated

62690201
12A2AE16
D7194AE8
984B0C52
743C1DF1
BCC6DBA0
746C34C9
07FF3643
C642BBC0
594EAA85
701AD05A
F3AEA328
695A67EE
93C6C140
4964F5E1
FC575E24

Heater. · 2018-10-11 15:23

TonyB_,

I'm not making any sense out of the results you posted above for "First 16 outputs for XORO32 using xoroshiro32++[13,5,10,9] with seed = 1.
PRN calculated at start of algorithm before new state generated.". The first output you have is 62690201 which according to the present algorithm is not possible.

If you are doing the PRN calculation first then on the first run it is :

prn = (s0 + s1) rol d + s0
= (0x00000001 + 0x00000000) rol 9 + 0x00000001
= 0x00000201

Given that the seed is single 1 bit there is no way to get all the bits required to make 0x62690201

I was trying to make a C++ version of the algorithm above but it produces different results.

#include <cstdint>
#include <stdlib.h>
#include <iostream>
#include <iomanip>

class XoroshiroPlusPlus
{
  const uint32_t a = 13;
  const uint32_t b = 5;
  const uint32_t c = 10;
  const uint32_t d = 9;

  static inline uint32_t rol(uint32_t x, uint32_t k) noexcept
  {
    return (((x) << (k)) | ((x) >> (32 - (k))));
  }

  uint32_t s0;
  uint32_t s1;

public:
  explicit XoroshiroPlusPlus(uint32_t s0 = 1, uint32_t s1 = 0) noexcept
    : s0(s0), s1(s1)
  {
  }

  inline uint32_t operator()() noexcept
  {
    uint32_t prn = s0;
    prn + = s1;
    prn = rol(prn, d);
    prn += s0;
    s1 ^= s0;
    uint32_t tmp = s1;
    s0 = rol(s0, a);
    tmp = tmp << b;
    s0 ^= tmp;
    s0 ^= s1;
    s1 = rol(s1, c);
    return prn;
  }
};

int main(int argc, char* argv[])
{
  XoroshiroPlusPlus prng(1, 0);

  for (int i = 0; i <= 16; i++)
  {
    std::cout << std::hex << std::setfill('0') << std::setw(8) << prng() << std::endl;
  }
}

The results:

00000201
00486221
2608820a
c7e20c2e
08aabc0a
5f77c44d
9056f91c
ed37ad47
092dfb84
62bd58fb
d66e0e30
106547c8
b82d3d89
72bc16ca
ac4a76b8
f6d6d4ef
a5661e00

The only thing that matches is the low 12 bits of the first result, which is the 0x201 I calculated above. Where did all the extra bits in your 62690201 come from?

Is there an xoroshiro++ in C or C++ with example output anywhere? I could not find one.

evanh · 2018-10-11 15:42

Heater,
I can see your variables and shifters are 32-bit. They should all be 16-bit.

Tony has always done it as engine iteration first, then scrambler summing. For quite a while Chip had XORO32 also following that order of Tony's, but later changed it to summing first followed by engine iterate when XORO32 had become the longest (critical) timing path.

It doesn't change the random data, just swaps the shortwords around.

Heater. · 2018-10-11 15:57

evanh,

Many moons ago I made the comment here: "Would the real xoroshiro please stand up! This is very confusing."

Well, it's still confusing

The results I was attempting to recreate are for what TonyB_ described above as "XORO32 using xoroshiro32++[13,5,10,9] with seed = 1. PRN calculated at start of algorithm before new state generated."

There is nothing there that indicates 16 bit processing. The name has "32" in it. The presented code shows 32 bit operations. The present results are 32 bit. He says right there that he is extracting PRN before generating the new state.

Why only 16 bit variables when we have a 32 bit P2?

More confused than ever. What is it you guys have been testing and is now in the P2?

TonyB_ · 2018-10-11 16:01

Heater.,

XORO32 does a double-iteration of xoroshiro32++. When seed = 1, 0201 is the first output and low word, 6269 the second output and high word, of the first 32-bit XORO32 result.

I've added single-iterated test outputs to
http://forums.parallax.com/discussion/comment/1448906/#Comment_1448906

evanh · 2018-10-11 16:07

Heater,
I did post actual C source code for you yesterday - https://forums.parallax.com/discussion/comment/1448759/#Comment_1448759

Heater. · 2018-10-11 16:08

What exactly do I need to my code above to get the same results?

evanh · 2018-10-11 16:12

Copy mine! Then build you own when confident.

TonyB_ · 2018-10-11 16:17

This thread was meant for announcements. Please continue discussion at
http://forums.parallax.com/discussion/166176/random-lfsr-on-p2#latest

evanh · 2018-10-11 16:25

Opps, I was wrong about Tony's output. The reason for the apparent rapid transition is simply because there is two concatenated 16-bit random numbers.

62690201
12a2ae16
d7194ae8
984b0c52
743c1df1
bcc6dba0
746c34c9
07ff3643
c642bbc0
594eaa85
701ad05a
f3aea328
695a67ee
93c6c140
4964f5e1
fc575e24
a638d7ad
181ae233
7be766b5
5cbd8445

Heater. · 2018-10-11 16:39

Oops, we were cross posting there.

Somehow it totally slipped my mind that you had posted that code. Been a bit busy around here.

Let me see...

evanh · 2019-10-05 00:18

Screen shot of verilog code for revB XORO32 [13 5 10 9]

xoroshiro++ random number generator

Comments