Random/LFSR on P2

cgracey · 2017-04-25 01:59

D is the data to be worked on and S provides the function code.

evanh · 2017-04-25 02:59

Hmm, here's one example from the spreadsheet documentation:

EEEE 1001111 111 DDDDDDDDD 000000111	RGBEXP  D	Expand 5:6:5 RGB value in S[15:0] into 8:8:8 value in D[31:8]. D = {S[15:11,15:13], S[10:5,10:9], S[4:0,4:2], 8'b0}.

How is the S[15:0] addressing specified?

evanh · 2017-04-25 03:51

Chip,
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!

Hmm, I'm pondering the idea of adding a definable mask to the output of the ALU. Have a hidden register that holds the mask.

Its impact would be wide ranging and the details will take time to sort out - Something to think about for the Prop3 I guess.

jmg · 2017-04-25 04:04

evanh wrote: »

Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!

What exactly are you suggesting here ?

There are two forms of Rotate in MCUs, one rotates a single-step, and another takes the rotate-count as a parameter.
It would be practical to modify the first form to have a reach, so allow ROTATE of any element width from 2-32 bits, and that would have immediate use for Rotate of 8b and 16b elements.
However, I'm not sure a Rotate with two params, one for rotate-count, and another for rotate-reach would be practical, seems to be quite logic hungry.

evanh · 2017-04-25 04:29

The rotate logic probably would be bit hideous, haha.

Don't worry, I was just musing on record.

jmg · 2017-04-25 04:55

evanh wrote: »

The rotate logic probably would be bit hideous, haha.
Don't worry, I was just musing on record.

ok.
The single-bit form could be useful, (and compact logic) if combined with REP.
Trades off a little speed for logic savings.
Saves logic, code is compact and covers any reach and any count, for those cases that require it.
I think REP can manage multiple lines, for > 32b, N shifts ?

ozpropdev · 2017-04-25 05:47

evanh wrote: »
Hmm, here's one example from the spreadsheet documentation:
EEEE 1001111 111 DDDDDDDDD 000000111	RGBEXP  D	Expand 5:6:5 RGB value in S[15:0] into 8:8:8 value in D[31:8]. D = {S[15:11,15:13], S[10:5,10:9], S[4:0,4:2], 8'b0}.
How is the S[15:0] addressing specified?

The RGB value to be expanded is in D[15:0] not S[15:0].

evanh · 2017-04-25 06:09

Ah, thanks Oz, docs not quite right then. So that means they can go with the single operand instructions - freeing that slot for something else of need.

And vanishing SFUNC from the list is a nice bonus too.

evanh · 2017-04-25 08:28

I've been running some more tests and not doing so well ... I sat down and went over the numbers and worked out that, give or take quite a large margin for case variations, it's a 32x time multiplier per word size increase to check every combination. So the difference in time taken between a s16 full combination search and a s32 full combination search is a 32^16 = 1.2x10^24 (quadrillion, apparently) multiplier!

It's impractical to use 100% brute force beyond about 20 bits word size.

This issue applies heavily to the max-iteration search. I've tried to reduce the burden by limiting the combination ranges but it's not anywhere near enough. The inner most loop is a 4x per bit as a starting point.

cgracey · 2017-04-26 00:03

It seemed to me that we were both able to discover all max-length shift settings for xoroshiro32+, and then later you tested them all out for quality and determined that a single one was best, right?

evanh · 2017-04-26 00:12

Yep, that's all good and sorted - I call that a word size of 16. I'm just trying to make these tools useful to 32 bits (Xoroshiro64+) at least.

cgracey · 2017-04-26 01:51

evanh wrote: »

Yep, that's all good and sorted - I call that a word size of 16. I'm just trying to make these tools useful to 32 bits (Xoroshiro64+) at least.

Oh, yeah, that would take forever!

cgracey · 2017-04-29 12:45

I've got it working:

dat	org

	bmask	dirb,#15	'drive LEDs

loop	getword	outb,state,#1	'output sum to LEDs (LSB is low-quality)
	add	outb,state
	waitx	##80_000_000/10

	xoro32	state		'iterate xoroshiro32+

	jmp	#loop


state	long	1		'seed {s1,s0} with 1

I checked it for max-run-length and it certainly looks random. This is the 14,2,7 set.

Heater. · 2017-04-29 13:21

I thought participants of this thread might like this lecture on video:
"In Search of Randomness"

cgracey · 2017-04-30 01:50

Here's the final fruit of all the xoroshiro32+ testing that Evanh did:

wire [15:0] xoro32x  = d[31:16] ^ d[15:0];
wire [31:0] xoro32   = {xoro32x[8:0], xoro32x[15:9], {d[1:0], d[15:2]} ^ xoro32x ^ {xoro32x[13:0], 2'b0}};

That's the whole iterator, now realized in the XORO32 instruction. It uses any cog register to track the 32-bit PRNG state. Evanh found that the ROL/SHL settings used in this implementation cause the PRNG to pass all random quality tests out to 1GB, whether from using the top bit, the top byte, or the top 15 bits of the sum of the two 16-bit fields that comprise the state. This is golden.

This all means that you can now have high-quality seedable and repeatable pseudo-random number generators in PASM, without having to implement any algorithm. This was modeled after the best PRNG topology known on the interwebs, and the very best settings were determined through a lot of testing. Thanks to Ahle, Evanh, and whoever came up with the xoroshiro128+ algorithm!

evanh · 2017-05-02 07:43

Nice! All comes down to some mux's and xor's.

So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.

PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.

All good now though.

cgracey · 2017-05-02 08:03

evanh wrote: »

Nice! All comes down to some mux's and xor's.

So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.

PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.

All good now though.

Wow! I'm glad you're back now.

evanh · 2017-05-02 08:23

Thanks Chip.

evanh · 2017-05-07 14:24

Yaaaay! I've got there at last.

Attached is all the sources/scripts and here's a sample (s16) score table that's been auto generated:



    Xorshift32+ PractRand Score Table

Combination   Word  Byte3  Byte2  Byte1   Bit
===============================================
 [11  8  5]    32M     4M   256M            1G
 [13 12  3]     4M   512K   512K          512M
 [14  1 15]   256K    64K    32K            4K
 [15 10  1]   256K    32K    32K          512M
 [ 1  1  7]     2M   128K     4M            4K
 [ 1  1 12]     1M    64K   256K            4K
 [ 1  1 13]   512K    64K   256K            4K
 [ 2  5  8]     8M   256K    32M          256K
 [ 2  5 13]     8M   256K    16M          256K
 [ 2 13 15]     1M   128K   256K          256K
 [ 2 15 13]     1M   128K   256K          256K
 [ 3  7  6]     8M   256K     4M           32M
 [ 5  3  1]   512M     8M   128M            1G
 [ 5  3  8]   128M     2M   512M            1G
 [ 5  3 13]   128M     2M   128M            1G
 [ 5  7  4]    64M     2M    64M           32M
 [ 6  3  8]   256M     8M   512M            1G
 [ 7  1  6]   128M    16M    32M           64M
 [ 7  1 15]     2M     8M     1M          128M
 [ 7  2  1]    64M    32M    32M            1G
 [ 8  3  9]   128M     1G   128M          256M
 [ 9 14  5]    16M    16M     8M          256M


    Xoroshiro32+ PractRand Score Table

Combination   Word  Byte3  Byte2  Byte1   Bit
===============================================
 [10  2  7]   256M    16M    16M            1G
 [10  3  3]    64M    32M    32M          256M
 [10  3 11]     2G   256M   256M            1G
 [10  5 13]     2G     1G   512M          512M
 [10  7 11]   512M   512M   128M            1G
 [10 10  7]    16M    16M    16M          512M
 [11  1  2]   256M    32M    32M            1G
 [11  1 14]     1G    32M    16M            1G
 [11  2  6]   512M    32M    64M          512M
 [11  3 10]    32M    64M     1M          128M
 [11  7 10]     4M    32M   128K            1G
 [11  8 12]   256M   512M   512M            1G
 [11 10 12]   128M     1G   512M            1G
 [11 11 14]    64M    16M    16M            1G
 [11 14  6]   256M   256M    64M            1G
 [12  4 15]   512M   128M    64M            1G
 [12  8 11]     4M    64M   128K            1G
 [12 10 11]     2M   128K   128K            1G
 [12 10 13]    32M    16M    16M            1G
 [12 14  5]    32M   128M   128M            1G
 [13  3 14]   512M    64M    32M            1G
 [13  5  8]   256M   512M   128M            1G
 [13  5 10]    64M    64M    16M            1G
 [13  9  8]   128M   512M   512M            1G
 [13 10 12]     2M   128K   128K          512M
 [13 11 14]    16M     8M     8M            1G
 [13 12 14]    16M     8M     8M            1G
 [13 13 14]    16M     8M     8M            1G
 [14  1 11]    64M   256K   256K            1G
 [14  2  7]     1G     1G   512M            1G
 [14  2  9]   512M   128M   128M            1G
 [14  3 13]    16M     1M   256K            1G
 [14  8  9]   128M     1G    64M            1G
 [14 11  3]   512M   256M   128M            1G
 [14 11 11]    16M   256K   256K            1G
 [14 11 13]   512K   128K    64K            1G
 [14 12 13]   512K   128K    64K            1G
 [14 13 13]   128K    64K    64K          256M
 [15  1  2]     2M     1M   512K            1G
 [15  3  6]     1G   512M   512M            1G
 [15  4 12]   128M   128M     8M            1G
 [15  6  2]   128M   128M   128M            1G
 [15  7  4]   512M   512M   512M            1G
 [15  7  8]    64M    32M    32M          512M
 [ 1  2  8]    32M    16M    64M            1G
 [ 2  1 11]     8M   128K   128K            1G
 [ 2  1 15]   128K    32K    32K            1G
 [ 2  2  7]     8M    32M    32M            1G
 [ 2  6 15]     2M    16M   128K            1G
 [ 2  8  7]     8M   128M    32M            1G
 [ 2  9  9]   256K    16K   128K            1G
 [ 2 11  3]     2M   512K   256K            1G
 [ 3  2  6]    64M     2M     2M            1G
 [ 3  3 10]    16M   256K   256K            1G
 [ 3 11  2]   256K   128K    64K            1G
 [ 3 11 14]     4M   256K   256K            1G
 [ 4  1  9]   512K   512K     8M          512M
 [ 4  7 15]    64M   128M   512K            1G
 [ 4  8  5]   512M    32M     8M            1G
 [ 5  2  6]   128M     8M     4M            1G
 [ 5  8  4]     4M     4M     1M           64M
 [ 5 14 12]   128M     2M     4M            1G
 [ 6  2  3]    32M     1M     1M            2M
 [ 6  2  5]    16M   256K   512K            4M
 [ 6  2 11]   512M     8M     8M            1G
 [ 6  3 15]    32M     1M     1M            1G
 [ 6 14 11]   128M    32M    64M            1G
 [ 7  1  8]    64M    16M   128M            1G
 [ 7  2  2]    16M     2M     2M            4M
 [ 7  2 10]     1G   128M   256M            1G
 [ 7  2 14]   512M     4M     8M          256M
 [ 7  8  2]   512M    64M     8M          512M
 [ 7 10 10]   128M    32M    32M            1G
 [ 7 15  8]     2M   256K     1M            1G
 [ 8  1  7]     1M     1M     1M           64M
 [ 8  2  1]    32M     8M     8M           16M
 [ 8  5 13]    32M    16M     8M            1G
 [ 8  7 15]     2M   512K     2M            1G
 [ 8  9 13]   512M   512M     1G            1G
 [ 8 15  7]   512K   256K   512K            1G
 [ 9  1  4]    32M     8M    16M           64M
 [ 9  2 14]     2G   128M   256M            1G
 [ 9  8 14]    16M   256M    32M            1G
 [ 9  9  2]    32M    16M    16M           64M

evanh · 2017-05-07 14:42

I may as well dump out all the scores I've generated myself I guess ... score tables for word sizes 12 to 20 are attached. s20 took a couple of days on a Ryzen 8-core running at 3800MHz on all cores.

Next step is look into using GPU for doing the full-period candidate searches ... The CPU takes enormous time when the word size is notched up - With respect to the word size, the per-combination run time is a square law 4^x exponential law and the number of combinations to search is a cube law. Overall candidate search time multiplies the two together.

cgracey · 2017-05-07 23:39

Looks good.

Could you please explain, again maybe, what the non-combination columns mean?

evanh · 2017-05-08 01:04

Yeah, that was a tad short-hand labelling:
- "Word" is the full summed word size minus the lsb, for s16 (Xoroshiro32+) that is the top 15 bits [15:1].
- "Byte3" is the most significant 8 bits of the summed word, for s16 that is bits [15:8].
- "Byte2" is half way down the summed word, for s16 that is bits [11:4].
- "Byte1" is always bits [8:1] for all word sizes.
- "Bit" is msb.

PS: You'll note I've included Xoroshiro's predecessor, Xorshift, for comparison. It performs notably worse on the PractRand scores even though the author indicates it's still a perfectly okay algorithm. To me that says that Xoroshiro is exceptionally good for it's compactness and speed.

cgracey · 2017-05-08 01:15

Okay. Super. Thanks.

cgracey · 2017-05-08 01:20

What is xorshift32+ ?

Was it a predecessor to xoroshiro32+ ?

evanh · 2017-05-08 01:25

yep, I'd just answered that as you were typing the question.

potatohead · 2017-05-08 02:50

Nice work Evanh! I learned a lot about random numbers following this.

And we know we've got good numbers in the P2. The best, probably.

evanh · 2017-05-08 03:36

Thanks Doug,
The mathematicians that do the real work formulating these algorithms must work at an entirely different level though. Somehow, for a period of 2^128-1, the author of Xoroshiro must have picked full-period candidates without empirically ever having tested the period on even one combination.

evanh · 2017-05-08 03:40

And, just to emphasise that a little more, there is algorithms with virtual non-repeating periods of 2^(some ten digit power) or something like that.

potatohead · 2017-05-08 04:45

must have picked full-period candidates without empirically ever having tested the period on even one combination.

It's crazy. Glad they are around, and we can use their work.

Heater. · 2017-05-08 05:01

Well, at some point the period of a PRNG is untestable. The universe will not last long enough for your computer to test it!

It's amazing what these number theory guys can deduce.

Random/LFSR on P2

Comments