Hmm, here's one example from the spreadsheet documentation:
EEEE1001111111 DDDDDDDDD 000000111 RGBEXP D Expand 5:6:5 RGB value in S[15:0] into 8:8:8 value in D[31:8]. D = {S[15:11,15:13], S[10:5,10:9], S[4:0,4:2], 8'b0}.
Chip,
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!
Hmm, I'm pondering the idea of adding a definable mask to the output of the ALU. Have a hidden register that holds the mask.
Its impact would be wide ranging and the details will take time to sort out - Something to think about for the Prop3 I guess.
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!
What exactly are you suggesting here ?
There are two forms of Rotate in MCUs, one rotates a single-step, and another takes the rotate-count as a parameter.
It would be practical to modify the first form to have a reach, so allow ROTATE of any element width from 2-32 bits, and that would have immediate use for Rotate of 8b and 16b elements.
However, I'm not sure a Rotate with two params, one for rotate-count, and another for rotate-reach would be practical, seems to be quite logic hungry.
The rotate logic probably would be bit hideous, haha.
Don't worry, I was just musing on record.
ok.
The single-bit form could be useful, (and compact logic) if combined with REP.
Trades off a little speed for logic savings.
Saves logic, code is compact and covers any reach and any count, for those cases that require it.
I think REP can manage multiple lines, for > 32b, N shifts ?
Hmm, here's one example from the spreadsheet documentation:
EEEE1001111111 DDDDDDDDD 000000111 RGBEXP D Expand 5:6:5 RGB value in S[15:0] into 8:8:8 value in D[31:8]. D = {S[15:11,15:13], S[10:5,10:9], S[4:0,4:2], 8'b0}.
How is the S[15:0] addressing specified?
The RGB value to be expanded is in D[15:0] not S[15:0].
Ah, thanks Oz, docs not quite right then. So that means they can go with the single operand instructions - freeing that slot for something else of need.
And vanishing SFUNC from the list is a nice bonus too.
I've been running some more tests and not doing so well ... I sat down and went over the numbers and worked out that, give or take quite a large margin for case variations, it's a 32x time multiplier per word size increase to check every combination. So the difference in time taken between a s16 full combination search and a s32 full combination search is a 32^16 = 1.2x10^24 (quadrillion, apparently) multiplier!
It's impractical to use 100% brute force beyond about 20 bits word size.
This issue applies heavily to the max-iteration search. I've tried to reduce the burden by limiting the combination ranges but it's not anywhere near enough. The inner most loop is a 4x per bit as a starting point.
It seemed to me that we were both able to discover all max-length shift settings for xoroshiro32+, and then later you tested them all out for quality and determined that a single one was best, right?
datorgbmaskdirb,#15'drive LEDs
loop getwordoutb,state,#1'output sum to LEDs (LSB is low-quality)addoutb,state
waitx ##80_000_000/10xoro32 state 'iterate xoroshiro32+jmp #loop
state long1'seed {s1,s0} with 1
I checked it for max-run-length and it certainly looks random. This is the 14,2,7 set.
That's the whole iterator, now realized in the XORO32 instruction. It uses any cog register to track the 32-bit PRNG state. Evanh found that the ROL/SHL settings used in this implementation cause the PRNG to pass all random quality tests out to 1GB, whether from using the top bit, the top byte, or the top 15 bits of the sum of the two 16-bit fields that comprise the state. This is golden.
This all means that you can now have high-quality seedable and repeatable pseudo-random number generators in PASM, without having to implement any algorithm. This was modeled after the best PRNG topology known on the interwebs, and the very best settings were determined through a lot of testing. Thanks to Ahle, Evanh, and whoever came up with the xoroshiro128+ algorithm!
So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.
PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.
So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.
PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.
I may as well dump out all the scores I've generated myself I guess ... score tables for word sizes 12 to 20 are attached. s20 took a couple of days on a Ryzen 8-core running at 3800MHz on all cores.
Next step is look into using GPU for doing the full-period candidate searches ... The CPU takes enormous time when the word size is notched up - With respect to the word size, the per-combination run time is a square law 4^x exponential law and the number of combinations to search is a cube law. Overall candidate search time multiplies the two together.
Yeah, that was a tad short-hand labelling:
- "Word" is the full summed word size minus the lsb, for s16 (Xoroshiro32+) that is the top 15 bits [15:1].
- "Byte3" is the most significant 8 bits of the summed word, for s16 that is bits [15:8].
- "Byte2" is half way down the summed word, for s16 that is bits [11:4].
- "Byte1" is always bits [8:1] for all word sizes.
- "Bit" is msb.
PS: You'll note I've included Xoroshiro's predecessor, Xorshift, for comparison. It performs notably worse on the PractRand scores even though the author indicates it's still a perfectly okay algorithm. To me that says that Xoroshiro is exceptionally good for it's compactness and speed.
Thanks Doug,
The mathematicians that do the real work formulating these algorithms must work at an entirely different level though. Somehow, for a period of 2^128-1, the author of Xoroshiro must have picked full-period candidates without empirically ever having tested the period on even one combination.
Comments
EEEE 1001111 111 DDDDDDDDD 000000111 RGBEXP D Expand 5:6:5 RGB value in S[15:0] into 8:8:8 value in D[31:8]. D = {S[15:11,15:13], S[10:5,10:9], S[4:0,4:2], 8'b0}.
How is the S[15:0] addressing specified?
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!
Hmm, I'm pondering the idea of adding a definable mask to the output of the ALU. Have a hidden register that holds the mask.
Its impact would be wide ranging and the details will take time to sort out - Something to think about for the Prop3 I guess.
There are two forms of Rotate in MCUs, one rotates a single-step, and another takes the rotate-count as a parameter.
It would be practical to modify the first form to have a reach, so allow ROTATE of any element width from 2-32 bits, and that would have immediate use for Rotate of 8b and 16b elements.
However, I'm not sure a Rotate with two params, one for rotate-count, and another for rotate-reach would be practical, seems to be quite logic hungry.
Don't worry, I was just musing on record.
The single-bit form could be useful, (and compact logic) if combined with REP.
Trades off a little speed for logic savings.
Saves logic, code is compact and covers any reach and any count, for those cases that require it.
I think REP can manage multiple lines, for > 32b, N shifts ?
And vanishing SFUNC from the list is a nice bonus too.
It's impractical to use 100% brute force beyond about 20 bits word size.
This issue applies heavily to the max-iteration search. I've tried to reduce the burden by limiting the combination ranges but it's not anywhere near enough. The inner most loop is a 4x per bit as a starting point.
Oh, yeah, that would take forever!
dat org bmask dirb,#15 'drive LEDs loop getword outb,state,#1 'output sum to LEDs (LSB is low-quality) add outb,state waitx ##80_000_000/10 xoro32 state 'iterate xoroshiro32+ jmp #loop state long 1 'seed {s1,s0} with 1
I checked it for max-run-length and it certainly looks random. This is the 14,2,7 set.
"In Search of Randomness"
wire [15:0] xoro32x = d[31:16] ^ d[15:0]; wire [31:0] xoro32 = {xoro32x[8:0], xoro32x[15:9], {d[1:0], d[15:2]} ^ xoro32x ^ {xoro32x[13:0], 2'b0}};
That's the whole iterator, now realized in the XORO32 instruction. It uses any cog register to track the 32-bit PRNG state. Evanh found that the ROL/SHL settings used in this implementation cause the PRNG to pass all random quality tests out to 1GB, whether from using the top bit, the top byte, or the top 15 bits of the sum of the two 16-bit fields that comprise the state. This is golden.
This all means that you can now have high-quality seedable and repeatable pseudo-random number generators in PASM, without having to implement any algorithm. This was modeled after the best PRNG topology known on the interwebs, and the very best settings were determined through a lot of testing. Thanks to Ahle, Evanh, and whoever came up with the xoroshiro128+ algorithm!
So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.
PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.
All good now though.
Wow! I'm glad you're back now.
Attached is all the sources/scripts and here's a sample (s16) score table that's been auto generated:
Xorshift32+ PractRand Score Table Combination Word Byte3 Byte2 Byte1 Bit =============================================== [11 8 5] 32M 4M 256M 1G [13 12 3] 4M 512K 512K 512M [14 1 15] 256K 64K 32K 4K [15 10 1] 256K 32K 32K 512M [ 1 1 7] 2M 128K 4M 4K [ 1 1 12] 1M 64K 256K 4K [ 1 1 13] 512K 64K 256K 4K [ 2 5 8] 8M 256K 32M 256K [ 2 5 13] 8M 256K 16M 256K [ 2 13 15] 1M 128K 256K 256K [ 2 15 13] 1M 128K 256K 256K [ 3 7 6] 8M 256K 4M 32M [ 5 3 1] 512M 8M 128M 1G [ 5 3 8] 128M 2M 512M 1G [ 5 3 13] 128M 2M 128M 1G [ 5 7 4] 64M 2M 64M 32M [ 6 3 8] 256M 8M 512M 1G [ 7 1 6] 128M 16M 32M 64M [ 7 1 15] 2M 8M 1M 128M [ 7 2 1] 64M 32M 32M 1G [ 8 3 9] 128M 1G 128M 256M [ 9 14 5] 16M 16M 8M 256M Xoroshiro32+ PractRand Score Table Combination Word Byte3 Byte2 Byte1 Bit =============================================== [10 2 7] 256M 16M 16M 1G [10 3 3] 64M 32M 32M 256M [10 3 11] 2G 256M 256M 1G [10 5 13] 2G 1G 512M 512M [10 7 11] 512M 512M 128M 1G [10 10 7] 16M 16M 16M 512M [11 1 2] 256M 32M 32M 1G [11 1 14] 1G 32M 16M 1G [11 2 6] 512M 32M 64M 512M [11 3 10] 32M 64M 1M 128M [11 7 10] 4M 32M 128K 1G [11 8 12] 256M 512M 512M 1G [11 10 12] 128M 1G 512M 1G [11 11 14] 64M 16M 16M 1G [11 14 6] 256M 256M 64M 1G [12 4 15] 512M 128M 64M 1G [12 8 11] 4M 64M 128K 1G [12 10 11] 2M 128K 128K 1G [12 10 13] 32M 16M 16M 1G [12 14 5] 32M 128M 128M 1G [13 3 14] 512M 64M 32M 1G [13 5 8] 256M 512M 128M 1G [13 5 10] 64M 64M 16M 1G [13 9 8] 128M 512M 512M 1G [13 10 12] 2M 128K 128K 512M [13 11 14] 16M 8M 8M 1G [13 12 14] 16M 8M 8M 1G [13 13 14] 16M 8M 8M 1G [14 1 11] 64M 256K 256K 1G [14 2 7] 1G 1G 512M 1G [14 2 9] 512M 128M 128M 1G [14 3 13] 16M 1M 256K 1G [14 8 9] 128M 1G 64M 1G [14 11 3] 512M 256M 128M 1G [14 11 11] 16M 256K 256K 1G [14 11 13] 512K 128K 64K 1G [14 12 13] 512K 128K 64K 1G [14 13 13] 128K 64K 64K 256M [15 1 2] 2M 1M 512K 1G [15 3 6] 1G 512M 512M 1G [15 4 12] 128M 128M 8M 1G [15 6 2] 128M 128M 128M 1G [15 7 4] 512M 512M 512M 1G [15 7 8] 64M 32M 32M 512M [ 1 2 8] 32M 16M 64M 1G [ 2 1 11] 8M 128K 128K 1G [ 2 1 15] 128K 32K 32K 1G [ 2 2 7] 8M 32M 32M 1G [ 2 6 15] 2M 16M 128K 1G [ 2 8 7] 8M 128M 32M 1G [ 2 9 9] 256K 16K 128K 1G [ 2 11 3] 2M 512K 256K 1G [ 3 2 6] 64M 2M 2M 1G [ 3 3 10] 16M 256K 256K 1G [ 3 11 2] 256K 128K 64K 1G [ 3 11 14] 4M 256K 256K 1G [ 4 1 9] 512K 512K 8M 512M [ 4 7 15] 64M 128M 512K 1G [ 4 8 5] 512M 32M 8M 1G [ 5 2 6] 128M 8M 4M 1G [ 5 8 4] 4M 4M 1M 64M [ 5 14 12] 128M 2M 4M 1G [ 6 2 3] 32M 1M 1M 2M [ 6 2 5] 16M 256K 512K 4M [ 6 2 11] 512M 8M 8M 1G [ 6 3 15] 32M 1M 1M 1G [ 6 14 11] 128M 32M 64M 1G [ 7 1 8] 64M 16M 128M 1G [ 7 2 2] 16M 2M 2M 4M [ 7 2 10] 1G 128M 256M 1G [ 7 2 14] 512M 4M 8M 256M [ 7 8 2] 512M 64M 8M 512M [ 7 10 10] 128M 32M 32M 1G [ 7 15 8] 2M 256K 1M 1G [ 8 1 7] 1M 1M 1M 64M [ 8 2 1] 32M 8M 8M 16M [ 8 5 13] 32M 16M 8M 1G [ 8 7 15] 2M 512K 2M 1G [ 8 9 13] 512M 512M 1G 1G [ 8 15 7] 512K 256K 512K 1G [ 9 1 4] 32M 8M 16M 64M [ 9 2 14] 2G 128M 256M 1G [ 9 8 14] 16M 256M 32M 1G [ 9 9 2] 32M 16M 16M 64M
Next step is look into using GPU for doing the full-period candidate searches ... The CPU takes enormous time when the word size is notched up - With respect to the word size, the per-combination run time is a square law 4^x exponential law and the number of combinations to search is a cube law. Overall candidate search time multiplies the two together.
Could you please explain, again maybe, what the non-combination columns mean?
- "Word" is the full summed word size minus the lsb, for s16 (Xoroshiro32+) that is the top 15 bits [15:1].
- "Byte3" is the most significant 8 bits of the summed word, for s16 that is bits [15:8].
- "Byte2" is half way down the summed word, for s16 that is bits [11:4].
- "Byte1" is always bits [8:1] for all word sizes.
- "Bit" is msb.
PS: You'll note I've included Xoroshiro's predecessor, Xorshift, for comparison. It performs notably worse on the PractRand scores even though the author indicates it's still a perfectly okay algorithm. To me that says that Xoroshiro is exceptionally good for it's compactness and speed.
Was it a predecessor to xoroshiro32+ ?
And we know we've got good numbers in the P2. The best, probably.
The mathematicians that do the real work formulating these algorithms must work at an entirely different level though. Somehow, for a period of 2^128-1, the author of Xoroshiro must have picked full-period candidates without empirically ever having tested the period on even one combination.
It's crazy. Glad they are around, and we can use their work.
It's amazing what these number theory guys can deduce.