Hmm, here's one example from the spreadsheet documentation:
EEEE 1001111 111 DDDDDDDDD 000000111 RGBEXP D Expand 5:6:5 RGB value in S[15:0] into 8:8:8 value in D[31:8]. D = {S[15:11,15:13], S[10:5,10:9], S[4:0,4:2], 8'b0}.
Chip,
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!
Hmm, I'm pondering the idea of adding a definable mask to the output of the ALU. Have a hidden register that holds the mask.
Its impact would be wide ranging and the details will take time to sort out - Something to think about for the Prop3 I guess.
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!
What exactly are you suggesting here ?
There are two forms of Rotate in MCUs, one rotates a single-step, and another takes the rotate-count as a parameter.
It would be practical to modify the first form to have a reach, so allow ROTATE of any element width from 2-32 bits, and that would have immediate use for Rotate of 8b and 16b elements.
However, I'm not sure a Rotate with two params, one for rotate-count, and another for rotate-reach would be practical, seems to be quite logic hungry.
The rotate logic probably would be bit hideous, haha.
Don't worry, I was just musing on record.
ok.
The single-bit form could be useful, (and compact logic) if combined with REP.
Trades off a little speed for logic savings.
Saves logic, code is compact and covers any reach and any count, for those cases that require it.
I think REP can manage multiple lines, for > 32b, N shifts ?
Hmm, here's one example from the spreadsheet documentation:
EEEE 1001111 111 DDDDDDDDD 000000111 RGBEXP D Expand 5:6:5 RGB value in S[15:0] into 8:8:8 value in D[31:8]. D = {S[15:11,15:13], S[10:5,10:9], S[4:0,4:2], 8'b0}.
How is the S[15:0] addressing specified?
The RGB value to be expanded is in D[15:0] not S[15:0].
Ah, thanks Oz, docs not quite right then. So that means they can go with the single operand instructions - freeing that slot for something else of need.
And vanishing SFUNC from the list is a nice bonus too.
I've been running some more tests and not doing so well ... I sat down and went over the numbers and worked out that, give or take quite a large margin for case variations, it's a 32x time multiplier per word size increase to check every combination. So the difference in time taken between a s16 full combination search and a s32 full combination search is a 32^16 = 1.2x10^24 (quadrillion, apparently) multiplier!
It's impractical to use 100% brute force beyond about 20 bits word size.
This issue applies heavily to the max-iteration search. I've tried to reduce the burden by limiting the combination ranges but it's not anywhere near enough. The inner most loop is a 4x per bit as a starting point.
It seemed to me that we were both able to discover all max-length shift settings for xoroshiro32+, and then later you tested them all out for quality and determined that a single one was best, right?
dat org
bmask dirb,#15 'drive LEDs
loop getword outb,state,#1 'output sum to LEDs (LSB is low-quality)
add outb,state
waitx ##80_000_000/10
xoro32 state 'iterate xoroshiro32+
jmp #loop
state long 1 'seed {s1,s0} with 1
I checked it for max-run-length and it certainly looks random. This is the 14,2,7 set.
That's the whole iterator, now realized in the XORO32 instruction. It uses any cog register to track the 32-bit PRNG state. Evanh found that the ROL/SHL settings used in this implementation cause the PRNG to pass all random quality tests out to 1GB, whether from using the top bit, the top byte, or the top 15 bits of the sum of the two 16-bit fields that comprise the state. This is golden.
This all means that you can now have high-quality seedable and repeatable pseudo-random number generators in PASM, without having to implement any algorithm. This was modeled after the best PRNG topology known on the interwebs, and the very best settings were determined through a lot of testing. Thanks to Ahle, Evanh, and whoever came up with the xoroshiro128+ algorithm!
So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.
PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.
So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.
PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.
I may as well dump out all the scores I've generated myself I guess ... score tables for word sizes 12 to 20 are attached. s20 took a couple of days on a Ryzen 8-core running at 3800MHz on all cores.
Next step is look into using GPU for doing the full-period candidate searches ... The CPU takes enormous time when the word size is notched up - With respect to the word size, the per-combination run time is a square law 4^x exponential law and the number of combinations to search is a cube law. Overall candidate search time multiplies the two together.
Yeah, that was a tad short-hand labelling:
- "Word" is the full summed word size minus the lsb, for s16 (Xoroshiro32+) that is the top 15 bits [15:1].
- "Byte3" is the most significant 8 bits of the summed word, for s16 that is bits [15:8].
- "Byte2" is half way down the summed word, for s16 that is bits [11:4].
- "Byte1" is always bits [8:1] for all word sizes.
- "Bit" is msb.
PS: You'll note I've included Xoroshiro's predecessor, Xorshift, for comparison. It performs notably worse on the PractRand scores even though the author indicates it's still a perfectly okay algorithm. To me that says that Xoroshiro is exceptionally good for it's compactness and speed.
Thanks Doug,
The mathematicians that do the real work formulating these algorithms must work at an entirely different level though. Somehow, for a period of 2^128-1, the author of Xoroshiro must have picked full-period candidates without empirically ever having tested the period on even one combination.
Comments
How is the S[15:0] addressing specified?
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!
Hmm, I'm pondering the idea of adding a definable mask to the output of the ALU. Have a hidden register that holds the mask.
Its impact would be wide ranging and the details will take time to sort out - Something to think about for the Prop3 I guess.
There are two forms of Rotate in MCUs, one rotates a single-step, and another takes the rotate-count as a parameter.
It would be practical to modify the first form to have a reach, so allow ROTATE of any element width from 2-32 bits, and that would have immediate use for Rotate of 8b and 16b elements.
However, I'm not sure a Rotate with two params, one for rotate-count, and another for rotate-reach would be practical, seems to be quite logic hungry.
Don't worry, I was just musing on record.
The single-bit form could be useful, (and compact logic) if combined with REP.
Trades off a little speed for logic savings.
Saves logic, code is compact and covers any reach and any count, for those cases that require it.
I think REP can manage multiple lines, for > 32b, N shifts ?
And vanishing SFUNC from the list is a nice bonus too.
It's impractical to use 100% brute force beyond about 20 bits word size.
This issue applies heavily to the max-iteration search. I've tried to reduce the burden by limiting the combination ranges but it's not anywhere near enough. The inner most loop is a 4x per bit as a starting point.
Oh, yeah, that would take forever!
I checked it for max-run-length and it certainly looks random. This is the 14,2,7 set.
"In Search of Randomness"
That's the whole iterator, now realized in the XORO32 instruction. It uses any cog register to track the 32-bit PRNG state. Evanh found that the ROL/SHL settings used in this implementation cause the PRNG to pass all random quality tests out to 1GB, whether from using the top bit, the top byte, or the top 15 bits of the sum of the two 16-bit fields that comprise the state. This is golden.
This all means that you can now have high-quality seedable and repeatable pseudo-random number generators in PASM, without having to implement any algorithm. This was modeled after the best PRNG topology known on the interwebs, and the very best settings were determined through a lot of testing. Thanks to Ahle, Evanh, and whoever came up with the xoroshiro128+ algorithm!
So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.
PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.
All good now though.
Wow! I'm glad you're back now.
Attached is all the sources/scripts and here's a sample (s16) score table that's been auto generated:
Next step is look into using GPU for doing the full-period candidate searches ... The CPU takes enormous time when the word size is notched up - With respect to the word size, the per-combination run time is a square law 4^x exponential law and the number of combinations to search is a cube law. Overall candidate search time multiplies the two together.
Could you please explain, again maybe, what the non-combination columns mean?
- "Word" is the full summed word size minus the lsb, for s16 (Xoroshiro32+) that is the top 15 bits [15:1].
- "Byte3" is the most significant 8 bits of the summed word, for s16 that is bits [15:8].
- "Byte2" is half way down the summed word, for s16 that is bits [11:4].
- "Byte1" is always bits [8:1] for all word sizes.
- "Bit" is msb.
PS: You'll note I've included Xoroshiro's predecessor, Xorshift, for comparison. It performs notably worse on the PractRand scores even though the author indicates it's still a perfectly okay algorithm. To me that says that Xoroshiro is exceptionally good for it's compactness and speed.
Was it a predecessor to xoroshiro32+ ?
And we know we've got good numbers in the P2. The best, probably.
The mathematicians that do the real work formulating these algorithms must work at an entirely different level though. Somehow, for a period of 2^128-1, the author of Xoroshiro must have picked full-period candidates without empirically ever having tested the period on even one combination.
It's crazy. Glad they are around, and we can use their work.
It's amazing what these number theory guys can deduce.