I'm ditching the D=0 means "plus" scrambler special case. The new selection mechanism for which algorithm to test has made it redundant. From now on the data and reports with D constant of 0 will really be the specified algorithm.
Okay so M can be used as evenly spaced offsets to the state. But it won't be as simple as, say, invert msb of M to jump 50% through state space. It won't be that linear, right?
M is a key to an entirely different sequence, so the values that occur (pick any dozen in a row) with one M will likely never occur with another M present.
Inverting any bit in M (using the seed function) will switch to a stream unrelated to the previous one (except that all streams form a master set, like threads in a rope that complexly intertwine, but never join each other).
So, nothing remotely linear.
Regarding scrambler modifications, it is difficult. The issue is with carry propagation, which is why the original sum rotate sum works so well, so viable options would likely do something similar.
Honestly, I believe that any small improvements would be easily solved in assembly/hardware (if there were no need to publish comprehensible code in a language like C... is there a need?).
Inverting any bit in M (using the seed function) will switch to a stream unrelated to the previous one (except that all streams form a master set, like threads in a rope that complexly intertwine, but never join each other).
Ah, nice, I see, more an alternative to a jump.
Honestly, I believe that any small improvements would be easily solved in assembly/hardware (if there were no need to publish comprehensible code in a language like C... is there a need?).
Cool, I guess it's less universal that way but I'm interested. There's always new hardware getting designed.
Updated beta code: Xoroshiro128psp
I need to restart statistical testing on it.
128-bit Equivalent to xoroshiro32pp (no streaming) using scrambler: uint16_t result = rotl((s0 + s1) * 5, 8 ) + s0 * 5;
This refactoring road is feeling like a rabbit hole at times. I keep finding more ideas to make it tidier as I move to each script ... causing a recursion. Lots of copying of replacement code that then need tweaking for different job.
Time for bed I think, I just spent some hours trying to work out why it was reporting every file of a certain aging directory, with over 90,000 files, not to exist. Only to discover it was entirely correct due to an unremembered change that had been made some months back that produced a surprisingly subtle naming change.
This refactoring road is feeling like a rabbit hole at times.
It is a rabbit hole... only a little to be gained with 32-bit state.
The above example scrambler should set the upper limit of what is possible at 32-bit, but that is not necessarily what is reasonable for a decent hardware implementation.
In light of that, the existing xoroshiro32pp has only a little room to grow... I guess an almost 'Pass' on PractRand at 2GB would be considered about double.
With 128-bit state, I will take that improvement, and the x64 compiler/CPU pipeline efficiency relieves the performance penalty. However, I still have to be concerned with all other details.
This refactoring road is feeling like a rabbit hole at times.
It is a rabbit hole... only a little to be gained with 32-bit state.
The above example scrambler should set the upper limit of what is possible at 32-bit, but that is not necessarily what is reasonable for a decent hardware implementation.
Oh, I'm talking about posting a more useable versions of my automation scripts and C sources.
Oh, I'm talking about posting a more useable versions of my automation scripts and C sources.
I understand what you mean. I'm just afraid to touch anything I have created for fear of breaking it.
I'm gearing up for automating Big Crush again. Pretty ugly coding, because my Big Crush piped version for Windows is too slow, so I use Windows VB auto-generated batch files that call 'Bash on Ubuntu on Windows' shell scripts (from D. Lemire, but modified by me) and automatically parse the output files from Windows with another VB statistical analysis and visualization program I wrote.
See attached screenshot of AES-counter-mode statistical cloud meta-analysis. Cloud-center deviation is on left and boundary is on right. The various lines are (all-bits, high-bits, low-bits) and (forward-bits, reverse-bits, byte-reverse-bits), and combinations of those groups.
I never wrote a parser for PractRand output, as I would not know what to do with the full p-value dump (-a), since so many of the the statistics are deeply correlated (unlike Big Crush, which the 254 p-vals are minimally correlated).
For Practrand, because it grows the testing as the tests keep passing, I just use the fail point in processed data size as the final score. It's so much easier to handle this way.
I never put much effort into a solution for Crush but I had observed that counting the number of pass/fail tests was one possible way.
Late in the effort I went looking for a way to impliement Crush in my testing and got the best answer from Melisa's work. It resulted in two versions: One as a generic piped input, and the other one as a compiled in tuned generator. Both were simpler than what I could glean from the documentation, which never worked anyway.
Here's the piped version(It's a lot cleaner looking). I've used read() instead of getchar() in the hopes it might have less overhead:
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "unif01.h"
#include "bbattery.h"
uint32_t chunk_bits( void )
{
static uint32_t buff[260];
static int emitidx = 0;
uint32_t output;
if( emitidx == 0 )
if( fread( buff, sizeof(uint32_t), 256, stdin ) != 256 )
{
printf( "Error! Stream incomplete.\n" );
exit(2);
}
// Syphon out the random data. Maybe more sensitive with all 32-bits swapped.
output = buff[emitidx++];
if( emitidx >= 256 )
emitidx = 0;
return output;
}
int main( void )
{
// Create TestU01 PRNG object for our generator
unif01_Gen *gen = unif01_CreateExternGenBits( "stdin32", chunk_bits );
// Run the tests.
#if CRUSH == 1
bbattery_SmallCrush( gen );
#elif CRUSH == 2
bbattery_Crush( gen );
#elif CRUSH == 3
bbattery_BigCrush( gen );
#else
printf( "No test-battery selected!\n" );
#endif
// Clean up.
unif01_DeleteExternGenBits( gen );
return 0;
}
I'm calling this xoroshiro32+x+, as cannot use +*+ in filename. Below is overall top 20 for prank+zrank+nzrank, followed by top 10 prank/zrank/nzrank not in overall top 20 plus [13,5,10,9].
Here we go. I've tacked a zero on the name to distinguish it.
EDIT: To put that name in context, it's kind of formally named as part of the new collection. See second third attachment. So I'm using this to test out the revised scripts.
EDIT2: Third Second attachment is three csv files for the three distributions. I don't know if they'll be any use but I've added building these to the automation anyway.
Here's the ten best, I think*, grid scores from last year's ++ effort. Notable is there is three than have minimum's of 29, [13 5 10 9] being one. While, just as notably, the averages are all within 0.5 of each other.
* We spent a lot of time doing double iterated scoring so I'm not sure where these scores actually came from.
EDIT: Ah, these were ten best distributions by the looks. Actually, there was twelve but a couple were lower scoring.
Ha, that selection definitely produce better grid scores. A top average of 30.2 is notably better than anything else so far. There's a lot of variation but seven minimums of 29 also whips the Smile off (s1 * 5), which got none at all.
I've now adopted the shorthand scrambler naming, eg: +x+, in the complete names for the algorithms. It seems to flow nicely through the scripts and filenames without issue.
That makes some sense now, since s0 * 5 was required for maintaining 2-D equidistribution of my streaming version across all streams.
That would tend to indicate a form of hidden order that would extend to the full period statistics you are measuring.
That can't happen with s1 * 5 (for some reason), but makes for some good PractRand scores at 32-bit.
Hopefully the combination of s0 * 5, without the ( s0+s1 ) * 5 that I am using, will have enough effect on all stats for some candidates.
And on that note, here's my sources in their latest glory for you to try out. It's not everything but what I've updated for the runs we've been doing in the last couple of weeks.
All the scripts expect the src directory to be present and they will build all workings in a fresh directory called tests. I'm using Ubuntu (Kubuntu) as my OS so there is likely some assumptions on what is default installed like GCC and Bash and Awk and Grep and likely others of similar ilk.
General flow is: find-fullp-candidates -> run-distribution -> merge-distribution -> sort the result and trim to acceptable number of candidates (Tony is doing this for me) -> paste into candidates.cfg file -> run-grid-scoring -> assemble-grids
You will find results like the grid tables I've posted will be created within relevant subdirectory of tests directory.
Oh, I forgot about Practrand. The compiled copy I've made is oddly named as RNG_test.unchanged, and sits in the same base directory as those scripts. The name and path to Practrand can be edited at the top of run-grid-scoring.sh
Ha, and looking at the top of the scripts, I see I've accidental left enabled the double iteration switch in them all. You'll most likely want to comment that out.
I'll have have to study the nix PractRand build, as I have never used it.
I am curious about that... how many seconds (one instance) to 1GB using: 'stdin -tf 2 -te 1 -tlmin 1KB -multithreaded' on your 8-core 4 GHz CPU?
Long forgotten the answer ... although I did just add something to the report files ... 2 mins 40 sec for 2GB with eight paralleled single-tasking jobs ...
97 seconds under same condition but for 1 GB. Both were double iterated cases.
Doing a special run (back at single iterated):
- Used PractRand v0.93 options: stdin -multithreaded -te 1 -tf 2 -tlmin 1KB
- tests/engine-xom0-s16/pract/a5b14c12/case-a5b14c12d8w16p0 produced a failure score at 1 GB.
- Forcing a single case at a time netted me 59 seconds.
Allowing parallel tasks again, for the same single iterated candidate, the per case time jumps from 60(59) seconds to 80 seconds for 1 GB score. Which is bang on half the 2:40.
Not sure what to make of that 97 seconds now. Maybe I miss-read it, I'm working with datestamps. EDIT: Oh, it was during normal use of computer and all these scripts intentionally lower the task priority so I can do other stuff while waiting for results. Maybe I'd done something that was slowing the gridding down for that particular case.
Okay, it's not that simple. I've got a 2 GB score here in just 95 seconds. Looks like there is a lot of variation with Practrand speed, depending on the data stream content. Maybe singletasking is more consistent.
Thanks for the results... those seem consistent with Windows binaries.
Just brought my small workstation to idle, a 6-core/12-thread Intel Xeon W3690 @ 3.8 GHz.
Piping the output of a 32-bit exe to PractRand:
1 GB = 80 seconds
2GB = 105 seconds
PractRand reporting overhead is quite a substantial fraction until you get to about 32 GB.
I know my dual CPU workstation is about 25% faster since it can satisfy all of the threads from Practrand.
Okay, it's not that simple. I've got a 2 GB score here in just 95 seconds. Looks like there is a lot of variation with Practrand speed, depending on the data stream content. Maybe singletasking is more consistent.
Doh! Of course, although both were running in parallel with 7 other cases, the 160 seconds (2:40) one was without the -multithread option whereas the 95 seconds was with it.
BTW: I just looked up the W3690 and I can see it's longer in the tooth than I expected, eg: Architecture dated from 2011, with DDR3 RAM. Looks to be the first of the "Core i#" with built-in north-bridge. I presume you got it second hand cheaply. It's performing really well!
Comments
Inverting any bit in M (using the seed function) will switch to a stream unrelated to the previous one (except that all streams form a master set, like threads in a rope that complexly intertwine, but never join each other).
So, nothing remotely linear.
Regarding scrambler modifications, it is difficult. The issue is with carry propagation, which is why the original sum rotate sum works so well, so viable options would likely do something similar.
Honestly, I believe that any small improvements would be easily solved in assembly/hardware (if there were no need to publish comprehensible code in a language like C... is there a need?).
Cool, I guess it's less universal that way but I'm interested. There's always new hardware getting designed.
I need to restart statistical testing on it.
128-bit Equivalent to xoroshiro32pp (no streaming) using scrambler: uint16_t result = rotl((s0 + s1) * 5, 8 ) + s0 * 5;
Time for bed I think, I just spent some hours trying to work out why it was reporting every file of a certain aging directory, with over 90,000 files, not to exist. Only to discover it was entirely correct due to an unremembered change that had been made some months back that produced a surprisingly subtle naming change.
The above example scrambler should set the upper limit of what is possible at 32-bit, but that is not necessarily what is reasonable for a decent hardware implementation.
In light of that, the existing xoroshiro32pp has only a little room to grow... I guess an almost 'Pass' on PractRand at 2GB would be considered about double.
With 128-bit state, I will take that improvement, and the x64 compiler/CPU pipeline efficiency relieves the performance penalty. However, I still have to be concerned with all other details.
I'm gearing up for automating Big Crush again. Pretty ugly coding, because my Big Crush piped version for Windows is too slow, so I use Windows VB auto-generated batch files that call 'Bash on Ubuntu on Windows' shell scripts (from D. Lemire, but modified by me) and automatically parse the output files from Windows with another VB statistical analysis and visualization program I wrote.
See attached screenshot of AES-counter-mode statistical cloud meta-analysis. Cloud-center deviation is on left and boundary is on right. The various lines are (all-bits, high-bits, low-bits) and (forward-bits, reverse-bits, byte-reverse-bits), and combinations of those groups.
I never wrote a parser for PractRand output, as I would not know what to do with the full p-value dump (-a), since so many of the the statistics are deeply correlated (unlike Big Crush, which the 254 p-vals are minimally correlated).
I never put much effort into a solution for Crush but I had observed that counting the number of pass/fail tests was one possible way.
Late in the effort I went looking for a way to impliement Crush in my testing and got the best answer from Melisa's work. It resulted in two versions: One as a generic piped input, and the other one as a compiled in tuned generator. Both were simpler than what I could glean from the documentation, which never worked anyway.
Here's the piped version(It's a lot cleaner looking). I've used read() instead of getchar() in the hopes it might have less overhead:
Here's another distribution run dataset. This one from the recent streamlined "plusmultplus".
Is s1 * 5 better than s0 * 5?
I'm calling this xoroshiro32+x+, as cannot use +*+ in filename. Below is overall top 20 for prank+zrank+nzrank, followed by top 10 prank/zrank/nzrank not in overall top 20 plus [13,5,10,9].
xoroshiro32+x+[a,b,c,d]
xoroshiro32++ rankings here:
http://forums.parallax.com/discussion/comment/1468718/#Comment_1468718
:shrug: I'll do a distribution run for that too ...
EDIT: To put that name in context, it's kind of formally named as part of the new collection. See second third attachment. So I'm using this to test out the revised scripts.
EDIT2: Third Second attachment is three csv files for the three distributions. I don't know if they'll be any use but I've added building these to the automation anyway.
EDIT4: Refresh the algorithms file.
I also bumped into a slightly higher grid score with candidate [14 1 11 8]
* We spent a lot of time doing double iterated scoring so I'm not sure where these scores actually came from.
EDIT: Ah, these were ten best distributions by the looks. Actually, there was twelve but a couple were lower scoring.
I'm thinking that with a more exhaustive search of Chris's newest scramblers we'll get some higher averages, say 30.5.
xom0 rankings:
EDIT: Accordingly, I've updated the algorithms zip file above - https://forums.parallax.com/discussion/comment/1469195/#Comment_1469195
That would tend to indicate a form of hidden order that would extend to the full period statistics you are measuring.
That can't happen with s1 * 5 (for some reason), but makes for some good PractRand scores at 32-bit.
Hopefully the combination of s0 * 5, without the ( s0+s1 ) * 5 that I am using, will have enough effect on all stats for some candidates.
All the scripts expect the src directory to be present and they will build all workings in a fresh directory called tests. I'm using Ubuntu (Kubuntu) as my OS so there is likely some assumptions on what is default installed like GCC and Bash and Awk and Grep and likely others of similar ilk.
General flow is: find-fullp-candidates -> run-distribution -> merge-distribution -> sort the result and trim to acceptable number of candidates (Tony is doing this for me) -> paste into candidates.cfg file -> run-grid-scoring -> assemble-grids
You will find results like the grid tables I've posted will be created within relevant subdirectory of tests directory.
Ha, and looking at the top of the scripts, I see I've accidental left enabled the double iteration switch in them all. You'll most likely want to comment that out.
I'll have have to study the nix PractRand build, as I have never used it.
I am curious about that... how many seconds (one instance) to 1GB using: 'stdin -tf 2 -te 1 -tlmin 1KB -multithreaded' on your 8-core 4 GHz CPU?
Doing a special run (back at single iterated):
- Used PractRand v0.93 options: stdin -multithreaded -te 1 -tf 2 -tlmin 1KB
- tests/engine-xom0-s16/pract/a5b14c12/case-a5b14c12d8w16p0 produced a failure score at 1 GB.
- Forcing a single case at a time netted me 59 seconds.
Not sure what to make of that 97 seconds now. Maybe I miss-read it, I'm working with datestamps. EDIT: Oh, it was during normal use of computer and all these scripts intentionally lower the task priority so I can do other stuff while waiting for results. Maybe I'd done something that was slowing the gridding down for that particular case.
Just brought my small workstation to idle, a 6-core/12-thread Intel Xeon W3690 @ 3.8 GHz.
Piping the output of a 32-bit exe to PractRand:
1 GB = 80 seconds
2GB = 105 seconds
PractRand reporting overhead is quite a substantial fraction until you get to about 32 GB.
I know my dual CPU workstation is about 25% faster since it can satisfy all of the threads from Practrand.
BTW: I just looked up the W3690 and I can see it's longer in the tooth than I expected, eg: Architecture dated from 2011, with DDR3 RAM. Looks to be the first of the "Core i#" with built-in north-bridge. I presume you got it second hand cheaply. It's performing really well!