Chris/Tony,
I had commented out the older result collating code that we were using for Tony to process. Re-enabling that, here's them and associated CSV files that might be handy if you want to do your own distribution stats.
The pfChi is perfect... it should be about 12.
The zfChi is only a little low... it should be about 20.
The nzfChi is only a little low... it should be about 44.
Edit: Each perfect value is approximately the total number of > 0 freqs summed, with each individual value that created the freqsum being about 1 (non-linear, but usually in the range 0.05 to 5).
#a b c d pfreq0 pfreq1 pfreq2 pfreq3 pfreq4 pfreq5 pfreq6 pfreq7 pfreq8 pfreq9 pfreq10 pfreq11 pfreq12 pfreq13 total
#normal ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~14
The above is an approximation, as are the following: as each '~1' is most likely to be in the range of 0.15 to 6, and the sum '~14' is likely to be in the range of 4 to 30.
With only a single trial, high freq values near the end may carry excess weight, so stopping before actual '0's appear can minimize bias, which would eventually dissipate with the addition of more 'randomly seeded trials'* of the same PRNG.
Bear in mind that different initial acc values will affect how closely the distributions match the ideal ones, therefore more than one initial acc should be tested. Other [a,b,c,d] will be better than [13,5,10,9] for init_acc = X, but could [13,5,10,9] do just as well if not better when init_acc = Y?
Do pfreq/zfreq/nzfreq ever repeat for different init_acc or is there always one best value? If the latter, do Chi-square scores fall away smoothly either side of this peak init_acc? Probably the only practicable way to test this is to look at xoroshiro16++ with all possible 256 acc1 values.
So just completed full distribution run of ++1a. CSV file attached
This latest run of ++1a supports everything above.
From the file, look at the freq Chi^2 sum averages across all candidates: pfChiAvg=12, zfChiAvg=21, nzfChiAvg=43
These averages match the values I predicted very closely (12, 20 and 44).
Also from the file, the min and max values for pfChi are 2 and 34, respectively, which also match the values I predicted very closely (4 and 30), but are slightly more extreme due to the large number of values.
Based on that, the numbers in the ++1a file are (log?) normally distributed and exactly what would be found if a perfect PRNG was run 1344 times with a different set of seeds each time.
To actually distinguish between them (and eliminate most of the skew of the resulting data), it would require running freqs on each of the 1344 PRNGs at least 12-16 times (30 would be best) using completely independent sets of seeds for states 0,1 and 2 each time. If you pick the seeds ahead of time for use in all candidates, that is fine.
The winning candidates would be those closest to having an overal average of about 12, 20 and 44... depending on the max x in freq(x) used to create the Chi^2 sums.
Due to the nature of this ++1a PRNG design, you have a distinct possibility that a majority of candidates are acceptably random, but there will be a subset more obviously closer to theoretical values I listed. Anything within 10-15% would be expected and acceptable.
Tony would likely prefer to use the raw freq data for each trial, which allows for a more exact calculation (as he pointed out my math was sloppy) which would then reduce the list of candidates to a few dozen... with [13 5 10 9] likely not near the top of the list (from previous testing only about 6 trials were required to prove it was not truly random).
To actually distinguish between them (and eliminate most of the skew of the resulting data), it would require running freqs on each of the 1344 PRNGs at least 12-16 times (30 would be best) using completely independent sets of seeds for states 0,1 and 2 each time. If you pick the seeds ahead of time for use in all candidates, that is fine.
The winning candidates would be those closest to having an overal average of about 12, 20 and 44... depending on the max x in freq(x) used to create the Chi^2 sums.
So you want this full set distribution run repeated several times with each new full set using another seed in the accumulator, right? Successive seeds of 1,2,3,4,5,6,7,... would be fine I assume.
BTW, This is all still single iterated. I've not tried engaging double iterating.
To actually distinguish between them (and eliminate most of the skew of the resulting data), it would require running freqs on each of the 1344 PRNGs at least 12-16 times (30 would be best) using completely independent sets of seeds for states 0,1 and 2 each time. If you pick the seeds ahead of time for use in all candidates, that is fine.
The winning candidates would be those closest to having an overal average of about 12, 20 and 44... depending on the max x in freq(x) used to create the Chi^2 sums.
So you want this full set distribution run repeated several times with each new full set using another seed in the accumulator, right? Successive seeds of 1,2,3,4,5,6,7,... would be fine I assume.
BTW, This is all still single iterated. I've not tried engaging double iterating.
Evan, if you are using the previous p/z/nzfreq C program with acc code added as I suggested, how can it be single iterated? There will be 2^32-1 double iterations, each producing a 32-bit output. If it were equidistributed (which it is NOT), each non-zero output would occur once and zero never. I assume you're calculating Chi-Square by comparing with the ideal binomial output (N/e outputs never occur, N/e occur once and the rest more than once).
The new acc increases effective state bits by 50% to state + top half of acc (acc1). It would be good if we knew how changing initial acc1 but keeping state the same affects p/z/nzfChi and p/z/nzRank. (Different initial states could be tested later.)
Successive seeds of 1,2,3,4,5,6,7 would be OK for acc1, but not acc as a whole. As it would take a very long time to fully test xoroshiro32++1a, I suggest testing xoroshiro16(8)++1a with every possible initial acc = 0000, 0100, 0200, ..., FD00, FE00, FF00. That's 256 separate acc seeds for 26 [a,b,c] * 8 = 208 [a,b,c,d] which is more than 50000 separate tests, however each one will have only 2^16-1 double iterations and require only 64KB for the byte array.
Brute forcing this is the only way to understand what is going on.
There could be separate files for every xoroshiro16(8)++1a [a,b,c,d] with 256 lines for each acc1, or one file for each [a,b,c] with 2048 lines for each d = 0-7 and each acc1.
It would be good if we knew how changing initial acc1 but keeping state the same affects p/z/nzfChi and p/z/nzRank. (Different initial states could be tested later.)
I agree, but it runs more of a risk of skewing the results, unless perhaps that the acc1 seeds were randomly chosen to increase the chance that they are maximally de-correlated from each other.
I suggest testing xoroshiro16(8)++1a with every possible initial acc = 0000, 0100, 0200, ..., FD00, FE00, FF00. That's 256 separate acc seeds for 26 [a,b,c] * 8 = 208 [a,b,c,d] which is more than 50000 separate tests, however each one will have only 2^16-1 double iterations and require only 64KB for the byte array.
That would certainly answer the question of whether or not skew might play a part when sequential acc1 seeds are used (which is normally a bad idea in actual usage).
I think the freq code Evan is using would be close to being able to handle the 16-bit outputs of of a double-iterated xoroshiro16(8)++1a.
That code for 32-bit variables has no 32-bit equidistribution when translated to 16-bit variables, as written (but does include the comment on how to do so).
Do tell. EDIT: Hang-on. Chris, haven't you just been saying how well the numbers have panned out in the most recent run? I don't get what's wrong here.
Do tell. EDIT: Hang-on. Chris, haven't you just been saying how well the numbers have panned out in the most recent run? I don't get what's wrong here.
The only issue is with the equidistribution at full period... the numbers will look good either way at a fraction of the entire period (2^48-2^16, or 2^49-2^17 for toggling +1), but might change slightly, which is what Tony is worried about. The PractRand results will likely come out better also.
You could combine Tony's code above using ACC with my TOGGLE_SUM/CONSTANT_E/e in the code I posted to make things more readable and still configurable, perhaps.
Do tell. EDIT: Hang-on. Chris, haven't you just been saying how well the numbers have panned out in the most recent run? I don't get what's wrong here.
The only issue is with the equidistribution at full period... the numbers will look good either way at a fraction of the entire period (2^48-2^16, or 2^49-2^17 for toggling +1), but might change slightly, which is what Tony is worried about. The PractRand results will likely come out better also.
You could combine Tony's code above using ACC with my TOGGLE_SUM/CONSTANT_E/e in the code I posted to make things more readable and still configurable, perhaps.
Evan, the new xoroacc code (that will become logic sometime, I hope) produces 32-bit equidistribution after 64K streams and any C code that you use for testing now must work in the same way, even though the testing length is only one stream, otherwise all the testing will need to be done again 'properly' later.
The critical point is the + 1 happens only once every double iteration, i.e. every other single iteration, hence Chris's TOGGLE_SUM. The convention we have used for some time is that + 1 (or more generally + e) applies to acc0, the low half of acc. It could have been acc1 instead, but we have to choose one or the other and stick to it.
acc0 := acc1 + prn0 + 1
acc1 := acc0 + prn1
EDIT:
It is possible there might not be any differences in the overall distributions between the "+1,+1" tests done so far and the "+1,+0" needed for 32-bit equidistribution. Perhaps a few [a,b,c,d] should be tried first rather than the whole lot and this also applies to testing all possible acc1 with xoroshiro16++a.
Sorry, I hadn't been clear about the third option, which Tony and I discussed briefly.
It was written for single iteration in the case that someone wanted to access perfect (not just near-perfect) 32-bit equidistribution and didn't mind translating that 32-bit variable code directly for use in the P2, etc.
Originally I thought you were focusing on that code translated to 16-bit variables so that you could model the + 1 toggle mode more easily by adding a state bit, as required.
What you have done with the non-toggle so far is get a taste of the difficulty in using freq information, since the randomness is capped at either 8G or 16GB in PractRand depending on toggle usage, but does not give meaningful data about how random the average stream is.
The ultimate goal with freqs would be to ascertain how well [13 5 10 9] compares on average against other constants through multiple double-iterated streams, in the fashion Tony mentioned (by modeling a double-iterated 8-bit variable version using toggle in which all constants could all complete 256 seperate times, once per extended state, very quickly), before stepping up to the 16-bit toggle version.
Evan, some extra detail on chi^2 sums that was not as important for XORO32, but now is more important on XOROACC, since much of the variant results will be already plausibly random at only one double-iterated stream.
If you want to dig in to an appoximate probability that a given Chi^Sum could occur, you can use Open Office Calc or Microsoft Excel.
x = Sum of freqs, y= Number of (non-zero) freqs summed - 1
Calc: '=CHIDIST(x;y)'
Excel: '=CHIDIST(x,y)'
When looking at hundreds of 'p-values', just assume that the vast majority should be in the range 0.001 (not random/uniform) to 0.999 (too random/uniform).
Regardless of the above, when calculating chi^2 freq sums for storing to a file, it is better to use a more exact floating point value during calculations for the 'ideal' (e.g. pfreq(0) 1580030168.7021, instead of 1580030169).
This will not make much difference for the larger values, but becomes more important for the smaller values (e.g. pfreq(12) 3.298591, instead of 3), which will affect the overall sum.
When you have a series of actual freq(x) values from multiple streams, you can do the following for each x to generate a composite freq(x) before summing:
('sum of all freq(x)' - 'total number of freq(x)' * 'ideal freq(x)')^2/('total number of freq(x)' * 'ideal freq(x)')
On XOROACC, what you will find is that as 'total number of freq(x)' (i.e. the number of streams) increases, the sum will trend upward.
This upward trend will not happen within 30 streams on a much better quality PRNG, and knowing that helps categorize candidates by multi-stream quality to help choose PractRand candidates, or if there is no visible difference in PractRand results, as a tie-breaker.
I'm not sure how far Tony wants you to go with this, as the likely outcome is that some candidates will be somewhat measurably superior to [13 5 10 9]++a, but ultimately distinguishing between different variants of that one (e.g. with rotl(,1), etc.) will be important to justify any added complexity.
I've attached 50 randomly seeded double-iterated streams of [13 5 10 9]++a with +1 on first iteration and rotl(,1) on the second, just as a reference.
Just to confirm, are you saying that +1 and rotl(XOROACC,1) are acceptable with xoroshiro32++ and anything more complicated is unnecessary? If so, then the +1 could be implemented as a simple add-with-carry.
Just to be clear, at this state size +1 and revbits(XOROACC) is still noticeably superior to +1 and rotl(XOROACC,1). At larger state sizes the difference would not be noticeable.
To illustrate that point, have a look at the attached plots for revbits, where I have converted the chi^2 sums to p-values for each of 100+ random double-iterated stream trials, then sorted the p-values.
I have not found anything better than this that did not involve much more complicated code.
Regarding alternative scramblers/constants using only +1 and +0 (have not tested rotl/revbits):
Presumably revbits (without +1) would also be superior on the second iteration here.
This is about as simple as it could possibly get, if a fresh implementation were ever needed.
Just to confirm, are you saying that +1 and rotl(XOROACC,1) are acceptable with xoroshiro32++ and anything more complicated is unnecessary? If so, then the +1 could be implemented as a simple add-with-carry.
Just to be clear, at this state size +1 and revbits(XOROACC) is still noticeably superior to +1 and rotl(XOROACC,1). At larger state sizes the difference would not be noticeable.
To illustrate that point, have a look at the attached plots for revbits, where I have converted the chi^2 sums to p-values for each of 100+ random double-iterated stream trials, then sorted the p-values.
I have not found anything better than this that did not involve much more complicated code.
Apologies for the slow reply, partly due to computer problems. Reversing bits would be simple in hardware (less so for testing with x86 code, though).
Presumably revbits (without +1) would also be superior on the second iteration here.
This is about as simple as it could possibly get, if a fresh implementation were ever needed.
* 5 would require * 4 + * 1 and I think there would not be enough time for a modified P2 to cascade 16-bit state[2] to get a combined 32-bit result.
* 5 would require * 4 + * 1 and I think there would not be enough time for a modified P2 to cascade 16-bit state[2] to get a combined 32-bit result.
I thought it would be helpful to list expected randomness from most-random to least-random of variants:
1. XOROACC xoroshiro32pp+1, revbits(xoroshiro32pp)
2. XOROACC xoroshiro32pp+1, rotl(xoroshiro32pp,1)
3. Alternate (from above) rotl(state[0] * 5, D) + 1, rotl(state[0] * 5, D)
4. XORO32 (based on ++)
Obviously revbits would be the way to go, especially considering the preexisting hardware basis.
The 'Alternate' (in optimized form, currently named 'sps') will be what I release for x64, 192-bit state, plus a faster version ('psp') that is intended to replace xoroshiro+.
Those are only beta names... 'acc' will be added later (in some fashion?).
Some Linux benchmark results attached that attempt to cover real-world performance of 64-bit code for integer and floating-point.
Tony and Evan, just a quick status update on my work.
I plan on reaching out to Seba in the near future, as I have have discovered a level above #1 in the list above which is simpler (e.g. no revbits), faster (up to ~35%), more random, and has potential crypto applications.
In terms of XORO32 (and previous XOROACC's discussed), this new XOROACC variant I have discovered fails PractRand at 256GB forward bits and 2TB reverse bits (in the general purpose version, at 49 bits state),
I have attached 1 set2 sets of benchmarks (of many I have run on various old and new Intel and AMD CPUs) for floating point and general purpose variants, that hopefully demonstrates the full potential of extended state xoroshiro.
I appreciate all of the effort both of you have put into entertaining my explorations of scramblers for the xoroshiro engine.
Edited for clarity, in bold.
Added AMD benchmark.
Evan, I also attached a buffered version of TestU01_stdin, which is ~2x faster.
Comments
I had commented out the older result collating code that we were using for Tony to process. Re-enabling that, here's them and associated CSV files that might be handy if you want to do your own distribution stats.
From the file, look at the freq Chi^2 sum averages across all candidates: pfChiAvg=12, zfChiAvg=21, nzfChiAvg=43
These averages match the values I predicted very closely (12, 20 and 44).
Also from the file, the min and max values for pfChi are 2 and 34, respectively, which also match the values I predicted very closely (4 and 30), but are slightly more extreme due to the large number of values.
Based on that, the numbers in the ++1a file are (log?) normally distributed and exactly what would be found if a perfect PRNG was run 1344 times with a different set of seeds each time.
To actually distinguish between them (and eliminate most of the skew of the resulting data), it would require running freqs on each of the 1344 PRNGs at least 12-16 times (30 would be best) using completely independent sets of seeds for states 0,1 and 2 each time. If you pick the seeds ahead of time for use in all candidates, that is fine.
The winning candidates would be those closest to having an overal average of about 12, 20 and 44... depending on the max x in freq(x) used to create the Chi^2 sums.
Due to the nature of this ++1a PRNG design, you have a distinct possibility that a majority of candidates are acceptably random, but there will be a subset more obviously closer to theoretical values I listed. Anything within 10-15% would be expected and acceptable.
Tony would likely prefer to use the raw freq data for each trial, which allows for a more exact calculation (as he pointed out my math was sloppy) which would then reduce the list of candidates to a few dozen... with [13 5 10 9] likely not near the top of the list (from previous testing only about 6 trials were required to prove it was not truly random).
BTW, This is all still single iterated. I've not tried engaging double iterating.
Evan, if you are using the previous p/z/nzfreq C program with acc code added as I suggested, how can it be single iterated? There will be 2^32-1 double iterations, each producing a 32-bit output. If it were equidistributed (which it is NOT), each non-zero output would occur once and zero never. I assume you're calculating Chi-Square by comparing with the ideal binomial output (N/e outputs never occur, N/e occur once and the rest more than once).
The new acc increases effective state bits by 50% to state + top half of acc (acc1). It would be good if we knew how changing initial acc1 but keeping state the same affects p/z/nzfChi and p/z/nzRank. (Different initial states could be tested later.)
Successive seeds of 1,2,3,4,5,6,7 would be OK for acc1, but not acc as a whole. As it would take a very long time to fully test xoroshiro32++1a, I suggest testing xoroshiro16(8)++1a with every possible initial acc = 0000, 0100, 0200, ..., FD00, FE00, FF00. That's 256 separate acc seeds for 26 [a,b,c] * 8 = 208 [a,b,c,d] which is more than 50000 separate tests, however each one will have only 2^16-1 double iterations and require only 64KB for the byte array.
There could be separate files for every xoroshiro16(8)++1a [a,b,c,d] with 256 lines for each acc1, or one file for each [a,b,c] with 2048 lines for each d = 0-7 and each acc1.
Binomial distributions for N = 2^16 samples:
Floating-point would be more accurate than nearest integer, though.
I think the freq code Evan is using would be close to being able to handle the 16-bit outputs of of a double-iterated xoroshiro16(8)++1a.
Yes, no rotation of prn1 just yet, for simplicity, not because I think it's not worthwhile.
Testing xoroshiro16(8)++1a with every possible initial acc1 should tell us a lot more than xoroshiro32(16)++1a with a few randomly chosen acc1.
Not right. Adding acc to your working C prog (post unknown) is the way to do it..
Doing a double iteration using acc[0] and acc[1] would be better than the confusing s[2], I think. The +1 added to s[2] every iteration was the error.
You could combine Tony's code above using ACC with my TOGGLE_SUM/CONSTANT_E/e in the code I posted to make things more readable and still configurable, perhaps.
Evan, the new xoroacc code (that will become logic sometime, I hope) produces 32-bit equidistribution after 64K streams and any C code that you use for testing now must work in the same way, even though the testing length is only one stream, otherwise all the testing will need to be done again 'properly' later.
The critical point is the + 1 happens only once every double iteration, i.e. every other single iteration, hence Chris's TOGGLE_SUM. The convention we have used for some time is that + 1 (or more generally + e) applies to acc0, the low half of acc. It could have been acc1 instead, but we have to choose one or the other and stick to it.
EDIT:
It is possible there might not be any differences in the overall distributions between the "+1,+1" tests done so far and the "+1,+0" needed for 32-bit equidistribution. Perhaps a few [a,b,c,d] should be tried first rather than the whole lot and this also applies to testing all possible acc1 with xoroshiro16++a.
EDIT: Okay, maybe not quite stated as such, but certainly the only one of interest to me - https://forums.parallax.com/discussion/comment/1486685/#Comment_1486685
It was written for single iteration in the case that someone wanted to access perfect (not just near-perfect) 32-bit equidistribution and didn't mind translating that 32-bit variable code directly for use in the P2, etc.
Originally I thought you were focusing on that code translated to 16-bit variables so that you could model the + 1 toggle mode more easily by adding a state bit, as required.
What you have done with the non-toggle so far is get a taste of the difficulty in using freq information, since the randomness is capped at either 8G or 16GB in PractRand depending on toggle usage, but does not give meaningful data about how random the average stream is.
The ultimate goal with freqs would be to ascertain how well [13 5 10 9] compares on average against other constants through multiple double-iterated streams, in the fashion Tony mentioned (by modeling a double-iterated 8-bit variable version using toggle in which all constants could all complete 256 seperate times, once per extended state, very quickly), before stepping up to the 16-bit toggle version.
If you want to dig in to an appoximate probability that a given Chi^Sum could occur, you can use Open Office Calc or Microsoft Excel.
x = Sum of freqs, y= Number of (non-zero) freqs summed - 1
Calc: '=CHIDIST(x;y)'
Excel: '=CHIDIST(x,y)'
When looking at hundreds of 'p-values', just assume that the vast majority should be in the range 0.001 (not random/uniform) to 0.999 (too random/uniform).
Regardless of the above, when calculating chi^2 freq sums for storing to a file, it is better to use a more exact floating point value during calculations for the 'ideal' (e.g. pfreq(0) 1580030168.7021, instead of 1580030169).
This will not make much difference for the larger values, but becomes more important for the smaller values (e.g. pfreq(12) 3.298591, instead of 3), which will affect the overall sum.
When you have a series of actual freq(x) values from multiple streams, you can do the following for each x to generate a composite freq(x) before summing:
('sum of all freq(x)' - 'total number of freq(x)' * 'ideal freq(x)')^2/('total number of freq(x)' * 'ideal freq(x)')
On XOROACC, what you will find is that as 'total number of freq(x)' (i.e. the number of streams) increases, the sum will trend upward.
This upward trend will not happen within 30 streams on a much better quality PRNG, and knowing that helps categorize candidates by multi-stream quality to help choose PractRand candidates, or if there is no visible difference in PractRand results, as a tie-breaker.
I'm not sure how far Tony wants you to go with this, as the likely outcome is that some candidates will be somewhat measurably superior to [13 5 10 9]++a, but ultimately distinguishing between different variants of that one (e.g. with rotl(,1), etc.) will be important to justify any added complexity.
I've attached 50 randomly seeded double-iterated streams of [13 5 10 9]++a with +1 on first iteration and rotl(,1) on the second, just as a reference.
To illustrate that point, have a look at the attached plots for revbits, where I have converted the chi^2 sums to p-values for each of 100+ random double-iterated stream trials, then sorted the p-values.
I have not found anything better than this that did not involve much more complicated code.
Regarding alternative scramblers/constants using only +1 and +0 (have not tested rotl/revbits): Presumably revbits (without +1) would also be superior on the second iteration here.
This is about as simple as it could possibly get, if a fresh implementation were ever needed.
Apologies for the slow reply, partly due to computer problems. Reversing bits would be simple in hardware (less so for testing with x86 code, though).
* 5 would require * 4 + * 1 and I think there would not be enough time for a modified P2 to cascade 16-bit state[2] to get a combined 32-bit result.
1. XOROACC xoroshiro32pp+1, revbits(xoroshiro32pp)
2. XOROACC xoroshiro32pp+1, rotl(xoroshiro32pp,1)
3. Alternate (from above) rotl(state[0] * 5, D) + 1, rotl(state[0] * 5, D)
4. XORO32 (based on ++)
Obviously revbits would be the way to go, especially considering the preexisting hardware basis.
The 'Alternate' (in optimized form, currently named 'sps') will be what I release for x64, 192-bit state, plus a faster version ('psp') that is intended to replace xoroshiro+.
Those are only beta names... 'acc' will be added later (in some fashion?).
Some Linux benchmark results attached that attempt to cover real-world performance of 64-bit code for integer and floating-point.
I plan on reaching out to Seba in the near future, as I have have discovered a level above #1 in the list above which is simpler (e.g. no revbits), faster (up to ~35%), more random, and has potential crypto applications.
In terms of XORO32 (and previous XOROACC's discussed), this new XOROACC variant I have discovered fails PractRand at 256GB forward bits and 2TB reverse bits (in the general purpose version, at 49 bits state),
I have attached 1 set 2 sets of benchmarks (of many I have run on various old and new Intel and AMD CPUs) for floating point and general purpose variants, that hopefully demonstrates the full potential of extended state xoroshiro.
I appreciate all of the effort both of you have put into entertaining my explorations of scramblers for the xoroshiro engine.
Edited for clarity, in bold.
Added AMD benchmark.
Evan, I also attached a buffered version of TestU01_stdin, which is ~2x faster.