Rather than one word constant, I think the 64K outputs with both words the same would be better, as it would test the immediate effect of the xor and bit reversal in all of these outputs.
I see the wisdom in this approach, as it might be useful for vetting constants much faster (and perhaps more reliably), as compared to running PractRand and BigCrush first.
This will test 192GB of output out to ~12 double iterated streams, which I can normalize against the following table, which I would like you to verify:
Assuming the above is correct, here is the results for 'n ^ AD65h/RevBits(n)', as a goodbad example (data here is incorrect, will create a post with good data later):
Edit: Fixed double-iteration formula, and also notice that AD65h and its bit reverse, A6B5h, are both prime (but not sure of implications). Edit2: There was a correlation issue with my seeding/re-seeding, which somehow managed to make all of the above AD65h results closer to theoretical than they should be.
I will post correct results in response to Tony's request below.
Thanks for the results, I'll attempt to verify. I've been preoccupied with something else recently, but rest assured I haven't lost any interest. Are the results significantly worse if no bit reversal, nor anything else, is applied to the high word?
Attached file (.txt as .csv is not allowed) containing processed results consisting of 2 trials of 16 runs each for several variants on ^AD65h and ^A6B5h, where the total non-occurring value results were summed in each trial for each successive stream, then averaged, and finally subtracted from the theoretical value to create a difference result (which should approach zero).
The three variants tested (for each first iteration xor constant) were only different in how the second iteration value was processed: no-operation [NOP(xororshiro++)], rotate left 1 bit [ROL(xororshiro++,1)], and reverse bits [REVBITS(xororshiro++)]
Larger differences in the first several streams might be tolerable for a PRNG of much larger period (that did not have to satisfy equidistribution requirements in 65536 16GB streams), but (average of 16 run) differences greater than ~15 in the 6th stream or later would be an issue for any PRNG of larger period.
Edit: Keep in mind that the NOP results are not just obviously worse, but also possibly a side-effect indicated by the early GAP fail and FPF issues with PractRand. Edit2: There is a small error in the expected zeros and diff results in the attachment. Streams 1-6 (only) are offset by 10, 8, 4, 2, 0 and 1, respectively, which only minimally impacts the comparison of results.
This will test 192GB of output out to ~12 double iterated streams, which I can normalize against the following table, which I would like you to verify:
I think the expected values above are not quite right for what I call pfreq0, the frequency of pair outputs that occur 0 times. If possible outputs = N and stream = S, then pfreq0 = pfreq1 = N/e^S. Here are my values to the nearest integer:
As the current emphasis is on the 64K duplicates, I've listed only the Stream 1 expected for All and the actual value for xoroshiro32++[13,5,10,9] = 1580109068, which is better than 99.99% accurate.
I think the expected values above are not quite right for what I call pfreq0, the frequency of pair outputs that occur 0 times...
Thanks for finding that. I replicated the issue... I typed e from memory and accidentally transposed two pairs of digits.
Next time I will just type 'Exp(1)', as reciting e is a dyslexic endeavor for me (and perhaps many others, I guess).
Luckily, the maximum error in my results is still small enough to accurately discern RevBits as the better choice when considering only those two constants.
I planned to write an automated search for least (perhaps weighted) differential sum of 12 (or 13 streams), considering only primes that have 7 to 9 bits set.
However, in order to establish automated vetting criteria (which will be used to expedite the process), I need to create some more reference data using a good PRNG (that is beyond reproach).
I planned to write an automated search for least (perhaps weighted) differential sum of 12 (or 13 streams), considering only primes that have 7 to 9 bits set.
However, in order to establish automated vetting criteria (which will be used to expedite the process), I need to create some more reference data using a good PRNG (that is beyond reproach).
Although it seems logical to consider primes, I am unaware of any proof that primes are directly applicable to this scenario. Prime, or not, bit pattern distribution is certainly relevant.
Additionally, a 'good' PRNG will have a much greater StdDev of trial stream differences, so is not very useful with XOROACC due to the relatively short protraction of the 16/17 bits of state extension.
Therefore, I have written a search for odd xor values that produce an absolute difference (average of 16 trials) sum of less than 80 (which is the current absolute difference sum for AD65h).
I will take the best of those and look at some basic PractRand and gjrand stats (since they are quick to produce), before considering deeper PractRand and BigCrush analysis.
Tony, have a look at the attachment, which is a fairly comprehensive analysis of 'xoroshiro32pp[13,5,10,9]^AD65h,Revbits(xoroshiro32pp[13,5,10,9])'.
In comparison to all of the other variations looked at, this is the best so far.
Let me know if you have any questions about the data, or if any further information is required.
Thanks for these latest results.
Re the ideal vs. actual pfreq0 distributions for the 64K duplicates (both words of 32-bit value the same), I tested xoroshiro64++[26,9,13,17] just to see how it behaved and here are the results for a single run with initial state = 1:
Note that a "stream" for xoroshiro64++ is a period of 2^32-1 iterations to match that for xoroshiro32. (It makes no difference in this case if 2^32 is used instead.) I'm hoping that using the amended ideal values I posted your averaged results for xoroshiro32++a/pp will improve a little.
Being able to record all the outputs to see how many iterations it takes for each of them to occur at least once would be more accurate than the duplicates and it would also give us the distributions of zero and non-zero runs (zfreq and nzfreq). 512MB would be enough to record all 2^32 outputs at one bit per output, but I still couldn't do it because that's all the RAM there is in my current PC.
A Chi-Square sum of (Actual - Expected)^2 / Expected for streams 1-11 would tell us the "goodness of fit" to the ideal pfreq0 distribution, the lower the better.
Slight bias in streams 1, 2 and 5 is obvious, and greater than expected given the nature of this type of PRNG... however, I am in the process of trying hundreds of odd values, and have found nothing obviously better as yet, but am exploring a few contenders... I am fairly sure they are all prime, but need to double-check. Edit: 5B69h is NOT PRIME, and by initial inspection of pfreq0, PractRand and gjrand 32GB looks better than AD65!
I can likely perform a full test in 64-bit mode, in order to quickly allocate a full count array. Byte should be fine, I guess. Edit2: Code written (attached... I really dislike writing in C++, but I cannot argue with the performance) and some tests already done.
5B69h is no good, as it takes 27 streams to complete, instead of the expected 23. AD65h is still the top contender.
Now looking at some other prime/non-prime values of seemingly good merit (i.e. low chi-square sum).
Will post some data in the next several days.
Here is pfreq0 w/chi-square sum data on '^AD65h,RevBits', and 'Xoroshiro128** High 32-Bits', for comparison (single test, randomly seeded, scroll to bottom of table for sums):
By contrast, the simple variants using '^1,rol1', '+1,rol1' and '+1,revbits' all have lower pfreq0 chi-squared sums (20, 27, and 21, respectively)...
Yet PractRand and gjrand agree that the simpler variants are significantly inferior to '^AD65h,RevBits' at 32GB (in multiple ways), which is only 2 streams.
The dozens of the other constants I have considered with any depth have not made it through the gauntlet of tests I have done without showing some obvious weakness.
Due to the vast amount of testing required to adequately explore more candidates, I should say that I believe that '^AD65h,RevBits' must be within the top 99.5%, and that any improvements would be statistically very small by comparison.
I'm hoping that using the amended ideal values I posted your averaged results for xoroshiro32++a/pp will improve a little.
I had used the correct values for '^AD65,revbits' in the last zip attachment here, but did not calculate the chi-squared values or sums in the ods file... all of the required data is there.
Being able to record all the outputs to see how many iterations it takes for each of them to occur at least once would be more accurate than the duplicates and it would also give us the distributions of zero and non-zero runs (zfreq and nzfreq).
Being able to record all the outputs to see how many iterations it takes for each of them to occur at least once would be more accurate than the duplicates and it would also give us the distributions of zero and non-zero runs (zfreq and nzfreq).
That post is a concise description of the distributions and there isn't much more I can say apart from a good distribution does not imply a good score in randomness tests. [13,5,10,9] was chosen for xoroshiro32++ as it had the best (lowest) prank+zrank+nzrank and it performed well in PractRand tests.
XOROACC should be capable of slightly better pfreq, zfreq & nzfreq distributions than xoroshiro32++ [13,5,10,9] within a single stream. In other words, users wanting random 32-bit outputs that are a bit closer to the ideal could use XOROACC for 2^32-1 double-iterations instead of XORO32. Have you compared stream 1 Chi-Square pfreq/zfreq/nzfreq values for '^AD65,revbits' with the simple variants?
Chi-Square for multiple streams is something new and I need to think whether zfreq and nzfreq are still relevant and if so how. Maybe something else should be measured, such as the interval in double-iterations between outputs that have not occurred before. Note that the highest Chi-Square values for '^AD65,revbits' and most of its total are for streams 13-17. How do the simple variants perform for streams 1-12?
That post is a concise description of the distributions and there isn't much more I can say apart from a good distribution does not imply a good score in randomness tests. [13,5,10,9] was chosen for xoroshiro32++ as it had the best (lowest) prank+zrank+nzrank and it performed well in PractRand tests.
To be more clear, I have studied Evan's code for pfreq, zfreq & nzfreq, but my ability to read other's c code is poor and I have little use for or experience with Linux/gcc other than to run TestU01 and Seba's Hamming Weight Dependency (under Windows Subsystem for Linux). Therefore, I must somehow translate his work to mathematical / statistical fundamentals (using a standard reference book), then to Visual C++ .NET, and also expand it to cover the nuances of multiple streams (both sequential and/or randomly selected). I have heavily relied on statistical analysis packages (e.g. TestU01, PractRand, gjrand, RaBiGeTe and HWD) to avoid re-inventing the wheel. However, I have seen the value of implementing focused tests such as pfreq, etc., since it serves well in fast vetting and as a reality check... especially given that the extra streams make a mess out of the normal statistical package results.
So basic questions need answered:
How does pair_frequency(0) equate to what I have done so far with counting zerosnon-occurring values per stream, as I do not see the logical relationship?
Can you provide an example description of what pfreq(1), etc. is testing detailed enough so I can code it?
Note that the highest Chi-Square values for '^AD65,revbits' and most of its total are for streams 13-17. How do the simple variants perform for streams 1-12?
Since that was a single trial, the bubbles in streams 13-17 should be just a normal statistical variance. I will run 16+ randomly selected stream sequences for that, xoroshiro128**, and simple variants, like I did for high16=low16.
However, I need to look into automated seeding under .NET (probably using Crypto APIs, since my VB6 AMLS bit fountain would take a lot of work to translate)... that will take some time to come together.
To be more clear, I have studied Evan's code for pfreq, zfreq & nzfreq, but my ability to read other's c code is poor and I have little use for or experience with Linux/gcc other than to run TestU01 and Seba's Hamming Weight Dependency (under Windows Subsystem for Linux). Therefore, I must somehow translate his work to mathematical / statistical fundamentals (using a standard reference book), then to Visual C++ .NET, and also expand it to cover the nuances of multiple streams (both sequential and/or randomly selected). I have heavily relied on statistical analysis packages (e.g. TestU01, PractRand, gjrand, RaBiGeTe and HWD) to avoid re-inventing the wheel. However, I have seen the value of implementing focused tests such as pfreq, etc., since it serves well in fast vetting and as a reality check... especially given that the extra streams make a mess out of the normal statistical package results.
So basic questions need answered:
How does pair_frequency(0) equate to what I have done so far with counting zerosnon-occurring values per stream, as I do not see the logical relationship?
Can you provide an example description of what pfreq(1), etc. is testing detailed enough so I can code it?]
Evan translated my ideas into C. As I was concerned we were being overly reliant on PractRand, I came up with the pfreq distribution as an additional, reality-based test. Chris Doty-Humphrey, the author of PractRand, suggested also looking at zero and non-zero runs, which led to zfreq and nzfreq.
Some definitions. A pair is two successive 16-bit xoroshiro/xoroacc PRN outputs concatenated in one 32-bit value. Pair frequency(n) or pfreq(n) is the number of different pairs that occur exactly n times, thus pfreq(0) is the number of pairs that occur zero times and the answer to your first question as amended is the two are the same. The expected set of values of pfreq(n) for different n is the well-known binomial distribution.
In our case, 4GB of RAM is required to find the actual pfreq(n) values, configured as an array of 2^32 bytes, called pair, say. Zero the entire pair array, then iterate xoroshiro/xoroacc for 2^32-1 double-iterations. Use each 32-bit pair output as an index into the pair array and increment that particular byte value. Now we need another array, called pfreq, which is array of 256 longs. Zero pfreq first, then for n = 0 to 2^32-1, read byte at pair(n) and use that as index into pfreq and increment that long value. At the end, we will know the pfreq distribution in its entirety and most pfreq(n) will be zero. It is possible to calculate pfreq(0) without a pfreq array, by reading the pair array bytes before writing to check whether they are zero. If so, decrement a 32-bit counter, say pfreq0, which is initially zero.
So far, no mention of zfreq nor nzfreq. The pair array will have runs of consecutive zero or non-zero bytes, e.g. pair(0) to pair(10) might all be non-zero and pair(11) zero, in which case the non-zero run length is 11. When reading all of the pair array from pair(0) to pair(2^32-1) and writing to pfreq, it is quite easy to also record the zero and non-zero runs and write these to two other arrays of 256 longs called zfreq and nzfreq. The final run, zero or non-zero, is terminated by end of data, not a non-zero or zero value. So now we have three real-world distributions which we can compare against the expected values using Chi-Square goodness of fit tests. My maths is not good to prove the expected equations for zfreq and nzfreq, but results show them to be correct.
Evan created pfreq, zfreq & nzfreq distributions for all 84 full-length [a,b,c,d] xoroshiro32++ candidates, except for a handful of complete duds from earlier PractRand tests. I calculated Chi-Square values from the distribution data, ranking each of pfreq/zfreq/nzfreq separately to give prank/zrank/nzrank and adding these using equal weighting the lowest prank+zrank+nzrank was [13,5,10,9]. Its PractRand scores were very acceptable and therefore this [a,b,c,d] was chosen for the P2 rev B silicon.
Thanks for that explanation... it should be enough to move forward.
So far, I have some preliminary pfreq0 results from 14 (only, due to RAM limit, but should suffice) independent trials of xoroshiro128** high bits, ^AD65_RevBits, +1_RevBits and ^1_Revbits.
I will run a few more variants and post an expanded data set based on new code, once it is proofed to my satisfaction.
My initial preview of pfreq0 results indicates that there is indeed greater statistical bias in streams 13-18 with the variants tested so far.
Stream 13 is actually composited of single iterated streams 25 and 26, which might indicate an endemic artifact of the short protraction of equidistribution.
The way it occurred in XORO32 with the failure in PractRand at 512MB-1GB signaled a pending issue with double-iterated pfreq (if it had been carried beyond one double-iterated stream).
With XOROACC being based on that same underlying limit, the best that can be hoped for is to optimize the e xor mask to smooth the avalanche behavior.
For now, here is the pfreq0 data for ^AD65_Revbits (14 trials): Edit: Replaced table with ods attachment.
The explicit near-perfect statistical termination at stream 23 is not something shared in common with simpler variants.
I have compared two extended statistics with 'Xoroshiro128** High 32 bits': chi-squared sum of stream averages, and average of chi-squared sums.
^AD65_RevBits : 39.65 and 43.48
Xoroshiro128**: 1.89 and 19.43
The former value ideally should approach zero with more trials, and I can only guess that ~20 is very close to the ideal normal for the latter value.
My belief is that an optimal e constant may exist which would allow XOROACC to reach a lower limit of ~30 for both statistics while maintaining the stream 23 termination... a marginal improvement for the first statistic and a meaningful improvement for the last statistic.
'+1_Revbits' has just a little above the 30/30 that I propose... but terminates too soon, decidedly at stream 22. I had discarded it early on due the excessive PractRand failures... but still a viable option if one is concerned with only pfreq0 performance on just 1 double-iterated stream, as it out-performs '^AD65_RevBits' by a considerable margin on stream 1 pfreq0 chi-square of the average of 14 streams: 0.0068 vs. 0.42 (the former value nearly ideal, and the latter value being, I believe, much below 0.05 probability).
I have compared two extended statistics with 'Xoroshiro128** High 32 bits': chi-squared sum of stream averages, and average of chi-squared sums.
^AD65_RevBits : 39.65 and 43.48
Xoroshiro128**: 1.89 and 19.43
The former value ideally should approach zero with more trials, and I can only guess that ~20 is very close to the ideal normal for the latter value.
Extrapolating from what I have learned, I tried a few +1 variants.
Here is the best (which excels at PractRand, also):
(xoroshiro32pp ^ AD65h)_RevBits(xoroshiro32pp + 1): 19.51 and 22.88
What these pfreq0 statistics (ods file attached) imply is that an individual 23 stream trial is now similar to an individual xoroshiro128** (also attached) trial, but it only takes several trials to prove that that the per-stream biases are fixed to specific stream #s in the sequence.
I tried (xoroshiro32pp + 1)_RevBits(xoroshiro32pp + 1), and the results on PractRand and pfreq0 were not a step forward.
And yes, these hybrid variants somehow maintain the same equidistribution (which is quite extraordinary).
Tony, I think this type of simple hybrid variant is worth pursuing, but want your take on the implementation complexity (as this one adds carry implementation on one of the double iterations, in addition to the xor on the other).
Edited for clarity and to add attachments. Edit2: I am currently looking at (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1), which shows NO precursor issues in PractRand at 16GB... the only variant that has ever accomplished this! Edit3: (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1) is the best when completing 1 double-iterated stream, but decidedly completes pfreq0 two streams too early. I bet it would do well in BigCrush.
For now, here is the pfreq0 data for ^AD65_Revbits (14 trials): Edit: Replaced table with ods attachment.
How do initial state and initial acc differ for each of the 14 trials? Are you sure there is no overlap between any of them for at least 23 streams?
The main reason for using xoroacc is to get 2D equidistribution. I was concerned that outputting all possible outputs might take longer with our 'pseudo' 32-bit generator than with a 'proper' 32-bit or 64-bit one. However, tests have shown this concern to be completely misplaced. Taking one stream or even two fewer than the ideal is erring on the good side and I think we should not be too worried about 22 instead of a perfect 23.
Studying the results, pfreq0 in general is on the low side from stream 9 onwards and more clearly so from streams 13 to 18 or 19. There are ~500 fewer missing outputs after stream 13 compared to the ideal remainder of ~15000. Some trials have very low Chi-Squared errors for streams 1 to 9, say, during which more than 99.99% of the possible outputs occur.
Thus it seems likely that, by choosing good constants and the appropriate initial state and acc values, a single stream of 2^32-1 double-iterations could be extremely close to the ideal random distribution for a 32-bit output. I think further tests should look at the total pfreq, zfreq and nzfreq distributions for stream 1 and these results considered along with those for multi-stream pfreq0.
Re hardware implementation, there ought to be enough time for an addition of X+Y+1, but I think we should avoid too much complexity. I'm not sure what (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1) means because the two acc additions are only 16-bit.
How do initial state and initial acc differ for each of the 14 trials? Are you sure there is no overlap between any of them for at least 23 streams?
Non-deterministic random seeding, so there is no guarantee of no overlap, but basic birthday spacings logic applies, so (if I calculated correctly) there is about a 3.3% chance an overlap would occur.
Re Chi-Squared sums, it seems wrong to ignore an actual non-zero value that is expected to be zero. Of course, one cannot divide by zero, but perhaps the expected value could be changed to 1? In the end, though, this would affect the total only slightly.
I am considering the fractional expectation, so the first 0 (stream 23) is actually 0.44.
Taking one stream or even two fewer than the ideal is erring on the good side and I think we should not be too worried about 22 instead of a perfect 23.
Agreed, but it is very compelling to discover that the AD65h constant I previously found though great effort by other means also happens to satisfy the 23 streams so exactly.
Thus it seems likely that, by choosing good constants and the appropriate initial state and acc values, a single stream of 2^32-1 double-iterations could be extremely close to the ideal random distribution for a 32-bit output. I think further tests should look at the total pfreq, zfreq and nzfreq distributions for stream 1 and these results considered along with those for multi-stream pfreq0.
Agreed, but striking a good balance with meshing those results with results from PractRand/BigCrush/gjrand might be problematic, as the current evidence suggests.
However, a perhaps larger issue at the moment concerns how to gather total pfreq, zfreq, and nzfreq: i.e. I am currently averaging 14 of each x statistic (takes about 2.3 hours total), admittedly generated via a seeding method that will eventually cause collisions.
Re hardware implementation, there ought to be enough time for an addition of X+Y+1, but I think we should avoid too much complexity. I'm not sure what (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1) means because the two acc additions are only 16-bit.
My notation is poor (thus open to suggestion), but simply indicates that the first xoroshiro result is xor'd with AD65h (prior to accumulation), and the second result has 1 added, then a left rotation by 1 (prior to accumulation).
However, I don't think the extra complexity will be required. For instance, that particular example using the +1 with rotate looks horrible under BigCrush meta-analysis (often exceeding 3 StdDev). Think of the meta-analysis as performing a very similar function to pfreq, zfreq and nzfreq, except that it is meshing a few perhaps similar statistics, along with lots of others.
Limiting BigCrush to just 1 trial (instead of the meta-analysis 16 each of forward bits, reverse bits and byte reverse) is pointless, since most well-conceived variants I have tried pass it easily.
Thus it seems likely that, by choosing good constants and the appropriate initial state and acc values, a single stream of 2^32-1 double-iterations could be extremely close to the ideal random distribution for a 32-bit output. I think further tests should look at the total pfreq, zfreq and nzfreq distributions for stream 1 and these results considered along with those for multi-stream pfreq0.
Agreed, but striking a good balance with meshing those results with results from PractRand/BigCrush/gjrand might be problematic, as the current evidence suggests.
However, a perhaps larger issue at the moment concerns how to gather total pfreq, zfreq, and nzfreq: i.e. I am currently averaging 14 of each x statistic (takes about 2.3 hours total), admittedly generated via a seeding method that will eventually cause collisions.
I appreciate the time this all takes. pfreq, zfreq & nzfreq data could be generated at the same time. Perhaps you could test one of the 14 initially, the one with pfreq0 closest to the ideal? The fourth is 1580027099.
From testing xoroshiro32++, any pfreq0 that begins 15800... or 15799... could be considered excellent.
Re hardware implementation, there ought to be enough time for an addition of X+Y+1, but I think we should avoid too much complexity. I'm not sure what (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1) means because the two acc additions are only 16-bit.
My notation is poor (thus open to suggestion), but simply indicates that the first xoroshiro result is xor'd with AD65h (prior to accumulation), and the second result has 1 added, then a left rotation by 1 (prior to accumulation).
Notation understood now, it was the 16 in rotl16 that was confusing me.
I appreciate the time this all takes. pfreq, zfreq & nzfreq data could be generated at the same time. Perhaps you could test one of the 14 initially, the one with pfreq0 closest to the ideal? The fourth is 1580027099.
I was not recording the seeds, but had already started writing code to do so and add the additional statistics.
In any case, it takes the same amount of time to do 1 trial as it does 14 (with 15 perhaps possible).
I do not like the idea of hand picking data. However, for example, with electronic analytical balances, the old 3 of 5 algorithm suffices... so perform 5 independent trials and discard the high and low for each stream #, taking the average of the remaining 3. This approach might allow vetting (15/5) 3 candidate algorithms/constants simultaneously, for a total of ~30 per day (if fully automated).
Regarding other comments: '15800... or 15799' describes all of these... it is simply the perceived expectation of greater precision on stream #1 in particular, and > #1 as a bonus. I think it is within grasp, but that remains to be seen. I used the rotl16 notation, as other bit-size rotls existing in my code (so I have to consciously be careful in constructing logic in my head).
I appreciate the time this all takes. pfreq, zfreq & nzfreq data could be generated at the same time. Perhaps you could test one of the 14 initially, the one with pfreq0 closest to the ideal? The fourth is 1580027099.
I was not recording the seeds, but had already started writing code to do so and add the additional statistics.
In any case, it takes the same amount of time to do 1 trial as it does 14 (with 15 perhaps possible).
I do not like the idea of hand picking data. However, for example, with electronic analytical balances, the old 3 of 5 algorithm suffices... so perform 5 independent trials and discard the high and low for each stream #, taking the average of the remaining 3. This approach might allow vetting (15/5) 3 candidate algorithms/constants simultaneously, for a total of ~30 per day (if fully automated).
Regarding other comments: '15800... or 15799' describes all of these... it is simply the perceived expectation of greater precision on stream #1 in particular, and > #1 as a bonus. I think it is within grasp, but that remains to be seen. I used the rotl16 notation, as other bit-size rotls existing in my code (so I have to consciously be careful in constructing logic in my head).
If 14 trials are as easy as one, then please find the 14 pfreq/zfreq/nzfreq distributions for stream 1 and calculate the Chi-Squared values. It is possible that all 14 might be better than xoroshiro32++[13,5,10,9] which was 6th/5th/1st best for pfreq/zfreq/nzfreq individually and best overall.
... it is simply the perceived expectation of greater precision on stream #1 in particular, and > #1 as a bonus. I think it is within grasp, but that remains to be seen.
As I understand it, when implemented in hardware, reversing the bits is just as easy as not reversing, or as rotating, etc., as these are all topological equivalents. There are even more topological equivalents, such as '(rol(xoroshiro32pp, 1) ^ AD65h)_revbits(rol(xoroshiro32pp, 1))' and many more... dozens of easily expressed ones. And these are just waiting for discovery of any additional constants that surpass AD65h (which I need to get my other server working on again, now that I understand a bit more about the nature of that value).
Taken together, I am quite optimistic this concept should exceed any reasonable expectation (and already is far beyond my original expectations).
I am still working on formatting data in the file output, so I just manually put this pfreq(x) together from 5 random stream trials of: (rotl(xoroshiro32pp, 7) ^ AD65h)_(revbits(rotl(xoroshiro32pp, 7)))
The rotate left by 7 normalizes the highest entropy xoroshiro32++[13,5,10,9] bits to the ends (left for first iteration and right for second, due to revbits).
Edit: Fixed table to use floating point (instead of integer) values for calculation (which had minimal impact).
Tony, for the zfreq(0), how is 'mean run' calculated from it, or is it something different?
For example, I have a zfreq(0) from an arbitrary output file of 998774320, but an ideal 'mean run' from this table is 1.58197671.
I see the result is not used with the chi^2, so perhaps it is not important?
Tony, for the zfreq(0), how is 'mean run' calculated from it, or is it something different?
For example, I have a zfreq(0) from an arbitrary output file of 998774320, but an ideal 'mean run' from this table is 1.58197671.
I see the result is not used with the chi^2, so perhaps it is not important?
Chris, thanks for the pfreq results. It will be interesting to see the corresponding zfreq & nzfreq.
The following post defined mean zero run, see term (i) under Notes: http://forums.parallax.com/discussion/comment/1438409/#Comment_1438409
Mean non-zero run can be calculated in the same way and its expected value = e. I didn't know about Chi-Squared when I wrote this post and I used the incorrect |Actual-Expected|/Expected instead.
zfreq(0) and nzfreq(0) have no real meaning, so I thought of using their table entries for something else. The expected values were easy to deduce from a few results, which gives us another measure of quality or two. I thought at one time that mean run might be the most accurate, but I'm not so sure now because it is a ratio and errors could cancel out (or rather be divided out). zfreq(0)/nzfreq(0) could be used to store the number of zero/non-zero runs when generating the data and the two values will either be the same or differ by one.
P.S.
I think that mean run is important and Chi-Squared could be applied to it.
Also I now regard pfreq, zfreq & nzfreq as having equal worth.
It will be interesting to see the corresponding zfreq & nzfreq.
Thanks for the extra info. Sorry about skipping ahead abruptly in variants... I was curious to try and project limits to the possible randomness, and can back-fill stats for previous / simpler variants later (which may actually catch something I missed).
Here is the zfreq(x) for that same advanced rol7 variant (without addressing the meanrun):
Notice the 5th stream trial glitched at the very end... which only succeeded in demoting that particular stream sum to place it in the same class as 'scrorarns'.
Still manually constructing these tables... nzfreq will be a little bit harder.
Still manually constructing these tables... nzfreq will be a little bit harder.
The scro generator is excellent on pfreq, but your results show it can be beaten on zfreq. I'm not sure that averaging the 7x scores is allowed, though.
If it's too time consuming for you, I could use pfreq/zfreq/nzfreq binary data files (preferably named *.pb/*.zb/*.nzb) as inputs to a QuickBASIC program that generates tables. 256 longs per file, starting at freq(0), with zeroes for zfreq(0) and pfreq(0) if it is easier as total runs and mean run can be derived from freq(1+).
I'm not sure that averaging the 7x scores is allowed, though.
It is the only way I can think of to represent the generator as a whole, though I intend to use 14 (or 15) scores for the average.
To put the question to rest, we can look at scores for xoroshiro128** high 32 bits... I'll run them tomorrow and post the binaries for you to look at.
I was able to get Evan's code to work in .NET, with a few adjustments (e.g. string and monotonic clock related code changed for cross-compatibility... I will not call myself a C++ programmer, so still amazed I can get anything to work right across two platforms).
I'm not sure that averaging the 7x scores is allowed, though.
It is the only way I can think of to represent the generator as a whole, though I intend to use 14 (or 15) scores for the average.
Averaging the chi^2 total is allowable, I think, but not chi^2 for individual frequencies. Summing prank+zrank+nzrank would give a winner for 7A-7E, based on frequency distribution.
Comments
This will test 192GB of output out to ~12 double iterated streams, which I can normalize against the following table, which I would like you to verify: Assuming the above is correct, here is the results for 'n ^ AD65h/RevBits(n)', as a good bad example (data here is incorrect, will create a post with good data later): Edit: Fixed double-iteration formula, and also notice that AD65h and its bit reverse, A6B5h, are both prime (but not sure of implications).
Edit2: There was a correlation issue with my seeding/re-seeding, which somehow managed to make all of the above AD65h results closer to theoretical than they should be.
I will post correct results in response to Tony's request below.
The three variants tested (for each first iteration xor constant) were only different in how the second iteration value was processed: no-operation [NOP(xororshiro++)], rotate left 1 bit [ROL(xororshiro++,1)], and reverse bits [REVBITS(xororshiro++)]
Larger differences in the first several streams might be tolerable for a PRNG of much larger period (that did not have to satisfy equidistribution requirements in 65536 16GB streams), but (average of 16 run) differences greater than ~15 in the 6th stream or later would be an issue for any PRNG of larger period.
Edit: Keep in mind that the NOP results are not just obviously worse, but also possibly a side-effect indicated by the early GAP fail and FPF issues with PractRand.
Edit2: There is a small error in the expected zeros and diff results in the attachment. Streams 1-6 (only) are offset by 10, 8, 4, 2, 0 and 1, respectively, which only minimally impacts the comparison of results.
As the current emphasis is on the 64K duplicates, I've listed only the Stream 1 expected for All and the actual value for xoroshiro32++[13,5,10,9] = 1580109068, which is better than 99.99% accurate.
Next time I will just type 'Exp(1)', as reciting e is a dyslexic endeavor for me (and perhaps many others, I guess).
Luckily, the maximum error in my results is still small enough to accurately discern RevBits as the better choice when considering only those two constants.
I planned to write an automated search for least (perhaps weighted) differential sum of 12 (or 13 streams), considering only primes that have 7 to 9 bits set.
However, in order to establish automated vetting criteria (which will be used to expedite the process), I need to create some more reference data using a good PRNG (that is beyond reproach).
In comparison to all of the other variations looked at, this is the best so far.
Let me know if you have any questions about the data, or if any further information is required.
Additionally, a 'good' PRNG will have a much greater StdDev of trial stream differences, so is not very useful with XOROACC due to the relatively short protraction of the 16/17 bits of state extension.
Therefore, I have written a search for odd xor values that produce an absolute difference (average of 16 trials) sum of less than 80 (which is the current absolute difference sum for AD65h).
I will take the best of those and look at some basic PractRand and gjrand stats (since they are quick to produce), before considering deeper PractRand and BigCrush analysis.
Re the ideal vs. actual pfreq0 distributions for the 64K duplicates (both words of 32-bit value the same), I tested xoroshiro64++[26,9,13,17] just to see how it behaved and here are the results for a single run with initial state = 1:
Note that a "stream" for xoroshiro64++ is a period of 2^32-1 iterations to match that for xoroshiro32. (It makes no difference in this case if 2^32 is used instead.) I'm hoping that using the amended ideal values I posted your averaged results for xoroshiro32++a/pp will improve a little.
Being able to record all the outputs to see how many iterations it takes for each of them to occur at least once would be more accurate than the duplicates and it would also give us the distributions of zero and non-zero runs (zfreq and nzfreq). 512MB would be enough to record all 2^32 outputs at one bit per output, but I still couldn't do it because that's all the RAM there is in my current PC.
A Chi-Square sum of (Actual - Expected)^2 / Expected for streams 1-11 would tell us the "goodness of fit" to the ideal pfreq0 distribution, the lower the better.
EDIT:
Corrected Chi-Square value.
Edit: 5B69h is NOT PRIME, and by initial inspection of pfreq0, PractRand and gjrand 32GB looks better than AD65!
I can likely perform a full test in 64-bit mode, in order to quickly allocate a full count array. Byte should be fine, I guess.
Edit2: Code written (attached... I really dislike writing in C++, but I cannot argue with the performance) and some tests already done.
5B69h is no good, as it takes 27 streams to complete, instead of the expected 23. AD65h is still the top contender.
Now looking at some other prime/non-prime values of seemingly good merit (i.e. low chi-square sum).
Will post some data in the next several days.
Yet PractRand and gjrand agree that the simpler variants are significantly inferior to '^AD65h,RevBits' at 32GB (in multiple ways), which is only 2 streams.
The dozens of the other constants I have considered with any depth have not made it through the gauntlet of tests I have done without showing some obvious weakness.
Due to the vast amount of testing required to adequately explore more candidates, I should say that I believe that '^AD65h,RevBits' must be within the top 99.5%, and that any improvements would be statistically very small by comparison.
Can you provide a link or description that elaborates on pfreq(x), zfreq(x) and nzfreq(x) in addition to what you wrote here: https://forums.parallax.com/discussion/comment/1469463/#Comment_1469463?
That post is a concise description of the distributions and there isn't much more I can say apart from a good distribution does not imply a good score in randomness tests. [13,5,10,9] was chosen for xoroshiro32++ as it had the best (lowest) prank+zrank+nzrank and it performed well in PractRand tests.
XOROACC should be capable of slightly better pfreq, zfreq & nzfreq distributions than xoroshiro32++ [13,5,10,9] within a single stream. In other words, users wanting random 32-bit outputs that are a bit closer to the ideal could use XOROACC for 2^32-1 double-iterations instead of XORO32. Have you compared stream 1 Chi-Square pfreq/zfreq/nzfreq values for '^AD65,revbits' with the simple variants?
Chi-Square for multiple streams is something new and I need to think whether zfreq and nzfreq are still relevant and if so how. Maybe something else should be measured, such as the interval in double-iterations between outputs that have not occurred before. Note that the highest Chi-Square values for '^AD65,revbits' and most of its total are for streams 13-17. How do the simple variants perform for streams 1-12?
So basic questions need answered:
How does pair_frequency(0) equate to what I have done so far with counting zeros non-occurring values per stream, as I do not see the logical relationship?
Can you provide an example description of what pfreq(1), etc. is testing detailed enough so I can code it? Since that was a single trial, the bubbles in streams 13-17 should be just a normal statistical variance. I will run 16+ randomly selected stream sequences for that, xoroshiro128**, and simple variants, like I did for high16=low16.
However, I need to look into automated seeding under .NET (probably using Crypto APIs, since my VB6 AMLS bit fountain would take a lot of work to translate)... that will take some time to come together.
Edited to correct ambiguity, in bold.
Evan translated my ideas into C. As I was concerned we were being overly reliant on PractRand, I came up with the pfreq distribution as an additional, reality-based test. Chris Doty-Humphrey, the author of PractRand, suggested also looking at zero and non-zero runs, which led to zfreq and nzfreq.
Some definitions. A pair is two successive 16-bit xoroshiro/xoroacc PRN outputs concatenated in one 32-bit value. Pair frequency(n) or pfreq(n) is the number of different pairs that occur exactly n times, thus pfreq(0) is the number of pairs that occur zero times and the answer to your first question as amended is the two are the same. The expected set of values of pfreq(n) for different n is the well-known binomial distribution.
In our case, 4GB of RAM is required to find the actual pfreq(n) values, configured as an array of 2^32 bytes, called pair, say. Zero the entire pair array, then iterate xoroshiro/xoroacc for 2^32-1 double-iterations. Use each 32-bit pair output as an index into the pair array and increment that particular byte value. Now we need another array, called pfreq, which is array of 256 longs. Zero pfreq first, then for n = 0 to 2^32-1, read byte at pair(n) and use that as index into pfreq and increment that long value. At the end, we will know the pfreq distribution in its entirety and most pfreq(n) will be zero. It is possible to calculate pfreq(0) without a pfreq array, by reading the pair array bytes before writing to check whether they are zero. If so, decrement a 32-bit counter, say pfreq0, which is initially zero.
So far, no mention of zfreq nor nzfreq. The pair array will have runs of consecutive zero or non-zero bytes, e.g. pair(0) to pair(10) might all be non-zero and pair(11) zero, in which case the non-zero run length is 11. When reading all of the pair array from pair(0) to pair(2^32-1) and writing to pfreq, it is quite easy to also record the zero and non-zero runs and write these to two other arrays of 256 longs called zfreq and nzfreq. The final run, zero or non-zero, is terminated by end of data, not a non-zero or zero value. So now we have three real-world distributions which we can compare against the expected values using Chi-Square goodness of fit tests. My maths is not good to prove the expected equations for zfreq and nzfreq, but results show them to be correct.
Evan created pfreq, zfreq & nzfreq distributions for all 84 full-length [a,b,c,d] xoroshiro32++ candidates, except for a handful of complete duds from earlier PractRand tests. I calculated Chi-Square values from the distribution data, ranking each of pfreq/zfreq/nzfreq separately to give prank/zrank/nzrank and adding these using equal weighting the lowest prank+zrank+nzrank was [13,5,10,9]. Its PractRand scores were very acceptable and therefore this [a,b,c,d] was chosen for the P2 rev B silicon.
So far, I have some preliminary pfreq0 results from 14 (only, due to RAM limit, but should suffice) independent trials of xoroshiro128** high bits, ^AD65_RevBits, +1_RevBits and ^1_Revbits.
I will run a few more variants and post an expanded data set based on new code, once it is proofed to my satisfaction.
My initial preview of pfreq0 results indicates that there is indeed greater statistical bias in streams 13-18 with the variants tested so far.
Stream 13 is actually composited of single iterated streams 25 and 26, which might indicate an endemic artifact of the short protraction of equidistribution.
The way it occurred in XORO32 with the failure in PractRand at 512MB-1GB signaled a pending issue with double-iterated pfreq (if it had been carried beyond one double-iterated stream).
With XOROACC being based on that same underlying limit, the best that can be hoped for is to optimize the e xor mask to smooth the avalanche behavior.
For now, here is the pfreq0 data for ^AD65_Revbits (14 trials):
Edit: Replaced table with ods attachment.
The explicit near-perfect statistical termination at stream 23 is not something shared in common with simpler variants.
I have compared two extended statistics with 'Xoroshiro128** High 32 bits': chi-squared sum of stream averages, and average of chi-squared sums.
^AD65_RevBits : 39.65 and 43.48
Xoroshiro128**: 1.89 and 19.43
The former value ideally should approach zero with more trials, and I can only guess that ~20 is very close to the ideal normal for the latter value.
My belief is that an optimal e constant may exist which would allow XOROACC to reach a lower limit of ~30 for both statistics while maintaining the stream 23 termination... a marginal improvement for the first statistic and a meaningful improvement for the last statistic.
'+1_Revbits' has just a little above the 30/30 that I propose... but terminates too soon, decidedly at stream 22. I had discarded it early on due the excessive PractRand failures... but still a viable option if one is concerned with only pfreq0 performance on just 1 double-iterated stream, as it out-performs '^AD65_RevBits' by a considerable margin on stream 1 pfreq0 chi-square of the average of 14 streams: 0.0068 vs. 0.42 (the former value nearly ideal, and the latter value being, I believe, much below 0.05 probability).
Here is the best (which excels at PractRand, also):
(xoroshiro32pp ^ AD65h)_RevBits(xoroshiro32pp + 1): 19.51 and 22.88
What these pfreq0 statistics (ods file attached) imply is that an individual 23 stream trial is now similar to an individual xoroshiro128** (also attached) trial, but it only takes several trials to prove that that the per-stream biases are fixed to specific stream #s in the sequence.
I tried (xoroshiro32pp + 1)_RevBits(xoroshiro32pp + 1), and the results on PractRand and pfreq0 were not a step forward.
And yes, these hybrid variants somehow maintain the same equidistribution (which is quite extraordinary).
Tony, I think this type of simple hybrid variant is worth pursuing, but want your take on the implementation complexity (as this one adds carry implementation on one of the double iterations, in addition to the xor on the other).
Edited for clarity and to add attachments.
Edit2: I am currently looking at (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1), which shows NO precursor issues in PractRand at 16GB... the only variant that has ever accomplished this!
Edit3: (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1) is the best when completing 1 double-iterated stream, but decidedly completes pfreq0 two streams too early. I bet it would do well in BigCrush.
How do initial state and initial acc differ for each of the 14 trials? Are you sure there is no overlap between any of them for at least 23 streams?
The main reason for using xoroacc is to get 2D equidistribution. I was concerned that outputting all possible outputs might take longer with our 'pseudo' 32-bit generator than with a 'proper' 32-bit or 64-bit one. However, tests have shown this concern to be completely misplaced. Taking one stream or even two fewer than the ideal is erring on the good side and I think we should not be too worried about 22 instead of a perfect 23.
Studying the results, pfreq0 in general is on the low side from stream 9 onwards and more clearly so from streams 13 to 18 or 19. There are ~500 fewer missing outputs after stream 13 compared to the ideal remainder of ~15000. Some trials have very low Chi-Squared errors for streams 1 to 9, say, during which more than 99.99% of the possible outputs occur.
Thus it seems likely that, by choosing good constants and the appropriate initial state and acc values, a single stream of 2^32-1 double-iterations could be extremely close to the ideal random distribution for a 32-bit output. I think further tests should look at the total pfreq, zfreq and nzfreq distributions for stream 1 and these results considered along with those for multi-stream pfreq0.
Re hardware implementation, there ought to be enough time for an addition of X+Y+1, but I think we should avoid too much complexity. I'm not sure what (xoroshiro32pp ^ AD65h)_rotl16(xoroshiro32pp + 1, 1) means because the two acc additions are only 16-bit.
However, a perhaps larger issue at the moment concerns how to gather total pfreq, zfreq, and nzfreq: i.e. I am currently averaging 14 of each x statistic (takes about 2.3 hours total), admittedly generated via a seeding method that will eventually cause collisions. My notation is poor (thus open to suggestion), but simply indicates that the first xoroshiro result is xor'd with AD65h (prior to accumulation), and the second result has 1 added, then a left rotation by 1 (prior to accumulation).
However, I don't think the extra complexity will be required. For instance, that particular example using the +1 with rotate looks horrible under BigCrush meta-analysis (often exceeding 3 StdDev). Think of the meta-analysis as performing a very similar function to pfreq, zfreq and nzfreq, except that it is meshing a few perhaps similar statistics, along with lots of others.
Limiting BigCrush to just 1 trial (instead of the meta-analysis 16 each of forward bits, reverse bits and byte reverse) is pointless, since most well-conceived variants I have tried pass it easily.
From testing xoroshiro32++, any pfreq0 that begins 15800... or 15799... could be considered excellent.
Notation understood now, it was the 16 in rotl16 that was confusing me.
In any case, it takes the same amount of time to do 1 trial as it does 14 (with 15 perhaps possible).
I do not like the idea of hand picking data. However, for example, with electronic analytical balances, the old 3 of 5 algorithm suffices... so perform 5 independent trials and discard the high and low for each stream #, taking the average of the remaining 3. This approach might allow vetting (15/5) 3 candidate algorithms/constants simultaneously, for a total of ~30 per day (if fully automated).
Regarding other comments: '15800... or 15799' describes all of these... it is simply the perceived expectation of greater precision on stream #1 in particular, and > #1 as a bonus. I think it is within grasp, but that remains to be seen. I used the rotl16 notation, as other bit-size rotls existing in my code (so I have to consciously be careful in constructing logic in my head).
If 14 trials are as easy as one, then please find the 14 pfreq/zfreq/nzfreq distributions for stream 1 and calculate the Chi-Squared values. It is possible that all 14 might be better than xoroshiro32++[13,5,10,9] which was 6th/5th/1st best for pfreq/zfreq/nzfreq individually and best overall.
Taken together, I am quite optimistic this concept should exceed any reasonable expectation (and already is far beyond my original expectations). Will do... still reviewing Evan's code for details on relevant C++ constructs related to managing the forthcoming flood of data.
I may be out of touch for a little bit... too many balls in the air this week, but rest assured I am working on it.
Edit: Fixed table to use floating point (instead of integer) values for calculation (which had minimal impact).
For example, I have a zfreq(0) from an arbitrary output file of 998774320, but an ideal 'mean run' from this table is 1.58197671.
I see the result is not used with the chi^2, so perhaps it is not important?
Chris, thanks for the pfreq results. It will be interesting to see the corresponding zfreq & nzfreq.
The following post defined mean zero run, see term (i) under Notes:
http://forums.parallax.com/discussion/comment/1438409/#Comment_1438409
Mean non-zero run can be calculated in the same way and its expected value = e. I didn't know about Chi-Squared when I wrote this post and I used the incorrect |Actual-Expected|/Expected instead.
zfreq(0) and nzfreq(0) have no real meaning, so I thought of using their table entries for something else. The expected values were easy to deduce from a few results, which gives us another measure of quality or two. I thought at one time that mean run might be the most accurate, but I'm not so sure now because it is a ratio and errors could cancel out (or rather be divided out). zfreq(0)/nzfreq(0) could be used to store the number of zero/non-zero runs when generating the data and the two values will either be the same or differ by one.
P.S.
I think that mean run is important and Chi-Squared could be applied to it.
Also I now regard pfreq, zfreq & nzfreq as having equal worth.
Here is the zfreq(x) for that same advanced rol7 variant (without addressing the meanrun): Notice the 5th stream trial glitched at the very end... which only succeeded in demoting that particular stream sum to place it in the same class as 'scrorarns'.
Still manually constructing these tables... nzfreq will be a little bit harder.
The scro generator is excellent on pfreq, but your results show it can be beaten on zfreq. I'm not sure that averaging the 7x scores is allowed, though.
If it's too time consuming for you, I could use pfreq/zfreq/nzfreq binary data files (preferably named *.pb/*.zb/*.nzb) as inputs to a QuickBASIC program that generates tables. 256 longs per file, starting at freq(0), with zeroes for zfreq(0) and pfreq(0) if it is easier as total runs and mean run can be derived from freq(1+).
To put the question to rest, we can look at scores for xoroshiro128** high 32 bits... I'll run them tomorrow and post the binaries for you to look at.
I was able to get Evan's code to work in .NET, with a few adjustments (e.g. string and monotonic clock related code changed for cross-compatibility... I will not call myself a C++ programmer, so still amazed I can get anything to work right across two platforms).
Averaging the chi^2 total is allowable, I think, but not chi^2 for individual frequencies. Summing prank+zrank+nzrank would give a winner for 7A-7E, based on frequency distribution.