Welcome to the Parallax Discussion Forums, sign-up to participate.

- 101.5K All Categories
- 812 Announcements
- 51 Propeller Code
- 22 PASM2/Spin2 (P2)
- 4 PASM/Spin (P1)
- 14 BASIC (for Propeller)
- 61 Forth
- 10 C/C++
- 2.8K Propeller 2
- 27.6K Propeller 1
- 18.9K BASIC Stamp
- 9 micro:bit
- 21.1K General Discussion
- 2K Learn with BlocklyProp
- 8.2K Robotics
- 124 Customer Projects
- 3.3K Accessories

## Comments

12,202One case on 8-cores with -multithread option: 59 secs

Eight parallel cases on 8-cores with -multithread option: 80 secs

One case without -multithread option: 99 secs

Eight parallel cases on 8-cores without -multithread option: 97 secs

The later two are same within error margin. That's good to know.

304My dual CPU workstation is running E5-2690s, which I picked up for $600 last year.

It came with 64GB RAM (whereas my W3690 only has 12GB), but I don't have any jobs that require that much right now.

304I was hoping for something closer to 100%.

I believe the pair frequency might be overly weighing on the ranking (perhaps culling the list of 20-60 overall rank), as anything below about 500 for pfreq, zfreq and nzfreq seems plausible.

Regardless of the above, I am still curious if my streaming mod with a single M and using only a single candidate (e.g. 4,8,5,7 is great for s0 * 5 ) will drive pfreq, zfreq and nzfreq all well below 100 simultaneously?

I am very reasonably sure from my own testing that it will not substantially affect the PractRand scores on any aperture, but will cause the pfreq, zfreq and nzfreq to wander (presumably with some M, all to very near 0 total).

The actual mod I am currently using would look like this, if refactored slightly, to affect only the last line of code with C=5: Of course the missing value would swing to a non-zero, which requires an E subtraction (or subtraction from E, if that would be better) in the scrambler, which would have to be performed prior to analyzing for pfreq, zfreq and nzfreq.

However, running a few dozen random M without applying E would demonstrate the viability of the concept if a few of those M would rank at #1 above all previous.

Hopefully that all makes sense...

12,202EDIT: It would likely be twice that fast faster (7.5 seconds per score) because the good averages are barely hitting exponent of 29 (512 MB).

12,20212,202304So if you try several random values of M with the above code mod patched in, see how these statistics from Tony are affected when redone for each M: I suspect that you will quickly find an M that is superior, and if that is the case, then we could worry about repeating with an E present (which would change the stats once again toward perhaps a different M).

The end goal would be to force 4,8,5,7,E,M to be perfect, e.g. rank 1,1,1 for pf, zf and nzf (assuming that 4,8,5,7 is the best target to do this). Yes, but initially no E, since it is too much trouble to calculate it for a proof of concept run (though it would be required later and would again change the choice of M, perhaps by exhaustive search).

12,202Depth one allows a single adder summing: Xoroshiro+.

Depth two allows three adder summing: Xoroshiro++, Xoroshiron+e+, Xoroshiro+x+.

Depth three would allow up to seven adders! Hardware speed limitations would need looked at but I presume this does open up a lot more options for more complex scramblers.

12,202I'm pretty certain the only reason we got away with depth two was because the register size is only 16 bit. I doubt that leeway will extend to three levels.

The regular way to go more complex is take more clocks about it. It sounds like Chip was saying it's a pain to make timing exceptions for lots of multi-clock paths so will usually go with multi-stage instead. Which requires intermediate staging registers to be added.

3041. Drop E from the code, which simplifies the search for M.

2. If no M values result in pfreq/zfreq/nzfreq simultaneously significantly closer to zero, then abort the idea of using M completely and fall back to original path of considering s0 * 5.

3. If a superior M<>0 is found, calculate E and document it.

4. If the end-user magically* requires that the value that is short by one output is 0, then they can subtract or xor the documented value of E with both halves of XORO32 in their own code.

*'Magically' implies that it would be rare that a give application cares which value is short (only 65535 of that value in the full period), and those few that do care (if they exist?) might likely be just as unhappy about a missing zero. Therefore, the adaptive approach would be to write application code (that cares about such things) to either throw out zeros, or to xor a counter with the output and increment the counter once every full period so that all values are represented equally. In the latter case, it does not matter which value is short by one.

**Edit: Thinking about the 'non-option' of actually coding E a little deeper... since xor also works, inverting the set bits of E on the output read of XORO32 would have the same effect as subtraction/xor in the code... just requires an average of about 16 inverters (one for each set bit of E * 2 words) on the 32-bits of output stage... thus virtually tiny real estate compared to another adder and no clocking required. I hope that makes sense.

12,20212,202304For example, if the short output is 0xAAAA, then E=0xAAAA and thus: If Output = E then (Output - E) = (Output ^ E) = 0, and if Output <> E then 'Don't Care'.

The PractRand results will not be impacted by the choice of subtraction/xor, nor should the pfreq, but zfreq and nzfreq would be. Thus xor E would ultimately affect the specific choice of M.

Too tired... must sleep to get up in 5 hours.

12,2023041. Refactoring the M into the last line of code might have a (hopefully minor) consequence, as the first pass of s1 to the output value and engine would not have M applied. I need to study this more to see what the actual consequence is. I avoided any potential issue in my own code by performing up front. I might have to refactor a test down to 16-bit state to study this. Edit: Non-issue... s[1] = rotl(s1 ^ M, C) works perfect. I just couldn't make sense of it in my head, so had to run full simulations to ensure everything was correct.

2. I noticed that the original xoroshiro32pp full period has exactly two occurrences where certain 16-bit values repeat 3 times in a row (i.e. two triplets). Statistically this should only happen once, but either zero or twice is plausible. What is interesting is that M can be specifically chosen to set the occurrence of triplets to zero (or one), if desired. However, since two 16-bit outputs are combined into XORO32, it would not be obvious that either the 1st or the 3rd 16-bit output occurrence in a row is present in the adjacent 32-bit output.

3041. s0 * 5 would net approximately 70% better randomness, perhaps more with some other untested ABCD. Seemingly proven.

2. s0 * 5 might fit in the pipeline (barely). Possibly sound.

3. Use of M xor'd in the final stage will fit in the pipeline. Seems sound.

4. Use of M with near 8 bits set, will erase all traces of sparse set-bit state recovery effect on the output. Has been demonstrated.

5. Specific choices of M should manipulate pfreq, zfreq and nzfreq to superior target goals without compromising PractRand performance. Current evidences supports this, but remains to be fully demonstrated.

6. M affects triplet generation and could be used to cancel it entirely, if desired (i.e. xoroshiro32pp 13,5,10,9 has two sets of triplets). Has been demonstrated.

7. E could possibly be factored into the final output using Not gates to assign the missing output value back to zero, with minimal impact to design efficiency. Logically sound.

8. The use of E would affect zfreq and nzfreq, but not PractRand results, so would be required for final tuning. Logically sound with freqs, but needs to be formally demonstrated. Has been demonstrated with PractRand.

9. Ultimately, the specific choice of M would be used to calculate/discover the bad-seed-value to xor mask against the normal non-zero seed (or as I do in 128-bit version: seed, pump, xor bit 0 with a 1 if new state = old state). Presumably implementable.

I hope I didn't miss anything.

12,202What approach am I to take for this? Do I need to be running exhaustive, every M, distribution runs on preferred candidates?

Isn't that E's job?

1,8081,808The equations for the expected distributions are

pfreq(x) = 1/x! * N * 1/ezfreq(x) = (1-1/e)^2 * N * (1/e)^xnzfreq(x) = (1/e)^2 * N * (1-1/e)^xThe first is the well-known binomial. The second and third have a very nice symmetry and although I cannot derive them mathematically the results show that they are correct. N = 2^32 in this case.

The equation for "goodness of fit" is

Chi-Square(x) = [Actual(x)-Expected(x)]^2/Expected(x)The lower the value, the closer the distribution is to that expected. I sum Chi-Square(x) for all x for which the nearest integer to the expected value is non-zero. The nearest integer values are listed at the end of this post.

For pfreq, sum Chi-Square values for x = 0 to x = 12 incl.

For zfreq, sum Chi-Square values for x = 1 to x = 21 incl.

For nzfreq, sum Chi-Square values for x = 1 to x = 45 incl.

Defining prank/zrank/nzrank as the ranking of pfreq/zfreq/nzfreq for all full-length candidates, a ranking of 1 is the best and the overall ranking is prank+zrank+nzrank, thus the lowest overall is considered to be the best. However, one can argue that pfreq is more important than zfreq or nzfreq as the pfreq distribution is identical to the XORO32 output distribution. Having said that, the scro generator beats any xoroshiro so far tested for all three distributions, but is too large to be implemented and also has no equidistribution.

zfreq(0) and nzfreq(0) are not used for Chi-Square as they have no real meaning and instead could hold the total number of zero and non-zero runs. Note that a zero or non-zero run is terminated by a '1' or '0' and by the end of the 4GB of data.

pfreq(x)zfreq(x)nzfreq(x)304If you can calculate E, then use it. Otherwise, M alone is just a quick-check to see if the freq diffs are driven like I think (hopefully some toward zero).

E only corrects the output back to a missing zero. 'Bad-seed' prevents the state from starting in a single cycle loop... not sure how the XORO32 seeder works.

1,808XORO32 D reads D as the current state, iterates it twice and writes it back to D. The two 16-bit outputs are concatenated and the 32-bit result put into the source field of the next instruction (low word is first output, high word second). There is nothing in hardware to detect and change an invalid state of all zeroes.

The documentation should be modified to say the algorithm is now xoroshiro32++ and D must be > 0.

1,8083041. Modify the documentation to say that D (as in XORO32 D) must be <> X, where X is the 'bad-seed-value' resulting from a specific choice of ABCM. That seems somewhat cumbersome (as compared to when X=0).

2. Keep the documentation to say 'D must be > 0' and magically not/xor the D bits with the set bits of X (aka 'bad-seed-value') in a way that does not interfere with the normal iteration process, but only affects seeding. I have no idea if this is possible.

3. Change the documentation to read something like 'D must be xor'd with 1 (or other non-zero value) and reseeded if D(i)=D(i+1). This would also work with the current xoroshiro32++, though is somewhat cryptic.

4. Abandon the pursuit of using M for the purpose of actually attempting to perfect the modified xoroshiro32++ implementation, but perhaps continue the pursuit of M for academic/future use.

12,202304Edit: I looked at a few candidates using Seba's new scrambler, but the PractRand results were not good. Either there are some good candidates to be found by a much more exhaustive search (perhaps by allowing D not exclusively 7), or the scrambler is overly symmetrical as-is, lacking a second occurrence of s0. The latter comment is based on my understanding of Melissa's small-state plotting of the xoroshiro state showing a cris-cross pattern, which I don't think can be cancelled sufficiently by a simple bivariate convolution.

3041. What are the raw count values based on for each (e.g. for pairs, is it word pairs, byte pairs, bit pairs)?

2. Why terminating z / nz runs at end of 4GB of data (as opposed to the 8GB of a single period or 16GB of a double period)?

12,202For speed optimise, it would be: Resolving to "x" is a full two levels. Which means there is flexibility to freely adjust one of the shifter parameters to split the x5 into separate multipliers. Same could be done to fully use the third level but it costs more transistors of course.

12,2021,808OK, I thought it was worth mentioning but I don't think we should spend too much time on PRN = ((s0 + s1) * 5) rotl 7) * 9 as this ** scrambler is intended for s0 only when there are at least 128 bits of state (see the paper for more details).

1,808I don't know whether the pair distributions (pfreq) are identical because I don't have 4GB of RAM to test them. Also I haven't looked at (s0 - s1) yet.