Here's a QUOTE: "Remember that we're going for an all-core overclock and that means a lower clock frequency than the highest Turbo bin offers. What you need to do:
Enable and start at 4200 MHz (42 Multiplier)
Apply 1.40V to the CPU (or simply leave it at auto)
Work your way upwards from there until the system becomes unstable and then back down at least 100 MHz."

His first wrong assumption is that a multiplier beyond max boost isn't achievable.
Second is he started from 1.40 volts!
And ultimately, the cooler likely wasn't up to the job. No mention of changing it, which presumably means he used the AMD supplied cooler "we use the Wraith stock AMD cooler" and thermal transfer paste.

Hmm, the dude doing this review hasn't really tried to lower the voltage...

I agree... bottom-up on voltage / clock (with better than stock cooler) would be best practice for these, it seems, based on the results of extreme-underclocking here.
Apparently clock-stretching is the wild card that complicates under/over-clocking.

Huh, intriguing, I wonder if that is new to the 3k parts, or that I've also been benefiting from it. I've not tried comparing my own benchmarks with different voltages. In fact, I've not tried lowing the voltage much below default. Of course, the default for the 3k series will be lower than the default voltage for my 1700X so I don't know how much lower the 1.00 volts is that they talk about ...

I'm not seeing any speed variations here. Although, Cinebench is far from consistent in its scoring. Have to run it a number of times to decide if a reliable score has occurred or not. The power/current does follow the expected gradient. While running Cinebench R15 at 4 GHz, system power went from a high of 265 W at 1.5 volts, down to 186 W at 1.287 volts.

Under load, the 1700X fails to a reset/lockup when the voltage is too low. At my set 4 GHz clock, below 1.300 core volts I had to reduce the temperature for max RPM on the fan to keep it stable. I was able to go down to 1.287 volts. All this would seem to support the idea that the 3k series has the clock-stretching (or possibly skipping) as a new feature.

PS: System power being 1700X CPU, B350 mobo, GTX960 GPU, 1x SSD, 1x HDD, 1x ODD, gold rated power supply, mouse, keyboard, USB extender, card reader, and some cooling fans.

PPS: I've worked out that it isn't Cinebench's fault for the inconsistent scores. I just have to wait longer after each reboot for the OS to settle down and stop running its little background jobs. Conveniently, I can see the transition on the power meter with idle system power always at 55 watts.

I've just done some more experimenting with the minimum voltage and using Cinebench to load test and verify constant processing speed. Just by changing the multiplier to 3800 MHz I could go down to 1.175 volts on the CPU core voltage. And remember this is with a 14 nm first gen part that has a default of 1.35 volts for 3400 MHz. Not that I would rely on it that close to the edge.

I had another go at 4100 MHz, which had proved too unstable historically, and found that indeed it is creating too much heat for my S40 cooler. With all cores going flat out, the temperature rises to needing more volts for stability which in turn heats higher. There's no sweet spot.

Now I am understanding more about some of the conflicting information I have read... if the write-up here is accurate.
Excerpt: "... if you want to overclock a single core inside a CCX, the second core must run at a 1 GHz difference, meaning that if one core is OC'd to 4.5 GHz, the second core must run at 3.5 GHz. Such design is to be blamed on CPU's internal clock divider..."
No worries... Shamino to the rescue with a new version of Work Tool.
EDIT: Required reading (and some scary stuff in one of the links before using Work Tool).

On another note, that "data fabric" in the I/O die is huge. So far I've not seen a single official comment on what is in there. I've seen one off-hand comment by a journo speculating it could be DRAM.

I see Seba has formally posted code links for larger state xoroshiro++ and xoshiro++ on his PRNG Shootout page.
I have already tested many of the xoroshiro++ constants at 128-bit, but wasn't perfectly satisfied with the ones I looked at via meta-analysis.
It is just like the search for 32-bit state constants, but without any easy ability to exhaustively test.

I am in the process of re-fitting my BigCrush meta-analysis engine with mathematical improvements (e.g. I noticed I had been using the sample average as part of the StdDev calculation, where the known population average of 0.5 seems more appropriate and is reasonably mathematically scalable for n > 2, and almost perfectly so for n > ~30) to better explore this.

I don't know 'when' about the decision itself, but the announcement was August 1st.
On the whole I think it is overdue, as I believe large state ++ (with correct choice of constants) has the potential to obsolete most other PRNGs for most common tasks (that do not require a CSPRNG).

The only real failing (in very specific use-case scenarios) of a good ++ is sparse-state recovery (i.e., lag in propagating a single state bit change to other bits), which my +*+ w/mask idea addresses fairly well.
Some might argue that lack of multidimensional equidistribution is a 'failing' with + (or ++), but I don't think that argument carries much weight for the vast majority of uses, and even less so at larger state sizes.

Evan,
I gather you’re trying to overclock you pc to get actual results from the png. If so, could you use the free GPU/TPU from google? There is a version that runs python and you can get about 10 hr time blocks at a time although you can get kicked off for paying customers. Just a thought.

The PRNG testing hasn't been touched for months. I've been doing streamer and smartpin testing of late.

I guess the reason I bring up the Ryzen in this topic is because I've done all my significant PRNG testing using the original 8-core product from early 2017 and have mentioned more than once how much of an upgrade it was from the dual-core. That and Chris has shown interest in adding to his extensive collection of PCs.

I see Seba has formally posted code links for larger state xoroshiro++ and xoshiro++ on his PRNG Shootout page.

When did they decide to put ++ in there?

I don't know 'when' about the decision itself, but the announcement was August 1st.
On the whole I think it is overdue, as I believe large state ++ (with correct choice of constants) has the potential to obsolete most other PRNGs for most common tasks (that do not require a CSPRNG).

The new version of the paper has some differences from the original and is worth downloading (however I suggest keeping a copy of the old one). The ++ scrambler section is now section 10.6, not 10.7. Our constant d is called r in the paper, but we started using [a,b,c,d] long before the paper was published.

Seba and David now suggest d aka r values of 5 and 11 for w = 16 (32-bit state, 16-bit output as used by XORO32). What's quite amusing is that Seba knows that we've changed from [14,2,7,5] to [13,5,10,9] and the former is mentioned in the first paper whereas the latter is not in the second presumably because it conflicts with their new advice! As mentioned, though, test results are what really matter. Also the double-iteration in XORO32 is a unique feature of the P2 and others would use a 64-bit or larger state to get a 32-bit PRN.

I think the amended paper still gives the (misleading) impression that there is not a lot to choose between + and ++ on quality. Perhaps it's hard to tell the difference with states > 32-bit, but our tests show that ++ is much better. + is faster and easier to analyse theoretically, but if there is time to do ++ I can't see any reason to do + instead.

Regarding the PRNG shootout, I don't understand how a footprint of 1068 bits arises when the text says it is always padded to a multiple of 64.

That and Chris has shown interest in adding to his extensive collection of PCs.

Interest is all I have at the moment... choosing between paying to send a kid through college and buying one of the new EPYC processors announced yesterday is easy for me. But, others don't share my appreciation for awe-inspiring processors.

Seba and David now suggest d aka r values of 5 and 11 for w = 16...

My test results at various bit-widths suggest a gravitation toward certain D (aka R) constants in general, but indeed the specific ABC choice decides whether any given D will create excess bit correlations.

Perhaps it's hard to tell the difference with states > 32-bit.

I believe I can tell the difference with my meta-analysis (and it is a given that I would need to intentionally ignore the obvious low-bit binary matrix and linear complexity issues with +), but it is difficult to wrap up all the details to publish credibly while demonstrating proof. Part of the proof issue has to do with the statistics of the most important values, which are located on the outside of my cloud model... those points are non-normally distributed, thus the uninitiated would attempt to apply their knowledge of normally distributed statistics to my results. To avoid this conundrum, it takes at least 1000 BigCrush results to come close enough to the Central Limit Theorem expectation where the normal and non-normal merge well (in this case) and actual biases in 'so called' good PRNGs begin to emerge. At that point we begin to see a correlation between the most biased statistical tests and the bit positions they are testing, or other patterns that emerge (which are not present in AES or other simulated TRNG data).

Regarding the PRNG shootout, I don't understand how a footprint of 1068 bits arises when the text says it is always padded to a multiple of 64.

I don't understand either. Perhaps they meant 1088, except that the 1024 index is an 'int', which would suggest 1056 as the real answer (also in conflict with the 'multiple of 64').

If anyone is interested, I found a paper that discusses the idea of what I am trying to accomplish with my TestU01 meta-analysis: Normal Cloud Distribution Probability Fast
My implementation will be different, but hopefully comparable.

Just a few possibly useful realizations regarding the current Xoroshiro32++ implementation:

1. There is no obligation to assemble XORO32 as 'f(n) | ( f(n+1) << 16 )', as 'f(n+1) | ( f(n) << 16 )' is topologically equivalent. However, the latter appears more random due to greater de-correlation at the boundary of XORO32(n) and XORO32(n+1), at least when using [13,5,10,9]. This is what I call a JMT ('Jedi Mind Trick'), since it is equivalent to 'ROL(XORO32,16)', but it is enough to fool PractRand into believing that the output is nearly twice as random in most all of the forward and reverse rotations.

2. If the output buffer of xoroshiro32++ was bi-directional (i.e. 'result' is a static state variable initialized to zero), then this becomes possible: 'result = result + rotl( s0 + s1, 9 ) + s0 - 1;'. This produces a random stream guaranteed up to 8GB (failing in PractRand @ 16GB), fully 2-dimensionally equidistributed, with a period 2^48-2^16.

#1, by itself, is not entirely pointless, and is easy to implement. #1 and #2 together are more difficult to implement, but work even better than #2 alone (with an abrupt hard-fail in PractRand @ 16GB).

Just a few possibly useful realizations regarding the current Xoroshiro32++ implementation:

1. There is no obligation to assemble XORO32 as 'f(n) | ( f(n+1) << 16 )', as 'f(n+1) | ( f(n) << 16 )' is topologically equivalent. However, the latter appears more random due to greater de-correlation at the boundary of XORO32(n) and XORO32(n+1), at least when using [13,5,10,9]. This is what I call a JMT ('Jedi Mind Trick'), since it is equivalent to 'ROL(XORO32,16)', but it is enough to fool PractRand into believing that the output is nearly twice as random in most all of the forward and reverse rotations.

2. If the output buffer of xoroshiro32++ was bi-directional (i.e. 'result' is a static state variable initialized to zero), then this becomes possible: 'result = result + rotl( s0 + s1, 9 ) + s0 - 1;'. This produces a random stream guaranteed up to 8GB (failing in PractRand @ 16GB), fully 2-dimensionally equidistributed, with a period 2^48-2^16.

#1, by itself, is not entirely pointless, and is easy to implement. #1 and #2 together are more difficult to implement, but work even better than #2 alone (with an abrupt hard-fail in PractRand @ 16GB).

Thanks for this new info, I'm pleased this thread has been revived. Comments:

1. The XORO32 instruction injects the 32-bit PRN into the next instruction's S (source) field. Had the next D (dest) field been chosen instead the 'Jedi Mind Trick' could have been achieved without the need for a separate rotate instruction:

'P2 implementation
XORO32 state 'state iterated twice, PRN -> next S
MOV prn,0
ROL prn,#16
'Alternative not implemented
XORO32 state 'state iterated twice, PRN -> next D
ROL prn,#16

2. The addition could be implemented in logic as follows

'result = result + rotl( s0 + s1, 9 ) + s0 - 1
p = s0 + s1
q = result + s0 - 1
result = rotl (p, 9) + q

and probably would take no more time than the current algorithm. The problem is the extra register needed to hold result, which might make multiple XORO32 instances difficult or impossible.

Are you sure about the period of 2^48-2^16? I envisage a much higher PractRand failure if so. Also the state is only 32 bits.

Are you sure about the period of 2^48-2^16? I envisage a much higher PractRand failure if so. Also the state is only 32 bits.

Yes, I am quite reasonably sure about the period (and a formal proof should not be very difficult).
[Edit}I am considering the persistent result accumulator part of the state.[/Edit]
You can run the simulation easily on the 16-bit state (+ 8-bit result state accumulator) Xoroshiro16+++(-1) using [3,7,4,4]. Period is 2^24-2^8.
Here is the result:
1D 256:65535 (all 256 values occur 65535 times)
2D 256:256 (all 256 values occur twice in a row 65535 256 times)

The purpose of the result state variable is NOT to extend the randomness period extensively. It is to:
1. Recover the missing zero, like LCG.
2. Recover from loss of 2D, since 1D is a normal result of either + or ++ (but not * or **).
3. Make the randomness period more deterministic (i.e. =2^(2/3 the full-state size), when 'result' is considered part of the state) by use of 'protracted equidistribution'.
4. Make the randomness over the defined randomness period reasonably beyond reproach (but yet not intended as a CSPRNG).
5. Enable a simple, logical extension to ALL similar LFSR PRNGs.
6. Encourage the completion of mathematical construct in PRNGs, even if doing so has an undesirable marginal impact on target implementation and/or speed.
[Edit]7. Prevent the 2D from disrupting the randomness, as exact 2D over the normal 2^32-1 period fails many randomness tests more easily.[/Edit]

[Edit]Caveat1: I do not have a jump function (in this version), which is not so much of an issue for a free-running hardware PRNG of ~2^48 period.[/Edit]
[Edit]Caveat2: Original 32-bit state must still be seeded non-zero.[/Edit]

Not to detract from the current points, but I have just finished testing of a ++ variant that uses JMT, but not the persistent result state, which can often pass TestU01 Crush with only 32-bits.
AFAIK, this has never been done before.
[Edit]Of course this has not been done, and still hasn't... my code is 1-dimensionally equidistributed at the 16-bit level, not the 32-bit level. Still perhaps slightly impressive, though.[/Edit]

A few subtle points on JMT:
JMT (if used) breaks the 2-Dimensional Equidistibution created by +++(-1) due to the period being short of an exact power of 2, falling back to 1D.
JMT creates a rift in the 1D boundary at full period, which recovers as the number of full period cycles approaches infinity. [Edit]I need to double-check this statement.[/Edit]
The idea for JMT partially results from my observation that many LFSR 64-bit PRNG results from TestU01 BigCrush look too good when the high and low 32-bits are tested sequentially, rather than in isolation.

Way too much thought put into this for such a simple concept.

I'm understanding this more, result provides the PRN output and extra state bits. I've verified the period of 2^24 - 2^8 for w = 16, which could also be written as 2^w - 1 << w/2. 1-D equidistribution including zero checks out, but 2-D including zero doesn't and appears impossible given the period.

I'm understanding this more, result provides the PRN output and extra state bits. I've verified the period of 2^24 - 2^8 for w = 16, which could also be written as 2^w - 1 << w/2. 1-D equidistribution including zero checks out, but 2-D including zero doesn't and appears impossible given the period.

You are correct. 2D can only apply to w=8 (on the small test version), so also only to w=16 on modified XORO32. This is the expected behavior when you concatenate an output word from two components. This is incorrect.

I believe you would find the same to be true with the existing XORO32, if the scrambler was simply 'result = s0', which is 2D at 16-bit output (but with a single missing zero), but loses 2D when inspected as a concatenated 32-bits (i.e. ~half the pairs disappear, but some are consequently replaced by other pairs).
Try it with the small version to see. I believe this is incorrect, as xoroshiro using 'result = s0' is maximally equidistributed (except for missing zero), whereas I was only trying to achieve 1 and 2-dimensional equidistribution of the 'result' output, which indeed does not extend to 2x output pairs, which are normally distributed... which is actually quite exceptional when you look at the raw data (and compare to it to that of the ++ scrambler over the same period).

Therefore, full 1D and 2D at 32-bits, while simultaneously maintaining all of the other desirable properties discussed, would require twice the state (e.g. 64-bits is minimum, and 96-bits guarantees statistical randomness all the way out to 2^64 outputs)... I am already running that 96-bit generator in VB6 for research, along-side Xoshiro128+ for floating point values.

BTW, the 1D boundary rift issue I mentioned with JMT when inspecting the 'result', also occurs to a much smaller extent in the current XORO32, but gracefully recovers fully after two full periods of the underlying xoroshiro32++ (but leaves 2 missing zeros in its wake).

This new idea can be expressed as xoroshiro+++ [a,b,c,d,e] where:

result = result + rotl( s0 + s1, d ) + s0 + e

I can see no difference in single and pair outputs between e = -1 and +1. Is there a qualitative difference in PractRand or TestUI? (I cannot run either of these.)

The existing XORO32 output could be used with new result calculation done in software, with more optimum code if e = +1. As a way of producing a PRNG with significantly longer period than xoroshiro32++, this code would be smaller and quicker than a fully software xoroshiro64++.

I can see no difference in single and pair outputs between e = -1 and +1. Is there a qualitative difference in PractRand or TestUI?

I assume you are referring to rolling e = +1 in as a carry bit into an existing summation?
I am running PractRand on all 16-bit rotations of randomly seeded streams. +1 and -1 looks similar so far.
TestU01... if PractRand is any indication, might require BigCrush to tell the difference between +1 and -1.
[Edit]Attached -1,+1 and JMT+1, all 16-bit output rotations. Summary:[/Edit]

The xoroshiro+ sum (s0+s1) discards the highest entropy carry bit of the sum, which would be desirable to recover, as you know.
In the context of the original code, attempting to do so is of little value, since it would create some other issues, thus negating the benefit.
Now that I find e=+1 is desirable (more so than -1) for implementation, but causes minor issues statistically.
JMT, although seemingly advantageous, might also create implementation issues, but does improves statistics measurably.
I believe I have a solution that would recover the high entropy carry bit, allow e=1 and not require JMT.
This solution can be expressed as:

xoroshiro+++ [aR,bR,cR,dR,e]

In the above, all rotations and shifts run to the right, rather than left. The same [13,5,10,9,1] can be used. The engine output is bit reversed, and the scrambler output carry bits flow from least bit (high entropy) to highest bit (low entropy).
Since the result accumulates the entropy from least to greatest due to carries, issues with the high bits will not be noticed.
Here is the preliminary result (see attached, with JMT to follow for comparison): [Edit:]Added JMT to summary and attached zip file.[/Edit]

The failures at 00R and 00F are noticeably more graceful, but an inevitable consequence of the simplicity of the scrambler code modification.
Additionally, I have done preliminary studies on the 32-bit distribution, and it seems flawlessly normal, likely due to the 16-bit 2D equidistribution property. Edit: The above non-00x rotation results (and those in the attached files) are likely wrong, thus should be disregarded.

e = +1 rather than -1 permits add with carry (called ADDX in the P2), saving two instructions but adding one to set carry, net saving of one.

The new result variable creates a PRNG that consists of multiple xoroshiro periods, e.g. 65536 such periods for xoroshiro32+++. Extrapolating from xoroshiro16+++, each period is identical except that result is one fewer than the previous period and one more than the next period for the corresponding iterations. It appears the same effect could be achieved by adding an offset that decrements after each period to the xoroshiro++ output.

The new result variable creates a PRNG that consists of multiple xoroshiro periods, e.g. 65536 such periods for xoroshiro32+++. Extrapolating from xoroshiro16+++, each period is identical except that result is one fewer than the previous period and one more than the next period for the corresponding iterations.

One more + the next scrambler result, as I recall, which in the worst case would yield a much more highly correlated (but not identical) stream... all streams after 8GB of output are correlated anyway. My original intent was to introduce 'e' as an odd seed value of high merit, but realized that it could not solve the correlation issue entirely, so I focused on maximizing the randomness over the entire underlying xoroshiro period, which is now nearly as complete as possible.
Regardless of randomness tests, to fall victim to this limitation would require using more that 8GB at once (which is outside the design specification). Even so, in many cases exceeding that limit (again, not recommend) would only be noticed if user code consumed streams of exactly 2^32-1 xoroshiro outputs. This is ignoring the double-iteration of xoroshiro for XORO32, which swaps the order after 2^32-1, where now the high and low will have different, but correlated values to the low and high of the previous stream.

It appears the same effect could be achieved by adding an offset that decrements after each period to the xoroshiro++ output.

Any offset that decrements would be part of the state. Doing so after each period requires a conditional check.
Using the inherent property of the xoroshiro++ output sums, we get the same effect with less code complexity and greater random behavior, I believe...
Just in case, let me see some pseudo-code of what you are thinking?

The new result variable creates a PRNG that consists of multiple xoroshiro periods, e.g. 65536 such periods for xoroshiro32+++. Extrapolating from xoroshiro16+++, each period is identical except that result is one fewer than the previous period and one more than the next period for the corresponding iterations.

One more + the next scrambler result, as I recall, which in the worst case would yield a much more highly correlated (but not identical) stream... all streams after 8GB of output are correlated anyway. My original intent was to introduce 'e' as an odd seed value of high merit, but realized that it could not solve the correlation issue entirely, so I focused on maximizing the randomness over the entire underlying xoroshiro period, which is now nearly as complete as possible.
Regardless of randomness tests, to fall victim to this limitation would require using more that 8GB at once (which is outside the design specification). Even so, in many cases exceeding that limit (again, not recommend) would only be noticed if user code consumed streams of exactly 2^32-1 xoroshiro outputs. This is ignoring the double-iteration of xoroshiro for XORO32, which swaps the order after 2^32-1, where now the high and low will have different, but correlated values to the low and high of the previous stream.

The double iteration does reduce the correlation, that's a good point.

It appears the same effect could be achieved by adding an offset that decrements after each period to the xoroshiro++ output.

Any offset that decrements would be part of the state. Doing so after each period requires a conditional check.
Using the inherent property of the xoroshiro++ output sums, we get the same effect with less code complexity and greater random behavior, I believe...
Just in case, let me see some pseudo-code of what you are thinking?

This was more an observation as to what is going on, rather than a suggestion. Code with offset would be larger and slower than with result and the former is not worth doing.

What concerns me is the equidistribution or lack of it. Let's define two new terms. For state width of w bits, the state period = 2^w-1 and the result period = (2^w-1) * 2^(w/2).

1-D equidistribution is perfect for a complete result period, but not good for one state period. There is a small improvement after each subsequent state period as the frequencies of each result output converge to the same value. I'm concerned about how slow this convergence is and it could take a long time to reach an 'acceptable' equidistribution.

1-D equidistribution is perfect for a complete result period, but not good for one state period. There is a small improvement after each subsequent state period as the frequencies of each result output converge to the same value. I'm concerned about how slow this convergence is and it could take a long time to reach an 'acceptable' equidistribution.

I agree. The question is about use case, and oddly, seeding:
Typically when someone needs random numbers, they are not concerned about equidistribution over a finite period of numbers they will consume, just that the possibility of getting the numbers is even over the long run.
If they are concerned about specific equidistribution over a shorter period, then they usually need a very specific PRNG written or adapted for their purpose.

Here, we have a PRNG that will appear only normally distributed for 8GB of output, but we are not stuck entirely... it will continue to be normally distributed until it reaches its full period, but yet fail randomness tests after any period of 8GB.

The only key to this is initial seeding. The sequence generated after the initial seed is not as obviously correlated to another that is from another initial seed.

## Comments

9,809Here's a QUOTE:

"Remember that we're going for an all-core overclock and that means a lower clock frequency than the highest Turbo bin offers. What you need to do:Enable and start at 4200 MHz (42 Multiplier)

Apply 1.40V to the CPU (or simply leave it at auto)

Work your way upwards from there until the system becomes unstable and then back down at least 100 MHz."

His first wrong assumption is that a multiplier beyond max boost isn't achievable.

Second is he started from 1.40 volts!

And ultimately, the cooler likely wasn't up to the job. No mention of changing it, which presumably means he used the AMD supplied cooler

"we use the Wraith stock AMD cooler"and thermal transfer paste.265Apparently clock-stretching is the wild card that complicates under/over-clocking.

9,8099,809Under load, the 1700X fails to a reset/lockup when the voltage is too low. At my set 4 GHz clock, below 1.300 core volts I had to reduce the temperature for max RPM on the fan to keep it stable. I was able to go down to 1.287 volts. All this would seem to support the idea that the 3k series has the clock-stretching (or possibly skipping) as a new feature.

PS: System power being 1700X CPU, B350 mobo, GTX960 GPU, 1x SSD, 1x HDD, 1x ODD, gold rated power supply, mouse, keyboard, USB extender, card reader, and some cooling fans.

PPS: I've worked out that it isn't Cinebench's fault for the inconsistent scores. I just have to wait longer after each reboot for the OS to settle down and stop running its little background jobs. Conveniently, I can see the transition on the power meter with idle system power always at 55 watts.

9,809I had another go at 4100 MHz, which had proved too unstable historically, and found that indeed it is creating too much heat for my S40 cooler. With all cores going flat out, the temperature rises to needing more volts for stability which in turn heats higher. There's no sweet spot.

265Excerpt: "... if you want to overclock a single core inside a CCX, the second core must run at a 1 GHz difference, meaning that if one core is OC'd to 4.5 GHz, the second core must run at 3.5 GHz. Such design is to be blamed on CPU's internal clock divider..."

No worries... Shamino to the rescue with a new version of Work Tool.

EDIT: Required reading (and some scary stuff in one of the links before using Work Tool).

9,809On another note, that "data fabric" in the I/O die is huge. So far I've not seen a single official comment on what is in there. I've seen one off-hand comment by a journo speculating it could be DRAM.

265I have already tested many of the xoroshiro++ constants at 128-bit, but wasn't perfectly satisfied with the ones I looked at via meta-analysis.

It is just like the search for 32-bit state constants, but without any easy ability to exhaustively test.

I am in the process of re-fitting my BigCrush meta-analysis engine with mathematical improvements (e.g. I noticed I had been using the sample average as part of the StdDev calculation, where the known population average of 0.5 seems more appropriate and is reasonably mathematically scalable for n > 2, and almost perfectly so for n > ~30) to better explore this.

9,809265On the whole I think it is overdue, as I believe large state ++ (with correct choice of constants) has the potential to obsolete most other PRNGs for most common tasks (that do not require a CSPRNG).

The only real failing (in very specific use-case scenarios) of a good ++ is sparse-state recovery (i.e., lag in propagating a single state bit change to other bits), which my +*+ w/mask idea addresses fairly well.

Some might argue that lack of multidimensional equidistribution is a 'failing' with + (or ++), but I don't think that argument carries much weight for the vast majority of uses, and even less so at larger state sizes.

9,80916,570I gather you’re trying to overclock you pc to get actual results from the png. If so, could you use the free GPU/TPU from google? There is a version that runs python and you can get about 10 hr time blocks at a time although you can get kicked off for paying customers. Just a thought.

9,809I guess the reason I bring up the Ryzen in this topic is because I've done all my significant PRNG testing using the original 8-core product from early 2017 and have mentioned more than once how much of an upgrade it was from the dual-core. That and Chris has shown interest in adding to his extensive collection of PCs.

1,517The new version of the paper has some differences from the original and is worth downloading (however I suggest keeping a copy of the old one).

The ++ scramblersection is now section 10.6, not 10.7. Our constant d is called r in the paper, but we started using [a,b,c,d] long before the paper was published.Seba and David now suggest d aka r values of 5 and 11 for w = 16 (32-bit state, 16-bit output as used by XORO32). What's quite amusing is that Seba knows that we've changed from [14,2,7,5] to [13,5,10,9] and the former is mentioned in the first paper whereas the latter is not in the second presumably because it conflicts with their new advice! As mentioned, though, test results are what really matter. Also the double-iteration in XORO32 is a unique feature of the P2 and others would use a 64-bit or larger state to get a 32-bit PRN.

I think the amended paper still gives the (misleading) impression that there is not a lot to choose between + and ++ on quality. Perhaps it's hard to tell the difference with states > 32-bit, but our tests show that ++ is much better. + is faster and easier to analyse theoretically, but if there is time to do ++ I can't see any reason to do + instead.

Regarding the PRNG shootout, I don't understand how a footprint of 1068 bits arises when the text says it is always padded to a multiple of 64.

265265My implementation will be different, but hopefully comparable.

2651. There is no obligation to assemble XORO32 as 'f(n) | ( f(n+1) << 16 )', as 'f(n+1) | ( f(n) << 16 )' is topologically equivalent. However, the latter appears more random due to greater de-correlation at the boundary of XORO32(n) and XORO32(n+1), at least when using [13,5,10,9]. This is what I call a JMT ('Jedi Mind Trick'), since it is equivalent to 'ROL(XORO32,16)', but it is enough to fool PractRand into believing that the output is nearly twice as random in most all of the forward and reverse rotations.

2. If the output buffer of xoroshiro32++ was bi-directional (i.e. 'result' is a static state variable initialized to zero), then this becomes possible: 'result = result + rotl( s0 + s1, 9 ) + s0 - 1;'. This produces a random stream guaranteed up to 8GB (failing in PractRand @ 16GB), fully 2-dimensionally equidistributed, with a period 2^48-2^16.

#1, by itself, is not entirely pointless, and is easy to implement. #1 and #2 together are more difficult to implement, but work even better than #2 alone (with an abrupt hard-fail in PractRand @ 16GB).

1,517Thanks for this new info, I'm pleased this thread has been revived. Comments:

1. The XORO32 instruction injects the 32-bit PRN into the next instruction's S (source) field. Had the next D (dest) field been chosen instead the 'Jedi Mind Trick' could have been achieved without the need for a separate rotate instruction:

2. The addition could be implemented in logic as follows

and probably would take no more time than the current algorithm. The problem is the extra register needed to hold result, which might make multiple XORO32 instances difficult or impossible.

Are you sure about the period of 2^48-2^16? I envisage a much higher PractRand failure if so. Also the state is only 32 bits.

265[Edit}I am considering the persistent result accumulator part of the state.[/Edit]

You can run the simulation easily on the 16-bit state (+ 8-bit result state accumulator) Xoroshiro16+++(-1) using [3,7,4,4]. Period is 2^24-2^8.

Here is the result:

1D 256:65535 (all 256 values occur 65535 times)

2D 256:256 (all 256 values occur twice in a row 65535 256 times)

The purpose of the result state variable is NOT to extend the randomness period extensively. It is to:

1. Recover the missing zero, like LCG.

2. Recover from loss of 2D, since 1D is a normal result of either + or ++ (but not * or **).

3. Make the randomness period more deterministic (i.e. =2^(2/3 the full-state size), when 'result' is considered part of the state) by use of 'protracted equidistribution'.

4. Make the randomness over the defined randomness period reasonably beyond reproach (but yet not intended as a CSPRNG).

5. Enable a simple, logical extension to ALL similar LFSR PRNGs.

6. Encourage the completion of mathematical construct in PRNGs, even if doing so has an undesirable marginal impact on target implementation and/or speed.

[Edit]7. Prevent the 2D from disrupting the randomness, as exact 2D over the normal 2^32-1 period fails many randomness tests more easily.[/Edit]

[Edit]Caveat1: I do not have a jump function (in this version), which is not so much of an issue for a free-running hardware PRNG of ~2^48 period.[/Edit]

[Edit]Caveat2: Original 32-bit state must still be seeded non-zero.[/Edit]

Enjoy.

265AFAIK, this has never been done before.

[Edit]Of course this has not been done, and still hasn't... my code is 1-dimensionally equidistributed at the 16-bit level, not the 32-bit level. Still perhaps slightly impressive, though.[/Edit]

265JMT (if used) breaks the 2-Dimensional Equidistibution created by +++(-1) due to the period being short of an exact power of 2, falling back to 1D.

JMT creates a rift in the 1D boundary at full period, which recovers as the number of full period cycles approaches infinity. [Edit]I need to double-check this statement.[/Edit]

The idea for JMT partially results from my observation that many LFSR 64-bit PRNG results from TestU01 BigCrush look too good when the high and low 32-bits are tested sequentially, rather than in isolation.

Way too much thought put into this for such a simple concept.

1,517andextra state bits. I've verified the period of 2^24 - 2^8 for w = 16, which could also be written as 2^w - 1 << w/2. 1-D equidistribution including zero checks out, but 2-D including zero doesn't and appears impossible given the period.265modifiedXORO32.This is the expected behavior when you concatenate an output word from two components. This is incorrect.

I believe you would find the same to be true with the existing XORO32, if the scrambler was simply 'result = s0', which is 2D at 16-bit output (but with a single missing zero), but loses 2D when inspected as a concatenated 32-bits (i.e. ~half the pairs disappear, but some are consequently replaced by other pairs).

Try it with the small version to see. I believe this is incorrect, as xoroshiro using 'result = s0' is maximally equidistributed (except for missing zero), whereas I was only trying to achieve 1 and 2-dimensional equidistribution of the 'result' output, which indeed does not extend to 2x output pairs, which are normally distributed... which is actually quite exceptional when you look at the raw data (and compare to it to that of the ++ scrambler over the same period).

Therefore, full 1D and 2D at 32-bits, while simultaneously maintaining all of the other desirable properties discussed, would require twice the state (e.g. 64-bits is minimum, and 96-bits guarantees statistical randomness all the way out to 2^64 outputs)... I am already running that 96-bit generator in VB6 for research, along-side Xoshiro128+ for floating point values.

BTW, the 1D boundary rift issue I mentioned with JMT when inspecting the 'result', also occurs to a much smaller extent in the current XORO32, but gracefully recovers fully after two full periods of the underlying xoroshiro32++ (but leaves 2 missing zeros in its wake).

Let me know if you see any discrepancies.

[Edit]Major edits above, so re-read[/Edit]

1,517The existing XORO32 output could be used with new result calculation done in software, with more optimum code if e = +1. As a way of producing a PRNG with significantly longer period than xoroshiro32++, this code would be smaller and quicker than a fully software xoroshiro64++.

265I am running PractRand on all 16-bit rotations of randomly seeded streams. +1 and -1 looks similar so far.

TestU01... if PractRand is any indication, might require BigCrush to tell the difference between +1 and -1.

[Edit]Attached -1,+1 and JMT+1, all 16-bit output rotations. Summary:[/Edit]

Edit: The above non-00x rotation results (and those in the attached files) are likely wrong, thus should be disregarded.265In the context of the original code, attempting to do so is of little value, since it would create some other issues, thus negating the benefit.

Now that I find e=+1 is desirable (more so than -1) for implementation, but causes minor issues statistically.

JMT, although seemingly advantageous, might also create implementation issues, but does improves statistics measurably.

I believe I have a solution that would recover the high entropy carry bit, allow e=1 and not require JMT.

This solution can be expressed as: In the above, all rotations and shifts run to the right, rather than left. The same [13,5,10,9,1] can be used. The engine output is bit reversed, and the scrambler output carry bits flow from least bit (high entropy) to highest bit (low entropy).

Since the result accumulates the entropy from least to greatest due to carries, issues with the high bits will not be noticed.

Here is the preliminary result (see attached, with JMT to follow for comparison): [Edit:]

Added JMT to summary and attached zip file.[/Edit] The failures at 00R and 00F are noticeably more graceful, but an inevitable consequence of the simplicity of the scrambler code modification.Additionally, I have done preliminary studies on the 32-bit distribution, and it seems flawlessly normal, likely due to the 16-bit 2D equidistribution property.

Edit: The above non-00x rotation results (and those in the attached files) are likely wrong, thus should be disregarded.1,517e = +1 rather than -1 permits add with carry (called ADDX in the P2), saving two instructions but adding one to set carry, net saving of one.

The new result variable creates a PRNG that consists of multiple xoroshiro periods, e.g. 65536 such periods for xoroshiro32+++. Extrapolating from xoroshiro16+++, each period is identical except that result is one fewer than the previous period and one more than the next period for the corresponding iterations. It appears the same effect could be achieved by adding an offset that decrements after each period to the xoroshiro++ output.

265My original intent was to introduce 'e' as an odd seed value of high merit, but realized that it could not solve the correlation issue entirely, so I focused on maximizing the randomness over the entire underlying xoroshiro period, which is now nearly as complete as possible.Regardless of randomness tests, to fall victim to this limitation would require using more that 8GB at once (which is outside the design specification). Even so, in many cases exceeding that limit (again, not recommend) would only be noticed if user code consumed streams of exactly 2^32-1 xoroshiro outputs. This is ignoring the double-iteration of xoroshiro for XORO32, which swaps the order after 2^32-1, where now the high and low will have different, but correlated values to the low and high of the previous stream. Any offset that decrements would be part of the state. Doing so after each period requires a conditional check.

Using the inherent property of the xoroshiro++ output sums, we get the same effect with less code complexity and greater random behavior, I believe...

Just in case, let me see some pseudo-code of what you are thinking?

[Edited above]

1,517The double iteration does reduce the correlation, that's a good point.

This was more an observation as to what is going on, rather than a suggestion. Code with offset would be larger and slower than with result and the former is not worth doing.

What concerns me is the equidistribution or lack of it. Let's define two new terms. For state width of w bits, the

state period= 2^w-1 and theresult period= (2^w-1) * 2^(w/2).1-D equidistribution is perfect for a complete result period, but not good for one state period. There is a small improvement after each subsequent state period as the frequencies of each result output converge to the same value. I'm concerned about how slow this convergence is and it could take a long time to reach an 'acceptable' equidistribution.

265Typically when someone needs random numbers, they are not concerned about equidistribution over a finite period of numbers they will consume, just that the possibility of getting the numbers is even over the long run.

If they are concerned about specific equidistribution over a shorter period, then they usually need a very specific PRNG written or adapted for their purpose.

Here, we have a PRNG that will appear only normally distributed for 8GB of output, but we are not stuck entirely... it will continue to be normally distributed until it reaches its full period, but yet fail randomness tests after any period of 8GB.

The only key to this is initial seeding. The sequence generated after the initial seed is not as obviously correlated to another that is from another initial seed.