Welcome to the Parallax Discussion Forums, sign-up to participate.

- 101.5K All Categories
- 812 Announcements
- 53 Propeller Code
- 23 PASM2/Spin2 (P2)
- 5 PASM/Spin (P1)
- 14 BASIC (for Propeller)
- 61 Forth
- 10 C/C++
- 2.8K Propeller 2
- 27.6K Propeller 1
- 18.9K BASIC Stamp
- 10 micro:bit
- 21.1K General Discussion
- 2K Learn with BlocklyProp
- 8.2K Robotics
- 124 Customer Projects
- 3.3K Accessories

## Comments

12,207Here's a QUOTE:

"Remember that we're going for an all-core overclock and that means a lower clock frequency than the highest Turbo bin offers. What you need to do:Enable and start at 4200 MHz (42 Multiplier)

Apply 1.40V to the CPU (or simply leave it at auto)

Work your way upwards from there until the system becomes unstable and then back down at least 100 MHz."

His first wrong assumption is that a multiplier beyond max boost isn't achievable.

Second is he started from 1.40 volts!

And ultimately, the cooler likely wasn't up to the job. No mention of changing it, which presumably means he used the AMD supplied cooler

"we use the Wraith stock AMD cooler"and thermal transfer paste.304Apparently clock-stretching is the wild card that complicates under/over-clocking.

12,20712,207Under load, the 1700X fails to a reset/lockup when the voltage is too low. At my set 4 GHz clock, below 1.300 core volts I had to reduce the temperature for max RPM on the fan to keep it stable. I was able to go down to 1.287 volts. All this would seem to support the idea that the 3k series has the clock-stretching (or possibly skipping) as a new feature.

PS: System power being 1700X CPU, B350 mobo, GTX960 GPU, 1x SSD, 1x HDD, 1x ODD, gold rated power supply, mouse, keyboard, USB extender, card reader, and some cooling fans.

PPS: I've worked out that it isn't Cinebench's fault for the inconsistent scores. I just have to wait longer after each reboot for the OS to settle down and stop running its little background jobs. Conveniently, I can see the transition on the power meter with idle system power always at 55 watts.

12,207I had another go at 4100 MHz, which had proved too unstable historically, and found that indeed it is creating too much heat for my S40 cooler. With all cores going flat out, the temperature rises to needing more volts for stability which in turn heats higher. There's no sweet spot.

304Excerpt: "... if you want to overclock a single core inside a CCX, the second core must run at a 1 GHz difference, meaning that if one core is OC'd to 4.5 GHz, the second core must run at 3.5 GHz. Such design is to be blamed on CPU's internal clock divider..."

No worries... Shamino to the rescue with a new version of Work Tool.

EDIT: Required reading (and some scary stuff in one of the links before using Work Tool).

12,207On another note, that "data fabric" in the I/O die is huge. So far I've not seen a single official comment on what is in there. I've seen one off-hand comment by a journo speculating it could be DRAM.

304I have already tested many of the xoroshiro++ constants at 128-bit, but wasn't perfectly satisfied with the ones I looked at via meta-analysis.

It is just like the search for 32-bit state constants, but without any easy ability to exhaustively test.

I am in the process of re-fitting my BigCrush meta-analysis engine with mathematical improvements (e.g. I noticed I had been using the sample average as part of the StdDev calculation, where the known population average of 0.5 seems more appropriate and is reasonably mathematically scalable for n > 2, and almost perfectly so for n > ~30) to better explore this.

12,207304On the whole I think it is overdue, as I believe large state ++ (with correct choice of constants) has the potential to obsolete most other PRNGs for most common tasks (that do not require a CSPRNG).

The only real failing (in very specific use-case scenarios) of a good ++ is sparse-state recovery (i.e., lag in propagating a single state bit change to other bits), which my +*+ w/mask idea addresses fairly well.

Some might argue that lack of multidimensional equidistribution is a 'failing' with + (or ++), but I don't think that argument carries much weight for the vast majority of uses, and even less so at larger state sizes.

12,20718,037I gather you’re trying to overclock you pc to get actual results from the png. If so, could you use the free GPU/TPU from google? There is a version that runs python and you can get about 10 hr time blocks at a time although you can get kicked off for paying customers. Just a thought.

12,207I guess the reason I bring up the Ryzen in this topic is because I've done all my significant PRNG testing using the original 8-core product from early 2017 and have mentioned more than once how much of an upgrade it was from the dual-core. That and Chris has shown interest in adding to his extensive collection of PCs.

1,808The new version of the paper has some differences from the original and is worth downloading (however I suggest keeping a copy of the old one).

The ++ scramblersection is now section 10.6, not 10.7. Our constant d is called r in the paper, but we started using [a,b,c,d] long before the paper was published.Seba and David now suggest d aka r values of 5 and 11 for w = 16 (32-bit state, 16-bit output as used by XORO32). What's quite amusing is that Seba knows that we've changed from [14,2,7,5] to [13,5,10,9] and the former is mentioned in the first paper whereas the latter is not in the second presumably because it conflicts with their new advice! As mentioned, though, test results are what really matter. Also the double-iteration in XORO32 is a unique feature of the P2 and others would use a 64-bit or larger state to get a 32-bit PRN.

I think the amended paper still gives the (misleading) impression that there is not a lot to choose between + and ++ on quality. Perhaps it's hard to tell the difference with states > 32-bit, but our tests show that ++ is much better. + is faster and easier to analyse theoretically, but if there is time to do ++ I can't see any reason to do + instead.

Regarding the PRNG shootout, I don't understand how a footprint of 1068 bits arises when the text says it is always padded to a multiple of 64.

304304My implementation will be different, but hopefully comparable.

3041. There is no obligation to assemble XORO32 as 'f(n) | ( f(n+1) << 16 )', as 'f(n+1) | ( f(n) << 16 )' is topologically equivalent. However, the latter appears more random due to greater de-correlation at the boundary of XORO32(n) and XORO32(n+1), at least when using [13,5,10,9]. This is what I call a JMT ('Jedi Mind Trick'), since it is equivalent to 'ROL(XORO32,16)', but it is enough to fool PractRand into believing that the output is nearly twice as random in most all of the forward and reverse rotations.

2. If the output buffer of xoroshiro32++ was bi-directional (i.e. 'result' is a static state variable initialized to zero), then this becomes possible: 'result = result + rotl( s0 + s1, 9 ) + s0 - 1;'. This produces a random stream guaranteed up to 8GB (failing in PractRand @ 16GB), fully 2-dimensionally equidistributed, with a period 2^48-2^16.

#1, by itself, is not entirely pointless, and is easy to implement. #1 and #2 together are more difficult to implement, but work even better than #2 alone (with an abrupt hard-fail in PractRand @ 16GB).

1,808Thanks for this new info, I'm pleased this thread has been revived. Comments:

1. The XORO32 instruction injects the 32-bit PRN into the next instruction's S (source) field. Had the next D (dest) field been chosen instead the 'Jedi Mind Trick' could have been achieved without the need for a separate rotate instruction:

2. The addition could be implemented in logic as follows

and probably would take no more time than the current algorithm. The problem is the extra register needed to hold result, which might make multiple XORO32 instances difficult or impossible.

Are you sure about the period of 2^48-2^16? I envisage a much higher PractRand failure if so. Also the state is only 32 bits.

304[Edit}I am considering the persistent result accumulator part of the state.[/Edit]

You can run the simulation easily on the 16-bit state (+ 8-bit result state accumulator) Xoroshiro16+++(-1) using [3,7,4,4]. Period is 2^24-2^8.

Here is the result:

1D 256:65535 (all 256 values occur 65535 times)

2D 256:256 (all 256 values occur twice in a row 65535 256 times)

The purpose of the result state variable is NOT to extend the randomness period extensively. It is to:

1. Recover the missing zero, like LCG.

2. Recover from loss of 2D, since 1D is a normal result of either + or ++ (but not * or **).

3. Make the randomness period more deterministic (i.e. =2^(2/3 the full-state size), when 'result' is considered part of the state) by use of 'protracted equidistribution'.

4. Make the randomness over the defined randomness period reasonably beyond reproach (but yet not intended as a CSPRNG).

5. Enable a simple, logical extension to ALL similar LFSR PRNGs.

6. Encourage the completion of mathematical construct in PRNGs, even if doing so has an undesirable marginal impact on target implementation and/or speed.

[Edit]7. Prevent the 2D from disrupting the randomness, as exact 2D over the normal 2^32-1 period fails many randomness tests more easily.[/Edit]

[Edit]Caveat1: I do not have a jump function (in this version), which is not so much of an issue for a free-running hardware PRNG of ~2^48 period.[/Edit]

[Edit]Caveat2: Original 32-bit state must still be seeded non-zero.[/Edit]

Enjoy.

304AFAIK, this has never been done before.

[Edit]Of course this has not been done, and still hasn't... my code is 1-dimensionally equidistributed at the 16-bit level, not the 32-bit level. Still perhaps slightly impressive, though.[/Edit]

304JMT (if used) breaks the 2-Dimensional Equidistibution created by +++(-1) due to the period being short of an exact power of 2, falling back to 1D.

JMT creates a rift in the 1D boundary at full period, which recovers as the number of full period cycles approaches infinity. [Edit]I need to double-check this statement.[/Edit]

The idea for JMT partially results from my observation that many LFSR 64-bit PRNG results from TestU01 BigCrush look too good when the high and low 32-bits are tested sequentially, rather than in isolation.

Way too much thought put into this for such a simple concept.

1,808andextra state bits. I've verified the period of 2^24 - 2^8 for w = 16, which could also be written as 2^w - 1 << w/2. 1-D equidistribution including zero checks out, but 2-D including zero doesn't and appears impossible given the period.304modifiedXORO32.This is the expected behavior when you concatenate an output word from two components. This is incorrect.

I believe you would find the same to be true with the existing XORO32, if the scrambler was simply 'result = s0', which is 2D at 16-bit output (but with a single missing zero), but loses 2D when inspected as a concatenated 32-bits (i.e. ~half the pairs disappear, but some are consequently replaced by other pairs).

Try it with the small version to see. I believe this is incorrect, as xoroshiro using 'result = s0' is maximally equidistributed (except for missing zero), whereas I was only trying to achieve 1 and 2-dimensional equidistribution of the 'result' output, which indeed does not extend to 2x output pairs, which are normally distributed... which is actually quite exceptional when you look at the raw data (and compare to it to that of the ++ scrambler over the same period).

Therefore, full 1D and 2D at 32-bits, while simultaneously maintaining all of the other desirable properties discussed, would require twice the state (e.g. 64-bits is minimum, and 96-bits guarantees statistical randomness all the way out to 2^64 outputs)... I am already running that 96-bit generator in VB6 for research, along-side Xoshiro128+ for floating point values.

BTW, the 1D boundary rift issue I mentioned with JMT when inspecting the 'result', also occurs to a much smaller extent in the current XORO32, but gracefully recovers fully after two full periods of the underlying xoroshiro32++ (but leaves 2 missing zeros in its wake).

Let me know if you see any discrepancies.

[Edit]Major edits above, so re-read[/Edit]

1,808The existing XORO32 output could be used with new result calculation done in software, with more optimum code if e = +1. As a way of producing a PRNG with significantly longer period than xoroshiro32++, this code would be smaller and quicker than a fully software xoroshiro64++.

304I am running PractRand on all 16-bit rotations of randomly seeded streams. +1 and -1 looks similar so far.

TestU01... if PractRand is any indication, might require BigCrush to tell the difference between +1 and -1.

[Edit]Attached -1,+1 and JMT+1, all 16-bit output rotations. Summary:[/Edit]

Edit: The above non-00x rotation results (and those in the attached files) are likely wrong, thus should be disregarded.304In the context of the original code, attempting to do so is of little value, since it would create some other issues, thus negating the benefit.

Now that I find e=+1 is desirable (more so than -1) for implementation, but causes minor issues statistically.

JMT, although seemingly advantageous, might also create implementation issues, but does improves statistics measurably.

I believe I have a solution that would recover the high entropy carry bit, allow e=1 and not require JMT.

This solution can be expressed as: In the above, all rotations and shifts run to the right, rather than left. The same [13,5,10,9,1] can be used. The engine output is bit reversed, and the scrambler output carry bits flow from least bit (high entropy) to highest bit (low entropy).

Since the result accumulates the entropy from least to greatest due to carries, issues with the high bits will not be noticed.

Here is the preliminary result (see attached, with JMT to follow for comparison): [Edit:]

Added JMT to summary and attached zip file.[/Edit] The failures at 00R and 00F are noticeably more graceful, but an inevitable consequence of the simplicity of the scrambler code modification.Additionally, I have done preliminary studies on the 32-bit distribution, and it seems flawlessly normal, likely due to the 16-bit 2D equidistribution property.

Edit: The above non-00x rotation results (and those in the attached files) are likely wrong, thus should be disregarded.1,808e = +1 rather than -1 permits add with carry (called ADDX in the P2), saving two instructions but adding one to set carry, net saving of one.

The new result variable creates a PRNG that consists of multiple xoroshiro periods, e.g. 65536 such periods for xoroshiro32+++. Extrapolating from xoroshiro16+++, each period is identical except that result is one fewer than the previous period and one more than the next period for the corresponding iterations. It appears the same effect could be achieved by adding an offset that decrements after each period to the xoroshiro++ output.

304My original intent was to introduce 'e' as an odd seed value of high merit, but realized that it could not solve the correlation issue entirely, so I focused on maximizing the randomness over the entire underlying xoroshiro period, which is now nearly as complete as possible.Regardless of randomness tests, to fall victim to this limitation would require using more that 8GB at once (which is outside the design specification). Even so, in many cases exceeding that limit (again, not recommend) would only be noticed if user code consumed streams of exactly 2^32-1 xoroshiro outputs. This is ignoring the double-iteration of xoroshiro for XORO32, which swaps the order after 2^32-1, where now the high and low will have different, but correlated values to the low and high of the previous stream. Any offset that decrements would be part of the state. Doing so after each period requires a conditional check.

Using the inherent property of the xoroshiro++ output sums, we get the same effect with less code complexity and greater random behavior, I believe...

Just in case, let me see some pseudo-code of what you are thinking?

[Edited above]

1,808The double iteration does reduce the correlation, that's a good point.

This was more an observation as to what is going on, rather than a suggestion. Code with offset would be larger and slower than with result and the former is not worth doing.

What concerns me is the equidistribution or lack of it. Let's define two new terms. For state width of w bits, the

state period= 2^w-1 and theresult period= (2^w-1) * 2^(w/2).1-D equidistribution is perfect for a complete result period, but not good for one state period. There is a small improvement after each subsequent state period as the frequencies of each result output converge to the same value. I'm concerned about how slow this convergence is and it could take a long time to reach an 'acceptable' equidistribution.

304Typically when someone needs random numbers, they are not concerned about equidistribution over a finite period of numbers they will consume, just that the possibility of getting the numbers is even over the long run.

If they are concerned about specific equidistribution over a shorter period, then they usually need a very specific PRNG written or adapted for their purpose.

Here, we have a PRNG that will appear only normally distributed for 8GB of output, but we are not stuck entirely... it will continue to be normally distributed until it reaches its full period, but yet fail randomness tests after any period of 8GB.

The only key to this is initial seeding. The sequence generated after the initial seed is not as obviously correlated to another that is from another initial seed.