Random/LFSR on P2

xoroshironot · 2020-07-30 23:17

Several points (forgive my rambling):
1. I have only performed preliminary PractRand analysis, with e=0, f=1. It looks to be up to 2x better (judging by GAP and FPF failures) than the previously discussed RevBits(). That is not proof of better freqs performance (which I showed for RevBits plotted per stream across random seeds, and you requested it should also be done across sequential steams).

2. Do not forget the parenthesis in the modified code, for absolute clarity, (as '+ 1' must be the final step, or summed with prn[31:16] first, which I recall is not as good).

3. Using Xors only (without addition) does not interact adjacent bits, so no equidistribution recovery (or desired randomness benefit).

4. My unpublished code currently uses the above 'modified xoroacc' variant (which is extremely simple and fast), but further has its own added final output scrambler (required to get to 2TB+ randomness). My final output scrambler is not compatible with achieving (in the case of 16-bit word size) 32-bit near-perfect equidistribution when prn[15:0]/prn[16:31] are used (as a near-perfect 1D source), but works fine when s0/s1 are used (as a near-perfect 2D source). Therefore, once published, it will use its own xoroshiro engine (and return two outputs that are near-perfectly equidistributed, with some pairs occurring once less often over the full period). Also, since s0/s1 must be used, the final output scrambler must also be used to fix linear complexity and binary matrix rank issues. This is not an issue when using the prn output from XORO32, since linear complexity/matrix rank has already been dealt with, but does create issues with XORO32 regarding promotion of randomness beyond 16 or 32GB (without resorting to 'ugly' methods).

5. The 'modified xoroacc' could be further modified to replace either one of the ^ with a +, which (obviously) re-introduces more complexity, adds some noticable randomness benefit biased toward high bits (up to 64GB, perhaps), and (more important for me) some small x86-x64 speed improvement (due to enabling micro-op fusion / SIMD). Investigating the modified xoroacc change, and now this other valid possibility, is hanging up my research (not to mention the issue of finding my own optimal D value, used differently, which cannot be 0).

xoroshironot · 2020-08-09 20:51

I've attached 128 sequential streams starting at seed 1 0 0 for 'Modified XOROACC' and 'Bifurcated Modified XOROACC' (#5 in my post above).

'Modified XOROACC' (xor_xorplus1)

result_out[15:0]  := prn[15:0]  ^ result_in[31:16]
result_out[31:16] := (prn[31:16] ^ result_out[15:0]) + 1

'Bifurcated Modified XOROACC' (xor_plusplus1)

result_out[15:0]  := prn[15:0]  ^ result_in[31:16]
result_out[31:16] := prn[31:16] + result_out[15:0] + 1

While the Chi-Squared sum of all pfreq results for 'Modified XOROACC' are just slightly better, the 'Bifurcated' version is better overall in column frequencies, and has much better high-bits behavior in PractRand.

Results for both nzfreq and zfreq are statistically normal out to 128 streams, even when combined. All of these results are far better than the plus plus versions (including Rol1 and RevBits), where even randomly selected streams struggled with freqs well within a dozen streams, whereas these new ones still look statistically plausible in pfreq up to about two dozen streams, or more.

I also checked the bifurcated version in my unreleased code, and it is also statistically better (in both high and low bits, due to the final stage parallel scrambler).

I am affectionately calling the bifurcated version 'Slice and Sweep', since the xor followed by sum makes a certain logical sense, and in the case of XORO32, ensures that each half of the double-iterated stream is mathematically processed differently on a pair-wise basis, since the source pairs swap halfway through. Additionally, it helps with allowing E to remain equal to 1, since the sum is now proceeding on a now better de-correlated value that makes odd E > 1 mostly irrelevant.

xoroshironot · 2020-08-25 00:08

Evan, my son let me know about this new video:
Linus Tech Tips - 1usmus Zen 2 Undervolt/Overclock Utility Video

evanh · 2020-08-25 05:46

xoroshironot wrote: »

Evan, my son let me know about this new video:

Hehe, good quote near the end - "Custom levels of tuning overclocking in the hands of people who only really know how to press, one button."

The volts reduction is good, brought the power down nicely. I my case though, I prevented the BIOS from automatically raising the voltage rather than actually reducing it. This'll depend on what that 1.181 Volts represents. Presumably it's the base 3.7 GHz voltage of the 3970X. The thing is, I doubt the volts is dynamically driven up and down when boosting a core or two to 4.5 GHz. The cores are all together when is comes to what the set voltage is. That means that, assuming enough cooling, 1.181 Volts is good for all-cores 4.5 GHz (The rated single core boost speed). And with a little more I don't see why 4.7 or 4.8 GHz isn't doable.

He claimed it doesn't nerf any of the power management. That might be the clue for why he couldn't get it to go faster. My approach is all-cores and certainly it disables at least the boost feature. 280 Watts shouldn't be a problem for that size cooler he was using. I'd expect to see 400 Watts on that 3970X with a decent all-core run.

He also talked about "CPU load line calibration". I certainly didn't play around with any of those type BIOS controls. And having a look I see mine is not listed as levels 1 to 4 but Auto, Regular, Medium, High and Extreme. The default is Auto. I don't see these type settings as important factors due to the relatively reduced power from the under-volting.

PS: Comparing my 1700X (3.4 GHz boosts to 3.8 GHz) at stock CPU settings (with DDR4-2933 CL14-14-14) vs all-cores at 4.0 GHz settings - Using Geekbench 4.4.2 for compare:

	Stock 1700X	All-cores	Improved
Single:	4933		5106		3.5%
Multi:	28574		31803		11.3%

A second run at stock got a little better:

	Stock 1700X	All-cores	Improved
Single:	4943		5106		3.3%
Multi:	28824		31803		10.3%

evanh · 2020-08-25 09:48

Ah, that's more like it, Linus mentioned that 1usmus is getting 9% extra speed on his 3900X. Though, looking at the numbers Linus put on a slide, with the average clock speed sitting around 4.25 GHz, I feel it should be able to go further still.

evanh · 2020-08-25 10:41

evanh wrote: »

... 280 Watts shouldn't be a problem for that size cooler he was using. I'd expect to see 400 Watts on that 3970X with a decent all-core run.

Hmm, maybe the 280 Watts is limited by the cooler. I keep forgetting my power measurements are for the whole PC box, not just the CPU, including power conversion loses in the power supply. There could easily be 50 Watts difference at load.

xoroshironot · 2020-08-26 00:10

I am trying to hold out for a few months before building a new PC... Anthony/Linus said that utility should still work on Zen 3 (assuming I decide to go that route).

There is still spurious evidence that overclocking Zen 2 (without under-volting) might be degrading the silicon... I'm not convinced, since needing to back down the clock from an edge stability condition could come from any number of other factors (fans, motherboard, power supply, heatsink film buildup, etc.).

On other matters, I found what I believe to be proper scrambler constants for the fast 32-bit output version of my unpublished code, but still testing (and already passed 32TB PractRand, working on 64TB). I may release that code along with 8-bit and 16-bit output code once I'm done (if all goes well), as the 64-bit output variant will likely take much longer than I originally imagined to vet. Having another PC to help out would help a little, but I decided to write a custom computer algebra system (CAS) to help me make razor-sharp decisions on ABCD, etc., constants to test (as it would otherwise take more than just another few workstations full of cores to brute-force).

I had already finished the fast version of the 16-bit output code many weeks ago. The 'fast' version fails PractRand at either 64GB or 128GB (depending on rotation), which is far shy of the 2TB 'random' version, but honestly I don't feel it matters too much beyond 16GB when the underlying xoroshiro engine wrap-around has begun to generate a statistically detectable bit correlation (even if PractRand has trouble seeing that issue, I know it is there, which is enough).

On really other matters, I was allowed to remote in to a PC on a Fortune 100 company network today... they spent about a month diagnosing and coming up with reasons why I couldn't remote in, before finally capitulating. Found evidence of buggy outdated chipset drivers and mis-adjusted network adapter device properties. Should be good to go now. Reminds me of that recent quote by Max Brooks concerning COVID-19: "... If I'm the smartest guy in the room, we're in big trouble".

evanh · 2020-08-26 09:16

xoroshironot wrote: »

There is still spurious evidence that overclocking Zen 2 (without under-volting) might be degrading the silicon... I'm not convinced, since needing to back down the clock from an edge stability condition could come from any number of other factors (fans, motherboard, power supply, heatsink film buildup, etc.).

It's the voltage that can do the damage. Overclocking normally presumes a lifted core volts too. My BIOS does it automatically when increasing the base multiplier.

xoroshironot · 2020-08-26 15:21

Right, the normal PBO will lift the core volts when required, but the concern was whether that maximum lift is too much. Some are theorizing that a normal PBO uplift to ~1.35-1.40 on Zen 2 might be too much (with 1.45 likely being way too excessive), but capping to ~1.24-1.35v might be noticably more reliable in the long run. Technically, using PBO voids the warranty, so I guess AMD knows the degradation figures. Dr. Cutress was mentioning about 1% degradation over 10 years (without having access to AMD's figures) at the higher PBO voltages, but I suggested he was likely way off because electro-migration calculations cannot be guessed at accurately (due 11th, or so, order equations).

evanh · 2020-08-27 07:19

Experience has taught me the automatics for that don't do a good job at all.

evanh · 2020-09-10 05:41

You know, I don't consider any of this as a true overclock unless the rated boost speed is exceeded. It's just raising the power rating to get the all-core frequency closer to rated boost.

xoroshironot · 2020-09-11 00:19

That seems lucid to me... Zen 3 will announce in one month, and I'll bet that it will be even harder to squeeze more performance out at the higher end of the product stack. This is, I believe, a side-effect of the improved, optimized, and extra-binned chiplet design, where a true overclock exists (for most who would try) only at the bottom end of the product stack (e.g. over-sized tires on a compact car, which may suit some people).

Edit: Or, just as a tease at the other extreme, reasonably well-binned and speed-locked chiplets crammed together in a nearly unreachable OEM variant:
TR Pro 3995WX

evanh · 2020-09-11 04:45

Ha! Anand has just reviewed exactly the sort of setup I envisioned being possible. The power budget is blown completely but there is good results for it - https://www.anandtech.com/show/16070/a-rendering-powerhouse-the-armari-magnetar-x64t-workstation-with-4-ghz-allcore-threadripper-3990x/3

evanh · 2020-09-11 04:56

That PRO 3995WX box must have extra powerful cooling too. The PassMark compared to the average for the 3990X is about 11% higher.

xoroshironot · 2020-09-12 00:38

evanh wrote: »

The PassMark compared to the average for the 3990X is about 11% higher.

The 3995WX has 8 memory channels (compared to 4 on the 3990X), which PassMark favors in several of the sub-benchmarks that make up the final score. David at PassMark illustrated that well when I and others were discussing with him the excessive bias against AMD CPUs (caused by a nasty Microsoft triggered performance issue with the specific PRNG they were using at that time, which AMD helped to get sorted out).

Bandwidth was a big concern of mine (considering the price) even when looking at the 3970X, where forcing 32 cores/64 threads down 4 channels would have a noticeable impact on many of my workloads. Many people (perhaps with a lesser need for more cores) have seemingly realized that, and went with the 3960X instead, which fares a bit better with per-chiplet available bandwidth. The 3995WX solves the issues mostly, and opens the door wider for future RAM upgrades, but total capacity was less of a concern of mine (as, right now, I can't foresee ever needing more than 256GB, with 128GB being a reasonable start for such a chip).

Hopefully the next-gen TR has 8-channel at entry level... those core will be hungry and will need to be fed properly to move forward (and DDR5 is a little ways off).

evanh · 2020-09-12 02:01

Agreed, the 4-channel cap is an outright nerf that AMD should not have done, even for the first gen Threadrippers.

PS: I didn't know about the PRO edition having 8-channels. Would be nice if that is a sign that AMD will allow all future board makers to have 8-channels.

xoroshironot · 2020-09-14 02:12

evanh wrote: »

Ha! Anand has just reviewed exactly the sort of setup I envisioned being possible.

I had seen that article before you mentioned it, but just finally had a chance to sift through it.
What I can speculate when comparing it to the 3995WX system is interesting... if the 3995WX could be similarly overclocked, then it would steal some of the thunder from the next gen TR, but not doing so still leaves potential performance on the table for AMD to release such a beast in a pinch, only if necessary, to compete with any 'magical' offering from Intel before next-gen TR is ready (from both a technology a bean-counting standpoint).

If I were AMD, I might do something dumb and just hold off on next-gen TR until DDR5 support is baked in... but it is probably too late to change much at this point.

On other matters, I am just about finished with the stand-alone version of XOROACC32GP statistical testing... I had to make a last minute change to the ABC constants. They are [10,5,13] now, which might take me several paragraphs to fully try and explain (some of which being speculation, but basically the C constant is better being both odd and slightly farther from the half-way point of 8).

Even if Parallax has no use for a from-scratch version, it should still serve as a near-bullet-proof example for my class of 'double output per engine iteration scramblers'.

evanh · 2020-09-14 04:32

xoroshironot wrote: »

If I were AMD, I might do something dumb and just hold off on next-gen TR until DDR5 support is baked in... but it is probably too late to change much at this point.

I think the idea is that TR can be experimented with - Have the consumer level churn of the Ryzen parts. DDR5 needs a new socket for sure though. So it has always been Zen4 parts for DDR5.

EDIT: On the other hand, I guess it would be possible to mix'n'match future I/O dies to provide DDR5 to older chiplets. I don't see that happening unless there is a big gap in Zen design releases.

rogloh · 2020-09-15 05:49

Any way to seed the P2 PRNG in SPIN2 with a known value so some random sequence can be repeated? Was hoping to use this GETRND SPIN2 method for randomizing data for a memory test and then reading back to compare. If not possible I guess I can retain a local buffer and test with smaller sequences that fit.

Update: Ok it looks like I need to do it via a HUBSET.

Update 2: Ok it looks like there is too much variation in cycle timing during Fastspin HUB exec mode to get a consistent random sequence after seeding with a HUBSET($80000000), given this PRNG updates per clock. So it would probably only be useful for PASM2 base usage with exact timing when calling getrnd etc.

evanh · 2020-09-15 06:40

Yeah, not the free running generator. That's difficult to do reruns with.

The one you want to use is the XORO32 instruction. It takes a register operand as its seed (state store). So you can keep reloading it with whatever seed you like and iterate it one step at a time to get exactly repeatable sequences.

EDIT: The way XORO32 instruction works is it takes in the stored state (or new seed) from the D operand and writes the iterated state back to D. With the outputted random number feed forward into the following instruction's S operand. It achieves this via the hidden Q register. Which is a little different to how ALTS works. ALTS modifies the bit-fields of the following instruction.

rogloh · 2020-09-15 07:32

Thanks evanh. The XORO32 based random numbers don't seem to be accessible from SPIN2, so I'll need to hack some inline PASM2 for that. With any luck there is a way to do that so it works with both FastSpin and PNut SPIN2...

Update: actually it looks like we can use a ?? operator to access XORO32 from SPIN2.

evanh · 2020-09-15 08:04

I used it like that for the bit error measuring with the HyperRAM burst testing. When doing the verify pass I could compare a live XORO32 output against the transferred block. No need to keep a second copy of the block. Just had to remember the seed - which progressed for each new block tested.

evanh · 2020-09-16 05:55

rogloh wrote: »

Update: actually it looks like we can use a ?? operator to access XORO32 from SPIN2.

Did you work out the syntax?

rogloh · 2020-09-16 20:26

evanh wrote: »

Did you work out the syntax?

Yep. I got it to generate pseudo random sequences from a seed with the ?? operator in SPIN2. I can use GETRND for the seed too to get a new group each time I test with the next batch of numbers. So I can keep writing a memory test demo test in SPIN2.

evanh · 2020-09-17 05:05

That was me wanting to see the code in question quoted.

The Spin2 manual isn't clear how the two parts - state store and generator output - are syntactically arranged.

evanh · 2020-09-17 07:40

Okay, I've found the debug() print formatting extensions in the manual and managed to bootstrap my learning of Spin2 ... so far the only syntax I've found for ?? is as a prefixing operator. This results in the prefixed variable being used for state store. The XORO32 generator's random output is lost.

Tony posted the output sequence when the state is seeded with 1 - https://forums.parallax.com/discussion/comment/1448906/#Comment_1448906

Here's a dump with output and state changes side by side:

First 20 iterations with seed of 1
output    state
62690201  84908405
12a2ae16  dfda9401
d7194ae8  51bcedd3
984b0c52  13cb03c2
743c1df1  45d4fd1b
bcc6dba0  bf81765e
746c34c9  4d0fbe2d
07ff3643  aead221f
c642bbc0  0cd29b35
594eaa85  f918baf2
701ad05a  ef0352a5
f3aea328  10e75522
695a67ee  5a8bceee
93c6c140  255da64a
4964f5e1  cbac5bd5
fc575e24  8b49f8a6
a638d7ad  51b38c77
181ae233  f61e2284
7be766b5  62a3eba9
5cbd8445  6efcaedc

evanh · 2020-09-17 08:13

Ah! Solved! After reading my own writings above I realised that prefixing means it can be combined with any other operator, eg: assigned to second variable.

num := ??state

That has the odd effect of assigning the random output from XORO32 to num, instead of the value from the state variable.

rogloh · 2020-09-17 09:22

Yep the ?? was how I did it. I'm about to post a binary and some source SPIN2 code for a HyperRAM memory test that shows this.

evanh · 2020-09-18 04:44

The syntax around using ?? is what took the effort.

xoroshironot · 2020-10-27 01:12

Evan/Tony,

Just an update (and I hope you are both well):

My main dual-CPU server motherboard died a few weeks ago, apparently due to one of the CPUs starting to fail and repetitively glitching the VRs. Since I couldn't wait for the new AMD 5950X CPUs (check the PassMark results) to become available, I went ahead and replaced the MB, and upgraded both the CPUs to Xeon E5-2697 V2 in the process. It cost about $400 US total for the MB and CPUs... that would have been over almost $1000 about a year ago, and about $4000 a year and half ago.

Since I am able to complete about 230 BigCrush runs per day now, I was able to finish completing and evaluating my modified xoroacc64gp results using cloud density analysis. It looks good after about 650 x 3 runs (forward, bit reverse, and byte reverse, each 1.3TB). The forward and reverse PractRand results also look good out to at least 64TB. Hamming Weight Dependency good to at least 1PB and gjrand good to at least 10TB. The icing-on-the-cake is that even though the output is 2x32 bits (using only uint32_t variables), it is faster than many 64-bit native output PRNGs, and has their 64-bit equidistribution (though some a tiny fraction of 64-bit (double 32-bit) outputs occur 1 less time than the rest over the entire 2^97 period). Of course the 32-bit (half-output) is fully 1-dimensionally equidistributed over the entire period.

The effort to complete testing on xoroacc128gp has been and will be extreme, but oddly, the faster fp/floating point version is basically done (since binary matrix rank, linear complexity, and Hamming issues with LSBs are of less concern). I really would like to publish the complete set of 8/10/16/32/64-bit general purpose PRNGs (with only a floating point version for 64-bit) by the end of the year, but I don't think I'm going to make it... I can barely remember what real sleep is like, but recall it is good, so would like to get back to doing it.

Cheers,

Chris

Random/LFSR on P2

Comments