Random/LFSR on P2

evanh · 2021-01-13 03:20

xoroshironot wrote: »

...
Edit 2: The switch back to WSL 1 was easy (through PowerShell): 'wsl --set-version Ubuntu-20.04 1'... then re-launch my bash scripts, and done. Before the switch-back, all of my workloads had been quickly installed and recompiled 'native' under WSL 2 specifically for my (pair of) Xeon E5-2697 V2 processors (which I failed to do when I upgraded the CPUs), so it might be slightly faster than before.

That's a relief for me too. I didn't know what to say.

So the WSL 1 you're using now is both invoked and installed differently than before?

xoroshironot · 2021-01-13 04:12

As far as I can tell, the installation of WSL 2 left WSL 1 intact. I could have simply switched my original Ubuntu from WSL 1 to 2 without re-installing Ubuntu, toolchain, apps, etc., but I didn't realize it at the time (i.e. trial by fire). Alternatively, I could have pinned one version of Ubuntu to WSL 1 and another to WSL 2 (which is what I will do on my laptop), but I would have to manage which is the default for launching bat files that call bash from Windows. Opening bash directly for either Ubuntu via shortcut is easy, though. However, I think that since switching back-and-forth the WSL version is also very easy, I'll keep just the one Ubuntu for now on that server for simplicity. I'll play with some of this on my small workstation (hopefully without getting too stressed).

xoroshironot · 2021-02-10 02:45

Evan, I'm picking up an AMD 3700X CPU/5700 XT GPU system tomorrow cheap (guy needs money for a new Xbox)... it should perform about 75% 70% of my 24C/48T server fully loaded, I guess.

evanh · 2021-02-10 11:40

Really? 33% of the cores. What about power consumption? You're just gonna run them all anyway aren't you.

xoroshironot · 2021-02-11 03:37

I have it up and running (water cooler pre-installed), but it came with Windows 10 Home, so will have to figure out how to get WSL working on it. Last I checked, it is possible, but not officially supported. It is supported on W10 Home.
We will just have to see how it actually performs fully loaded... I was basing my performance assessment on the PassMark Cross-Platform rating of 43000 vs. 60000, but it might not be correct for my workloads on this CPU due to only 2 memory channels, etc.
PassMark baseline: Here
Edit: Win 10 Home does not support RDP. I installed a shim obtained from GitHub, and can remote out, but not remote in from another PC, as yet.
Edit2: Looking for a spare Win 8 Pro license that I never used, which should be sufficient to bump Win 10 Home to Pro.

xoroshironot · 2021-02-13 17:38

@evanh said:
Really? 33% of the cores. What about power consumption? You're just gonna run them all anyway aren't you.

The preliminary performance figure for all-cores-loaded BigCrush on my AMD 3700X is 56% 58% (after letting threads normalize) of my dual Intel E5-2697 V2s.
That is 16 threads vs. 48 threads, using DDR4 3000 RAM (vs 1866 on Intel), and the exact same Intel native executable.
On a new AMD native compile and/or with DDR4 3600 and/or with a 3800 XT I might expect about 60% performance.
It would take a 5800X to get the 70% performance figure I had guessed for the 3700X based on the PassMark cross-platform values.
Therefore a 5900X should easily match the dual 2697 V2s, so 2x the performance per core/thread under full load.
Knowing that will make it easier to calculate the performance of the next-gen Threadripper, hopefully out by late this year.

I'm not worried about power consumption right now, but not much heat coming out of that case, unlike the dual Xeons which turn the PC into a space heater.

Edit: The published PassMark dual-CPU cross-platform results are significantly lower than twice the single-CPU result. A simple equation that seems to better predict observed performance for all-cores-loaded BigCrush when comparing these two types of CPUs: '3700X cross-platform / (single-2697V2 cross-platform * 2 CPUs)', so 42854 / (35885 * 2) = 0.597.

evanh · 2021-03-08 06:04

Oh wow, the pricey Threadripper Pros have showed up locally. I wasn't really expecting to ever see one listed as a part. And I can even buy a motherboard for it too: https://www.pbtech.co.nz/product/CPUAMD03995WX/AMD-Ryzen-Threadripper-Pro-3995WX-64-Cores--128-Th https://www.pbtech.co.nz/product/MBDASU92011/ASUS-Pro-WS-WRX80E-SAGE-SE-WIFI

xoroshironot · 2021-03-09 19:10

They should have had the TR Pro on track sooner, as I suspect the pending Zen 3 EPYC release next Monday may turn some heads.
The expected IPC improvement is fairly well understood moving from Zen 2 to Zen 3, but the base/boost clocks also might see a significant increase, as well as other improvements, which early report suggest up to 40% better than Zen 2 EPYC under some workloads.
If that is true, AMD would have to drive the price way up to avoid some potential competition with TR Pro.

Wuerfel_21 · 2021-03-09 20:29

@xoroshironot said:
up to 40% better than Zen 2 EPYC under some workloads.

Zen3 actually implements PDEP/PEXT BMI2 instructions in hardware (vs up to some hundred cycles of microcode), so if "some workloads" use those, it's an easy win.

evanh · 2021-03-09 22:15

@Wuerfel_21 said:

@xoroshironot said:
up to 40% better than Zen 2 EPYC under some workloads.

Zen3 actually implements PDEP/PEXT BMI2 instructions in hardware (vs up to some hundred cycles of microcode), so if "some workloads" use those, it's an easy win.

Ah, looking that up, I see that's part of AVX2. It's notable that all Zen processors are listed on paper as supporting AVX2 but Steam consistently excludes them from the AVX2 supported list.

xoroshironot · 2021-03-10 14:34

@Wuerfel_21 said:

@xoroshironot said:
up to 40% better than Zen 2 EPYC under some workloads.

Zen3 actually implements PDEP/PEXT BMI2 instructions in hardware (vs up to some hundred cycles of microcode), so if "some workloads" use those, it's an easy win.

I need to look into that more deeply, as I am planning to write a statistical analyzer for random numbers that will make extensive use of bit manipulation.
It is based on an old drinking game where two people pick an integer from 1 to infinity, and the smaller value wins larger value buys the round, unless it is only 1 smaller larger, where the smaller value buys the next two rounds, and ties are discarded. A full analysis of the game shows that picking randomly from the integers 1-5 with the following frequency distribution is provably (1) the best strategy, with 2 and 4 picked 5/16 each, 3 picked 4/16, and 1 and 5 picked 1/16 each. There is a trivial way to use 4 random bits per person to create these ratios. My goal was to put this up against a Hamming weight dependency distribution analysis to see how it compares.

1. From M. Gardner's 'Time Travel and Other Mathematical Bewilderments' (pg. 112): "For a proof of the strategy see "A Psychological Game," by N. S. Mendelsohn (American Mathematical Monthly 53, February 1946, pp. 86-88) and pages 212-215 of I. N. Herstein and I. Kaplansky's Matters Mathematical (Harper & Row, 1974)".

evanh · 2021-04-02 11:49

Crazy, that $9000 TR Pro I linked above is now listed as restocking and is the most popular of the Threadrippers sold there. Three other models in stock. It'd have to be someone like Weta Digital as a guess.

xoroshironot · 2021-04-05 21:44

@evanh said:
Crazy, that $9000 TR Pro I linked above is now listed as restocking and is the most popular of the Threadrippers sold there. Three other models in stock. It'd have to be someone like Weta Digital as a guess.

That is crazy... the U.S. price for a 3995WX at Newegg is $5,488.99, and they are in stock. The way they are handing out money here, I could just about buy one, but then regret it 6-9 months down the road.

xoroshironot · 2021-04-29 14:19

I was running randomness tests on xoroacc32gp (my stand-alone xoroshiro32++/XORO32 variant) and found a specific weakness in byte-reversed output when testing with gjrand, which fails auto-correlation above 8GB of output, but fine up to 8GB.

On a hunch (based on output function similarities), I tested byte-reversed XORO32 with gjrand and found that it fails the same test above 64MB of output, and nearly fails at 64MB (with a p-value of about 1e-11).

To be clear, this is just one statistical test, in one statistical test suite, run on one specific variation of the output.
Therefore, I am undeterred from releasing xoroacc gp in time, as long as no other big surprises occur... so far the indications are that xoroacc64gp and xoroacc128gp scale as expected.

Edit: I tested Bifurcated Modified XOROACC with gjrand, and it is fine on byte-reversed auto-correlation up to at least 16GB, but it emits only one 16-bit word per xoroshiro++ engine iteration, thus it is slower and more computationally expensive.

evanh · 2021-05-03 02:44

@xoroshironot said:
ASRock, so far: Here
Not sure if 300 chipset support of Zen3 will spread to other OEMs.

AMD aren't in the mood for it - https://hothardware.com/news/amd-preventing-ryzen-5000-cpu-on-x370
I'm happy with what I've got and the prices need to come down anyway.

xoroshironot · 2021-05-03 12:56

The AIO water cooler on my new system is small, but should handle a 5950X (or refresh) ok once prices come down.
That begs the question what to do with the 3700X I remove. Maybe just a cheap 400 chipset, or perhaps a 550.
It is still somewhat academic at that point, since low-cost reasonably performing GPUs no longer grow on trees.

xoroshironot · 2021-06-01 03:31

@xoroshironot said:
... should handle a 5950X (or refresh)...

Refresh indeed, this is getting insane. See here.

evanh · 2021-06-01 05:32

Huh, that seems to imply the SRAM is stacked fully over top of the cores ... in multiple layered dies! I gather then that the SRAM will conduct the heat through to the heat spreader without issue. Will be interesting to see if that impacts max boost and/or all-cores clock rate.

What's cool is, functionally, L3 is very suited to this sort of separation. Kind of finally gives it a real place in CPU design. And frees up space for larger L2 caches, which will be important to offset the resulting longer L3 latency.

evanh · 2021-06-01 14:57

It also means a 100% TSMC made product. Err, maybe not, it's still per chiplet basis.

Will be premium priced for some time. Existing products stay.

xoroshironot · 2021-06-01 21:33

Since these (5000 XT?) will ship by end of this year, the implication is that the 6000 series that should be shipping next year should not be threatened by this new announcement. Therefore, I would expect yet another 15% improvement from the 6000 series w/DDR5 support.

evanh · 2021-06-01 21:46

It'll certainly be interesting to see if cache stacking rolls out for all products at Zen4. It'll probably benefit APU performance even better than the chiplets.

xoroshironot · 2021-06-02 00:55

@evanh said:
It'll probably benefit APU performance even better than the chiplets.

Right, I hadn't considered that. It might boil down to whether AMD plans to commit to APUs across the majority of Desktop SKUs like Intel has.
On the other hand, who knows how much non-stacked cache will fit on 5nm...
For all we know they will start stacking GPU or cores instead, though I've yet to see a convincing methodology for heat dissipation in that context, but Intel has discussed it, as I recall.

evanh · 2021-06-02 07:15

I was thinking just laptops and low-end office boxes. Where power efficiency or cheaper solutions are desired.

The chiplet approach suits desktops. I can see the GPU becoming another chiplet in the CPU package for mid-range desktops. Offering decent GPU performance but without the extra price of discrete GPU, positioned above the IGP APU range.

xoroshironot · 2021-06-02 20:32

@evanh said:
Offering decent GPU performance but without the extra price of discrete GPU, positioned above the IGP APU range.

Indeed... Nvidia might have something to worry about. Similarly, I wonder what kind of pressure AMD vs. AMD GPU partners are under due to this possibility. At 5nm, a 100W power envelope is more than enough to fit everything in one APU package that is good enough for all but the high end user. It sounds like a shake-up is coming.

evanh · 2021-06-02 21:20

Huh, I've just noted there is a general presumption in the press pundits that the Zen3+ with stacked cache will retain the existing 32 MB of L3 cache on the base die. That seems a mad idea! It defeats the space freeing advantage and there's no way to just tack on extra without re-laying the circuits on the base chiplet. And at the very least, even if nothing else is added ( like larger L2), there will be substantial base die space needed for the interconnect and its drive circuits. I'm confident the L3 will all be in the stacked dies.

The alternative is a notable increase in the chiplet die size.

xoroshironot · 2021-06-02 21:30

The 32M, as described, will remain on the base chiplet for Zen 3, so 32M+64M=96M per chiplet stack.
I was guessing the vias were already present in the base chiplet design, as they have stated there will be no latency increase in accessing the additional 64M.
I will have to see it to believe it, as it could have profound implications for some of the code (you and) I run.
I get what you are saying, but that would require substantial re-layout of the die, which would likely only occur with Zen 4.

evanh · 2021-06-02 21:55

@xoroshironot said:
I was guessing the vias were already present in the base chiplet design ...

Ah, I guess that's a possibility. Would explain a lot. Also means it has been in the plan for a while. It's either that or a new layout - which does happen.

TonyB_ · 2021-06-03 09:40

Any news on further PRNG tests applicable to P2+ or P3?

I'll need a reminder of our earlier progress with interleaved streams as I've forgotten which algorithm is best and even what we called it.

xoroshironot · 2021-06-03 19:09

@TonyB_ said:
Any news on further PRNG tests applicable to P2+ or P3?

I'll need a reminder of our earlier progress with interleaved streams as I've forgotten which algorithm is best and even what we called it.

The best overall (considering simplicity and randomness) code I found (which you are free to use):
'Bifurcated Modified XOROACC' (xor_plusplus1)

result_out[15:0]  := prn[15:0]  ^ result_in[31:16]
result_out[31:16] := prn[31:16] + result_out[15:0] + 1

It should pass all statistical randomness tests to (at least) 16GB (confirmed by BigCrush, PractRand, gjrand and freq tests, but of course requires independent verification).

Additionally, I would simply call it XOROACC32 for hardware implementation, as that code is consistent with parts of my standalone version, which I call xoroacc_gp (and there is also a xoroacc_fp version).
Xoroacc32gp is only good to (at least) 8GB, but requires just one xoroshiro engine iteration to generate 32 bits of output (and should be the fastest general purpose PRNG devised, at least while using C on generic x86/x64).
Let me know if you are interested in a preliminary peek at the tiny Xoroacc16gp (that you could implement in QB) while I am working on documentation (which I am in no hurry to complete).

evanh · 2021-06-12 09:58

Interesting, DDR5 DIMMs have a couple of big changes:
- 2 x 40 bit wide instead of 1 x 72 bit. So dual-channel and doubling the ECC redundancy.
- Control pins also reduce and divided into two duplicate groups to match the 2 x 40 databus.
- Power input goes from 26 pins to just 3 pins, but instead of being at VDD voltage they're now at 12 Volts with on-module regulation.
- VSS/GND is now 127 pins, up from 94 pins. So, closing in on half the pins being ground.

Random/LFSR on P2

Comments