Random/LFSR on P2

18384868889

Comments

  • xoroshironotxoroshironot Posts: 266
    edited 2020-06-07 - 18:06:44
    The new XT variants should be out soon, but not sure if 3900XT/3950XT will be among them.
    Once I finish building a deck for my wife, if I have money left over I might go TR 3970X.
    My current server with more RAM (and twice the cores) slightly beats that 3800X in intensive workloads, so 3950X seems like a minimum for me to move forward.
    It will all be 'old' in a year and a half or two when Zen 4 comes out with DDR5 support (and PCI-e 5 / USB4).
  • Evan, I am not very accomplished at C programming, and have a related question you might know the answer to (or where to find it):
    1. You have a pair of bifurcated functions A and B, with a function pointer to A.
    2. Function A is called by pointer, and within function A it modifies the pointer to point to function B.
    3. Function B is then called by the pointer, and within function B it reverts the pointer to A.
    4. The process repeats, ad-infinitum.

    I want to play with an example to see how performance compares to standard conditional logic on various CPUs.

    Though only rarely an issue, I have found in some gcc native (but not x86-64) compiled code, simple conditional logic can occasionally wreak havoc with branch prediction, unless care is taken. On Intel, at least, I've seen cases of seemingly all branches being miss-predicted, causing a 75% performance loss.
  • evanhevanh Posts: 9,982
    edited 2020-06-08 - 06:12:44
    A simple cooperative task switcher. Yeah, It would be an interesting experiment to know how effective branch prediction is with pointers in general. C++ must be chocker with them behind the facade.

    So you want an example of doing that in C, I gather?

  • Yes, that would be great.

    I put some small effort into declaring function templates and pointers, but couldn't get the syntax correct within the functions.

    Technically, using an 'if...else' statement is wrong for what I want to do in this case, since it is always known that the choice will be opposite of last time. Pointer manipulation was the only logic I could think of to avoid the asm conditional jump... hoping it performs well.
  • I thought it would be helpful to list expected randomness from most-random to least-random of variants:
    1. XOROACC xoroshiro32pp+1, revbits(xoroshiro32pp)
    2. XOROACC xoroshiro32pp+1, rotl(xoroshiro32pp,1)
    3. Alternate (from above) rotl(state[0] * 5, D) + 1, rotl(state[0] * 5, D)
    4. XORO32 (based on ++)
    Tony and Evan, just a quick status update on my work.
    I plan on reaching out to Seba in the near future, as I have have discovered a level above #1 in the list above which is simpler (e.g. no revbits), faster (up to ~35%), more random, and has potential crypto applications.
    In terms of XORO32 (and previous XOROACC's discussed), this new XOROACC variant I have discovered fails PractRand at 256GB forward bits and 2TB reverse bits (in the general purpose version, at 49 bits state).

    Chris, is your new algorithm applicable to the P2?
  • #include <stdio.h>
    #include <unistd.h>
    
    
    void  task1( void );
    void  task2( void );
    
    
    void  (*task)( void ) = task1;
    
    
    
    void  task1( void )
    {
    	printf("task1\n");
    	task = task2;
    }
    
    
    void  task2( void )
    {
    	printf("task2\n");
    	task = task1;
    }
    
    
    
    int  main( void )
    {
    	printf( "Testing of branch prediction with reassigning function pointers\n\n" );
    
    	while( 1 )
    	{
    		sleep( 1 );
    		task();
    	}
    
    	return 0;
    }
    
    
  • Thanks, Evan... I'll let you know the result.
    TonyB_ wrote: »
    Chris, is your new algorithm applicable to the P2?
    Yes, but it adds some extra complexity that is negated when used as a stand-alone PRNG, rather than as an extension to XORO32, but it would work there, as well.
    The issue with the need for a conditional disappears when double-iterating, since the output is twice the word size (which is what XORO32 uses).
    I'm still in the midst of testing larger word size versions for general purpose and floating point and will have more to share in several weeks.
  • xoroshironotxoroshironot Posts: 266
    edited 2020-06-08 - 23:11:16
    Evan, the code works perfectly (based on checksum result), but is slower than using conditional logic:
    Conditional:            xoroacc128fp-speed: INT= 10.71 GB/s  FP=  8.33 GB/s  AVG=  9.52 GB/s  10.02 s  CHK= 738CA3A1
    Pointers:              xoroacc128fpP-speed: INT=  5.06 GB/s  FP=  4.33 GB/s  AVG=  4.69 GB/s  10.02 s  CHK= 738CA3A1
    
    De-referencing and loss of in-lining are discussed here as sensible causes:
    does-function-pointer-make-the-program-slow
  • Not too surprised. It'll be one of the reasons C++ bloats so easily.

  • Oh, static declares might make a difference. eg:
    #include <stdio.h>
    #include <unistd.h>
    
    
    static void  task1( void );
    static void  task2( void );
    
    
    static void  (*task)( void ) = task1;
    
    
    
    int  main( void )
    {
    	printf( "\nTesting of branch prediction with reassigning function pointers\n" );
    
    	while( 1 )
    	{
    		sleep( 1 );
    		task();
    	}
    
    	return 0;
    }
    
    
    static void  task1( void )
    {
    	printf( "task1\n" );
    	task = task2;
    }
    
    
    static void  task2( void )
    {
    	printf( "task2\n" );
    	task = task1;
    }
    
  • evanhevanh Posts: 9,982
    edited 2020-06-09 - 08:21:26
    The new XT variants should be out soon, but not sure if 3900XT/3950XT will be among them.
    Looks like there is 3900XT in early leaks.

    You know how I said I managed to push the all-cores up to 4000 MHz (full boost of the 1800X) on my 1700X. Basically treat it like full boost is the base and anything less is thermal throttling (Down-clocking due to idling not counted). I've been in this config for the past two years now without issue. Played a few games too. Has felt like my most stable PC ever in fact.

    I presume the same can be done with the Zen2 parts. Would your son be interested in being my guinea pig?

    There was a couple of critical things that made it possible. One was the predictably high spec cooler needed. The other was what I've noticed is now called "throttle-stop" or more blandly "under-volting". The premise is that the regulated core voltage is decided by the full boost frequency. Laptops use more power savings over this but doesn't apply for desktops. Anyway, overriding the auto calculated voltage allows significant reduction in heating by manually setting a core voltage lower than gets calculated by the BIOS while setting the all-cores base multiplier to somewhere around highest boost of whole family. In the case of Zen2 family I've previously guessed at 4800 MHz.

  • xoroshironotxoroshironot Posts: 266
    edited 2020-06-10 - 03:02:20
    I had already tried the Static declares, but it didn't help.
    I just checked the code on godbolt.org and see that every call results in a asm call statement, so in-lining is not used.
    That being said... this code is likely faster than anything else when in-lining is also not used... which I'll check later. (Edit: Yes, significantly faster than xoroshiro128plus in 12 of 14 benchmarks, but the two it was slower in showed some horrible bottle-neck, which ideally I would explore further. However, since the pointer idea is a fail at ~50% performance, maybe not.)

    I need to run a power throttle check on my son's system before I set him up as a guinea pig:
    Ryzen Power Cheating by Motherboard Manufacturers

    Edit: I checked his 3800X / motherboard power calculation and it was reading 115%... in the conservative direction, which seems like an odd value. Asus TUF Gaming X570 Plus WiFi, BTW. Anyway, going through PassMark database, I don't see any evidence of more than all-core 4.3 or 4.4 GHz being possible, and those at the upper limit are both likely winners of the silicon lottery and water-cooled, as most of the maximum all-core results are at 4.2 GHz.

    My son is using PBO (Precision Boost Overclock) and Turbos up to 4.5 GHz, but not on all cores... based on that, and the above, what did you have in mind? Keep in mind that he runs it most of the day as a multi-party gaming server, so I cannot touch it till after 11:30PM (which is when the house WiFi gets shut-down).
  • evanhevanh Posts: 9,982
    edited 2020-06-10 - 04:27:07
    Oh, many will noticed if it dies then. ;)

    The power consumption I plan on won't be particularly high. Just will push the stock cooler too far I suspect. My motherboard is a boring Asus Prime B350 chipset. It doesn't have beefy regulators. The cooler I'm using is the Deepcool S40. It's quite an effective unit yet relatively cheap. I chose it because my case is narrow, the taller coolers don't fit.

    When the voltage is manually set the automatics are disabled. Same for the multiplier, once you manually set it the auto boost (or throttling as I prefer to think of it as) is disabled. Power limit is capped this way.

    The normal BIOS response to setting a high base multiplier is ratchet up the regulator volts. This has an immediate knock on effect of hugely increasing the power draw. Which in turn blows past what the cooler can handle, or maybe what the regulators can deliver, and you've got yourself a space heater that keeps crashing when loaded up.

    So it's vital to make sure the core voltage is manually set first.

  • Here's my 1700X settings. Most are defaults and some are the automatic XMP settings from the DIMM. These show up like manually adjusted.

    The relevant ones I've adjusted are the CPU Core Ratio (40.00), and VDDCR CPU Offset, Mode and Voltage. The mode is an offset arrangement only, dunno why it's not just an absolute voltage. I've raised the voltage from default of 1.35 V to 1.3625 V. Just a minor buff, nothing like what happens if you leave it to the BIOS.

    I've also lowered the VDDCR SOC as well. This was done when I was testing my DRAM limits. Not important to touch this.

    The 3800X will have different, smaller, voltage values and might not even be named the same. Note down the default voltages before you start.
    1632 x 1224 - 175K
    1632 x 1224 - 164K
  • Oh, don't forget to use top end thermal transfer paste. Well worth buying the best for this. Even the stock cooler would benefit from being remounted with some Arctic Silver 5 or similar.

  • I'll run this past him, but I doubt he will want to remove the Wraith Spire cooler to use my Arctic Silver (5?, not sure, if I can find it... got two cheap from Walmart, and gave one away). We would have used it originally, but he specifically said he didn't plan on overclocking at that time (other than basic PBO and DOCP).
  • xoroshironotxoroshironot Posts: 266
    edited 2020-06-10 - 22:42:14
    I spoke to him, but it is hard to catch his attention. I think he is good to go, short of removing the cooler.
    What are you looking to do exactly, as compared to this?:
    PBO Undervolt Analysis Ryzen 3800x
  • evanhevanh Posts: 9,982
    edited 2020-06-11 - 00:24:25
    That doesn't give you all-cores. You have to set the multiplier manually to get all-cores. And doing so disables all the boosting automatics.

  • There's only the two things to adjust: CPU core voltage and CPU core ratio (the multiplier).

    The screenshots above show the sort of naming and my adjustments. It may not be the same naming for the 3800X, and it definitely won't be the same values. You'll have to take some photos of the BIOS options if you want more specifics from me.

  • evanhevanh Posts: 9,982
    edited 2020-06-12 - 08:12:33
    Here's a reason to put the effort in, and this is running cool and quiet - https://browser.geekbench.com/v4/cpu/15558360

    PS: That's Kubuntu (KDE), with all its background activity as well.

  • xoroshironotxoroshironot Posts: 266
    edited 2020-06-13 - 02:39:19
    That is sweet. For Linux, I use both Ubuntu under WSL (Windows Subsystem for Linux) and a static USB bootable Knoppix. Edit: WSL is interesting (and WSL2 will be better, when I have time to vet it): for example, I can pipe the output of a windows exe into a gcc compiled linux executable, or run bash shell scripts from a bat file. Sloppy, but expedient, in some cases.

    My son is fighting a slowdown issue right now after anywhere from 1 to 5 days of running his 3800X as a game server.
    We just updated the BIOS and AMD chipset drivers to see the effect.
    I discussed under-clocking with him in terms of how hot his room gets while running the way he is... the air conditioner cannot keep up.
    However, I suspect something else is going on besides a warm room. For example his Oculus 'OVRServer_x64.exe' racks up more I/O than the server code, even when he isn't using the Rift... could be causing an issue over the long run, so he ended the process for now (as there is no tray icon), just in case. Edit: Uninstalling Oculus and installing new version, since he was having 'black screen' issues, also, which that should fix.

    On the subject of PRNGs... I am now in the same position you were in with candidate search, but it is not practical to change ABC to test a 64-bit output word. I therefore only can change D (for which ABC already have documented jump functions), and am using the behavior of 8 and 16-bit (single iterated) output word versions to predict a good starting point for D. PractRand is of no value now, since the 32-bit output word version passed it to 32TB without any evidence of even the slightest issue. So now I am in the process of running thousands of BigCrush (hi/lo,fwd/rev/byterev) on some select D to look for any suspicious trends on any test (analyzing over 1PB in the process). Final candidates will get pumped through a 10TB gjrand and alternating hi/lo BigCrush (as Seba originally used to document no failures in +). Obviously I need to get a new server (to turn months into weeks).
  • My son is fighting a slowdown issue right now after anywhere from 1 to 5 days of running his 3800X as a game server.
    Check DRAM with memtest86. Might be getting RAM errors. Could be just a re-socketing needed ... or there's a faulty DIMM.

    Re-socket the graphics card. I think I've heard those can do weird shit when not seated fully.

    Check the 3800X's performance with long run heavy computation. And compare with typical results of a 3800X. It might be throttling unduly. This would point to thermal transfer paste needing redone.

    Power supply can be the problem too. Swapping with a known good one would be best way to check.

  • Yes, DRAM (replaced and upgraded a month ago from 3200 to 3600) and graphics (using 2070 Super) re-seating were next on my list, after I take care of known issues. I'm still looking for my Arctic Silver, as well (since he is still using the AMD thermal pad that came with the Wraith Spire cooler). In my day job, one of my functions is to provide global IT support to our customers (as well as electronics, chemistry, physics, metallography, robotics, etc., and in my old position, ICP and GDS atomic/optical emissions spectroscopy, but my eyesight degraded so it was a chore to perform precision field work).
  • Yes, DRAM (replaced and upgraded a month ago from 3200 to 3600) and graphics (using 2070 Super) re-seating were next on my list, after I take care of known issues. I'm still looking for my Arctic Silver, as well (since he is still using the AMD thermal pad that came with the Wraith Spire cooler). In my day job, one of my functions is to provide global IT support to our customers (as well as electronics, chemistry, physics, metallography, robotics, etc., and in my old position, ICP and GDS atomic/optical emissions spectroscopy, but my eyesight degraded so it was a chore to perform precision field work).
    Cool, that explains your ease at absorbing far more than me with the maths on the random numbers.

    One more thing with the DRAM, 3600 MHz is approaching an internal limit of the northbridge (I/O die) inside the 3800X. It's supposed to be good for that but I've seen some complaints from users having to reduce DRAM speed, or adjust a divider, for stability.

  • evanh wrote: »
    One more thing with the DRAM, 3600 MHz is approaching an internal limit of the northbridge (I/O die) inside the 3800X.
    I had him buy the original 3200, and suggested to save money when he needed more RAM by just buying 2 more sticks of the same. No, he wanted the 3600, which is only a marginal improvement in real world performance, except in some narrow cases (e.g. much better Physics score in Passmark).

    I have to be real careful when I put together a system... my random numbers already session freeze Microsoft RDP when pumped over the network using my custom high frame rate visualization software. I believe it is a bug in RDP, since I've replicated it on a few systems and networks. Imagine that kind of abuse on RAM that already has issues with its basic design (e.g. Rowhammer), which affects many manufacturers. DDR5 will have additional mitigations, which is the main reason I was thinking to put together a 'band-aid' server now so I will have the cash to 'go-big' in another year and a half or two when systems get 'really' interesting.
  • Refrigeration is the answer! A heat pump direct on the heat spreader. :D

  • xoroshironotxoroshironot Posts: 266
    edited 2020-06-14 - 15:53:31
    I have an old Peltier heat pump module on my shelf that might still work... nah, the heat sink required for that is as big as my hand, not including fan (though wouldn't require much voltage to hold 20C). One of my friends and I stacked one of those on an old 286 processor when overclocking once... works until you run out of places to move the waste heat, since the heat pump is so inefficient while running at 12VDC. We didn't have anything like a Noctua then.
  • heh, yeah, those only work at low power levels. I was jokingly referring to big-iron air-con types with the liquid-gas high efficiency phase change along with its ease of heat transport.

  • evanhevanh Posts: 9,982
    edited 2020-06-21 - 00:05:17
    I just bumped into an online test on a 3900X that indicates that manually setting the core multiplier disables the auto-boosting features, so you get an all-cores fixed multiplier. Same as I'm getting with my 1700X.

    Their test doesn't address lowering the core voltage (they actually raise it to 1.4 V) but everything points to that being a perfectly doable action. https://www.techspot.com/review/2044-amd-b550-motherboard-vrm/

    EDIT: I just ran the same render test they used and got result of 33 mins 39 secs. My CPU temp stayed below 73 degC. To compare, I note another 1700X of 44 mins 36 secs on the blender webpage - https://gooseberry.blender.org/gooseberry-production-benchmark-file/

    EDIT2: Running this render benchmark again this morning, with lower room temperature, the CPU topped out at 67 degC. The fans don't max out until 70 so they were changing their tones the whole time.

    Power meter peaked at 200 W for whole desktop box. Idle is 55 W, so only 145 W delta. And that includes extra for DRAM and fans and power conversion losses as well.

  • evanhevanh Posts: 9,982
    edited 2020-06-21 - 01:43:01
    Here's the AMD slide about DRAM speeds vs the northbridge clock rate. They're saying 3600 MT/s is all good to go at 1:1 ratio. So it'd be bad luck if that speed is a problem.
    page1_2.jpg
    725 x 408 - 42K
Sign In or Register to comment.