Random/LFSR on P2

18283848587

Comments

  • xoroshironotxoroshironot Posts: 259
    edited 2020-07-26 - 23:53:20
    evanh wrote: »
    I just bumped into an online test on a 3900X that indicates that manually setting the core multiplier disables the auto-boosting features, so you get an all-cores fixed multiplier.
    That was why we originally set up the 3800X with PBO, so that it can boost to 4.5 GHz when it (as a server) is lightly loaded, but yet not over-heat his room when it is heavily loaded and maxed at 4.2 GHz. That is why I'm curious about under-volting improving thermal ceiling, which I understand is easier on the 3800X compared to other variants, though not much wiggle room anyway.

    My younger son solved any potential issues (e.g. bandwidth, heat, fighting the older one for server time, etc.) all on his own... he is renting remote servers, making a good profit (and likely won't let me 'borrow' them). If he keeps it up, he might be able to put himself through college. It seems odd to me to have children who think it is normal to run servers (like me). Oh well, at least they are not pursuing (my or others) bad habits.

    I've finished my primitive multiplier M (for FP and GP variants) and D (for GP variant) search with +1 increment at 64-bit word size. M was fairly easy, as there are only 3 choices that are typically optimized for x86-64: 3 5 9, but not 17. D was nail-biting, due to Seba's use of a near half-width C constant, which is very close to the theoretically optimal D value. Now replacing +1 with random odd increments for testing (xorwow style, but unlike xorwow, values greater than 1 are not required to improve randomness... they are only present to eliminate the statistically insignificant number of sparse states, and are almost free to execute in my code context due to parallelization. Sweet.).

    Edit: Corrected optimal primitive multipliers.
  • evanhevanh Posts: 9,702
    edited 2020-06-21 - 03:21:31
    Even with a fixed multiplier, it still down-clocks a lot when idling. Mine spends most of its time well below 2 GHz. This'll be OS controlled rather than the BIOS/CPU firmware that does the PBO.

  • xoroshironotxoroshironot Posts: 259
    edited 2020-06-21 - 04:01:04
    I'm curious what your idle speed is if you leave your browser open to the main page at tomshardwareguide?

    He is running the AMD power profile in Windows. I'm not sure how that compares to the one by created by 1usmus.
  • evanhevanh Posts: 9,702
    edited 2020-06-21 - 04:43:10
    As per most websites I visit, there is a lot of blank space on that webpage. The articles titles are there. That's about it.

    I have all scripts blocked by default. You might have to be more specific about what I should be looking for.

    EDIT: Lol, and enabling first-party scripts nets me an adblocker whine.
    EDIT2: Okay, got first-party scripts operating now, without the whinging. Had to allow two more domains - futurecdn.net and footprint.net, and one extra script domain - futurecdn.net. Both of which brought in images and graphs for the articles.

  • xoroshironotxoroshironot Posts: 259
    edited 2020-06-21 - 04:58:27
    Yeah, the scripts start running videos, and if you adblock, then the whining starts.
    AnandTech is not quite as bad, but bad enough.

    Working at home on a laptop, I sometimes forget and leave the browser running on some page like that when I go eat lunch... Dell is toasty when I get back. PB&J won't help.

    [Rant]I wonder if anybody in the ad business realizes how much I loathe them on this subject of eating my CPU time, racking up my power bill and burning my legs (as I sit here, since I was just on their site less than 5 minutes ago). They might as well install mining scripts on their webpage, since the current state of things feels criminal to me anyway... 'Hey lets see what we can do to destroy the Internet for everybody'.[/Rant]

    Oh, for the sites I adblock that won't let you read their content when you do, I added a Print Preview extension and just click the button on the toolbar just before the whining starts, which works for many.
  • evanhevanh Posts: 9,702
    edited 2020-06-21 - 05:05:15
    For me, it's all about the tracking scripts. The fact that any ads get blocked is just incidental because of the way they all work these days.

    EDIT: Heh, although, since I'm not seeing many ads because so much gets blocked maybe I'd change my tune if I did get hit with the full brunt. :D
  • Oh, for the sites I adblock that won't let you read their content when you do, I added a Print Preview extension and just click the button on the toolbar just before the whining starts, which works for many.
    I'm generally satisfied with being selective about domains. It's the third-party scripts that do the nasty stuff.
  • evanh wrote: »
    ... Their test doesn't address lowering the core voltage (they actually raise it to 1.4 V) ...
    Just checked my BIOS automatics for this. It raises my 1700X core voltage from 1.35 V to 1.5 V when I set the base multiplier for 4 GHz. Since I'm using 1.362 V you could say I've lowered the core voltage by 0.138 V.

  • evanhevanh Posts: 9,702
    edited 2020-06-21 - 07:00:08
    With a better cooler the 1700X would go a little faster still on that curve. There's a balancing act between cooling performance and the heat generated by frequency and voltage. And since frequency is capped by temperature and voltage, the heating vs cooling matters a lot.

    In a hot room my setup may become unstable before the thermal throttling temperature threshold. By running the voltage lower, that's likely what I'm trading off to get these better results.

    The Wrath Spire cooler will limit your results when comparing to what I'm showing here.

  • He is running a Wraith Prism, which (by some accounts) runs about 8C cooler than the Spire. Even so, Prism is considered too small for anything larger than the 3800X.
  • Oops, yeah, Prism. Still that needs to be considered with whatever results you get.

  • xoroshironotxoroshironot Posts: 259
    edited 2020-06-24 - 04:16:50
    I just noticed that there is an exploitable asymmetry that I missed which simplifies my code, without reducing randomness (and perhaps improving it, as fails PractRand at 4MB with 25 bits state, which I believe is maximum). In addition to PractRand testing on 8-bit output word, I 'xor' pairs of outputs and run a multidimensional equidistribution verification of the full period, comparing it to the theoretical distribution. For 16-bit output, I will have to fall back to using (your) standard freq analysis to ensure I can replicate the normal distribution at the 16GB boundary (i.e. that I previously posted showing linear freqs p-value plots with RevBits). Edit: New modification fails Hamming Weight Dependencies prematurely... learning from this, I will exploit the asymmetry differently, but my code will revert to its previous 'fast' speed.

    Upside: Should be even faster than the excellent benchmark results I already posted. Edit: Yes, attached... but a failure, so removed.
    Downside: The last month of testing wasted (other than having reasonably established a solid proof-of-concept), and I have to re-optimize and verify new code before starting testing again.

    Edited for clarity.
    Removed failed attempt attachment.
  • Any progress with the Ryzen stability?

  • xoroshironotxoroshironot Posts: 259
    edited 2020-07-06 - 03:41:12
    evanh wrote: »
    Any progress with the Ryzen stability?
    We updated the UEFI BIOS and AMD chipset driver, and also found his Oculus Rift S software was busy even when it wasn't being used, so I had him terminate parts of it. It seems fine now, but he hasn't tried using the Oculus since. We uninstalled another piece of software that caught my attention (excess handles) that he was going to update later, but I can't remember what it was.

    He has lost interest in under-clocking, but I may build a 3950X system and take up the mantle. I am actually strongly considering a GIGABYTE B550I Aorus Pro AX motherboard for a mini-ITX build. Not really a system to replace my current dual Xeon server, but it could since it should be about 1.8x faster. Waiting on the XT processors release to see if it has any effect on existing CPU pricing. If I went Threadripper, I have a feeling I would be disappointed with anything less than a 3990x, but can't currently justify the cost in this uncertain economy. At least with the 3950X I have multiple uses, so its abilities won't go wasted, even if I have to wait extra weeks for certain jobs to complete that are better suited to TR.
  • AMD definitely haven't been going cheap on the large core counts. Those 3990X's are just too pricey, imho. I guess they don't want to flood the market with underused products just because the price was right and then end up with shortages.

    Thanks for the update anyway.

  • If the 3990X dropped to about $3000 US, then it would be much more difficult for me to turn down. The next gen of TR will likely force this gen down close to that price (and bring something else to the table that will make you look hard at the new one... hopefully they don't put Cinebench in microcode in the new one so that it spends a few seconds to generate a CB score every time you power it on, as I am getting sick of watching CB runs).
  • hehe
  • xoroshironotxoroshironot Posts: 259
    edited 2020-07-09 - 01:39:08
    Tony and Evan, while waiting on my slow server to churn out 1000's of BigCrush results on 64-bit (actually 128-bit, due to double-iteration*) output for TestU01 meta-analysis (which looks good so far, BTW), I took a dip to the bottom-end and ran SmallCrush on 10-bit and 11-bit output generators, which are 31 and 34 bits of state, respectively.

    I am happy to say that 10-bit only fails the Birthday Spacing test (which is expected, passing all other test easily, which was unexpected), while 11-bit passes all tests completely. Apparently my new methodology makes the repetition of the underlying xoroshiro (at 20 and 22 bits of state) nearly completely invisible (unless you know exactly what to look for).

    Since this all seems (even more so) too good to be true, I'll be happy to hand this off for others to try and poke holes in once I've spent several more weeks on it and have time to revise my preliminary documentation.

    I haven't fully vetted what this might mean for XORO32, but I'll play with it. As I said before, this new approach is more likely stand-alone, but since XORO32 has already done a lot of the up-front scrambling work, maybe there is a way to slot it into a carefully simplified version of my current code and keep the PractRand failure point at 128GB, or above.

    Edit: * The term 'double-iteration' is no longer technically correct with respect to my new code, which I'll explain more on later.
  • Using a logic simplification (e.g. fewer transistors, without causing Hamming Weight issues this time, but a few percent slower on x86-64), I've increased 10-bit output / 31-bit state randomness to pass SmallCrush on both forward and reverse (3x 10-bit outputs concatenated into 30 bits, then shifted by 2 bits left since SmallCrush does not look at the 2 LSBs).

    I thought 31-bit state SmallCrush pass should be impossible, except perhaps that restriction and/or observation only applies to 31-bit output, 31-bit state, maximally equidistributed PRNGs (i.e. each 31-bit output value occurs once per full period). Not sure.

    I tried a similar modification on 2x 16-bit output / 49-bit state, and (after playing with constants) it fails PractRand both forward and reverse at about 2TB. So, in some sense it is about a 2000x improvement over XORO32, and at least a 16x improvement in all cases.

    Correlation issues with xoroshiro engine are now nearly practically invisible to TestU01 and PractRand (significantly extending the confidence in normally distributed freqs at, and well beyond, the stream limit, allowing any given stream to to be more useful in producing differentiated experimental results as compared to the majority of other correlated streams, thus many non-research use cases could over-run into multiple streams with no ill effect).

    I'm going to run some more BigCrush on 16, 32 and 64-bit output of this new code, then will post source.
  • Sounds intriguing.
  • TonyB_TonyB_ Posts: 1,493
    edited 2020-07-25 - 10:49:51
    The big question is whether this new as-yet undisclosed algorithm could be implemented in a future version of the P2 in a similar way to the proposed XOROACC, which interleaves 65536 correlated streams to achieve 32-bit equidistribution (almost) by using an accumulator.

    XORO32 (existing)
    Output   All PRN   All PRN   Period      Number of  Output
    width    values    values                possible   frequency
             output?   output                PRN
    		   equally?              values
    
    word     yes       almost    2^33-2      2^16       2^17 for 2^16-1 values (> 0)
                                                        2^17-2 for 1 value (= 0)
    		
    long     no        no        2^32-1      2^32       mostly 1 or 0, good 
                                                        approximation to ideal
                                                        random distribution
    
    XOROACC = XORO32 and accumulator (proposed)
    Output   All PRN   All PRN   Period      Number of  Output
    width    values    values                possible   frequency
             output?   output                PRN
    		   equally?              values
    
    word     yes       yes       2^49-2^17   2^16       2^33-2 for all values
    		
    long     yes       almost    2^48-2^16   2^32       2^16 for 2^32-2^16 values
                                                        2^16-1 for 2^16 values
    
    word means bits [31:16] and [15:0] only
  • Wuerfel_21Wuerfel_21 Posts: 824
    edited 2020-07-25 - 19:30:44
    Just popping in with a random question: A looooong time ago, I pulled a 64bit state/32bit output xoroshiro algorithm from this thread.

    in P1ASM:
                  ''generate number   
                  mov vm_res1,vm_seed1
                  add vm_res1,vm_seed0
                  rol vm_res1,#17
                  add vm_res1,vm_seed0
                  ''advance state
                  xor vm_seed1,vm_seed0
                  rol vm_seed0,#26
                  xor vm_seed0,vm_seed1
                  mov vm_junk,vm_seed1
                  shl vm_junk,#9
                  xor vm_seed0,vm_junk
                  rol vm_seed1,#13
    

    Is this still good or is there a better 64bit state/32bit output algo in the same number of instructions? (or, like, 1 more?)
  • evanhevanh Posts: 9,702
    edited 2020-07-25 - 20:32:47
    That's the plusplus extension to the algorithm. https://forums.parallax.com/discussion/168188/xoroshiro-random-number-generator/p1
    More specifically Xoroshiro64++[26 9 13 17]

    Tony might have something else for you but that's still good as is.

  • xoroshironotxoroshironot Posts: 259
    edited 2020-07-25 - 21:44:42
    TonyB_ wrote: »
    The big question is whether this new as-yet undisclosed algorithm could be implemented in a future version of the P2 in a similar way to the proposed XOROACC, which interleaves 65536 correlated streams to achieve 32-bit equidistribution (almost) by using an accumulator.
    Wuerfel_21 wrote: »
    Is this still good or is there a better 64bit state/32bit output algo in the same number of instructions? (or, like, 1 more?)
    Wuerfel, the code you listed looks like a standard ++ implementation, which is good... but the specific constants in use are hard to test for any high degree of randomness beyond, say, 32TB (or only as little as 16GB, if you follow the square root of the period rule), due to the large state size... which leads to my research to demonstrate a methodology of extending the state in a way that can have some reasonable assurance to apply to larger output word sizes and increase statistical randomness.

    Tony and Wuerfel, so, I asked Seba if s0 and s1 are equally random. He said 'yes'. So I inferred that the only reason that one would not use both s0 and s1 as a double output (since all possible combinations of s0 and s1 occur except for [0,0]) is their simultaneous pair correlation with each other AND successive pairs. Therefore, if you can break the pair correlation, then you can return both f(s0) and f(s1). The end result is that with only a few more instructions than the above code example, you get TWO usable random values.

    Tony, that is what I meant before when I stated that double-iterated is no longer technically correct, as the xoroshiro engine itself need only be executed once per pair of values, and the s0/s1 de-correlation is fairly simple (if you ignore the months of research required to derive it).

    Wuerfel, unlike Tony's specific interest in double-iteration, I'm not sure if you have a best use-case for two values at once (e.g. calculating Pi from a pair of random numbers is a benchmark I wrote), but regardless, it is indeed simpler and faster. Wait several weeks and I will post code here.

    Tony, to point out the problem I was worried about with integrating XORO32 with my new code... it will likely work fine, but the existing ++ scrambler in XORO32 will be somewhat redundant to the intrinsic scrambler in my new code (that is required for extending freqs range). Perhaps not really a problem, but slightly more complicated than necessary, which is why I called it a stand-alone approach. Who knows... perhaps the combination of XORO32 with my new code (i.e. substituting XORO32:lo16 for s0 and XORO32:hi16 for s1) will push PractRand out to 8TB fwd/rev failure (which I have some indications that is a possibility).
  • Wuerfel_21 wrote: »
    Just popping in with a random question: A looooong time ago, I pulled a 64bit state/32bit output xoroshiro algorithm from this thread.

    in P1ASM:
                  ''generate number   
                  mov vm_res1,vm_seed1
                  add vm_res1,vm_seed0
                  rol vm_res1,#17
                  add vm_res1,vm_seed0
                  ''advance state
                  xor vm_seed1,vm_seed0
                  rol vm_seed0,#26
                  xor vm_seed0,vm_seed1
                  mov vm_junk,vm_seed1
                  shl vm_junk,#9
                  xor vm_seed0,vm_junk
                  rol vm_seed1,#13
    

    Is this still good or is there a better 64bit state/32bit output algo in the same number of instructions? (or, like, 1 more?)

    When this xoroshiro64++ is implemented on the P1 or P2, "advance state" can be placed before "generate number" and the same register can be used for vm_res1 and vm_junk. The code is the same speed and size but one long is saved overall. In hardware or software with true parallel processing, it is best to have "generate number" before "advance state" as that is quickest.
  • Ahh, thanks everyone. No, I don't really need 64 bits at once. Infact, I often only use one bit of the 32bit result. I think I've read somewhere that the top bits are better than the bottom bits? And not using the junk register doesn't really bring any advantage in the code I've copied that particular snippet from, but I'll remember that optimization for when it is useful.
  • Wuerfel_21 wrote: »
    I think I've read somewhere that the top bits are better than the bottom bits?
    That is true with the + scrambler, but not necessarily the case with the ++ scrambler you are using, as it was designed to (in theory) balance the randomness across all bits. Thus, if using only 1 bit, generally you can feel free to use the bottom bit, then shift the result 31x to extract the rest before generating a new number / advancing state.
  • TonyB_TonyB_ Posts: 1,493
    edited 2020-07-27 - 09:59:07
    Wuerfel_21 wrote: »
    Ahh, thanks everyone. No, I don't really need 64 bits at once. Infact, I often only use one bit of the 32bit result. I think I've read somewhere that the top bits are better than the bottom bits? And not using the junk register doesn't really bring any advantage in the code I've copied that particular snippet from, but I'll remember that optimization for when it is useful.

    Before the final rotation and addition, the xoroshiro++ code generates the xoroshiro+ output. You could use bit 31 of the latter if you need only one bit and would like to save two instructions. Higher bits have better quality with the + scrambler whereas quality is considered to be the same for all bits with ++. If code size and execution speed are not important, then use ++ as our tests have shown it is much better (i.e more random) than +.

    Omit instructions with * below for xoroshiro64+
    	''generate number   
    	mov vm_res1,vm_seed1
    	add vm_res1,vm_seed0	' xoroshiro64+ PRN
    	rol vm_res1,#17		'*
    	add vm_res1,vm_seed0	'*xoroshiro64++ PRN
    	''advance state
    	xor vm_seed1,vm_seed0
    	rol vm_seed0,#26
    	xor vm_seed0,vm_seed1
    	mov vm_junk,vm_seed1
    	shl vm_junk,#9
    	xor vm_seed0,vm_junk
    	rol vm_seed1,#13
    
    Again, "advance state" before "generate number" avoids the separate junk register.
  • xoroshironotxoroshironot Posts: 259
    edited 2020-07-29 - 14:01:06
    Tony, to re-hash, I was not entirely satisfied with the proposed code:
    'xoroacc operation
    'result_out[15:0]  := prn[15:0]  + result_in[31:16] + 1 ' ^ 1 also works, but seemingly not as well
    'result_out[31:16] := prn[31:16] + result_out[15:0]
    
    We then looked at the results of applying different combinations of modifications to one or both lines. The most promising from a randomness standpoint, as I recall, was applying RevBits(prn[31:16]), but we agreed it was unwieldy for software operation/simulation, even if it is easy to implement in hardware. There was also some benefit to search for a different odd constant > 1, which cannot be entirely disregarded in terms of protection against sparse states (though there are very few to worry about).

    Recently I mentioned having simplified my new unpublished code (to reduce hardware footprint and increase randomness), and I just realized (somewhat embarrassingly so) that the simplification also applies to the above code, with benefits I believe are better than using RevBits. To wit:
    'modified xoroacc operation
    'result_out[15:0]  :=    ( prn[15:0]        ^ result_in[31:16] ) + 1 ' or other odd constant of merit (to further benefit high bits). 
    'result_out[31:16] := rol( prn[31:16] , D ) ^ result_out[15:0]       ' optimal D (0-15) must be searched, with most values being similar. 
    
    Because the ++ scrambler performs addition, it is seemingly harmful to randomness to follow it with another addition, but an xor is good, which is then followed with the odd constant sum to further the benefit and preserve equidistribution (which was not obvious to me by inspection). Having used xors in place of previous additions, the odd constant must now be summed to the result, as xoring it would break equidistribution (which is perhaps slightly more obvious).
  • TonyB_TonyB_ Posts: 1,493
    edited 2020-07-31 - 09:05:41
    Tony, to re-hash, I was not entirely satisfied with the proposed code:
    'xoroacc operation
    'result_out[15:0]  := prn[15:0]  + result_in[31:16] + 1 ' ^ 1 also works, but seemingly not as well
    'result_out[31:16] := prn[31:16] + result_out[15:0]
    
    We then looked at the results of applying different combinations of modifications to one or both lines. The most promising from a randomness standpoint, as I recall, was applying RevBits(prn[31:16]), but we agreed it was unwieldy for software operation/simulation, even if it is easy to implement in hardware. There was also some benefit to search for a different odd constant > 1, which cannot be entirely disregarded in terms of protection against sparse states (though there are very few to worry about).

    Chris, thanks for that accurate summary, which has refreshed my memory. Regarding your new simplification
    'result_out[15:0]  :=    (prn[15:0]     ^ result_in[31:16]) + 1
    'result_out[31:16] := rol(prn[31:16],D) ^ result_out[15:0]
    
    is functionally equivalent to
    'result_out[15:0]  := rol(prn[15:0],D) ^ result_in[31:16]
    'result_out[31:16] :=    (prn[31:16]   ^ result_out[15:0]) + 1
    
    and I think the latter is slightly easier to handle as the addition can be done last (and separately if necessary).

    As we already use [a,b,c,d] constants for the xoroshiro engine and ++ scramber, I suggest renaming D to e and using f for the add constant, thus
    'result_out[15:0]  := rol(prn[15:0],e) ^ result_in[31:16]
    'result_out[31:16] :=    (prn[31:16]   ^ result_out[15:0]) + f
    
    Setting e = 0 and f = 1, have you checked how these two compare in randomness tests?

    1. Original xoroacc
    'result_out[15:0]  := prn[15:0]  + result_in[31:16]
    'result_out[31:16] := prn[31:16] + result_out[15:0] + 1
    
    2. Modified xoroacc
    'result_out[15:0]  := prn[15:0]  ^ result_in[31:16]
    'result_out[31:16] := (prn[31:16] ^ result_out[15:0]) + 1
    
    Finally for now, I guess that replacing + 1 with ^ 1 breaks the equidistribution?
    'result_out[15:0]  := prn[15:0]  ^ result_in[31:16]
    'result_out[31:16] := prn[31:16] ^ result_out[15:0] ^ 1
    
Sign In or Register to comment.