At some point, increasing e will decrease quality as fewer bits of s0 are used twice (in different positions) in the s0 * e addition. Is e = 5 best, or is it 3 or 7 or 9?
Chris recommended x5. I just followed instructions.
s0 * 5 is an extra 14-bit adder for each iteration compared to just s0, therefore the double iteration in XORO32 would use 92 add/subtract bits instead of 64 now. No extra routing resources should be needed for s0 + s0 * 4 as both operands are the same.
I notice: s0 + s0 * 4, from a hardware implementation standpoint, would likely be nearly the same as s0 + rotl(s0, 2).
The worth of those two (normally discarded) bits might be high, with the right constants, but equidistribution would have to be verified.
I have not tested this idea yet in my own code (which uses two * 5s), but from a software implementation standpoint, I suspect it would likely overburden an already saturated CPU pipeline.
s0 + rotl(s0, 2) would use a 16-bit adder, but would not take any more time than s0 + s1.
I'd be interested to know PractRand score for xoroshiro[12,4,15,7,5]-+- and whether s0 + s0 * 8 is better or worse than s0 * 5. Summing pchi (and zchi & nzchi) for top 20 would give a measure of quality. Re PractRand, only the top candidate really needs testing.
I'd be interested to know PractRand score for xoroshiro[12,4,15,7,5]-+- and whether s0 + s0 * 8 is better or worse than s0 * 5. Summing pchi (and zchi & nzchi) for top 20 would give a measure of quality. Re PractRand, only the top candidate really needs testing.
Here's the first part of that. I'm just finishing off the x5 runs now. Decided to add a few extra candidates.
EDIT: Bugger! xom11(-+-) isn't finished. Oh well, you can see that [12 4 15 7] is a ways from the top of the best scoring. 18th position for xom02(+-+).
Here's the rerun, in Practrand v0.94, xo (XORO32) scrambler grid scores. The best of this selection is 29.895, candidate [6 2 3 9]. That's a whole magnitude lower.
I am impressed enough that I have halted Big Crush testing on the 2x * 5 (i.e., rotl((s0 + s1) * 5, D ) + s0 * 5) 128-bit state variant, and switched to the +++ xom (with E as a stream selector).
Preliminary results are better than those with 2x, and also marginally faster since the CPU pipeline has some slack now. It could be months before I have any conclusive results.
Thanks for all of your effort on this... I hope you find some use for it.
Ah, you mean the weightings on the three ranks in the dist score files. So sort the distribution rankings with sumRank=pRank+zRank+nzRank and attach the relevant grid score to each, right?
You do realise there is 1344 ranked distributions in each file and only about 50 of each with a grid score average done.
Evan, to diverge for a moment... I noticed your signature concerning the recent development of predicting quantum jumps.
Think about the escape-from-zero (a.k.a. sparse-set-bit-state) issue, where suddenly only a small number of bits are set.
It takes time to propagate the bits back to an ~50% set-bits state.
Now, run the problem backwards, where the quiescence begins to accumulate, resulting in progressively fewer set bits until there is only one...
Then suddenly there are many.
I have lightly mused about the parallels between PRNGs and quantum systems in some of my writings.
Now there is more fodder for my thoughts as a result of those quantum jump experiments.
EDIT: I've not read a great deal on the new quantum discovery. Mostly just was amused with that quote which actually came from a journalist interpretation rather than the original paper. I had a quick look at the paper but was drowned immediately.
When run forward, an arbitrary LFSR PRNG state might look something like this (in binary):
T-1: 100110110001 (~50% bits set)
T0: 001110010011 (status quo)
T1: 000001000000 (state collapses to a single set bit)
T2: 010000100000 (gradually repopulating bits)
T3: 010010001001
T4: 110110110110 (state recovers to ~50% set bits)
Running the PRNG in reverse from T4 to T0 is interesting, not as much because we can see the sparse state approaching, but because the recovery at T0 is instantaneous... perhaps similar to the 'quantum jump' (or not).
That reverse behavior would normally be difficult to achieve with an LFSR, which is why my own code is using a stream selector xor'd with S1 to allow for both the forward and reverse possibilities all within the same stream.
Without a stream selector, the * 5 helps with bit-interleaving (to offset the block transfer nature of shifts and rotates) and also approximately doubles the effective sparse set bit re-population rate in the output.
What I'm thinking is it occurs once per state bit throughout the full period, ie: 32-bit state store will have 32 of these "teeth". And then there is also the possibility for collapse to two set bits instead of one. Lots more of those cases. I'm guessing they could all be the sawtooth shaped collapse and rebuild.
That is correct. Also, since the * 5 cuts the apparent rebuild time in half (as it affects the output) and my 128-bit stream selector has a worst-case (assuming at least one bit is set, thus 0 stream is not used), then the effective sparse state recovery is always at least 4x faster... both of which have a positive effect on some statistics, especially at larger state sizes (where even a collapse to five set bits would otherwise be problematic).
BTW, I've noticed that while testing the single * 5 (as compared to double * 5) it is noticeably much more difficult at 128-bits to find a particular D value that distributes maximal statistical randomness across all bits.
Wow, I just saw a Geekbench multicore rating for the new 12 core Ryzen 3900X - socket AM4 (dual channel). It's quite something coming in at 40% up on the 12 core Threadripper 2920X - socket TR4 (quad channel) and matches the 18-core i9-9980XE - socket LGA2066 (quad channel).
Lol, I've spent more than enough myself. But still, I never thought I'd see AMD pull this off to be honest. It always seemed like Intel was just keeping AMD around so that they didn't get slammed with anti-trust penalties.
PS: More than double the score of my 8-core. EDIT: Ah, cool, there is a Linux build of Geekbench ... ha, it uploaded the score and doesn't tell the owner what it was without a payment! ... Ah, the link didn't need any payment at all. How odd. My score is 5055 single core and 30991 multi-core. That's quite a lot, 44%, higher than my CPU (Ryzen 1700X) is rated at ... struth, higher than even the Threadripper 1950X, which is 16 cores.
Clearly the score depends a huge amount on setup. It's really easy to set all cores at max clockrate just by using a better cooler than is shipped. Well, the Ryzen 1xxx series didn't have a cooler shipped so I had to buy that extra anyway. So I spent more on getting one rated double the CPU's TDP.
So, it begs the question of how many CPU's are effectively spending their entire existence throttled for want of better cooling.
And this also means the newly posted 3900X score might have had better cooling than the prior generation 2xxx parts.
I guess it is possible Linux scores could always be higher as well. Even my single core score is 24% above the average rating. It's hard to know if single core boosting is operating when there good chance it was still being throttled. It's then 18% clockrate difference if staying with the base 3400 MHz as top speed vs my 4000 MHz on all cores.
The Geekbench website is cool, it's easy to do simple text searches on names. Here's one for all 1700X's where you can see my score as the newest entry with the set 4000 MHz clock - https://browser.geekbench.com/v4/cpu/search?q=1700X
Lol, There is a bunch of iMac's in that list. Apart from the ridiculous scores (and with 16 cores!), I know Apple never used any AMD CPU's. I presume Geekbench has misidentified some Intel CPU as a Ryzen 1700X. Or someone has hacked the database.
Check out the 3600 (no 'X') Single Thread Passmark score: Here.
It should be easy to find on the chart... because it is at the top. Almost unbelievable... it would be great if it is correct (as that chip is low end in the product stack).
Wow, wasn't expecting that either. The problem all these benchmarking programs have now is they don't(can't) record the clock speed that was operating during the tests. It's very hard to work out how much throttling is occurring.
Boosting=1/throttling EDIT: Crossing that out, there is an operational difference between boosting and throttling. I think boosting is something the OS decides on, whereas throttling is forced by the hardware. I have my BIOS clock multiplier set higher than the max boost so there is no room left for that decision. And as long as the cooling holds up then no throttling will occur.
Interestingly, on top of all that, for the Ryzen's at least, the hardware still dynamically jumps down to 2.2 GHz, I think it is, for each core when that core is idle. I can disable this in the BIOS but annoyingly it then starts giving me power up error messages about incorrect CMOS settings. A flaw of my particular BIOS.
Applying all this blather about not knowing the clockrate to those public databases of benchmarking: The still unreleased products like that Ryzen 5 3600 can easily be operating under good conditions, ie: an upgraded cooler, to get a good first score. Might even be a new stock cooler that got an upgrade at the same time. And then that is compared to well established averages that can be terribly nerf'd by their stock cooler. Or even worse, cheapo consumer grade laptops with rubbish cooling.
EDIT: Assuming clockrate measurements can be gathered during benchmarking, a remedy for those benchmarking databases would be to include, right beside every score, a clockrate for that score. And website listed CPU model score average also have a CPU model clockrate average.
Finally found one graph showing relative performance. It's not as good as knowing what the actual clock frequencies were but still way better than all the rest of the benchmarking that is happening these days.
EDIT: Would be good to also have estimated wattages with those.
Ha! Here's a good indication of how AMD have given the benchmarks a better overall outcome with the Ryzen 3k series. The 3700X is boosting beyond the rated TDP where as the 2700X is falling short.
Admittedly, it's more showing that the 2700X was nerf'd rather than the 3700X cheating. But that indicates exactly why the score I got for my 1700X system is so much higher than the reference numbers ever were.
Cool, I see both a box and tray listing for the 3900X. There wasn't any tray option previously. The 1k parts had no coolers with a mostly empty box and the 2k parts all had underrated coolers.
No price for the tray part yet but without a cooler it has to be cheaper.
Only question mark now is whether the XFR listed parts are more unlocked than the others. I suspect it makes no diff and all models are fully unlocked, just waiting for better cooling.
With the 3900X at the top of both the Single and Multi Thread Passmark charts (though only a few samples), and given the power ratings, it looks like time to research a small rack server build.
Not necessarily for the 3900X, but it would be hard to argue against it for a 2U or 4U considering the value, as opposed to waiting for the 3950X (or new Threadripper / Epyc or some 'magical' Intel competitive release).
I need to bring my minimal BigCrush meta-analysis of (released and candidate) PRNGs down to 1 day (or less) per PRNG, as it is currently taking me 5 days.
That would allow for a complete preliminary screening of all D values for xoroshiro128psp (A=24,B=16,C=37) in about 2 months.
It looks like the end goal with the 3900X would be to replicate the 4.50/4.51GHz of this build: https://passmark.com/baselines/V9/display.php?id=124301243525
Why? The CPU Mark (hard to fake) and 2D Graphics Mark (more Intel-like, but could be faked) seem almost impossible.
Possibly water-cooled, but I am not familiar enough with these specific builds to be sure.
I'm guessing that some aspect of that speed and/or configuration is able to skip a wait state.
That specific memory (G Skill F4-3600C15-8GTZ) is likely part of the key (and magically does not suffer from the high latency issue that some generic builds have while using it). Techpowerup would seem to indicate that PC3600 is optimal for 3900X.
Yeah, that guy will be using a better cooler than stock (I note the average 3900X CPUmark is 31880 vs his 35382). And possibly raised the clock multiplier to 4.5 GHz. He's kept it quite conservative I suspect. Full power burns, although higher than the stock cooler, won't be aggressive.
Plain air cooling should be good for 4.8 GHz multiplier (meaning all cores). Actually, water cooling is getting trickier these days, the motherboards need the flow of air cooling coming from the CPU cooler to circulate the heat away from the voltage regulators.
That specific memory (G Skill F4-3600C15-8GTZ) is likely part of the key (and magically does not suffer from the high latency issue that some generic builds have while using it). Techpowerup would seem to indicate that PC3600 is optimal for 3900X.
Right, there is a gear-change effect above 3600 MHz where the internals of the I/O die, I think it is, halve in speed. Or it could be the whole of the Infinity Fabric. I haven't looked too hard.
Oh, I forgot to mention there is an important but simple step to controlling the heat. And that is don't let the automatics set the CPU core voltage for the chosen clock multiplier setting. Basically, there is a sweet spot in voltage for efficient MIPS vs power. If you ask for a higher clock then the conservative automatics gives too much corresponding voltage.
Eg: First photo is default auto CPU settings but with XMP DIMM setting selected, hence the auto raised DRAM voltage. My Ryzen 1700X default clock multiplier is 3400 MHz.
Second photo is same as first but with CPU clock multiplier set to 4000 MHz. You can see the voltage has been automatically raised from 1.35 volts to 1.50 volts.
In the third photo, I've manually overridden the voltage back down to 1.3625 volts. So slightly higher than default. With it this way, it's still completely reliable without overloading the regulators or thermally throttling the CPU clock. Doesn't even need a top grade motherboard, but obviously does require one that has those settings.
I must admit, I haven't seen anyone try this with, say, the AMD Prism cooler. So I don't know how effective at cooling it would be in a direct compare. There wasn't any AMD coolers when I got the 1700X, I'd have spend money to find out myself.
PS: My cooler is a reasonably humble Deep Cool S40.
And that is don't let the automatics set the CPU core voltage for the chosen clock multiplier setting.
Yes. I too have an ASUS AM4 motherboard and it autodetected the core voltage waay high, so as soon as more than a couple cores were utilized, it would immediately shut down. Still happens occasionally when something loads all cores for a longer time, but I think that has more to do with insufficient cooling.
Yes. I too have an ASUS AM4 motherboard and it autodetected the core voltage waay high, so as soon as more than a couple cores were utilized, it would immediately shut down.
Behaviour sounds familiar. The cause of the instant crash could also be an overloaded voltage regulator. Raising the voltage also naturally increases the current. Creating crash/reset from excessive ripple or over-current.
Still happens occasionally when something loads all cores for a longer time, but I think that has more to do with insufficient cooling.
Comments
s0 + rotl(s0, 2) would use a 16-bit adder, but would not take any more time than s0 + s1.
I'd be interested to know PractRand score for xoroshiro[12,4,15,7,5]-+- and whether s0 + s0 * 8 is better or worse than s0 * 5. Summing pchi (and zchi & nzchi) for top 20 would give a measure of quality. Re PractRand, only the top candidate really needs testing.
EDIT: Bugger! xom11(-+-) isn't finished. Oh well, you can see that [12 4 15 7] is a ways from the top of the best scoring. 18th position for xom02(+-+).
I am impressed enough that I have halted Big Crush testing on the 2x * 5 (i.e., rotl((s0 + s1) * 5, D ) + s0 * 5) 128-bit state variant, and switched to the +++ xom (with E as a stream selector).
Preliminary results are better than those with 2x, and also marginally faster since the CPU pipeline has some slack now. It could be months before I have any conclusive results.
Thanks for all of your effort on this... I hope you find some use for it.
You do realise there is 1344 ranked distributions in each file and only about 50 of each with a grid score average done.
Think about the escape-from-zero (a.k.a. sparse-set-bit-state) issue, where suddenly only a small number of bits are set.
It takes time to propagate the bits back to an ~50% set-bits state.
Now, run the problem backwards, where the quiescence begins to accumulate, resulting in progressively fewer set bits until there is only one...
Then suddenly there are many.
I have lightly mused about the parallels between PRNGs and quantum systems in some of my writings.
Now there is more fodder for my thoughts as a result of those quantum jump experiments.
EDIT: I've not read a great deal on the new quantum discovery. Mostly just was amused with that quote which actually came from a journalist interpretation rather than the original paper. I had a quick look at the paper but was drowned immediately.
T-1: 100110110001 (~50% bits set)
T0: 001110010011 (status quo)
T1: 000001000000 (state collapses to a single set bit)
T2: 010000100000 (gradually repopulating bits)
T3: 010010001001
T4: 110110110110 (state recovers to ~50% set bits)
Running the PRNG in reverse from T4 to T0 is interesting, not as much because we can see the sparse state approaching, but because the recovery at T0 is instantaneous... perhaps similar to the 'quantum jump' (or not).
That reverse behavior would normally be difficult to achieve with an LFSR, which is why my own code is using a stream selector xor'd with S1 to allow for both the forward and reverse possibilities all within the same stream.
Without a stream selector, the * 5 helps with bit-interleaving (to offset the block transfer nature of shifts and rotates) and also approximately doubles the effective sparse set bit re-population rate in the output.
BTW, I've noticed that while testing the single * 5 (as compared to double * 5) it is noticeably much more difficult at 128-bits to find a particular D value that distributes maximal statistical randomness across all bits.
Lol, I've spent more than enough myself. But still, I never thought I'd see AMD pull this off to be honest. It always seemed like Intel was just keeping AMD around so that they didn't get slammed with anti-trust penalties.
PS: More than double the score of my 8-core. EDIT: Ah, cool, there is a Linux build of Geekbench ... ha, it uploaded the score and doesn't tell the owner what it was without a payment! ... Ah, the link didn't need any payment at all. How odd. My score is 5055 single core and 30991 multi-core. That's quite a lot, 44%, higher than my CPU (Ryzen 1700X) is rated at ... struth, higher than even the Threadripper 1950X, which is 16 cores.
Clearly the score depends a huge amount on setup. It's really easy to set all cores at max clockrate just by using a better cooler than is shipped. Well, the Ryzen 1xxx series didn't have a cooler shipped so I had to buy that extra anyway. So I spent more on getting one rated double the CPU's TDP.
So, it begs the question of how many CPU's are effectively spending their entire existence throttled for want of better cooling.
And this also means the newly posted 3900X score might have had better cooling than the prior generation 2xxx parts.
I guess it is possible Linux scores could always be higher as well. Even my single core score is 24% above the average rating. It's hard to know if single core boosting is operating when there good chance it was still being throttled. It's then 18% clockrate difference if staying with the base 3400 MHz as top speed vs my 4000 MHz on all cores.
The Geekbench website is cool, it's easy to do simple text searches on names. Here's one for all 1700X's where you can see my score as the newest entry with the set 4000 MHz clock - https://browser.geekbench.com/v4/cpu/search?q=1700X
Lol, There is a bunch of iMac's in that list. Apart from the ridiculous scores (and with 16 cores!), I know Apple never used any AMD CPU's. I presume Geekbench has misidentified some Intel CPU as a Ryzen 1700X. Or someone has hacked the database.
It should be easy to find on the chart... because it is at the top. Almost unbelievable... it would be great if it is correct (as that chip is low end in the product stack).
Boosting=1/throttling EDIT: Crossing that out, there is an operational difference between boosting and throttling. I think boosting is something the OS decides on, whereas throttling is forced by the hardware. I have my BIOS clock multiplier set higher than the max boost so there is no room left for that decision. And as long as the cooling holds up then no throttling will occur.
Interestingly, on top of all that, for the Ryzen's at least, the hardware still dynamically jumps down to 2.2 GHz, I think it is, for each core when that core is idle. I can disable this in the BIOS but annoyingly it then starts giving me power up error messages about incorrect CMOS settings. A flaw of my particular BIOS.
EDIT: Renamed base clock to multiplier
EDIT: Assuming clockrate measurements can be gathered during benchmarking, a remedy for those benchmarking databases would be to include, right beside every score, a clockrate for that score. And website listed CPU model score average also have a CPU model clockrate average.
EDIT: Would be good to also have estimated wattages with those.
Admittedly, it's more showing that the 2700X was nerf'd rather than the 3700X cheating. But that indicates exactly why the score I got for my 1700X system is so much higher than the reference numbers ever were.
No price for the tray part yet but without a cooler it has to be cheaper.
Only question mark now is whether the XFR listed parts are more unlocked than the others. I suspect it makes no diff and all models are fully unlocked, just waiting for better cooling.
Not necessarily for the 3900X, but it would be hard to argue against it for a 2U or 4U considering the value, as opposed to waiting for the 3950X (or new Threadripper / Epyc or some 'magical' Intel competitive release).
I need to bring my minimal BigCrush meta-analysis of (released and candidate) PRNGs down to 1 day (or less) per PRNG, as it is currently taking me 5 days.
That would allow for a complete preliminary screening of all D values for xoroshiro128psp (A=24,B=16,C=37) in about 2 months.
https://passmark.com/baselines/V9/display.php?id=124301243525
Why? The CPU Mark (hard to fake) and 2D Graphics Mark (more Intel-like, but could be faked) seem almost impossible.
Possibly water-cooled, but I am not familiar enough with these specific builds to be sure.
I'm guessing that some aspect of that speed and/or configuration is able to skip a wait state.
That specific memory (G Skill F4-3600C15-8GTZ) is likely part of the key (and magically does not suffer from the high latency issue that some generic builds have while using it).
Techpowerup would seem to indicate that PC3600 is optimal for 3900X.
Plain air cooling should be good for 4.8 GHz multiplier (meaning all cores). Actually, water cooling is getting trickier these days, the motherboards need the flow of air cooling coming from the CPU cooler to circulate the heat away from the voltage regulators.
Right, there is a gear-change effect above 3600 MHz where the internals of the I/O die, I think it is, halve in speed. Or it could be the whole of the Infinity Fabric. I haven't looked too hard.
EDIT: Renamed base clock to multiplier
Eg: First photo is default auto CPU settings but with XMP DIMM setting selected, hence the auto raised DRAM voltage. My Ryzen 1700X default clock multiplier is 3400 MHz.
Second photo is same as first but with CPU clock multiplier set to 4000 MHz. You can see the voltage has been automatically raised from 1.35 volts to 1.50 volts.
In the third photo, I've manually overridden the voltage back down to 1.3625 volts. So slightly higher than default. With it this way, it's still completely reliable without overloading the regulators or thermally throttling the CPU clock. Doesn't even need a top grade motherboard, but obviously does require one that has those settings.
EDIT: Rename base clock to multiplier
PS: My cooler is a reasonably humble Deep Cool S40.
Yes. I too have an ASUS AM4 motherboard and it autodetected the core voltage waay high, so as soon as more than a couple cores were utilized, it would immediately shut down. Still happens occasionally when something loads all cores for a longer time, but I think that has more to do with insufficient cooling.
What brand/model cooler is that?