2. Is there enough time to do the 64-bit addition in a single clock?
If the answer is yes, I'm wondering whether two 16-bit additions could be done in one clock, with second using result of the first, perhaps with bespoke carry logic.
Apparently, there IS time to do a 64-bit addition at 160MHz, since OnSemi didn't complain about it when we gave them the Verilog that, among other things, implemented xoroshiro128+. It was actually the 16x16 multiplier in the cogs' ALUs that were the critical paths in the design. The FPGA hid it well, having hard multiplier blocks in 28nm.
I think that two 16-bit additions could be done, one after the other. I know that at 160MHz, there is NOT time for two 32-bit additions, one after the other, as this was a problem in the CORDIC solver that I had to work around by adding additional pipeline stages.
FPF - "floating point frequency" test; contrary to the name
it's purely integer math. Can be slow on some settings.
This checks for very short range correlations, even shorter
than DC6, especially those correlations involving lots of 0
bits.
Roughly speaking, this test does a frequency test applied to
the binary format of floating point numbers storing the
integer values of overlapping windows of the original data
stream.
Hmm, reading that, I'm thinking there could be some sort of pattern forming when sampling is aligned with LSbit. I'll reorder the Word1/2 variants to be LSbit aligned rather than the current MSbit ... Nope, no quality change.
FPF: A frequency test. At the parameterization used in core test set it's almost a non-overlapping frequency test. And not on hamming weights this time! Every 16 bits in the datastream it does a sample. The sample value consists of the number of consecutive zero bits starting from the lowest bit of the sample (up to a maximum of like ~50 or so, but that never happens, realistically it's almost always a small number), and the first 14 bits above the lowest 1 bit in the sample. Which is basically equivalent to a bit-backwards unsigned integer to floating-point value conversion.
I'll do some more configurations tomorrow if Scro doesn't tell me I'm wasting my time.
I don't know what you'd be doing to wasting your time or not waste your time.
The main attempts right now are shuffling the summing output bit order. This is the question I was trying to articulate earlier - Is there is any real quality advantage to a fixed post-summing shuffle? Or are we just tricking PrandRand?
Shuffling bit order in the final output? Yeah, that sounds pretty suspicious. Exactly how useless that is depends upon what test failures you're managing to avoid that way, but in general it sounds like a bad idea.
scro,
Is the FPF test in PractRand important for something like xoroshiro32+?
That particular question was asked back when DC6 tests were the failure point.
2. Is there enough time to do the 64-bit addition in a single clock?
If the answer is yes, I'm wondering whether two 16-bit additions could be done in one clock, with second using result of the first, perhaps with bespoke carry logic.
Apparently, there IS time to do a 64-bit addition at 160MHz, since OnSemi didn't complain about it when we gave them the Verilog that, among other things, implemented xoroshiro128+. It was actually the 16x16 multiplier in the cogs' ALUs that were the critical paths in the design. The FPGA hid it well, having hard multiplier blocks in 28nm.
I think that two 16-bit additions could be done, one after the other. I know that at 160MHz, there is NOT time for two 32-bit additions, one after the other, as this was a problem in the CORDIC solver that I had to work around by adding additional pipeline stages.
Chip, do you think there would be time in XORO32 for a second addition that uses the result of the first, for each of the two iterations? No need to calculate parity (no longer required) and the second sums would be the outputs.
2. Is there enough time to do the 64-bit addition in a single clock?
If the answer is yes, I'm wondering whether two 16-bit additions could be done in one clock, with second using result of the first, perhaps with bespoke carry logic.
Apparently, there IS time to do a 64-bit addition at 160MHz, since OnSemi didn't complain about it when we gave them the Verilog that, among other things, implemented xoroshiro128+. It was actually the 16x16 multiplier in the cogs' ALUs that were the critical paths in the design. The FPGA hid it well, having hard multiplier blocks in 28nm.
I think that two 16-bit additions could be done, one after the other. I know that at 160MHz, there is NOT time for two 32-bit additions, one after the other, as this was a problem in the CORDIC solver that I had to work around by adding additional pipeline stages.
Chip, do you think there would be time in XORO32 for a second addition that uses the result of the first, for each of the two iterations? No need to calculate parity (no longer required) and the second sums would be the outputs.
Chip, do you think there would be time in XORO32 for a second addition that uses the result of the first, for each of the two iterations? No need to calculate parity (no longer required) and the second sums would be the outputs.
I think so. What are you thinking about?
Just curious.
EDIT March 2018
Implementing xoroshiro++ was the real reason for my question.
Chip, do you think there would be time in XORO32 for a second addition that uses the result of the first, for each of the two iterations? No need to calculate parity (no longer required) and the second sums would be the outputs.
I think so. What are you thinking about?
Chip, there is a simple way to improve the xoroshiro+ output considerably and all LFSR artifacts can be eliminated. Sebastiano Vigna told me the way, however I am sworn to secrecy until it is announced officially. Evan knows and has been doing tests and I am allowed to tell you but nobody else.
Random is what comes to mind when I think about the computers at work these days. We've recently had IT infrastructure moved overseas so now the M$ terminal server desktops show up via some laggy and congested shared data pipe. Even something as simple as selecting some text takes a long time to get right. I end up using the cursor keys a whole lot now because then I can precisely predict what it will do without having to wait to see what happens.
At least the local guy was able to help things by setting the mouse pointer to be locally hosted. It was pure agony until he did that.
and the real fun begins when you from your remote desktop access another server via remote desktop, then you can watch how often programs redraw their buttons and borders!
Well, the marathon effort to decide on the best Xoroshiro32 shifter combinations, aka triplets, aka candidates, has finally come to an end. It was engrossing and enjoyable. Many thanks to everyone posting here and especially to:
- Johannes, aka Ahle2, for starting the ball rolling.
- Chip's openness to ideas and proving of implementation.
- Tony's enthusiasm, engagement and taking the lead to find even better.
- Sebastiano Vigna, aka Seba, and David Blackman for the Xoroshiro algorithm.
- Chris Doty-Humphrey, aka Scro, for the wonderful PractRand!!!! and helpful comments. I would never have gone this far without PractRand.
- And on that note of just how much work was done with PractRand I'll have to credit the coincidental timing of AMD launching the Ryzen CPU when they did. Stepping up from an ageing dual-core to an octa-core was perfect for me. It provided excellent flexibility for testing hair brained ideas and just throwing brute MIPS at the job. At one moment of unrestrained task launching I even hit 29 GB of RAM usage ... and it came out fine. Something that would have needed a hard reset on the old system I suspect.
Random is what comes to mind when I think about the computers at work these days. We've recently had IT infrastructure moved overseas so now the M$ terminal server desktops show up via some laggy and congested shared data pipe. Even something as simple as selecting some text takes a long time to get right. I end up using the cursor keys a whole lot now because then I can precisely predict what it will do without having to wait to see what happens.
At least the local guy was able to help things by setting the mouse pointer to be locally hosted. It was pure agony until he did that.
Sounds like a major boon for productivity. Nothing could be smarter than moving your desktop to the other side of the world.
Okay, I'm ready to implement the improved XORO32, but I have some questions. I emailed you and TonyB_. I'm not clear on how this new scheme works when it comes to the adding.
Once this is implemented, I'll have a new release ready in a short time.
I should write up here the details of what we've done. If what Tony says is holds true then XORO32 is an algorithm that isn't ever planned to be published officially as a Xoroshiro PRNG.
In the last couple of days I've been running lots of tests without the FPF tests in PractRand (After looking around PractRand's sources for something else I worked out how to remove selected tests). As a result I've been back testing with original Xoroshiro+ algorithm and the byte1 [8:1] scores are comparatively even less pretty now. Although, I don't know how much removing FPF tests has nerf'd PractRand's testing.
Chip,
On the free running Xoroshiro128+, maybe the taps should be adjusted to no longer use bit1 of the summing output, ie: restricted to selections from [63:2].
In the last couple of days I've been running lots of tests without the FPF tests in PractRand (After looking around PractRand's sources for something else I worked out how to remove selected tests). As a result I've been back testing with original Xoroshiro+ algorithm and the byte1 [8:1] scores are comparatively even less pretty now. Although, I don't know how much removing FPF tests has nerf'd PractRand's testing.
Chip,
On the free running Xoroshiro128+, maybe the taps should be adjusted to no longer use bit1 of the summing output, ie: restricted to selections from [63:2].
In the last couple of days I've been running lots of tests without the FPF tests in PractRand (After looking around PractRand's sources for something else I worked out how to remove selected tests). As a result I've been back testing with original Xoroshiro+ algorithm and the byte1 [8:1] scores are comparatively even less pretty now. Although, I don't know how much removing FPF tests has nerf'd PractRand's testing.
Chip,
On the free running Xoroshiro128+, maybe the taps should be adjusted to no longer use bit1 of the summing output, ie: restricted to selections from [63:2].
We could have saddled up the one-trick parity pony for another ride!
Evan, that's good news as you know I asked about PractRand tests without FPF some time ago. What do the xoroshiro32++ scores look like?
I haven't been game enough to run any of those yet. I've also turned on -tf 2 option in PractRand and that pushes the scoring time out a lot. I guess I should turn that off again for the ++ scoring. The scores are close to max, except for the known weak cases, even with original algorithm.
... as you know I asked about PractRand tests without FPF some time ago.
It would be fair to say I had kept request pending in my mind. I certainly wasn't very happy with the alternative of forcing the testing beyond fail point and trying to deeper interpret PractRand's reports.
Okay, I had already started tonight ... third triple in the list is [5 2 6] and has always been one of the better ones - now, pretty much it's entire quad group is looking to be flat out max scores.
It's looking to be all honey and sunshine, it's all so perfect there's no single stand-out to pick.
I'll leave it running over night to see bland it really gets ...
Okay, I had already started tonight ... third triple in the list is [5 2 6] and has always been one of the better ones - now, pretty much it's entire quad group is looking to be flat out max scores.
It's looking to be all honey and sunshine, it's all so perfect there's no single stand-out to pick.
I'll leave it running over night to see bland it really gets ...
Any earlier failures or unusual/suspicious scores than the rest for [5,2,6,8] and [5,2,6,11] ?
And maybe for [5,2,6,3] and [5,2,6,7] and [5,2,6,9] ?
[5,2,6,0] should not be good, but I don't think you test d = 0.
[14,2,7,d] is the main quadruple of interest.
Any earlier failures or unusual/suspicious scores than the others for [14,2,7,8] ?
And maybe [14,2,7,7] and [14,2,7,9] ?
Comments
Apparently, there IS time to do a 64-bit addition at 160MHz, since OnSemi didn't complain about it when we gave them the Verilog that, among other things, implemented xoroshiro128+. It was actually the 16x16 multiplier in the cogs' ALUs that were the critical paths in the design. The FPGA hid it well, having hard multiplier blocks in 28nm.
I think that two 16-bit additions could be done, one after the other. I know that at 160MHz, there is NOT time for two 32-bit additions, one after the other, as this was a problem in the CORDIC solver that I had to work around by adding additional pipeline stages.
Is the FPF test in PractRand important for something like xoroshiro32+?
http://pracrand.sourceforge.net/Tests_engines.txt
The scores for all 16 bits appear to be worse than when testing fewer bits, e.g. 15.
EDIT:
This question has been answered already here:
http://forums.parallax.com/discussion/comment/1424224/#Comment_1424224
Hmm, reading that, I'm thinking there could be some sort of pattern forming when sampling is aligned with LSbit. I'll reorder the Word1/2 variants to be LSbit aligned rather than the current MSbit ... Nope, no quality change.
Chip, do you think there would be time in XORO32 for a second addition that uses the result of the first, for each of the two iterations? No need to calculate parity (no longer required) and the second sums would be the outputs.
I think so. What are you thinking about?
Just curious.
EDIT March 2018
Implementing xoroshiro++ was the real reason for my question.
Cool! When is he going to announce it?
Enjoy!
Mike
Might have been in the old days. Now a days I can never tell what my Win 10 machine is going to do next.
They could get much better and randomness by replacing all those Lava Lamps with Surface Pro 4's
At least the local guy was able to help things by setting the mouse pointer to be locally hosted. It was pure agony until he did that.
and the real fun begins when you from your remote desktop access another server via remote desktop, then you can watch how often programs redraw their buttons and borders!
interesting,
Mike
- Johannes, aka Ahle2, for starting the ball rolling.
- Chip's openness to ideas and proving of implementation.
- Tony's enthusiasm, engagement and taking the lead to find even better.
- Sebastiano Vigna, aka Seba, and David Blackman for the Xoroshiro algorithm.
- Chris Doty-Humphrey, aka Scro, for the wonderful PractRand!!!! and helpful comments. I would never have gone this far without PractRand.
- And on that note of just how much work was done with PractRand I'll have to credit the coincidental timing of AMD launching the Ryzen CPU when they did. Stepping up from an ageing dual-core to an octa-core was perfect for me. It provided excellent flexibility for testing hair brained ideas and just throwing brute MIPS at the job. At one moment of unrestrained task launching I even hit 29 GB of RAM usage ... and it came out fine. Something that would have needed a hard reset on the old system I suspect.
I just need to understand what to do exactly, and I'll do it.
Sounds like a major boon for productivity. Nothing could be smarter than moving your desktop to the other side of the world.
Okay, I'm ready to implement the improved XORO32, but I have some questions. I emailed you and TonyB_. I'm not clear on how this new scheme works when it comes to the adding.
Once this is implemented, I'll have a new release ready in a short time.
https://forums.parallax.com/discussion/168188/xoroshiro-random-number-generator
Chip,
On the free running Xoroshiro128+, maybe the taps should be adjusted to no longer use bit1 of the summing output, ie: restricted to selections from [63:2].
Done.
We could have saddled up the one-trick parity pony for another ride!
Evan, that's good news as you know I asked about PractRand tests without FPF some time ago. What do the xoroshiro32++ scores look like?
I'll get back ...
It would be fair to say I had kept request pending in my mind. I certainly wasn't very happy with the alternative of forcing the testing beyond fail point and trying to deeper interpret PractRand's reports.
It's looking to be all honey and sunshine, it's all so perfect there's no single stand-out to pick.
I'll leave it running over night to see bland it really gets ...
Any earlier failures or unusual/suspicious scores than the rest for [5,2,6,8] and [5,2,6,11] ?
And maybe for [5,2,6,3] and [5,2,6,7] and [5,2,6,9] ?
[5,2,6,0] should not be good, but I don't think you test d = 0.
[14,2,7,d] is the main quadruple of interest.
Any earlier failures or unusual/suspicious scores than the others for [14,2,7,8] ?
And maybe [14,2,7,7] and [14,2,7,9] ?