Thanks Chip.
A long time ago you posted the basic clock processing for the P2. Not sure where that was and if it is even current. How does it work now?
Here's the P1 for example
I d S D e R
I d S D e R
Drawn that way, Prop2 is:
I SD e e R
I SD e e R
I SD e e R
EDIT: The S and D ALU ports of the second instruction can be fed straight from the first instruction completion - effectively a forced immediate addressing mode on the second instruction. Or the second instruction's SD operand substitution can be done after the first's first execution stage.
I'm assuming Chip prefers operand substitution where possible but what use does a random number have as anything other than an immediate number?
EDIT2: Chip did say the second execution stage is basically all mux'ing. I'm not sure the implications there. Does that mean that none of the prefixing instructions, eg: ALTS, can do operand substitution? I've long forgotten how those instructions work ... (Now I'm late for work!)
You might be amused to know, when only using 16-bits or less, the quality scores for our XORO32 doing the even-odd looping trick are generally a little better than straight incremental. There is a few 1 GB scores in the word0 column, and even some 4 GB and 8 GB scores in other columns too. It's also more of a mixed bag, [14 2 7] no longer looks so great for example.
For the testing, I modified the source to double iterate on every call to next(). Nothing else changed.
Remember when you swapped bits 1 and 9? [14,2,7] got worse but [15,3,6] got better. I wonder what the double-iteration score would be with that swapping?
Evan, could you please try a double-iteration test with bits 1 and 9 swapped? Also it would be good to add an lsBit test to this to see how bit 0 performs.
I think this will be the last test, I hope so anyway. PractRand has been an excellent tool for comparing triplets and [14,2,7] has always been the one in the Verilog, however XORO32 has changed and I think [15,3,6] is the better choice now. That's the only modification XORO32 really needs.
In that case, I'd recommend also moving the parity bit to the most significant bit position.
My impression is, without subsequent summing, a final bit shuffle is only hiding things from PractRand. I doubt it's a quality improvement.
To change the subject a little, that last odd-even scoring also had a lowly 32MB score for [15 3 6] as well as [14 2 7]. Tony is picking the score columns that will likely be used in production code but there is many more possible sampling variants that can show differing quality.
I don't know how relevant such aligned scores are to the true quality of an algorithm. Maybe what Tony is picking out is the right answer.
The question isn't all that rigidly defined is it.
The question is one of over-all quality. Doing lots of sampling variations like I've been doing, is there ever going to be a flat set of scores for a given algorithm? Or should we only expect to settle for finding a combination and config that makes the common uses produce good scores?
Evan, could you please try a double-iteration test with bits 1 and 9 swapped? Also it would be good to add an lsBit test to this to see how bit 0 performs.
Ouch, the whole set of candidates is taking up to an hour to run now! That increase will be solely down to the average increase in score sizes. Those old times, ~12 minutes, must have been just before we'd found the full-width parity improvement.
I'll do some more configurations tomorrow if Scro doesn't tell me I'm wasting my time.
I don't know what you'd be doing to wasting your time or not waste your time.
But if your trying xoroshiro32plus with a parity bit replacing one bit of output, could you instead have the parity bit act as the carry-in on the addition that produces the output? That might be better, maybe.
I'll do some more configurations tomorrow if Scro doesn't tell me I'm wasting my time.
I don't know what you'd be doing to wasting your time or not waste your time.
But if your trying xoroshiro32plus with a parity bit replacing one bit of output, could you instead have the parity bit act as the carry-in on the addition that produces the output? That might be better, maybe.
The idea of using the parity is a bit of a novelty and it's probably not easy to see in all our previous discussions where the parity comes from. It's the parity of the full 17-bit result after the addition of the two 16-bit halves of the state.
Evan, could you please try a double-iteration test with bits 1 and 9 swapped? Also it would be good to add an lsBit test to this to see how bit 0 performs.
Ouch, the whole set of candidates is taking up to an hour to run now! That increase will be solely down to the average increase in score sizes. Those old times, ~12 minutes, must have been just before we'd found the full-width parity improvement.
Many thanks for today's results, Evan. I think you have done enough tests.
Here is a summary of the most important scores for the top two triplets with three combinations of sum and parity:
Word size is 16 bits. Parity as msb has the worst scores but was a convenient way of doing a parity-only test and the top bit here has been copied to the bottom bit of the other two sets. Swapping bits 1 and 9 has worse scores overall than not swapping which leaves parity as lsb as the best option and here it is again on its own:
Top Bot Top Bot
Triplet Word Byte Byte Bit Bit Iter. PRN[15:0] =
------------------------------------------------------------------
[14,2,7] 512M 1G 512M 1G 1G single
[15,3,6] 512M 512M 1G 1G 1G "
sum[15:1],parity
[14,2,7] 512M 256M 1G 1G 1G double
[15,3,6] 1G 1G 2G 1G 1G "
------------------------------------------------------------------
XORO32 does a double iteration and concatenates two 16-bit sum & parity results. Treating those as one long PRN, [14,2,7] is better in the top bytes and [15,3,6] better in the bottom bytes but otherwise they score the same.
If the best possible word or byte PRN is wanted from each XORO32 instruction, then [15,3,6] is clearly superior. The word score of 1G is the best since using all 16 bits and the bottom byte score of 2G is the highest we have had.
We have always relied on the PractRand scores and they are saying choose [15,3,6].
I'll do some more configurations tomorrow if Scro doesn't tell me I'm wasting my time.
I don't know what you'd be doing to wasting your time or not waste your time.
The main attempts right now are shuffling the summing output bit order. This is the question I was trying to articulate earlier - Is there is any real quality advantage to a fixed post-summing shuffle? Or are we just tricking PrandRand?
I'll do some more configurations tomorrow if Scro doesn't tell me I'm wasting my time.
I don't know what you'd be doing to wasting your time or not waste your time.
The main attempts right now are shuffling the summing output bit order. This is the question I was trying to articulate earlier - Is there is any real quality advantage to a fixed post-summing shuffle? Or are we just tricking PrandRand?
Shuffling bit order in the final output? Yeah, that sounds pretty suspicious. Exactly how useless that is depends upon what test failures you're managing to avoid that way, but in general it sounds like a bad idea.
Many thanks for today's results, Evan. I think you have done enough tests.
The thing is, the shuffling is pretty arbitrary. The original bit9<->bit1 swap, intentionally 8 bits apart, was just to test if PracrRand can be fooled - which seems easily done.
If we're going to start down this path we should do a decent amount of it to see what else comes up.
Many thanks for today's results, Evan. I think you have done enough tests.
The thing is, the shuffling is pretty arbitrary. The original bit9<->bit1 swap, intentionally 8 bits apart, was just to test if PracrRand can be fooled - which seems easily done.
If we're going to start down this path we should do a decent amount of it to see what else comes up.
I think you have done enough tests for the decision about switching to [15,3,6]. Someone should double check my summary, though.
You could try swapping bits 1 and 8/10/11/12/13/14 but swapping bits 1 and 9 made things worse overall.
Tony,
You are somewhat cherry picking. There is other triplets that also look better than either of those two when the bits are shuffled appropriately.
Tony,
You are somewhat cherry picking. There is other triplets that also look better than either of those two when the bits are shuffled appropriately.
I chose the triplets that performed best in the three sets of tests that you did as listed in my summary post above. However, in my view the most important set is the first one and we need the best triplet for both single and double iterations when bit 0 is parity, i.e. just a single bit change from the original algorithm. I don't like the idea of shuffling bits about after the addition unless that gives better scores, in which case there is probably something wrong with the algorithm. What I am saying needs checking but I'm confident that [15,3,6] scores the best.
Jump functions work with xoroshiro32. It is possible to jump by 64K or 128K states with only 32 single iterations plus some XORs according to the jump bit patterns:
JUMP_64K = 1A57604D
JUMP_128K = 2FD3E1B9
Bit 31 is the first bit to test.
The jump function needs single-stepping of states which XORO32 can no longer do, unless it is modified. The easiest way, avoiding use of CZ opcode bits, would be to move XORO32 to an empty D,S slot with S selecting single or double iterations.
Here are the first 32 * 64K jumps from state 00000001:
Comments
You'll need a wide display to see this properly:
clk _________------------____________------------____________------------____________------------____________------------____________------------____________- | | | | | | | rdRAM Ib |-------+ | rdRAM Ic |-------+ | rdRAM Id |-------+ | rdRAM Ie | | | | | | | | | | | latch Da |---+ +----> rdRAM Db |------------> latch Db |---+ +----> rdRAM Dc |------------> latch Dc |---+ +----> rdRAM Dd |------------> latch Dd | latch Sa |---+ +----> rdRAM Sb |------------> latch Sb |---+ +----> rdRAM Sc |------------> latch Sc |---+ +----> rdRAM Sd |------------> latch Sd | latch Ia |---+ +----> latch Ib |------------> latch Ib |---+ +----> latch Ic |------------> latch Ic |---+ +----> latch Id |------------> latch Id | | | | | | | | | | | | +------------------ALU-----------> wrRAM Ra | +------------------ALU-----------> wrRAM Rb | +------------------ALU-----------> wrRAM Rc | | | | | | | | | | stall/done = 'gox' | | stall/done = 'gox' | | stall/done = 'gox' | | 'get' | done = 'go' | 'get' | done = 'go' | 'get' | done = 'go' |I SD e e R I SD e e R I SD e e REDIT: The S and D ALU ports of the second instruction can be fed straight from the first instruction completion - effectively a forced immediate addressing mode on the second instruction. Or the second instruction's SD operand substitution can be done after the first's first execution stage.I'm assuming Chip prefers operand substitution where possible but what use does a random number have as anything other than an immediate number?
EDIT2: Chip did say the second execution stage is basically all mux'ing. I'm not sure the implications there. Does that mean that none of the prefixing instructions, eg: ALTS, can do operand substitution? I've long forgotten how those instructions work ... (Now I'm late for work!)
I SDe R I SDe R I SDe REvan, could you please try a double-iteration test with bits 1 and 9 swapped? Also it would be good to add an lsBit test to this to see how bit 0 performs.
I think this will be the last test, I hope so anyway. PractRand has been an excellent tool for comparing triplets and [14,2,7] has always been the one in the Verilog, however XORO32 has changed and I think [15,3,6] is the better choice now. That's the only modification XORO32 really needs.
My impression is, without subsequent summing, a final bit shuffle is only hiding things from PractRand. I doubt it's a quality improvement.
To change the subject a little, that last odd-even scoring also had a lowly 32MB score for [15 3 6] as well as [14 2 7]. Tony is picking the score columns that will likely be used in production code but there is many more possible sampling variants that can show differing quality.
I don't know how relevant such aligned scores are to the true quality of an algorithm. Maybe what Tony is picking out is the right answer.
Scro, is there a rigid answer to this question?
EDIT: Typo
The question is one of over-all quality. Doing lots of sampling variations like I've been doing, is there ever going to be a flat set of scores for a given algorithm? Or should we only expect to settle for finding a combination and config that makes the common uses produce good scores?
Here's the earlier scoring with parity at msBit position - http://forums.parallax.com/discussion/comment/1424022/#Comment_1424022
Scoring with double-iterating, parity at msBit and exchanged bit8 with bit0:
I don't know what you'd be doing to wasting your time or not waste your time.
But if your trying xoroshiro32plus with a parity bit replacing one bit of output, could you instead have the parity bit act as the carry-in on the addition that produces the output? That might be better, maybe.
The idea of using the parity is a bit of a novelty and it's probably not easy to see in all our previous discussions where the parity comes from. It's the parity of the full 17-bit result after the addition of the two 16-bit halves of the state.
Many thanks for today's results, Evan. I think you have done enough tests.
Here is a summary of the most important scores for the top two triplets with three combinations of sum and parity:
Top Bot Top Bot Triplet Word Byte Byte Bit Bit Iter. PRN[15:0] = ------------------------------------------------------------------ [14,2,7] 512M 1G 512M 1G 1G single [15,3,6] 512M 512M 1G 1G 1G " sum[15:1],parity [14,2,7] 512M 256M 1G 1G 1G double [15,3,6] 1G 1G 2G 1G 1G " ------------------------------------------------------------------ [14,2,7] 512M 256M 512M 1G 1G single [15,3,6] 512M 512M 1G 1G 1G " sum[15:10],sum[1],sum[8:2],sum[9],parity [14,2,7] 512M 512M 512M 1G 1G double [15,3,6] 1G 2G 256M 1G 1G " ------------------------------------------------------------------ [14,2,7] 512M 256M 128M 1G ? single [15,3,6] 512M 512M 128M 1G ? " parity,sum[15:1] [14,2,7] 512M 64M 32M 1G ? double [15,3,6] 512M 1G 128M 1G ? " ------------------------------------------------------------------Word size is 16 bits. Parity as msb has the worst scores but was a convenient way of doing a parity-only test and the top bit here has been copied to the bottom bit of the other two sets. Swapping bits 1 and 9 has worse scores overall than not swapping which leaves parity as lsb as the best option and here it is again on its own:
Top Bot Top Bot Triplet Word Byte Byte Bit Bit Iter. PRN[15:0] = ------------------------------------------------------------------ [14,2,7] 512M 1G 512M 1G 1G single [15,3,6] 512M 512M 1G 1G 1G " sum[15:1],parity [14,2,7] 512M 256M 1G 1G 1G double [15,3,6] 1G 1G 2G 1G 1G " ------------------------------------------------------------------XORO32 does a double iteration and concatenates two 16-bit sum & parity results. Treating those as one long PRN, [14,2,7] is better in the top bytes and [15,3,6] better in the bottom bytes but otherwise they score the same.
If the best possible word or byte PRN is wanted from each XORO32 instruction, then [15,3,6] is clearly superior. The word score of 1G is the best since using all 16 bits and the bottom byte score of 2G is the highest we have had.
We have always relied on the PractRand scores and they are saying choose [15,3,6].
If we're going to start down this path we should do a decent amount of it to see what else comes up.
And our output is still {sum[15:1], ^sum[16:0]}, right?
wire [15:0] xoro32z = d[31:16] ^ d[15:0]; // first iteration wire [31:0] xoro32y = { xoro32z[9:0], xoro32z[15:10], {d[0], d[15:1]} ^ {xoro32z[12:0], 3'b0} ^ xoro32z }; wire [15:0] xoro32x = xoro32y[31:16] ^ xoro32y[15:0]; // second iteration wire [31:0] xoro32 = { xoro32x[9:0], xoro32x[15:10], // xoro32 = d result {xoro32y[0], xoro32y[15:1]} ^ {xoro32x[12:0], 3'b0} ^ xoro32x }; wire [16:0] xoro32a = xoro32y[31:16] + xoro32y[15:0]; // first sum wire [16:0] xoro32b = xoro32[31:16] + xoro32[15:0]; // second sum assign xoro32r = { xoro32b[15:1], ^xoro32b, // xoro32r = prng result, next instruction's s value xoro32a[15:1], ^xoro32a };I think you have done enough tests for the decision about switching to [15,3,6]. Someone should double check my summary, though.
You could try swapping bits 1 and 8/10/11/12/13/14 but swapping bits 1 and 9 made things worse overall.
Yes, yes and yes, I reckon.
I'll run my prog to get some states and sums when seed is $0000_0001.
Do these look correct?
You are somewhat cherry picking. There is other triplets that also look better than either of those two when the bits are shuffled appropriately.
Yes, my first 32 results:
I chose the triplets that performed best in the three sets of tests that you did as listed in my summary post above. However, in my view the most important set is the first one and we need the best triplet for both single and double iterations when bit 0 is parity, i.e. just a single bit change from the original algorithm. I don't like the idea of shuffling bits about after the addition unless that gives better scores, in which case there is probably something wrong with the algorithm. What I am saying needs checking but I'm confident that [15,3,6] scores the best.
Links to test results:
Bit 0 is parity, single iteration
http://forums.parallax.com/discussion/comment/1423900/#Comment_1423900
http://forums.parallax.com/discussion/comment/1423918/#Comment_1423918 (bottom byte added)
http://forums.parallax.com/discussion/comment/1424022/#Comment_1424022 (bit 0 test)
Bit 0 is parity, double iteration
http://forums.parallax.com/discussion/comment/1424903/#Comment_1424903
Bit 0 is parity, bits 1 and 9 swapped, single iteration
http://forums.parallax.com/discussion/comment/1423918/#Comment_1423918
Bit 0 is parity, bits 1 and 9 swapped, double iteration
http://forums.parallax.com/discussion/comment/1425052/#Comment_1425052
Bit 15 is parity, single iteration
http://forums.parallax.com/discussion/comment/1424022/#Comment_1424022
Bit 15 is parity, double iteration
http://forums.parallax.com/discussion/comment/1425053/#Comment_1425053
The jump function needs single-stepping of states which XORO32 can no longer do, unless it is modified. The easiest way, avoiding use of CZ opcode bits, would be to move XORO32 to an empty D,S slot with S selecting single or double iterations.
Here are the first 32 * 64K jumps from state 00000001:
EDIT:
Added 64K jumps.