There will be a way to do the jumping using a double iterator function. But it's not like there is a great need for this with a Xoroshiro32. It's just too small a state space. At 160 MHz and using the REP instruction, a cog can iterate the XORO32 instruction 64k states in 0.4096 ms. So, about 13.1 ms to produce those 32 seeds above.

Jump functions work with xoroshiro32. It is possible to jump by 64K or 128K states with only 32 single iterations plus some XORs according to the jump bit patterns:

JUMP_64K = $1A57604D = %00011010010101110110000001001101
JUMP_128K = $2FD3E1B9 = %00101111110100111110000110111001
Bit 31 is the first bit to test.

The jump function needs single-stepping of states which XORO32 can no longer do, unless it is modified. The easiest way, avoiding use of CZ opcode bits, would be to move XORO32 to an empty D,S slot with S selecting single or double iterations.

There will be a way to do the jumping using a double iterator function. But it's not like there is a great need for this with a Xoroshiro32. It's just too small a state space. At 160 MHz and using the REP instruction, a cog can iterate the XORO32 instruction 64k states in 0.4096 ms. So, about 13.1 ms to produce those 32 seeds above.

Any jump, no matter how big, would take under 2 µs at 160 MHz. Here's a good one:

The jump function is quite good fun, if nothing else.

I've iterated xoroshiro32+ for the full 2^32-1 period to check each bit of the sum has one more '1' than '0' (as zero state is invalid). This is guaranteed with the original algorithm but we have changed bit 0 and I tested the other bits anyway. Each of sum[16:0], the parity of sum[15:0] and the parity of sum[16:0] do indeed have one more '1'.

Clearly, you have some code or tool at your disposal for jump key generation for the regular version of Xoroshiro. I wonder what it would take to modify that to handle a double iterating version like XORO32?

BTW: I tested the REP idea and it functionally works without error and at estimated speed. The S port insertion at the ALU doesn't corrupt the working state.

Chip,
Many thanks for the CORDIC! It's so painless and convenient to use that you'd hardly know it wasn't in the Cog.

Thanks. With the built-in K-factor compensation, it really became nice to use. To preserve the compensation and meet timing for OnSemi, I had to add 16 stages to it. That seemed like a lot, but to have lost the compensation would have been huge. That thing does stuff (rotate, vector, log, exp) that would take very lengthy routines to accomplish. Having those things in hardware is like sci-fi, to me. You can put a few instructions together and do amazing things. We really need the silicon with analog I/O to demonstrate what can be done with the CORDIC.

Tony,
For me, the most use I'd put a key generator to is finding all the full-period triplets of larger word sizes, eg: Xoroshiro48. By the looks it can generate any arbitrary position in the state sequence.

Though, we'd have to prove behaviour of when jumping past the end of a given period.

With each CORDIC iteration, X and Y grow. With infinite iterations, you get a fixed growth factor of:

After maybe a dozen iterations, you are pretty much there. We've got 32 iterations.

Anyway, this growth needs to be compensated for by multiplying the X and Y results by ~0.607. I made a BASIC program to find at what stages we could subtract a right-shifted amount of the current value from the current value, in order to arrive at the right compensation. It took 16 of these simple subtractors, but they each needed a stage, as I couldn't stack up two 40-bit adders in a single clock cycle. The reason the adders are 40 bits, instead of 32, is so that 8 fractional bits can be maintained. With those fractional bits and rounding, you get binary-perfect results.

Huh, that one is not in the Wikipedia list of disambiguous K-Factors. I'll have a go at adding it ... I put an entry in the talk page instead. I think it would have needed a separate article otherwise.

Brute force and ignorance of how to do it properly. Choose a start state, then iterate xoroshiro32+ the desired jump size to get an end state. Using the start state, run the jump function and test whether the resulting jump state is the end state for every possible 32-bit pattern until a match is found or not.

If the pattern bits are shifted out one at a time, the jump function can exit early when the remaining bits are all zero to save time. Finding the 64K and 128K jump bit patterns was quick as they are fairly close to the starting bit pattern count of zero, however I couldn't find any matching 32-bit pattern for 256K and 512K jumps, which took quite a while to fail.

Clearly, you have some code or tool at your disposal for jump key generation for the regular version of Xoroshiro. I wonder what it would take to modify that to handle a double iterating version like XORO32?

The tool was MASM and the code x86 assembler. The jump function needs single-stepping through the states, as '1' bits could be anywhere in the bit pattern, not just the even bits. I think a double-iterating jump function is impossible. XORO32 that can double- and single-step is possible.

Well, the Prop can already do a 32-bit parity in a single instruction. So if we went back to the first incarnation of XORO, and have the second half of the state being forwarded on to the following instruction's S-port, then a XORO64 would work quite well.

Still don't have any way to find full-period triplets though.

Yep, but the whole state at once. Here's an example snippet for using it:

' iterate the state
xoro64 s0,s1 ' xoro64 directly updates s0 itself and
mov s1,0 ' passes s1 to the following instruction
' perform the hash (3 instructions with a shift to remove bit 0)
mov prn,s0
add prn,s1 wc ' primary summing
' extend the hash to 32 bits (+3 instructions)
bitc carry,#0 ' save the summed carry
or prn,#0 wc ' generate parity into C flag
testb carry,#0 xorc ' include the carry in the parity
' bitc prn,#0 ' replace bit 0 with the new parity
rcr prn,#1 ' or shift parity into bit 31

EDIT: Ah, I see, 'next' is next instruction.
EDIT2: Improved commenting.

Sebastiano Vigna kindly emailed me all the xoroshiro32 characteristic polynomials, essential info for creating jump functions. The weight is the number of terms in the polynomial and in theory close to 16 is preferable. Our best triplet [14,2,7] has weight of 11 and second best [15,3,6] has weight of 13.

I hadn't looked closely at all the full-period triplets before. For every [a,b,c] there is a corresponding [c,b,a] that shares the same characteristic polynomial. Knowing this in advance means brute force searches can be reduced by half.

Hey Tony, thanks for getting that. It perfectly matches our 84 brute forced findings. I only just really looked at it now.

And the recommendation of targeting only the ones of closely matching "degree" with word size will be valuable with what's coming up ... I've spoken with Sebastiano again and, same as you, I just had to ask and he immediately popped me a complete list of full-period candidates for a Xoroshiro64 iterator! Amazing stuff. It isn't nicely sorted in mirror pairs like your one was though.

Hmm, 250 in total, and restricting the candidate set to a degree range of 27-33 it is still 110 candidates. I could halve that by taking out all the mirrors but not too keen on doing that just yet.

Restricting the degree range further to 29-33 drops that to 66 candidates.

We need only two registers, for 32 bits of state and 32 bits of PRN.
Period is over 20000 years for six clocks at 160 MHz.

Only outstanding issue is parity for bit 0. XORO64 could report changed bit 0 from normal addition in the carry flag.

I think my XORO64 idea has not been understood fully. There is no need to save s1 separately because prn = s0 + s1. The three instructions above are the entire code if original sum[0] is acceptable. If XORO64 could flag when sum[0] and full parity are different (Z might be better choice than C), then code would increase to five instructions.

## Comments

9,956There will be a way to do the jumping using a double iterator function. But it's not like there is a great need for this with a Xoroshiro32. It's just too small a state space. At 160 MHz and using the REP instruction, a cog can iterate the XORO32 instruction 64k states in 0.4096 ms. So, about 13.1 ms to produce those 32 seeds above.

1,524Any jump, no matter how big, would take under 2 µs at 160 MHz. Here's a good one:

1,061,939,636 is 2^30 approx.

EDIT:

Changed big jump above.

1,524I've iterated xoroshiro32+ for the full 2^32-1 period to check each bit of the sum has one more '1' than '0' (as zero state is invalid). This is guaranteed with the original algorithm but we have changed bit 0 and I tested the other bits anyway. Each of sum[16:0], the parity of sum[15:0] and the parity of sum[16:0] do indeed have one more '1'.

9,956BTW: I tested the REP idea and it functionally works without error and at estimated speed. The S port insertion at the ALU doesn't corrupt the working state.

9,956Many thanks for the CORDIC! It's so painless and convenient to use that you'd hardly know it wasn't in the Cog.

13,076Thanks. With the built-in K-factor compensation, it really became nice to use. To preserve the compensation and meet timing for OnSemi, I had to add 16 stages to it. That seemed like a lot, but to have lost the compensation would have been huge. That thing does stuff (rotate, vector, log, exp) that would take very lengthy routines to accomplish. Having those things in hardware is like sci-fi, to me. You can put a few instructions together and do amazing things. We really need the silicon with analog I/O to demonstrate what can be done with the CORDIC.

9,9569,956For me, the most use I'd put a key generator to is finding all the full-period triplets of larger word sizes, eg: Xoroshiro48. By the looks it can generate any arbitrary position in the state sequence.

Though, we'd have to prove behaviour of when jumping past the end of a given period.

13,076https://en.wikipedia.org/wiki/CORDIC

With each CORDIC iteration, X and Y grow. With infinite iterations, you get a fixed growth factor of:

After maybe a dozen iterations, you are pretty much there. We've got 32 iterations.

Anyway, this growth needs to be compensated for by multiplying the X and Y results by ~0.607. I made a BASIC program to find at what stages we could subtract a right-shifted amount of the current value from the current value, in order to arrive at the right compensation. It took 16 of these simple subtractors, but they each needed a stage, as I couldn't stack up two 40-bit adders in a single clock cycle. The reason the adders are 40 bits, instead of 32, is so that 8 fractional bits can be maintained. With those fractional bits and rounding, you get binary-perfect results.

9,95613,076And don't forget about Circle-K and Special-K. Then there's K-Kauppa in Finland.

1,524Brute force and ignorance of how to do it properly. Choose a start state, then iterate xoroshiro32+ the desired jump size to get an end state. Using the start state, run the jump function and test whether the resulting jump state is the end state for every possible 32-bit pattern until a match is found or not.

If the pattern bits are shifted out one at a time, the jump function can exit early when the remaining bits are all zero to save time. Finding the 64K and 128K jump bit patterns was quick as they are fairly close to the starting bit pattern count of zero, however I couldn't find any matching 32-bit pattern for 256K and 512K jumps, which took quite a while to fail.

The tool was MASM and the code x86 assembler. The jump function needs single-stepping through the states, as '1' bits could be anywhere in the bit pattern, not just the even bits. I think a double-iterating jump function is impossible. XORO32 that can double- and single-step is possible.

9,9561,524XORO64using an empty D,S instruction slot and the same next S replacement trick as XORO32?Period is 2^64-1 and it will work!

13,076We can only write one register per instruction, not two.

1,524Yes, I know.

9,95613,0761,524Another instruction is required:

We need only two registers, for 32 bits of state and 32 bits of PRN.

Period is over 20000 years for six clocks at 160 MHz.

Only outstanding issue is parity for bit 0. XORO64 could report changed bit 0 from normal addition in the carry flag.

9,956Still don't have any way to find full-period triplets though.

13,0769,956EDIT: Ah, I see, 'next' is next instruction.

EDIT2: Improved commenting.

13,076And if a *real* random number is needed, you can always just do a GETRND, without any contingencies.

9,9569,956Hey Tony, thanks for getting that. It perfectly matches our 84 brute forced findings. I only just really looked at it now.

And the recommendation of targeting only the ones of closely matching "degree" with word size will be valuable with what's coming up ... I've spoken with Sebastiano again and, same as you, I just had to ask and he immediately popped me a complete list of full-period candidates for a Xoroshiro64 iterator! Amazing stuff. It isn't nicely sorted in mirror pairs like your one was though.

9,9561,524EDIT:

Written before previous two posts appeared.

9,956Restricting the degree range further to 29-33 drops that to 66 candidates.

1,5241,524I think my XORO64 idea has not been understood fully. There is no need to save s1 separately because prn = s0 + s1. The three instructions above are the entire code if original sum[0] is acceptable. If XORO64 could flag when sum[0] and full parity are different (Z might be better choice than C), then code would increase to five instructions.