Hmm, haven't quite got my head around it but it's clearly the FIFO fetching speed, 1 clock per instruction, that is playing a factor here. Total ticks is ranging, in increments of 1 for each extra NOP, from 309 minimum to 316 maximum.
PS: I'm not sure how 309 is even possible. One of the above code comments must be wrong.
Hmm, haven't quite got my head around it but it's clearly the FIFO fetching speed, 1 clock per instruction, that is playing a factor here. Total ticks is ranging, in increments of 1, from 309 minimum to 316 maximum.
Doesn't adding a NOP also shift the memory address, which is eggbeater LSB aligned in time ?
Another test would be to add your test NOP's, whilst removing an earlier NOP, so as to keep memory alignments of the REP-under-test constant ?
It takes five clocks to get data back from the hub. ...
I've seen indications of that. The below code works as commented but if I throw an extra couple of NOPs after the CALL, but before the GETCT, the total ticks increases by 6. This obviously is due to the number of instructions between first and second reload.
A little more oddly though is I can get increments of 1 tick as well. I have no idea how.
call #puts
getct ticks ' Importantly, the return from #puts reloads HubExec FIFO
rep @.endl,#20 ' 2 clks + (16 clks x 19 repeats) = 306 clocks
xoro32 state ' XORO32 ignores prior S value feed through
.endl
mov parm,0-0 ' final random value appears in S port
getct ticke ' 306 + 4 clks = 310 total clocks from GETCT to GETCT ($0136)
I think you need to put an instruction after the XORO32 to utilize the upcoming S data. Well, maybe it doesn't matter.
Does that code run in the hub? If so, there will be timing differences due to ongoing hub rotation and code length differences.
XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.
Oh! And improving it's timing would help?
Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.
XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.
Oh! And improving it's timing would help?
Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.
At one time MUL was the slowest instruction. I'm a bit surprised that the 64-bit adder for xoroshiro128+ is quicker than two 16-bit adds for xoroshiro32++. Also, from page 1, is 180 MHz now likely to be the worst-case speed?
I figured out that you can think of chip timing like the baggage conveyor at the airport. If most everybody can stand 10 ft back from the conveyor, those that need to get in will be able to, with minimal hindrance. Having a person step further back will always help the general situation. it's not so much about him as it is about him posing less impedance to others.
XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.
It seems odd that a function that probably won't be used by 99% of the applications is the gating item on a chip. Most programs won't require the level of randomness that XORO32 provides. In fact, I can't think of any real applications that need it. Maybe Parallax can sell a variant of the P2 without XORO32 that can run at a faster clock rate.
XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.
It seems odd that a function that probably won't be used by 99% of the applications is the gating item on a chip. Most programs won't require the level of randomness that XORO32 provides. In fact, I can't think of any real applications that need it. Maybe Parallax can sell a variant of the P2 without XORO32 that can run at a faster clock rate.
The hub RAMs are the current speed limitters. XORO32 is right behind them, followed by a bunch of other dubious circuits. It's okay.
OK, thanks. I guess my brain missed the "right after the hub memories" part. Will a DE2 FPGA image be available? I would like to update my development tools, and run my tests on the latest version.
Comments
PS: I'm not sure how 309 is even possible. One of the above code comments must be wrong.
Another test would be to add your test NOP's, whilst removing an earlier NOP, so as to keep memory alignments of the REP-under-test constant ?
I think you need to put an instruction after the XORO32 to utilize the upcoming S data. Well, maybe it doesn't matter.
Does that code run in the hub? If so, there will be timing differences due to ongoing hub rotation and code length differences.
BTW: It produces the correct result at the end of the 20 iterations.
Okay. Good!
Hub execution explains the timing differences.
The timing code was later added explicitly to see how varied HubExec behaves.
REP works in hub for code compatibility, but the timing is brutal, as FIFO reloads must occur regularly, instead of just PC adjustments.
If you really needed to.
Yes, to skip the random numbers we don't like.
Use the new WAITX WC/WZ variant for inline jitter injection.
Oh! And improving it's timing would help?
Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.
It might be best to use the original thread for the details of that:
https://forums.parallax.com/discussion/166176/random-lfsr-on-p2#latest
At one time MUL was the slowest instruction. I'm a bit surprised that the 64-bit adder for xoroshiro128+ is quicker than two 16-bit adds for xoroshiro32++. Also, from page 1, is 180 MHz now likely to be the worst-case speed?
With adders, getting the sum out is much slower than carry propagation, when carry-select adders are available.
Yes, let's change XORO32 to cut the time down.
Is this necessary?
No, but it would help.
I figured out that you can think of chip timing like the baggage conveyor at the airport. If most everybody can stand 10 ft back from the conveyor, those that need to get in will be able to, with minimal hindrance. Having a person step further back will always help the general situation. it's not so much about him as it is about him posing less impedance to others.
I seem to be having issues with V32, in particular JCTx events.
Code that runs fine in V31 misfires in V32.
Trying to isolate it now....
Remember that we changed the count events To trigger on MSB of difference, not equality. Until the MSB situation is remedied, the event remains true.
One of these should have reacted to a CT event.
Its a one off use of the CT event in my case, doesn't need clearing.
The JCTx should branch, but not clear the event. Only an ADDCTx that solves the MSB situation should clear the event.
Can you find out if it works the way I explained? It doesn't sound right. Maybe some fix is in order.
Ok, fixed it.
The value I was adding to CT was 80_000_000 * 30 (30 seconds).
In V32 I had to reduce it down to 26 seconds to avoid CT MSB difference.
Ok. Whew!
Do you feel the current MSB scheme works reasonably?
That change caught me out, but shouldn't be a problem if explained in the documentation.
Nothing to see here, move along (to further testing..)
The hub RAMs are the current speed limitters. XORO32 is right behind them, followed by a bunch of other dubious circuits. It's okay.