Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

evanh · 2018-04-01 01:59

Hmm, haven't quite got my head around it but it's clearly the FIFO fetching speed, 1 clock per instruction, that is playing a factor here. Total ticks is ranging, in increments of 1 for each extra NOP, from 309 minimum to 316 maximum.

PS: I'm not sure how 309 is even possible. One of the above code comments must be wrong.

jmg · 2018-04-01 02:03

evanh wrote: »

Hmm, haven't quite got my head around it but it's clearly the FIFO fetching speed, 1 clock per instruction, that is playing a factor here. Total ticks is ranging, in increments of 1, from 309 minimum to 316 maximum.

Doesn't adding a NOP also shift the memory address, which is eggbeater LSB aligned in time ?
Another test would be to add your test NOP's, whilst removing an earlier NOP, so as to keep memory alignments of the REP-under-test constant ?

cgracey · 2018-04-01 02:04

evanh wrote: »
cgracey wrote: »

It takes five clocks to get data back from the hub. ...

I've seen indications of that. The below code works as commented but if I throw an extra couple of NOPs after the CALL, but before the GETCT, the total ticks increases by 6. This obviously is due to the number of instructions between first and second reload.

A little more oddly though is I can get increments of 1 tick as well. I have no idea how.
		call    #puts
		getct   ticks               ' Importantly, the return from #puts reloads HubExec FIFO

		rep     @.endl,#20          ' 2 clks + (16 clks x 19 repeats) = 306 clocks
		xoro32  state               ' XORO32 ignores prior S value feed through
.endl
		mov     parm,0-0            ' final random value appears in S port
		getct   ticke               ' 306 + 4 clks = 310 total clocks from GETCT to GETCT ($0136)

I think you need to put an instruction after the XORO32 to utilize the upcoming S data. Well, maybe it doesn't matter.

Does that code run in the hub? If so, there will be timing differences due to ongoing hub rotation and code length differences.

evanh · 2018-04-01 02:14

cgracey wrote: »

I think you need to put an instruction after the XORO32 to utilize the upcoming S data.

Does that code run in the hub?

That was the original idea for writing that. I wanted to know if the S port overwriting did anything weird. Yes, it is HubExec.

BTW: It produces the correct result at the end of the 20 iterations.

cgracey · 2018-04-01 02:16

evanh wrote: »

cgracey wrote: »

I think you need to put an instruction after the XORO32 to utilize the upcoming S data.

Does that code run in the hub?

That was the original idea for writing that. I wanted to know if the S port overwriting did anything weird. Yes, it is HubExec.

BTW: It produces the correct result at the end of the 20 iterations.

Okay. Good!

Hub execution explains the timing differences.

evanh · 2018-04-01 02:16

cgracey wrote: »

If so, there will be timing differences due to ongoing hub rotation and code length differences.

The timing code was later added explicitly to see how varied HubExec behaves.

TonyB_ · 2018-04-01 02:19

So this means we can iterate the xoroshiro state at double speed?

		xoro32  state
		xoro32  state
		xoro32  state
		...

cgracey · 2018-04-01 02:19

Try running that code in the Cog. It will probably go 8 times faster.

REP works in hub for code compatibility, but the timing is brutal, as FIFO reloads must occur regularly, instead of just PC adjustments.

cgracey · 2018-04-01 02:21

TonyB_ wrote: »
So this means we can iterate the xoroshiro state at double speed?
		xoro32  state
		xoro32  state
		xoro32  state
		...

If you really needed to.

evanh · 2018-04-01 02:23

cgracey wrote: »

Try running that code in the Cog. It will probably go 8 times faster.

REP works in hub for code compatibility, but the timing is brutal, as FIFO reloads must occur regularly, instead of just PC adjustments.

Then it won't be telling me the worst case any longer.

cgracey · 2018-04-01 02:24

XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.

TonyB_ · 2018-04-01 02:25

cgracey wrote: »
TonyB_ wrote: »
So this means we can iterate the xoroshiro state at double speed?
		xoro32  state
		xoro32  state
		xoro32  state
		...
If you really needed to.

Yes, to skip the random numbers we don't like.

cgracey · 2018-04-01 02:26

evanh wrote: »

cgracey wrote: »

Try running that code in the Cog. It will probably go 8 times faster.

REP works in hub for code compatibility, but the timing is brutal, as FIFO reloads must occur regularly, instead of just PC adjustments.

Then it won't be telling me the worst case any longer.

Use the new WAITX WC/WZ variant for inline jitter injection.

evanh · 2018-04-01 02:29

cgracey wrote: »

XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.

Oh! And improving it's timing would help?

Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.

TonyB_ · 2018-04-01 02:41

evanh wrote: »

cgracey wrote: »

XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.

Oh! And improving it's timing would help?

Reordering the result hash to use the initial input state with first iterator output could be done. With the second iterator output only going back to the state.

It might be best to use the original thread for the details of that:
https://forums.parallax.com/discussion/166176/random-lfsr-on-p2#latest

At one time MUL was the slowest instruction. I'm a bit surprised that the 64-bit adder for xoroshiro128+ is quicker than two 16-bit adds for xoroshiro32++. Also, from page 1, is 180 MHz now likely to be the worst-case speed?

evanh · 2018-04-01 02:47

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

Try running that code in the Cog. It will probably go 8 times faster.

Then it won't be telling me the worst case any longer.

Use the new WAITX WC/WZ variant for inline jitter injection.

I mean I'm measuring HubExec performance. I wasn't trying to make the Prop2 do work.

cgracey · 2018-04-01 02:51

Yes, 180MHz should be easy to achieve.

With adders, getting the sum out is much slower than carry propagation, when carry-select adders are available.

Yes, let's change XORO32 to cut the time down.

Seairth · 2018-04-01 03:18

cgracey wrote: »

Yes, let's change XORO32 to cut the time down.

Is this necessary?

cgracey · 2018-04-01 03:50

Seairth wrote: »

cgracey wrote: »

Yes, let's change XORO32 to cut the time down.

Is this necessary?

No, but it would help.

I figured out that you can think of chip timing like the baggage conveyor at the airport. If most everybody can stand 10 ft back from the conveyor, those that need to get in will be able to, with minimal hindrance. Having a person step further back will always help the general situation. it's not so much about him as it is about him posing less impedance to others.

ozpropdev · 2018-04-01 04:13

Chip
I seem to be having issues with V32, in particular JCTx events.
Code that runs fine in V31 misfires in V32.
Trying to isolate it now....

cgracey · 2018-04-01 04:29

ozpropdev wrote: »

Chip
I seem to be having issues with V32, in particular JCTx events.
Code that runs fine in V31 misfires in V32.
Trying to isolate it now....

Remember that we changed the count events To trigger on MSB of difference, not equality. Until the MSB situation is remedied, the event remains true.

cgracey · 2018-04-01 04:42

After the JCTx, you need to do the ADDCTx to clear the event.

ozpropdev · 2018-04-01 05:12

cgracey wrote: »

Remember that we changed the count events To trigger on MSB of difference, not equality. Until the MSB situation is remedied, the event remains true.

I tried a JNCTx instead, same result as JCTx.
One of these should have reacted to a CT event.

cgracey wrote: »

After the JCTx, you need to do the ADDCTx to clear the event.

Its a one off use of the CT event in my case, doesn't need clearing.

cgracey · 2018-04-01 05:37

ozpropdev wrote: »

cgracey wrote: »

Remember that we changed the count events To trigger on MSB of difference, not equality. Until the MSB situation is remedied, the event remains true.

I tried a JNCTx instead, same result as JCTx.
One of these should have reacted to a CT event.

cgracey wrote: »

After the JCTx, you need to do the ADDCTx to clear the event.

Its a one off use of the CT event in my case, doesn't need clearing.

The JCTx should branch, but not clear the event. Only an ADDCTx that solves the MSB situation should clear the event.

Can you find out if it works the way I explained? It doesn't sound right. Maybe some fix is in order.

ozpropdev · 2018-04-01 05:38

cgracey wrote: »

Remember that we changed the count events To trigger on MSB of difference, not equality. Until the MSB situation is remedied, the event remains true.

Ok, fixed it.
The value I was adding to CT was 80_000_000 * 30 (30 seconds).
In V32 I had to reduce it down to 26 seconds to avoid CT MSB difference.

cgracey · 2018-04-01 06:00

ozpropdev wrote: »

cgracey wrote: »

Remember that we changed the count events To trigger on MSB of difference, not equality. Until the MSB situation is remedied, the event remains true.

Ok, fixed it.
The value I was adding to CT was 80_000_000 * 30 (30 seconds).
In V32 I had to reduce it down to 26 seconds to avoid CT MSB difference.

Ok. Whew!

Do you feel the current MSB scheme works reasonably?

ozpropdev · 2018-04-01 06:33

cgracey wrote: »

Do you feel the current MSB scheme works reasonably?

I think it's Ok.
That change caught me out, but shouldn't be a problem if explained in the documentation.

Nothing to see here, move along (to further testing..)

Dave Hein · 2018-04-01 14:02

cgracey wrote: »

XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.

It seems odd that a function that probably won't be used by 99% of the applications is the gating item on a chip. Most programs won't require the level of randomness that XORO32 provides. In fact, I can't think of any real applications that need it. Maybe Parallax can sell a variant of the P2 without XORO32 that can run at a faster clock rate.

cgracey · 2018-04-01 14:44

Dave Hein wrote: »

cgracey wrote: »

XORO32 is the critical path in the actual silicon, right after the hub memories. This is because of the stacked 16-bit adders.

It seems odd that a function that probably won't be used by 99% of the applications is the gating item on a chip. Most programs won't require the level of randomness that XORO32 provides. In fact, I can't think of any real applications that need it. Maybe Parallax can sell a variant of the P2 without XORO32 that can run at a faster clock rate.

The hub RAMs are the current speed limitters. XORO32 is right behind them, followed by a bunch of other dubious circuits. It's okay.

Dave Hein · 2018-04-01 16:25

OK, thanks. I guess my brain missed the "right after the hub memories" part. Will a DE2 FPGA image be available? I would like to update my development tools, and run my tests on the latest version.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments