New PUSH and POP functionality added for single hub cycle LMM and other hub transfers

Cluso99 · 2014-09-08 14:23

Bill,
Why do you need RDxxxx D,S WZ,WC to set the Z flag while also performing S++ ?
I can see the case for RDxxxx D,S WZ.
This keeps is simpler to describe the instruction. No requirement for D=S or S=$1EF etc.

Since we have never used the WC modifier (can't think of any cases and it doesnt perform to the manual spec anyway), I cannot see why this WC cannot change the behaviour of WZ.
BTW same applies to WRxxxx.

rjo__ · 2014-09-13 11:48

I finally got myself set up with Quartus 14 and had a chance to integrate the changes from the first post in this thread.
I made the changes to my P4X32aV. I counted the clocks in a PASM loop, with WC an without...and on my 4 Cog setup, the results are almost exactly the same: 1604 clocks for 100 sequential reads with WC and 1608 clocks without WC and with incrementing the pointer.

I am out the door again. So I can't do more right now. I didn't have a chance to play with this much. I only tested the timing.

Also, with my standard 4Cog setup I can clock at 140MHz... but this fails with the PUSH-PULL changes....80MHz seemed fine.

Bill Henning · 2014-09-13 13:47

Because that is very useful for walking arrays or linked lists. Sorry for the delayed response, some urgent work has come in.

Cluso99 wrote: »

Bill,
Why do you need RDxxxx D,S WZ,WC to set the Z flag while also performing S++ ?
I can see the case for RDxxxx D,S WZ.
This keeps is simpler to describe the instruction. No requirement for D=S or S=$1EF etc.

Since we have never used the WC modifier (can't think of any cases and it doesnt perform to the manual spec anyway), I cannot see why this WC cannot change the behaviour of WZ.
BTW same applies to WRxxxx.

rogloh · 2014-09-13 20:24

rjo__ wrote: »

I finally got myself set up with Quartus 14 and had a chance to integrate the changes from the first post in this thread.
I made the changes to my P4X32aV. I counted the clocks in a PASM loop, with WC an without...and on my 4 Cog setup, the results are almost exactly the same: 1604 clocks for 100 sequential reads with WC and 1608 clocks without WC and with incrementing the pointer.

I am out the door again. So I can't do more right now. I didn't have a chance to play with this much. I only tested the timing.

Also, with my standard 4Cog setup I can clock at 140MHz... but this fails with the PUSH-PULL changes....80MHz seemed fine.

@rjo__, I just want to note that this push/pop change is not going to speed up your access to the hub or increase maximum hub bandwidth in of itself. It just combines an increment or decrement operation within the read or write instruction. That will then allow certain critical loops to fit within the single hub cycle.

Also I haven't had a chance to continue more with adding the LR thing, but am planning to get back to it when I can.

rjo__ · 2014-09-13 21:40

seems like it should... don't understand why it doesn't...

rogloh · 2014-09-13 23:31

rjo__ wrote: »

seems like it should... don't understand why it doesn't...

What are you doing inside your critical inner loop? Do you have example code? Are you copying data to/from hub RAM?

Update: I just re-read your post and think I probably understand your issue. If you were just doing something like this before...

WRLONG data, address
ADD address, #4
WRLONG data+1, address
ADD address, #4
WRLONG data+2, address
ADD address, #4
WRLONG data+3, address
ADD address, #4
WRLONG data+5, address
ADD address, #4

and then updated to use my Verilog and changed your code to something like this hoping to speed it up...

WRLONG data, address WC
WRLONG data+1, address WC
WRLONG data+2, address WC
WRLONG data+3, address WC
WRLONG data+4, address WC

then there will be no performace gain because the hub is fully saturated in both cases. It's still one hub access to each COG every 16 clock cycles once you lock on to the initial hub cycle in your first access.

Roger.

ozpropdev · 2014-09-13 23:49

@Rich
In your 4 cog hub access experiment, did you achieve the performance boost you required?
If not, by how much did you miss your target bandwidth?
brian

rjo__ · 2014-09-14 10:33

Roger,

Thanks for the explanation.

Brian,

At this point, I'm not really sure I have achieved anything other than get a better understanding of hub access and a smaller footprint P1v,
which compiles faster... because it is smaller, 00:4:05 //AMD A8, with multiprocessing enabled in Quartus 14

Along the way, some of my thinking has gotten marginally better about acquisition issues.

At this point, I'm fairly confident that (at least on the 4Cog version), I am sitting somewhere around 7MHz(140Mhz clkfreq)...for a single cog.
Double that for interleaved cogs. That's pretty good. In previous experiments using an 80MHz Prop, using somewhat less than optimal logic I was somewhere around 2Mhz for one cog.
Chip has suggested that with proper constraints and tuning the P1V might be capable of 200MHz.

What we have now is random access reads and writes.

It seems reasonable to me that somehow, there must be a way to issue a start address and then either read or write sequentially from there... all the while saving a hub access.
But the fact that it seems "reasonable to me" doesn't mean that it is actually possible:)

If it were possible that change would boost useful single cog throughput from 7MHz to 8.75MHz at 140MHz clkfreq... and give 10MHz at 160MHz clkfreq.

10MHz seems like a nice round number to shoot at:)

Rich

New PUSH and POP functionality added for single hub cycle LMM and other hub transfers

Comments