P2 vs modern process limits

jmg · 2016-05-04 22:50

cgracey wrote: »

That's it. Each cog will be able to r/w the next cog's LUT via its second port that the streamer uses. The other cog will have priority over the local cog's streamer.

Sounds cool. Does this place & route ok and meet timing ok ?

Tubular · 2016-05-04 22:56

Ok, that's neat. So its not the DAA instruction after all : )

At the end, once timing closes OK, can you take a look at shorter output pulses to match the streamer clock when streaming above SysClk/2?

Thanks for all your efforts Chip. Really looking forward to it.

jmg · 2016-05-04 23:17

cgracey wrote: »

The other cog will have priority over the local cog's streamer.

Is that the right way round ?
Wouldn't most apps need unbroken streaming, and anything streaming at SysCLK/2 or slower, will have spare time slots that the other COG could easily wait for ?

cgracey wrote: »

That's it. Each cog will be able to r/w the next cog's LUT via its second port that the streamer uses..

So that is laid out as half-looks-left, half-looks-right, as each COG has two 'next COGs, and all COGs are equal ?
This would one COGS to tightly couple to 2 others & maybe one COG can feed 3 streamers ?

cgracey · 2016-05-04 23:24

The streamer can output pixel-type data to DACs/pins that it looks up from the LUT. The streamer can also use the egg-beater hub access, but it's not the egg-beater. Some talk on here has shown confusion between the egg-beater and the streamer. They are separate things.

cgracey · 2016-05-04 23:26

jmg wrote: »

cgracey wrote: »

The other cog will have priority over the local cog's streamer.

Is that the right way round ?
Wouldn't most apps need unbroken streaming, and anything streaming at SysCLK/2 or slower, will have spare time slots that the other COG could easily wait for ?

cgracey wrote: »

That's it. Each cog will be able to r/w the next cog's LUT via its second port that the streamer uses..

So that is laid out as half-looks-left, half-looks-right, as each COG has two 'next COGs, and all COGs are equal ?
This would one COGS to tightly couple to 2 others & maybe one COG can feed 3 streamers ?

I figure that a cog accessing a LUT needs priority over the read-only streamer which outputs to DACs and pins. Really, you should never have a conflict, in the same sense that a pin shouldn't be controlled from multiple cogs. You wouldn't program that way. In case there is a conflict between the other cog and the local cog's streamer, the other cog wins.

cgracey · 2016-05-04 23:32

You do, in effect, get access to the lower AND upper cog. Your LUT is accessible by the lower cog and the upper cog's LUT is accessible by you. This doesn't take much hardware, at all. Right now, we are using 111,758 ALMs out of 113,560. We've got about 112 ALM's for each cog left. I'm already using the 'area-aggressive' setting in the fitter. This is it!

User Name · 2016-05-04 23:36

Today is the day the P2 feels real. It's going to happen!

jmg · 2016-05-04 23:40

cgracey wrote: »

The streamer can output pixel-type data to DACs/pins that it looks up from the LUT.

yes, which makes this seem backwards ... ?

cgracey wrote: »

I figure that a cog accessing a LUT needs priority over the read-only streamer which outputs to DACs and pins. Really, you should never have a conflict, in the same sense that a pin shouldn't be controlled from multiple cogs. You wouldn't program that way. In case there is a conflict between the other cog and the local cog's streamer, the other cog wins.

I see a full speed streamer as having to work in bursts, if used with any second COG, and usually some co-operation is needed, but it seems safer if the streamer wins, and the COG waits for the next slot, if it has to.
Doing the reverse will affect pixel playback ?

jmg · 2016-05-04 23:42

cgracey wrote: »

You do, in effect, get access to the lower AND upper cog. Your LUT is accessible by the lower cog and the upper cog's LUT is accessible by you. This doesn't take much hardware, at all.

I'm unclear - is this paired only access, or can N see N+!, (and also N-1) ?
Does Lower Cog mean N-1, or Even COG only ?

Rayman · 2016-05-04 23:52

Looks like we're going to keep piling on to P2 until we run out of A9 ALMs...

Do like the LUT direct connection idea though...

Could it be bidirectional between cog pairs instead of one way around?

Electrodude · 2016-05-05 00:16

Rayman wrote: »

Could it be bidirectional between cog pairs instead of one way around?

It is bidirectional:

cgracey wrote: »

You do, in effect, get access to the lower AND upper cog. Your LUT is accessible by the lower cog and the upper cog's LUT is accessible by you.

How do you address the other cog's LUT? Can you do LUTEXEC out of it?

jmg · 2016-05-05 00:16

Rayman wrote: »

Could it be bidirectional between cog pairs instead of one way around?

Oh, do you think Chip means one-direction / one way looking only ? ->
Now what I was expecting, but maybe he does mean that ?

cgracey · 2016-05-05 00:32

You can read and write the next cog's LUT via its second port, which is shared by the next cog's streamer.

I never thought about making it wait, in case the other cog's streamer was using it. Good idea!

Cog exec is a possibility. That didn't occur to me, either.

One thing I need to add is a SETQ3, to enable RD/WRLONG-repeat.

Seairth · 2016-05-05 00:34

cgracey wrote: »

You do, in effect, get access to the lower AND upper cog. Your LUT is accessible by the lower cog and the upper cog's LUT is accessible by you. This doesn't take much hardware, at all. Right now, we are using 111,758 ALMs out of 113,560. We've got about 112 ALM's for each cog left. I'm already using the 'area-aggressive' setting in the fitter. This is it!

If I'm understanding correctly:

There are two additional instructions (e.g. RDAUX/WRAUX)
Cog 1 can WRAUX to Cog 2's LUT, then Cog 2 can RDLUT its own LUT
Cog 2 can WRLUT to its own LUT, then Cog 1 can RDAUX Cog 2's LUT
Cog 1 does not know when Cog 2 has written or read Cog 2's LUT
Cog 2 does not know when Cog 1 has written or read Cog 2's LUT

I get the desire to share 512 registers between cogs, but without efficient handshaking (i.e. read/write events), it seems to me that this approach will be no better than using HUB ram and the existing hub read/write events.

I advocate for the simplicity of two one-way 32-bit (33-bit!) registers that are accompanied by events. The events are critical, though. Without the events, I think that any inter-cog conduit will be underutilized.

Rayman · 2016-05-05 00:39

I was just thinking that we need an event trigger for lut write or read from other cog

jmg · 2016-05-05 01:06

cgracey wrote: »

You can read and write the next cog's LUT via its second port, which is shared by the next cog's streamer.

So this (N+1) LUT, appears above 'own' (N)LUT in memory map for R/W purposes, or something else ?

cgracey wrote: »

I never thought about making it wait, in case the other cog's streamer was using it. Good idea!

If it is easy to do, that would be more generally useful.

cgracey wrote: »

Cog exec is a possibility. That didn't occur to me, either.

That was my next question

Seems this could also allow some seriously tricky 'self modifying code', where 'self' is not quite you, but your evil twin...

Looks like this would allow a 2nd COG as a great numeric and/or crypto co-processor, for precisions outside native support.

jmg · 2016-05-05 01:08

Seairth wrote: »

... The events are critical, though. Without the events, I think that any inter-cog conduit will be underutilized.

Handshakes in Dual Port memory are usually done with an agreed sempahore pair, using RAM ?
eg Write a block, then update RAM Flag says Block Ready, and other side polls that Ready, and sets a Block Read Done after it has got all the data, repeat...

jmg · 2016-05-05 01:11

cgracey wrote: »

...This doesn't take much hardware, at all. Right now, we are using 111,758 ALMs out of 113,560. We've got about 112 ALM's for each cog left. I'm already using the 'area-aggressive' setting in the fitter. This is it!

Hmm, 1.586% of spare space ? - fingers crossed about any bug-fixes or mode clean-ups...

T Chap · 2016-05-05 01:15

Is there ROM to put in USB code for a dedicated pair of pins for loading? Or are we still needing external USB interface?

Seairth · 2016-05-05 01:15

jmg wrote: »

Seairth wrote: »

... The events are critical, though. Without the events, I think that any inter-cog conduit will be underutilized.

Handshakes in Dual Port memory are usually done with an agreed sempahore pair, using RAM ?
eg Write a block, then update RAM Flag says Block Ready, and other side polls that Ready, and sets a Block Read Done after it has got all the data, repeat...

I agree. But I certainly hope you're not suggesting that each cog should go into a busy loop to wait for that flag!

jmg · 2016-05-05 01:21

T Chap wrote: »

Is there ROM to put in USB code for a dedicated pair of pins for loading? Or are we still needing external USB interface?

I think that is a 'maybe'

- needs someone to craft a ROM-Ready USB loader small enough, and it can tack on the end.

jmg · 2016-05-05 01:24

Seairth wrote: »

I agree. But I certainly hope you're not suggesting that each cog should go into a busy loop to wait for that flag!

Any system has to wait for data - I'm unclear what you mean by 'events' ?
Do you mean an interrupt is triggered ?

rjo__ · 2016-05-05 01:35

Just excellent.

Seairth wrote: »

If I'm understanding correctly:
There are two additional instructions (e.g. RDAUX/WRAUX)

Cog 1 can WRAUX to Cog 2's LUT, then Cog 2 can RDLUT its own LUT

Cog 2 can WRLUT to its own LUT, then Cog 1 can RDAUX Cog 2's LUT

Cog 1 does not know when Cog 2 has written or read Cog 2's LUT

Cog 2 does not know when Cog 1 has written or read Cog 2's LUT

I get the desire to share 512 registers between cogs, but without efficient handshaking (i.e. read/write events), it seems to me that this approach will be no better than using HUB ram and the existing hub read/write events.

I advocate for the simplicity of two one-way 32-bit (33-bit!) registers that are accompanied by events. The events are critical, though. Without the events, I think that any inter-cog conduit will be underutilized.

The hub always seems like a complicated place. I know it isn't... but that is the way it feels. I know exactly when something is happening in a cog, but when it comes to hub access, I am never really sure about the "when." Having this kind of communication and signaling between cogs would remove these kinds of uncertainties. Strange as it sounds... these bits simplify issues of determinacy in a very elegant way.

Seairth · 2016-05-05 01:49

jmg wrote: »

Seairth wrote: »

I agree. But I certainly hope you're not suggesting that each cog should go into a busy loop to wait for that flag!

Any system has to wait for data - I'm unclear what you mean by 'events' ?
Do you mean an interrupt is triggered ?

I mean events. Whether you use WAITxxx or an interrupt to react is up to you.

jmg · 2016-05-05 02:00

Seairth wrote: »

I mean events. Whether you use WAITxxx or an interrupt to react is up to you.

"Events are tracked and can be polled, waited for, and used directly as interrupt sources."
ok, there may be room to add an event flag, to when 'other-cog access' occurs ?
Should that be across the whole memory map ( in which case it may trigger early, in block moves )
or act only across part of the memory map, or triggered by one location only ?

78rpm · 2016-05-05 03:53

Tubular wrote: »

Ok, that's neat. So its not the DAA instruction after all : )

DAA = Dead After All.

Heater. · 2016-05-05 04:29

When PullMoll and I were trying to get the emulation of the 8080/8085/Z80 DAA instruction to match what real chips do I started to think it was:

DAA = Do Anything At-all

They handle it differently for negative numbers and the documentation did not seem to be clear on that.

jmg · 2016-05-05 04:33

and I thought it was a NOP alias...

DAA = Don't Alter Anything ?

cgracey · 2016-05-05 04:55

We need to have an event, for sure. Maybe two events are needed. Maybe four? What should they be? We are not actually limited to 16 events. We could add one bit and go up to 32.

jmg · 2016-05-05 05:09

cgracey wrote: »

We need to have an event, for sure. Maybe two events are needed. Maybe four? What should they be? We are not actually limited to 16 events. We could add one bit and go up to 32.

If you consider block writing and moves, and handshakes, and FIFO action, it can take a lead from that ?

With block writing, you do not want to trigger event too early, but it is nice to have it auto-trigger when full.
For handshakes, you have more like Ready and Ack/done.
tight REP loops that {WAIT/Write} and {WAIT/Read} could work, with the right event details ? - and they would save power.

Speaking of handshakes, does the Streamer have simple FIFO style handshakes, so an external device can tell it to pause ?

P2 vs modern process limits

Comments