P2 vs modern process limits

cgracey · 2016-05-05 05:12

The streamer does what you command it to without any pauses.

jmg · 2016-05-05 05:18

cgracey wrote: »

The streamer does what you command it to without any pauses.

Hmm, that could make writing reliable streamer code to/from any OS related host, tricky.

On the FTDI parts that have FIFOs, and on High speed Serial modes, it seems handshakes are needed rarely, I suspect when the OS gets tangled elsewhere.
That means things will appear to work, until in some rare combination, data is lost. The handshake lines cover those cases.

If the Clock is via a separate PinCell, maybe that just needs a Clock-Enable type option, and the Streamer attaches to what is now a gated clock.
(This ignores for now, the SysCLK/1 streaming which needs a CLK (mentioned above) + handshake too...)

evanh · 2016-05-05 05:44

cgracey wrote: »

We need to have an event, for sure. Maybe two events are needed. Maybe four? What should they be? We are not actually limited to 16 events. We could add one bit and go up to 32.

For the LUTs, a LUTRAM-end-write matching the existing LUTRAM-end-read, and same again for events coming from neighbouring LUT. So, that's three more event sources.

I note all this detection circuitry will need moved over to the second port for local LUT but to detect the same on the neighbour's LUT will need detection on the neighbour's first port.

cgracey · 2016-05-05 05:55

jmg wrote: »

cgracey wrote: »

The streamer does what you command it to without any pauses.

Hmm, that could make writing reliable streamer code to/from any OS related host, tricky.

On the FTDI parts that have FIFOs, and on High speed Serial modes, it seems handshakes are needed rarely, I suspect when the OS gets tangled elsewhere.
That means things will appear to work, until in some rare combination, data is lost. The handshake lines cover those cases.

If the Clock is via a separate PinCell, maybe that just needs a Clock-Enable type option, and the Streamer attaches to what is now a gated clock.
(This ignores for now, the SysCLK/1 streaming which needs a CLK (mentioned above) + handshake too...)

The streamer would have to be given some sensitivity to an input pin. Mainly, what the streamer is used for is interfacing to continuous real-time I/O. It could be in spurts now, but under software control, where you give it a command for so many cycles of operation and it does it. If the OS-based receiver could say "stop", but then receive up to 16 more samples before filling up, you could issue streamer commands that output groups of 16 samples, checking for the "stop" signal between commands. That would take a small-enough time that operation would be continuous until the "stop" signal.

cgracey · 2016-05-05 05:59

evanh wrote: »

cgracey wrote: »

We need to have an event, for sure. Maybe two events are needed. Maybe four? What should they be? We are not actually limited to 16 events. We could add one bit and go up to 32.

For the LUTs, a LUTRAM-end-write matching the existing LUTRAM-end-read, and same again for events coming from neighbouring LUT. So, that's three more event sources.

I note all this detection circuitry will need moved over to the second port for local LUT but to detect the same on the neighbour's LUT will need detection on the neighbour's first port.

So, I figure we could use these events in the local cog:

1) lower cog wrote my LUT
2) lower cog read my LUT
3) upper cog wrote its LUT
4) upper cog read its LUT

Maybe we could trigger these events at fixed LUT addresses, like $1FF, or maybe they need to be settable. Addresses could involve a mask, too. What do you guys think?

jmg · 2016-05-05 06:36

cgracey wrote: »

So, I figure we could use these events in the local cog:

1) lower cog wrote my LUT
2) lower cog read my LUT
3) upper cog wrote its LUT
4) upper cog read its LUT

Maybe we could trigger these events at fixed LUT addresses, like $1FF, or maybe they need to be settable. Addresses could involve a mask, too. What do you guys think?

A Mask would be flexible, if there is room, as you need sometimes to ignore a write, and sometimes to react to it.
Otherwise, map events into (asymmetric) address regions.

To clarify, here lower and upper are Even/Odd COGS, and 3rd COG cannot connect to 2nd COG at all , correct ? (but it can connect 3rd & 4th)

jmg · 2016-05-05 06:43

cgracey wrote: »

The streamer would have to be given some sensitivity to an input pin.

Yes.

cgracey wrote: »

Mainly, what the streamer is used for is interfacing to continuous real-time I/O.

I can see it is good for pixel streaming, or camera capture.
Not so good for real-world connections that need a 'hang-on-a-mo' signal.

cgracey wrote: »

It could be in spurts now, but under software control, where you give it a command for so many cycles of operation and it does it. If the OS-based receiver could say "stop", but then receive up to 16 more samples before filling up, you could issue streamer commands that output groups of 16 samples, checking for the "stop" signal between commands. That would take a small-enough time that operation would be continuous until the "stop" signal.

That might work, but not all FIFO interfaces are well documented, and most seem to be designed to connect to FPGAs, which react immediately.
To displace FPGAs, which a P2 is very close to doing, it needs to react the same way, and interface to the same HS UHS USB FIFOs.

Otherwise you limit your PC connect speeds.
I suppose you could ask FTDI for more information - some of their data is vague and incomplete, and restrictive.

JRetSapDoog · 2016-05-05 07:05

Not that I have an application in mind, but I was just wondering how long it would take to "ripple/wave" a single long of data through all 16 of the cogs' LUTs (bypassing the hub entirely), wherein a first cog writes a long to the LUT of a second cog (its neighbor) and that second cog "picks that long up" and write it to the LUT a third cog (its neighbor) and so on (assuming no coordination signaling involved). Wonder if it would be N instruction cycles, or N*2 (seems unlikely) or N(N-1)/2. Obviously, it wouldn't be 1 instruction cycle total. Would it just be N cycles? That's mostly for understanding's sake, as if one really wanted to share a long across all cogs, I guess one would just read it from the hub. And if one really wanted N-time (sharing across cogs in one instruction cycle), then one would need to use a bunch of pins (as there's no internal bus/ring).

JRetSapDoog · 2016-05-05 07:14

cgracey wrote: »

The streamer can output pixel-type data to DACs/pins that it looks up from the LUT. The streamer can also use the egg-beater hub access, but it's not the egg-beater [itself]. Some talk on here has shown confusion between the egg-beater and the streamer. They are separate things.

Sometimes saying what something is *not* can add a useful perspective for understanding.

Congrats, Chip, on nearly wrapping up the design (even though there are some events to work out).

evanh · 2016-05-05 07:26

jmg wrote: »

cgracey wrote: »

Mainly, what the streamer is used for is interfacing to continuous real-time I/O.

I can see it is good for pixel streaming, or camera capture.
Not so good for real-world connections that need a 'hang-on-a-mo' signal.

Real-world meaning artificial databuses.

Okay, given USB is already catered for, probably a good starting point would be SDRAM. Getting the Streamer up to handling such transfers would be worth it I'd say. Making use of the Streamer to max out a burst SDRAM transfer I think would be a good feature test.

evanh · 2016-05-05 07:29

cgracey wrote: »

So, I figure we could use these events in the local cog:

1) lower cog wrote my LUT
2) lower cog read my LUT
3) upper cog wrote its LUT
4) upper cog read its LUT

Maybe we could trigger these events at fixed LUT addresses, like $1FF, or maybe they need to be settable. Addresses could involve a mask, too. What do you guys think?

Just keep it simple at the fixed $1FF address is good.

Another set of events that trigger on every address might be handy. Wouldn't want to use it for interrupts but polling/waiting could possible speed up arbitrary writes by not needing to write to $1FF or anywhere specific.

cgracey · 2016-05-05 07:29

evanh wrote: »

jmg wrote: »

cgracey wrote: »

Mainly, what the streamer is used for is interfacing to continuous real-time I/O.

I can see it is good for pixel streaming, or camera capture.
Not so good for real-world connections that need a 'hang-on-a-mo' signal.

Real-world meaning artificial databuses.

Okay, given USB is already catered for, probably a good starting point would be SDRAM. Getting the Streamer up to handling such transfers would be worth it I'd say. Making use of the Streamer to max out a burst SDRAM transfer I think would be a good feature test.

It already does SDRAM. That was one of the original design goals.

evanh · 2016-05-05 09:00

Sexy! I'm all good then.

Rayman · 2016-05-05 11:11

If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...

cgracey · 2016-05-05 11:29

Rayman wrote: »

If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...

Yes, I was thinking the same thing. The smallest way would be to have the cog select the sensitivity. This way, you don't need more events. The one event could do either. We've got one empty event at the moment. We could use it for write sensing, since the other is already read sensing. A simple D/# instruction could set the sensitivities for both events.

At the moment, I'm compiling the whole thing, with the new RDLUTN/WRLUTN instructions added (N=next). I'm hoping we still fit. An ALM is almost 3 LE's, so we should be okay. We have a budget of about 112 ALMs per cog for this last feature.

Seairth · 2016-05-05 11:46

Well... I guess this is happening. Ok, then...

I suggest skipping the configurable events. Just do the following:

Cog N gets WRITE event for Cog N+1 writing to $1FF
Cog N gets READ event for Cog N+1 reading from $1FE
Cog N gets WRITE event for Cog N-1 writing to $1FE
Cog N gets READ event for Cog N-1 reading from $1FF

You will notice that the last event is actually an existing event (13), which means that you only need three new events. If you want to increase the event ID size to 5 bits, that's fine. But I think you could possibly get away with getting rid of one of the counters. (The additional counters were added when we didn't have the smart pins, so having 3 may not be as critical now. This will also free up a few ALMs on the FPGA. Speaking of ALMs, you might also free up a few by taking the HUB back to 512K.)

ozpropdev · 2016-05-05 11:47

cgracey wrote: »

At the moment, I'm compiling the whole thing, with the new RDLUTN/WRLUTN instructions added (N=next). I'm hoping we still fit. An ALM is almost 3 LE's, so we should be okay. We have a budget of about 112 ALMs per cog for this last feature.

Fingers crossed!

It's all wrapping up nicely.

cgracey · 2016-05-05 12:19

It done blowed up!

Error (170012): Fitter requires 11361 LABs to implement the design, but the device contains only 11356 LABs

I'll get rid of the CT3 timer. Seairth is right that we don't need three timers with smart pins. Two is plenty.

Seairth · 2016-05-05 12:35

cgracey wrote: »

Rayman wrote: »

If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...

Yes, I was thinking the same thing. The smallest way would be to have the cog select the sensitivity. This way, you don't need more events. The one event could do either. We've got one empty event at the moment. We could use it for write sensing, since the other is already read sensing. A simple D/# instruction could set the sensitivities for both events.

So, you would have two events, one for reads and one for writes, with the following two settings:

* Own LUT or Next LUT
* LUT address

Seems reasonable. I'd change the event mnemonics slightly:

RDL   -> RDH (hub read)
WRL   -> WRH (hub write)
RLE   -> RDL (lut read)
(new) -> WRL (lut write)

For defaults, I'd do the following:

* READ: Own LUT, $1FF (which is what event 13 is currently)
* WRITE: Next LUT, $1FF

To enable two-way communication,
* Cog N would switch READ to Next LUT, $1FF
* Cog N+1 would switch WRITE to Own LUT, $1FE

With this,

Cog N+1 WAITWRL
Cog N WRLUTN $1FE
Cog N WAITRDL
Cog N+1 wakes from WAITWRL
Cog N+1 RDLUT $1FE
Cog N wakes from WAITRDL

and...

Cog N WAITWRL
Cog N+1 WRLUT $1FF
Cog N+1 WAITRDL
Cog N wakes from WAITWRL
Cog N RDLUTN $1FF
Cog N+1 wakes from WAITRDL

Seairth · 2016-05-05 12:37

cgracey wrote: »

It done blowed up!

Error (170012): Fitter requires 11361 LABs to implement the design, but the device contains only 11356 LABs

I'll get rid of the CT3 timer. Seairth is right that we don't need three timers with smart pins. Two is plenty.

Why are you holding onto that extra 512K? I realize this is mostly memory, but some ALMs have got to be used! Isn't it more important to prove the actual design then to give a little bit extra to those few people that have the A9 board?

cgracey · 2016-05-05 12:51

Seairth wrote: »

cgracey wrote: »

It done blowed up!

Error (170012): Fitter requires 11361 LABs to implement the design, but the device contains only 11356 LABs

I'll get rid of the CT3 timer. Seairth is right that we don't need three timers with smart pins. Two is plenty.

Why are you holding onto that extra 512K? I realize this is mostly memory, but some ALMs have got to be used! Isn't it more important to prove the actual design then to give a little bit extra to those few people that have the A9 board?

I don't think that extra address bit amounts to much, so I haven't bothered, yet. The compiler is still moving along, further than it got last time.

cgracey · 2016-05-05 12:57

Seairth wrote: »
cgracey wrote: »

Rayman wrote: »

If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...

Yes, I was thinking the same thing. The smallest way would be to have the cog select the sensitivity. This way, you don't need more events. The one event could do either. We've got one empty event at the moment. We could use it for write sensing, since the other is already read sensing. A simple D/# instruction could set the sensitivities for both events.

So, you would have two events, one for reads and one for writes, with the following two settings:

* Own LUT or Next LUT
* LUT address

Seems reasonable. I'd change the event mnemonics slightly:
RDL   -> RDH (hub read)
WRL   -> WRH (hub write)
RLE   -> RDL (lut read)
(new) -> WRL (lut write)
For defaults, I'd do the following:

* READ: Own LUT, $1FF (which is what event 13 is currently)
* WRITE: Next LUT, $1FF

To enable two-way communication,
* Cog N would switch READ to Next LUT, $1FF
* Cog N+1 would switch WRITE to Own LUT, $1FE

With this,
Cog N+1 WAITWRL

Cog N WRLUTN $1FE

Cog N WAITRDL

Cog N+1 wakes from WAITWRL

Cog N+1 RDLUT $1FE

Cog N wakes from WAITRDL

and...
Cog N WAITWRL

Cog N+1 WRLUT $1FF

Cog N+1 WAITRDL

Cog N wakes from WAITWRL

Cog N RDLUTN $1FF

Cog N+1 wakes from WAITRDL

Looks good. I think I'm going to have to sleep for a few hours before I can do any more.

cgracey · 2016-05-05 22:49

The compiler ran for 9 hours and wasn't getting the job done. It should take about 80 minutes if things are good. To get this LUT sharing working, something else may have to be pared down.

I got rid of the CT3 event, which was superfluous with smart pins able to track so much timing.

Question: Do we need more than one timer event per cog?

If we got rid of CT2, also, we'd go back to GETCNT/ADDCNT/POLLCNT/WAITCNT, instead of GETCT/ADDCT1/ADDCT2/POLLCT1/POLLCT2/WAITCT1/WAITCT2.

Does a cog need more than one timer?

jmg · 2016-05-05 23:10

cgracey wrote: »

The compiler ran for 9 hours and wasn't getting the job done. It should take about 80 minutes if things are good. To get this LUT sharing working, something else may have to be pared down.

I got rid of the CT3 event, which was superfluous with smart pins able to track so much timing.

Question: Do we need more than one timer event per cog?

If we got rid of CT2, also, we'd go back to GETCNT/ADDCNT/POLLCNT/WAITCNT, instead of GETCT/ADDCT1/ADDCT2/POLLCT1/POLLCT2/WAITCT1/WAITCT2.

Does a cog need more than one timer?

Maybe the tools need a little headroom, meaning 100.00 % usage is not quite reachable ?

How much does that CT2 save ?
If timers can trigger events, I'd be reluctant to pare back to only one ?
Does having two help porting code from P1, which has 2 timers per COG ?

Rayman · 2016-05-05 23:10

I'd say no, at first. But, if someone did want to use all the new task switching stuff to do multiple things, then they might want all 3 timers...

Maybe make cog-cog only to next cog?

Is there an A10 you can try to compile to? Or, something with more ALMs?

cgracey · 2016-05-05 23:16

Rayman wrote: »

I'd say no, at first. But, if someone did want to use all the new task switching stuff to do multiple things, then they might want all 3 timers...

Maybe make cog-cog only to next cog?

Is there an A10 you can try to compile to? Or, something with more ALMs?

We are using fattest of the Cyclone V devices.

jmg · 2016-05-05 23:24

Rayman wrote: »

Maybe make cog-cog only to next cog?

hmm.. I thought it was already limited to Even-Odd pairings ?
Seems that would save a lot of routing and Muxes, as the two associated LUTs can sit between their two owner COGs ?

cgracey · 2016-05-05 23:28

jmg wrote: »

Rayman wrote: »

Maybe make cog-cog only to next cog?

hmm.. I thought it was already limited to Even-Odd pairings ?
Seems that would save a lot of routing and Muxes, as the two associated LUTs can sit between their two owner COGs ?

Tubular proposed something similar, but I can't see how it would save any logic. It's a different connection pattern, but takes the same amount of glue.

Seairth · 2016-05-05 23:49

For now, maybe get rid of some smartpin cells. Our primary concern is proving out the various features. 32 smartpins is still plenty enough for testing purposes.

Unless there is a direct correlation between the ability to fit everything in the final silicon and fit everything in an A9, I wouldn't start dropping features just to fit on the A9.

Also, I know you say the 1MB HUB only has 1 extra addressing line, but wouldn't the FPGA potentially use multiple ALMs for signal routing? I know I'm sounding like a broken record, but it's not part of the final silicon, so there's no harm in seeing if it helps free up resources.

cgracey · 2016-05-05 23:52

Seairth wrote: »

For now, maybe get rid of some smartpin cells. Our primary concern is proving out the various features. 32 smartpins is still plenty enough for testing purposes.

Unless there is a direct correlation between the ability to fit everything in the final silicon and fit everything in an A9, I wouldn't start dropping features just to fit on the A9.

Also, I know you say the 1MB HUB only has 1 extra addressing line, but wouldn't the FPGA potentially use multiple ALMs for signal routing? I know I'm sounding like a broken record, but it's not part of the final silicon, so there's no harm in seeing if it helps free up resources.

I'll try it. It'll probably amount to 20, or so, ALMs.

We've actually got a LOT of room on the silicon die. We could just proceed by doing what you suggested: not have all pins be smart pins on the -A9. We could put them at 0..31 and then 58..63. That would cover all of port A and the flash and serial pins at the top of port B. Let's do that! Problem solved.

P2 vs modern process limits

Comments