A different way to access smart pins

Seairth · 2016-05-18 00:35

Seairth wrote: »

If we dropped the shared LUT thing, I wonder if there would there be enough room to double the bus width to the smart pins. That would speed up the special link mode AND all other smart pin interaction.

@cgracey, what is the current bus width to the smart pins? I have another thought (which I'll have to wait and share in a few hours).

I've been thinking about the use of smart pins for fast data sharing. One issue with them is that you have to trade off latency with data width (i.e. 32 bits stall a cog much longer than 8 bits do). Conversely, if you look at something like the hub, careful timing can maximize data width while minimizing latency. So, the question is, what if we were to do the same with the smart pins...

Suppose the following changes:

* Get rid of the variable-length timing, returning back to the old method where smart pin access was alway 10 clocks
* Group the smart pins into 8 groups
* Widen the data path to 32 bits (which, if @jmg is correct, is effectively how many data lines 8 cogs have anyhow)
* Access to each smart pin becomes round-robin, where the entire 32 bits from each successive smart pin are transfered every clock cyle

Why?

1. It's already a familiar access pattern for a shared resource (i.e. hub)

2. Cogs can do other things between accesses to a smart pin.

3. The worst case latency is not worse than the current worst case. And by timing access, the best case latency will be the same as the current best case. In other words, latency is dependent on timing, not data width.

What can we do with this?

Well, I'm sure you all will be able to think of some great stuff! But, here are a couple simple ones...

A tiny serial echo block:

echo_loop       REP #rep_end, #0        ' loop forever
                WAITEDG                 ' wait for RX
                PINACK #32              ' ack (still 2 clocks)
                PINGETZ t0, #32         ' grab the incoming data (waits 6 clocks)
                PINSETY t0, #34         ' send to tx (0 extra clocks due to alignment)
rep_end

Actually, this one could be even more interesting if PINGETZ could also indicate whether pin event was active. Because PINGETZ would wait until it gets the right slot, you could change the code as follows:

echo_loop       REP #rep_end, #0        ' loop forever
                PINGETZ t0, #32 WC      ' grab the incoming data (C = event flag)
        if_c    PINSETY t0, #34         ' send to tx (0 extra clocks due to alignment)
        if_c    PINACK #32              ' ack the rx(still 2 clocks)
rep_end

Of course, this would be much less energy efficient than using WAITEDG, but it would be crazy fast!

How about another example... fast cog-to-cog data sharing of multiple 32-bit values. On the sender's side:

send_data       PINSETY t0, #32         ' writes 32 bits to smart pin (triggers edg event in receiver)
                                        ' grab next data into t0
                WAITEDG                 ' wait for falling edge (receiver ACKed)
                PINSETY t0, #32         ' writes t0 to smart pin (waits 6 clocks)
                                        ' next three instruction sets next t0
                PINSETY t0, #32         ' writes t0 to smart pin (waits 0 clocks due to alignment)
                                        ' next three instruction sets next t0
                PINSETY t0, #32         ' writes t0 to smart pin (waits 0 clocks due to alignment)
                                        ' repeat as necessary...
[/code

On the receiver's side:

[code]
get_data        WAITEDG                 ' wait for other cog to write to smart pin
                PINACK #32              ' ack (still 2 clocks)
                PINGETZ t0, #32         ' grab the incoming data (waits 6 clocks)
                                        ' next three instructions put t0 somewhere...
                PINGETZ t0, #32         ' grab next incomping data (waits 0 clocks due to alignment)
                                        ' next three instructions put t0 somewhere...
                                        ' repeat as necessary...

Anyhow, I think you get the idea. Just as with the hub, you can always safely write your code for the worst-case timing. But, in those few spots where latency is critical, you still have the ability to tune your code.

cgracey · 2016-05-18 00:43

The current bus width is 4 bits, both ways.

jmg · 2016-05-18 00:56

Seairth wrote: »

* Group the smart pins into 8 groups
* Widen the data path to 32 bits

hmm. right now Pin cells naturally group together, I'm not sure that groups of 8 is going to helps Any-COG to Any-pin mappings ?

I can see the merit of multiple timed SET and GET, but I think the current scheme can be tuned to offer that ?
That's why I ask for end-to-end timings !!

I would suggest looking at DDR handling on the Pin cells first, to see if that can meet timing.
That has low logic cost, and doubles the bandwidth, without doubling the routing.

( We have just seen what large amounts of routing can do to a design... )

cgracey · 2016-05-18 00:56

Having pins output 32 bits wide would simplify the smart pins in a few different ways.

Grouping them in sets of eight and time-mux'ing the individual 32-bit Z ouputs would maybe be a wash logic-wise, but would cut the read latency in half, down to 8 clocks, max.

jmg · 2016-05-18 01:47

cgracey wrote: »

Having pins output 32 bits wide would simplify the smart pins in a few different ways.

Grouping them in sets of eight and time-mux'ing the individual 32-bit Z ouputs would maybe be a wash logic-wise, but would cut the read latency in half, down to 8 clocks, max.

Can that still meet USB path timings ?
I thought that was optimized down to work adequately, with the Length+Nibble approach ?

cgracey · 2016-05-18 08:02

I've been thinking about this all evening and I think I want to keep it the way it is. The reason is that actively mux'ing a bunch of signals is going to take a lot more power than the current passive mux'ing requires. I've also thought about going from 4-bit to 8-bit buses, and the improvement is not great enough to warrant all the extra logic. For example, sending data:

A byte goes from 4 down to 3 clocks, due to header and $00 terminus.
A word goes from 6 down to 4 clocks.
A long goes from 10 clocks down to 6 clocks.

Those don't seem like big-enough savings to warrant the logic required. I think we are at a happy medium using 4 bits.

jmg · 2016-05-18 08:12

cgracey wrote: »

I've been thinking about this all evening and I think I want to keep it the way it is. The reason is that actively mux'ing a bunch of signals is going to take a lot more power than the current passive mux'ing requires. I've also thought about going from 4-bit to 8-bit buses, and the improvement is not great enough to warrant all the extra logic. For example, sending data:

A byte goes from 4 down to 3 clocks, due to header and $00 terminus.
A word goes from 6 down to 4 clocks.
A long goes from 10 clocks down to 6 clocks.

Those don't seem like big-enough savings to warrant the logic required. I think we are at a happy medium using 4 bits.

1) Can you use DDR approach and still meet timing ?
ie destination clocks some registers on rise, some on Fall, and MUXes using SysCLK.
It is all still SysCLK based, just using both edges.

2) Can the combination of Tx and Rx overlap, to shrink the times from 10+10+extras for Long ?

eg a Writing COG could signal the receiver, which then generates a GET in parallel, and that waits a few clocks until data starts arriving, and passes data thru, so the total end-to-end time is perhaps 2 or 3 pipeline clocks more than a single write time.
Biggest gain is on moving 23b

cgracey · 2016-05-18 08:35

jmg wrote: »

cgracey wrote: »

I've been thinking about this all evening and I think I want to keep it the way it is. The reason is that actively mux'ing a bunch of signals is going to take a lot more power than the current passive mux'ing requires. I've also thought about going from 4-bit to 8-bit buses, and the improvement is not great enough to warrant all the extra logic. For example, sending data:

A byte goes from 4 down to 3 clocks, due to header and $00 terminus.
A word goes from 6 down to 4 clocks.
A long goes from 10 clocks down to 6 clocks.

Those don't seem like big-enough savings to warrant the logic required. I think we are at a happy medium using 4 bits.

1) Can you use DDR approach and still meet timing ?
ie destination clocks some registers on rise, some on Fall, and MUXes using SysCLK.
It is all still SysCLK based, just using both edges.

2) Can the combination of Tx and Rx overlap, to shrink the times from 10+10+extras for Long ?

eg a Writing COG could signal the receiver, which then generates a GET in parallel, and that waits a few clocks until data starts arriving, and passes data thru, so the total end-to-end time is perhaps 2 or 3 pipeline clocks more than a single write time.
Biggest gain is on moving 23b

I don't want to introduce dual-edge clocking, at this point, into the design.

As far as overlapping writes and reads in the smart pins go, things are not set up that way. When the full data byte/word/long arrives into the smart pin, it then issues a write_mode, write_x, or write_y. It's the smart pin mode that handles it from there. In the case of the DAC-alternate modes which convey data, they capture that write_x into Z and start outputting it on the next clock for RDPIN (was GETPINZ). It would be extra mux's to make it circumvent the existing mechanism for concurrent turn-around. I don't want to grow the smart pin for just that mode.

jmg · 2016-05-18 09:08

cgracey wrote: »

As far as overlapping writes and reads in the smart pins go, things are not set up that way. When the full data byte/word/long arrives into the smart pin, it then issues a write_mode, write_x, or write_y. It's the smart pin mode that handles it from there. In the case of the DAC-alternate modes which convey data, they capture that write_x into Z and start outputting it on the next clock for RDPIN (was GETPINZ). It would be extra mux's to make it circumvent the existing mechanism for concurrent turn-around. I don't want to grow the smart pin for just that mode.

Alternatively, you could place them above Pin64, and they are then (much?) smaller than Pin Cells, and single tasked, so they can be very focused on end-to-end optimize.
Still using the Smart Pin highways, and signaling and encoding.

I think the DAC_Data comes almost for free, so you could still keep that, just have some more (above 64) that avoid consuming vital Pin cells for internal use.
Then, you really can say 'Every Pin has a Smart Cell'.

If LUT COG-COG pathways really are all going away, it makes sense to use a little part of the logic that you were prepared to use for them, to boost the speed of the only remaining non-hub pathway.

cgracey · 2016-05-18 09:49

jmg wrote: »

cgracey wrote: »

As far as overlapping writes and reads in the smart pins go, things are not set up that way. When the full data byte/word/long arrives into the smart pin, it then issues a write_mode, write_x, or write_y. It's the smart pin mode that handles it from there. In the case of the DAC-alternate modes which convey data, they capture that write_x into Z and start outputting it on the next clock for RDPIN (was GETPINZ). It would be extra mux's to make it circumvent the existing mechanism for concurrent turn-around. I don't want to grow the smart pin for just that mode.

Alternatively, you could place them above Pin64, and they are then (much?) smaller than Pin Cells, and single tasked, so they can be very focused on end-to-end optimize.
Still using the Smart Pin highways, and signaling and encoding.

I think the DAC_Data comes almost for free, so you could still keep that, just have some more (above 64) that avoid consuming vital Pin cells for internal use.
Then, you really can say 'Every Pin has a Smart Cell'.

If LUT COG-COG pathways really are all going away, it makes sense to use a little part of the logic that you were prepared to use for them, to boost the speed of the only remaining non-hub pathway.

If we're going to go through all that trouble, maybe we could just use the nibble-wide pin-write conduit to implement inter-cog LUT writing. The message would be 9 address bits plus 32 data bits = 41 bits, or 11 nibbles (clocks), including header. Because these writes would be implemented as messages over 11 clocks, there would be practical rules about one cog writing a particular cog's LUT at a time. That might be a decent value for the cost.

Meanwhile, I found a way to shave a clock cycle off of WRPIN/WXPIN/WYPIN by having the terminating %0000 be done automatically, after the message. In cases where another WRPIN/WXPIN/WYPIN immediately follows, there is no need for the %0000, as a new header is allowed right after the last nibble of the prior message. The %0000 is needed, though, when no message is present. I implemented it and it's working.

This makes me consider bumping the cog-to-pin messaging to 8 bits, as timing would be:

Byte to smart pin: 2 clocks
Word to smart pin: 3 clocks
Long to smart pin: 5 clocks
Write to any LUT: 6 clocks

evanh · 2016-05-18 10:00

RDPIN is still cyclic sync'ing, right? If so, does message notification, or WxPIN, reset the cycle to allow a potential fixed latency from send to receive?

Cluso99 · 2016-05-18 10:08

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

As far as overlapping writes and reads in the smart pins go, things are not set up that way. When the full data byte/word/long arrives into the smart pin, it then issues a write_mode, write_x, or write_y. It's the smart pin mode that handles it from there. In the case of the DAC-alternate modes which convey data, they capture that write_x into Z and start outputting it on the next clock for RDPIN (was GETPINZ). It would be extra mux's to make it circumvent the existing mechanism for concurrent turn-around. I don't want to grow the smart pin for just that mode.

Alternatively, you could place them above Pin64, and they are then (much?) smaller than Pin Cells, and single tasked, so they can be very focused on end-to-end optimize.
Still using the Smart Pin highways, and signaling and encoding.

I think the DAC_Data comes almost for free, so you could still keep that, just have some more (above 64) that avoid consuming vital Pin cells for internal use.
Then, you really can say 'Every Pin has a Smart Cell'.

If LUT COG-COG pathways really are all going away, it makes sense to use a little part of the logic that you were prepared to use for them, to boost the speed of the only remaining non-hub pathway.

If we're going to go through all that trouble, maybe we could just use the nibble-wide pin-write conduit to implement inter-cog LUT writing. The message would be 9 address bits plus 32 data bits = 41 bits, or 11 nibbles (clocks), including header. Because these writes would be implemented as messages over 11 clocks, there would be practical rules about one cog writing a particular cog's LUT at a time. That might be a decent value for the cost.

Meanwhile, I found a way to shave a clock cycle off of WRPIN/WXPIN/WYPIN by having the terminating %0000 be done automatically, after the message. In cases where another WRPIN/WXPIN/WYPIN immediately follows, there is no need for the %0000, as a new header is allowed right after the last nibble of the prior message. The %0000 is needed, though, when no message is present. I implemented it and it's working.

This makes me consider bumping the cog-to-pin messaging to 8 bits, as timing would be:

Byte to smart pin: 2 clocks
Word to smart pin: 3 clocks
Long to smart pin: 5 clocks
Write to any LUT: 6 clocks

That looks like a real breakthrough, Chip!!!

If you stay at 4bits, then reduce the LUT to 8 address bits (half the lut) to reduce to 10 clocks.

Presume with 8 bits, LUT write could be 5 clocks with 8 address bits too.

ozpropdev · 2016-05-18 10:09

cgracey wrote: »

Meanwhile, I found a way to shave a clock cycle off of WRPIN/WXPIN/WYPIN by having the terminating %0000 be done automatically, after the message. In cases where another WRPIN/WXPIN/WYPIN immediately follows, there is no need for the %0000, as a new header is allowed right after the last nibble of the prior message. The %0000 is needed, though, when no message is present. I implemented it and it's working.

This makes me consider bumping the cog-to-pin messaging to 8 bits, as timing would be:

Byte to smart pin: 2 clocks
Word to smart pin: 3 clocks
Long to smart pin: 5 clocks
Write to any LUT: 6 clocks

Looks encouraging.

cgracey · 2016-05-18 10:19

Cluso99 wrote: »

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

As far as overlapping writes and reads in the smart pins go, things are not set up that way. When the full data byte/word/long arrives into the smart pin, it then issues a write_mode, write_x, or write_y. It's the smart pin mode that handles it from there. In the case of the DAC-alternate modes which convey data, they capture that write_x into Z and start outputting it on the next clock for RDPIN (was GETPINZ). It would be extra mux's to make it circumvent the existing mechanism for concurrent turn-around. I don't want to grow the smart pin for just that mode.

Alternatively, you could place them above Pin64, and they are then (much?) smaller than Pin Cells, and single tasked, so they can be very focused on end-to-end optimize.
Still using the Smart Pin highways, and signaling and encoding.

I think the DAC_Data comes almost for free, so you could still keep that, just have some more (above 64) that avoid consuming vital Pin cells for internal use.
Then, you really can say 'Every Pin has a Smart Cell'.

If LUT COG-COG pathways really are all going away, it makes sense to use a little part of the logic that you were prepared to use for them, to boost the speed of the only remaining non-hub pathway.

If we're going to go through all that trouble, maybe we could just use the nibble-wide pin-write conduit to implement inter-cog LUT writing. The message would be 9 address bits plus 32 data bits = 41 bits, or 11 nibbles (clocks), including header. Because these writes would be implemented as messages over 11 clocks, there would be practical rules about one cog writing a particular cog's LUT at a time. That might be a decent value for the cost.

Meanwhile, I found a way to shave a clock cycle off of WRPIN/WXPIN/WYPIN by having the terminating %0000 be done automatically, after the message. In cases where another WRPIN/WXPIN/WYPIN immediately follows, there is no need for the %0000, as a new header is allowed right after the last nibble of the prior message. The %0000 is needed, though, when no message is present. I implemented it and it's working.

This makes me consider bumping the cog-to-pin messaging to 8 bits, as timing would be:

Byte to smart pin: 2 clocks
Word to smart pin: 3 clocks
Long to smart pin: 5 clocks
Write to any LUT: 6 clocks

That looks like a real breakthrough, Chip!!!

If you stay at 4bits, then reduce the LUT to 8 address bits (half the lut) to reduce to 10 clocks.

Presume with 8 bits, LUT write could be 5 clocks with 8 address bits too.

It seems not that good, when I think about it more. Consider that it would take 10, or even 5, clocks to send a long to some LUT. How about just ATN'ing that cog and have him do a SETQ+RDLONG. After the latency (5..22 clocks), that cog will be loading in 32 bits every clock from the hub. And that's not even using the egg-beater FIFO. The sending cog could write to hub, beforehand, using SETQ+WRLONG to move data just as quickly. I don't know. All these other avenues are somewhat expensive and only buy a little advantage over the conduits that are already in place.

evanh · 2016-05-18 10:26

Cluso's biggest issue in his USB example was instruction stalling because there was no way to phase align the incoming USB bit timing with the Hub rotation timing.

cgracey · 2016-05-18 10:26

It looks like the whole design is fitting into the A9, again! That's 16 cogs and 64 smart pins.

By making it take one less clock for WRPIN/WXPIN/WYPIN, a few ALMs were shaved off each cog. Also, I made the instruction decoding for the D/#-only instruction not care about the S[8:6] bits, as they are always %000.

Placement is done and now it's routing.

EDIT: Woops! I had the CORDIC turned off. I'll try again...

cgracey · 2016-05-18 10:29

evanh wrote: »

Cluso's biggest issue in his USB example was instruction stalling because there was no way to phase align the incoming USB bit timing with the Hub rotation timing.

Was he not able to use RDFAST or WRFAST, so that he could do RFBYTE/WFBYTE?

jmg · 2016-05-18 10:30

cgracey wrote: »

If we're going to go through all that trouble, maybe we could just use the nibble-wide pin-write conduit to implement inter-cog LUT writing. The message would be 9 address bits plus 32 data bits = 41 bits, or 11 nibbles (clocks), including header.

That's similar to what I was suggesting, but has additional overheads, and constraints.
* It can only send 32b, which immediately pushes it to the slowest message end
* it adds 9 Adr bits of payload, slowing it down even more.

The 9 adr bits are flexible, but this is going to be slower than HUB-LUT, so no one is going to use it for large blocks of data.
Most cases will be COG-COG pipes, often Bytes.

The appeal of the Pin_cell path, is to be small and nimble.

cgracey wrote: »

Meanwhile, I found a way to shave a clock cycle off of WRPIN/WXPIN/WYPIN by having the terminating %0000 be done automatically, after the message. In cases where another WRPIN/WXPIN/WYPIN immediately follows, there is no need for the %0000, as a new header is allowed right after the last nibble of the prior message. The %0000 is needed, though, when no message is present. I implemented it and it's working.

This makes me consider bumping the cog-to-pin messaging to 8 bits, as timing would be:

Byte to smart pin: 2 clocks
Word to smart pin: 3 clocks
Long to smart pin: 5 clocks
Write to any LUT: 6 clocks

Now that is more what I had in mind - compacting the Pin-Cell links as much as possible.

What is the End-To-End delay times, with the Pin-cell conduit, in this scheme ?

What could they be lowered to, with a shrunk cell, optimized for transfer, placed above pin 64 ?

ozpropdev · 2016-05-18 10:44

cgracey wrote: »

It looks like the whole design is fitting into the A9, again! That's 16 cogs and 64 smart pins.

By making it take one less clock for WRPIN/WXPIN/WYPIN, a few ALMs were shaved off each cog. Also, I made the instruction decoding for the D/#-only instruction not care about the S[8:6] bits, as they are always %000.

Placement is done and now it's routing.

EDIT: Woops! I had the CORDIC turned off. I'll try again...

Chip
Are you using the latest Quartus Prime 16.0?

evanh · 2016-05-18 10:51

cgracey wrote: »

evanh wrote: »

Cluso's biggest issue in his USB example was instruction stalling because there was no way to phase align the incoming USB bit timing with the Hub rotation timing.

Was he not able to use RDFAST or WRFAST, so that he could do RFBYTE/WFBYTE?

It was a Prop1 example I think. At any rate, I'm keen on using those RD/WRFAST features as much as possible, hence my semi-regular nag on having a way to prevent any stalls - even on data turnarounds.

evanh · 2016-05-18 10:52

Here's the topic - http://forums.parallax.com/discussion/163965/faster-communications-between-cogs/p1

Cluso99 · 2016-05-18 13:02

evanh wrote: »

Cluso's biggest issue in his USB example was instruction stalling because there was no way to phase align the incoming USB bit timing with the Hub rotation timing.

No! My biggest issue is in latency when passing a single byte or long. In these instances, hub latency is way too long to pass the "last" byte (could be a long though) to another cog for additional processing, and then the reply back to the first cog, again with no/minimal latency. Every clock delay is significant with the timing I used as a real example.

No form of caching or block move helps. That's why I loved the LUT sharing, because the second cog could be in a rdlut/jump loop waiting on the byte. By making the byte at least 9 bits means that a zero can be tested for in the rdlut wz.

evanh · 2016-05-18 20:17

Cluso, I did read your write up. You very clearly pointed at an instruction stall if a Hub write was to be used. Your workaround was to use I/O pins as a conduit so as to avoid the stalling of a Hub access.

So, latency was not your issue. Only instruction stalling.

Cluso99 · 2016-05-18 21:17

evanh wrote: »

Cluso, I did read your write up. You very clearly pointed at an instruction stall if a Hub write was to be used. Your workaround was to use I/O pins as a conduit so as to avoid the stalling of a Hub access.

So, latency was not your issue. Only instruction stalling.

That was a rather old thread with respect to the advancing threads. Perhaps I worded it wrongly, or else later realised that it was actually the latency.
So please accept my apologies.

My current thinking however is, both stalling (which prevents more instructions being executed in a given time) and latency combine to significantly reduce the number of instructions (code) that can execute within the short turnaround time being demanded by the protocol.

A different way to access smart pins

Comments