A different way to access smart pins
Seairth
Posts: 2,474
in Propeller 2
If we dropped the shared LUT thing, I wonder if there would there be enough room to double the bus width to the smart pins. That would speed up the special link mode AND all other smart pin interaction.
@cgracey, what is the current bus width to the smart pins? I have another thought (which I'll have to wait and share in a few hours).
I've been thinking about the use of smart pins for fast data sharing. One issue with them is that you have to trade off latency with data width (i.e. 32 bits stall a cog much longer than 8 bits do). Conversely, if you look at something like the hub, careful timing can maximize data width while minimizing latency. So, the question is, what if we were to do the same with the smart pins...
Suppose the following changes:
* Get rid of the variable-length timing, returning back to the old method where smart pin access was alway 10 clocks
* Group the smart pins into 8 groups
* Widen the data path to 32 bits (which, if @jmg is correct, is effectively how many data lines 8 cogs have anyhow)
* Access to each smart pin becomes round-robin, where the entire 32 bits from each successive smart pin are transfered every clock cyle
Why?
1. It's already a familiar access pattern for a shared resource (i.e. hub)
2. Cogs can do other things between accesses to a smart pin.
3. The worst case latency is not worse than the current worst case. And by timing access, the best case latency will be the same as the current best case. In other words, latency is dependent on timing, not data width.
What can we do with this?
Well, I'm sure you all will be able to think of some great stuff! But, here are a couple simple ones...
A tiny serial echo block:
echo_loop REP #rep_end, #0 ' loop forever WAITEDG ' wait for RX PINACK #32 ' ack (still 2 clocks) PINGETZ t0, #32 ' grab the incoming data (waits 6 clocks) PINSETY t0, #34 ' send to tx (0 extra clocks due to alignment) rep_end
Actually, this one could be even more interesting if PINGETZ could also indicate whether pin event was active. Because PINGETZ would wait until it gets the right slot, you could change the code as follows:
echo_loop REP #rep_end, #0 ' loop forever PINGETZ t0, #32 WC ' grab the incoming data (C = event flag) if_c PINSETY t0, #34 ' send to tx (0 extra clocks due to alignment) if_c PINACK #32 ' ack the rx(still 2 clocks) rep_end
Of course, this would be much less energy efficient than using WAITEDG, but it would be crazy fast!
How about another example... fast cog-to-cog data sharing of multiple 32-bit values. On the sender's side:
send_data PINSETY t0, #32 ' writes 32 bits to smart pin (triggers edg event in receiver) ' grab next data into t0 WAITEDG ' wait for falling edge (receiver ACKed) PINSETY t0, #32 ' writes t0 to smart pin (waits 6 clocks) ' next three instruction sets next t0 PINSETY t0, #32 ' writes t0 to smart pin (waits 0 clocks due to alignment) ' next three instruction sets next t0 PINSETY t0, #32 ' writes t0 to smart pin (waits 0 clocks due to alignment) ' repeat as necessary... [/code On the receiver's side: [code] get_data WAITEDG ' wait for other cog to write to smart pin PINACK #32 ' ack (still 2 clocks) PINGETZ t0, #32 ' grab the incoming data (waits 6 clocks) ' next three instructions put t0 somewhere... PINGETZ t0, #32 ' grab next incomping data (waits 0 clocks due to alignment) ' next three instructions put t0 somewhere... ' repeat as necessary...
Anyhow, I think you get the idea. Just as with the hub, you can always safely write your code for the worst-case timing. But, in those few spots where latency is critical, you still have the ability to tune your code.
Comments
I can see the merit of multiple timed SET and GET, but I think the current scheme can be tuned to offer that ?
That's why I ask for end-to-end timings !!
I would suggest looking at DDR handling on the Pin cells first, to see if that can meet timing.
That has low logic cost, and doubles the bandwidth, without doubling the routing.
( We have just seen what large amounts of routing can do to a design... )
Grouping them in sets of eight and time-mux'ing the individual 32-bit Z ouputs would maybe be a wash logic-wise, but would cut the read latency in half, down to 8 clocks, max.
Can that still meet USB path timings ?
I thought that was optimized down to work adequately, with the Length+Nibble approach ?
A byte goes from 4 down to 3 clocks, due to header and $00 terminus.
A word goes from 6 down to 4 clocks.
A long goes from 10 clocks down to 6 clocks.
Those don't seem like big-enough savings to warrant the logic required. I think we are at a happy medium using 4 bits.
1) Can you use DDR approach and still meet timing ?
ie destination clocks some registers on rise, some on Fall, and MUXes using SysCLK.
It is all still SysCLK based, just using both edges.
2) Can the combination of Tx and Rx overlap, to shrink the times from 10+10+extras for Long ?
eg a Writing COG could signal the receiver, which then generates a GET in parallel, and that waits a few clocks until data starts arriving, and passes data thru, so the total end-to-end time is perhaps 2 or 3 pipeline clocks more than a single write time.
Biggest gain is on moving 23b
I don't want to introduce dual-edge clocking, at this point, into the design.
As far as overlapping writes and reads in the smart pins go, things are not set up that way. When the full data byte/word/long arrives into the smart pin, it then issues a write_mode, write_x, or write_y. It's the smart pin mode that handles it from there. In the case of the DAC-alternate modes which convey data, they capture that write_x into Z and start outputting it on the next clock for RDPIN (was GETPINZ). It would be extra mux's to make it circumvent the existing mechanism for concurrent turn-around. I don't want to grow the smart pin for just that mode.
Alternatively, you could place them above Pin64, and they are then (much?) smaller than Pin Cells, and single tasked, so they can be very focused on end-to-end optimize.
Still using the Smart Pin highways, and signaling and encoding.
I think the DAC_Data comes almost for free, so you could still keep that, just have some more (above 64) that avoid consuming vital Pin cells for internal use.
Then, you really can say 'Every Pin has a Smart Cell'.
If LUT COG-COG pathways really are all going away, it makes sense to use a little part of the logic that you were prepared to use for them, to boost the speed of the only remaining non-hub pathway.
If we're going to go through all that trouble, maybe we could just use the nibble-wide pin-write conduit to implement inter-cog LUT writing. The message would be 9 address bits plus 32 data bits = 41 bits, or 11 nibbles (clocks), including header. Because these writes would be implemented as messages over 11 clocks, there would be practical rules about one cog writing a particular cog's LUT at a time. That might be a decent value for the cost.
Meanwhile, I found a way to shave a clock cycle off of WRPIN/WXPIN/WYPIN by having the terminating %0000 be done automatically, after the message. In cases where another WRPIN/WXPIN/WYPIN immediately follows, there is no need for the %0000, as a new header is allowed right after the last nibble of the prior message. The %0000 is needed, though, when no message is present. I implemented it and it's working.
This makes me consider bumping the cog-to-pin messaging to 8 bits, as timing would be:
Byte to smart pin: 2 clocks
Word to smart pin: 3 clocks
Long to smart pin: 5 clocks
Write to any LUT: 6 clocks
If you stay at 4bits, then reduce the LUT to 8 address bits (half the lut) to reduce to 10 clocks.
Presume with 8 bits, LUT write could be 5 clocks with 8 address bits too.
Looks encouraging.
It seems not that good, when I think about it more. Consider that it would take 10, or even 5, clocks to send a long to some LUT. How about just ATN'ing that cog and have him do a SETQ+RDLONG. After the latency (5..22 clocks), that cog will be loading in 32 bits every clock from the hub. And that's not even using the egg-beater FIFO. The sending cog could write to hub, beforehand, using SETQ+WRLONG to move data just as quickly. I don't know. All these other avenues are somewhat expensive and only buy a little advantage over the conduits that are already in place.
By making it take one less clock for WRPIN/WXPIN/WYPIN, a few ALMs were shaved off each cog. Also, I made the instruction decoding for the D/#-only instruction not care about the S[8:6] bits, as they are always %000.
Placement is done and now it's routing.
EDIT: Woops! I had the CORDIC turned off. I'll try again...
Was he not able to use RDFAST or WRFAST, so that he could do RFBYTE/WFBYTE?
* It can only send 32b, which immediately pushes it to the slowest message end
* it adds 9 Adr bits of payload, slowing it down even more.
The 9 adr bits are flexible, but this is going to be slower than HUB-LUT, so no one is going to use it for large blocks of data.
Most cases will be COG-COG pipes, often Bytes.
The appeal of the Pin_cell path, is to be small and nimble.
Now that is more what I had in mind - compacting the Pin-Cell links as much as possible.
What is the End-To-End delay times, with the Pin-cell conduit, in this scheme ?
What could they be lowered to, with a shrunk cell, optimized for transfer, placed above pin 64 ?
Are you using the latest Quartus Prime 16.0?
No form of caching or block move helps. That's why I loved the LUT sharing, because the second cog could be in a rdlut/jump loop waiting on the byte. By making the byte at least 9 bits means that a zero can be tested for in the rdlut wz.
So, latency was not your issue. Only instruction stalling.
So please accept my apologies.
My current thinking however is, both stalling (which prevents more instructions being executed in a given time) and latency combine to significantly reduce the number of instructions (code) that can execute within the short turnaround time being demanded by the protocol.