Prop2 Development Update

jmg · 2015-01-31 21:37

[QUOTE=mark

evanh · 2015-01-31 22:06

I don't really see any advantage there since throughput isn't an issue. The only potential problem with SmartPins is read latency ... which could be sorted with more complex buffering/DMA at the Cog end of things. Even that may not be worth the real-estate it takes up.

mark · 2015-02-01 14:04

jmg wrote: »

The simplest method is DDR, as that avoids needing any new clock domains, and new clock domains are non-trivial.

The SIM values I got from Verilog counters, suggested there was not a lot of headroom over the SysCLK numbers Chip was quoting - IIRC he gave some counter values, which were < 2x the SysCLK.
(which shows how high he has managed to push the relative SysCLK )

Another Clock Domain that gave better PWM and capture precision would be good, but the numbers did not suggest enough room.
That leaves just serial shifting, where DDR makes more sense. ( & DDR can actually also be used at the pins, for SPI modes)

From my limited knowledge on the topic, I'd agree that DDR would be the easiest method to increase bandwidth with any existing internal interfaces, and a 2X increase in speed certainly isn't negligible. I would say that this approach should be considered at the very least.

As for your SIM values, aren't those specific to the hardware you were targeting in the build? So those weren't representative of what's possible for an ASIC on a 180nm process node, but rather the specific FPGA you were working with, right?

evanh wrote:

I don't really see any advantage there since throughput isn't an issue. The only potential problem with SmartPins is read latency ... which could be sorted with more complex buffering/DMA at the Cog end of things. Even that may not be worth the real-estate it takes up.

I don't really know what the usefulness would be, as I'm very much an amateur in this field. However, I know there were performance concerns related to pin/cog interfacing in which Chip hoped to at least partially solve by assigning certain pins to cogs and giving them their own high bandwidth interface. That leads me to believe such options are legitimately worth looking into. I wouldn't really consider any buffering or DMA in the traditional sense of CPU designs as that doesn't inherently solve the bandwidth issues sufficiently in a chip that's supposed to be just about fully deterministic. I do, however, see pins being hub memory mapped as a potential option.

jmg · 2015-02-01 14:43

[QUOTE=mark

mark · 2015-02-01 15:50

jmg wrote: »

True but Chip also provided OnSemi 180nm Sim values for Counters, which made it possible to compare the FPGA with 180nm. (and also compare Counters with CPU SysCLK )
They were broadly similar in MHz - somewhat expected as the FPGA has faster process, but also incurs a routing/mux overhead from all the configuration.

Ah, ok. I don't remember that.

I wonder what balance could be struck between message "latency", bus width per pin, and DDR. 4-bit wide bus and DDR can transfer a 32 bit message in 4 clocks which would sure beat 32 clocks, and require a 256 wire data bus to interface with every pin. How reasonable of a latency is that in practice? And what is the target DAC/ADC resolution? Assuming 16 bits or less, that would only require 2 clocks for a transfer. Not bad.

evanh · 2015-02-01 16:16

[QUOTE=mark

mark · 2015-02-01 16:39

evanh wrote: »

PS: Cogs are the way to go for memory mapping, Hub time slots are all taken already.

While I agree with your statement, especially since there's going to be hub execute, my line of thought was that it might simplify the routing to the pins. So instead of each and every pin having to be routed to every cog, each pin is instead routed to a hub memory register.

I guess for any greater bandwidth requirements outside of the fast DACs will depend on just how "smart" these pins will be.

evanh · 2015-02-01 17:17

No, I was wrong. Hub or Cog, both can work. It would be another mux layer though, which works better inside a Cog. And there is more options for selective mapping within a Cog also. With the Hub, the whole of SmartPins register map would be laid out as a fixed map, ie: memory mapped I/O.

mark · 2015-02-01 17:21

evanh wrote: »

No, I was wrong. Hub or Cog, both can work. It would be another mux layer though.

I meant actually mapped to some address in one of the 16 hub slices, and not some new 17th slice or something. No extra mux needed. But as you can imagine, it has some drawbacks.

evanh · 2015-02-01 17:26

[QUOTE=mark

jmg · 2015-02-01 19:39

What I remember of the Pin-Cell discussions, was Chip wanted to KISS this, and use existing pin-connect lines as much as possible.
A serial Pin-select-transaction (not fast) connects a COG to a Pin-cell, & configures that Cell, and then you get a single bit of OUT and IN, (which no longer have to be to the physical pin).
For Capture / Compare & signaling, those BITs are the ARM and DONE type flags, so you get single cycle granularity on control, provided you can map that to a single bit.

The Cell would also use the fast serial opcode, to send / receive data, from tiny FIFOs in the Pin cell - so that has a finite bandwidth.

DACs had a separate data path, via optional CLUT

Half duplex Serial UART would be simple with this flow.
Connect to COG, Write-SBUF, and Read.TxDone
COG code then does a repeat of
FastSerial.Write-SBUF
Wait for TxDone

Full duplex is trickier, as the flags are now TxDone and RxReady, and FIFO write & FIFO read are needed.
With suitable gymnastics on IN, OUT and DIR lines, it might be do-able without needing slower commands inside sending loops.

evanh · 2015-02-01 20:46

jmg wrote: »

What I remember of the Pin-Cell discussions, was Chip wanted to KISS this, and use existing pin-connect lines as much as possible.

It was a somewhat bigger issue than keep it simple. The parallel bus like structure of interconnect for all Cogs to have free access to all SmartPins at instruction cycle speeds made the interconnect massive! The single bit line nature and OR-gate cooperation mechanism from the Prop1 design is such a huge saving of real-estate it became a no-brainer to jump a few hoops and restrict performance so as to have the flexibility without the real-estate cost.

Removing that interconnect was the point where the earlier Prop2 design gained a lot more space. I think it went from 128KB to 256KB of HubRAM with room to spare for bigger/hotter Cogs.

ErNa · 2015-02-02 11:33

For those who didn't get aware of it: http://forums.parallax.com/showthread.php/159884-Appreciation-and-Thanks?p=1313096#post1313096

Tubular · 2015-02-02 15:07

Thanks Erna, hadn't spotted that. September's a nice month visit, I believe

Prop2 Development Update

Comments