P2 vs modern process limits

jmg · 2016-05-03 09:29

cgracey wrote: »

I just finished the 1/2/4-bit pin-to-hub streamer input. I still need to test it, though.

What feed ordering system did you choose for the 2 bit ?
4 bits is naturally nibble based, and 1 bit is easy too, but two bits has a couple of possible choices.

For JTAG use, it is probably easiest to feed from a 16 bit split, so upper 16 feed TMS and lower 16 feed TDI. All the state-engine stuff has data in both halves, and then data-pages need update just the lower 16, as TMS is then stable.
Data comes in from TDO
If it talks to Dual-mode SPI, then the bit ordering should follow that convention.
Or, an opcode to convert between the 2 choices ?

cgracey wrote: »

You're on track with Cluso's latch, but it's better than that.

Better hub-hub links

Is there a means to lower the COG clocks, separate from the peripheral SysCLK ?
There are cases where users may want 100+MHz on pins for resolution, but not actually need 100MIPS in every single COG they use.

WAIT opcodes naturally effectively do that, but of course only when the actual wait is active. I'm thinking along the lines of a simple small prescaler, this is common in many MCUs that does Core ClkDiv, some do /2^N and some do /N.
The eggbeater slots would interact with this, so maybe the prescaler alternates in a simple manner : WAIT resolves to 1 SysCLK, and and HUB Slot also resolves to next-slot on a SysCLK basis, leaving all 'other' opcodes as the throttled ones.

cgracey · 2016-05-03 09:38

jmg wrote: »

cgracey wrote: »

I just finished the 1/2/4-bit pin-to-hub streamer input. I still need to test it, though.

What feed ordering system did you choose for the 2 bit ?
4 bits is naturally nibble based, and 1 bit is easy too, but two bits has a couple of possible choices.

For JTAG use, it is probably easiest to feed from a 16 bit split, so upper 16 feed TMS and lower 16 feed TDI. All the state-engine stuff has data in both halves, and then data-pages need update just the lower 16, as TMS is then stable.
Data comes in from TDO
If it talks to Dual-mode SPI, then the bit ordering should follow that convention.
Or, an opcode to convert between the 2 choices ?

cgracey wrote: »

You're on track with Cluso's latch, but it's better than that.

Better hub-hub links

Is there a means to lower the COG clocks from the peripheral SysCLK ?
There are cases where users may want 100+MHz on pins for resolution, but not actually need 100MIPS.

WAIT opcodes naturally effectively do that, but of course only when the actual wait is active.

For the 1/2/4-bit modes, you can select either shift left or shift right by 1/2/4 bits. I never thought about splitting the two-bit mode like you said. Interesting idea. Wait! We have a SPLITW and a MERGEW instruction to do that post or pre. I don't think the hardware needs to worry about it.

There's no way to clock the cog at a fractional rate. It's hard to think about that with the hub timing.

JRetSapDoog · 2016-05-03 09:48

cgracey wrote: »

You're on track with Cluso's latch, but it's better than that.

So some kind of inter-cog communication facility, then, it seems. Wonder if it will simultaneously allow any cog to see a shared long from any other cog? I'm probably dreaming, as that idea was already shot down. But if it were that, I wonder what cool name it would be given. I'm a sucker for names. (Please don't call it the Precog Pool from that strange Minority Report movie)

cgracey · 2016-05-03 09:52

JRetSapDoog wrote: »

cgracey wrote: »

You're on track with Cluso's latch, but it's better than that.

So some kind of inter-cog communication facility, then, it seems. Wonder if it will simultaneously allow any cog to see a shared long from any other cog? I'm probably dreaming, as that idea was already shot down. But if it were that, I wonder what cool name it would be given. I'm a sucker for names.

That would take too much logic. Think more powerful, but shorter range.

JRetSapDoog · 2016-05-03 10:00

cgracey wrote: »

That would take too much logic. Think more powerful, but shorter range.

So it's like the strong nuclear force, powerful but acting at a short distance?

Hmm, perhaps it allows peeking at more than one long between neighboring cogs, but wouldn't that require another port on the cog memories? That's dreaming, too, I guess (though you did say you had more room a ways back). So maybe there will be a scratchpad or pool of longs that can be shared between neighboring cogs, something like that. Thinking grandiose again, maybe the entirety of each of the auxiliary/CLUT memories can be shared between neighbors (perhaps interleaved in timing somehow). Anyway, I should give up. This is all over my head, but I'll grab some popcorn to watch. But, hey, don't you hate it when you give someone a gift in a small box and they guess it's a Rolex watch or something when it's only a chunk of chocolate?

jmg · 2016-05-03 10:21

cgracey wrote: »

For the 1/2/4-bit modes, you can select either shift left or shift right by 1/2/4 bits. I never thought about splitting the two-bit mode like you said. Interesting idea. Wait! We have a SPLITW and a MERGEW instruction to do that post or pre. I don't think the hardware needs to worry about it.

Cool, so those interleave bits from upper/lower 16 ?

cgracey wrote: »

There's no way to clock the cog at a fractional rate. It's hard to think about that with the hub timing.

Yes, that is why I thought Wait and HUB would simply trump the slow-down.
Any of those opcodes work as usual, others (ones that have nothing special in the timing area), can have a slower paced Clock enable - just like a wait on every opcode.
- maybe that WAIT can be used, as it is already there, just add a mode to fire WAIT after any non-timing opcodes ?
PINSET etc would probably be in the 'as usual/full speed' basket too.

Seairth · 2016-05-03 10:56

I still think 8 4-port LUTs shared between pairs of cogs would be super powerful.

Edit: Or, now that I think about it, you could maybe 16 3-port LUTs, where the first two ports are used as they are right now and the third port would be used by RDAUX/WRAUX to access each other's LUT (i.e. still paired).

Either way, add an event for writing to $1FF (not just reading from) and you're good to go!

KeithE · 2016-05-03 14:23

cgracey wrote: »

ozpropdev wrote: »

92,800 cogs! (said while experiencing sharp chest pains)
If Quartus takes 2 hours to compile Chip's 16 cog verilog code now, even 1000 cogs would take an eternity.

There is an FPGA cluster made by Cadence called Palladium that costs $1M, I heard. It's for ASIC prototyping. I bet that thing takes a lot of time to compile for. The user's design would have to be partitioned somehow.

That one is actually not FPGA based. It uses custom processors - it would be interesting to see the microarchitecture of those. I worked with people who used them, and don't think that the compile times were too bad. I recall facilities having to be involved in the installation. They did have ways to plug in boards which would allow the emulation to interact with the real world, but this often involved speed bridges since the emulations couldn't run at speed.

You might be thinking of the FPGA based Synopsys HAPS product line. (Follow on to CHIPit.) For the GNSS parts that I worked in the partitioning wasn't bad due to the chip sizes. If I remember correctly you could iterate about one of two times a day. The work flow involves a lot of RTL simulation on compute clusters before proceeding to an FPGA, but that's typical.

cgracey · 2016-05-03 14:35

KeithE wrote: »

cgracey wrote: »

ozpropdev wrote: »

92,800 cogs! (said while experiencing sharp chest pains)
If Quartus takes 2 hours to compile Chip's 16 cog verilog code now, even 1000 cogs would take an eternity.

There is an FPGA cluster made by Cadence called Palladium that costs $1M, I heard. It's for ASIC prototyping. I bet that thing takes a lot of time to compile for. The user's design would have to be partitioned somehow.

That one is actually not FPGA based. It uses custom processors - it would be interesting to see the microarchitecture of those. I worked with people who used them, and don't think that the compile times were too bad. I recall facilities having to be involved in the installation. They did have ways to plug in boards which would allow the emulation to interact with the real world, but this often involved speed bridges since the emulations couldn't run at speed.

You might be thinking of the FPGA based Synopsys HAPS product line. (Follow on to CHIPit.) For the GNSS parts that I worked in the partitioning wasn't bad due to the chip sizes. If I remember correctly you could iterate about one of two times a day. The work flow involves a lot of RTL simulation on compute clusters before proceeding to an FPGA, but that's typical.

Thinking about it, the difficulty of making a cutting-edge ASIC is always going to be high. At any point in time you can only use today's technology to make tomorrow's chips.

KeithE · 2016-05-03 15:04

cgracey wrote: »

Thinking about it, the difficulty of making a cutting-edge ASIC is always going to be high. At any point in time you can only use today's technology to make tomorrow's chips.

Our group always felt fortunate that our core could fit in the largest FPGA of the day. (Let's say an $8000 FPGA to emulate a $1 ASIC ;-) Then for larger chips which wanted to use the core, we could have a simple interface to another FPGA or set of FPGAs which contained the remainder of the design. Also we could run fast enough to interface to radio test chips, so we not only got to test those interfaces but we had a system which could be used for software development.

If we didn't have those FPGA boards we would have failed. No doubt about it.

Electrodude · 2016-05-03 18:47

cgracey wrote: »

You're on track with Cluso's latch, but it's better than that.

Could it be FIFOs (more than a latch) between cogs? Or connecting one cog's streamer to a neighbor's egg beater FIFO or something?

evanh · 2016-05-03 22:35

I'm pretty certain it's the dual-ported LUT being shareable to the next adjacent Cog. So, say, Cog1's LUT can also be accessed by Cog0. And following that around, Cog0's LUT will be accessible by Cog16.

evanh · 2016-05-03 22:36

I'm assuming the LUTs are now dual-ported.

Seairth · 2016-05-03 22:52

evanh wrote: »

I'm assuming the LUTs are now dual-ported.

They are, but that's to support RDLUT/WRLUT and streamer at the same time. You'd need to go to 3 or 4 ports to share the LUT. I think paired COG access would be more useful than n->n+1 access.

78rpm · 2016-05-04 00:04

My guess:

Each cog can signal every other cog. It is a lot of wires and a mux, though many instances.

cgracey · 2016-05-04 00:06

Electrodude wrote: »

cgracey wrote: »

You're on track with Cluso's latch, but it's better than that.

Could it be FIFOs (more than a latch) between cogs? Or connecting one cog's streamer to a neighbor's egg beater FIFO or something?

You guys have more imagination than I do.

Seairth · 2016-05-04 01:13

cgracey wrote: »

Electrodude wrote: »

cgracey wrote: »

You're on track with Cluso's latch, but it's better than that.

Could it be FIFOs (more than a latch) between cogs? Or connecting one cog's streamer to a neighbor's egg beater FIFO or something?

You guys have more imagination than I do.

Hardly!

cgracey · 2016-05-04 03:10

What I'm implementing is something that one of you thought of. Many of the P2 ideas and most all of the refinements came from people on this forum.

potatohead · 2016-05-04 15:20

Oh that helps! We have thought about a ton of stuff!

Greets all in P2 land. Spud has been busy, but reading and progamming some PASM.

Did you all read, "it will finish the design?"

Been a long road getting here. Thanks Chip. If nothing else, I know I have learned a ton through this process!

Bill Henning · 2016-05-04 19:09

I think Chip is having fun keeping us guessing...

Seairth · 2016-05-04 19:36

Bill Henning wrote: »

I think Chip is having fun keeping us guessing...

And guessing is what we'll do! I think he's implementing the following:

* two 32-bit registers between pairs of cogs
* two new instructions that write to the "outgoing" register and read from the "incoming" register
* two new events for when the "incoming" register is written to or when the "outgoing" register is read from

no imagination required.

and I think someone had actually suggested this already...

Edit: okay... a bit of imagination... 33-bit registers that also capture C on write and set C on read (if WC).

ErNa · 2016-05-04 20:40

SPS instruction: solve problem simply

kwinn · 2016-05-04 21:00

Seairth wrote: »

Bill Henning wrote: »

I think Chip is having fun keeping us guessing...

And guessing is what we'll do! I think he's implementing the following:

* two 32-bit registers between pairs of cogs
* two new instructions that write to the "outgoing" register and read from the "incoming" register
* two new events for when the "incoming" register is written to or when the "outgoing" register is read from

no imagination required. and I think someone had actually suggested this already...

I can see the need for two physical registers so there is a connection to the cogs on either side, but not for special instructions. Better to make it part of the regular register or lut address space. that way any instruction that affects registers or lut can be used

jmg · 2016-05-04 21:14

kwinn wrote: »

I can see the need for two physical registers so there is a connection to the cogs on either side, but not for special instructions. Better to make it part of the regular register or lut address space. that way any instruction that affects registers or lut can be used

Yup, and once you do that, no need to stop at one register - you can Dual port a few locations... with half left-going, and half right-going.

Seairth · 2016-05-04 21:17

kwinn wrote: »

Seairth wrote: »

Bill Henning wrote: »

I think Chip is having fun keeping us guessing...

And guessing is what we'll do! I think he's implementing the following:

* two 32-bit registers between pairs of cogs
* two new instructions that write to the "outgoing" register and read from the "incoming" register
* two new events for when the "incoming" register is written to or when the "outgoing" register is read from

no imagination required. and I think someone had actually suggested this already...

I can see the need for two physical registers so there is a connection to the cogs on either side, but not for special instructions. Better to make it part of the regular register or lut address space. that way any instruction that affects registers or lut can be used

No, no, no! Too much imagination! Whatever he's implementing has got to be simple, so the design can be completed!

(on a technical note, I was suggesting that the two registers were between an odd/even cog pair, like USB, not one between each adjacent cog. Doing one register between each cog would either result in contention issues for two-way communication or would be of limited use for one-way communication.)

kwinn · 2016-05-04 21:39

Seairth wrote: »

kwinn wrote: »

Seairth wrote: »

Bill Henning wrote: »

I think Chip is having fun keeping us guessing...

And guessing is what we'll do! I think he's implementing the following:

* two 32-bit registers between pairs of cogs
* two new instructions that write to the "outgoing" register and read from the "incoming" register
* two new events for when the "incoming" register is written to or when the "outgoing" register is read from

no imagination required. and I think someone had actually suggested this already...

I can see the need for two physical registers so there is a connection to the cogs on either side, but not for special instructions. Better to make it part of the regular register or lut address space. that way any instruction that affects registers or lut can be used

No, no, no! Too much imagination! Whatever he's implementing has got to be simple, so the design can be completed!

(on a technical note, I was suggesting that the two registers were between an odd/even cog pair, like USB, not one between each adjacent cog. Doing one register between each cog would either result in contention issues for two-way communication or would be of limited use for one-way communication.)

Not necessarily. Two uni-derectional registers can share an address where a write to that address writes the data into the register that can only be read by the adjacent cog. A read from that address reads from the register that can only be written to by the adjacent cog. Physically it is four registers to communicate with the two adjacent cogs, but as far as addressing goes it is only two registers and no contention.

On second thought, only two registers and two tristate buffers are required.

evanh · 2016-05-04 22:09

I don't see the problem with wiring the whole LUT if it's dual ported. Streamer is not going to be used very often so the second porting is sitting idle in most cases.

jmg · 2016-05-04 22:38

evanh wrote: »

I don't see the problem with wiring the whole LUT if it's dual ported. Streamer is not going to be used very often so the second porting is sitting idle in most cases.

Not so much a 'problem', as the details of you need more address lines routed, and may hit a timing ceiling.
Smaller memory pieces need fewer address lines, and access faster, giving more time for MUX overhead.
Practical placement & routing may work better with 3 memory areas.
A large Streamer shared one, and two smaller ones on either side, physically placed close to the adjacent GOG since they are dual port each way.

kwinn · 2016-05-04 22:39

evanh wrote: »

I don't see the problem with wiring the whole LUT if it's dual ported. Streamer is not going to be used very often so the second porting is sitting idle in most cases.

Good point, and having use of the whole LUT is a bonus.

cgracey · 2016-05-04 22:46

evanh wrote: »

I don't see the problem with wiring the whole LUT if it's dual ported. Streamer is not going to be used very often so the second porting is sitting idle in most cases.

That's it. Each cog will be able to r/w the next cog's LUT via its second port that the streamer uses. The other cog will have priority over the local cog's streamer.

P2 vs modern process limits

Comments