New Hub Scheme For Next Chip

Invent-O-Doc · 2014-05-22 12:36

repeat:
What Phil said....

jmg · 2014-05-22 12:59

Bill Henning wrote: »

Very interesting! Branches would have to go to 8 byte boundaries, which is not a major problem, but may throw off the memory stride, depending on step size and pattern.

There is also a Block read variant, which I think also imposes boundaries, as it can read 16 opcodes in just 8 Opcode cycles, and be memory linear, whilst I think Chip's unrolled example is still not going to be memory linear.

I think these are also still present, which open more scope for 'chunkier' LMM up to the inserted subroutine level

HUB2REG	D/#,S/#		- read S[8:0]+1 longs from read-FIFO starting at reg D[8:0]
HUB2LUT	D/#,S/#		- read S[7:0]+1 longs from read-FIFO starting at LUT D[7:0]

mark · 2014-05-22 21:42

Since the FIFO can be used to stream data directly between the hub and pins, what would it take for it to be useful as a SERDES?

Rayman · 2014-05-23 05:56

I think the ability to move data between hub/cog/pins at up to the full clock speed is the basic building block that Chip wants to include here.
Sounds like some people are asking what good is it for. I'm sure we'll think of something...

dMajo · 2014-05-23 09:52

Regarding memory schemes, lut & friends #98

User Name · 2014-05-23 19:34

Rayman wrote: »

I think the ability to move data between hub/cog/pins at up to the full clock speed is the basic building block that Chip wants to include here.
Sounds like some people are asking what good is it for. I'm sure we'll think of something...

+1

For whatever value my opinion may have in this discussion, I'm against slowing things down in order to dumb them down. A 2.5x improvement over P1 is hardly worth the bother.

koehler · 2014-05-23 20:06

+/-1

We're all just greedy ...

A 2.5x P1 is probably great and would offer more than many of Parallax's revenue customers want or even need at this time.
On the flip side, no matter what is offered, there is always going to be someone wanting more, more, more.

The question that few want asked, or are willing to answer, is how much is too much, and what happens if it ends up being overly complex to the point that large customers who are supporting its development give it a pass?

All this additional time, work, complexity.... and I don't see any realistic benefit at the end of the day insofar as Parallax getting additional revenue from it.
I think people would have a markedly different set of opinions if their 401K were invested in Parallax, or if their paycheck were dependent upon company revenue. Since they don't, there is always room to consider delay and performance improvements regardless of time/$$.

<This is a general statement, not directed at anyone>

RossH · 2014-05-23 20:21

Sounds like it is time to remind people that the overwhelming consensus was that a 5x improvement in speed, 16 cogs and 256k of Hub RAM was all that was required of the next chip (whatever it is now called).

About that chip, Chip said ...

cgracey wrote: »

This would be really easy to make happen and the chance of latent bugs would be very low.

All this other new stuff is really great to think about for some future Propeller - but it is most likely years away from becoming a reality.

Can we please just have the chip that most people wanted before we get lost in all this complexity all over again?

Ross.

koehler · 2014-05-23 21:12

RossH wrote: »

Sounds like it is time to remind people that the overwhelming consensus was that a 5x improvement in speed, 16 cogs and 256k of Hub RAM was all that was required of the next chip (whatever it is now called).

About that chip, Chip said ...

All this other new stuff is really great to think about for some future Propeller - but it is most likely years away from becoming a reality.

Can we please just have the chip that most people wanted before we get lost in all this complexity all over again?

Ross.

I have to do something I haven't done in a while, and agree with you.

Chip will probably have an FPGA bin out next week for everyone to start working on though.

It would have been great if Ken could have interceded several weeks ago, and reminded Chip of his statement that Parallax wanted something useable in a shorter rather than longer time frame, and that +'s could be addressed as future, incremental revisions.

We are where we are though, so if thing seem to be working acceptably then we should be ok, if not, then hopefully the business reality* can be looked at a bit more seriously.

*And I have no way of knowing that Parallax doesn't in fact have more more than enough resources to continue on with development.
I guess since I haven't seen Ken express that explicitly, that may not be an issue at this time.

I guess we're all good then!

RossH · 2014-05-23 21:31

koehler wrote: »

I have to do something I haven't done in a while, and agree with you.

This doesn't mean we're engaged or anything!

kwinn · 2014-05-23 21:49

Cluso99 wrote: »

Thanks Chip.

In fact, as soon as you have DJZ, DJNZ, TJZ, TJNZ, TJS, TJNS, JP and JNP D,S/@ working, then...

If you add JMP and CALL #abs/@rel where call saves the return in a fixed register (say $1EF) where there is up to 17 immediate address bits (9 will do for now) (this is the GCC LR call required version), then...

We can test the hubexec mode running from the cog ram (at full cog speed).
We don't even need the LUT as extended cog ram to test it
We don't even need the PC to be increased from 9 bits to test it

And please don't worry about implementing this until after you have done an FPGA. It doesn't matter what the fgpa is missing.

Not sure if a method to do long absolute or relative JMP/CALL instructions has been decided on, but there is a way to do them without using a dedicated stack or register. Use the S and D bits in the instruction to create an 18 bit address, which is all that is needed for the JMP. For the call a return address needs to be stored, so the assembler inserts a JMP instruction at the beginning of every subroutine, and the CALL instruction saves the PC+1 in that call instruction and starts executing the subroutine at the next address. The subroutine returns by jumping to that first JMP.

Somewhat like the way it is done with the P1, but with the destination and location of the return address specified by one 18 bit address.

kwinn · 2014-05-23 22:04

+1 Such jumps or calls should have a relative version.

Cluso99 wrote: »

IMHO, irrespective of anything else...

DJNZ & friends should be relative !

JMPRET should at least have a relative mode (for the S = goto address anyway. There is an argument both ways for the D return address.

This is important for relocatable code. It should have been in P1, but Chip didn't conceive code running anywhere but cog.

Cluso99 · 2014-05-24 02:34

kwinn,
Theses instructions have already been implemented (and many others) were already implemented on the old P2 design, so the instruction bitmaps were solved.

cgracey · 2014-05-24 04:37

I think I might have the cog FIFO done that talks to the egg-beater hub memory. This has been a real pain to put together. It's not complicated, just new concepts that I haven't implemented before.

David Betz · 2014-05-24 04:45

cgracey wrote: »

I think I might have the cog FIFO done that talks to the egg-beater hub memory. This has been a real pain to put together. It's not complicated, just new concepts that I haven't implemented before.

Congratulations! Great progress! Maybe I should start dusting off my DE2-115 board... :-)

Cluso99 · 2014-05-24 05:48

cgracey wrote: »

I think I might have the cog FIFO done that talks to the egg-beater hub memory. This has been a real pain to put together. It's not complicated, just new concepts that I haven't implemented before.

Great news Chip. Anxiously looking forward to the FPGA binary.
Any ideas how many cogs may fit a DE0 yet (guess)?

Timing might just be right for my Korea trip - off to see our daughter and her husband and new twin boys. I will be there for 5 weeks

Peter Jakacki · 2014-05-24 06:14

cgracey wrote: »

I think I might have the cog FIFO done that talks to the egg-beater hub memory. This has been a real pain to put together. It's not complicated, just new concepts that I haven't implemented before.

Well done ol "chip", have DE2 sitting on the bench, shall I warm it up?

jazzed · 2014-05-24 12:57

Cluso99 wrote: »

Timing might just be right for my Korea trip - off to see our daughter and her husband and new twin boys. I will be there for 5 weeks

Twins? Congratulations!

cgracey · 2014-05-24 13:05

Peter Jakacki wrote: »

Well done ol "chip", have DE2 sitting on the bench, shall I warm it up?

Not yet. Today I will be doing a reality check on the FIFO. This has been really hard to think about and design. Now, I need to make sure it all makes sense. The last time I had such tedious work to do was when I made INDA/INDB work with multitasking across the pipeline stages, where odd cancellation patterns could occur. Once done, these things just work, of course, but getting there just about kills me, sometimes.

Once this seems done, I'll integrate it into the cog.

koehler · 2014-05-24 13:59

Just curiuous.

Any idea on an approximate transistor number for the current design?
I'm not familiar with conversion from LUT to transistor, however we seem to have:

16 Cores
2.5 Core of Hub
512KB of HRAM, not sure if it is already counted in 2.5 Core or not, or if it is 4T, 6T RAM
Smart Pins

One thing that actually woke me up last night was the fact that I never hear any talk of a switching fabric in the Prop.
Or, maybe it is, and I just don't recognize the terminology. Seems like the majority of the P2 is a switching fabric, with some
Cores on it. The Hub is obviously the main part of that, however all the other things like smartpins ?

If there are any big bumps in the near future, I'll probably ask this again as my subconscious seemed to think it was something...

And, since its Bar-B-Q weekend, here's a great blast from the past.
See what, where and how -your- predictions made out for the new P2....

evanh · 2014-05-24 14:53

koehler wrote: »

One thing that actually woke me up last night was the fact that I never hear any talk of a switching fabric in the Prop.
Or, maybe it is, and I just don't recognize the terminology. Seems like the majority of the P2 is a switching fabric, with some
Cores on it. The Hub is obviously the main part of that, however all the other things like smartpins ?

That's because the Prop1 is pretty simple. There is a basic DMA engine running the Hub that selects one Cog at a time, doing either a 32 bit read or write. It takes 16 system clock ticks for the Hub to cycle through all Cogs before looping back. It never skips any Cogs so instruction counting can be used to predict when the next service will occur. When a Cog executes a hub access instruction the Cog is stalled until the Hub has serviced the Cog.

The latest Prop2 is pretty close to a full blown 16x16 crosspoint switch with each port containing something in the order of 50 signals. A particular feature/restriction of the switch implementation being all Cogs are kept phase locked at all times in order to provide equal timing for all Cogs and keep the arbitration sane. Conceptually, this is similar to 16 Hubs all servicing each Cog consecutively. So, not unlike 16 Prop1's. It'll be harder to predict when a particular Hub address can be accessed but there is also new buffering to help prevent Cog stalls.

David Betz · 2014-05-24 19:25

evanh wrote: »

That's because the Prop1 is pretty simple....

That's one of the reasons I'm looking forward to Parallax releasing the Verilog code for P1. I think the P1 COG will be an interesting building block to experiment with different multi-core architectures.

kwinn · 2014-05-25 23:11

jazzed wrote: »

Twins? Congratulations!

Why congrats to Cluso99? It's the daughter and son in law that deserve the credit ;-)

Cluso99 · 2014-05-25 23:56

kwinn wrote: »

Why congrats to Cluso99? It's the daughter and son in law that deserve the credit ;-)

IIRC Steve's daughter has twins too.

jazzed · 2014-05-26 00:43

kwinn wrote: »

Why congrats to Cluso99? It's the daughter and son in law that deserve the credit ;-)

Congratulations In that he will have even more joy sharing as a grandfather- 3 times now?. That is joy without mandatory full-time commitment as a parent day after day.

May sound odd to some, but drive-by grandfathering is a great pleasure.

Cluso99 · 2014-05-26 01:17

jazzed wrote: »

Congratulations In that he will have even more joy sharing as a grandfather- 3 times now?. That is joy without mandatory full-time commitment as a parent day after day.

May sound odd to some, but drive-by grandfathering is a great pleasure.

How true. And you can hand them back when you get too tired of them - haha.
I now have 6 grandchildren, al boys except one. That is 2 each to my kids.

David Betz · 2014-05-26 07:44

jazzed wrote: »

Congratulations In that he will have even more joy sharing as a grandfather- 3 times now?. That is joy without mandatory full-time commitment as a parent day after day.

May sound odd to some, but drive-by grandfathering is a great pleasure.

I think you're a bit more than a drive-by grandfather! :-)

kwinn · 2014-05-26 17:33

jazzed wrote: »

Congratulations In that he will have even more joy sharing as a grandfather- 3 times now?. That is joy without mandatory full-time commitment as a parent day after day.

May sound odd to some, but drive-by grandfathering is a great pleasure.

Ah, good point. It is nice to have fun with kids and then be able to return them to their parents when they get cranky or you get tired.

Congrats Cluso99.

evanh · 2014-06-14 16:17

Chip, a question: Presumably the FIFO's are single ported. I'm going to guess that any Cog instruction that acesses it's FIFO while the hub is acessing it will stall the Cog, right?

I ask, trying not to be too greedy, because a Cog working on the FIFO contents concurently would seem the most streamlined ... Double buffering anyone? 2x8 FIFO per Cog maybe? /me ducks.

jmg · 2014-06-14 17:20

evanh wrote: »

Presumably the FIFO's are single ported. I'm going to guess that any Cog instruction that acesses it's FIFO while the hub is acessing it will stall the Cog, right?

I ask, trying not to be too greedy, because a Cog working on the FIFO contents concurently would seem the most streamlined ... Double buffering anyone?

I'm not following the angle here ?
FIFOs are pretty much always dual ported - a Write side and a Read side.
They can be implemented as either Dual port RAM based with Wr and Rd pointers, or as chained registers with Muxes.

Dual port RAM maps better onto FPGA resource, (and probably also ASIC resource) as it uses a compilable core element that should trigger a memory generator.( I think Chip was using chained registers, before the rewrite)

The Block transfer opcode will stall the COG, as it needs 16 SysClks to fill the fifo, make the transfer, no matter what the starting phase.
Slower DMA paced reads (SysClk/N) allow highspeed FIFO streaming for video use & IIRC there can be a direct opcode that does not use the FIFO, but those opcodes will need to wait for phase match on nibble LSN.

New Hub Scheme For Next Chip

Comments