New Hub Scheme For Next Chip

Phil Pilgrim (PhiPi) · 2014-05-21 08:41

My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated. Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good? And forget hubexec; LMM is good enough, especially with the new hub bandwidth. Otherwise, this thing will never get finished. Sorry, Chip, but it's deja vu all over again.

-Phil

dMajo · 2014-05-21 08:42

cgracey wrote: »

It loops through the whole memory right now, but it could be made to wrap in a limited area. That's a great idea you have! That way, you could output to four 8-bit DACs a loop of longs at up to 200MHz. If we had one more control bit somewhere, we could make the buffer switchable in position, so that we could write one buffer while we output the other.

Yes, that would be nice and facilitate many things. I am not expert at video generation so I do not know if it could be used eg for scanlines, part of them and so on.
I will hardly use video features if there will not be available drivers made from others and even in this case only for simple text HMI interface (preferably vga and s-video (that can be externally merged to composite if needed) without any graphics). Thus while others focus/tatget this part of development at video I try to understand how I can benefit from it for other purposes.

I even thought initially (for other non-video applications), that the same buffer can be fifo-read/streamed by one cog while the other can fifo-write it by carefully syncing the timings and cogs order, in the sense of order of passing windows in front of cogs so that the newly written value is then available for read to the next cog. If this is not working at full speed (not fsys/1) there is time for both cogs to do useful nice stuff between the unused fifo slots, thus the question of circular possibility, even with single buffer.
Somehow this can be perceived as a hidden tasking, transfer offload.
Switchable buffer perhaps allows the things also at fsys even if i think not continuously because as I understand then there are no available slots to hub because they are all used by the fifo. But except for video there are comms, signal/pattern generation/acquisition that can work with bursts and don't need continuous data rate.

dMajo · 2014-05-21 09:30

Phil Pilgrim (PhiPi) wrote: »

My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated. Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good? And forget hubexec; LMM is good enough, especially with the new hub bandwidth. Otherwise, this thing will never get finished. Sorry, Chip, but it's deja vu all over again.

-Phil

I hope it is not tatgeted to me ... perhaps I shouldn't reply .... as I have never pushed, asked for a feature. My questions are intended only to understand what is going on and how it can be used.

I am interested for the uses in the industrial field, power controls, process controls, closed loops, measurement instrumentation and domotics(home/building automation) so beside a small lcd (sometimes) the main things are determinism, number of IOs, good (internal) ADC/DAC (as external involves other ICs, communication to them ...), good math.
I agree with Heather that this chip is for real-world and for bigger/OS uses it is better paired eg with a raspberry. I currently use tibbo modules for ethernet/wifi hmi interface to it, I even program (fw update) the P1 from the tibbo and I have never thought to make ethernet comms from the prop, I simply think there are specific devices for specific jobs.

I will use whatever will be built-in if I'll find it suitable, otherwise I'll go with other uC like I do now. I am not afraid to say that I also use eg picaxes for very simple tasks: simple to program, low-cost, free development environment and if then the thing grows (they have some limits) I switch to the equivalent pic. I look on the prop (and the new-coming big-brother) in the same way: easy to program, free tools, higher cost but its compensated by the higher project allowances and performance/complexity. The on-chip flash will be still missed but the ADC/DAC is a great addition both to simplicity and cost-reduction. I still need to understand what the pin-built-in comparator features will be, that interests me too.

Todd Marshall · 2014-05-21 11:51

Phil Pilgrim (PhiPi) wrote: »

My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated. Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good? And forget hubexec; LMM is good enough, especially with the new hub bandwidth. Otherwise, this thing will never get finished. Sorry, Chip, but it's deja vu all over again.

-Phil

A very big thumbs up.

Heater. · 2014-05-21 12:25

Yep,

Skip all this horrible FIFO, hubexec, tortuous, messing around. Video is neat but anyone who wants video has much better options that the Propeller can never match.

I'm convince the P1 is a beautiful and balanced design.

I'm also convinced it does not scale. Going to 16 processors brings us smack up against Amdahl's law. Going to much bigger RAM is great but the balance is lost because the performance is not there.

These issues are well known and have been studied by generations of comp. sci. types chasing the holy grail of parallel processing.

But, what we have in the Propeller concept is freedom from interrupts, the ease of mixing and matching code from here and there, the flexibility of the pin usage. A gloriously simple way to program in PASM.

Can't the P2 fly with "just" that?

David Betz · 2014-05-21 12:41

Heater. wrote: »

Yep,

Skip all this horrible FIFO, hubexec, tortuous, messing around. Video is neat but anyone who wants video has much better options that the Propeller can never match.

I'm convince the P1 is a beautiful and balanced design.

I'm also convinced it does not scale. Going to 16 processors brings us smack up against Amdahl's law. Going to much bigger RAM is great but the balance is lost because the performance is not there.

These issues are well known and have been studied by generations of comp. sci. types chasing the holy grail of parallel processing.

But, what we have in the Propeller concept is freedom from interrupts, the ease of mixing and matching code from here and there, the flexibility of the pin usage. A gloriously simple way to program in PASM.

Can't the P2 fly with "just" that?

I think it would certainly fly with the P1 crowd. Whether it could attract additional users isn't clear. Of course, that isn't clear even with hub exec, fifos, etc. We'll only know when the chip actually exists. I still think it is unlikely that many will adopt either Propeller unless Parallax comes up with well tested and documented "soft peripheral" objects that are distributed outside of the OBEX context and officially branded as Parallax supported.

Bill Henning · 2014-05-21 12:43

FYI,

All numbers assume 200Mhz system clock, two clocks per instruction, 16 cogs, information on the instructions/FIFO as Chip has posted.

LMM on the currently discussed by Chip (with FIFO) would run between 15MIPS-24MIPS (24MIPS assumes no spacer instructions are needed after RDFLONG, which I believe is unrealistic. 15MIPS is two spacers.)

LMM without FIFO on the currently discussed chip would run at 12.5MIPS

hubexec with FIFO would run at > 50MIPS

hubexec without FIFO would run at 12.5MIPS with 32 bit hub bus, up to 50MIPS with previous 128 bit bus

Anyone that thinks potential customers won't care about a 2x~4x C performance differential is naive.

FYI, a $1.50 LPC1114 runs 32 bit code at 48MIPS. It would look terrible if the P16X64 was slower at running C code.

David Betz · 2014-05-21 12:45

Bill Henning wrote: »

Anyone that thinks potential customers won't care about a ~4x C performance differential is naive.

That presumes that customers are actually using C and not Spin or PASM. Ken has said many times that most commercial customers use Spin. If education is the main user of C then maybe performance isn't as big an issue. Parallax needs to decide these priorities and I'm sure they will.

Bill Henning · 2014-05-21 12:51

Chip wants large PASM programs, and has said next Spin will generate hubexec (or LMM code).

I think 2x-4x speed difference for large programs - regardless of the language - is a big deal.

If Parallax wants to stay just with the educational market, speed does not matter as much, but it will limit Parallax pretty much to the niche it is in, and it is unlikely they will break in to new markets.

The direction Parallax wants to go is obviously up to Parallax.

David Betz wrote: »

That presumes that customers are actually using C and not Spin or PASM. Ken has said many times that most commercial customers use Spin. If education is the main user of C then maybe performance isn't as big an issue. Parallax needs to decide these priorities and I'm sure they will.

Roy Eltham · 2014-05-21 12:51

Bill,
What about hubexec with the rdbloc mechanism, but without fifo. My understanding is that we still have rd/wrbloc, unless I missed something.

Kerry S · 2014-05-21 12:55

The programming language is not the root issue here. The lack of usable CORE memory is what causes us to have to do all kinds of things to use this external shared ram to get enough space to run the programs that we need/want to run. All the while losing performance from what the chip SHOULD be able to deliver.

The Propeller still scales great, we are just trying to scale the WRONG thing.

Each CORE needs to get 32K of ram with maybe 32K of shared HUB ram that is only needed for data sharing.

No HUBEXEC, no LMM, no complex FIFO, no ever more convoluted hub data access methods.

Simple direct solution to the issue - the ability to write programs large enough to take full advantage of the new CORE and I/O performance features.

Bill Henning · 2014-05-21 12:57

Excellent question.

Without a really fancy gcc backend: 12.5MIPS

With a fancy VLIW style optimizing (read very expensive to do) gcc backend: 25MIPS-40MIPS best educated guess (due to "wheel sync" delays, very expensive (in cycles) jmp/call/ret primitives, and unused instructions in blocks of 16)

Roy Eltham wrote: »

Bill,
What about hubexec with the rdbloc mechanism, but without fifo. My understanding is that we still have rd/wrbloc, unless I missed something.

David Betz · 2014-05-21 12:59

Kerry S wrote: »

The programming language is not the root issue here. The lack of usable CORE memory is what causes us to have to do all kinds of things to use this external shared ram to get enough space to run the programs that we need/want to run. All the while losing performance from what the chip SHOULD be able to deliver.

The Propeller still scales great, we are just trying to scale the WRONG thing.

Each CORE needs to get 32K of ram with maybe 32K of shared HUB ram that is only needed for data sharing.

No HUBEXEC, no LMM, no complex FIFO, no ever more convoluted hub data access methods.

Simple direct solution to the issue - the ability to write programs large enough to take full advantage of the new CORE and I/O performance features.

32K is not enough memory. Look at what the education people had to do to shoe horn C into 32k. Even with CMM they had to create "simple libraries" that aren't standard and limit the ability to port code from other processors to the Propeller and CMM wouldn't even be possible if you give 32k to each COG and have only 32K of shared hub memory. The Propeller instruction set isn't designed to use instruction memory very efficiently like the ARM Thumb mode does. To run C or probably even Spin programs of any size is going to require bigger memory.

Bill Henning · 2014-05-21 13:00

Cog dual port memory takes twice as many transistors as hub memory, so the 512KB of hub is equivalent to 256KB of cog memory.

Therefore your hypothetical 32KB per cog would simply not fit.

16KB per cog would probably fit.

Note customers have complained about the maximum size of C code on P1 (32KB) so they would not be happy with 16KB or 32KB.

I agree 16KB would be great for pasm programming, however the cog architecture would have to change greatly to be able to use even 16KB, and given the resistance to minor changes, that will never fly.

Kerry S wrote: »

The programming language is not the root issue here. The lack of usable CORE memory is what causes us to have to do all kinds of things to use this external shared ram to get enough space to run the programs that we need/want to run. All the while losing performance from what the chip SHOULD be able to deliver.

The Propeller still scales great, we are just trying to scale the WRONG thing.

Each CORE needs to get 32K of ram with maybe 32K of shared HUB ram that is only needed for data sharing.

No HUBEXEC, no LMM, no complex FIFO, no ever more convoluted hub data access methods.

Simple direct solution to the issue - the ability to write programs large enough to take full advantage of the new CORE and I/O performance features.

Phil Pilgrim (PhiPi) · 2014-05-21 13:11

heater wrote:

I'm convince the P1 is a beautiful and balanced design. I'm also convinced it does not scale.

+1 for both sentiments. The P1 is orthogonal, simple, and exquisitely self-contained, but that's also its Achilles heel when it comes to where to go from here. Perhaps the best way to scale the P1 would be to put several of them on the same die with communication channels between them. I still think the new hub scheme as first introduced here is a great idea. But its elegance has become compromised by the layers of stuff being added to it.

-Phil

Kerry S · 2014-05-21 13:13

David Betz wrote: »

32K is not enough memory. Look at what the education people had to do to shoe horn C into 32k. Even with CMM they had to create "simple libraries" that aren't standard and limit the ability to port code from other processors to the Propeller and CMM wouldn't even be possible if you give 32k to each COG and have only 32K of shared hub memory. The Propeller instruction set isn't designed to use instruction memory very efficiently like the ARM Thumb mode does. To run C or probably even Spin programs of any size is going to require bigger memory.

Yes but they were trying to fit EIGHT cores worth of programs into 32K, not ONE core and are FORCED to use some kind of interpreter to get around the core memory limits.

Ok, we have a 512K hub and 16 cores. That means that we have 32K per core to use on average now if the programs grow beyond the cogs register limit.

Yes, a big HUB would allow you to have a couple of 'big' programs. I get that, and it would be great. The problem is that we are killing the performance of this chip trying to get there.

Part of this is I still don't get why are we still designing the chip to HAVE to use interpreters due to the core memory limits? If we can compile to direct PASM and run native, 32K is a lot of space! Just because we did it with the P1, because there was no other option, is not a good reason to make this chip have the same problems. Isn't a lot of the reason for designing the new chip to make improvements to it based on what was found with the P1?

David Betz · 2014-05-21 13:16

Kerry S wrote: »

Yes but they were trying to fit EIGHT cores worth of programs into 32K, not ONE core and are FORCED to use some kind of interpreter to get around the core memory limits.

I'm not sure this is true. I think they were trying to fit one large main program and a bunch of small drivers into 32k I doubt very much that the code was evenly distributed over all 8 COGs.

Kerry S · 2014-05-21 13:18

Bill Henning wrote: »

Cog dual port memory takes twice as many transistors as hub memory, so the 512KB of hub is equivalent to 256KB of cog memory.

Therefore your hypothetical 32KB per cog would simply not fit.

16KB per cog would probably fit.

Note customers have complained about the maximum size of C code on P1 (32KB) so they would not be happy with 16KB or 32KB.

I agree 16KB would be great for pasm programming, however the cog architecture would have to change greatly to be able to use even 16KB, and given the resistance to minor changes, that will never fly.

Bill,

Why is cog memory dual port? Is it to facilitate the hub system? If that is the case than why not just put in a small buffer that is dual port that is the temporary storage location for read/writes to hub. Then cog memory can be standard and fit much more.

Just asking because I would really like to know.

Also, the reason that 32K on the P1 is killer is because it gets eaten up by an interpreter to manage the hub<>cog memory movement for the program code. If the cog memory is large enough for native programs (PASM) then you don't need one. Would that not be a better solution?

Kerry S · 2014-05-21 13:20

David Betz wrote: »

I'm not sure this is true. I think they were trying to fit one large main program and a bunch of small drivers into 32k I doubt very much that the code was evenly distributed over all 8 COGs.

That makes sense, however is not the 32K issue not with the actual program size (the one they wanted to run) but with the NEED to have huge CMM program to handle the hub<>cog memory juggling to be able to run a bigger program?

If they could have just compiled their desired program into PASM and loaded it directly into the core would they not have had a lot more room for the real code?

Phil Pilgrim (PhiPi) · 2014-05-21 13:20

Kerry S wrote:

Why is cog memory dual port?

It's to facilitate the execution pipeline. One instruction can be reading a destination register, at the same time the next instruction is being fetched. This doubles the execution speed per system clock compared to the P1.

-Phil

Heater. · 2014-05-21 13:21

Bill,

a $1.50 LPC1114 runs 32 bit code at 48MIPS. It would look terrible if the P16X64 was slower at running C code.

Would it?
Your worst case scenario is: "LMM without FIFO on the currently discussed chip would run at 12.5MIPS".

That is 200MIPs across 16 cores. Not bad. Especially as we can also have cores running "native" in COG code much faster. Especially as we don't have 15 levels of interrupts dragging our real MIPs down from that theoretical 48MIPS of the expresso. A "slow" multi-core Propeller can do things as 700MIPs ARM can only dream of.

On top of that, I find it hard to believe that a FIFO can improve much on the random nature of C code execution and data access. It might gain a bit but can the compilers use it?

David Betz · 2014-05-21 13:24

Heater. wrote: »

On top of that, I find it hard to believe that a FIFO can improve much on the random nature of C code execution and data access. It might gain a bit but can the compilers use it?

The FIFO should help with LMM and CMM instruction fetch won't it?

jmg · 2014-05-21 13:26

cgracey wrote: »

I think direct read and writes need to yield to FIFO activity, and use slots the FIFO skips. For software FIFO activity, this is no problem, but for hardware streaming, this could introduce delays. Is this okay?

I think the FIFO flow decisions need to change depending on SW-FIFO or HW-FIFO.
So all discussions should make it clear if the use case is SW or HW

Default for HW-FIFO, which usually streams to Pins or DACS with/without LUT, needs to be hard real time.
That means HW-FIFO gets slots first.
If the design is burst-fed, as most video is, then there are block move windows at blanking times, and there will be a reduced number of SW slots during the active scan.
Those could use large or small block moves, or even SW-FIFO provided the user knows they have time to complete before they restart the HW-FIFO.

I think the continual-LUT scan case, of Waveform generation, will use COG Cycles for LUT, but not need any Rotate Slot cycles.

SW-FIFOs probably need the other action, where direct writes win over SW-FIFO.
An alternative is to make the choice configurable, with a default set for each of HW-FIFO and SW-FIFO.

David Betz · 2014-05-21 13:28

Whare are the HW FIFO use cases? Is the FIFO essentially a generalization of the video FIFO in P1? Would it be used to stream out video to the DACs?

Bill Henning · 2014-05-21 13:29

Kerry S wrote: »

Bill,

Why is cog memory dual port? Is it to facilitate the hub system? If that is the case than why not just put in a small buffer that is dual port that is the temporary storage location for read/writes to hub. Then cog memory can be standard and fit much more.

Just asking because I would really like to know.

On the P1 it is four ported, it was also four port on the P2 design.

For the P16X64 Chip went to two-cycle instructions, and dual ports.

To execute an instruction, four cog memory accesses are required:

- read instruction
- read source value
- read destination value
- write result

Kerry S wrote: »

Also, the reason that 32K on the P1 is killer is because it gets eaten up by an interpreter to manage the hub<>cog memory movement for the program code. If the cog memory is large enough for native programs (PASM) then you don't need one. Would that not be a better solution?

Sorry, incorrect.

The LMM interpreter, including FCACHE and FLIB areas is 2KB

What is large is the generated C code, and all the required C libraries, and data structures.

Spin is far more memory efficient - but comes with fewer libraries etc.

You can easily go past 32KB by loading too many objects, or having large buffers.

Hope this helps.

Heater. · 2014-05-21 13:31

David.

I guess fetching a stream of instructions helps a lot when you are executing linear code.

That is not my code. It takes a turn this way or that every handful of instructions. Look at FullDuplexSerial. Can you even find four instructions in a "row" their?

My prediction is that such a FIFO only has minimal impact on execution speed.

Unless it is more than a FIFO but more like a cache and you can keep your code in cache and keep your data locality high.

Bill Henning · 2014-05-21 13:38

Heater,

I was comparing one cog to one LPC.

So if you want to say 200MIPS across 16 cores, I'll respond with 768MIPS across 16 LPC1114's.

Reason I was looking at one core at a time is for single threaded performance, not all apps can be cleanly distributed across multiple cores.

Regarding random nature of C code.

With the FIFO's, and the random RDxxx / WRxxx not disturbing them, the FIFO will help HUBEXEC greatly (it even helps LMM some, see my earlier post)

Statistically about one in eight - ten instructions will be a jump taken (or call taken)

when a jump or call is taken, we hit a FIFO reload, which on average will start feeding us longs after 8 sysclock cycles (4 instruction cycles)

then we can run about 7 instructions at the same rate as in cog memory

so 8+4 instruction times to run 8 cog instructions, 8/11 = 66% cog performance.

Which is why I wrote >50MIPS earlier (I was being conservative due to random hub access)

This is why I don't make arguments based on what I find hard to believe... I crunch numbers.

Heater. wrote: »

Bill,

Would it?
Your worst case scenario is: "LMM without FIFO on the currently discussed chip would run at 12.5MIPS".

That is 200MIPs across 16 cores. Not bad. Especially as we can also have cores running "native" in COG code much faster. Especially as we don't have 15 levels of interrupts dragging our real MIPs down from that theoretical 48MIPS of the expresso. A "slow" multi-core Propeller can do things as 700MIPs ARM can only dream of.

On top of that, I find it hard to believe that a FIFO can improve much on the random nature of C code execution and data access. It might gain a bit but can the compilers use it?

David Betz · 2014-05-21 13:38

Heater. wrote: »

David.

I guess fetching a stream of instructions helps a lot when you are executing linear code.

That is not my code. It takes a turn this way or that every handful of instructions. Look at FullDuplexSerial. Can you even find four instructions in a "row" their?

My prediction is that such a FIFO only has minimal impact on execution speed.

Unless it is more than a FIFO but more like a cache and you can keep your code in cache and keep your data locality high.

I know that the C compiler probably won't generate long sequence of instructions but my guess is it will be more than a few so might still benefit from the FIFO. Of course, an extra instruction is required on every branch so maybe that will negate any benefit. You're right that it isn't likely to be a huge improvement.

Todd Marshall · 2014-05-21 13:38

Phil Pilgrim (PhiPi) wrote: »

My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated. Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good? And forget hubexec; LMM is good enough, especially with the new hub bandwidth. Otherwise, this thing will never get finished. Sorry, Chip, but it's deja vu all over again.

-Phil

Flexible control of the "firing order" serves most of the objectives these new designs seem to be striving for. More memory in the COG and flexible control of the firing order should allow implementation in a single COG where multiple COG cooperation is currently required.

jmg · 2014-05-21 13:39

cgracey wrote: »

I've just about got the logic done for the interface between the cog, FIFO, and hub memory. It's been really challenging, even though it's not much logic.

Sounds good, this gives very good HW bandwidth, so is well worth the effort.

cgracey wrote: »

Does anyone see a strong need for separate read and write FIFOs that could operate concurrently (but not at top speeds, together)? This would be good for software reading and writing. In my experience, I usually need to input for a while, or output for a while, in which case a single FIFO, usable for either reading or writing, is adequate.

I would release a simplex FIFO first, and see if any brick-wall cases appear.
Sure, duplex is nice, but it comes at significant silicon cost, and those registers are better used in the Smart-Pins.

I see the primary FIFO use as HW-FIFO, where it solves problems SW alone cannot touch.
Next, the fast block moves will allow rapid LUT change, and even COG SW-Swaps.

SW-FIFO will be useful in cases where some small SW conditioning is needed in a streaming case - and USB seems a likely candidate here.

New Hub Scheme For Next Chip

Comments