HUBEXEC - Any alternatives ???

Cluso99 · 2014-05-09 00:14

Chip has just mentioned that both quad access to hub and hubexec are not simple.

I wonder if there are any alternatives, simplifications, or other ways we could implement hubexec ???
It is at least worth a discussion - quite possibly there are no alternatives.

I developed a fast overlay loader for P1. While it performs better than LMM, particularly if there are any loops, I am not sure if it could be used in HLL (eg by GCC etc).
I am not a fan of caching either. I think it carries risk and certainly doesn't inspire me (and therefore probably others who want to use it commercially).

I don't have any concrete thoughts yet, but here is the thread to discuss ideas (and a break from slot sharing ha! ha!)

jmg · 2014-05-09 00:18

Cluso99 wrote: »

Chip has just mentioned that both quad access to hub and hubexec are not simple.

I wonder if there are any alternatives, simplifications, or other ways we could implement hubexec ???

Chip has already done this, I think twice now, once for P2, and again on a slightly different pipeline/Cycle time of P1+.
Given it is proven, I'd think he has a good handle on how it works, and where any fat may be.

Heater. · 2014-05-09 00:34

propgcc pretty much does what your fast overlay loader does. They call it FCACHE. Basically the compiler can identify loops in your source code and if they are small and self contained enough they get compiled into in COG "native" code that is loaded and run as needed at run time.

It works amazingly well. The C version of my FFT runs almost as fast as the hand crafted PASM version that has been a bit tweaked and optimized by a number of forum members.

Not only that, using OMP propgcc can spread the inner loops of the FFT over two or four COGs for a bit a performance boost!

Alternatives to HUB exec? I have no idea.

I'd be inclined to chop the HUB memory into 4 areas of 128K bytes each and have 4 HUBs doing round robin arbitration among 4 groups of 4 COGs.

BINGO every body is now able to run LMM, or HUB exec 4 times faster. And there is 4 times less HUB access latency for normal COG code.

This is much simpler than any stupid HUB slot juggling mechanism and probably much faster overall.

Problems with this approach are:

1) Do we have room to run 3 extra buses from COGS to RAM and add 3 extra HUB switches.

2) How to arrange for communication between the four groups of COGs?

3) Will people bleat that they can no longer run more than 128K of code (including data) from any given COG?

jmg · 2014-05-09 00:45

Heater. wrote: »

3) Will people bleat that they can no longer run more than 128K of code (including data) from any given COG?

Absolutely they will, There are a lot of large data applications you have just killed, right there.

Cluso99 · 2014-05-09 00:57

I am just going to throw this into the mix...

8/12/16 simple P1+ cogs (no quad hub access, no hubexec, 1:8/12/16 hub access, 32KB hub)
4 more powerful P1+/P2 cogs (quad hub access, hubexec only (no cog mode??), 1:4 hub access, 512KB hub)
Small register window between the 2 sets of processor cores.

The smaller simpler cogs are for simple I/O. They have their own 32KB hub.

The 4 more powerful cogs are for heavier I/O processing, video, and main code. They have their own 512KB hub.
Since they have 1:4 hub slots, hubexec can be executed straight out of hub. No need to cache hubexec!
If hub could be accessed at 200MHz, then each cog would be 50 MIPs in hubexec.

The 4 cogs would implement a super pasm set.

Brian Fairchild · 2014-05-09 01:00

jmg wrote: »

Absolutely they will, There are a lot of large data applications you have just killed, right there.

But is the P1+ really a "big data" kind of chip. My instinct, and posts in the last month, tells me not; it tells me that the '"big data" users wants lots of data (ie more than even 512k) and SDRAM hardware.

Heater. · 2014-05-09 01:31

jmg,

Absolutely they will, There are a lot of large data applications you have just killed, right there.

I thought so. Examples please!

Thing is if you have "big data" or perhaps "big code" you probably don't want to be accessing it really slowly as the Prop currently does.

"Big data" in itself is not the issue here. After all if you have 300K of data you can probably arrange it into multiple memory areas each with a COG or two banging on it.

If you need 300K of data and speed you probably don't want to be using a Propeller anyway.

As Brian says, the Prop is not a "big data" kind of chip. There are many other offerings already that will do that better for you.

Many of us see the Propeller as a real world, real-time interfacing device. With huge advantages over run of the mill MCU's in that it has a multiple processors that can be dedicated to multiple interfaces very easily and flexibly.

Here the priority is getting data from those pins into RAM and vice versa as fast as possible. It is not going to be "big data".

Brian Fairchild · 2014-05-09 01:46

Heater. wrote: »

Many of us see the Propeller as a real world, real-time interfacing device. With huge advantages over run of the mill MCU's in that it has a multiple processors that can be dedicated to multiple interfaces very easily and flexibly.

That's the thing isn't it? Over and over people say that whatever form memory execution and sharing takes it must be deterministic. To me, deterministic = real-time. To me, "big data" = something with an MMU and a proper OS and filing system.

jmg · 2014-05-09 02:46

Heater. wrote: »

I thought so. Examples please!

I obviously mean 'big data' from am embedded space, not 'big data' in the web-space sense.

Easy examples :
For any LCD/video application 128K chunks will be positively painful.
T&M capture and generate systems, also are far easier with single-address space.

Even 512k is just a tad small for the common 800x480 LCD standard, making it smaller is going the wrong way.

Especially as there are much smarter solutions to Slot Rate, than dicing memory into separate blocks.That's very un-P1 like

jmg · 2014-05-09 02:50

Cluso99 wrote: »

I am just going to throw this into the mix...

8/12/16 simple P1+ cogs (no quad hub access, no hubexec, 1:8/12/16 hub access, 32KB hub)
4 more powerful P1+/P2 cogs (quad hub access, hubexec only (no cog mode??), 1:4 hub access, 512KB hub)
Small register window between the 2 sets of processor cores.

He he, Asymmetric COGs will give some a heart attack...

Seairth · 2014-05-09 05:49

I'll also throw into the mix... maybe just leave hubexec out of the P1+.

Based on Chip's latest statements, the P1+ is looking to be 4x MIPS over the P1.
This means that LMM will also be 4x faster with basically no changes.
With QUAD reads, I'm guessing LMM can get a bit faster. Maybe 2x (for a total of 8x over P1)?

And, while I know it's not an alternative to hubexec, it might make sense to implement the 32-slot hub table. A few thoughts on how it could be used with LMM:

Giving additional slots to the LMM cog allows it make hubop calls with less stalling.
Assuming a RDQUAD takes two instruction cycles, giving the LMM cog 1:4 access would allow it to perform a large block copy at max speed (with the use of a REPS). This could be used to transfer entire routines to the cog, then execute the routine without any further hub fetches.
Assuming that determinism is not a requirement for LMM, the LMM cog doesn't need to be assigned regular intervals in the table. Basically, you just give it all of the remaining slots that aren't used by other cogs. Yes, it would make for irregular timing, but you might still be able to get a better average speed than the fixed 1:16 might allow.

Dave Hein · 2014-05-09 06:20

There was a thread that discussed LMM on P1+, but discussion stopped on it when hubexec was added. Now that hubexec is out of P1+ it seems like we should continue discussion on that thread. However, I can't seem to locate it. Is it possible that that thread got deleted? I can't even find the posts that I made to that thread.

David Betz · 2014-05-09 07:41

Dave Hein wrote: »

There was a thread that discussed LMM on P1+, but discussion stopped on it when hubexec was added. Now that hubexec is out of P1+ it seems like we should continue discussion on that thread. However, I can't seem to locate it. Is it possible that that thread got deleted? I can't even find the posts that I made to that thread.

I didn't hear Chip say that hub exec was out of P1+. I just heard him say that it isn't currently implemented but that he considered it important to add back in. Did I miss something?

4x5n · 2014-05-09 12:00

David Betz wrote: »

I didn't hear Chip say that hub exec was out of P1+. I just heard him say that it isn't currently implemented but that he considered it important to add back in. Did I miss something?

The way I read Chip's post was that while hubexec was going to be a chore to implement that he intended to do it. Both hubexec and quad long transfers per hub access. I'm thinking that the combination of the two makes playing games with hub slots, etc unneeded. Seems to me that 4 longs per hub access is a lot to process per hub access. Especially when hubexec is added to the picture.

Just as I was warming up to the idea of hub slot sharing.

Electrodude · 2014-05-09 12:24

How about the hardware block mover I suggested here? You have to manually control it, but, since you manually control it, it's way more efficient. It also makes function caching trivial (and backgroundable!). It could use the same hardware that cognew uses.

electrodude

jmg · 2014-05-09 13:21

David Betz wrote: »

I didn't hear Chip say that hub exec was out of P1+. I just heard him say that it isn't currently implemented but that he considered it important to add back in. Did I miss something?

Yes, that is what I read too.
Hubexec is already done, (actually twice) and proven.
It has distinct marketing benefits, so would be a surprise to see it removed.

I think wide-fetch is almost free, now they changed from customer memory to OnSemi IP.

However, there are different ways to manage the details, which I guess Chip is looking at.

Opcode 'reach' is already done, and the biggest gain is to add some form of 'hardware feeder' to give opcode fetch from HUB an invisible manner, even if slower than COG code.

Another step, would be to support (in the same handler?) direct opcode fetch from QuadSPI memory (called Execute in Place (XIP))

Those QuadSPI devices continue to expand, and they deliver 104MHz clocks, They are cheap, and come in Flash/Ram/MRAM...

jmg · 2014-05-09 13:25

4x5n wrote: »

Just as I was warming up to the idea of hub slot sharing.

That has not gone away, when you have 16 COGS, the fundamental Rate & BW issues remain.
If anything, on a now slower device, a solution becomes more important, as 'just use a higher clock' is less of an option.

A very flexible solution, that has no fSys impact, has been found for that.

jmg · 2014-05-09 14:38

Cluso99 wrote: »

Chip has just mentioned that both quad access to hub and hubexec are not simple.

and before this thread gets a life of its own, this is what Chip actually said about those items.

I think those features are important enough, though, that the disruption they cause must be accommodated.

ie they make the cut , as he thinks the silicon-impact is worth it.

Bill Henning · 2014-05-09 14:46

+1e9

one of the customer requests was better C support.

hubexec with ptra & ptrb provides much higher C performance and better code density.

plus chip already had it working on p8x92a and p16x64a initial version as i recall, so as jmg said, it is a fait accompli!

jmg wrote: »

and before this thread gets a life of its own, this is what Chip actually said about those items.

I think those features are important enough, though, that the disruption they cause must be accommodated.

ie they make the cut , as he thinks the silicon-impact is worth it.

Cluso99 · 2014-05-09 19:46

I will reiterate my original reason for posting...
Are there any other ways to achieve hubexec ???
Simpler, easier, alternatives, etc. Even radical ideas may provoke that magical solution.
We all understand pretty well what has to be done, but maybe there is just an alternative waiting to be discovered.

Once we get to 1:4 hub slots, hubexec could just run straight from hub without any icache (at the same speed or a little faster than LMM). Might this be fast enough???

A hub table would be simpler to implement, with 32 slots and 2 level tables. Remember, the cost is shared over all cogs.

WIth mooching, hubexec without icache would be quite fast for the main program.

Bill Henning · 2014-05-09 20:42

i would strongly prefer 64 slots, 4 bits for cog, 4 for mooching cog, for finer grain

but i agree, with even a 32 slot table, with mooch, the icache is not strictly necessary

we still need ptra&b for fast hub stack and indexed adressing

jmg · 2014-05-09 21:35

Bill Henning wrote: »

i would strongly prefer 64 slots, 4 bits for cog, 4 for mooching cog, for finer grain

but i agree, with even a 32 slot table, with mooch, the icache is not strictly necessary

we still need ptra&b for fast hub stack and indexed adressing

I've updated the FPGA reports, for a 64x : 32x/32x : F16 table - it is slower than 32x, but ~ 350MHz in distributed Dual Port RAM
( 32x/32x is mooch, F16 is default force 1/16 scan, no pre-load needed )

http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1266305&viewfull=1#post1266305

kwinn · 2014-05-09 21:39

If the repeat instruction and quad read/writes are to be included perhaps hubex could be implemented by using 4 (or a multiple of 4) cog registers as an instruction cache and the rest as index/indirect/general purpose registers.

koehler · 2014-05-09 21:59

Heater. wrote: »

I'd be inclined to chop the HUB memory into 4 areas of 128K bytes each and have 4 HUBs doing round robin arbitration among 4 groups of 4 COGs.

BINGO every body is now able to run LMM, or HUB exec 4 times faster. And there is 4 times less HUB access latency for normal COG code.

This is much simpler than any stupid HUB slot juggling mechanism and probably much faster overall.

Problems with this approach are:

1) Do we have room to run 3 extra buses from COGS to RAM and add 3 extra HUB switches.

2) How to arrange for communication between the four groups of COGs?

3) Will people bleat that they can no longer run more than 128K of code (including data) from any given COG?

OMG Heater, simple hub sharing is to complex and for some others, detracts from the 'beauty of simplicity' of the Prop, but this doesn't?????

I think #3 is a fair complaint given the request for memory was high on Ken's list.

Seeing as Ken just broke the news in the other thread though, maybe there is still hope for a simple form of hub exec if QUADs or something is too difficult.

I think thats good 'out of the box' thinking there though. However, nothing like making the Prop 2 even way weirder to anyone new to it.

Before we get all 'heated' up for the n'th time, I'd really like to hear a somewhat more complete posting by Chip that explains how things really stand and what issues/difficulties there are.

If we're going all out, why not take some of that extra die area we had form the old kitchen-sink P2, and throw in some actual, real H/W peripherals to attract some customers, and go back to 8 faster Cores which have more memory (as per above) and use them more as Cores should be used, not to imitate simple, solved H/W ?
Really, how much die area would it cost to get even 1x USARTs, I²C, SPI ?

Of course, Core RAM is limited, and all Cores have to share HUB.
Is there really no way to fix one of those so one of these issues could be put to bed?

Not that I'd hold my breath over that....

Cluso99 · 2014-05-09 22:06

koehler wrote: »

...
If we're going all out, why not take some of that extra die area we had form the old kitchen-sink P2, and throw in some actual, real H/W peripherals to attract some customers, and go back to 8 faster Cores which have more memory (as per above) and use them more as Cores should be used, not to imitate simple, solved H/W ?

What! That is the prime purpose of the prop - soft peripherals! We can almost have as many of anything provided it will fit with pins. It is just that with that, comes the ability to also do fast processing in at least one or two cogs.

RossH · 2014-05-09 22:34

Cluso99 wrote: »

I will reiterate my original reason for posting...
Are there any other ways to achieve hubexec ???

Well, as Seairth pointed out, with quad read alone, the P16X32B (or is it now P16X64A?) will already be something like 8 times faster than the P1 at executing high level languages.

Do we need more than that? Since (as I have pointed out before) most of the grunt work on most Propeller applications is done in the other 15 cogs anyway?

Ross.

koehler · 2014-05-09 22:38

Cluso99 wrote: »

What! That is the prime purpose of the prop - soft peripherals! We can almost have as many of anything provided it will fit with pins. It is just that with that, comes the ability to also do fast processing in at least one or two cogs.

I realize that, its just that at least once every couple of years I simply have to get that off of my chest...
However IF they were there and available, I think Parallax would certainly get some real interest from the majority who currently pan it, and the whole Core/soft peripheral idea.

I really think not having these very basics just kills the Prop when it comes to uC selection.
Having just these several h/w peripherals I think would give many the confidence to actually order a demo board to experiement with and potentially consider using in product.

For them, the utility of 8 Cores then becomes adding either additional custom code, or experimenting and proving to their own satisfaction the reliability of s/w peripherals, or custom objects, etc.

Again, not holding my breath.

Bill Henning · 2014-05-09 22:46

unfortunately that is incorrect

see my analysis in ariba's "50MIPS qlmm thread" due to slow fjmp/fcall/fret we are limited to 25..35mips even with Andy's clever extensionto qlmm

hubexec is about 50mips

hubexec plus slot table and mooch approaches 100mips

there is no magic non hubexec solution

RossH wrote: »

Well, as Seairth pointed out, with quad read alone, the P16X32B (or is it now P16X64A?) will already be something like 8 times faster than the P1 at executing high level languages.

Do we need more than that? Since (as I have pointed out before) most of the grunt work on most Propeller applications is done in the other 15 cogs anyway?

Ross.

RossH · 2014-05-09 23:48

Bill Henning wrote: »

unfortunately that is incorrect

see my analysis in ariba's "50MIPS qlmm thread" due to slow fjmp/fcall/fret we are limited to 25..35mips even with Andy's clever extensionto qlmm

hubexec is about 50mips

hubexec plus slot table and mooch approaches 100mips

there is no magic non hubexec solution

I don't see the point in arguing the analysis since the chip hasn't even been designed yet ... but 35 MIPs per cog for high-level languages sounds ok to me.

Has anyone made a case for needing more than that?

Ross.

Bill Henning · 2014-05-09 23:58

Funny how everyone opposing hubexec / slot mapping / mooch never provides analysis supporting their position.

NXP, TI, FreeScale, Atmel, etc all have better than 25mips showing a need for more.

frankly, all this opposition to hubexec is very ill advised.

p.s.

I will likely not have internet for about a week.

RossH · 2014-05-10 00:07

Bill Henning wrote: »

Funny how everyone opposing hubexec / slot mapping / mooch never provides analysis supporting their position.

NXP, TI, FreeScale, Atmel, etc all have better than 25mips

frankly, all this opposition to hubexec is very ill advised.

I am not opposed to HubExec. I just point out that no-one has made a compelling case for it.

But what I am opposed to is turning the Propeller into a complex monstrosity in the vague and misguided hope of competing with "NXP, TI, FreeScale, Atmel" or (these days) even relatively low-end ARM processors!

Ross.

HUBEXEC - Any alternatives ???

Comments