HUBEXEC - Any alternatives ???
Cluso99
Posts: 18,069
Chip has just mentioned that both quad access to hub and hubexec are not simple.
I wonder if there are any alternatives, simplifications, or other ways we could implement hubexec ???
It is at least worth a discussion - quite possibly there are no alternatives.
I developed a fast overlay loader for P1. While it performs better than LMM, particularly if there are any loops, I am not sure if it could be used in HLL (eg by GCC etc).
I am not a fan of caching either. I think it carries risk and certainly doesn't inspire me (and therefore probably others who want to use it commercially).
I don't have any concrete thoughts yet, but here is the thread to discuss ideas (and a break from slot sharing ha! ha!)
I wonder if there are any alternatives, simplifications, or other ways we could implement hubexec ???
It is at least worth a discussion - quite possibly there are no alternatives.
I developed a fast overlay loader for P1. While it performs better than LMM, particularly if there are any loops, I am not sure if it could be used in HLL (eg by GCC etc).
I am not a fan of caching either. I think it carries risk and certainly doesn't inspire me (and therefore probably others who want to use it commercially).
I don't have any concrete thoughts yet, but here is the thread to discuss ideas (and a break from slot sharing ha! ha!)
Comments
Chip has already done this, I think twice now, once for P2, and again on a slightly different pipeline/Cycle time of P1+.
Given it is proven, I'd think he has a good handle on how it works, and where any fat may be.
It works amazingly well. The C version of my FFT runs almost as fast as the hand crafted PASM version that has been a bit tweaked and optimized by a number of forum members.
Not only that, using OMP propgcc can spread the inner loops of the FFT over two or four COGs for a bit a performance boost!
Alternatives to HUB exec? I have no idea.
I'd be inclined to chop the HUB memory into 4 areas of 128K bytes each and have 4 HUBs doing round robin arbitration among 4 groups of 4 COGs.
BINGO every body is now able to run LMM, or HUB exec 4 times faster. And there is 4 times less HUB access latency for normal COG code.
This is much simpler than any stupid HUB slot juggling mechanism and probably much faster overall.
Problems with this approach are:
1) Do we have room to run 3 extra buses from COGS to RAM and add 3 extra HUB switches.
2) How to arrange for communication between the four groups of COGs?
3) Will people bleat that they can no longer run more than 128K of code (including data) from any given COG?
Absolutely they will, There are a lot of large data applications you have just killed, right there.
8/12/16 simple P1+ cogs (no quad hub access, no hubexec, 1:8/12/16 hub access, 32KB hub)
4 more powerful P1+/P2 cogs (quad hub access, hubexec only (no cog mode??), 1:4 hub access, 512KB hub)
Small register window between the 2 sets of processor cores.
The smaller simpler cogs are for simple I/O. They have their own 32KB hub.
The 4 more powerful cogs are for heavier I/O processing, video, and main code. They have their own 512KB hub.
Since they have 1:4 hub slots, hubexec can be executed straight out of hub. No need to cache hubexec!
If hub could be accessed at 200MHz, then each cog would be 50 MIPs in hubexec.
The 4 cogs would implement a super pasm set.
But is the P1+ really a "big data" kind of chip. My instinct, and posts in the last month, tells me not; it tells me that the '"big data" users wants lots of data (ie more than even 512k) and SDRAM hardware.
I thought so. Examples please!
Thing is if you have "big data" or perhaps "big code" you probably don't want to be accessing it really slowly as the Prop currently does.
"Big data" in itself is not the issue here. After all if you have 300K of data you can probably arrange it into multiple memory areas each with a COG or two banging on it.
If you need 300K of data and speed you probably don't want to be using a Propeller anyway.
As Brian says, the Prop is not a "big data" kind of chip. There are many other offerings already that will do that better for you.
Many of us see the Propeller as a real world, real-time interfacing device. With huge advantages over run of the mill MCU's in that it has a multiple processors that can be dedicated to multiple interfaces very easily and flexibly.
Here the priority is getting data from those pins into RAM and vice versa as fast as possible. It is not going to be "big data".
That's the thing isn't it? Over and over people say that whatever form memory execution and sharing takes it must be deterministic. To me, deterministic = real-time. To me, "big data" = something with an MMU and a proper OS and filing system.
I obviously mean 'big data' from am embedded space, not 'big data' in the web-space sense.
Easy examples :
For any LCD/video application 128K chunks will be positively painful.
T&M capture and generate systems, also are far easier with single-address space.
Even 512k is just a tad small for the common 800x480 LCD standard, making it smaller is going the wrong way.
Especially as there are much smarter solutions to Slot Rate, than dicing memory into separate blocks.That's very un-P1 like
He he, Asymmetric COGs will give some a heart attack...
And, while I know it's not an alternative to hubexec, it might make sense to implement the 32-slot hub table. A few thoughts on how it could be used with LMM:
The way I read Chip's post was that while hubexec was going to be a chore to implement that he intended to do it. Both hubexec and quad long transfers per hub access. I'm thinking that the combination of the two makes playing games with hub slots, etc unneeded. Seems to me that 4 longs per hub access is a lot to process per hub access. Especially when hubexec is added to the picture.
Just as I was warming up to the idea of hub slot sharing.
electrodude
Yes, that is what I read too.
Hubexec is already done, (actually twice) and proven.
It has distinct marketing benefits, so would be a surprise to see it removed.
I think wide-fetch is almost free, now they changed from customer memory to OnSemi IP.
However, there are different ways to manage the details, which I guess Chip is looking at.
Opcode 'reach' is already done, and the biggest gain is to add some form of 'hardware feeder' to give opcode fetch from HUB an invisible manner, even if slower than COG code.
Another step, would be to support (in the same handler?) direct opcode fetch from QuadSPI memory (called Execute in Place (XIP))
Those QuadSPI devices continue to expand, and they deliver 104MHz clocks, They are cheap, and come in Flash/Ram/MRAM...
If anything, on a now slower device, a solution becomes more important, as 'just use a higher clock' is less of an option.
A very flexible solution, that has no fSys impact, has been found for that.
and before this thread gets a life of its own, this is what Chip actually said about those items.
I think those features are important enough, though, that the disruption they cause must be accommodated.
ie they make the cut , as he thinks the silicon-impact is worth it.
one of the customer requests was better C support.
hubexec with ptra & ptrb provides much higher C performance and better code density.
plus chip already had it working on p8x92a and p16x64a initial version as i recall, so as jmg said, it is a fait accompli!
Are there any other ways to achieve hubexec ???
Simpler, easier, alternatives, etc. Even radical ideas may provoke that magical solution.
We all understand pretty well what has to be done, but maybe there is just an alternative waiting to be discovered.
Once we get to 1:4 hub slots, hubexec could just run straight from hub without any icache (at the same speed or a little faster than LMM). Might this be fast enough???
A hub table would be simpler to implement, with 32 slots and 2 level tables. Remember, the cost is shared over all cogs.
WIth mooching, hubexec without icache would be quite fast for the main program.
but i agree, with even a 32 slot table, with mooch, the icache is not strictly necessary
we still need ptra&b for fast hub stack and indexed adressing
I've updated the FPGA reports, for a 64x : 32x/32x : F16 table - it is slower than 32x, but ~ 350MHz in distributed Dual Port RAM
( 32x/32x is mooch, F16 is default force 1/16 scan, no pre-load needed )
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1266305&viewfull=1#post1266305
OMG Heater, simple hub sharing is to complex and for some others, detracts from the 'beauty of simplicity' of the Prop, but this doesn't?????
I think #3 is a fair complaint given the request for memory was high on Ken's list.
Seeing as Ken just broke the news in the other thread though, maybe there is still hope for a simple form of hub exec if QUADs or something is too difficult.
I think thats good 'out of the box' thinking there though. However, nothing like making the Prop 2 even way weirder to anyone new to it.
Before we get all 'heated' up for the n'th time, I'd really like to hear a somewhat more complete posting by Chip that explains how things really stand and what issues/difficulties there are.
If we're going all out, why not take some of that extra die area we had form the old kitchen-sink P2, and throw in some actual, real H/W peripherals to attract some customers, and go back to 8 faster Cores which have more memory (as per above) and use them more as Cores should be used, not to imitate simple, solved H/W ?
Really, how much die area would it cost to get even 1x USARTs, I²C, SPI ?
Of course, Core RAM is limited, and all Cores have to share HUB.
Is there really no way to fix one of those so one of these issues could be put to bed?
Not that I'd hold my breath over that....
Well, as Seairth pointed out, with quad read alone, the P16X32B (or is it now P16X64A?) will already be something like 8 times faster than the P1 at executing high level languages.
Do we need more than that? Since (as I have pointed out before) most of the grunt work on most Propeller applications is done in the other 15 cogs anyway?
Ross.
I realize that, its just that at least once every couple of years I simply have to get that off of my chest...
However IF they were there and available, I think Parallax would certainly get some real interest from the majority who currently pan it, and the whole Core/soft peripheral idea.
I really think not having these very basics just kills the Prop when it comes to uC selection.
Having just these several h/w peripherals I think would give many the confidence to actually order a demo board to experiement with and potentially consider using in product.
For them, the utility of 8 Cores then becomes adding either additional custom code, or experimenting and proving to their own satisfaction the reliability of s/w peripherals, or custom objects, etc.
Again, not holding my breath.
see my analysis in ariba's "50MIPS qlmm thread" due to slow fjmp/fcall/fret we are limited to 25..35mips even with Andy's clever extensionto qlmm
hubexec is about 50mips
hubexec plus slot table and mooch approaches 100mips
there is no magic non hubexec solution
I don't see the point in arguing the analysis since the chip hasn't even been designed yet ... but 35 MIPs per cog for high-level languages sounds ok to me.
Has anyone made a case for needing more than that?
Ross.
NXP, TI, FreeScale, Atmel, etc all have better than 25mips showing a need for more.
frankly, all this opposition to hubexec is very ill advised.
p.s.
I will likely not have internet for about a week.
I am not opposed to HubExec. I just point out that no-one has made a compelling case for it.
But what I am opposed to is turning the Propeller into a complex monstrosity in the vague and misguided hope of competing with "NXP, TI, FreeScale, Atmel" or (these days) even relatively low-end ARM processors!
Ross.