I can see how hubexec might look easier for new users familiar with other micros. But if you remember, cog (PASM) was not difficult.
What percentage of P1 users ever go beyond Spin? I was under the impression that it was relatively uncommon for most Propeller users to use PASM other than what is included in drivers that they download from OBEX. This is as it should be. I would be willing to bet that very few Arduino users or even commercial users of the AVR, PIC, or ARM processors write much assembly code. We here in this forum are probably not representative of normal MCU users.
PLEASE DO NOT REMOVE HUBEXEC!!!! Yes, that was me yelling. Sorry if I hurt anybody's ears. It concerns me greatly that the chief designer (only designer) of the P2 would even consider removing hubexec. I think things could have been simplified if hubexec used the same addressing scheme that is used with cog/LUT memory. It seems like using long addressing throughout would simplify the hardware and the software. If fact, I would think that completely eliminating un-aligned accesses would simplified both instruction access and data access from hub RAM. We don't need un-aligned accesses.
Hubexec will be very useful on the P2. It will be used extensively with C programs, and I'm sure many people will write large PASM programs that will use it as well. It would be used even more if P2 Spin has a mode where it directly compiles into PASM. Just think about it. Spin programmers could actually write code that runs almost as fast as PASM programs.
The only problem that I have with hubexec is that it uses the streamer instead of an instruction cache. The streamer will have to be restarted every time a jump is done. This reduces the efficiency of code that does a lot of jumps. This really messes up the performance of tight loops running in hubexec mode. It also messes up the timing of hubexec code that calls cog code in an attempt to improve efficiency. The streamer is restarted when the cog routine routines instead of just continuing after the CALL was made. Also, data cannot be stream while running in hubexec mode (or at the very least they will have to alternate data and instruction streaming).
An instruction cache would resolve this. Maybe there can be some tweaks made to the streamer to improve looping and calls, but a cache would be the ultimate fix.
Chip
It seems the "bug radar" is being jammed here with "chaff"
Not sure if you saw an earlier post highlighting an issue with INB operations. See "INB issue" here
*BUMP*
I was able to reproduce the same issue. There's definitely something going on with INB when the cog first starts.
I am getting $3FFFFFFF initially. Then, after some period, it goes to $7FFFFFFF. On my 1-2-3, it's taking ~2700 iterations of the loop before it seems to start working. Actually, it doesn't seem to matter if I'm reading INB repeatedly or not. It's the delay that seems to be important.
I don't think Chip has even considered removing hub exec. I was just objecting to the idea that it should be relegated to an "advanced feature". It seems to me that COG/LUT execution is the advanced feature since it involves loading another memory with code before you can use it. Hub execution treating COG addresses as registers seems like the simplest way to think of P2 programming.
... The trouble is mainly coming from addressing complexities. If we could establish some basic rules, maybe based on cog/lut vs. hub context, perhaps we could simplify things greatly.
I think trying to orient programming towards hub exec has been the downer. Living in the cog is more fun.
Based on what Chip said, maybe it's only the addressing difference between cog and hub exec that complicates things. That's why I suggest eliminating un-aligned hubexec. The programmer counter would continue to increment by 1 when going from cogexec to hubexec mode. A PC address of $400 would map to a byte address of $1000 in hub RAM. The only drawback I can see is that the first $1000 bytes of hub RAM can't be used for hubexec. I don't see that as a major problem. This memory could be used to hold a cog image that would be loaded into cog RAM. Most hubexec programs will call routines loaded in cog RAM for speed improvement.
I think some small tweaks can be done to the instruction streamer to improve performance. I think one simple tweak is for it to remember where it left off when calling routines in cog memory. This way it wouldn't have to be reloaded when returning from the cog memory routine. Another tweak would be to retain a certain number of longs in the streamer after execution. This way the streamer wouldn't have to be reloaded if just jumping back a few instructions. And one more tweak would be to not reload the streamer if jumping ahead by just a few instructions.
A cache would really be the best approach, but that may be too much to ask for at this point.
Removing hubexec would be worst possible move, as hubexec is will give good high level language performance.
Personally, I'd address code in longs, with implied low order bits, in instructions. LOC etc can always add the 2 zeros, and data access for {RD|WR}{BYTE|WORD|LONG} (and streaming equivalents) in bytes.
Regarding streamer... I think it is GREAT for streaming data for video or other high-bandwidth data.
It is far better than LMM for code, but it is nowhere near as effective as a code cache.
Rule of thumb on most processors is every sixth instruction is a branch, which will flush the streamer.
It takes 8 instruction cycles (16 clocks) to refill the streamer.
Best guess: streamed hubexec code will (on average) be 1/2 the speed of cog code, which is not bad at all. (lut code will be 66% the speed of cog code)
A 16 long instruction cache, would run at ~100% cog speed for loops <= 16 longs, which are frequent for things like mem*(), str*(), etc, but not for main UI code loops.
What would likely help quite a bit is making the instruction stream a bit smarter... only flushing what is needed, so if the target of the branch is already in the streamer, go to it immediately - heck Chip may already be doing this! This would automatically "cache" loops up to 16 instructions long.
David had a good idea - perhaps educational material should first document hubexec mode, as it is closer to conventional processors, and does not limit size. lut/cog execution as an "advanced" topic for drivers, libraries etc. makes sense.
Regarding C/C++ .. David, more people will use it on a P2 with a lot of memory.
@jmg, the real question to ask is whether or not there will be that code that assembly language people write.
Having it standards compliant and all of that at this stage doesn't really mean much. Besides we can write simple filters it'll move a lot of the code if it's really needed.
The reality is a lot of people ended up on p1 and did some neat stuff precisely because of the cool programming environment. If you want that to continue then we need another cool programming environment.
And I would advise that you abandon the idea of some grand unification at the stage it's not going to happen.
In fact on P2 to one of the things I expect we will see is a few different kinds of programming environments. Having code that does cool stuff is a nice problem to have.
As for learning, let's whittle this syntax down and resolve addressing for real.
Then we can start thinking about how people get in. Just as a brief note, I found myself wondering where code was actually running. More clarity on that should be a driver on our addressing solution.
@jmg, the real question to ask is whether or not there will be that code that assembly language people write.
Having it standards compliant and all of that at this stage doesn't really mean much. Besides we can write simple filters it'll move a lot of the code if it's really needed.
The reality is a lot of people ended up on p1 and did some neat stuff precisely because of the cool programming environment. If you want that to continue then we need another cool programming environment.
And I would advise that you abandon the idea of some grand unification at the stage it's not going to happen.
In fact on P2 to one of the things I expect we will see is a few different kinds of programming environments. Having code that does cool stuff is a nice problem to have.
Sure we need a cool programming environment but does it have to be based on Spin? Can't we come up with a way to make a C or C++ based environment "cool"? It's true that having lots of cool programming environments is a nice problem to have. My question is whether we have enough resources to generate more than one. It ook many years to create all of the environments we now have on P1. Do we want to wait that long for P2 environments?
Please, feel free to say more. But could you move it to another thread that's specifically about that topic? I'm afraid that Chip is missing bug reports (like the one that @ozpropdev mentioned earlier). And, in my opinion, getting verilog bugs fixed is much more important right now.
Coley wrote:
IMHO Hub exec mode is the icing on the cake and shouldn't be the driving force.
I use all sorts of micros depending on the task at hand, the Propeller is by far the easiest and most fun to use.
I really hope P2 can continue that tradition.
+1
David Betz wrote:
I was under the impression that it was relatively uncommon for most Propeller users to use PASM other than what is included in drivers that they download from OBEX.
It is FAR FAR FAR more likely for a Propeller user to use ASM than it is for an ARM user to use ASM. I will resist any effort to make the Propeller another ARM chip. At least 75% of all Propeller code I write is PASM because of the Propeller's irreplaceable and almost unique facility to allow a user to create precise custom high-speed signalling easily.
Coley wrote:
IMHO Hub exec mode is the icing on the cake and shouldn't be the driving force.
I use all sorts of micros depending on the task at hand, the Propeller is by far the easiest and most fun to use.
I really hope P2 can continue that tradition.
+1
David Betz wrote:
I was under the impression that it was relatively uncommon for most Propeller users to use PASM other than what is included in drivers that they download from OBEX.
It is FAR FAR FAR more likely for a Propeller user to use ASM than it is for an ARM user to use ASM. I will resist any effort to make the Propeller another ARM chip. At least 75% of all Propeller code I write is PASM because of the Propeller's irreplaceable and almost unique facility to allow a user to create precise custom high-speed signalling easily.
I don't doubt that that is true but are you a typical Propeller user or are you one of the few who provide OBEX code that the rest use?
Based on what Chip said, maybe it's only the addressing difference between cog and hub exec that complicates things. That's why I suggest eliminating un-aligned hubexec. The programmer counter would continue to increment by 1 when going from cogexec to hubexec mode. A PC address of $400 would map to a byte address of $1000 in hub RAM.
Complicates what ?
Chip has made the code relocatable, which is work now hidden from users.
The fine granularity of HUB is needed for BYTE data, and sure CODE does not have to be other than long aligned, but long aligned Code is a subset of what is working now.
Because a BYTE address is needed for Data, I'm guessing Chip has kept that for Code to keep the hardware simpler. (Plus, it does avoid wasting space with packers, but that is a less important point)
The only drawback I can see is that the first $1000 bytes of hub RAM can't be used for hubexec. I don't see that as a major problem. This memory could be used to hold a cog image that would be loaded into cog RAM. Most hubexec programs will call routines loaded in cog RAM for speed improvement.
This was why I suggested LUT:HUB could go to the top of the memory map.
I think some small tweaks can be done to the instruction streamer to improve performance. I think one simple tweak is for it to remember where it left off when calling routines in cog memory. This way it wouldn't have to be reloaded when returning from the cog memory routine. Another tweak would be to retain a certain number of longs in the streamer after execution. This way the streamer wouldn't have to be reloaded if just jumping back a few instructions. And one more tweak would be to not reload the streamer if jumping ahead by just a few instructions.
A cache would really be the best approach, but that may be too much to ask for at this point.
The question then becomes what size cache - and any cache still needs to be reloaded.
A Cache-reload would still need to wait for the COG-slot, as that is fundamental to the high memory bandwidth.
Given the good P2 support for block moves, I expect Software will eventually create some nifty resizable-caches, where it does load a block into COG and runs that.
The streamer design seems a nice, relatively simple, way to give every COG full memory bandwidth flows.
Sure, not quite a code-cache in conventional thinking, but there are 16 of these, so they cannot be overly large.
I would be willing to bet that very few Arduino users or even commercial users of the AVR, PIC, or ARM processors write much assembly code. We here in this forum are probably not representative of normal MCU users.
Yes, the general larger MCU use model, is to 'use ASM only when you have to'.
( Of course, that also gives laments of the hundreds of files the frameworks create and the tens of k bytes of overheads, seen in other forums.... )
The P2 is a little different, and I expect first-usage will follow other MCUs with a large HUB compiler model, but there are 16 COGS and they are finite in size.
That means there will be more ASM code in a P2, than in other MCUs and that ASM code resource needs to be 'user harvestable'.
Complicates what ?
Chip has made the code relocatable, which is work now hidden from users.
The fine granularity of HUB is needed for BYTE data, and sure CODE does not have to be other than long aligned, but long aligned Code is a subset of what is working now.
Because a BYTE address is needed for Data, I'm guessing Chip has kept that for Code to keep the hardware simpler. (Plus, it does avoid wasting space with packers, but that is a less important point)
I was responding to Chip's comment about the complications of hubexec. Yes, Chip has made the code relocatable and hidden the complexities from users, but it seems like he regretted how much effort it required.
The fine addressing granularity is required for data, but it's not necessary for execution, so why complicate execution with it? And non-aligned accesses aren't required for data or execution.
Complicates what ?
Chip has made the code relocatable, which is work now hidden from users.
The fine granularity of HUB is needed for BYTE data, and sure CODE does not have to be other than long aligned, but long aligned Code is a subset of what is working now.
Because a BYTE address is needed for Data, I'm guessing Chip has kept that for Code to keep the hardware simpler. (Plus, it does avoid wasting space with packers, but that is a less important point)
I was responding to Chip's comment about the complications of hubexec. Yes, Chip has made the code relocatable and hidden the complexities from users, but it seems like he regretted how much effort it required.
The fine addressing granularity is required for data, but it's not necessary for execution, so why complicate execution with it? And non-aligned accesses aren't required for data or execution.
Chip has expressed a desire to intermix packed data with instructions and that makes non-aligned instructions necessary.
Cluso,
Why do you think having all 16 cogs running hubexec will be any hotter than 15 cogs running cogexec, and 1 running hubexec?
If anything, I would expect 16 cogs running hubexec to run slightly cooler than 16 cogs running cogexec. Mainly because the execution units will be stalled more often waiting on fifo refills on branches. The execution units are likely the hot points. The hub memory stuff is spread out around many separate memories, making it less likely to be a hot point because it's spread out.
I think the main thing that caused the P2-hot to be hot was the pipelining to get instructions to 1 cycle per. Since it had more of the cog execution stuff going at the same time.
Cluso,
Why do you think having all 16 cogs running hubexec will be any hotter than 15 cogs running cogexec, and 1 running hubexec?
If anything, I would expect 16 cogs running hubexec to run slightly cooler than 16 cogs running cogexec. Mainly because the execution units will be stalled more often waiting on fifo refills on branches. The execution units are likely the hot points. The hub memory stuff is spread out around many separate memories, making it less likely to be a hot point because it's spread out.
I think the main thing that caused the P2-hot to be hot was the pipelining to get instructions to 1 cycle per. Since it had more of the cog execution stuff going at the same time.
This is because one of the P2-HOT issues was that the hub was a singular block that was being accessed regularly. Chip broke this down to 16 blocks, with each block being a staged lowest 4bit address (ie each consecutive address can be accessed by a cog in parallel - hence the egg-beater definition). So now we can have every cog accessing a separate hub block in parallel, and will be doing so on almost every clock in hubexec. So the full 16 blocks of hub will be being accessed on each clock, one clock for each cog.
Sure, there will be a bit of a hub slot delay with hubexec, but not that much.
So, the power being used will be significantly higher with hubexec. Just how much, I have absolutely no idea. Cog and Lut exec will not access hub regularly, as happens now with the P1.
BTW I do understand that there were other issues with P2-HOT.
I just posted a new file at the top of this thread.
I made some assembler changes which are going to make programming a lot easier, regarding addresses and branching.
Notice two small differences here?
dat
orgh 0
'
' launch cogs 15..0 with blink program
' cogs that don't exist won't blink
'
org
:loop coginit cognum,#`blink
djns cognum,:loop
cognum long 15
'
' blink
'
org
blink cogid x 'which cog am I?
setb dirb,x 'make that pin an output
notb outb,x 'flip its output state
add x,#16 'add to my id
shl x,#18 'shift up to make it big
waitx x 'wait that many clocks
jmp blink 'do it again
x res 1 'variable at cog register 8
Comments
Hubexec will be very useful on the P2. It will be used extensively with C programs, and I'm sure many people will write large PASM programs that will use it as well. It would be used even more if P2 Spin has a mode where it directly compiles into PASM. Just think about it. Spin programmers could actually write code that runs almost as fast as PASM programs.
The only problem that I have with hubexec is that it uses the streamer instead of an instruction cache. The streamer will have to be restarted every time a jump is done. This reduces the efficiency of code that does a lot of jumps. This really messes up the performance of tight loops running in hubexec mode. It also messes up the timing of hubexec code that calls cog code in an attempt to improve efficiency. The streamer is restarted when the cog routine routines instead of just continuing after the CALL was made. Also, data cannot be stream while running in hubexec mode (or at the very least they will have to alternate data and instruction streaming).
An instruction cache would resolve this. Maybe there can be some tweaks made to the streamer to improve looping and calls, but a cache would be the ultimate fix.
Hubexec could really use an instruction cache.
*BUMP*
I was able to reproduce the same issue. There's definitely something going on with INB when the cog first starts.
I am getting $3FFFFFFF initially. Then, after some period, it goes to $7FFFFFFF. On my 1-2-3, it's taking ~2700 iterations of the loop before it seems to start working. Actually, it doesn't seem to matter if I'm reading INB repeatedly or not. It's the delay that seems to be important.
Yep, I found later that using waitx was effective as a workaround too.
I think some small tweaks can be done to the instruction streamer to improve performance. I think one simple tweak is for it to remember where it left off when calling routines in cog memory. This way it wouldn't have to be reloaded when returning from the cog memory routine. Another tweak would be to retain a certain number of longs in the streamer after execution. This way the streamer wouldn't have to be reloaded if just jumping back a few instructions. And one more tweak would be to not reload the streamer if jumping ahead by just a few instructions.
A cache would really be the best approach, but that may be too much to ask for at this point.
Removing hubexec would be worst possible move, as hubexec is will give good high level language performance.
Personally, I'd address code in longs, with implied low order bits, in instructions. LOC etc can always add the 2 zeros, and data access for {RD|WR}{BYTE|WORD|LONG} (and streaming equivalents) in bytes.
Regarding streamer... I think it is GREAT for streaming data for video or other high-bandwidth data.
It is far better than LMM for code, but it is nowhere near as effective as a code cache.
Rule of thumb on most processors is every sixth instruction is a branch, which will flush the streamer.
It takes 8 instruction cycles (16 clocks) to refill the streamer.
Best guess: streamed hubexec code will (on average) be 1/2 the speed of cog code, which is not bad at all. (lut code will be 66% the speed of cog code)
A 16 long instruction cache, would run at ~100% cog speed for loops <= 16 longs, which are frequent for things like mem*(), str*(), etc, but not for main UI code loops.
What would likely help quite a bit is making the instruction stream a bit smarter... only flushing what is needed, so if the target of the branch is already in the streamer, go to it immediately - heck Chip may already be doing this! This would automatically "cache" loops up to 16 instructions long.
David had a good idea - perhaps educational material should first document hubexec mode, as it is closer to conventional processors, and does not limit size. lut/cog execution as an "advanced" topic for drivers, libraries etc. makes sense.
Regarding C/C++ .. David, more people will use it on a P2 with a lot of memory.
I should have asked this here. However, I threw together some code to search for those pin outputs. Here's what I found:
inb[27] : PB3
inb[26] : PB2
inb[25] : PB1
inb[24] : PB0
note: active low (0 = pressed, 1 = not pressed)
I'd still like a full listing of the 1-2-3 to the P2 pins, though.
Having it standards compliant and all of that at this stage doesn't really mean much. Besides we can write simple filters it'll move a lot of the code if it's really needed.
The reality is a lot of people ended up on p1 and did some neat stuff precisely because of the cool programming environment. If you want that to continue then we need another cool programming environment.
And I would advise that you abandon the idea of some grand unification at the stage it's not going to happen.
In fact on P2 to one of the things I expect we will see is a few different kinds of programming environments. Having code that does cool stuff is a nice problem to have.
Then we can start thinking about how people get in. Just as a brief note, I found myself wondering where code was actually running. More clarity on that should be a driver on our addressing solution.
I have never used the words "grand unification"
What I have have done, is given examples of a small steps to make new users life easier, and to make code more portable.
+1
It is FAR FAR FAR more likely for a Propeller user to use ASM than it is for an ARM user to use ASM. I will resist any effort to make the Propeller another ARM chip. At least 75% of all Propeller code I write is PASM because of the Propeller's irreplaceable and almost unique facility to allow a user to create precise custom high-speed signalling easily.
Chip has made the code relocatable, which is work now hidden from users.
The fine granularity of HUB is needed for BYTE data, and sure CODE does not have to be other than long aligned, but long aligned Code is a subset of what is working now.
Because a BYTE address is needed for Data, I'm guessing Chip has kept that for Code to keep the hardware simpler. (Plus, it does avoid wasting space with packers, but that is a less important point)
This was why I suggested LUT:HUB could go to the top of the memory map.
The question then becomes what size cache - and any cache still needs to be reloaded.
A Cache-reload would still need to wait for the COG-slot, as that is fundamental to the high memory bandwidth.
Given the good P2 support for block moves, I expect Software will eventually create some nifty resizable-caches, where it does load a block into COG and runs that.
The streamer design seems a nice, relatively simple, way to give every COG full memory bandwidth flows.
Sure, not quite a code-cache in conventional thinking, but there are 16 of these, so they cannot be overly large.
( Of course, that also gives laments of the hundreds of files the frameworks create and the tens of k bytes of overheads, seen in other forums.... )
The P2 is a little different, and I expect first-usage will follow other MCUs with a large HUB compiler model, but there are 16 COGS and they are finite in size.
That means there will be more ASM code in a P2, than in other MCUs and that ASM code resource needs to be 'user harvestable'.
The fine addressing granularity is required for data, but it's not necessary for execution, so why complicate execution with it? And non-aligned accesses aren't required for data or execution.
Why do you think having all 16 cogs running hubexec will be any hotter than 15 cogs running cogexec, and 1 running hubexec?
If anything, I would expect 16 cogs running hubexec to run slightly cooler than 16 cogs running cogexec. Mainly because the execution units will be stalled more often waiting on fifo refills on branches. The execution units are likely the hot points. The hub memory stuff is spread out around many separate memories, making it less likely to be a hot point because it's spread out.
I think the main thing that caused the P2-hot to be hot was the pipelining to get instructions to 1 cycle per. Since it had more of the cog execution stuff going at the same time.
Heat sink it and ONWARD! ;p
You'll want to run it off a decent lipo (or other lithium based battery) for most unplugged cases, instead of the old school small dry cell(s).
Sure, there will be a bit of a hub slot delay with hubexec, but not that much.
So, the power being used will be significantly higher with hubexec. Just how much, I have absolutely no idea. Cog and Lut exec will not access hub regularly, as happens now with the P1.
BTW I do understand that there were other issues with P2-HOT.
I don't think so either. Just being flippant
I made some assembler changes which are going to make programming a lot easier, regarding addresses and branching.
Notice two small differences here?
I also added word and long alignment directives.
As for the removal of "@", I'm somewhat intrigued. Personally, I was fine with the "@", so I don't know what the value of removing it is.