48 said we would accept the P16X32B or P32X32B as Chip had originally specified it - i.e. without Hubexec (but there may have been one or two preferences for Hubexec in there somewhere).
4 wanted to hold out for the P2, or something else (EDIT: including Hubexec, as msrobots just pointed out).
You voted NO because of not including hubexec and slot-sharing.
#225
#226
#227
It was NOT about P1/P2 it was about P1 on steroids.
Don't cheat
Mike
Mike,
Actually Chip added hubexec back in.
As he then went thru it he realised there were a few gotchas with the simplified version of hubexec.
But that didn't really mean the simplified hubexec would not work, only that a couple of nice things couldn't be done.
Together we (yes "we" meaning some help from the forum) found a solution for parts, but some other parts are currently too complex.
So the current hubexec discussion really has nothing to do with the original poll done by Ross.
Since there seemed to be no takers on a separate thread, I am posting here.
Please also see my post #971 below.
This is a possible solution to the cog stack for CALLA/RETA...
With just a new "LINKA" instruction and standard cog instructions in cog to support the LINKA & RETA instructions, we could make a cog stack ourselves.
These support cog instructions may or may not lose a hub cycle, depending on where the code currently is.
They would not increase user code (beyond the base cog support routine).
We do not have to use it if we don't want to. It would be at least as fast as hub based stack(s), and likely faster, especially if there were a few instructions replaced with new ones, or someone comes up with a better method.
Note I have not used PUSHA & POPA as these are currently not guaranteed.
dat
Note: I have not considered save/restore of Z & C Flags.
...
LINKA #<routine> ' <routine> is the 15-bit address of the hub-long/cog routine to be called
' This instruction writes a 32-bit long to the fixed cog address _SAVEA
' _SAVEA[31] = Z flag
' _SAVEA[30] = C flag
' _SAVEA[29:15] = 15-bit <routine> address (hub-long or cog)
' _SAVEA[14:0] = 15-bit <return> address (hub-long or cog)
' Then it jumps to the fixed cog address _CALLA
...
<routine> ...
RETA ' This instruction jumps to the fixed cog address _RETA
' which will ultimately return to the next instruction after LINKA.
' It could be simply coded as JMP #_RETA.
...
' The following routine must be setup in the cog ram at the fixed location $1Ex
' This routine supports the new instructions LINKA & RETA
' Note: new instructions could combine the following to simplify the code.
org $1Ex
_CALLA movd _PUSHA, _INDA ' set the cog stack pointer
add _INDA, #1 ' INDA++
_PUSHA mov *-*, _SAVEA ' push the return address onto the cog stack
shr _SAVEA, #15 ' get <routine> address into lower 15 bits
jmp _SAVEA ' jump to hub-long/cog <routine> 15-bit address
_RETA sub _INDA, #1 ' --INDA
movs _POPA, _INDA ' set the cog stack pointer
nop ' (may not be required?)
_POPA mov _SAVEA, *-* ' pop the return address off the cog stack
jmp _SAVEA ' includes 17 bit jump address (Z&C??)
_SAVEA long 0 ' Z,C,<routine>,<return>
_INDA long 0 ' INDA cog stack pointer
Mike,
Actually Chip added hubexec back in. As he went thru it he realised there were a few gotchas with the simplified version of hubexec. But that didn't realy mean the simplified hubexec would not work, only that a couple of nice things couldn't be done. Together we (yes "we" meaning some help from the forum) found a solution for parts, but some other parts are currently too complex.
So the current hubexec discussion really has nothing to do with the original poll done by Ross.
Well the poll was about a P1+ NOT having all the clutter of the P2 and was about a specific Model WITHOUT hubexec.
So that poll still stands.
The same people who ruined the P2 thru excessive adding of functionality are doing the same thing now with the new P1+ based on Chips Model of that Thread the poll was in.
Even when Chip then said he needs to take that hubexec out AGAIN because it is complicating the NEW design over his limits - there is no stopping of the feature creep.
WHO really needs this and WHY?
And would ANY of them put there own money on the block? Like people offered on that thread for a P1+ WITHOUT feature creep?
How about $10 per post for each additional feature you need? just 6000 posts for a shuttle run. We can do better than that in two weeks.
I don't have full understanding of parallel processing, so I feel kinda like window shopping. I see something interesting but I can't have it. If parallel means several tasks at the same time, then I would think that one processor could read/write i2c memory, another processor could read/write serial port, all at the same time, but that requires hardware usart, etc, so I would really need to see some programs to understand the advantage of cogs, or explanation how is it better than interrupts. The Propeller is huge step up from Basic Stamp, but it isn't easy to understand. At least some parts are not clear even to programmers and taking full advantage of analog dac and adc is really a black magic. It is like hobby magazine that turned into trade magazine for science. Different audience. Many readers left behind unable to catch up. Nothing wrong with that, its just different.
Don't worry about the "parallel programming" part.
If you need a UART then that UART is written in software and runs on one of those processors. Your program can Tx and Rx by calling functions provided by that software UART. You don't even have to know it is a software UART or that it is running anything in parallel. Same goes for I2C, Spi, Video output etc etc etc. Just use the library objects provided that create these devices for you.
One day when you are comfortable with this idea you might want to write your own drivers. Then you will start to understand why not having interrupts is wonderful, having a whole processor dedicated to do the job is so much easier.
I think many would agree with you on the feature creep complaint.
However, having your mainline code running LMM at 12-15MIPS, no matter how fast your Cog PASM cores are running, seems to be well below current expectations of new customers. Even 25 probably.
I agree once you hit 50 + 10-12 super Cog cores at 100 MIPS, it's probably interesting enough to give Parallax a test run.
Otherwise, get an M3/M4 with the peripherals you need and Core MIPS sufficient for the job, and move on.
Parallax does even get a shot.
I think people who are against this are simply seeing only their personal use case, and not the commercial use that Parallax needs to remain profitable for the future.
I doubt all the annual sales volume to forum participants (minus commercial lurkers) profit-wise cover even 1 of their 70 employees annual pay.
Here is what I also meant to post with my hubexec post #965 above...
HUBEXEC
Here is where I see we are up to with hubexec...
* Hub can be accessed as 128bits (QUAD)
* We can read a block of 4 instructions into a cog cache (ICACHE)
* We can execute up to 4 instructions out of this ICACHE (presuming no branches and initial quad long aligned.
* We can have CALLA & CALLB using PTRA & PTRB as hub pointers where the return address is stored.
* RETA & RETB also uses PTRA & PTRB to fetch the return address.
* We can have PUSHA & PUSHB and POPA & POPB too.
* So, support for hub stacks are possible.
* We can have LINK (a CALL where the return address is placed in a fixed cog register (currently $1EF). This supports GCC, etc.
* We can have a 4 deep 19bit buried LIFO for CALL & RET, PUSH & POP. There is a preference to remove this.
What we cannot have (currently anyway) is a Cog Stack using INDA & INDB because the instruction timing does not work.
* We can overcome INDA & INDB for indirect cog access by using new a ALTDS instruction. This can be hidden from the user by the compiler, but it adds an instruction either way.
Are there alternate solutions ???
* Maybe increase the depth of the buried LIFO ???
This may remove the need to use a PTRA/PTRB based cog stack.
* Maybe use a specialised LINKA & RETA and cog support sw as I described in my earlier post.
Just a random early morning thought but does it matter if all Cores don't have equal access to the RAM?
In other words there are 16 cores which have restricted access to act as intelligent IO and 4 cores which are closely coupled to the RAM in a central block, along with the CORDIC unit, that provide the heavy lifting.
The IO cores are ultra-simple with no support for threading. The central cores have threading capability.
I guess this is just a fixed version of the various slot allocation schemes proposed but it might result in simpler logic.
now we are down to 12-15 MIPs LMM? Bill was still talking about 25 MIPs and Ariba has a thread with 50 MIPs LMM.
The current P1 does 20 MIPS in PASM so C in LMM on the P1+ would be faster then PASM on the P1. Right?
The current customers paying for Parallax use the existing P1. Having a chip with
5 times the ram
5 times the speed,
2 times the pins
ADC/DAC
and NOT eating 5 Watt
Might give them a chance to enhance their products while keeping the Propellers.
remember. The featuritis already ate the P2.
Want to kill P1+ too?
me not.
Mike
Just because we voted for the P1+ didn't mean we really did not want hubexec. But the P2 was not going anywhere soon.
Chip took the P1+ onboard and added in a few P2 features - his choice, not ours - although obviously we embraced it.
Chip, and many others including myself, realise that hubexec, no matter how crippled, is way better than sw LMM. Apart from the fact LMM runs 4x slower, plus less every time a jump/call/ret occurs (because there are multiple hub instructions executed each time), plus less every time a hub data location is accessed, is likely the now 12-15MIPs that Bill is referring to. BUT, the cog is still executing at at least 4x the power of hubexec mode. Every hub instruction takes 4 cog instructions to execute.
It is also considered a kludge. A very smart kludge in the P1 as there was otherwise no alternative. The silicon was done!
Here we have the opportunity to discuss the problems and try and find solutions.
IMHO hubexec was the single most benefit that the P2 ultimately uncovered. It would be such a shame to waste this without some constructive thoughts, without trashing this.
Otherwise, the P1+ (or whatever it is called) cannot really say with all honesty that it can run 512KB programs without real qualifiers. We are destined to the less than 496 instruction limit.
I realise you may not be able to contribute to valid alternatives, but there are some of us here who have the experience to suggest varying alternatives. And who knows, one of them may just be the seed that opens Chip's mind to a simple alternative. I have see this often enough to know this works - Chip has even said so!
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Chip,
I am fine without the hardware multitasking. I've never been a fan of it personally. I think it's much less important now that we have 16 COGs, and like you have said you can still do software based cooperative multitasking as needed.
I know there are some folks that really like it and wanted it... not sure how badly they will miss it...
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Any thoughts?
If I left this reply blank I think everyone would know what I said. Yes, leave it out, we have multiple cores and smart I/O but no silicon so if this makes silicon possible sooner, then please do.
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Any thoughts?
I am fine without multi-tasking, especially with 16 cogs!
Keep the silicon and power for killer features, not niceties that we don't really need.
BTW I don't consider the ROM Monitor essential either. We can always load it up from external flash.
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Hardware multi-tasking is not essential, especially now that there are twice as many COGs.
The whole point of hardware multi-tasking was to make better use of those big "expensive" COGs when there were few. Now we have leaner, meaner COGS and there are many-.
Does it make HUB exec any easier or is that still the same problem?
If dropping multi-tasking makes hub exec even a little simpler then that's a good reason to drop multi-tasking.
I would imagine that a simple hubexec that is faster than a software LMM loop but perhaps not as fast as has theoretically possible (give an infinite number of gates and zero power consumption) is still very valuable.
The hardware threads were born out of two factors,
1: The lack of extra Cogs in the Prop2 design.
2: The significant increase in performance and realestate of each Cog.
So, with the now simpler 16 Cogs, that need has diminished.
That said, I'd still be interested in something that could exist as Cog only feature. Being able to partition off some fixed amount of MIPS for an assitant task was a nice idea all on it's own.
I'd miss tasks. I have three asynchronous tasks running in a single (P3-) cog, and only one one of those (video stream) needs to interact with the hub rather than the pins.
If tasks go then the replacement is probably a 4~6 cog "traditional P1 style" cog solution, with key functions divided into one P1+ cog each. It would have a heap of data going via hub, which would be unnecessary, but it'd still work. 4~6 cogs depending on how smart the pins are.
The good news is I'd have more cogs left over now, and perhaps get a P1+ slightly sooner, which has appeal too. But it worries me slightly that we're getting towards the point where two P1's (also giving 16 cogs, 64 pins, but available now) is looking like valid competition. I know that's simplistic.
I haven't looked in depth at software tasking, so I'd be interested in your proposed general approach for the monitor, Chip. I expect there would be lots of comms objects where you have an input stream, an output stream, and a command processor / state machine, and what kind of performance looks achievable with a software approach.
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Any thoughts?
Any numbers ? ie what MHz is attained with/without this enabled, and what is the exact die impact.
Given this is an entirely optional feature, that does bring some important marketing edge, plus an ability to use that expensive COG Memory more fully, as well as Debug & watchdog abilities, it seems the sort of thing to remove only when there is a pressing reason to do so. ( like Area or Power envelopes being hit)
What is the size increase in the ROM monitor, without tasking ?
Not dealing with tasks makes hub exec a little simpler.
- but tasks are optional ? - so it could be that hub exec needs tasks disabled ?
Not every COG is going to need HubExec, and I'm not sure both Hub Exec and Tasks in the same COG is vital, but tasks will allow more to fit into a COG - just look what was packed into a P2 COG with tasks.
Comments
Not quite true.
48 said we would accept the P16X32B or P32X32B as Chip had originally specified it - i.e. without Hubexec (but there may have been one or two preferences for Hubexec in there somewhere).
4 wanted to hold out for the P2, or something else (EDIT: including Hubexec, as msrobots just pointed out).
Actually Chip added hubexec back in.
As he then went thru it he realised there were a few gotchas with the simplified version of hubexec.
But that didn't really mean the simplified hubexec would not work, only that a couple of nice things couldn't be done.
Together we (yes "we" meaning some help from the forum) found a solution for parts, but some other parts are currently too complex.
So the current hubexec discussion really has nothing to do with the original poll done by Ross.
Please also see my post #971 below.
This is a possible solution to the cog stack for CALLA/RETA...
With just a new "LINKA" instruction and standard cog instructions in cog to support the LINKA & RETA instructions, we could make a cog stack ourselves.
These support cog instructions may or may not lose a hub cycle, depending on where the code currently is.
They would not increase user code (beyond the base cog support routine).
We do not have to use it if we don't want to. It would be at least as fast as hub based stack(s), and likely faster, especially if there were a few instructions replaced with new ones, or someone comes up with a better method.
Note I have not used PUSHA & POPA as these are currently not guaranteed.
Well the poll was about a P1+ NOT having all the clutter of the P2 and was about a specific Model WITHOUT hubexec.
So that poll still stands.
The same people who ruined the P2 thru excessive adding of functionality are doing the same thing now with the new P1+ based on Chips Model of that Thread the poll was in.
Even when Chip then said he needs to take that hubexec out AGAIN because it is complicating the NEW design over his limits - there is no stopping of the feature creep.
WHO really needs this and WHY?
And would ANY of them put there own money on the block? Like people offered on that thread for a P1+ WITHOUT feature creep?
How about $10 per post for each additional feature you need? just 6000 posts for a shuttle run. We can do better than that in two weeks.
Enjoy!
Mike
Don't worry about the "parallel programming" part.
If you need a UART then that UART is written in software and runs on one of those processors. Your program can Tx and Rx by calling functions provided by that software UART. You don't even have to know it is a software UART or that it is running anything in parallel. Same goes for I2C, Spi, Video output etc etc etc. Just use the library objects provided that create these devices for you.
One day when you are comfortable with this idea you might want to write your own drivers. Then you will start to understand why not having interrupts is wonderful, having a whole processor dedicated to do the job is so much easier.
I think many would agree with you on the feature creep complaint.
However, having your mainline code running LMM at 12-15MIPS, no matter how fast your Cog PASM cores are running, seems to be well below current expectations of new customers. Even 25 probably.
I agree once you hit 50 + 10-12 super Cog cores at 100 MIPS, it's probably interesting enough to give Parallax a test run.
Otherwise, get an M3/M4 with the peripherals you need and Core MIPS sufficient for the job, and move on.
Parallax does even get a shot.
I think people who are against this are simply seeing only their personal use case, and not the commercial use that Parallax needs to remain profitable for the future.
I doubt all the annual sales volume to forum participants (minus commercial lurkers) profit-wise cover even 1 of their 70 employees annual pay.
now we are down to 12-15 MIPs LMM? Bill was still talking about 25 MIPs and Ariba has a thread with 50 MIPs LMM.
The current P1 does 20 MIPS in PASM so C in LMM on the P1+ would be faster then PASM on the P1. Right?
The current customers paying for Parallax use the existing P1. Having a chip with
5 times the ram
5 times the speed,
2 times the pins
ADC/DAC
and NOT eating 5 Watt
Might give them a chance to enhance their products while keeping the Propellers.
remember. The featuritis already ate the P2.
Want to kill P1+ too?
me not.
Mike
HUBEXEC
Here is where I see we are up to with hubexec...
* Hub can be accessed as 128bits (QUAD)
* We can read a block of 4 instructions into a cog cache (ICACHE)
* We can execute up to 4 instructions out of this ICACHE (presuming no branches and initial quad long aligned.
* We can have CALLA & CALLB using PTRA & PTRB as hub pointers where the return address is stored.
* RETA & RETB also uses PTRA & PTRB to fetch the return address.
* We can have PUSHA & PUSHB and POPA & POPB too.
* So, support for hub stacks are possible.
* We can have LINK (a CALL where the return address is placed in a fixed cog register (currently $1EF). This supports GCC, etc.
* We can have a 4 deep 19bit buried LIFO for CALL & RET, PUSH & POP. There is a preference to remove this.
What we cannot have (currently anyway) is a Cog Stack using INDA & INDB because the instruction timing does not work.
* We can overcome INDA & INDB for indirect cog access by using new a ALTDS instruction. This can be hidden from the user by the compiler, but it adds an instruction either way.
Are there alternate solutions ???
* Maybe increase the depth of the buried LIFO ???
This may remove the need to use a PTRA/PTRB based cog stack.
* Maybe use a specialised LINKA & RETA and cog support sw as I described in my earlier post.
In other words there are 16 cores which have restricted access to act as intelligent IO and 4 cores which are closely coupled to the RAM in a central block, along with the CORDIC unit, that provide the heavy lifting.
The IO cores are ultra-simple with no support for threading. The central cores have threading capability.
I guess this is just a fixed version of the various slot allocation schemes proposed but it might result in simpler logic.
Chip took the P1+ onboard and added in a few P2 features - his choice, not ours - although obviously we embraced it.
Chip, and many others including myself, realise that hubexec, no matter how crippled, is way better than sw LMM. Apart from the fact LMM runs 4x slower, plus less every time a jump/call/ret occurs (because there are multiple hub instructions executed each time), plus less every time a hub data location is accessed, is likely the now 12-15MIPs that Bill is referring to. BUT, the cog is still executing at at least 4x the power of hubexec mode. Every hub instruction takes 4 cog instructions to execute.
It is also considered a kludge. A very smart kludge in the P1 as there was otherwise no alternative. The silicon was done!
Here we have the opportunity to discuss the problems and try and find solutions.
IMHO hubexec was the single most benefit that the P2 ultimately uncovered. It would be such a shame to waste this without some constructive thoughts, without trashing this.
Otherwise, the P1+ (or whatever it is called) cannot really say with all honesty that it can run 512KB programs without real qualifiers. We are destined to the less than 496 instruction limit.
I realise you may not be able to contribute to valid alternatives, but there are some of us here who have the experience to suggest varying alternatives. And who knows, one of them may just be the seed that opens Chip's mind to a simple alternative. I have see this often enough to know this works - Chip has even said so!
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Any thoughts?
I am fine without the hardware multitasking. I've never been a fan of it personally. I think it's much less important now that we have 16 COGs, and like you have said you can still do software based cooperative multitasking as needed.
I know there are some folks that really like it and wanted it... not sure how badly they will miss it...
If I left this reply blank I think everyone would know what I said. Yes, leave it out, we have multiple cores and smart I/O but no silicon so if this makes silicon possible sooner, then please do.
Not me: in fact I suggested something similar just two posts above you.
Keep the silicon and power for killer features, not niceties that we don't really need.
BTW I don't consider the ROM Monitor essential either. We can always load it up from external flash.
It's valuable for fine-grained timing that deals with I/O. If we have smart pins, it's needed much less.
In my opinion --->
Simplest possible HubExec have more value that Task's
As IC will have 16 COG/Cores -- That with internal I/O port can give any TASK from second COG/Core.
-Phil
Agree.
The whole point of hardware multi-tasking was to make better use of those big "expensive" COGs when there were few. Now we have leaner, meaner COGS and there are many-.
Does it make HUB exec any easier or is that still the same problem?
Not dealing with tasks makes hub exec a little simpler.
I would imagine that a simple hubexec that is faster than a software LMM loop but perhaps not as fast as has theoretically possible (give an infinite number of gates and zero power consumption) is still very valuable.
Edit: "have" corrected to "drop"
1: The lack of extra Cogs in the Prop2 design.
2: The significant increase in performance and realestate of each Cog.
So, with the now simpler 16 Cogs, that need has diminished.
That said, I'd still be interested in something that could exist as Cog only feature. Being able to partition off some fixed amount of MIPS for an assitant task was a nice idea all on it's own.
If tasks go then the replacement is probably a 4~6 cog "traditional P1 style" cog solution, with key functions divided into one P1+ cog each. It would have a heap of data going via hub, which would be unnecessary, but it'd still work. 4~6 cogs depending on how smart the pins are.
The good news is I'd have more cogs left over now, and perhaps get a P1+ slightly sooner, which has appeal too. But it worries me slightly that we're getting towards the point where two P1's (also giving 16 cogs, 64 pins, but available now) is looking like valid competition. I know that's simplistic.
I haven't looked in depth at software tasking, so I'd be interested in your proposed general approach for the monitor, Chip. I expect there would be lots of comms objects where you have an input stream, an output stream, and a command processor / state machine, and what kind of performance looks achievable with a software approach.
Any numbers ? ie what MHz is attained with/without this enabled, and what is the exact die impact.
Given this is an entirely optional feature, that does bring some important marketing edge, plus an ability to use that expensive COG Memory more fully, as well as Debug & watchdog abilities, it seems the sort of thing to remove only when there is a pressing reason to do so. ( like Area or Power envelopes being hit)
What is the size increase in the ROM monitor, without tasking ?
- but tasks are optional ? - so it could be that hub exec needs tasks disabled ?
Not every COG is going to need HubExec, and I'm not sure both Hub Exec and Tasks in the same COG is vital, but tasks will allow more to fit into a COG - just look what was packed into a P2 COG with tasks.