Hrm, that's kinda interesting.. like a virtual IO port ?
I've certainly had times when it would have been nice to be able to use portB as a hardware semaphore between cogs.. WAITPE/WAITPNE and all that..
Anyway, I promised myself I'd not buy into this thread.. too busy with what the existing chip can do rather than getting my juices flowing over what may come in the future.
Firstly, thanks Chip for asking for public feedback. Your continual thinking outside the square is what makes your chip and the forum such a great place to be.
I am disappointed on two fronts:
1. Only 8 cogs, not 16
2. How far away the PropII must be
I was attracted to the Prop because of its unusual design: 8 cogs, Risc, 32 bit, 32 I/O and reasonably fast for a little chip. I did not need these features for my little project, which, by the way got sidetracked because I fell in love with the Prop design. My uses will not sell many props, other than by word of mouth in getting others to try it out, so my input should not carry much weight.
However, the simplicity of just using another cog for each task is the beauty of the Prop. That makes it easy for beginners and education, which is what I believe is currently the target audience. Objects can just be added into another cog and presto, it works. It is fairly easy to understand the code being used.
I did not like the prospects of multitasking within a cog. While I know this will work well (and will eventualy happen), the target audience will not be able to understand this type of code. 8 cogs will force this to happen much sooner. The cog ram size is a limitation, but with 16 cogs this would not have been such an issue. While many of the new instructions will address some of these concerns, it will certainly add to the complexities in understanding the chip. May I sugest the manual be divided into two sections of instructions (basic and advanced).
It is too late, but I would have preferred 16 cogs without the speed improvements. The four way access to cog ram has obviously blown the chip feature size. Unfortunately nothing comes for free - there are always trade-offs.
Possible suggestions:
1. Could one cog (say cog 0) be given hub access on unused cycles. This would mean this cog would have faster access to hub memory for priority tasks at the expense of determinability for this cog. If possible, it would be nice for this to be an option bit in a register. In other words, I would like at least one cog be given extra access to the hub whenever other cogs are not using the bandwidth.
2. I second the idea of seperate internal I/O for inter-cog communications.
3. If there is some form of external communications between props, would this make sense to be on seperate pins and I/O, given the available pin count on the package???
I am sure ultimately you will make the right decision, whatever that may be.
I have one more qustion on VIDEO generator/counter.
You have anounced R2R DAC in 3 group in it. you have conter in it.
How match work YOU must give to it to ad one more mode to it
R2R DAC in 2 group in 16 Bits and counter modes to suport WAV frecuencies.
And name it VIDEO/WAV generator/counter.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Nothing is impossible, there are only different degrees of difficulty.
This is why we all love Parallax!· Customer involvement!· Thanks Chip & Ken.
This has raised but I am not clear on the answer.· Is it possible to have the "cog-counters" changed to be "chip" counters - aka not tied to a specific cog?· I.e. any cog can use any counter(s) - up to the programmer.· Much in the same way all cogs can access (w/r) all pins. Thus you can use some cogs for pure logic/processing etc, and some for pure I/O activities - i.e. A/D, pulse generation etc, etc.· This would make the prop2 a very flexible IC!
Maybe:
mov phsa,#0 ..... to ..... mov phsq,#0
or,
mov phs1,#0 ..... to ..... mov phs16,#0
or,
mov phsa1,#0 ... to mov phsa2,#0 ... to ... mov phsb1,#0 ... to ... mov phsb2,#0 etc,......
With regard to cogs - i'd be happy with 8!· A tutorial on use of JMPRET would make getting our head arround it much easier and allow us to more easily use the full 100% of the cog time available. Many objects at present use 1 cog = 1 task, and you loose most of the capability of the cog.· I'd personally not spend huge amounts of time & effort·developing a complicated task switching system....
Oh and don't have a compatibility mode!· No more paged memory!!
I've been working out·a simple task-switching system which I will post here later today. It will work great, but I wonder how truly useful it will be, given the limitations of task-switched code. For example, each task will have·indeterminant timing due to other tasks' RDxxxx/WRxxxx instructions. This·will largely preclude things like cycle-accurate access to CTRs, leaving them diminished in value. All the schemes I've thought of to regulate thread timing do so at the expense of execution speed. And forget about doing a WAITCNT - it would temporarily hang every other thread.
I share many of your sentiments about not wanting the cogs to become too complex. In the case of this feature, it doesn't take much silicon to implement, but makes new things possible and efficient.·If we make Spin code·compile to LMM-type code, a single cog could run up to·16 threads of Spin very quickly, though·Spin code could no longer·do an accurate·WAITCNT.·In the case of multi-threaded execution, WAITCNT·would have to become·an assembly-level·macro which·would wait for the target count to pass (by·a small, but indeterminant amount). But, if you single-thread your Spin execution, you'd get real WAITCNT behavior. For many app's, this doesn't matter.
QUESTION:
If·up to 16·threads of Spin can execute on a single cog very quickly, would this alleviate the need many of you have for 16 cogs? In other words, of those 16 cogs, how many were you planning to run Spin on?
Teoretically it multiple COG´s 8x16
If that tasks can run 16 PASM tasks all can run i upp to 16·logical COG´s with not timing and counter critical code.
And have 7 COG´s to time critical tasks
Ps. If that system can RUN Spin interpreter and My PASM code on respective task. It is perfect!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Nothing is impossible, there are only different degrees of difficulty.
I've been thinking through the stuff I've seen here on this thread, projects I plan to do, and the work of others. IMHO, 8 cogs with the added capability here is going to be sweet. Of all the things I think about doing with COG's, supervisor / UI type tasks are the most costly. One COG could now do a number of those, really freeing up the remaining COG's for those things where speed and accuracy really matter. Granularity appears to be the big worry. Something like 16 threads that work in a sane, straightforward way on a COG is gonna relieve that for a lot of use cases. (not every use case is cycle exact, nor demanding of peak speed)
SPIN to LMM is very intriguing!
I wonder about the program size implications. One very nice thing about SPIN is that it's small. There is more RAM coming, of course, but the current SPIN can do a lot with little space. Keeping that intact is worth it, IMHO. Perhaps a port of the SPIN interpeter makes sense? So then, one could write interpeted spin, or compile it, leaving that a software choice, so that RAM use continues to be flexible?
Hi Chip, I'm standing right about where you are on this.·But I think if this is provided, both of us will be suprised how much it's used.
First there's still a sizable segment of users that don't program in assembly, thier only use of assembly is through objects they include in thier program. While technically Spin is deterministic, the granularity is mushy to the point that no-one really uses spin as a deterministic language, so the difference won't be noticed.
Personally I don't use waitxxx in spin for anything but generalized waiting which is not time critical. I know this isn't the case with everyone especially spin-only coders, but I won't miss the loss of waitxxx, you can always use traditional software mechanisms (hardcode compare with cnt, manual polling).
JMPRET is still there for assembly coders who want to split the processing time among multiple tasks (Perhaps I should write a template for writing assembly level TS code, bringing the technique front and center rather than leaving it buried in the FDSerial object).
Finally, almost all applications have a mix of time critical and time non-critical processes, if the programmer has at thier disposal the ability to easily throw all of thier time non-critical processes on a single cog, that will make more cogs availible for·time critical processes.
(128 processes, I don't know if anyone will find a use for all of those)
Cluso,·I understand your disappointment to hear we are moving away from the 16 cog idea, but with the dramatic increase in cost for providing a full 16 cogs we feel·the need to keep more in bounds on the cost.
PS, the one argument I can see overriding all arguments to include TS spin is an answer of yes to the following question: "Does this mechanism destroy the simplicity and easy to understand nature of the Propeller?". Unfortunately I see this as being somewhere on the borderline, and if it doesn't pass the smell test, perhaps it should be abandoned. Ultimately I think it boils down to how straightforward the mechanism is to use it. I'm interested in hearing Jeff's opinion on this scheme.
Ahh, Spin to LMM type code? That sounds good, but would it be through a bytecode-to-LMM built-in micro-JIT, or would the proptool
directly emit LMM codes?
I think multithreading a single LMM Cog is somewhat interesting, but I do believe 90% of the people will never use it (other than
play with it as a gee-whiz thing). And it does leave some nasty corners like waitcnt/waitne that either need to be explicitly supported,
explicitly disallowed (and checked at compile time), or else there's even more gotchas for people.
I am afraid we're getting far away from the KISS principle that makes the Prop I so attractive. I think 8 fast cogs, 256KB, 160MHz
will blow the doors off the microcontroller world, especially if the low price and accessibility is still there. The more complicated we get,
the less attractive we get.
I think providing a *good* Basic to sit next to Spin on this new chip, ideally with the same level of support (ROM interpreter, LMM
micro-JIT if possible, or LMM-emitting proptool) would do more for the chip than trying to shoehorn some 16-way threading in there.
BTW, a LMM micro-JIT for Spin and/or Basic would be a *far* better use of ROM space than putting any sort of development environment
in there, in my opinion.
Chip, Do you have an idea of the slowdown for LMM for propII?
The reason is that I can see 4 uses of a cog
1. Spin - this is not normally time determinisic so running threads for spin would be great
2. ASM but not time determinisic, many devices have a max access rate but not a min, so running these in a thread wouldn't be a problem e.g. many I2C/SPI devices. The main issue with these is code size, i.e. running out of cog memory, running these drivers using LMM would remove the memory limitation
3. ASM that needs a min time e.g. PS2, serial. Again these run into the cog memory limit a combined key/mouse is just under 512longs, similar with the 4 port serial. Running PS2 under LMM is possible, for serial it depends on the slowdown LMM will give.
4. ASM - time determinisic - run these in a cog by themselves
At least with the stuff I have done 4 is a small subset of the code running. Allowing threads esp if LMM is fast would allow a lot of effective cogs
Not being able to do RDxxxx/WRxxxx in all tasks would be understandable given the architecture. Having extra threads of code running within a COG along side the hub accessor thread would allow more work to be done without involving more cogs, and this is a win. I doubt you will arbitrate write access to COG memory among threads, and this will likely need to be done in software. If a semaphore or mutex like LOCKxxx was available, arbitration would be easier and more likely ... but that would make tasking bigger.
Not having access to WAITCNT in any COG thread would be a big negative; I assume you mean at least one COG <added> thread </added> can do WAITCNT.
If spin could do in-line PASM, that would help. Today's PASM interactiive requirement eats cogs like a fat man (must be lunch time [noparse]:o[/noparse]). If spin could run an LMM snippet in-line ASM or blocks of ASM would be cake. With Prop-II power and bigger memory, I will spend much more time with other languages though.
One of the problems that can't be overcome regarding the number of COGs problem is the limited memory in each COG. I know LMM helps tremendously and will be even better with Prop-II, but there will still be users who want native PASM speed on some apps, and the only way to get there is to use more COGs cooperatively for the application. What is the projected LMM:PASM speed ratio now?
WAITCNT (or its equivalent) would be pretty hard to sacrifice, since it gets used for so many things. Would it be possible to add a TASKSLP instruction to put a task to sleep for a prescribed number of clocks? That would eliminate the need for most WAITCNTs. If it could be done in such a way that timing errors didn't accumulate, it could eliminate nearly all WAITCNTs. That's harder to do with just one argument, though. Perhaps each task could have its own count register that got augmented by the single argument to TASKSLP. Then the task would be primed for awakening whenever its count register equalled CNT. I guess you'd need a TASKCNT instruction to load the count register initially, as well.
Come to think of it, with the instruction granularity now finer than the clock granularity, is WAITCNT in its present form all that useful? Perhaps WAITCNT could be the two-argument TASKSLP that using the ENC slot doesn't permit. In other words, jettison WAITCNT entirely in favor of TASKSLP.
Phil,
The way that would work would be if the single task degenerative case of TASKSLP worked like WAITCNT. For high speed bit-banging, you still need a WAITCNT functionality with resolution to a single clock cycle. For multiple tasks, you'd need some mechanism for a "fuzzy" system clock compare, particularly with 3 tasks running. For 1, 2, and 4, you'd ignore the appropriate number of low order bits.
jazzed said...
Not having access to WAITCNT in any COG thread would be a big negative; I assume you mean at least one COG <added> thread </added> can do WAITCNT.
Because waitxxx is a hardware function, allowing thier use in TS Spin would end up freezing all tasks until the waitxxx condition cleared. While this might have it's application, it's not the behavior most people would·expect.
jazzed said...
Not being able to do RDxxxx/WRxxxx in all tasks would be understandable given the architecture.
You could do RDxxxx/WRxxxx instructions in every thread, no problem. The side effect, though is timing indeterminancy for each thread. Assume an 8-thread set: If each thread executed·single-cycle instructions,·the thread-to-thread delay would only be 8 clocks. Now imagine if every thread happened to execute·a RDLONG in the same round; the thread-to-thread delay would jump to·64 clocks (8 clocks per RDLONG·times 8 threads). You could set up 7 threads and then have one always do a RDxxxx/WRxxxx instruction (takes only two cycles, once aligned). This way, one thread could handle all the hub memory R/W's while the other 6 had deterministic timing, as long as they stuck to single-cycle instructions (easy).
Having extra threads of code running within a COG along side the hub accessor thread would allow more work to be done without involving more cogs, and this is a win. I doubt you will arbitrate write access to COG memory among threads, and this will likely need to be done in software. If a semaphore or mutex like LOCKxxx was available, arbitration would be easier and more likely ... but that would make tasking bigger.
Not having access to WAITCNT in any COG thread would be a big negative; I assume you mean at least one COG <added> thread </added> can do WAITCNT.
No threads are special. Any thread could do a WAITCNT, it would just hang everything until satisifed.
If spin could do in-line PASM, that would help. Today's PASM interactiive requirement eats cogs like a fat man (must be lunch time [noparse]:o[/noparse]). If spin could run an LMM snippet in-line ASM or blocks of ASM would be cake. With Prop-II power and bigger memory, I will spend much more time with other languages though.
One of the problems that can't be overcome regarding the number of COGs problem is the limited memory in each COG. I know LMM helps tremendously and will be even better with Prop-II, but there will still be users who want native PASM speed on some apps, and the only way to get there is to use more COGs cooperatively for the application. What is the projected LMM:PASM speed ratio now? With 8 cogs, the LMM:PASM speed ratio would be 1:8.
Paul, can you clarify "not the behaviour most people would expect"? What is "TS Spin" ? Not having a way at all to wait for a certain amount of time to pass by any method would make it useless to me. Java has a wait that can be used by any object with a matching thread-id though that is not so comparable. Me thinks we're not communicatiing.
Added: since Chip's reply came about the same time I sent the paragraph above:
Chip Says: "No threads are special. Any thread could do a WAITCNT, it would just hang everything until satisifed'"
This of course makes sense·per COG Chip; I thought you meant something else.
Not having WAITCNT in the language·syntax would make it impossible to have the same module work on both devices; while this will be less important in time, it may be unwelcome by many.
Mike Green said...
For high speed bit-banging, you still need a WAITCNT functionality with resolution to a single clock cycle.
WAITCNT is only necessary in the Prop I because there are multiple clocks per instruction. In the Prop II, there are two instructions per clock, and the clock rate can be as high as 160MHz. If the CNT register were clocked at clkfreq/2, the function of WAITCNT could be done entirely in software (single task mode) in the Prop II (ignoring sub-CNT-frequency jitter). ("clkfreq/2" is only necessary since JMP xxx in the Prop II flushes the pipe when the jump is taken.) The hardware WAITCNT would no longer be needed, and its opcode slot could be taken up by a two-argument TASKSLP instruction. In single-task mode, TASKSLP would have the same time granularity as WAITCNT would have had.
So with a LMM:PASM ratio of 1:8, you can run LMM at the same speed as existing propI cog, with threading you can run cogs using LMM 4x slower than an existing cog. This for example would allow you to run keyboard, mouse and 1 serial port (@115K). The drivers would all need to run using LMM to get them into the cog memory. You might be able to make it simpler by just moving the keyboard lookup table into hub memory, that I believe would free enough cog memory to fit the serial port driver in as well. In that case you wouldn't need to use LMM.
Steve, Chip has a better response to your questions. TS Spin = task switching vesion of spin, just a means of talking about it without having to write the full descript each time. All of the waitxxx instructions (waitcnt/waitpeq/waitpne) in spin have a direct hardware mechansm which supports them. This hardware behavior will be identical in TS Spin, meaning it will stop all cog activity until the waitxxx condition clears. In TS Spin this would mean that all tasks pause, not just the task that called it. In task switching this isn't what would be expected, most people would expect a waitxxx to pause only that task, not all of them.
There are always ways around the loss of these functions, after all every other microcontroller doesn't have these instructions, you would just have to resort to the old ways of performing these tasks. Yes there's more slop in handling the functionality (ie for waitcnt,·do a·comparision of cnt with targetcnt,·while checking boundries), but·they are still achievable.
Mike Green said...
Phil,
The way that would work would be if the single task degenerative case of TASKSLP worked like WAITCNT. For high speed bit-banging, you still need a WAITCNT functionality with resolution to a single clock cycle. For multiple tasks, you'd need some mechanism for a "fuzzy" system clock compare, particularly with 3 tasks running. For 1, 2, and 4, you'd ignore the appropriate number of low order bits.
For threaded code, instead of a pure WAITCNT, you'd have to do the following:
1) compute what CNT target value you were looking for (a single ADD, usually)
2) keep·computing (CNT - target) until the result MSB was clear:
wait····mov···· x,CNT ······· sub·····x,target ········shl···· x,#1······wc···'get x[noparse][[/noparse]31] into c
if_c ·· jmp···· #wait
Well, if we had *native* support for threading/task switching, and the support was turned "on" for a threaded cog, any
WAIT* instruction would instead simply disable the writeback portion of the instructions being executed until the next
task switch instruction, and on the task switch restore the PC to the instruction that did the wait. This way the WAIT
instructions would become *approximate* but they'd still pretty much work. (They may, of course, miss narrow
pulses in this mode . . .)
Paul Baker (Parallax) said...
... In task switching this isn't what would be expected, most people would expect a waitxxx to pause only that task, not all of them.
Indeed .... as I mentioned in the Java thread analogy in the post for which you were replying but apparently missed for some reason.· Me thinks we assume too much understanding of other's meanings in the forum sometimes.
Mike Green said...
For high speed bit-banging, you still need a WAITCNT functionality with resolution to a single clock cycle.
WAITCNT is only necessary in the Prop I because there are multiple clocks per instruction. In the Prop II, there are two instructions per clock, and the clock rate can be as high as 160MHz. If the CNT register were clocked at clkfreq/2, the function of WAITCNT could be done entirely in software (single task mode) in the Prop II (ignoring sub-CNT-frequency jitter). ("clkfreq/2" is only necessary since JMP xxx in the Prop II flushes the pipe when the jump is taken.) The hardware WAITCNT would no longer be needed, and its opcode slot could be taken up by a two-argument TASKSLP instruction. In single-task mode, TASKSLP would have the same time granularity as WAITCNT would have had.
-Phil
Phil, on the Prop II there will be only one instruction per clock, for a max of 160 MIPS per cog.
What about on the fly substitution? The interpreter checks if it's in task switching mode, if not execute WAITCNT, if yes execute WAITTSK (or whatever it ends up being called).
Sorry, I thought I remembered two instructions per clock for 320 MIPS. What I read probably implied two instructions for each equivalent Prop I clock, instead.
Anyway, here's how I see TASKSLP working:
1. In addition to a PCsave register, cc bits, and enable bit, each task would have a "sleep" bit and a count register.
2. When the task is started, its enable bit would be set and its sleep bit cleared, indicating that its next instruction can be executed whenever its turn comes up.
3. When a task executes a TASKSLP D,S instruction, D would get loaded into the task's count register, then S would be added to D. (TASKSLP uses the same opcode slot now occupied by WAITCNT, which allows two arguments.) Then, the enable bit would get cleared and the sleep bit set.
4. The hardware would monitor each task's count register for a match with CNT. Whenever a match occurs, and if sleep is set, it clears sleep and sets enable.
5. A task slot would only be available for a TASKNEW if both enable and sleep are clear. (TASKEND would clear both bits.)
For a single-task cog, TASKSLP would have the same time granularity (one clock/one instruction) as WAITCNT would have had, thus obviating the need for a separate WAITCNT.
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again BTW: I type as I'm thinking, so please don't take any offence at my writing style
A quick look at some of my code: All three robots use waitcnt. NES controller uses waitcnt. Spyglass object (a lcd/switches board) and XBee do as well. The Ping object uses waitpeq and waitpne. I haven't looked at how hard it would be to get around all tasks blocking on one tasks call to a waitxxx.
Comments
I've certainly had times when it would have been nice to be able to use portB as a hardware semaphore between cogs.. WAITPE/WAITPNE and all that..
Anyway, I promised myself I'd not buy into this thread.. too busy with what the existing chip can do rather than getting my juices flowing over what may come in the future.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Pull my finger!
I am disappointed on two fronts:
1. Only 8 cogs, not 16
2. How far away the PropII must be
I was attracted to the Prop because of its unusual design: 8 cogs, Risc, 32 bit, 32 I/O and reasonably fast for a little chip. I did not need these features for my little project, which, by the way got sidetracked because I fell in love with the Prop design. My uses will not sell many props, other than by word of mouth in getting others to try it out, so my input should not carry much weight.
However, the simplicity of just using another cog for each task is the beauty of the Prop. That makes it easy for beginners and education, which is what I believe is currently the target audience. Objects can just be added into another cog and presto, it works. It is fairly easy to understand the code being used.
I did not like the prospects of multitasking within a cog. While I know this will work well (and will eventualy happen), the target audience will not be able to understand this type of code. 8 cogs will force this to happen much sooner. The cog ram size is a limitation, but with 16 cogs this would not have been such an issue. While many of the new instructions will address some of these concerns, it will certainly add to the complexities in understanding the chip. May I sugest the manual be divided into two sections of instructions (basic and advanced).
It is too late, but I would have preferred 16 cogs without the speed improvements. The four way access to cog ram has obviously blown the chip feature size. Unfortunately nothing comes for free - there are always trade-offs.
Possible suggestions:
1. Could one cog (say cog 0) be given hub access on unused cycles. This would mean this cog would have faster access to hub memory for priority tasks at the expense of determinability for this cog. If possible, it would be nice for this to be an option bit in a register. In other words, I would like at least one cog be given extra access to the hub whenever other cogs are not using the bandwidth.
2. I second the idea of seperate internal I/O for inter-cog communications.
3. If there is some form of external communications between props, would this make sense to be on seperate pins and I/O, given the available pin count on the package???
I am sure ultimately you will make the right decision, whatever that may be.
I have one more qustion on VIDEO generator/counter.
You have anounced R2R DAC in 3 group in it. you have conter in it.
How match work YOU must give to it to ad one more mode to it
R2R DAC in 2 group in 16 Bits and counter modes to suport WAV frecuencies.
And name it VIDEO/WAV generator/counter.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
Post Edited (Sapieha) : 9/1/2008 3:22:56 PM GMT
This is why we all love Parallax!· Customer involvement!· Thanks Chip & Ken.
This has raised but I am not clear on the answer.· Is it possible to have the "cog-counters" changed to be "chip" counters - aka not tied to a specific cog?· I.e. any cog can use any counter(s) - up to the programmer.· Much in the same way all cogs can access (w/r) all pins. Thus you can use some cogs for pure logic/processing etc, and some for pure I/O activities - i.e. A/D, pulse generation etc, etc.· This would make the prop2 a very flexible IC!
Maybe:
With regard to cogs - i'd be happy with 8!· A tutorial on use of JMPRET would make getting our head arround it much easier and allow us to more easily use the full 100% of the cog time available. Many objects at present use 1 cog = 1 task, and you loose most of the capability of the cog.· I'd personally not spend huge amounts of time & effort·developing a complicated task switching system....
Oh and don't have a compatibility mode!· No more paged memory!!
Cheers all,
James
I share many of your sentiments about not wanting the cogs to become too complex. In the case of this feature, it doesn't take much silicon to implement, but makes new things possible and efficient.·If we make Spin code·compile to LMM-type code, a single cog could run up to·16 threads of Spin very quickly, though·Spin code could no longer·do an accurate·WAITCNT.·In the case of multi-threaded execution, WAITCNT·would have to become·an assembly-level·macro which·would wait for the target count to pass (by·a small, but indeterminant amount). But, if you single-thread your Spin execution, you'd get real WAITCNT behavior. For many app's, this doesn't matter.
QUESTION:
If·up to 16·threads of Spin can execute on a single cog very quickly, would this alleviate the need many of you have for 16 cogs? In other words, of those 16 cogs, how many were you planning to run Spin on?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Teoretically it multiple COG´s 8x16
If that tasks can run 16 PASM tasks all can run i upp to 16·logical COG´s with not timing and counter critical code.
And have 7 COG´s to time critical tasks
Ps. If that system can RUN Spin interpreter and My PASM code on respective task. It is perfect!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
Post Edited (Sapieha) : 9/1/2008 7:27:24 PM GMT
I've been thinking through the stuff I've seen here on this thread, projects I plan to do, and the work of others. IMHO, 8 cogs with the added capability here is going to be sweet. Of all the things I think about doing with COG's, supervisor / UI type tasks are the most costly. One COG could now do a number of those, really freeing up the remaining COG's for those things where speed and accuracy really matter. Granularity appears to be the big worry. Something like 16 threads that work in a sane, straightforward way on a COG is gonna relieve that for a lot of use cases. (not every use case is cycle exact, nor demanding of peak speed)
SPIN to LMM is very intriguing!
I wonder about the program size implications. One very nice thing about SPIN is that it's small. There is more RAM coming, of course, but the current SPIN can do a lot with little space. Keeping that intact is worth it, IMHO. Perhaps a port of the SPIN interpeter makes sense? So then, one could write interpeted spin, or compile it, leaving that a software choice, so that RAM use continues to be flexible?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Post Edited (potatohead) : 9/1/2008 5:50:52 PM GMT
First there's still a sizable segment of users that don't program in assembly, thier only use of assembly is through objects they include in thier program. While technically Spin is deterministic, the granularity is mushy to the point that no-one really uses spin as a deterministic language, so the difference won't be noticed.
Personally I don't use waitxxx in spin for anything but generalized waiting which is not time critical. I know this isn't the case with everyone especially spin-only coders, but I won't miss the loss of waitxxx, you can always use traditional software mechanisms (hardcode compare with cnt, manual polling).
JMPRET is still there for assembly coders who want to split the processing time among multiple tasks (Perhaps I should write a template for writing assembly level TS code, bringing the technique front and center rather than leaving it buried in the FDSerial object).
Finally, almost all applications have a mix of time critical and time non-critical processes, if the programmer has at thier disposal the ability to easily throw all of thier time non-critical processes on a single cog, that will make more cogs availible for·time critical processes.
(128 processes, I don't know if anyone will find a use for all of those)
Cluso,·I understand your disappointment to hear we are moving away from the 16 cog idea, but with the dramatic increase in cost for providing a full 16 cogs we feel·the need to keep more in bounds on the cost.
PS, the one argument I can see overriding all arguments to include TS spin is an answer of yes to the following question: "Does this mechanism destroy the simplicity and easy to understand nature of the Propeller?". Unfortunately I see this as being somewhere on the borderline, and if it doesn't pass the smell test, perhaps it should be abandoned. Ultimately I think it boils down to how straightforward the mechanism is to use it. I'm interested in hearing Jeff's opinion on this scheme.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Post Edited (Paul Baker (Parallax)) : 9/1/2008 6:32:27 PM GMT
directly emit LMM codes?
I think multithreading a single LMM Cog is somewhat interesting, but I do believe 90% of the people will never use it (other than
play with it as a gee-whiz thing). And it does leave some nasty corners like waitcnt/waitne that either need to be explicitly supported,
explicitly disallowed (and checked at compile time), or else there's even more gotchas for people.
I am afraid we're getting far away from the KISS principle that makes the Prop I so attractive. I think 8 fast cogs, 256KB, 160MHz
will blow the doors off the microcontroller world, especially if the low price and accessibility is still there. The more complicated we get,
the less attractive we get.
I think providing a *good* Basic to sit next to Spin on this new chip, ideally with the same level of support (ROM interpreter, LMM
micro-JIT if possible, or LMM-emitting proptool) would do more for the chip than trying to shoehorn some 16-way threading in there.
BTW, a LMM micro-JIT for Spin and/or Basic would be a *far* better use of ROM space than putting any sort of development environment
in there, in my opinion.
The reason is that I can see 4 uses of a cog
1. Spin - this is not normally time determinisic so running threads for spin would be great
2. ASM but not time determinisic, many devices have a max access rate but not a min, so running these in a thread wouldn't be a problem e.g. many I2C/SPI devices. The main issue with these is code size, i.e. running out of cog memory, running these drivers using LMM would remove the memory limitation
3. ASM that needs a min time e.g. PS2, serial. Again these run into the cog memory limit a combined key/mouse is just under 512longs, similar with the 4 port serial. Running PS2 under LMM is possible, for serial it depends on the slowdown LMM will give.
4. ASM - time determinisic - run these in a cog by themselves
At least with the stuff I have done 4 is a small subset of the code running. Allowing threads esp if LMM is fast would allow a lot of effective cogs
Not having access to WAITCNT in any COG thread would be a big negative; I assume you mean at least one COG <added> thread </added> can do WAITCNT.
If spin could do in-line PASM, that would help. Today's PASM interactiive requirement eats cogs like a fat man (must be lunch time [noparse]:o[/noparse]). If spin could run an LMM snippet in-line ASM or blocks of ASM would be cake. With Prop-II power and bigger memory, I will spend much more time with other languages though.
One of the problems that can't be overcome regarding the number of COGs problem is the limited memory in each COG. I know LMM helps tremendously and will be even better with Prop-II, but there will still be users who want native PASM speed on some apps, and the only way to get there is to use more COGs cooperatively for the application. What is the projected LMM:PASM speed ratio now?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Post Edited (jazzed) : 9/1/2008 6:08:12 PM GMT
WAITCNT (or its equivalent) would be pretty hard to sacrifice, since it gets used for so many things. Would it be possible to add a TASKSLP instruction to put a task to sleep for a prescribed number of clocks? That would eliminate the need for most WAITCNTs. If it could be done in such a way that timing errors didn't accumulate, it could eliminate nearly all WAITCNTs. That's harder to do with just one argument, though. Perhaps each task could have its own count register that got augmented by the single argument to TASKSLP. Then the task would be primed for awakening whenever its count register equalled CNT. I guess you'd need a TASKCNT instruction to load the count register initially, as well.
Come to think of it, with the instruction granularity now finer than the clock granularity, is WAITCNT in its present form all that useful? Perhaps WAITCNT could be the two-argument TASKSLP that using the ENC slot doesn't permit. In other words, jettison WAITCNT entirely in favor of TASKSLP.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
Post Edited (Phil Pilgrim (PhiPi)) : 9/1/2008 6:15:55 PM GMT
The way that would work would be if the single task degenerative case of TASKSLP worked like WAITCNT. For high speed bit-banging, you still need a WAITCNT functionality with resolution to a single clock cycle. For multiple tasks, you'd need some mechanism for a "fuzzy" system clock compare, particularly with 3 tasks running. For 1, 2, and 4, you'd ignore the appropriate number of low order bits.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Added: since Chip's reply came about the same time I sent the paragraph above:
Chip Says: "No threads are special. Any thread could do a WAITCNT, it would just hang everything until satisifed'"
This of course makes sense·per COG Chip; I thought you meant something else.
Not having WAITCNT in the language·syntax would make it impossible to have the same module work on both devices; while this will be less important in time, it may be unwelcome by many.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Post Edited (jazzed) : 9/1/2008 7:06:49 PM GMT
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
Post Edited (Phil Pilgrim (PhiPi)) : 9/1/2008 6:56:48 PM GMT
There are always ways around the loss of these functions, after all every other microcontroller doesn't have these instructions, you would just have to resort to the old ways of performing these tasks. Yes there's more slop in handling the functionality (ie for waitcnt,·do a·comparision of cnt with targetcnt,·while checking boundries), but·they are still achievable.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Post Edited (Paul Baker (Parallax)) : 9/1/2008 7:14:41 PM GMT
1) compute what CNT target value you were looking for (a single ADD, usually)
2) keep·computing (CNT - target) until the result MSB was clear:
wait····mov···· x,CNT
······· sub·····x,target
········shl···· x,#1······wc·· ·'get x[noparse][[/noparse]31] into c
if_c ·· jmp···· #wait
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
WAIT* instruction would instead simply disable the writeback portion of the instructions being executed until the next
task switch instruction, and on the task switch restore the PC to the instruction that did the wait. This way the WAIT
instructions would become *approximate* but they'd still pretty much work. (They may, of course, miss narrow
pulses in this mode . . .)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Can you elaborate on how WAITSLP would work?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Sorry, I thought I remembered two instructions per clock for 320 MIPS. What I read probably implied two instructions for each equivalent Prop I clock, instead.
Anyway, here's how I see TASKSLP working:
1. In addition to a PCsave register, cc bits, and enable bit, each task would have a "sleep" bit and a count register.
2. When the task is started, its enable bit would be set and its sleep bit cleared, indicating that its next instruction can be executed whenever its turn comes up.
3. When a task executes a TASKSLP D,S instruction, D would get loaded into the task's count register, then S would be added to D. (TASKSLP uses the same opcode slot now occupied by WAITCNT, which allows two arguments.) Then, the enable bit would get cleared and the sleep bit set.
4. The hardware would monitor each task's count register for a match with CNT. Whenever a match occurs, and if sleep is set, it clears sleep and sets enable.
5. A task slot would only be available for a TASKNEW if both enable and sleep are clear. (TASKEND would clear both bits.)
For a single-task cog, TASKSLP would have the same time granularity (one clock/one instruction) as WAITCNT would have had, thus obviating the need for a separate WAITCNT.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
I do hope these are optional - it's all sounding a bit scary to me (as a Spin-only programmer at present )
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.com
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again
BTW: I type as I'm thinking, so please don't take any offence at my writing style
You might want to rethink the name "TS Spin". It could acquire a bad connotation among users who find it ... um ... "tough" to understand.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
""
Mike Green said...
For high speed bit-banging, you still need a WAITCNT functionality with resolution to a single clock cycle.
""
Chip if it is fuly programable serialiser/deserialiser In every COG. Bit-banging is in my opinion litle of topic.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
John Abshier
If multiple cog threads is easy to implement, then please do.
All,
For those who don't like the waitcnt limitation, then don't use threads. It's that simple.
EDIT: However, I think Phil's TASKSLP suggestion is the solution.
Mark
Post Edited (Mark Swann) : 9/1/2008 8:26:51 PM GMT