I didn't forget. It is the relationship of the function syntax that put me on hold. With the posted program, the entire program, how to make it into a function? I'm interested in running the entire program in each cog, not just some part of it.
What exactly do you need? The example program will exit after having flashed the LEDs 4 times each. Not exactly worth throwing cogs at Or does that not matter?
I'm still not sure about syntax and the code is missing something in the top object. What I want it to do, the top object launches the example object in each cog, which blinks 1 led on a pin equal to the cog number. All cogs are shut off except one through the rems. That way, I can rem out the remaining cogs step by step (to test each one) and verify it's working.
and led 0 is flashing but the other pins 1 - 7 remain on.
You need separate stack space for each cog. That's why I used long stack[8 * 20] and use address offsets @stack[ID * 20]. You are allowed to copy my code. It's tested
e.g. cog 0 gets stack at @stack[0], cog 6 at @stack[120].
Would something like this be more simple? - attached
In example2.spin you're still using the same stack for all coginit calls. You CAN'T do that (besides stacks grow upwards in SPIN so you'd have to use @stack[0] anyway). Otherwise it's fine (you didn't use cog 0 though).
In example2.spin you're still using the same stack for all coginit calls. You CAN'T do that (besides stacks grow upwards in SPIN so you'd have to use @stack[0] anyway). Otherwise it's fine (you didn't use cog 0 though).
Kuroneko, but the code automatically loads into cog 0 and if cog 0 is initialized then the program won't work, right?
I read the book on page 78 for StackPointer and do not know how to set it, because there are several examples, one at 6, one at 10, one at 20, so I don't know how to set the stack.
So for cogs 0 through 7, what are the respective stack numbers and how do you determine it? If 0 is used, then no stack space is set aside, right?
Kuroneko, but the code automatically loads into cog 0 and if cog 0 is initialized then the program won't work, right?
That's why cog 0 (or rather cogid) is initialised last.
start := cogid - 7
repeat 8
ID := start++ & 7
...
The first line makes sure we deal with the other 7 cogs first. Assuming we run in cog 0 then start is -7. Which gives us the sequence 1 (== -7 & 7), 2, 3, 4, 5, 6, 7 and finally 0. You wanted the program to be run in all 8 cogs so you have to recycle the current cog somehow.
So for cogs 0 through 7, what are the respective stack numbers and how do you determine it? If 0 is used, then no stack space is set aside, right?
cognew/coginit take an address which points to a number of longs (e.g. 20). You can declare 8 stack arrays by name (e.g. stack0[20], stack1[20] etc) or use one array and point at addresses in that array (e.g. @stack[0], @stack[20] etc).
coginit(ID, launch(ID), @stack[ID * length])
Cog N is assigned stack starting at long N*20 + 0, ending at N*20 + 19.
Kuroneko, your post #5 is exactly what I needed. In my quest for eight cogs running the same app, I had not considered all that you have, nor was my implementation as organized.
BTW, for my particular app, the Propeller is a bargain. Compared with an ARM Cortex M3 running highly optimized C code, a fully-deployed Propeller running the same algorithm in PASM is 5.7 times faster.
Meanwhile, the respective chips are within $1 of each other, the ARM being slightly more. So, if you can tolerate PASM and no JTAG, the Propeller conquers!
The core algorithm is entirely PASM and makes good use of Prop's instruction set - it's a good fit. And it's small enough to fit into a COG with room to spare for a couple of background tasks that keep everything clicking away.
I really like the Cortex M3 core. I was stunned from the outset by how fast it could execute compiled C code, and how well the compiler optimizations worked. I'm told that the Cortex M3 was specifically designed to run C. It shows.
So for a Prop to be, in aggregate, 5.7 times faster than a 100 MHz 125Mips Cortex M3 is a real head spinner. It just shows that for the right app, the Prop is amazing.
As a final note, I still have as an exercise an attempt to unfold the C code further in order to avoid the relatively stiff penalty incurred anytime branching exceeds a certain distance. I commend Chip for optimizing conditional jumps on the Prop in favor of taking the jump. That's a real boon. I also like the NR effect, reminicent of the PIC. That saves time, too.
Edit: Your questions need better answers. Two obvious ways in which the Cortex M3 trounces the Prop are in memory size and single-cycle integer multiplication. The particular algorithm I used in this comparison does not place much demand on memory and does not employ multiplication. That's why I say, "...for the right app..." I'm fortunate to have a useful task to occupy the combined throughput of several Propellers. And fortunate to have ARM chips for certain other tasks.
Kuroneko, thank you for the explanation - the code is working after making the changes outlined. Very nice! Yes, I always test code before posting - it was blinking the first LED and the others remained on. Just curious, how do you handle debugging your code?
The core algorithm is entirely PASM and makes good use of Prop's instruction set - it's a good fit. And it's small enough to fit into a COG with room to spare for a couple of background tasks that keep everything clicking away.
Could you show it to us (if permitted) so we can take it apart?
@kuroneko: Well, I've been sort of secretive about what I'm doing simply because its unknown nature is part of its value. But I can say that most of the things I work on were originally inspired by Donald Knuth's "The Art of Computer Programming."
Seems that the more I delve into numbers, the more interesting they become. Computers make all kinds of things possible. With the current MIPS/$ ratio in the stratosphere, it's a great time to be tinkering.
@K2: I am not sure if you fully understand the conditional jumps (and calls). If the jmp is conditional there is no penalty for the jump not taken because the conditional is tested early in the pipeline and if the conditional is not met then the instruction is 'converted' to a nop and the next instruction will be fetched in the pipeline correctly - i.e. there is no miss and therefore no penalty which means it executes in 4 cycles and not 8. see the latest prop manual (with proptool) for a better explanation.
Of course all the conditionals can be used on any instruction.
So, the only jumps that can stall the pipeline (take 8 clocks) are djnz, djz, tjnz that take 8 clocks when the jump is not taken.
As an aside, I have just seen that the jmp wz,wc will clear zero and set carry.
As an aside, I have just seen that the jmp wz,wc will clear zero and set carry.
Well, that irked me enough to dig into it. The zero flag will always be clearedA. Carry is unsigned borrow from the comparison between the value pointed to by the destination slot (normally register 0) and the target address (9bit immediate or 32bit register value).
So if I set a register target to $207 and register 0 contains $42 then jmp target wc will jump to location $7 == $207 & $1FF and set carry ($42 < $207). Doing the same with $7 will produce the same jump but doesn't set carry ($42 > $7).
I wonder who came up (for what reason) with the explanation in the manual?
FWIW, the same carry rule applies to mov[dis].
A So far I haven't seen any evidence that it can be set.
I'm not 100% sure what I said, but what I meant to say is that I like the fact that djnz is optimized for looping. Normally (i.e, on most processors) looping incurs an additional penalty. That's one reason loops are so often unrolled for speed. A bigger reason nowadays has to do with the prefetch queue. I like the freedom the Prop provides from such concerns.
There are lots of details that need to be cleared up in my head. Your and kuroneko's elucidations are very helpful! I'm going over them carefully.
BTW, eight cogs operating at full-throttle create a bit of heat, both for the Propeller and the linear regulator. The time has come to employ a switching device, and perhaps a fan, especially since the quantity of Props will shortly be multiplying (no pun intended).
Comments
cognew(fn, @stack0)
cognew(fn, @stack1)
cognew(fn, @stack2)
cognew(fn, @stack3)
cognew(fn, @stack4)
cognew(fn, @stack5)
cognew(fn, @stack6)
coginit(cogid, fn, @stack7)
Your stack space is just one long and you use the same (invalid) stack for all cogs. How is that supposed to work?
Just to clarify, object file refers to the included example.spin, not the top level object.
and led 0 is flashing but the other pins 1 - 7 remain on.
You need separate stack space for each cog. That's why I used long stack[8 * 20] and use address offsets @stack[ID * 20]. You are allowed to copy my code. It's tested
e.g. cog 0 gets stack at @stack[0], cog 6 at @stack[120].
Would something like this be more simple? - attached
In example2.spin you're still using the same stack for all coginit calls. You CAN'T do that (besides stacks grow upwards in SPIN so you'd have to use @stack[0] anyway). Otherwise it's fine (you didn't use cog 0 though).
Kuroneko, but the code automatically loads into cog 0 and if cog 0 is initialized then the program won't work, right?
I read the book on page 78 for StackPointer and do not know how to set it, because there are several examples, one at 6, one at 10, one at 20, so I don't know how to set the stack.
So for cogs 0 through 7, what are the respective stack numbers and how do you determine it? If 0 is used, then no stack space is set aside, right?
Thanks.
The first line makes sure we deal with the other 7 cogs first. Assuming we run in cog 0 then start is -7. Which gives us the sequence 1 (== -7 & 7), 2, 3, 4, 5, 6, 7 and finally 0. You wanted the program to be run in all 8 cogs so you have to recycle the current cog somehow.
cognew/coginit take an address which points to a number of longs (e.g. 20). You can declare 8 stack arrays by name (e.g. stack0[20], stack1[20] etc) or use one array and point at addresses in that array (e.g. @stack[0], @stack[20] etc).
Cog N is assigned stack starting at long N*20 + 0, ending at N*20 + 19.
You should re-init cog 0 last (cogid in fact). Do you actually test your code before posting?
BTW, for my particular app, the Propeller is a bargain. Compared with an ARM Cortex M3 running highly optimized C code, a fully-deployed Propeller running the same algorithm in PASM is 5.7 times faster.
Meanwhile, the respective chips are within $1 of each other, the ARM being slightly more. So, if you can tolerate PASM and no JTAG, the Propeller conquers!
That makes me very curious as to what your mystery app is, or even just a clue.
Is it purely PASM within the COGs?
Is it Spin?
If the latter I'm amazed how well it compares with ARM.
How big is it?
The core algorithm is entirely PASM and makes good use of Prop's instruction set - it's a good fit. And it's small enough to fit into a COG with room to spare for a couple of background tasks that keep everything clicking away.
I really like the Cortex M3 core. I was stunned from the outset by how fast it could execute compiled C code, and how well the compiler optimizations worked. I'm told that the Cortex M3 was specifically designed to run C. It shows.
So for a Prop to be, in aggregate, 5.7 times faster than a 100 MHz 125Mips Cortex M3 is a real head spinner. It just shows that for the right app, the Prop is amazing.
As a final note, I still have as an exercise an attempt to unfold the C code further in order to avoid the relatively stiff penalty incurred anytime branching exceeds a certain distance. I commend Chip for optimizing conditional jumps on the Prop in favor of taking the jump. That's a real boon. I also like the NR effect, reminicent of the PIC. That saves time, too.
Edit: Your questions need better answers. Two obvious ways in which the Cortex M3 trounces the Prop are in memory size and single-cycle integer multiplication. The particular algorithm I used in this comparison does not place much demand on memory and does not employ multiplication. That's why I say, "...for the right app..." I'm fortunate to have a useful task to occupy the combined throughput of several Propellers. And fortunate to have ARM chips for certain other tasks.
Humanoido
Pencil and paper for timing issues, some LEDs (demoboard, "I got this far" style) or running the stuff in my head
Could you show it to us (if permitted) so we can take it apart?
Seems that the more I delve into numbers, the more interesting they become. Computers make all kinds of things possible. With the current MIPS/$ ratio in the stratosphere, it's a great time to be tinkering.
BTW, what is your advice to a person just starting with the prop chip and spin programming, who wants to become a master like you (and Heater)?
Humanoido
EDIT: (and Cluso..)
Of course all the conditionals can be used on any instruction.
So, the only jumps that can stall the pipeline (take 8 clocks) are djnz, djz, tjnz that take 8 clocks when the jump is not taken.
As an aside, I have just seen that the jmp wz,wc will clear zero and set carry.
Well, that irked me enough to dig into it. The zero flag will always be clearedA. Carry is unsigned borrow from the comparison between the value pointed to by the destination slot (normally register 0) and the target address (9bit immediate or 32bit register value).
So if I set a register target to $207 and register 0 contains $42 then jmp target wc will jump to location $7 == $207 & $1FF and set carry ($42 < $207). Doing the same with $7 will produce the same jump but doesn't set carry ($42 > $7).
I wonder who came up (for what reason) with the explanation in the manual?
FWIW, the same carry rule applies to mov[dis].
A So far I haven't seen any evidence that it can be set.
I'm not 100% sure what I said, but what I meant to say is that I like the fact that djnz is optimized for looping. Normally (i.e, on most processors) looping incurs an additional penalty. That's one reason loops are so often unrolled for speed. A bigger reason nowadays has to do with the prefetch queue. I like the freedom the Prop provides from such concerns.
There are lots of details that need to be cleared up in my head. Your and kuroneko's elucidations are very helpful! I'm going over them carefully.
BTW, eight cogs operating at full-throttle create a bit of heat, both for the Propeller and the linear regulator. The time has come to employ a switching device, and perhaps a fan, especially since the quantity of Props will shortly be multiplying (no pun intended).