Trying to drive Neopixels with Inline assembly CMM mode.
ftguy2016
Posts: 26
in Propeller 1
Hi,
I am trying to drive an ASD1293 NeoPixel using some inline assembly + C code and it does work pretty well... until I tried to use a loop to set all the 8 bits within the _asm_ block assembler:
I am trying to implement the normally simple :
"PutByteLoop:"
"..."
"..."
"..."
" shr %[bitMASK], #1 wz \n\t"
" if_nz brw #PutByteLoop \n\t"
Trying to loop for the 8 bits of data, but I have no success so far, is there any recommendation on how to use jump in inline assembly CMM mode ?
Thank you.
I am trying to drive an ASD1293 NeoPixel using some inline assembly + C code and it does work pretty well... until I tried to use a loop to set all the 8 bits within the _asm_ block assembler:
I am trying to implement the normally simple :
"PutByteLoop:"
"..."
"..."
"..."
" shr %[bitMASK], #1 wz \n\t"
" if_nz brw #PutByteLoop \n\t"
Trying to loop for the 8 bits of data, but I have no success so far, is there any recommendation on how to use jump in inline assembly CMM mode ?
Thank you.
Comments
I remember having immense trouble writing loops in inlined assembly as well. I don't know that I ever figured it out without using fcache. In PropWare, all inline-assembly uses fcache. This slightly increases the overhead for invoking your snippet of assembly (because it has to be copied to cog RAM) but then greatly increases the performance of each iteration in your loop.
Here's PropWare's NeoPixel assembly:
https://github.com/parallaxinc/PropWare/blob/release-2.0/PropWare/ws2812.h#L136
And here's a small sample of looping with fcached inline assembly:
Notice the %= at the end of each label. That's a special macro used by GCC which will expand to a unique number in every instance of assembly code. This allows you to use the word "loop" in multiple inline-assembly snippets.
I was not aware of that special % after the labels, I will try to experiment... I am still new to the platform and in the learning process. Just a quick question, using that fcache , will it be still ok to keep the CMM mode in my project because I am kind of short of memory when I select LMM.
Thank you.
Can you post your full code for this failure? Something is wrong about that. The snippet I provided above came from this file, which I have used in both CMM and LMM modes.
Definitely supported in CMM mode. That's how I'm able to get the same 4.4 MBaud burst transmit speed for UART in both LMM and CMM modes.
Without fcache, inline assembly only saves you from less-than-ideal compiler translation. It's just a way to execute specific instructions rather than whatever GCC thinks is best. But fcache brings in a whole different world
Personally, I think it's absolutely amazing that we can have CMM mode and then combine that with fcache. You get density that approximates Spin and speed that approximates native assembly! What an awesome world this is with the Propeller and PropGCC!
That should work -- brw will be expanded by the assembler to the appropriate CMM mode branch (and in LMM would be expanded into an LMM mode branch). What is going wrong?
Yes, fcache will work in CMM. You do have to be careful to wrap the fcache'd code in ".compress off" / ".compress default" to make sure the assembler will produce real PASM instructions instead of compressed ones. David Zemon's example does this, and should work in all modes.
It's a pity that fcache is a pretty poor name choice, for something that is a significant change in mode (and thus speed)
To make this more obvious, can a better name be found ?
A name that carries information about what a new user could expect would be good.
fcache sounds like a subtle option switch for floating point cache control....
I'd never thought about that before, but it's a good point. Perhaps the branch of PropGCC that is based off of GCC 6 can make that change. What do you think @ersmith? And @jmg, do you have any suggestions? Maybe cogcache?
I think COG needs to be in the name, as it is a COG Resident Fast Native Binary Mode (not interpreted).
Cache I am less warm to, as that sounds like a buffer control. (Users are more interested in WHAT something does, than how the software does it.)
What are the limits on multiple 'cogcache' ?
One COG runs the interpreter, which implies up to 7 SW blocks can launch and run as COG-resident.
each block has an upper ceiling of COG code area.
Within those, can a single COG be packed with multiple COG-resident blocks, of which one at a time can be launched. (this would save re-load overhead)
Was this called something like Terminate and Stay Resident, back in the DOS days ?
I've found the limit to be 64 instructions. More than that and your program will quietly fail to run correctly (I do wish a compiler error would be thrown... but I'm not sure how easy it would be to detect such a thing given the way I force fcache in the example above).
You can have multiple things run in fcache mode in a single cog. The code is not cached in cog ram until you're ready to execute it. The bad news is that there is a delay between when you hit the start of your fcache block and when your fcache block begins executing. The good news is, that means there is no limit on how many sections of your code can be run via fcache.
Is that 'same COG' the only mode supported ? Can you not allocate the Binary images to other codes (where there are less caveats) ?
The Propeller's native cogstart instruction is still available, so if you have a block of binary in your program, absolutely you can invoke cogstart on it. There are multiple ways to achieve said block of binary:
1) Extract the DAT section from a Spin file via spin2cpp, link into your executable via PropGCC
2) Write a pure GAS assembly file (.S or .s usually) and build it into an object file which is linked into your program like anything else with PropGCC
3) Write a "cogc" (or PropWare provides "cogcpp" as well) file. This simply means a C or C++ file which PropGCC will compile into assembly - as all C/C++ files are - but then does some extra magic on the symbols before linking so that it can be invoked via cogstart at runtime.
You can not, however, just pick any old function willy-nilly and invoke it in a new cog running native instructions.
PropGCC will automatically cache certain functions that it deems are worth caching. I don't know how good it is at detecting when something should be cached... I've never looked into it. For those automatically cached functions, I assume it is smart enough to only load when necessary.
As for my manual method, I don't know. It would sure be nice! I've never looked. Hopefully @ersmith can answer that.
I agree that "fcache" is probably not the best name, but it's pretty well entrenched now (and has been for a long time, since Bill first invented the concept). But I have no objection to adding an alias for it.
PropGCC automatically puts loops (if they are small enough) and recursive functions (again, if small enough) into fcache, if the appropriate optimization options are set. For LMM this is -O2 or -Os, but for CMM only -O2 (because fcache code is not compressed, so it isn't optimized for size).
You can also explicitly put an __attribute__(("fcache")) tag on a function to request that it be placed into fcache, This is very useful for functions with hard timing requirements (the PropGCC library uses this for its serial functions), and also for small frequently called functions.
Well, it seems to totally ignore the jump and it does lock, pretty weird. Ok at first I am trying to see if I can add the fcache with this
and in CMM i am getting :
(.text+0x6c): undefined reference to `__LMM_FCACHE_LOAD'
collect2: ld returned 1 exit status
Done. Build Failed!
I am then switching to LMM, now it does compile fine but the CogNeoPixl hungs, locks.
Ok so I switch back to CMM and I edited to have the loop into the ASM block and to light the P26 to see if it does hang or not with the following code :
And the propeller hangs, I mean, I can see 1 flash of light on P26 and that's all....then if I comment the BRW line , it does not hang, but of course do not work...that is the situation I have reached so far
It is much harder to debug code without the full context. I pasted your snippet into my own file and there are a few undefined symbols. do_mask_ is easy enough to figure out, but NbNeoPixl and NeoPixelRGB are less obvious.
Ok here are the missing part that I extracted that should make everything to run if you add the cogneopixl, I have 8 NeoPixels with data line on my pin0 so my do_mask_ is 1<<0, but to test, if you have led on some pin, you can just replace the do_mask_ with whatever pin you want to activate
I was only able to figure this out by looking at the assembly file (add -save-temps to your compile options)
I believe it would be nice if the GCC could output a warning about this double fcache issue because it is not something obvious to find out, do you know if it also explains why I have no luck with the brw ?
It also points out why it's probably better to avoid inline assembly unless you really know what you're doing (as David does ). An automatically fcache'd gcc loop is definitely going to run faster than a manually coded non fcache'd loop.
I think I know what I am doing, plus, you can't drive NeoPixels without using assembler in the propeller, the short tight delay cannot be reached easily. Personally I will rather point-out that the GCC should be fixed so we can use it properly.
I'm sorry, my comment came across with the wrong tone -- I didn't mean to imply you don't know what you're doing. Thank you for the bug report, and I have fixed the fcache problem in GCC by preventing it from fcache'ing loops that have inline assembly in them.
I do disagree though that you can't drive the NeoPixels without using assembler. I think GCC should be able to produce efficient enough code. What I was trying to get at (and not phrasing it well) is that the automatic optimizations that GCC performs (like fcache) will produce very good code which will often run faster than hand written assembly. Certainly fcache'd C code will outperform non-fcache'd hand written assembly.
Does this mean that 'converging on a solution' is not so easy ?
If a user asks for fcache, and finds on inspection that is very close, but they need to modify one small part, how do they do that if any in-line asm then disables fcache ?
Does this mean two forms of in-line ASM are needed ?