Trying to drive Neopixels with Inline assembly CMM mode.

ftguy2016 · 2016-01-30 06:39

Hi,

I am trying to drive an ASD1293 NeoPixel using some inline assembly + C code and it does work pretty well... until I tried to use a loop to set all the 8 bits within the _asm_ block assembler:

I am trying to implement the normally simple :

"PutByteLoop:"
"..."
"..."
"..."
" shr %[bitMASK], #1 wz \n\t"
" if_nz brw #PutByteLoop \n\t"

Trying to loop for the 8 bits of data, but I have no success so far, is there any recommendation on how to use jump in inline assembly CMM mode ?

Thank you.

DavidZemon · 2016-01-30 15:03

ftguy2016 wrote: »

Hi,

I am trying to drive an ASD1293 NeoPixel using some inline assembly + C code and it does work pretty well... until I tried to use a loop to set all the 8 bits within the _asm_ block assembler:

I am trying to implement the normally simple :

"PutByteLoop:"
"..."
"..."
"..."
" shr %[bitMASK], #1 wz \n\t"
" if_nz brw #PutByteLoop \n\t"

Trying to loop for the 8 bits of data, but I have no success so far, is there any recommendation on how to use jump in inline assembly CMM mode ?

Thank you.

I remember having immense trouble writing loops in inlined assembly as well. I don't know that I ever figured it out without using fcache. In PropWare, all inline-assembly uses fcache. This slightly increases the overhead for invoking your snippet of assembly (because it has to be copied to cog RAM) but then greatly increases the performance of each iteration in your loop.
Here's PropWare's NeoPixel assembly:
https://github.com/parallaxinc/PropWare/blob/release-2.0/PropWare/ws2812.h#L136

And here's a small sample of looping with fcached inline assembly:

volatile uint32_t waitCycles = bitCycles;
__asm__ volatile (
        "        fcache #(ShiftOutDataEnd%= - ShiftOutDataStart%=)                 \n\t"
        "        .compress off                                                     \n\t"

        "ShiftOutDataStart%=:                                                      \n\t"
        "        add %[_waitCycles], CNT                                           \n\t"

        "loop%=:                                                                   \n\t"
        "        waitcnt %[_waitCycles], %[_bitCycles]                             \n\t"
        "        shr %[_data],#1 wc                                                \n\t"
        "        muxc outa, %[_mask]                                               \n\t"
        "        djnz %[_bits], #__LMM_FCACHE_START+(loop%= - ShiftOutDataStart%=) \n\t"

        "        jmp __LMM_RET                                                     \n\t"
        "ShiftOutDataEnd%=:                                                        \n\t"
        "        .compress default                                                 \n\t"
: [_data] "+r"(data),
[_waitCycles] "+r"(waitCycles),
[_bits] "+r"(bits)
: [_mask] "r"(txMask),
[_bitCycles] "r"(bitCycles));

Notice the %= at the end of each label. That's a special macro used by GCC which will expand to a unique number in every instance of assembly code. This allows you to use the word "loop" in multiple inline-assembly snippets.

ftguy2016 · 2016-01-31 06:50

Hi DavidZemon ,

I was not aware of that special % after the labels, I will try to experiment... I am still new to the platform and in the learning process. Just a quick question, using that fcache , will it be still ok to keep the CMM mode in my project because I am kind of short of memory when I select LMM.

Thank you.

ftguy2016 · 2016-01-31 07:05

I tried and indeed, it seems I need the project to be in LMM because otherwise the linker is stating undefined reference to `__LMM_FCACHE_LOAD'. Actually I have my ASM block into a different COG to control the NeoPixls (code is below) which is working well on CMM on my board but once I put the fcache . the 2 symbols (start%= end%=) an the jump __LMM_RET and set my project to LMM mode, the very same ASM block does not seems to longer work so I will keep my current working implementation and try to play a little bit more later on and see what is wrong in my implementation.

void CogNeoPixl()
{
  
  //output  PIN
  OUTA &=  ~(do_mask_);
  DIRA |= (do_mask_);
  
   int nextCNT;
        
   while(1)
   {
     while(SetColor==0);
          
        for(int i=0;i<3*NbNeoPixl;i++)
        {
          int byte = NeoPixelRGB[i];
          
        
          for(int bitMASK=128; bitMASK >0 ; bitMASK >>= 1 )
          {
          
        __asm__ volatile(
                "         test    %[databyte],%[bitMASK] wz  \n\t" // test the bit into Z flag
                "         mov     %[nextCNT],#33                  \n\r"
                "         if_nz    add %[nextCNT],#42              \n\r"
                "         add     %[nextCNT],cnt                \n\t"
                "         or      outa,%[DOmask]           \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                "         andn    outa,        %[DOmask]          \n\t"
                "         mov     %[nextCNT],  #50                \n\t"
                "         if_z   add %[nextCNT],#42              \n\t"
                "         add     %[nextCNT],  cnt                \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                :
                [nextCNT] "=&r" (nextCNT)
                :
                [DOmask] "r" (do_mask_),
                [databyte] "r" (byte),
                [bitMASK] "r" (bitMASK)
                );
           }

        }          
         waitcnt(CNT+900);
         SetColor = 0;
       }         
}

DavidZemon · 2016-01-31 16:19

ftguy2016 wrote: »

Hi DavidZemon ,

I was not aware of that special % after the labels, I will try to experiment... I am still new to the platform and in the learning process. Just a quick question, using that fcache , will it be still ok to keep the CMM mode in my project because I am kind of short of memory when I select LMM.

Thank you.

Can you post your full code for this failure? Something is wrong about that. The snippet I provided above came from this file, which I have used in both CMM and LMM modes.

JasonDorie · 2016-01-31 16:59

Does CMM support online assembly & fcache? I thought fcache was only in "native" PASM modes like XMM and LMM? (Happy to be proven wrong)

DavidZemon · 2016-01-31 17:23

JasonDorie wrote: »

Does CMM support online assembly & fcache? I thought fcache was only in "native" PASM modes like XMM and LMM? (Happy to be proven wrong)

Definitely supported in CMM mode. That's how I'm able to get the same 4.4 MBaud burst transmit speed for UART in both LMM and CMM modes.

DavidZemon · 2016-01-31 17:27

What's important to note though, is that without using FCache, inline assembly in CMM mode is still converted to CMM instructions, so you still have the performance hit of retrieving instructions from HUB and interpreting those instructions.

Without fcache, inline assembly only saves you from less-than-ideal compiler translation. It's just a way to execute specific instructions rather than whatever GCC thinks is best. But fcache brings in a whole different world

Personally, I think it's absolutely amazing that we can have CMM mode and then combine that with fcache. You get density that approximates Spin and speed that approximates native assembly! What an awesome world this is with the Propeller and PropGCC!

ersmith · 2016-01-31 21:34

ftguy2016 wrote: »

Hi,

I am trying to drive an ASD1293 NeoPixel using some inline assembly + C code and it does work pretty well... until I tried to use a loop to set all the 8 bits within the _asm_ block assembler:

I am trying to implement the normally simple :

"PutByteLoop:"
"..."
"..."
"..."
" shr %[bitMASK], #1 wz \n\t"
" if_nz brw #PutByteLoop \n\t"

Trying to loop for the 8 bits of data, but I have no success so far, is there any recommendation on how to use jump in inline assembly CMM mode ?

Thank you.

That should work -- brw will be expanded by the assembler to the appropriate CMM mode branch (and in LMM would be expanded into an LMM mode branch). What is going wrong?

ersmith · 2016-01-31 21:37

JasonDorie wrote: »

Does CMM support online assembly & fcache? I thought fcache was only in "native" PASM modes like XMM and LMM? (Happy to be proven wrong)

Yes, fcache will work in CMM. You do have to be careful to wrap the fcache'd code in ".compress off" / ".compress default" to make sure the assembler will produce real PASM instructions instead of compressed ones. David Zemon's example does this, and should work in all modes.

jmg · 2016-01-31 21:49

DavidZemon wrote: »

Without fcache, inline assembly only saves you from less-than-ideal compiler translation. It's just a way to execute specific instructions rather than whatever GCC thinks is best. But fcache brings in a whole different world

It's a pity that fcache is a pretty poor name choice, for something that is a significant change in mode (and thus speed)
To make this more obvious, can a better name be found ?
A name that carries information about what a new user could expect would be good.

fcache sounds like a subtle option switch for floating point cache control....

DavidZemon · 2016-01-31 22:37

jmg wrote: »

DavidZemon wrote: »

Without fcache, inline assembly only saves you from less-than-ideal compiler translation. It's just a way to execute specific instructions rather than whatever GCC thinks is best. But fcache brings in a whole different world

It's a pity that fcache is a pretty poor name choice, for something that is a significant change in mode (and thus speed)
To make this more obvious, can a better name be found ?
A name that carries information about what a new user could expect would be good.

fcache sounds like a subtle option switch for floating point cache control....

I'd never thought about that before, but it's a good point. Perhaps the branch of PropGCC that is based off of GCC 6 can make that change. What do you think @ersmith? And @jmg, do you have any suggestions? Maybe cogcache?

jmg · 2016-01-31 23:02

DavidZemon wrote: »

And @jmg, do you have any suggestions? Maybe cogcache?

I think COG needs to be in the name, as it is a COG Resident Fast Native Binary Mode (not interpreted).

Cache I am less warm to, as that sounds like a buffer control. (Users are more interested in WHAT something does, than how the software does it.)

What are the limits on multiple 'cogcache' ?

One COG runs the interpreter, which implies up to 7 SW blocks can launch and run as COG-resident.
each block has an upper ceiling of COG code area.

Within those, can a single COG be packed with multiple COG-resident blocks, of which one at a time can be launched. (this would save re-load overhead)
Was this called something like Terminate and Stay Resident, back in the DOS days ?

DavidZemon · 2016-02-01 00:08

Actually, what makes fcache so cool is that it runs in the same cog as the CMM (or LMM) iterpreter. You don't have to start a new cog to run code in fcache mode. That's what allows you to program with the speed of PASM but have your entire program run in a single cog. It doesn't help you do async of course, but the instructions themselves are quite fast.

I've found the limit to be 64 instructions. More than that and your program will quietly fail to run correctly (I do wish a compiler error would be thrown... but I'm not sure how easy it would be to detect such a thing given the way I force fcache in the example above).

You can have multiple things run in fcache mode in a single cog. The code is not cached in cog ram until you're ready to execute it. The bad news is that there is a delay between when you hit the start of your fcache block and when your fcache block begins executing. The good news is, that means there is no limit on how many sections of your code can be run via fcache.

jmg · 2016-02-01 00:35

DavidZemon wrote: »

Actually, what makes fcache so cool is that it runs in the same cog as the CMM (or LMM) iterpreter. You don't have to start a new cog to run code in fcache mode. That's what allows you to program with the speed of PASM but have your entire program run in a single cog. It doesn't help you do async of course, but the instructions themselves are quite fast.

I've found the limit to be 64 instructions.

That's impressive that it can do that, but such 'packed' operation is sure to have caveats and fish-hooks, so your issues are not surprising.

Is that 'same COG' the only mode supported ? Can you not allocate the Binary images to other codes (where there are less caveats) ?

Electrodude · 2016-02-01 00:42

Is the LMM/CMM interpreter smart enough to keep a pointer to what was last loaded into the fcache buffer (per cog, obviously), so it can avoid reloading it needlessly if the pointers match? I would think this would create a very significant speed improvement when the same fcache segment is called many times in a row (like if you output lots of bytes at once, a very common operation). If fcache does do this (which it should), then it could truly be called a cache.

DavidZemon · 2016-02-01 00:55

jmg wrote: »

DavidZemon wrote: »

Actually, what makes fcache so cool is that it runs in the same cog as the CMM (or LMM) iterpreter. You don't have to start a new cog to run code in fcache mode. That's what allows you to program with the speed of PASM but have your entire program run in a single cog. It doesn't help you do async of course, but the instructions themselves are quite fast.

I've found the limit to be 64 instructions.

That's impressive that it can do that, but such 'packed' operation is sure to have caveats and fish-hooks, so your issues are not surprising.

Is that 'same COG' the only mode supported ? Can you not allocate the Binary images to other codes (where there are less caveats) ?

The Propeller's native cogstart instruction is still available, so if you have a block of binary in your program, absolutely you can invoke cogstart on it. There are multiple ways to achieve said block of binary:

1) Extract the DAT section from a Spin file via spin2cpp, link into your executable via PropGCC
2) Write a pure GAS assembly file (.S or .s usually) and build it into an object file which is linked into your program like anything else with PropGCC
3) Write a "cogc" (or PropWare provides "cogcpp" as well) file. This simply means a C or C++ file which PropGCC will compile into assembly - as all C/C++ files are - but then does some extra magic on the symbols before linking so that it can be invoked via cogstart at runtime.

You can not, however, just pick any old function willy-nilly and invoke it in a new cog running native instructions.

Electrodude wrote: »

Is the LMM/CMM interpreter smart enough to keep a pointer to what was last loaded into the fcache buffer (per cog, obviously), so it can avoid reloading it needlessly if the pointers match? I would think this would create a very significant speed improvement when the same fcache segment is called many times in a row (like if you output lots of bytes at once, a very common operation). If fcache does do this (which it should), then it could truly be called a cache.

PropGCC will automatically cache certain functions that it deems are worth caching. I don't know how good it is at detecting when something should be cached... I've never looked into it. For those automatically cached functions, I assume it is smart enough to only load when necessary.
As for my manual method, I don't know. It would sure be nice! I've never looked. Hopefully @ersmith can answer that.

ersmith · 2016-02-02 00:14

DavidZemon wrote: »

jmg wrote: »

DavidZemon wrote: »

Without fcache, inline assembly only saves you from less-than-ideal compiler translation. It's just a way to execute specific instructions rather than whatever GCC thinks is best. But fcache brings in a whole different world

It's a pity that fcache is a pretty poor name choice, for something that is a significant change in mode (and thus speed)
To make this more obvious, can a better name be found ?
A name that carries information about what a new user could expect would be good.

fcache sounds like a subtle option switch for floating point cache control....

I'd never thought about that before, but it's a good point. Perhaps the branch of PropGCC that is based off of GCC 6 can make that change. What do you think @ersmith? And @jmg, do you have any suggestions? Maybe cogcache?

I agree that "fcache" is probably not the best name, but it's pretty well entrenched now (and has been for a long time, since Bill first invented the concept). But I have no objection to adding an alias for it.

ersmith · 2016-02-02 00:15

Electrodude wrote: »

Is the LMM/CMM interpreter smart enough to keep a pointer to what was last loaded into the fcache buffer (per cog, obviously), so it can avoid reloading it needlessly if the pointers match?

Yes.

ersmith · 2016-02-02 00:19

DavidZemon wrote: »

PropGCC will automatically cache certain functions that it deems are worth caching. I don't know how good it is at detecting when something should be cached... I've never looked into it. For those automatically cached functions, I assume it is smart enough to only load when necessary.
As for my manual method, I don't know. It would sure be nice! I've never looked. Hopefully @ersmith can answer that.

PropGCC automatically puts loops (if they are small enough) and recursive functions (again, if small enough) into fcache, if the appropriate optimization options are set. For LMM this is -O2 or -Os, but for CMM only -O2 (because fcache code is not compressed, so it isn't optimized for size).

You can also explicitly put an __attribute__(("fcache")) tag on a function to request that it be placed into fcache, This is very useful for functions with hard timing requirements (the PropGCC library uses this for its serial functions), and also for small frequently called functions.

Electrodude · 2016-02-02 02:22

ersmith wrote: »

Electrodude wrote: »

Is the LMM/CMM interpreter smart enough to keep a pointer to what was last loaded into the fcache buffer (per cog, obviously), so it can avoid reloading it needlessly if the pointers match?

Yes.

Cool!

ftguy2016 · 2016-02-02 02:52

That should work -- brw will be expanded by the assembler to the appropriate CMM mode branch (and in LMM would be expanded into an LMM mode branch). What is going wrong?

Well, it seems to totally ignore the jump and it does lock, pretty weird. Ok at first I am trying to see if I can add the fcache with this

void CogNeoPixl()
{
  
  //output sur PIN0
  OUTA &=  ~(do_mask_);
  DIRA |= (do_mask_);
  
   int nextCNT;
        
   while(1)
   {
     while(SetColor==0);
          
        for(int i=0;i<3*NbNeoPixl;i += 3)
        {
          unsigned int byte = (NeoPixelRGB[i+0]);
          byte <<= 8;
          byte |= (NeoPixelRGB[i+1]);
          byte <<= 8;
          byte |= (NeoPixelRGB[i+2]);
          
          for(int bitMASK=1<<23; bitMASK >0 ; bitMASK >>= 1 )
          {
          
        __asm__ volatile(
                "        fcache #(BlockDataEnd%= - BlockDataStart%=)                 \n\t"
                "        .compress off                                                     \n\t"
                "BlockDataStart%=:                                                      \n\t"
                "         test    %[databyte],%[bitMASK] wz  \n\t" // test the bit into Z flag
                "         mov     %[nextCNT],#33                  \n\r"
                "         if_nz    add %[nextCNT],#42              \n\r"
                "         add     %[nextCNT],cnt                \n\t"
                "         or      outa,%[DOmask]           \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                "         andn    outa,        %[DOmask]          \n\t"
                "         mov     %[nextCNT],  #50                \n\t"
                "         if_z   add %[nextCNT],#42              \n\t"
                "         add     %[nextCNT],  cnt                \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                 "        jmp __LMM_RET                                                     \n\t"
                "BlockDataEnd%=:                                                        \n\t"
                ".compress default                                                 \n\t"
                :
                [nextCNT] "=&r" (nextCNT)
                :
                [DOmask] "r" (do_mask_),
                [databyte] "r" (byte),
                [bitMASK] "r" (bitMASK)
                );
           }

        }          
         waitcnt(CNT+900);
         SetColor = 0;
      
       }         
}

and in CMM i am getting :

(.text+0x6c): undefined reference to `__LMM_FCACHE_LOAD'
collect2: ld returned 1 exit status
Done. Build Failed!

I am then switching to LMM, now it does compile fine but the CogNeoPixl hungs, locks.

Ok so I switch back to CMM and I edited to have the loop into the ASM block and to light the P26 to see if it does hang or not with the following code :

void CogNeoPixl()
{
  
  //output sur PIN0
  OUTA &=  ~(do_mask_);
  DIRA |= (do_mask_);
  
   int nextCNT;
        
   while(1)
   {
    // while(SetColor==0);
    
         high(26);
         pause(100);
         low(26);
         pause(100);
          
        for(int i=0;i<3*NbNeoPixl;i += 3)
        {
          unsigned int byte = (NeoPixelRGB[i+0]);
          byte <<= 8;
          byte |= (NeoPixelRGB[i+1]);
          byte <<= 8;
          byte |= (NeoPixelRGB[i+2]);
          
          int bitMASK;
          
       //   for(int bitMASK=1<<23; bitMASK >0 ; bitMASK >>= 1 )
          {
          
        __asm__ volatile(

                "mov      %[bitMASK],#1                  \n\t"
                "shl      %[bitMASK],#23                  \n\t" 
                "BlockDataStart%=:                                                      \n\t"
                "         test    %[databyte],%[bitMASK] wz  \n\t" // test the bit into Z flag
                "         mov     %[nextCNT],#33                  \n\r"
                "         if_nz    add %[nextCNT],#42              \n\r"
                "         add     %[nextCNT],cnt                \n\t"
                "         or      outa,%[DOmask]           \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                "         andn    outa,        %[DOmask]          \n\t"
                "         mov     %[nextCNT],  #50                \n\t"
                "         if_z   add %[nextCNT],#42              \n\t"
                "         add     %[nextCNT],  cnt                \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                "         shr %[bitMASK], #1 wz \n\t"
                "         if_nz brw #BlockDataStart%= \n\t"
 
                :
                [nextCNT] "=&r" (nextCNT),
                [bitMASK] "=&r" (bitMASK)
                :
                [DOmask] "r" (do_mask_),
                [databyte] "r" (byte)
              //  [bitMASK] "r" (bitMASK)
                );
           }

        }          
         waitcnt(CNT+900);
         
       
         SetColor = 0;
      
       }         
}

And the propeller hangs, I mean, I can see 1 flash of light on P26 and that's all....then if I comment the BRW line , it does not hang, but of course do not work...that is the situation I have reached so far

ftguy2016 · 2016-02-02 02:59

Just for information and reference, here is the current implementation that is working and which I am using for now, with the loop in C, it is not a big deal because it does work but I just wanted to understand why I cannot seem to have a loop working.

void CogNeoPixl()
{
  
  //output sur PIN0
  OUTA &=  ~(do_mask_);
  DIRA |= (do_mask_);
  
   int nextCNT;
        
   while(1)
   {
     while(SetColor==0);
          
        for(int i=0;i<3*NbNeoPixl;i += 3)
        {
          unsigned int byte = (NeoPixelRGB[i+0]);
          byte <<= 8;
          byte |= (NeoPixelRGB[i+1]);
          byte <<= 8;
          byte |= (NeoPixelRGB[i+2]);
          
          for(int bitMASK=1<<23; bitMASK >0 ; bitMASK >>= 1 )
          {
          
        __asm__ volatile(
                "         test    %[databyte],%[bitMASK] wz  \n\t" // test the bit into Z flag
                "         mov     %[nextCNT],#33                  \n\r"
                "         if_nz    add %[nextCNT],#42              \n\r"
                "         add     %[nextCNT],cnt                \n\t"
                "         or      outa,%[DOmask]           \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                "         andn    outa,        %[DOmask]          \n\t"
                "         mov     %[nextCNT],  #50                \n\t"
                "         if_z   add %[nextCNT],#42              \n\t"
                "         add     %[nextCNT],  cnt                \n\t"
                "         waitcnt %[nextCNT],  #0                 \n\t"
                :
                [nextCNT] "=&r" (nextCNT)
                :
                [DOmask] "r" (do_mask_),
                [databyte] "r" (byte),
                [bitMASK] "r" (bitMASK)
                );
           }

        }          
         waitcnt(CNT+900);
         SetColor = 0;
      
       }         
}

DavidZemon · 2016-02-02 03:16

hi ftguy2016,

It is much harder to debug code without the full context. I pasted your snippet into my own file and there are a few undefined symbols. do_mask_ is easy enough to figure out, but NbNeoPixl and NeoPixelRGB are less obvious.

ftguy2016 · 2016-02-02 03:27

Hi DavidZemon,

Ok here are the missing part that I extracted that should make everything to run if you add the cogneopixl, I have 8 NeoPixels with data line on my pin0 so my do_mask_ is 1<<0, but to test, if you have led on some pin, you can just replace the do_mask_ with whatever pin you want to activate

#include "simpletools.h"                      // Include simple tools
#include "wavplayer2.h"
//#include "adcDCpropab.h"

unsigned int stack1[40 + 25];

#define do_mask_  (1 << 0)
#define NbNeoPixl (8)

unsigned char NeoPixelRGB[3*NbNeoPixl];
unsigned char  NeoPixlIndex[NbNeoPixl];
volatile char SetColor = 0;
unsigned int  MasterLight=16;
unsigned int  VblCount = 0;

extern void CogNeoPixl();




int main()                                    
{
  
 for(int i=0;i<NbNeoPixl*3; i +=3)
 {
  NeoPixelRGB[i+0] = 255;
  NeoPixelRGB[i+1] = 0;
  NeoPixelRGB[i+2] = 0;
 }
 
  sd_mount(22, 23, 24, 25);
  
  int cog = cogstart(&CogNeoPixl, NULL, stack1, sizeof(stack1));
  
  
  while(1)
  {
    
    pause(20);
    
    if (SetColor==0)
    {
   
      //ask the Cog to send data
      SetColor = 1;
     
   
    }       
   
    
      
    VblCount++;    
    
  }  
}

DavidZemon · 2016-02-02 04:28

Wow! That was tough! Turns out, the CogNeoPixl function is so short that GCC is trying to run it in fcache already! So when you try for fcache inside fcache, it doesn't work so hot!

I was only able to figure this out by looking at the assembly file (add -save-temps to your compile options)

ftguy2016 · 2016-02-02 04:57

Hi DavidZemon, wow...you are pretty good at that ^^

I believe it would be nice if the GCC could output a warning about this double fcache issue because it is not something obvious to find out, do you know if it also explains why I have no luck with the brw ?

ersmith · 2016-02-02 14:34

Interesting... I had thought that gcc wouldn't fcache loops that contain inline assembly, but it turns out that it does. That's a bug, and I've checked in a fix. This does explain all of the problems -- the rules for assembly language inside fcache are different than outside, and (for example) the brw wouldn't work.

It also points out why it's probably better to avoid inline assembly unless you really know what you're doing (as David does

). An automatically fcache'd gcc loop is definitely going to run faster than a manually coded non fcache'd loop.

ftguy2016 · 2016-02-03 02:27

It also points out why it's probably better to avoid inline assembly unless you really know what you're doing (as David does ). An automatically fcache'd gcc loop is definitely going to run faster than a manually coded non fcache'd loop.

I think I know what I am doing, plus, you can't drive NeoPixels without using assembler in the propeller, the short tight delay cannot be reached easily. Personally I will rather point-out that the GCC should be fixed so we can use it properly.

ersmith · 2016-02-03 13:11

ftguy2016 wrote: »

It also points out why it's probably better to avoid inline assembly unless you really know what you're doing (as David does ). An automatically fcache'd gcc loop is definitely going to run faster than a manually coded non fcache'd loop.

I think I know what I am doing, plus, you can't drive NeoPixels without using assembler in the propeller, the short tight delay cannot be reached easily. Personally I will rather point-out that the GCC should be fixed so we can use it properly.

I'm sorry, my comment came across with the wrong tone -- I didn't mean to imply you don't know what you're doing. Thank you for the bug report, and I have fixed the fcache problem in GCC by preventing it from fcache'ing loops that have inline assembly in them.

I do disagree though that you can't drive the NeoPixels without using assembler. I think GCC should be able to produce efficient enough code. What I was trying to get at (and not phrasing it well) is that the automatic optimizations that GCC performs (like fcache) will produce very good code which will often run faster than hand written assembly. Certainly fcache'd C code will outperform non-fcache'd hand written assembly.

jmg · 2016-02-04 00:22

ersmith wrote: »

Thank you for the bug report, and I have fixed the fcache problem in GCC by preventing it from fcache'ing loops that have inline assembly in them.

Does this mean that 'converging on a solution' is not so easy ?
If a user asks for fcache, and finds on inspection that is very close, but they need to modify one small part, how do they do that if any in-line asm then disables fcache ?

Does this mean two forms of in-line ASM are needed ?

Trying to drive Neopixels with Inline assembly CMM mode.

Comments