flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

1515254565770

Comments

  • evanhevanh Posts: 10,088
    edited 2020-05-23 - 23:47:03
    evanh wrote: »
    TonyB_ wrote: »
    Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
    It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
    Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
    I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.
  • evanh wrote: »
    evanh wrote: »
    TonyB_ wrote: »
    Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
    It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
    Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
    I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.

    If the FIFO were not flushed immediately when bit 31 is set, there might be time for a couple of instructions after the RDFAST, in which case the RDFAST "jump" might be placed two instructions early.

    This is a conjecture, though, without a full understanding of FIFO reloading.
  • TonyB_ wrote: »
    evanh wrote: »
    evanh wrote: »
    TonyB_ wrote: »
    Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
    It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
    Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
    I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.

    If the FIFO were not flushed immediately when bit 31 is set, there might be time for a couple of instructions after the RDFAST, in which case the RDFAST "jump" might be placed two instructions early.

    This is a conjecture, though, without a full understanding of FIFO reloading.

    Early jump could be useless, it's very late here.
    RDFAST with bit 31 set is useful, though, for cog/LUT exec.
  • TonyB_ wrote: »
    evanh wrote: »
    evanh wrote: »
    TonyB_ wrote: »
    Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
    It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
    Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
    I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.

    If the FIFO were not flushed immediately when bit 31 is set, there might be time for a couple of instructions after the RDFAST, in which case the RDFAST "jump" might be placed two instructions early.

    This is a conjecture, though, without a full understanding of FIFO reloading.

    Early jump could be useless, it's very late here.
    RDFAST with bit 31 set is useful, though, for cog/LUT exec.
  • I would hope in some future version of the P2 that there will be separate instruction and data FIFOs. It would also be nice if the instruction FIFO could be implemented as a circular buffer, and the read pointer could be moved backwards a few instructions when jumping backwards. This would require that the instruction FIFO logic retain a certain number of longs -- maybe something like 16 should be sufficient. This would eliminate the need to reload the FIFO when executing small loops.
  • Just a note... I think HyperRam was designed for execution in place... Can do a small circular buffer...
  • evanhevanh Posts: 10,088
    edited 2020-05-24 - 03:41:49
    ersmith wrote: »
    In the new code I just checked in to github (source only for now) ORG/END will also flag its block of assembly to be copied to FCACHE, i.e. LUT memory, before execution. In C `__asm const` does the same, and in BASIC it's `ASM CPU`. FCACHE is turned on for P2 now by default. If you're able to build from source, please give it a try, I'd like to shake out the bugs.
    Rayman's sdspi_bashed.spin2 code is crapping out badly with this. It seems to be restarting all the time. I'm guessing it's most likely because it needed the timing delays that came with hubexec ... My earlier optimisations aren't any better ...

  • I'm seeing something with "round()" that I don't understand...

    I see that the VGA driver needs to do round() here to turn pixel rate into an integer:
    setxfrq ##round(fset)           'set transfer frequency to 25MHz
    

    What I tried to do is do this rounding in the CON section like this:
        fset            = round((float(fpix) / float(_clkfreq) * 2.0) * float($4000_0000))
    

    Gives this error:
    D:\Propeller2\HyperTest\HyperVGA_640_x_480_8bpp_3g.spin2:20: error: applying float to a non integer expression
    

  • Ok, I see now that fpix was already a float, so didn't need to do float() on it.
    Surprised it was working before, now that I think about it... In both fastspin and PNut.

    Somehow, doing the round() made it check for this in a different way?
  • evanh wrote: »
    Rayman's sdspi_bashed.spin2 code is crapping out badly with this. It seems to be restarting all the time. I'm guessing it's most likely because it needed the timing delays that came with hubexec ... My earlier optimisations aren't any better ...

    Thanks Evan. I think the optimizer was trying to fcache a block with inlined fcache code, which failed. That should be fixed now.
  • ersmith wrote: »
    > Is there a way to force an FCACHE around a piece of inline pasm code, or something similar.

    Not yet, but there's clearly a desire for it, so I'll see what I can do.

    I'd also vote for something like that. What about a mechanism to load a small bunch of functions coded in assembly into LUT ram. Then it would be nice being able to declare those functions as LUT-resident. Maybe something like
    int MyFunction (int input) __fromLUT(&myLabel)
    
    That would speed up calls to regularly used functions a lot, I think.
  • ersmith wrote: »
    I think the optimizer was trying to fcache a block with inlined fcache code, which failed. That should be fixed now.
    Tested and working beautifully! I've rebuilt that sdspi code for extra fast pure bit-bashing via inline assembly. Next step is give the streamer a kick for the same job. See how fast an SD card in SPI mode can go.

  • ManAtWork wrote: »
    __asm {
    	rdlong  0,p
    	rdlong  0-0,p
      }
    
    The use of constants without "#" causes compiler warnings suggesting to use "-0" after the operand. But If I do exactly that the compiler erronously inserts a false "#". (I added some #defines because I thought it had something to do with the preprocessor but that's not the case)
    I can confirm this to be fixed, now.gut.gif

    However, Murphy says there's always one more bug (or two):
      int32_t* s= &myStruct;
      int32_t* d= 0x00;
      int i;
      __asm {
    	setq2	#256 // LUT block transfer
    	rdlong 	d,s 
    	rdlut	i,#0
      }
    
    is compiled to
    	setq2	#256
    	rdlut	result1, #0
    
    The rdlong is optimized away. I don't know if the reason is that d is the same as the source address of the RDLUT. However, RDLONG is supposed to copy 256 longs and RDLUT reads only one. It works correctly if I use __asm const.

    And another one:
        call #\0x200
    
    is copmpiled to
        call 512
    
    The "#\" (intended to be a jump into the LUT) seems to be ignored.
  • whickerwhicker Posts: 655
    edited 2020-05-25 - 15:48:11
    Yes I too have noticed the optimizer forget to remove the preceding setq or setq2 (or really any prefix) when deleting an instruction. Then the prefix errantly acts on the next instruction.
  • int32_t* d= 0x00;
      int i;
      __asm {
    	setq2	#256 // LUT block transfer
    	rdlong 	d,s 
    	rdlut	i,#0
      }
    
    Note that this will read 256 bytes into LUT starting at the address of d in COG, not at 0 (the value of d). I'm not entirely surprised the optimizer was confused by it :) although it probably should not have been optimized away; I'll try to add some more checks around SETQ.
  • I never said my code does something useful :wink: But there might be cases where somebody wants to do this. Today, I tried out so many different things that it's no wonder I'm totally confused. But yes, you are right, the way it should be is
    void InitLut ()
    {
      int32_t* s= &myStruct;
      __asm const {
    	setq2	#256
    	rdlong 	0,s 
      }
    }
    
    ... and this finally works.
  • I just noticed that Fastspin doesn't initialize variables in PUB methods whereas PNut does.
    Maybe I knew this already, don't know.
    But, it may break compatibility of PNut written code with Fastspin...
  • "If it hurts when you do that, don't do that." In other words, if you need a variable set to 0, set it to 0 explicitly.

    Maybe I can fix this (it sounds like another difference between Spin1 and SPin2) but honestly I don't think I'll be able to chase down every possible incompatibility between fastspin and PNut.
  • I thought Spin1 also initialized all variables...
  • Rayman,
    In Spin 1, I think globals are initialized, but locals are not (except maybe the result one?).
  • Dave HeinDave Hein Posts: 6,205
    edited 2020-05-26 - 22:05:24
    In Spin 1, the local variables are on the stack, and they are not initialized, except for the result variable. It seems like always initializing local variables would waste cycles when entering a method. In most cases, a local variable does not need to be initialized. What would you do with local arrays? Would they be initialized also? Does PNut really do that?

    I agree with Eric. If a local variable needs to be initialized, it should be done explicitly.
  • Dave Hein wrote: »
    In Spin 1, the local variables are on the stack, and they are not initialized, except for the result variable. It seems like always initializing local variables would waste cycles when entering a method. In most cases, a local variable does not need to be initialized. What would you do with local arrays? Would they be initialized also? Does PNut really do that?

    I agree with Eric. If a local variable needs to be initialized, it should be done explicitly.

    Or just figure out if there is any chance of the variable being read while uninitialized and only then initialize it automatically (and potentially emit a warning? idk).
  • So what happens if I pass a local array to another method? Have the compiler generate code to zero the whole thing, even if it's not necessary? I think explicit initialization is better.
  • While I love the defaults to zero initialisation because its' great for dummies, so much waste happens because of it.

    So I'm happy to need to initialise. Needs to be in the gotchas or tricks and traps tho'.

    May be nice to have a global flag to force initialisec all to zero at some stage???
  • evanhevanh Posts: 10,088
    edited 2020-05-27 - 07:15:17
    ersmith wrote: »
    Thanks Evan. I think the optimizer was trying to fcache a block with inlined fcache code, which failed. That should be fixed now.
    Uh-oh, not quite perfect after all. With -O2 the SD mount routine fails to return after certain compiles ... and it only takes a string length edit to trip it up. Basically the first timing sensitive code falls over.

    -O1 optimising seems to work every time. Edits don't upset it.


    EDIT: Ha! I think I may have found it. There is a one-liner inline assembly in the middle of the "start_explicit" mount routine within the sdspi file. All it does is fill a local variable from a GETCT instruction. I've replaced it with getct() instead.

    EDIT2: Nope, cleaning out that one-liner didn't fix it. After some more edits the SD mount problem is back with -O2 again. I guess if it's an alignment issue then -O1 is just realigning things.

  • ersmith wrote: »
    "If it hurts when you do that, don't do that." In other words, if you need a variable set to 0, set it to 0 explicitly.

    I totally agree. But we are all imperfect and forget it from time to time. Initializing everything to 0 is a great thing for debugging. It makes everything more deterministic. I hate nothing more than hunting a heisenbug for hours just to find out that I forgot to initialize a variable.

    So I'd suggest this: Pre-initialize everything with zero when compiling with -O0 and don't do it for higher optimization levels. This way I could fall back to no optimization if I suspect that there's something wrong with my code. If it works with -O0 and doesn't with -O1 then I know where to search and it saves me lots of time.
  • Stuff like RES in spin can't be initialized.

    C has rules about uninitialized variables, which need to be followed to make it standards compliant, as well.
  • I had another one of those bugs that was driving me crazy...

    Turns out I had both a global and local variable named R2
    var 'variables
        byte    R2[17] 'r2 response (136 bits)
            
    PUB Main()|i,s,w,o,r1,r2,r3 'Give eMMC CMD0 then CMD1 and see if get a response.
    

    Tried it in PNut and that showed me the error in my ways...
  • There are good arguments both ways for conflicts between global and local variables.

    On the one hand, allowing you a local with any name at all means you can write a function without worrying about what global variable names there might be (so for example you can cut and paste functions between objects and the function will still work properly in its new place).

    On the other hand, if you forget and try to use the global in a function that already has a local with that name, you won't get what you want.
  • Can you declare arrays as local variables like this?
    PUB SendCMD(op,parm)|o[6],i,c,r1,r3
        
        'Send command and sometimes get response
        'first, need to calculate CRC7
        o[0]:=$40+op
        o[1]:=(parm>>24)&$FF
        o[2]:=(parm>>16)&$FF
        o[3]:=(parm>>8)&$FF
        o[4]:=(parm>>0)&$FF
        o[5]:=0
    

    I thought you could, but it doesn't seem to work...
Sign In or Register to comment.