Kernel and Cache Driver Interworks

Kye · 2013-04-11 09:23

Does the compiler generate code that has pc in the destination ever? Or does it generate code that directly looks at the pc? If phsb were used for PC then I don't see an immediate problem as long as phsb is always in the source of an instruction.

ersmith · 2013-04-11 09:36

Kye wrote: »

Does the compiler generate code that has pc in the destination ever? Or does it generate code that directly looks at the pc? If phsb were used for PC then I don't see an immediate problem as long as phsb is always in the source of an instruction.

Yes, it generates code that looks at the pc. For short branches it generates instructions like "add pc, #N", and for subroutine returns it generates "mov pc, lr". I'm not sure if it ever reads the pc directly, I can't think of any cases off-hand.

Eric

jazzed · 2013-04-11 09:48

If PHSx is used as PC, what would it be after this (crt0_lmm.s) macro is finished?

    .global __LMM_PUSHM
    .global __LMM_PUSHM_ret
__LMM_PUSHM
    mov __TMP1,__TMP0
    and __TMP1,#0x0f
    movd    L_pushins,__TMP1
    shr __TMP0,#4
L_pushloop
    sub sp,#4
L_pushins
    wrlong  0-0,sp
    add L_pushins,inc_dest1
    djnz    __TMP0,#L_pushloop
__LMM_PUSHM_ret
    ret

Kye · 2013-04-11 12:05

Then what I am doing is futile... If the compiler doesn't use the pseudo ops to modify the instruction pointer then there's no purpose in trying to make the instruction execution loop fast. mov pc. lr would not be handled by my code.

My goal is to make the execution as fast as possible, I'm trying to avoid having to execute a subroutine to execute each instruction.

David Betz · 2013-04-11 12:11

Kye wrote: »

Then what I am doing is futile... If the compiler doesn't use the pseudo ops to modify the instruction pointer then there's no purpose in trying to make the instruction execution loop fast. mov pc. lr would not be handled by my code.

My goal is to make the execution as fast as possible, I'm trying to avoid having to execute a subroutine to execute each instruction.

I guess that would require a modification of the compiler to generate macro calls for any modifications to the pc. It might be interesting to do a back-of-the-envelope calculation to determine if adding those macros would negate the benefits of the faster linear instruction execution. I guess it depends on how much branching the program does.

jazzed · 2013-04-11 12:16

Kye wrote: »

Then what I am doing is futile... If the compiler doesn't use the pseudo ops to modify the instruction pointer then there's no purpose in trying to make the instruction execution loop fast. mov pc. lr would not be handled by my code.

My goal is to make the execution as fast as possible, I'm trying to avoid having to execute a subroutine to execute each instruction.

What you appeared to be doing was investigating ways to execute instructions loaded from flash without a cache. This is not something we do today, but it could be useful if it's fast enough. Perhaps I misunderstood? Normal XMM execution today is done by pulling longs out of the cache cog.

Normal LMM execution is done like below. A few macros like LMM_PUSHM are used for specific services and are called when necessary.

    ''
    '' main LMM loop -- read instructions from hub memory
    '' and executes them
    ''
__LMM_loop
    rdlong  L_ins0,pc
    add pc,#4
L_ins0  nop
    rdlong  L_ins1,pc
    add pc,#4
L_ins1  nop
    rdlong  L_ins2,pc
    add pc,#4
L_ins2  nop
    rdlong  L_ins3,pc
    add pc,#4
L_ins3  nop
    rdlong  L_ins4,pc
    add pc,#4
L_ins4  nop
    rdlong  L_ins5,pc
    add pc,#4
L_ins5  nop
    rdlong  L_ins6,pc
    add pc,#4
L_ins6  nop
    rdlong  L_ins7,pc
    add pc,#4
L_ins7  nop
    jmp #__LMM_loop

Kye · 2013-04-12 06:29

Well, I think I can get the loop down to 12 longs if macros are used to adjust the PC.

This would get you a performance of 2 MIPs directly without a cache.

David Betz · 2013-04-12 06:57

Kye wrote: »

Well, I think I can get the loop down to 12 longs if macros are used to adjust the PC.

This would get you a performance of 2 MIPs directly without a cache.

One problem might be that a macro for short branches will result in an increase in code size since it would replace an ADD or SUB instruction with a JMP to the macro and an offset in the following long. I guess function return would remain a single long though.

ersmith · 2013-04-12 09:53

I just realized that we've gradually been moving logic from the compiler to assembler for these kinds of branches. It isn't actually the compiler that generates the "add pc, #N" any more; in p2test it generates "brs #L" and lets the assembler translate this either to the add or to the CMM compressed form. Similarly the compiler generates "lret" for returning from subroutine. So it may be possible to support direct XMM with just some assembler macro changes (or with a preprocessor that modifies the compiler output before giving it to the assembler).

Of course, all the libraries would have to be recompiled with this change, so it would add yet another entry to our matrix of libraries :-(.

Eric

David Betz · 2013-04-12 10:12

ersmith wrote: »

I just realized that we've gradually been moving logic from the compiler to assembler for these kinds of branches. It isn't actually the compiler that generates the "add pc, #N" any more; in p2test it generates "brs #L" and lets the assembler translate this either to the add or to the CMM compressed form. Similarly the compiler generates "lret" for returning from subroutine. So it may be possible to support direct XMM with just some assembler macro changes (or with a preprocessor that modifies the compiler output before giving it to the assembler).

Of course, all the libraries would have to be recompiled with this change, so it would add yet another entry to our matrix of libraries :-(.

Eric

Maybe Kye could work on the preprocessor that goes between the compiler and the assembler at least to create a proof of concept implementation of this technique. Also, will we still be able to put data in the .text section (and hence in flash)? That will disturb the counter-based pc so it will have to be restored before code execution can continue.

jazzed · 2013-04-12 10:29

ersmith wrote: »

Of course, all the libraries would have to be recompiled with this change, so it would add yet another entry to our matrix of libraries :-(.

Yes, that is getting totally out of hand. It really would be nice to stick with just a few variations if possible. Even now, we have LMM, CMM, XMMC, XMM-SPLIT, and XMM-SINGLE. COG is also a mode, but the libraries are built-in. Then there are the P2 variations of the same libraries .... UGH.

If we can show great benefit in having the kernel embed the memory access rather than using a cache COG maybe we can move solely to that model.

Advantages:

- No extra COG
- More deterministic (SPI Flash, etc... will never be deterministic because of address loading with one possible exception).

Disadvantages:

- User code (libraries, apps, etc...) lose access to CTRx, etc...
- Uncertainty of changes and more work to finish.

David Betz · 2013-04-12 10:45

jazzed wrote: »

Yes, that is getting totally out of hand. It really would be nice to stick with just a few variations if possible. Even now, we have LMM, CMM, XMMC, XMM-SPLIT, and XMM-SINGLE. COG is also a mode, but the libraries are built-in. Then there are the P2 variations of the same libraries .... UGH.

If we can show great benefit in having the kernel embed the memory access rather than using a cache COG maybe we can move solely to that model.

Advantages:

- No extra COG
- More deterministic (SPI Flash, etc... will never be deterministic because of address loading with one possible exception).

Disadvantages:

- User code (libraries, apps, etc...) lose access to CTRx, etc...
- Uncertainty of changes and more work to finish.

One question is how many different external memory solutions can we support in this mode? Kye has shown that we can support dual quad SPI devices if we're willing to dedicate 10 pins to the interface. What about single bit SPI devices or single quad bit SPI devices? Can they be done as efficiently? I'm not sure we want to change the architecture to support a single solution.

jazzed · 2013-04-12 12:26

David Betz wrote: »

One question is how many different external memory solutions can we support in this mode? Kye has shown that we can support dual quad SPI devices if we're willing to dedicate 10 pins to the interface. What about single bit SPI devices or single quad bit SPI devices? Can they be done as efficiently? I'm not sure we want to change the architecture to support a single solution.

We could #include the basic IO stuff for XMMC with ADDR and READ macros. Then custom kernels could be done fairly easily. Writing Flash or SRAM would need to be done separately. I wouldn't even bother with XMM-SINGLE or XMM-SPLIT.

David Betz · 2013-04-12 12:36

jazzed wrote: »

We could #include the basic IO stuff for XMMC with ADDR and READ macros. Then custom kernels could be done fairly easily. Writing Flash or SRAM would need to be done separately. I wouldn't even bother with XMM-SINGLE or XMM-SPLIT.

xmm-single and xmm-split just select different linker scripts. They don't have any effect on the GCC code generation.

Edit: or do you mean we shouldn't even support data in external memory at all?

jazzed · 2013-04-12 13:24

David Betz wrote: »

Edit: or do you mean we shouldn't even support data in external memory at all?

Right. Not with a "direct access" kernel model.

David Betz · 2013-04-12 13:30

jazzed wrote: »

Right. Not with a "direct access" kernel model.

I guess that makes sense but what about constant data in .text which would place it in flash?

jazzed · 2013-04-12 13:33

David Betz wrote: »

I guess that makes sense but what about constant data in .text which would place it in flash?

Is there anything other than .text for XMMC to read from Flash (or SRAM in XMMC mode)?

You've done more of this than I have so I'm sure there is more devil in the details.

David Betz · 2013-04-12 13:34

jazzed wrote: »

Is there anything other than .text for XMMC to read?

You've done more of this than I have so I'm sure there is more devil in the details.

I believe you can tell GCC to put const data in the .text section which would mean variable reads would have to work from external memory as well as code execution.

ersmith · 2013-04-12 16:05

David Betz wrote: »

I believe you can tell GCC to put const data in the .text section which would mean variable reads would have to work from external memory as well as code execution.

I think you have to explicitly ask for that. In XMMC mode the compiler uses rdlong for all variable reads; in this case putting const data into .text wouldn't work, so it isn't the default (unless I'm misremembering something).

Eric

David Betz · 2013-04-12 17:03

ersmith wrote: »

I think you have to explicitly ask for that. In XMMC mode the compiler uses rdlong for all variable reads; in this case putting const data into .text wouldn't work, so it isn't the default (unless I'm misremembering something).

Eric

You might be right about that. I seem to remember you modifying memcpy to handle addresses in flash. Maybe that's the only way to get data from flash in xmmc mode.

Kye · 2013-04-12 21:05

There's no need for what I want to do being included in the propgcc libraries. I plan to have my own separate distribution of code for just the platform I'm working on. My goal is to optimize the software and hardware to work together. In this sense... the user isn't losing access to counter's because I never planned to give them access anyway.

For example, on the board I am build there is a user button and user led. Another cog handles the denouncing of that button (and all 14 digital I/O pins) in the background. To get the button state the user calls a function that reads a register that is being updated by the cog in charge of button denouncing Similarly, to control the LED the user does so through another cog which is generating PWM waves.

As for standard C libraries, I'll have to rework the interface for all those anyway since I'm not sure how it will work in the system setup I want (think Arduino like). So, I'll be recompiling the libraries anyway.

I should be able to just run a script on the intermediate ASM from the compiler to do what I want. This shouldn't be too hard. I just need to know all the different types of instructions it will output so that I can handle them all in the kernel.

---

Right now, my breakdown of cores is:

14 Pin Button Debuncer / RTC Interrupt Debouncer / User Button Debouncer / Millisecond Timmer / Microsecond Timer (the ASM I wrote for this core is really tricky due to the microsecond timer update requirement)
14 Pin Regular Encoder / 14 Pin Quadrature Encoder / 14 Pin PWM Driver / User LED PWM Driver

2 Serial Port Driver - 256 Byte Input and Output FIFOs per port - up to 115200 BPS per port
Kernel

User Controllable Helper Core 0 (the user can supply cog code for this core)
User Controllable Helper Core 1 (the user can supply cog code for this core)

14 Pin Servo Driver (haven't figured out how to use up all the clock cycles for this core yet... not really sure what to do...)

Free Core (was going to be the cache driver but now I can use it for something else - maybe SPI bus / I2C bus slave?)

Anyway, I'm not going for general purpose here.

Kernel and Cache Driver Interworks

Comments