Does the compiler generate code that has pc in the destination ever? Or does it generate code that directly looks at the pc? If phsb were used for PC then I don't see an immediate problem as long as phsb is always in the source of an instruction.
Does the compiler generate code that has pc in the destination ever? Or does it generate code that directly looks at the pc? If phsb were used for PC then I don't see an immediate problem as long as phsb is always in the source of an instruction.
Yes, it generates code that looks at the pc. For short branches it generates instructions like "add pc, #N", and for subroutine returns it generates "mov pc, lr". I'm not sure if it ever reads the pc directly, I can't think of any cases off-hand.
Then what I am doing is futile... If the compiler doesn't use the pseudo ops to modify the instruction pointer then there's no purpose in trying to make the instruction execution loop fast. mov pc. lr would not be handled by my code.
My goal is to make the execution as fast as possible, I'm trying to avoid having to execute a subroutine to execute each instruction.
Then what I am doing is futile... If the compiler doesn't use the pseudo ops to modify the instruction pointer then there's no purpose in trying to make the instruction execution loop fast. mov pc. lr would not be handled by my code.
My goal is to make the execution as fast as possible, I'm trying to avoid having to execute a subroutine to execute each instruction.
I guess that would require a modification of the compiler to generate macro calls for any modifications to the pc. It might be interesting to do a back-of-the-envelope calculation to determine if adding those macros would negate the benefits of the faster linear instruction execution. I guess it depends on how much branching the program does.
Then what I am doing is futile... If the compiler doesn't use the pseudo ops to modify the instruction pointer then there's no purpose in trying to make the instruction execution loop fast. mov pc. lr would not be handled by my code.
My goal is to make the execution as fast as possible, I'm trying to avoid having to execute a subroutine to execute each instruction.
What you appeared to be doing was investigating ways to execute instructions loaded from flash without a cache. This is not something we do today, but it could be useful if it's fast enough. Perhaps I misunderstood? Normal XMM execution today is done by pulling longs out of the cache cog.
Normal LMM execution is done like below. A few macros like LMM_PUSHM are used for specific services and are called when necessary.
Well, I think I can get the loop down to 12 longs if macros are used to adjust the PC.
This would get you a performance of 2 MIPs directly without a cache.
One problem might be that a macro for short branches will result in an increase in code size since it would replace an ADD or SUB instruction with a JMP to the macro and an offset in the following long. I guess function return would remain a single long though.
I just realized that we've gradually been moving logic from the compiler to assembler for these kinds of branches. It isn't actually the compiler that generates the "add pc, #N" any more; in p2test it generates "brs #L" and lets the assembler translate this either to the add or to the CMM compressed form. Similarly the compiler generates "lret" for returning from subroutine. So it may be possible to support direct XMM with just some assembler macro changes (or with a preprocessor that modifies the compiler output before giving it to the assembler).
Of course, all the libraries would have to be recompiled with this change, so it would add yet another entry to our matrix of libraries :-(.
I just realized that we've gradually been moving logic from the compiler to assembler for these kinds of branches. It isn't actually the compiler that generates the "add pc, #N" any more; in p2test it generates "brs #L" and lets the assembler translate this either to the add or to the CMM compressed form. Similarly the compiler generates "lret" for returning from subroutine. So it may be possible to support direct XMM with just some assembler macro changes (or with a preprocessor that modifies the compiler output before giving it to the assembler).
Of course, all the libraries would have to be recompiled with this change, so it would add yet another entry to our matrix of libraries :-(.
Eric
Maybe Kye could work on the preprocessor that goes between the compiler and the assembler at least to create a proof of concept implementation of this technique. Also, will we still be able to put data in the .text section (and hence in flash)? That will disturb the counter-based pc so it will have to be restored before code execution can continue.
Of course, all the libraries would have to be recompiled with this change, so it would add yet another entry to our matrix of libraries :-(.
Yes, that is getting totally out of hand. It really would be nice to stick with just a few variations if possible. Even now, we have LMM, CMM, XMMC, XMM-SPLIT, and XMM-SINGLE. COG is also a mode, but the libraries are built-in. Then there are the P2 variations of the same libraries .... UGH.
If we can show great benefit in having the kernel embed the memory access rather than using a cache COG maybe we can move solely to that model.
Advantages:
- No extra COG
- More deterministic (SPI Flash, etc... will never be deterministic because of address loading with one possible exception).
Disadvantages:
- User code (libraries, apps, etc...) lose access to CTRx, etc...
- Uncertainty of changes and more work to finish.
Yes, that is getting totally out of hand. It really would be nice to stick with just a few variations if possible. Even now, we have LMM, CMM, XMMC, XMM-SPLIT, and XMM-SINGLE. COG is also a mode, but the libraries are built-in. Then there are the P2 variations of the same libraries .... UGH.
If we can show great benefit in having the kernel embed the memory access rather than using a cache COG maybe we can move solely to that model.
Advantages:
- No extra COG
- More deterministic (SPI Flash, etc... will never be deterministic because of address loading with one possible exception).
Disadvantages:
- User code (libraries, apps, etc...) lose access to CTRx, etc...
- Uncertainty of changes and more work to finish.
One question is how many different external memory solutions can we support in this mode? Kye has shown that we can support dual quad SPI devices if we're willing to dedicate 10 pins to the interface. What about single bit SPI devices or single quad bit SPI devices? Can they be done as efficiently? I'm not sure we want to change the architecture to support a single solution.
One question is how many different external memory solutions can we support in this mode? Kye has shown that we can support dual quad SPI devices if we're willing to dedicate 10 pins to the interface. What about single bit SPI devices or single quad bit SPI devices? Can they be done as efficiently? I'm not sure we want to change the architecture to support a single solution.
We could #include the basic IO stuff for XMMC with ADDR and READ macros. Then custom kernels could be done fairly easily. Writing Flash or SRAM would need to be done separately. I wouldn't even bother with XMM-SINGLE or XMM-SPLIT.
We could #include the basic IO stuff for XMMC with ADDR and READ macros. Then custom kernels could be done fairly easily. Writing Flash or SRAM would need to be done separately. I wouldn't even bother with XMM-SINGLE or XMM-SPLIT.
xmm-single and xmm-split just select different linker scripts. They don't have any effect on the GCC code generation.
Edit: or do you mean we shouldn't even support data in external memory at all?
Is there anything other than .text for XMMC to read?
You've done more of this than I have so I'm sure there is more devil in the details.
I believe you can tell GCC to put const data in the .text section which would mean variable reads would have to work from external memory as well as code execution.
I believe you can tell GCC to put const data in the .text section which would mean variable reads would have to work from external memory as well as code execution.
I think you have to explicitly ask for that. In XMMC mode the compiler uses rdlong for all variable reads; in this case putting const data into .text wouldn't work, so it isn't the default (unless I'm misremembering something).
I think you have to explicitly ask for that. In XMMC mode the compiler uses rdlong for all variable reads; in this case putting const data into .text wouldn't work, so it isn't the default (unless I'm misremembering something).
Eric
You might be right about that. I seem to remember you modifying memcpy to handle addresses in flash. Maybe that's the only way to get data from flash in xmmc mode.
There's no need for what I want to do being included in the propgcc libraries. I plan to have my own separate distribution of code for just the platform I'm working on. My goal is to optimize the software and hardware to work together. In this sense... the user isn't losing access to counter's because I never planned to give them access anyway.
For example, on the board I am build there is a user button and user led. Another cog handles the denouncing of that button (and all 14 digital I/O pins) in the background. To get the button state the user calls a function that reads a register that is being updated by the cog in charge of button denouncing Similarly, to control the LED the user does so through another cog which is generating PWM waves.
As for standard C libraries, I'll have to rework the interface for all those anyway since I'm not sure how it will work in the system setup I want (think Arduino like). So, I'll be recompiling the libraries anyway.
I should be able to just run a script on the intermediate ASM from the compiler to do what I want. This shouldn't be too hard. I just need to know all the different types of instructions it will output so that I can handle them all in the kernel.
---
Right now, my breakdown of cores is:
14 Pin Button Debuncer / RTC Interrupt Debouncer / User Button Debouncer / Millisecond Timmer / Microsecond Timer (the ASM I wrote for this core is really tricky due to the microsecond timer update requirement)
14 Pin Regular Encoder / 14 Pin Quadrature Encoder / 14 Pin PWM Driver / User LED PWM Driver
2 Serial Port Driver - 256 Byte Input and Output FIFOs per port - up to 115200 BPS per port
Kernel
User Controllable Helper Core 0 (the user can supply cog code for this core)
User Controllable Helper Core 1 (the user can supply cog code for this core)
14 Pin Servo Driver (haven't figured out how to use up all the clock cycles for this core yet... not really sure what to do...)
Free Core (was going to be the cache driver but now I can use it for something else - maybe SPI bus / I2C bus slave?)
Comments
Yes, it generates code that looks at the pc. For short branches it generates instructions like "add pc, #N", and for subroutine returns it generates "mov pc, lr". I'm not sure if it ever reads the pc directly, I can't think of any cases off-hand.
Eric
My goal is to make the execution as fast as possible, I'm trying to avoid having to execute a subroutine to execute each instruction.
What you appeared to be doing was investigating ways to execute instructions loaded from flash without a cache. This is not something we do today, but it could be useful if it's fast enough. Perhaps I misunderstood? Normal XMM execution today is done by pulling longs out of the cache cog.
Normal LMM execution is done like below. A few macros like LMM_PUSHM are used for specific services and are called when necessary.
This would get you a performance of 2 MIPs directly without a cache.
Of course, all the libraries would have to be recompiled with this change, so it would add yet another entry to our matrix of libraries :-(.
Eric
Yes, that is getting totally out of hand. It really would be nice to stick with just a few variations if possible. Even now, we have LMM, CMM, XMMC, XMM-SPLIT, and XMM-SINGLE. COG is also a mode, but the libraries are built-in. Then there are the P2 variations of the same libraries .... UGH.
If we can show great benefit in having the kernel embed the memory access rather than using a cache COG maybe we can move solely to that model.
Advantages:
- No extra COG
- More deterministic (SPI Flash, etc... will never be deterministic because of address loading with one possible exception).
Disadvantages:
- User code (libraries, apps, etc...) lose access to CTRx, etc...
- Uncertainty of changes and more work to finish.
We could #include the basic IO stuff for XMMC with ADDR and READ macros. Then custom kernels could be done fairly easily. Writing Flash or SRAM would need to be done separately. I wouldn't even bother with XMM-SINGLE or XMM-SPLIT.
Edit: or do you mean we shouldn't even support data in external memory at all?
Right. Not with a "direct access" kernel model.
Is there anything other than .text for XMMC to read from Flash (or SRAM in XMMC mode)?
You've done more of this than I have so I'm sure there is more devil in the details.
I think you have to explicitly ask for that. In XMMC mode the compiler uses rdlong for all variable reads; in this case putting const data into .text wouldn't work, so it isn't the default (unless I'm misremembering something).
Eric
For example, on the board I am build there is a user button and user led. Another cog handles the denouncing of that button (and all 14 digital I/O pins) in the background. To get the button state the user calls a function that reads a register that is being updated by the cog in charge of button denouncing Similarly, to control the LED the user does so through another cog which is generating PWM waves.
As for standard C libraries, I'll have to rework the interface for all those anyway since I'm not sure how it will work in the system setup I want (think Arduino like). So, I'll be recompiling the libraries anyway.
I should be able to just run a script on the intermediate ASM from the compiler to do what I want. This shouldn't be too hard. I just need to know all the different types of instructions it will output so that I can handle them all in the kernel.
---
Right now, my breakdown of cores is:
14 Pin Button Debuncer / RTC Interrupt Debouncer / User Button Debouncer / Millisecond Timmer / Microsecond Timer (the ASM I wrote for this core is really tricky due to the microsecond timer update requirement)
14 Pin Regular Encoder / 14 Pin Quadrature Encoder / 14 Pin PWM Driver / User LED PWM Driver
2 Serial Port Driver - 256 Byte Input and Output FIFOs per port - up to 115200 BPS per port
Kernel
User Controllable Helper Core 0 (the user can supply cog code for this core)
User Controllable Helper Core 1 (the user can supply cog code for this core)
14 Pin Servo Driver (haven't figured out how to use up all the clock cycles for this core yet... not really sure what to do...)
Free Core (was going to be the cache driver but now I can use it for something else - maybe SPI bus / I2C bus slave?)
Anyway, I'm not going for general purpose here.