Kernel and Cache Driver Interworks

Kye · 2013-04-08 16:02

I'm looking at making a custom kernel for propgcc. I want to know if I can compile code for XMMC mode without the inclusion of the kernel explicitly. Basically, I want to setup a driver package that turns the prop chip into effectively an XMMC processor with a bunch of software defined hardware using the other cogs.

As I understand it now, the kernel executes your code and the cache driver caches the instructions to execute on the request of the kernel. However, does the cache driver do look ahead? Or is it just a two cog solution because the code wouldn't fit in one processor?

I would like to be able to include the code necessary for caching data right in the kernel. Would this be smart? Or would it be better to put that code into the cache driver?

I have an octal SPI setup and I'm running the propeller chip at 96 MHz.

...

I think I can get a 1.2 Mega Long a second write to main memory for reading in a new cache line and I should be able to execute instructions at a rate somewhere above 5 MIPs.

Kye · 2013-04-08 16:38

Mmm, something else I could also do...

If I want to store the instructions directly in cog memory, I can still read them in at 1.2 Mega Longs a second using a 20 instruction loop and then I could execute them at 24 MIPs. However, I would pay for this in that the cache will be reduced from about 8K to 1K.

Dr_Acula · 2013-04-08 17:01

If I want to store the instructions directly in cog memory, I can still read them in at 1.2 Mega Longs a second using a 20 instruction loop and then I could execute them at 24 MIPs. However, I would pay for this in that the cache will be reduced from about 8K to 1K.

That sounds clever. Ok, brainstorming that a bit, is that like a level 1 and a level 2 cache? Run from the cog cache, if there is a miss, check for hub cache, and if there is a miss, read from external memory?

ersmith · 2013-04-08 17:02

Yes, you could replace the XMM kernel with your own. The final inclusion of the kernel is done at link time, so you'd have to run the linker explicitly yourself to force your own kernel to be linked. It would also have to provide all of the functions that the compiler expects.

The kernel is in a file called _crt0.o. The source code for this is in gcc/gcc/config/propeller/crt0.S, which (for any of the XMM modes) includes crt0_xmm.s. That's what you'd have to change.

If you give the compiler driver propeller-elf-gcc the -v option it will print all the commands its running, in particular the linker, so you can see how to replace _crt0.o.

Regarding your idea of caching directly into cog memory, unfortunately that wouldn't work without compiler changes. The compiler expects the LMM/XMM kernel to use a register called "pc" to fetch instructions from memory with rdlong, and it does things like "add pc, #4" to skip over instructions or data.

Eric

Kye · 2013-04-08 17:20

I can use a script to pre-process the output ASM code before it is converted to hex. I suppose I could fix this problem by changing all the JMPs to CALLS.

My real question is what impact would this be on performance. I don't know how much the cache helps. David I guess would know this.

jazzed · 2013-04-08 17:23

The current kernel/cache design (call it the cache model) is a 2 COG affair.

Having a second COG do the cache has some not so obvious advantages with respect to interacting with the IO pins as well as the obvious burst nature of the design. 1) No need to "and-not-or bits" in the OUTA register for setting pins - I.E. you can only affect the outputs that are enabled by the COG. In a single COG kernel/hardware controller, one needs to spend cycles making sure only the bits you care about get set which may also involve shifting things around. 2) The counters and other features of the cache COG are exclusively yours. In a single COG kernel/hardware controller counters belong to the user. 3) The cache COG can do bursts for slow devices. I.E. a single bit SPI in a cache COG is almost as fast as a byte wide access design with the same address setup. The SSF byte-wide 2x QuadSPI design is about 30% faster than the equivalent 1 bit SPI design.

The current cache model does not look ahead. It is merely a fetcher/write-back storer at the moment. Cache is direct mapped mostly, although David has done some n-way set associatve experiments.

In a P1 the fastest direct kernel/hardware access model (call it the direct model) with a parallel bus design eats most of the propeller pins. The best fetch/store rate would be in 4 instructions per access with such hardware - I.E. 5MB/s at 80MHz or about 1MIPS or less. That does not account for the other work the kernel has to do.

Funny you mention using an in-COG cache though (the direct model). I've been talking with David and Eric about the possibility of using the CLUT on P2 as a tiny cache. I suspect that even with SDRAM setup times (and a single line cache in the clut), the P2 direct hardware access would out-perform a cached model because of the HUB bottle-neck (even with those spiffy new multi-long instructions). Once a "line" is burst read into the clut, getting a new instruction is mostly a matter of checking the address range and extracting the instruction. Writes to external memory could be via write-back, but I suspect a simpler write-through would be more efficient.

I really think the direct model on the P2 has merit. On P1, it's not as attractive, but it could be a stepping-stone for P2.

Best regards,
--Steve

Kye · 2013-04-08 17:23

@Dr_Acula,

No, there would just be one cache in cog RAM. This would liberate 8K in hub RAM.

Kye · 2013-04-08 17:38

@Jazzed - What I am working on is purely a stepping stone for the P2. But, I might as well give it my best shot for the P1.

I'm making an Arduino and more operating system for the propeller. Its going to be event driven. I hope to have significant progress done by August (I have a few months off before I start my job after graduating college this May, going to try to work very hard to get the bulk of the low level work done).

In my system, there is no assumption of what the hardware is doing, the user does not own OUTA/DIRA/INA, they can call C level functions safety, not mess with the hardware. I'm already 80% done designing and routing a specific board for this project. Attached is my current progress.

It has an RTC, SD card slot, FTDI chip, ADC, Octal SPI Flash (16 Mb), EEPROM, and a nice little bread board area. I just have to clean up the schematic and fix the routing for the I/O pins area (this might be heroic...).

Anyway, I'm doing a lot of stuff full custom. For example, to make the loader really fast, I'm not using a serial port, instead I'm going to use the D2XX driver to get 3M speeds out of the FTDI chip. I should be able to get around 200 KB a second write to the flash for programming.

jazzed · 2013-04-08 18:01

Kye wrote: »

@Jazzed - What I am working on is purely a stepping stone for the P2. But, I might as well give it my best shot for the P1. ...

Sounds pretty neat Kye.

When I mentioned user with respect to OUTA, etc... I was talking about "any software" that the kernel runs which includes libraries. The kernel can not mess with pins that it does not "own" and thus must do "and-not-or" operations so it does not interfere with "user" pins such as the serial port and eeprom pins.

If you have time to work on a "direct model" for P1/P2, it would be great. I'm sure the current core team will be supportive where possible. David is the point man at the moment, so he makes the decisions - I retired from that job.

Best regards,
--Steve

David Betz · 2013-04-08 20:29

jazzed wrote: »

Funny you mention using an in-COG cache though (the direct model). I've been talking with David and Eric about the possibility of using the CLUT on P2 as a tiny cache. I suspect that even with SDRAM setup times (and a single line cache in the clut), the P2 direct hardware access would out-perform a cached model because of the HUB bottle-neck (even with those spiffy new multi-long instructions). Once a "line" is burst read into the clut, getting a new instruction is mostly a matter of checking the address range and extracting the instruction. Writes to external memory could be via write-back, but I suspect a simpler write-through would be more efficient.

I really think the direct model on the P2 has merit. On P1, it's not as attractive, but it could be a stepping-stone for P2.

Best regards,
--Steve

This sounds like it would be an interesting approach but I don't really have the time to investigate it. Maybe we should put P2 XMM on hold until you have enough time to work on this?

David Betz · 2013-04-08 20:40

Kye wrote: »

I'm looking at making a custom kernel for propgcc. I want to know if I can compile code for XMMC mode without the inclusion of the kernel explicitly. Basically, I want to setup a driver package that turns the prop chip into effectively an XMMC processor with a bunch of software defined hardware using the other cogs.

As I understand it now, the kernel executes your code and the cache driver caches the instructions to execute on the request of the kernel. However, does the cache driver do look ahead? Or is it just a two cog solution because the code wouldn't fit in one processor?

I would like to be able to include the code necessary for caching data right in the kernel. Would this be smart? Or would it be better to put that code into the cache driver?

I have an octal SPI setup and I'm running the propeller chip at 96 MHz.

...

I think I can get a 1.2 Mega Long a second write to main memory for reading in a new cache line and I should be able to execute instructions at a rate somewhere above 5 MIPs.

The kernel and the cache driver are mostly in separate COGs because there isn't enough space in one COG. That may not be completely true for some types of memory. The SPI SRAM code is pretty small and could probably fit in with the kernel although some of the fcache space might be lost. Also, cache drivers like Steve's SDRAM driver probably wouldn't fit and there might be issues with SDRAM refresh. We wanted a general solution and having the memory access code merged with the kernel only works some of the time. The two COG solution is more general.

Kye · 2013-04-08 21:05

I see, but, do you know about cache performance? Would executing the code from the cog with basically only 1 cache line be worse compared to executing the code from cache lines in the main memory? I heard you say the default cache size is 8K. Then are the lines 512 bytes?

jazzed · 2013-04-08 21:54

Kye wrote: »

I see, but, do you know about cache performance? Would executing the code from the cog with basically only 1 cache line be worse compared to executing the code from cache lines in the main memory? I heard you say the default cache size is 8K. Then are the lines 512 bytes?

Today, the P1 cache cog model uses roughly 20 instructions in the kernel talking to the cache cog via a mailbox. The direct mapped cache cog algorithm uses roughly 30 instructions on a cache hit - a miss is much longer. In both cases there is about 4 hub delays for communications. I don't have an analysis, but it seems to me that even on a miss with SDRAM and no address/data latching tricks, a single line could be filled in about the same time as a cache hit. The trick would be optimizing a CLUT cache hit algorithm for P2. I begged for a magnitude comparator years ago that would have helped greatly.

Cache lines > 128 bytes are good if you have small pieces of code and a slow backstore device, but the fetch time starts to eat into performance with more code scattering. SDRAM is designed to work naturally with 32 byte bursts. Bigger bursts take special handling with SDRAM.

David has a point on SDRAM refresh BTW - not sure how Chip handles that just now.

Kye · 2013-04-08 22:26

Well, for the P2, I worked this out with Chip and you'll be able to bring in a long from an octal SPI bus with I believe about 4 to 6 instructions. He put in some hardware which allows you to shift bytes in one at a time. Using that you get around the whole masking and shifting problem which kills the performance on the P1. It wouldn't be the fastest code. But, it would be simple and run pretty well.

For the P1... at least for what I'm doing I think I'll just go with a 20 instruction loop that uses no cache which just fetches one instruction at a time from flash. It would run at around 1.2 MIPs. This will save space in main memory.

On another note, I was looking in the XMM kernel code and I noticed that there were multiple read and write pseudo instructions. There were simple versions, intermediate versions, and complex versions. What is all that for?

...

One more thing. So, I assume that for XMMC you guys have just mapped the executable code to somewhere high in the 32-bit memory address range. This means that the users code has no rdlongs and wrlongs in it but instead fake instructions which invoke the kernel code right? These fake instructions then check if they need to get the value from RAM or go out to get the value from flash (in my case)? And.. the wrlong would just fail for the flash.

Also, does FCACHE code look like the rest of the code too? Or is it compiled differently?

David Betz · 2013-04-09 04:00

Kye wrote: »

I see, but, do you know about cache performance? Would executing the code from the cog with basically only 1 cache line be worse compared to executing the code from cache lines in the main memory? I heard you say the default cache size is 8K. Then are the lines 512 bytes?

The default cache size is 8K but you can adjust that to any power of 2. The cache line size can also be configured but it is 64 bytes by default. I think the only exception to this is the SD cache driver where the line size is 512 bytes because that's the smallest amount you can read from an SD card. I'm not sure how doing memory access directly in the kernel would compare with a cached approach. I've never tried doing that. Whether the memory access code would even fit depends on the type of memory. I guess a simple parallel memory interface might fit but would use a lot of pins.

David Betz · 2013-04-09 04:06

Kye wrote: »

Well, for the P2, I worked this out with Chip and you'll be able to bring in a long from an octal SPI bus with I believe about 4 to 6 instructions. He put in some hardware which allows you to shift bytes in one at a time. Using that you get around the whole masking and shifting problem which kills the performance on the P1. It wouldn't be the fastest code. But, it would be simple and run pretty well.

For the P1... at least for what I'm doing I think I'll just go with a 20 instruction loop that uses no cache which just fetches one instruction at a time from flash. It would run at around 1.2 MIPs. This will save space in main memory.

On another note, I was looking in the XMM kernel code and I noticed that there were multiple read and write pseudo instructions. There were simple versions, intermediate versions, and complex versions. What is all that for?

We're currently using the intermediate version. We started with the simple version and added the other code to try to improve performance. I'm not sure why we didn't go to the COMPLEX version. I guess Eric decided it didn't buy us anything.

...

One more thing. So, I assume that for XMMC you guys have just mapped the executable code to somewhere high in the 32-bit memory address range. This means that the users code has no rdlongs and wrlongs in it but instead fake instructions which invoke the kernel code right? These fake instructions then check if they need to get the value from RAM or go out to get the value from flash (in my case)? And.. the wrlong would just fail for the flash.

Yes, we use kernel macros for all memory access and those macros check to see if the address is above 0x20000000. If it is, we first check to see if it is in the same page as the last external memory access. If it is we can avoid calling the cache COG at all and we use the same translation that we used for the last access. This is like a one entry TLB. This is only done for reads, writes always go to the cache COG.

Also, does FCACHE code look like the rest of the code too? Or is it compiled differently?

Eric will have to answer this.

ersmith · 2013-04-09 04:31

Kye wrote: »

On another note, I was looking in the XMM kernel code and I noticed that there were multiple read and write pseudo instructions. There were simple versions, intermediate versions, and complex versions. What is all that for?

We were trying out different ways of interfacing the compiler to the XMM read/write functions. We ended up going with the intermediate form. The complex one would have saved some space, but couldn't work with the fcache (it relied on reading the instruction itself and decoding unused bits in the JMP). It'd also be harder to get working with P2, I think. You can safely delete the simple and complex versions and just keep the intermediate.

One more thing. So, I assume that for XMMC you guys have just mapped the executable code to somewhere high in the 32-bit memory address range. This means that the users code has no rdlongs and wrlongs in it but instead fake instructions which invoke the kernel code right?

Actually for XMMC the data is in hub, so the user's code does have rdlong/wrlong to access data (but not code). For regular XMM data accesses have to go through the kernel code, except variables the compiler knows are on the stack (the stack is always in the hub memory).

Also, does FCACHE code look like the rest of the code too? Or is it compiled differently?

FCACHE is compiled to a call to __LMM_FCACHE_LOAD, followed by a count of bytes and then the actual instructions. All of this is in code space, so __LMM_FCACHE_LOAD has to use the cache read functions to fetch the data into COG ram. The memory access instructions are the same as non-FCACHE'd (so in XMM they call in to the kernel); the only difference is that branches in the FCACHE'd code are changed to refer to COG memory at the address where the code will eventually run.

Eric

Kye · 2013-04-09 07:14

What was the increment by the size of the data for?

EDIT: Wait, I see, for auto incrementing the pointer.

If I implemented the complex for anyway, would that cause problems? I suppose both are the same if the i bit is zero.

ersmith · 2013-04-09 07:22

Kye wrote: »

What was the increment by the size of the data for?

That was supposed to allow the compiler to optimize things like "foo = *bar++" by having the memory read also increment the pointer variable. The compiler does not use that, and I'm not even sure we ever implemented it in the rd/wr functions. There's no need for you to implement it.

Eric

Kye · 2013-04-09 09:21

Can the kernel's 32-bit multiply and divides by used to support 64-bit numbers?

ersmith · 2013-04-09 12:19

Kye wrote: »

Can the kernel's 32-bit multiply and divides by used to support 64-bit numbers?

Not directly. The PropGCC libraries have 64 bit arithmetic in them, though.

Kye · 2013-04-10 06:47

I got the XIP loop down to 14 longs from 20. Not much of a speed boost however.

' Execute In Place (XIP)

XMMCLoop                movi    frqa,               #00 ' made up number


                        movs    XMMCTemp0,          ina
                        ror     XMMCTemp0,          #8
                        
                        movs    XMMCTemp0,          ina
                        ror     XMMCTemp0,          #8
                        
                        movs    XMMCTemp0,          ina
                        ror     XMMCTemp0,          #8
                        
                        movs    XMMCTemp1,          ina
                        
                        mov     frqa,               #0
                        
                        or      XMMCTemp0,          XMMCTemp1
                        ror     XMMCTemp0,          #8


XMMCTemp0               nop


                        add     pc,                 #4
                        jmp     #XMMCLoop          


XMMCTemp1               long    0

Assume the the OCTAL SPI bus is P0 through P7 and that CS, which is low, is on P8.

I wonder if I can get this tighter. Right now it goes at 1,714,285.7 MIPS. Would need to cut 2 more longs to get to a nice round number.

David Betz · 2013-04-10 07:03

Kye wrote: »

I got the XIP loop down to 14 longs from 20. Not much of a speed boost however.

' Execute In Place (XIP)

XMMCLoop                movi    frqa,               #00 ' made up number


                        movs    XMMCTemp0,          ina
                        ror     XMMCTemp0,          #8
                        
                        movs    XMMCTemp0,          ina
                        ror     XMMCTemp0,          #8
                        
                        movs    XMMCTemp0,          ina
                        ror     XMMCTemp0,          #8
                        
                        movs    XMMCTemp1,          ina
                        
                        mov     frqa,               #0
                        
                        or      XMMCTemp0,          XMMCTemp1
                        ror     XMMCTemp0,          #8


XMMCTemp0               nop


                        add     pc,                 #4
                        jmp     #XMMCLoop          


XMMCTemp1               long    0

Assume the the OCTAL SPI bus is P0 through P7 and that CS, which is low, is on P8.

I wonder if I can get this tighter. Right now it goes at 1,714,285.7 MIPS. Would need to cut 2 more longs to get to a nice round number.

Don't you need a nop just before XMMCTemp0?

jazzed · 2013-04-10 08:41

Use CTRA to start/stop the clock instead of FRQA.

There is an alternative that use 8 instructions, but it uses the carry flag. End up with the same number of instructions but one less data.

Kye · 2013-04-10 10:06

XMMC_loop    movi frqa,     #10_0000_0

        movs XMMC_ins1,    ina
        ror  XMMC_ins1, #8
        movs XMMC_ins1, ina
        ror  XMMC_ins1, #8
        movs XMMC_ins1, ina
        ror  XMMC_ins1, #8
                movs XMMC_ins0, ina


        mov  frqa,    #0
        
        or   XMMC_ins1, XMMC_ins0


XMMC_ins0    long 0
XMMC_ins1    nop


        add  pc,    #4
        jmp  #XMMC_loop

This should do it. Gives 1.6 MIPs for 4 cycle instructions and 1.5 MIPs for hub instructions.

There might be a problem with phase drift with the above code. Haven't checked.

EDIT: Forum software seems to remove % things in the code. The number up there is %0010_0000_0.

jazzed · 2013-04-10 10:41

I assume you are counting on many instructions being in contiguous addresses. Of course that's not always the case, and some address setup decision making will need to be done.

Also, the Microchip SQI parts might be faster since the command and address can be passed in fairly quickly.

Kye · 2013-04-10 16:41

Oh, the code for changing addresses takes quite a bit of time. If you do it serially it takes 40 spi clock cycles which translates into about 120 plus machine cycles. Address switching hurts.

I'm trying to optimize the general case here. The windbond device has a word quad spi read instruction which only takes 5 spi clock cycles to change address with, but, it requires me to output nibbles of the address at the same time in parallel to P0-P7. I suppose it would be faster to use than serially writing out the address.

jazzed · 2013-04-10 18:53

You should look at my cache driver for 2x QuadSPI chips . There is a slow read that is enabled and a fast read (the fast read needs some work that I never had time to finish). Look for BREAD. All the erase, write, and other commands are there. The top part of the driver is the generic direct map cache handling stuff. Also David wrote a 2x SQI driver for Rayman's rampage2.

I know you won't adopt any of it wholesale, but at least they are functional examples. Getting the two nibbles together to form a byte was a little tricky to me.

Kye · 2013-04-10 21:45

Thank you Jazzed!

I'll keep you informed on my progress of this. I hope to have the board done by the end of this month to send out for production and hopefully get it built by the middle of next month. Then I can start testing all this stuff.

Kye · 2013-04-10 21:49

Your fast read code gives me an idea... what if PC were PHSB? Then the PC+=4 would happen from the clock pulses.

... Probably to much other stuff would have to be changed to make it worth it.

jazzed · 2013-04-10 22:28

I talked with Kuroneko a few years ago about using PHSx and counters for the PC in an LMM program. He sent some demo code in a private message. Maybe he will post it or give me permission to post what he sent.

As far as changing other stuff.... Ya some of the typical LMM macros use the PC. The PHSx PC would have to be captured and reset in the macros. Lots of knitting.

Kernel and Cache Driver Interworks

Comments