Kernel and Cache Driver Interworks
Kye
Posts: 2,200
I'm looking at making a custom kernel for propgcc. I want to know if I can compile code for XMMC mode without the inclusion of the kernel explicitly. Basically, I want to setup a driver package that turns the prop chip into effectively an XMMC processor with a bunch of software defined hardware using the other cogs.
As I understand it now, the kernel executes your code and the cache driver caches the instructions to execute on the request of the kernel. However, does the cache driver do look ahead? Or is it just a two cog solution because the code wouldn't fit in one processor?
I would like to be able to include the code necessary for caching data right in the kernel. Would this be smart? Or would it be better to put that code into the cache driver?
I have an octal SPI setup and I'm running the propeller chip at 96 MHz.
...
I think I can get a 1.2 Mega Long a second write to main memory for reading in a new cache line and I should be able to execute instructions at a rate somewhere above 5 MIPs.
As I understand it now, the kernel executes your code and the cache driver caches the instructions to execute on the request of the kernel. However, does the cache driver do look ahead? Or is it just a two cog solution because the code wouldn't fit in one processor?
I would like to be able to include the code necessary for caching data right in the kernel. Would this be smart? Or would it be better to put that code into the cache driver?
I have an octal SPI setup and I'm running the propeller chip at 96 MHz.
...
I think I can get a 1.2 Mega Long a second write to main memory for reading in a new cache line and I should be able to execute instructions at a rate somewhere above 5 MIPs.
Comments
If I want to store the instructions directly in cog memory, I can still read them in at 1.2 Mega Longs a second using a 20 instruction loop and then I could execute them at 24 MIPs. However, I would pay for this in that the cache will be reduced from about 8K to 1K.
That sounds clever. Ok, brainstorming that a bit, is that like a level 1 and a level 2 cache? Run from the cog cache, if there is a miss, check for hub cache, and if there is a miss, read from external memory?
The kernel is in a file called _crt0.o. The source code for this is in gcc/gcc/config/propeller/crt0.S, which (for any of the XMM modes) includes crt0_xmm.s. That's what you'd have to change.
If you give the compiler driver propeller-elf-gcc the -v option it will print all the commands its running, in particular the linker, so you can see how to replace _crt0.o.
Regarding your idea of caching directly into cog memory, unfortunately that wouldn't work without compiler changes. The compiler expects the LMM/XMM kernel to use a register called "pc" to fetch instructions from memory with rdlong, and it does things like "add pc, #4" to skip over instructions or data.
Eric
My real question is what impact would this be on performance. I don't know how much the cache helps. David I guess would know this.
Having a second COG do the cache has some not so obvious advantages with respect to interacting with the IO pins as well as the obvious burst nature of the design. 1) No need to "and-not-or bits" in the OUTA register for setting pins - I.E. you can only affect the outputs that are enabled by the COG. In a single COG kernel/hardware controller, one needs to spend cycles making sure only the bits you care about get set which may also involve shifting things around. 2) The counters and other features of the cache COG are exclusively yours. In a single COG kernel/hardware controller counters belong to the user. 3) The cache COG can do bursts for slow devices. I.E. a single bit SPI in a cache COG is almost as fast as a byte wide access design with the same address setup. The SSF byte-wide 2x QuadSPI design is about 30% faster than the equivalent 1 bit SPI design.
The current cache model does not look ahead. It is merely a fetcher/write-back storer at the moment. Cache is direct mapped mostly, although David has done some n-way set associatve experiments.
In a P1 the fastest direct kernel/hardware access model (call it the direct model) with a parallel bus design eats most of the propeller pins. The best fetch/store rate would be in 4 instructions per access with such hardware - I.E. 5MB/s at 80MHz or about 1MIPS or less. That does not account for the other work the kernel has to do.
Funny you mention using an in-COG cache though (the direct model). I've been talking with David and Eric about the possibility of using the CLUT on P2 as a tiny cache. I suspect that even with SDRAM setup times (and a single line cache in the clut), the P2 direct hardware access would out-perform a cached model because of the HUB bottle-neck (even with those spiffy new multi-long instructions). Once a "line" is burst read into the clut, getting a new instruction is mostly a matter of checking the address range and extracting the instruction. Writes to external memory could be via write-back, but I suspect a simpler write-through would be more efficient.
I really think the direct model on the P2 has merit. On P1, it's not as attractive, but it could be a stepping-stone for P2.
Best regards,
--Steve
No, there would just be one cache in cog RAM. This would liberate 8K in hub RAM.
I'm making an Arduino and more operating system for the propeller. Its going to be event driven. I hope to have significant progress done by August (I have a few months off before I start my job after graduating college this May, going to try to work very hard to get the bulk of the low level work done).
In my system, there is no assumption of what the hardware is doing, the user does not own OUTA/DIRA/INA, they can call C level functions safety, not mess with the hardware. I'm already 80% done designing and routing a specific board for this project. Attached is my current progress.
It has an RTC, SD card slot, FTDI chip, ADC, Octal SPI Flash (16 Mb), EEPROM, and a nice little bread board area. I just have to clean up the schematic and fix the routing for the I/O pins area (this might be heroic...).
Anyway, I'm doing a lot of stuff full custom. For example, to make the loader really fast, I'm not using a serial port, instead I'm going to use the D2XX driver to get 3M speeds out of the FTDI chip. I should be able to get around 200 KB a second write to the flash for programming.
Sounds pretty neat Kye.
When I mentioned user with respect to OUTA, etc... I was talking about "any software" that the kernel runs which includes libraries. The kernel can not mess with pins that it does not "own" and thus must do "and-not-or" operations so it does not interfere with "user" pins such as the serial port and eeprom pins.
If you have time to work on a "direct model" for P1/P2, it would be great. I'm sure the current core team will be supportive where possible. David is the point man at the moment, so he makes the decisions - I retired from that job.
Best regards,
--Steve
Today, the P1 cache cog model uses roughly 20 instructions in the kernel talking to the cache cog via a mailbox. The direct mapped cache cog algorithm uses roughly 30 instructions on a cache hit - a miss is much longer. In both cases there is about 4 hub delays for communications. I don't have an analysis, but it seems to me that even on a miss with SDRAM and no address/data latching tricks, a single line could be filled in about the same time as a cache hit. The trick would be optimizing a CLUT cache hit algorithm for P2. I begged for a magnitude comparator years ago that would have helped greatly.
Cache lines > 128 bytes are good if you have small pieces of code and a slow backstore device, but the fetch time starts to eat into performance with more code scattering. SDRAM is designed to work naturally with 32 byte bursts. Bigger bursts take special handling with SDRAM.
David has a point on SDRAM refresh BTW - not sure how Chip handles that just now.
For the P1... at least for what I'm doing I think I'll just go with a 20 instruction loop that uses no cache which just fetches one instruction at a time from flash. It would run at around 1.2 MIPs. This will save space in main memory.
On another note, I was looking in the XMM kernel code and I noticed that there were multiple read and write pseudo instructions. There were simple versions, intermediate versions, and complex versions. What is all that for?
...
One more thing. So, I assume that for XMMC you guys have just mapped the executable code to somewhere high in the 32-bit memory address range. This means that the users code has no rdlongs and wrlongs in it but instead fake instructions which invoke the kernel code right? These fake instructions then check if they need to get the value from RAM or go out to get the value from flash (in my case)? And.. the wrlong would just fail for the flash.
Also, does FCACHE code look like the rest of the code too? Or is it compiled differently?
Actually for XMMC the data is in hub, so the user's code does have rdlong/wrlong to access data (but not code). For regular XMM data accesses have to go through the kernel code, except variables the compiler knows are on the stack (the stack is always in the hub memory).
FCACHE is compiled to a call to __LMM_FCACHE_LOAD, followed by a count of bytes and then the actual instructions. All of this is in code space, so __LMM_FCACHE_LOAD has to use the cache read functions to fetch the data into COG ram. The memory access instructions are the same as non-FCACHE'd (so in XMM they call in to the kernel); the only difference is that branches in the FCACHE'd code are changed to refer to COG memory at the address where the code will eventually run.
Eric
EDIT: Wait, I see, for auto incrementing the pointer.
If I implemented the complex for anyway, would that cause problems? I suppose both are the same if the i bit is zero.
That was supposed to allow the compiler to optimize things like "foo = *bar++" by having the memory read also increment the pointer variable. The compiler does not use that, and I'm not even sure we ever implemented it in the rd/wr functions. There's no need for you to implement it.
Eric
Assume the the OCTAL SPI bus is P0 through P7 and that CS, which is low, is on P8.
I wonder if I can get this tighter. Right now it goes at 1,714,285.7 MIPS. Would need to cut 2 more longs to get to a nice round number.
There is an alternative that use 8 instructions, but it uses the carry flag. End up with the same number of instructions but one less data.
This should do it. Gives 1.6 MIPs for 4 cycle instructions and 1.5 MIPs for hub instructions.
There might be a problem with phase drift with the above code. Haven't checked.
EDIT: Forum software seems to remove % things in the code. The number up there is %0010_0000_0.
Also, the Microchip SQI parts might be faster since the command and address can be passed in fairly quickly.
I'm trying to optimize the general case here. The windbond device has a word quad spi read instruction which only takes 5 spi clock cycles to change address with, but, it requires me to output nibbles of the address at the same time in parallel to P0-P7. I suppose it would be faster to use than serially writing out the address.
I know you won't adopt any of it wholesale, but at least they are functional examples. Getting the two nibbles together to form a byte was a little tricky to me.
I'll keep you informed on my progress of this. I hope to have the board done by the end of this month to send out for production and hopefully get it built by the middle of next month. Then I can start testing all this stuff.
... Probably to much other stuff would have to be changed to make it worth it.
As far as changing other stuff.... Ya some of the typical LMM macros use the PC. The PHSx PC would have to be captured and reset in the macros. Lots of knitting.