Is Spin2Cpp version of "?" (random) slow?
Rayman
Posts: 14,665
I just ported my Quick Media Player's Graphics Demo to PropGCC using Spin2Cpp.
I'm amazed it actually works, but I noticed that some parts were faster and some were slower.
This is in cmm mode.
This particular code uses a lot of random numbers as it draws points, circles, lines, etc.
to random positions on the screen and in random colors.
These parts of the code are noticeably slower.
But, other parts, such as Bitmap display usings FSRW and text without random are faster.
I'm guessing that this is why, but am not 100% sure yet...
The code implementation for "?" is also noticeably bigger in the Spin2Cpp version...
Any way to speed this up?
I'm amazed it actually works, but I noticed that some parts were faster and some were slower.
This is in cmm mode.
This particular code uses a lot of random numbers as it draws points, circles, lines, etc.
to random positions on the screen and in random colors.
These parts of the code are noticeably slower.
But, other parts, such as Bitmap display usings FSRW and text without random are faster.
I'm guessing that this is why, but am not 100% sure yet...
The code implementation for "?" is also noticeably bigger in the Spin2Cpp version...
Any way to speed this up?
Comments
Added math.h and used rand() and it's now faster than Spin even in this cmm mode.
Program doesn't work in any xmm mode though, have to track that down.
I'm guessing that I have to add "HUBDATA" or volitile or something somewhere...
XMM won't work if the program uses spin cognew.
We don't at the moment except by using COGC, GAS, or PASM COG drivers which work fine with cognew.
The cogstart (or the other _start_lmm_thread?) function are used to start LMM code, but not XMM code.
David, Eric, Parallax, and I have discussed it for updating the P2 branch..
Maybe a good time to discuss that P2 SDRAM cache strategy with Chip since both kernel and cache interface will need to change.
That's not what I'm doing here. Actually, I very rarely do that. I'm just using cognew to launch assembly drivers for FSRW and the lcd display.
So, that should work in XMM modes, right? So, I'm back to thinking I must new "HUBDATA" or volatile added somewhere...
The data you're launching via cognew will have to be in the HUB, I think, so that may be where you have to add the HUBDATA.
Eric
Doing some troubleshooting, it seems like the driver is loading.
The first call to the driver works, but any second call goes haywire...
So, it is at least partially working.
Edit: Also, you'll need to declare the mailboxes as "volatile".
error: section attribute not allowed for 'Command'
Hi David,
The whole SDRAM cache thing on P2 interests me right now as well and I expect the XMMC model for the P2 device with that type of memory attached will become rather useful for some larger C/C++ applications, but I already have an understanding it could really suffer some slow performance without a good I-cache particularly when considering the SDRAM could also be shared by multiple clients and has a significant hub cycle latency involved. There is a another post already covering that topic over in the P2 forum.
I'm not sure if you PropGCC people have been looking at this but if you have can you tell me from your perspective is there any feeling today for how many hub windows per executed simple PASM instruction might be expected when operating in the XMMC VM model for P2 if we have a good I-cache and we are getting cache hits all the time? Basically I am thinking here of something that would be using Chip's SDRAM driver with some suitable I-cache in hub RAM. Instruction cache miss timing will obviously depend on the underlying memory technology and for SDRAM on P2 will be quite slow due to its overall driver latency, but cache hit performance depends more on the VM and its associated cache driver design (I am presuming they are operating in the same COG). I was digging around for any code samples that show me the existing VM / Cache driver so I can get a sense of how fast it would be executing instructions during the instruction cache hits but didn't find any VM kernel source code in the prop2gcc zip file. Is the VM's PASM source available somewhere else?
Also today just for interest I was mucking about with some of my own cache ideas on paper and think I just about managed to get a rudimentary 32kB 2 way set associative I-cache down to 2 hub windows per executed instruction using the new RDQUAD opcode which could be nice if the VM reading from it could then operate up to 10MIPs during the cache hits @160MHz. Though any cache misses that have to go read from SDRAM probably brings it right down to ~1MIP based on some rough estimates of Chip's current SDRAM driver performance, and that is only if it doesn't have to contend with video or other users. For say a 90% hit rate that equates to an average of ~5MIPs for simple instructions. Not sure how that compares with what is already there today for XMMC or anticipated at this point in time for P2 with SDRAM. That's why it would be interesting to see the VM code that already exists for these models and get a sense of how it might perform on the P2 with SDRAM in the future.
Cheers and good job by the way for the whole team effort on what has been achieved already with GCC on the Prop. It is great to have GCC for the Prop going forward and P2 will really benefit from it.
Roger.
First, you can find the kernel sources in the Google Code source tree in propgcc/gcc/gcc/config/propeller/crt0_xmm.s.
http://code.google.com/p/propgcc/
The cache drivers themselves are in propgcc/loader/spin. For instance, the SPI flash cache driver is spi_flash_cache.spin.
Currently the kernel runs in a separate COG from the cache driver. The two talk through a mailbox interface. The kernel caches the last address translation so access on ths same cache line don't result in a call to the cache driver. We only call the cache driver when we have a cache miss on the line the kernel remembers. We have two different architectures for the cache drivers themselves. The ones currently in use keep tags in COG memory so cache hits do not involve any hub accesses other than the mailbox protocol. The newer cache architecture puts tags in hub memory to free up more space in the cache driver COG for more complex external memory interfaces.
I'd love to hear about your ideas for speeding up cache access on the P2. We haven't really started working on the P2 XMM kernel or cache drivers yet but will be starting soon. We'd welcome any ideas you might have!
There will certainly be quite a bit to think about when doing an SDRAM XMM version on P2. There will need to be some decisions made as to the amount of code to read from the SDRAM during a miss (ie. the cache line size) given how the SDRAM driver behaves and also when it is shared by say video drivers.
I do also wonder if the new P2 COG's stack RAM can potentially be used either as a L1 I-cache or to hold the cache tags for further speedup somehow when running XMMC. Still pondering in some of my few free moments...
crt0_xmm.s
Yes that was what I was hoping do too: have the I-cache handling in the VM COG so no third separate COG is required just for caching when there is already a SDRAM driver present in the system as well. The other way to achieve that COG reduction is to do the cache handling in some (modified) SDRAM driver but that is a quite a lot to deal with their as well, especially if other clients use the same driver (eg video) in their own ways. I also imagine performance may be better if the I-cache tag handling is done in the VM COG if possible especially if any of its state can be stored internally in the P2 COG's stack RAM and then reduces the number of hub accesses for getting extra results. It all depends on how it is implemented and if all the extra P2 capabilities can be leveraged in a useful way.
One interesting problem for XMMC would be getting access to video memory in the data space if the SDRAM is also being used for video applications as a large frame buffer. I am wondering how would one map that memory back into a data space so GCC applications can use it? All writes back to SDRAM may become slower if a write though cache is involved or needs to be invalidated, and if another video driver is continually reading directly from the SDRAM driver you'd not want to cache it elsewhere in the hub unless the SDRAM driver handled it all for you which it probably wont. Its almost like one could want a variant model of XMMC where some video frame buffer data and instructions get stored in the SDRAM and the normal data segment and stack for the application are still stored in the hub RAM. Or perhaps this is already possible in general with XMM kernel VM today by just using linker scripts and the kernel knowing that memory lower than a certain value indicates hub RAM. So maybe one could steal an address bit for indicating uncached video space. Eg. 0x0 - 0x1fffffff is hub RAM (with foldover), 0x20000000-0x7fffffff is SDRAM for cachable code (and XMM data), and 0x80000000-0xFFFFFFF is video in SDRAM and is uncachable so gets treated differently on all writes so other video drivers always read valid video data from the SDRAM.
It is certainly going to get interesting...
[EDIT] New thread is here: http://forums.parallax.com/showthread.php/147886-Instruction-caching-ideas-for-P2-in-XMMC-mode-with-SDRAM