Is Spin2Cpp version of "?" (random) slow?

Rayman · 2013-05-05 18:17

I just ported my Quick Media Player's Graphics Demo to PropGCC using Spin2Cpp.

I'm amazed it actually works, but I noticed that some parts were faster and some were slower.
This is in cmm mode.

This particular code uses a lot of random numbers as it draws points, circles, lines, etc.
to random positions on the screen and in random colors.
These parts of the code are noticeably slower.

But, other parts, such as Bitmap display usings FSRW and text without random are faster.
I'm guessing that this is why, but am not 100% sure yet...

The code implementation for "?" is also noticeably bigger in the Spin2Cpp version...

Any way to speed this up?

Rayman · 2013-05-06 09:33

Never mind, I just decided to do it the C++ way and use rand() instead...
Added math.h and used rand() and it's now faster than Spin even in this cmm mode.

Program doesn't work in any xmm mode though, have to track that down.
I'm guessing that I have to add "HUBDATA" or volitile or something somewhere...

jazzed · 2013-05-06 09:40

Rayman wrote: »

Never mind, I just decided to do it the C++ way and use rand() instead...
Added math.h and used rand() and it's now faster than Spin even in this cmm mode.

Program doesn't work in any xmm mode though, have to track that down.
I'm guessing that I have to add "HUBDATA" or volitile or something somewhere...

XMM won't work if the program uses spin cognew.

Rayman · 2013-05-06 09:57

Really? I didn't know that... How do you get around this?

jazzed · 2013-05-06 10:02

Rayman wrote: »

Really? I didn't know that... How do you get around this?

We don't at the moment except by using COGC, GAS, or PASM COG drivers which work fine with cognew.
The cogstart (or the other _start_lmm_thread?) function are used to start LMM code, but not XMM code.

David, Eric, Parallax, and I have discussed it for updating the P2 branch..

Rayman · 2013-05-06 10:29

Ok, I'm not using COGNEW to start a new Spin thread, just to launch PASM drivers. So, maybe that should be working...

David Betz · 2013-05-06 11:19

jazzed wrote: »

XMM won't work if the program uses spin cognew.

Hmm... Looks like I need to spend some time to work on resolving this!

jazzed · 2013-05-06 11:41

David Betz wrote: »

Hmm... Looks like I need to spend some time to work on resolving this!

Maybe a good time to discuss that P2 SDRAM cache strategy with Chip since both kernel and cache interface will need to change.

Rayman · 2013-05-06 11:49

Again, I think by "spin cognew" jazzed meant using cognew to launch a new cog running the spin interpreter. (I thought he meant using cognew in Spin in any way).

That's not what I'm doing here. Actually, I very rarely do that. I'm just using cognew to launch assembly drivers for FSRW and the lcd display.

So, that should work in XMM modes, right? So, I'm back to thinking I must new "HUBDATA" or volatile added somewhere...

David Betz · 2013-05-06 11:59

jazzed wrote: »

Maybe a good time to discuss that P2 SDRAM cache strategy with Chip since both kernel and cache interface will need to change.

Not necessarily although that might be required if we want to make use of some of the techniques that have been discussed for improving kernel performance on P2. We could do a quick port with the same interface at least as a first step.

David Betz · 2013-05-06 12:00

Rayman wrote: »

Again, I think by "spin cognew" jazzed meant using cognew to launch a new cog running the spin interpreter. (I thought he meant using cognew in Spin in any way).

That's not what I'm doing here. Actually, I very rarely do that. I'm just using cognew to launch assembly drivers for FSRW and the lcd display.

So, that should work in XMM modes, right? So, I'm back to thinking I must new "HUBDATA" or volatile added somewhere...

Yes, this should be possible in XMM modes.

ersmith · 2013-05-06 12:00

Rayman wrote: »

So, that should work in XMM modes, right? So, I'm back to thinking I must new "HUBDATA" or volatile added somewhere...

The data you're launching via cognew will have to be in the HUB, I think, so that may be where you have to add the HUBDATA.

Eric

David Betz · 2013-05-06 12:03

ersmith wrote: »

The data you're launching via cognew will have to be in the HUB, I think, so that may be where you have to add the HUBDATA.

Eric

At one point we had code to launch COG images resident in external memory (flash) by copying it first into a 2K hub buffer. I guess I need to look through the code and remind myself of what options are available.

Rayman · 2013-05-06 13:22

I added HUBDATA everywhere, but it still doesn't work right in xmmc mode...
Doing some troubleshooting, it seems like the driver is loading.
The first call to the driver works, but any second call goes haywire...
So, it is at least partially working.

David Betz · 2013-05-06 13:29

Rayman wrote: »

I added HUBDATA everywhere, but it still doesn't work right in xmmc mode...
Doing some troubleshooting, it seems like the driver is loading.
The first call to the driver works, but any second call goes haywire...
So, it is at least partially working.

The mailboxes used to communicate with the driver will also need to be in hub memory.

Edit: Also, you'll need to declare the mailboxes as "volatile".

Rayman · 2013-05-06 13:37

How do you do that? If I try adding HUBDATA to non-static things, I get this error:
error: section attribute not allowed for 'Command'

David Betz · 2013-05-06 13:39

Rayman wrote: »

How do you do that? If I try adding HUBDATA to non-static things, I get this error:
error: section attribute not allowed for 'Command'

If it's on the stack it's automatically in hub memory. No matter what memory model you use, the stack is always in hub memory. So HUBDATA isn't required for local variables.

rogloh · 2013-05-07 06:55

jazzed wrote: »

Maybe a good time to discuss that P2 SDRAM cache strategy with Chip since both kernel and cache interface will need to change.

David Betz wrote: »

Not necessarily although that might be required if we want to make use of some of the techniques that have been discussed for improving kernel performance on P2. We could do a quick port with the same interface at least as a first step.

Hi David,

The whole SDRAM cache thing on P2 interests me right now as well and I expect the XMMC model for the P2 device with that type of memory attached will become rather useful for some larger C/C++ applications, but I already have an understanding it could really suffer some slow performance without a good I-cache particularly when considering the SDRAM could also be shared by multiple clients and has a significant hub cycle latency involved. There is a another post already covering that topic over in the P2 forum.

I'm not sure if you PropGCC people have been looking at this but if you have can you tell me from your perspective is there any feeling today for how many hub windows per executed simple PASM instruction might be expected when operating in the XMMC VM model for P2 if we have a good I-cache and we are getting cache hits all the time? Basically I am thinking here of something that would be using Chip's SDRAM driver with some suitable I-cache in hub RAM. Instruction cache miss timing will obviously depend on the underlying memory technology and for SDRAM on P2 will be quite slow due to its overall driver latency, but cache hit performance depends more on the VM and its associated cache driver design (I am presuming they are operating in the same COG). I was digging around for any code samples that show me the existing VM / Cache driver so I can get a sense of how fast it would be executing instructions during the instruction cache hits but didn't find any VM kernel source code in the prop2gcc zip file. Is the VM's PASM source available somewhere else?

Also today just for interest I was mucking about with some of my own cache ideas on paper and think I just about managed to get a rudimentary 32kB 2 way set associative I-cache down to 2 hub windows per executed instruction using the new RDQUAD opcode which could be nice if the VM reading from it could then operate up to 10MIPs during the cache hits @160MHz. Though any cache misses that have to go read from SDRAM probably brings it right down to ~1MIP based on some rough estimates of Chip's current SDRAM driver performance, and that is only if it doesn't have to contend with video or other users. For say a 90% hit rate that equates to an average of ~5MIPs for simple instructions. Not sure how that compares with what is already there today for XMMC or anticipated at this point in time for P2 with SDRAM. That's why it would be interesting to see the VM code that already exists for these models and get a sense of how it might perform on the P2 with SDRAM in the future.

Cheers and good job by the way for the whole team effort on what has been achieved already with GCC on the Prop. It is great to have GCC for the Prop going forward and P2 will really benefit from it.

Roger.

David Betz · 2013-05-07 07:07

rogloh wrote: »

Hi David,

The whole SDRAM cache thing on P2 interests me right now as well and I expect the XMMC model for the P2 device with that type of memory attached will become rather useful for some larger C/C++ applications, but I already have an understanding it could really suffer some slow performance without a good I-cache particularly when considering the SDRAM could also be shared by multiple clients and has a significant hub cycle latency involved. There is a another post already covering that topic over in the P2 forum.

I'm not sure if you PropGCC people have been looking at this but if you have can you tell me from your perspective is there any feeling today for how many hub windows per executed simple PASM instruction might be expected when operating in the XMMC VM model for P2 if we have a good I-cache and we are getting cache hits all the time? Basically I am thinking here of something that would be using Chip's SDRAM driver with some suitable I-cache in hub RAM. Instruction cache miss timing will obviously depend on the underlying memory technology and for SDRAM on P2 will be quite slow due to its overall driver latency, but cache hit performance depends more on the VM and its associated cache driver design (I am presuming they are operating in the same COG). I was digging around for any code samples that show me the existing VM / Cache driver so I can get a sense of how fast it would be executing instructions during the instruction cache hits but didn't find any VM kernel source code in the prop2gcc zip file. Is the VM's PASM source available somewhere else?

Also today just for interest I was mucking about with some of my own cache ideas on paper and think I just about managed to get a rudimentary 32kB 2 way set associative I-cache down to 2 hub windows per executed instruction using the new RDQUAD opcode which could be nice if the VM reading from it could then operate up to 10MIPs during the cache hits @160MHz. Though any cache misses that have to go read from SDRAM probably brings it right down to ~1MIP based on some rough estimates of Chip's current SDRAM driver performance, and that is only if it doesn't have to contend with video or other users. For say a 90% hit rate that equates to an average of ~5MIPs for simple instructions. Not sure how that compares with what is already there today for XMMC or anticipated at this point in time for P2 with SDRAM. That's why it would be interesting to see the VM code that already exists for these models and get a sense of how it might perform on the P2 with SDRAM in the future.

Cheers and good job by the way for the whole team effort on what has been achieved already with GCC on the Prop. It is great to have GCC for the Prop going forward and P2 will really benefit from it.

Roger.

First, you can find the kernel sources in the Google Code source tree in propgcc/gcc/gcc/config/propeller/crt0_xmm.s.

http://code.google.com/p/propgcc/

The cache drivers themselves are in propgcc/loader/spin. For instance, the SPI flash cache driver is spi_flash_cache.spin.

Currently the kernel runs in a separate COG from the cache driver. The two talk through a mailbox interface. The kernel caches the last address translation so access on ths same cache line don't result in a call to the cache driver. We only call the cache driver when we have a cache miss on the line the kernel remembers. We have two different architectures for the cache drivers themselves. The ones currently in use keep tags in COG memory so cache hits do not involve any hub accesses other than the mailbox protocol. The newer cache architecture puts tags in hub memory to free up more space in the cache driver COG for more complex external memory interfaces.

I'd love to hear about your ideas for speeding up cache access on the P2. We haven't really started working on the P2 XMM kernel or cache drivers yet but will be starting soon. We'd welcome any ideas you might have!

rogloh · 2013-05-08 07:43

Thanks David, I will try to take a look and see if I can figure it the code more and if there's anything useful I can add. My own approach was to try to combine the VM and cache lookup code in the same kernel loop but it turns out my 2 hub window loop was probably getting a little too ambitious given the delays involved in accessing the results from RDQUAD and the overhead in saving flags so I'll need to rethink some of it again. It's probably at least 3 hub windows. I had noticed that you combine the hub memory space at the lower external memory address range, so that adds a couple of extra cycles as well so it starts to push things out a bit.

There will certainly be quite a bit to think about when doing an SDRAM XMM version on P2.

There will need to be some decisions made as to the amount of code to read from the SDRAM during a miss (ie. the cache line size) given how the SDRAM driver behaves and also when it is shared by say video drivers.

I do also wonder if the new P2 COG's stack RAM can potentially be used either as a L1 I-cache or to hold the cache tags for further speedup somehow when running XMMC. Still pondering in some of my few free moments...

David Betz · 2013-05-08 08:05

rogloh wrote: »

Thanks David, I will try to take a look and see if I can figure it the code more and if there's anything useful I can add. My own approach was to try to combine the VM and cache lookup code in the same kernel loop but it turns out my 2 hub window loop was probably getting a little too ambitious given the delays involved in accessing the results from RDQUAD and the overhead in saving flags so I'll need to rethink some of it again. It's probably at least 3 hub windows. I had noticed that you combine the hub memory space at the lower external memory address range, so that adds a couple of extra cycles as well so it starts to push things out a bit.

There will certainly be quite a bit to think about when doing an SDRAM XMM version on P2. There will need to be some decisions made as to the amount of code to read from the SDRAM during a miss (ie. the cache line size) given how the SDRAM driver behaves and also when it is shared by say video drivers.

I do also wonder if the new P2 COG's stack RAM can potentially be used either as a L1 I-cache or to hold the cache tags for further speedup somehow when running XMMC. Still pondering in some of my few free moments...

I did an experiment last night with merging the tag handling code into the kernel. I did this on the P1 but the idea was to use a similar approach on the P2 so that I can make use of Chip's SDRAM driver and not require a third COG just to handle the cache tags. I've attached the resulting kernel in case you want to look it over. Another possiblity on P2 would be to use the CLUT memory to store tags.

crt0_xmm.s

rogloh · 2013-05-08 18:47

Thanks David.

Yes that was what I was hoping do too: have the I-cache handling in the VM COG so no third separate COG is required just for caching when there is already a SDRAM driver present in the system as well. The other way to achieve that COG reduction is to do the cache handling in some (modified) SDRAM driver but that is a quite a lot to deal with their as well, especially if other clients use the same driver (eg video) in their own ways. I also imagine performance may be better if the I-cache tag handling is done in the VM COG if possible especially if any of its state can be stored internally in the P2 COG's stack RAM and then reduces the number of hub accesses for getting extra results. It all depends on how it is implemented and if all the extra P2 capabilities can be leveraged in a useful way.

One interesting problem for XMMC would be getting access to video memory in the data space if the SDRAM is also being used for video applications as a large frame buffer. I am wondering how would one map that memory back into a data space so GCC applications can use it? All writes back to SDRAM may become slower if a write though cache is involved or needs to be invalidated, and if another video driver is continually reading directly from the SDRAM driver you'd not want to cache it elsewhere in the hub unless the SDRAM driver handled it all for you which it probably wont. Its almost like one could want a variant model of XMMC where some video frame buffer data and instructions get stored in the SDRAM and the normal data segment and stack for the application are still stored in the hub RAM. Or perhaps this is already possible in general with XMM kernel VM today by just using linker scripts and the kernel knowing that memory lower than a certain value indicates hub RAM. So maybe one could steal an address bit for indicating uncached video space. Eg. 0x0 - 0x1fffffff is hub RAM (with foldover), 0x20000000-0x7fffffff is SDRAM for cachable code (and XMM data), and 0x80000000-0xFFFFFFF is video in SDRAM and is uncachable so gets treated differently on all writes so other video drivers always read valid video data from the SDRAM.

It is certainly going to get interesting...

rogloh · 2013-05-08 22:06

I think I might start a separate thread on this caching topic.

[EDIT] New thread is here: http://forums.parallax.com/showthread.php/147886-Instruction-caching-ideas-for-P2-in-XMMC-mode-with-SDRAM

Is Spin2Cpp version of "?" (random) slow?

Comments