Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Bill Henning · 2010-11-19 07:00

Hi Heater

I've been hiding working on my boards and my lab so that I can get back to software (like VMCOG) as soon as possible.

I have also designed (and half written) a utility I call "target" that gets rid of the #ifdef mess in VMCOG which was making it difficult for me to maintain/modify VMCOG

usage: "target file.spin platform"

which inserts/replaces the platform specific code in file.spin

Examples:

'$SEC CON
' taget specific constants
'$END CON

'$SEC BINIT
' target inserts/replaces code here for the specified platform
'$END BINIT

... lots of pasm code

'$SEC BREAD
' you get the idea
'$END BREAD
' and so on
'$SEC BREAD

targets/platform.def

would contain the definitions for each supported section for that target.

So in the new VMCOG, there will be a targets directory with files like:

PropCade.def
C3.def
Triblade1.def

... and so on

Then when localizing VMCOG for a specific platform, target will be used to ... umm ... target it for that platform

And for people working on new platforms, they only see the code for their platform, and can just:

target vmcog.spin sdram32mb save

to update the target file with the new changes

Heater. wrote: »

Quickly runs away to read up on "multi-way unified cache....".

As you see I'm in no position to advise. Where is Bill when you need him?

Increasing the size of the cache must surely help, as a brute force approach.

I don't know yet. I have here a TriBlade and VMCOG which means nothing in your case due to the 64K limit. I also have a 32MB GadgetGangster card which due to work/flu/more work I have yet to find time to get running. Most of the timing experiments we have done were using fibo which is somewhat useless for this task.

Anywhere that is supported by VMCOG, the GadgetGangster 32MB setup, and now your C3 effort.

I know Bill has plans to greatly increase the address space handled by VMCOG, no idea how far along that has come.

I'm somewhat tempted to do what I always said I would not do, create a Zog with direct hardware access to the 512KBytes on the TriBlade card. Perhaps another for the DracBlade. This is the way to go for raw speed on those cards.

None of this will happen until I have the GadgetGangster running here, and floating point and...

lonesock · 2010-11-19 07:21

Heater, quick question for you: does Zog still use the float => unsigned integer cast? I'm almost ready for a final review of F32, then a post to the OBEX, so I just wanted to know if I should leave that function in.

Also, did you ever figure out what was going on when compiling with various optimization settings?

thanks,
Jonathan

Heater. · 2010-11-19 07:36

lonesock,

It's been such a long time since I've been able to look at this I will have to remind myself what is what and where did I get to. Just now I still have a fuzzy fluy head and I'm away from all my Prop kit so can't verify what works and what does not.

Which function exactly are your referring to?

I suspect that if it s a useful thing for Spin programmers you should leave it no matter what Zog does.

Never did get to the sensitivity to optimization setting problem. Again I don't think you should worry, its unlikely to be float32's fault after all.

I was toying with the idea of making the float functions in-line, at least the basic maths ops anyway.

Hopefully I will be communing with my Propellers again tomorrow and get get a handle on this for you.

lonesock · 2010-11-19 08:10

Heater,

No worries...get well! The function in question is UintTrunc, which casts a float to an unsigned integer. (I'm just not sure if it's of any use to Spin users in general, since Spin interprets longs as signed, and ditto for Zog, of course). The code uses 12 longs in the cog. It's not really a problem to leave it in for now, but let's say that function will be 1st against the wall if I want to add any more functionality [8^)

Jonathan

jazzed · 2010-11-19 08:44

David Betz wrote: »

I could certainly have separate code and data caches but I wonder if that would be better than just having a multi-way unified cache? Also, I guess I could try using more than 4k of hub RAM as a cache. I will probably play with some of these ideas to see if they help.

David, here is a 2 way SdramCache.spin. The fibo calculation results are somewhat slower because a few more instructions are required for real address comparison and virtual address delivery, but the over-all run time is marginally faster for SDRAM because the stack is maintained in the second set. Presumably your SRAM and FLASH live at different base addresses, so there should be more benefit there. Note the variable _SET2_ADDR - set that bit to the second set address. You will have to replace the SDRAM code with SPI code.

David Betz · 2010-11-19 09:08

jazzed wrote: »

David, here is a 2 way SdramCache.spin. The fibo calculation results are somewhat slower because a few more instructions are required for real address comparison and virtual address delivery, but the over-all run time is marginally faster for SDRAM because the stack is maintained in the second set. Presumably your SRAM and FLASH live at different base addresses, so there should be more benefit there. Note the variable _SET2_ADDR - set that bit to the second set address. You will have to replace the SDRAM code with SPI code.

Thanks for the two-way code! I'll take a look at it tonight. How did you decide which way to overwrite on a cache miss?

jazzed · 2010-11-19 09:36

David Betz wrote: »

Thanks for the two-way code! I'll take a look at it tonight. How did you decide which way to overwrite on a cache miss?

I decided to OR the read command as you had mentioned before. Here's the zog cache_access fragment I use.

'------------------------------------------------------------------------------
#ifdef USE_JCACHED_MEMORY
zpu_cache
                        mov     temp, addr                  'ptr + mboxdat = hub address of byte to load
                        andn    temp, #(cache#LINELEN-1)
                        cmp     cacheaddr,temp wz           'if cacheaddr == addr, just pull form cache
            if_ne       jmp     #zpu_cache_ret              'memp gets overwriteen on a miss
                        mov     memp, addr                  'ptr + mboxdat = hub address of byte to load
                        and     memp, #(cache#LINELEN-1)
                        add     memp, mboxptr               'add ptr to memp to get data address
zpu_cache_ret           ret

cache_write             mov     memp, addr                  'save address for index
                        andn    addr, #cache#CMD_MASK       'ensure a write is not a read
                        or      addr, #cache#WRITE_CMD
                        jmp     #cache_access

cache_read              mov     memp, addr                  'save address for index
                        or      addr, #cache#READ_CMD       'read must be 3 to avoid needing andn addr,#cache#CMD_MASK

cache_access            ' if cacheaddr <> addr, load new cache line
                        wrlong  addr, mboxcmd
                        mov     cacheaddr,addr              'Save new cache address. it's free time here
                        andn    cacheaddr,#cache#LINELEN-1  'Kill command bits in free time
:waitres                rdlong  temp, mboxcmd
                        tjnz    temp, #:waitres             'We have room for tjnz 4 or 8 cycles
                        rdlong  mboxptr, mboxdat            'Get new buffer
                        and     memp, #(cache#LINELEN-1)    'memp is index into buffer
                        add     memp, mboxptr               'memp is now HUB buf address of data to read
cache_read_ret
cache_write_ret
cache_access_ret        ret
#endif
'------------------------------------------------------------------------------

David Betz · 2010-11-19 09:40

Thanks for your ZOG modifications. However, that doesn't answer my question. :-)
I probably worded it badly. You have a two-way cache so any address will match two cache lines, one from each "way". If neither of those cache lines match the required address you have to decide to overwrite one of those cache lines. My question is, how do you decide which? Do you use some sort of LRU algorithm? Do you take into account whether one of the lines is "dirty" and has to be written back to external memory?

Sorry for the badly worded question!

jazzed · 2010-11-19 10:06

David Betz wrote: »

If neither of those cache lines match the required address you have to decide to overwrite one of those cache lines. My question is, how do you decide which?
Do you use some sort of LRU algorithm?

Do you take into account whether one of the lines is "dirty" and has to be written back to external memory?

The answer is 2.

Up until a few weeks ago the Cache would always flush or write-back. That is fixed on any recently posted/published code.

I have not bothered with a replacement policy with the SdramCache since it takes less than 6us to read or write 32 bytes.

David Betz · 2010-11-19 10:11

I'm wondering how fast my basic compiler would run on your SdramCache implementation. What is the minimum amount of hardware I need to purchase to try it? Just the Propeller Platform board and your SDRAM board? Do you know if both are in stock?

jazzed · 2010-11-19 10:46

David Betz wrote: »

I'm wondering how fast my basic compiler would run on your SdramCache implementation. What is the minimum amount of hardware I need to purchase to try it? Just the Propeller Platform board and your SDRAM board? Do you know if both are in stock?

The pre-assembled Propeller Platform USB is in stock - it or another compatible board is required for using SDRAM. Nick is building new SDRAM boards today. Once I've had a look, new boards will be posted for sale next week.

The SDRAM module which has it's own SDCARD slot and optional TV A/V circut is compatible with any of the Propeller Platform versions. The new SDRAM module does not have Keyboard/Mouse chip or connectors - those features are now on a separate board. There should be some cost advantage to the new SDRAM module since it is on a "short" board. Pricing is TBD.

David Betz · 2010-11-19 11:01

jazzed wrote: »

The SDRAM module which has it's own SDCARD slot and optional TV A/V circut is compatible with any of the Propeller Platform versions. The new SDRAM module does not have Keyboard/Mouse chip or connectors - those features are now on a separate board. There should be some cost advantage to the new SDRAM module since it is on a "short" board. Pricing is TBD.

Are you saying that once the "short board" is available it won't be possible to get the all-in-one board anymore that contains the TV interface, PS2 jacks, and SD card slot?

jazzed · 2010-11-19 11:09

David Betz wrote: »

Are you saying that once the "short board" is available it won't be possible to get the all-in-one board anymore that contains the TV interface, PS2 jacks, and SD card slot?

Not exactly. We can still produce the all-in-one boards.

David Betz · 2010-11-19 11:12

jazzed wrote: »

Not exactly. We can still produce the all-in-one boards.

Okay, thanks!

And, I'm sorry to have hijacked this thread to talk about SDRAM boards. I'll get back on topic and start talking about ZOG again in my next message! :-)

Heater. · 2010-11-19 11:25

NO worries about thread jacking David, Zog depends heavily on external RAM so any discussion of such boards is OK here.

David Betz · 2010-11-19 17:40

jazzed wrote: »

David, here is a 2 way SdramCache.spin. The fibo calculation results are somewhat slower because a few more instructions are required for real address comparison and virtual address delivery, but the over-all run time is marginally faster for SDRAM because the stack is maintained in the second set. Presumably your SRAM and FLASH live at different base addresses, so there should be more benefit there. Note the variable _SET2_ADDR - set that bit to the second set address. You will have to replace the SDRAM code with SPI code.

I must not understand set associative caches as well as I thought. It looks to me like you are using a high order address bit to select between the two sets. My understanding is that in a two way set associative cache the same addresses map to both sets and that the cache hardware tries to match both to determine a cache hit/miss. That's why I asked you how you decided which cache line to replace. Both match the same address. Maybe I'm wrong though. I'm not a hardware designer and never had to implement a set associative cache. :-)

On the other hand, your method might be quite useful to me since I can set your _SET2_ADDR to the bit I use to distinguish between SRAM and flash. That way I essentially get a separate instruction and data cache.

jazzed · 2010-11-19 17:53

David Betz wrote: »

... On the other hand, your method might be quite useful to me since I can set your _SET2_ADDR to the bit I use to distinguish between SRAM and flash. That way I essentially get a separate instruction and data cache.

Artistic license

I did a survey of implementations last month and decided this is what I needed and what could be done with minimum impact on fetch time. Hope it helps. Good luck.

David Betz · 2010-11-19 18:27

I want to increase the amount of memory allocated to my cache but I'm not sure how to determine how much memory is available above the loaded Spin program. How do I find the first available address in hub RAM?

Heater. · 2010-11-19 18:36

Almost impossible:)

I think I'm right in saying that Spins stack grows upwards and the stack pointer is set to just passed the end of all your Spin code/data in HUB on start up.

So, your cache is somewhere at the top and the Spin stack grows upwards hoping not to crash into it. I already had this fight with the VMCOG working set page area.

Now I'm not sure why the your cache area can't just be something defined in DAT space anyway. Then it would be added to the final Spin program size.

David Betz · 2010-11-19 19:26

Heater. wrote: »

Now I'm not sure why the your cache area can't just be something defined in DAT space anyway. Then it would be added to the final Spin program size.

That's not a bad idea. Maybe I'll try that. Thanks!

David Betz · 2010-11-21 10:07

Heater. wrote: »

Now I'm not sure why the your cache area can't just be something defined in DAT space anyway. Then it would be added to the final Spin program size.

Okay, I want to do this, define my cache in the DAT section and just pass the address to my cache initialization code. I'm not entirely clear on the DAT syntax for declaring an array of data. Is this correct?

DAT
cache byte 0[8192]

Seems like this defines the symbol 'cache' as an array of 8192 bytes initialized to zero. Is that correct? It's kind of an odd syntax so I want to make sure I understand it correctly.

jazzed · 2010-11-21 10:49

David Betz wrote: »
DAT
cache byte 0[8192]
Seems like this defines the symbol 'cache' as an array of 8192 bytes initialized to zero. Is that correct?

Yes.

SdramCache.spin has always used a statically defined cache array. Is there no value in using that file and replacing the PASM "flush" routine read/write methods? Just curious.

David Betz · 2010-11-21 12:07

jazzed wrote: »

Is there no value in using that file and replacing the PASM "flush" routine read/write methods? Just curious.

My cache code is already a combination of stuff I stole from SdramCache.spin and other stuff I stole from vmcog.spin. :-)

jazzed · 2010-11-21 12:45

David Betz wrote: »

My cache code is already a combination of stuff I stole from SdramCache.spin and other stuff I stole from vmcog.spin. :-)

So I can presume it is faster than both?

David Betz · 2010-11-21 12:57

jazzed wrote: »

So I can presume it is faster than both?

More likely slower than both! :-)
I tried making a change similar to what you did with your two-way cache code and at the same time I increased my cache to 8192 bytes and now my basic interpreter is running *much* faster. I'm still not sure if it is really fast enough but it's definitely okay for simple programs. My cache is now setup as follows:

8192 bytes of cache split into two 4096 byte sections, one for flash and one for SRAM
Each section is 32 cache lines of 128 bytes each.

I've attached my cache code in case you'd like to see how I mutilated your code.

jazzed · 2010-11-21 14:01

David Betz wrote: »

I tried making a change similar to what you did with your two-way cache code and at the same time I increased my cache to 8192 bytes and now my basic interpreter is running *much* faster.

Well that's good news! Hope it's good enough for release.

David Betz · 2010-11-21 14:24

jazzed wrote: »

Well that's good news! Hope it's good enough for release.

Not ready for release but it might mean that it is worth putting more effort into. I guess my next step is to integrate SD file I/O without having to resort to using the syscall interface to do raw sector I/O.

jazzed · 2010-11-21 14:31

David Betz wrote: »

Not ready for release but it might mean that it is worth putting more effort into. I guess my next step is to integrate SD file I/O without having to resort to using the syscall interface to do raw sector I/O.

I've been re-thinking syscalls. The main reason I wanted to use them before is that SPIN is faster than ZOG. My current thinking is that I would rather just start a COG with a C interface mainly for portability and avoiding debug_zog or other dependent clutter. The current OBEX C code can be easily shared with ZOG as well.

Now, where are the examples for starting a COG from ZOG in C ?

David Betz · 2010-11-21 17:09

Heater: Could you explain how _memreg is used? It seems to be used to communicate between the syscall handler in debug_zog.spin and the ZPU code running under ZOG. How does that work?

Heater. · 2010-11-22 00:53

Jazzed,

...where are the examples for starting a COG from ZOG in C ?

Have a look at the second half of post #742 in this thread. There I attempted to describe this for David.

I am sensing a growing desire to move away from Spin and debug_zog like setups towards a ZPU and C only environment.

This is exactly what run_zog does at the moment. But only within the confines of a HUB based system so far. Basically when run_zog is up and running you have a Prop running C from ZPU interpreter instead of Spin from the Spin interpreter.

It would be quite possible for the Zog started by run_zog to start another Zog that used external memory. We could end up with a debug_zog written in C instead of Spin.

Ideally we would only use SYSCALL for Propeller services like COGNEW and WAITxxx that cannot be gotten at in any other way.

David,

Could you explain how _memreg is used?

Hmmm...Not really.

Apparently gcc is in need of two or three CPU registers for various operations. The ZPU has no registers so they are implemented in RAM at address zero. There is mention of this on the Zylin web site but no details.

_memreg is used for returning results from functions and therefore from SYSCALLs

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments