Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

jazzed · 2011-06-03 12:22

Heater. wrote: »

Zog goes little endian.
...
With the bonus that fibo is now almost 7% faster!

Excellent news. Thanks for that. Hopefully a "HUB locals" version of external memory platforms can be running soon. I also look forward to running programs fetched from byte-wide Flash - it will be more work but at least I won't have to wait for anymore tools to get it working.

Heater. · 2011-06-03 12:55

Bill,

I had a little look at what could be done with your suggestion. I'm afraid you might be a bit disappointed because:
1) Zog already bypasses the dispatch loop for IM's. It just checks the top bit of every op code for IM and then does it. This was basically done to halve the dispatch table size and according to my code comments already sped fibo up by 7%
2) That look ahead involves fetching the next opcode which means in-lining another byte fetch for it, not nice as we need the space, or making opcode fetch a subroutine, not nice as it's slower.

There might be some mileage in the idea in that it gets rid of maintaining a flag for the first or subsequent IM's but that's only a couple of instructions.

Heater. · 2011-06-03 15:10

Bill,

I just replaced Zog's main execution loop with something that I think implements your suggestion. Code below. It shows no difference in fibo execution time at all!

I was hoping for something because as far as I can tell this code save 1 PASM instruction for non-IM opcodes, 2 PASM instructions for a first IM in a sequence and none for following IMs.

Given that 25% if the fibo instructions are single IMs I might have expected to see some gain.

This code looks nicer than the older version but as presented here only works for execution from HUB. Adding the external memory opcode fetch would require two lots of inline code or two subroutine calls niether of which appeals to me.

Looks like Zog will not be doing this look ahead unless you have a sneakier way to do it:)

' Main ZPU fetch and execute loop
done_and_inc_pc         add     pc, #1
done

execute                 mov     memp, pc                'Read an opcode from memory @ program counter
                        add     memp, zpu_memory_addr
                        rdbyte  data, memp

                        cmpsub  data, #$80 wc           'Check for IM instruction. This saves dispatch table lookup
          if_nc    jmp    #not_im                 'for the most common op. 7% fibo speed gain!

                        call    #push_tos               'Must be first IM of a sequence to get here.
                        mov     tos, data               'Extract 7 bits of immediate data into TOS.
                        shl     tos, #(32 - 7)          'Sign extend
                        sar     tos, #(32 - 7)

im_loop                 add    pc, #1                  'Is there a following IM?
                        mov     memp, pc                'Read an opcode from memory @ program counter
                        add     memp, zpu_memory_addr
                        rdbyte  data, memp

                        cmpsub  data, #$80 wc           'Check for IM instruction. This saves dispatch table lookup
          if_nc    jmp    #not_im

                        shl     tos, #7                 'For following IMs shift TOS and add 7 bits more data
                        or      tos, data
                        jmp    #im_loop

not_im                  mov     address, data           'Some opcodes contains address offsets.

                        add     data, dispatch_tab_addr 'Dispatch op through table.
                        rdbyte  temp, data
                        jmp     temp                    'No # here we are jumping through temp.

David Betz · 2011-06-03 15:21

Heater. wrote: »
Bill,

I just replaced Zog's main execution loop with something that I think implements your suggestion. Code below. It shows no difference in fibo execution time at all!

I was hoping for something because as far as I can tell this code save 1 PASM instruction for non-IM opcodes, 2 PASM instructions for a first IM in a sequence and none for following IMs.

Given that 25% if the fibo instructions are single IMs I might have expected to see some gain.

This code looks nicer than the older version but as presented here only works for execution from HUB. Adding the external memory opcode fetch would require two lots of inline code or two subroutine calls niether of which appeals to me.

Looks like Zog will not be doing this look ahead unless you have a sneakier way to do it:)
' Main ZPU fetch and execute loop
done_and_inc_pc         add     pc, #1
done

execute                 mov     memp, pc                'Read an opcode from memory @ program counter
                        add     memp, zpu_memory_addr
                        rdbyte  data, memp

                        cmpsub  data, #$80 wc           'Check for IM instruction. This saves dispatch table lookup
          if_nc    jmp    #not_im                 'for the most common op. 7% fibo speed gain!

                        call    #push_tos               'Must be first IM of a sequence to get here.
                        mov     tos, data               'Extract 7 bits of immediate data into TOS.
                        shl     tos, #(32 - 7)          'Sign extend
                        sar     tos, #(32 - 7)

im_loop                 add    pc, #1                  'Is there a following IM?
                        mov     memp, pc                'Read an opcode from memory @ program counter
                        add     memp, zpu_memory_addr
                        rdbyte  data, memp

                        cmpsub  data, #$80 wc           'Check for IM instruction. This saves dispatch table lookup
          if_nc    jmp    #not_im

                        shl     tos, #7                 'For following IMs shift TOS and add 7 bits more data
                        or      tos, data
                        jmp    #im_loop

not_im                  mov     address, data           'Some opcodes contains address offsets.

                        add     data, dispatch_tab_addr 'Dispatch op through table.
                        rdbyte  temp, data
                        jmp     temp                    'No # here we are jumping through temp.

 

Why don't you add zpu_memory_addr to pc whenever you load the pc rather than every time you want to fetch a bytecode?

Bill Henning · 2011-06-03 15:47

Hi Heater

The suggestion would only help on sequences of IM's,

Heater. wrote: »

Bill,

I just replaced Zog's main execution loop with something that I think implements your suggestion. Code below. It shows no difference in fibo execution time at all!

I was hoping for something because as far as I can tell this code save 1 PASM instruction for non-IM opcodes, 2 PASM instructions for a first IM in a sequence and none for following IMs.

Given that 25% if the fibo instructions are single IMs I might have expected to see some gain.

This code looks nicer than the older version but as presented here only works for execution from HUB. Adding the external memory opcode fetch would require two lots of inline code or two subroutine calls niether of which appeals to me.

Looks like Zog will not be doing this look ahead unless you have a sneakier way to do it:)

(putting on sneaky hat... suggestions marked with a comment)

' Main ZPU fetch and execute loop
done_and_inc_pc         add     pc, #1
done

execute                 
                        '** mov     memp, pc                'Read an opcode from memory @ program counter
                       ' ** add     memp, zpu_memory_addr 
                       ' ** rdbyte  data, memp

                       rdbyte data,pc      ' always keep PC offset, saves two instructions

                        cmpsub  data, #$80 wc           'Check for IM instruction. This saves dispatch table lookup
          if_nc    jmp    #not_im                 'for the most common op. 7% fibo speed gain!

                        '** call    #push_tos               'Must be first IM of a sequence to get here.
                        wrlong   tos,sp              ' saves call/ret overhead, also hits hub window
                        sub     sp,#4                  ' also maintain SP as correct address, only add zog_memory offset when initially loading sp
                        mov     tos, data               'Extract 7 bits of immediate data into TOS.

                        shl     tos, #(32 - 7)          'Sign extend
                        sar     tos, #(32 - 7)

im_loop         add    pc, #1                  'Is there a following IM?
                        '**mov     memp, pc                'Read an opcode from memory @ program counter
                        '**add     memp, zpu_memory_addr  ' not needed of zpu_memory_addr added when loading pc
                        rdbyte  data, pc

                        cmpsub  data, #$80 wc           'Check for IM instruction. This saves dispatch table lookup
          if_nc    jmp    #not_im
                        and   data,#$7f            ' clear high bit so constant is good

                        shl     tos, #7                 'For following IMs shift TOS and add 7 bits more data
                        or      tos, data
                        jmp    #im_loop

not_im                  mov     address, data           'Some opcodes contains address offsets.

                        add     data, dispatch_tab_addr 'Dispatch op through table.
                        rdbyte  temp, data
                        jmp     temp                    'No # here we are jumping through temp.

The changes above get rid of some instructions and hit a hub window on time

Keeping the zog_memory_offset additions to the minimum (ie when new address is loaded) will help.

With the little endian issue fixed, no need for the xors... it should be OK to only add offsets when the PC and SP addresses are loaded. I hope.

Heater. · 2011-06-04 04:49

Little endian ZOG even uses Lonesocks float 32 object quite successfully. There are some fails but I think they were there before:

ZOG (LITTLE ENDIAN) v1.6 (HUB)                                      
zpu memory at 0000008C                                              
Test F32:                                                           
F32 COG = 4                                                         
__fixsfsi (3.6) OK                                                  
__fixsfsi (-3.6) OK                                                 
__fixunssfsi (147.6) OK                                             
__fixunssfsi (-147.6) OK                                            
__addsf3 OK                                                         
sinf OK                                                             
asinf OK                                                            
cosf OK                                                             
acosf OK                                                            
tanf OK                                                             
atanf OK                                                            
atan2f OK                                                           
floorf OK                                                           
ceilf OK                                                            
logf OK                                                             
log2f OK                                                            
log10f OK                                                           
expf OK                                                             
exp2f OK                                                            
exp10f OK                                                           
powf OK                                                             
fabsf (pos) OK                                                  
fabsf (neg) OK                                                      
roundf FAIL                                                         
2147483647roundf FAIL                                               
2147483647roundf FAIL                                               
-2147483647roundf FAIL                                              
-2147483647truncf FAIL                                              
2147483647truncf FAIL                                               
2147483647truncf FAIL                                               
-2147483647truncf FAIL                                              
-2147483647modf FAIL                                                
float > OK                                                          
float >= OK                                                         
float < OK                                                          
float <= OK                                                         
The first 100000 terms of the Euler series sum to...164472528       
Euler OK                                                            
All tests complete.

Bill, I like the idea of only updating PC when necessary. However I'm guilty of not posting all the code there. I deleted all the "#ifdef" parts. You see every time we fetch a opcode it's not just RDBYTE, unless running from HUB, there are the other builds where it is a run through VMCOG or SDCACHE code. At which point catching HUB slots and such all goes out the window. In those cases memp is not used as PC is the address parameter directly.

Just now I'm not convinced the savings are worth messing with it.

Now to see if little-endian works from external RAM...

lonesock · 2011-06-04 10:10

Are those results using F32 version 1.3 from the OBEX? John Abshier found a bug in the FRound routine in versions 1.2 and previous.

Jonathan

Heater. · 2011-06-04 13:43

lonesock,

No, Zog has to catch up with latest F32 version..

David,

I believe that you and Bill have the same idea.

It does not help much when Zog is executing from external memory.

I start to think we need two zogs. One running code from HUB and one running code from ext RAM.

Just now the whole thing is very confused.

jazzed · 2011-06-04 14:22

Heater. wrote: »

I start to think we need two zogs. One running code from HUB and one running code from ext RAM.

Just now the whole thing is very confused.

It would be very nice to have something that worked for both models. I suppose all it takes is effort.

David Betz · 2011-06-04 14:45

Heater. wrote: »

I believe that you and Bill have the same idea.

It does not help much when Zog is executing from external memory.

I thought Bill's suggestion had to do with parsing IM instructions. I was merely suggesting that you add the zpu_memory_offset to the pc every time you load it so that you don't have to do that on every bytecode fetch. You're right that it probably won't help much with the external memory models.

Bill Henning · 2011-06-04 18:06

Actually I made the same suggestion (add zog_memory_offset) in a post after yours, as I did not see yours before I replied. It can save 16 cycles (one hub access).

David Betz wrote: »

I thought Bill's suggestion had to do with parsing IM instructions. I was merely suggesting that you add the zpu_memory_offset to the pc every time you load it so that you don't have to do that on every bytecode fetch. You're right that it probably won't help much with the external memory models.

David Betz · 2011-06-08 06:51

What's the status of the little-endian ZOG VM and tool chain? I'd like to update my Google Code project to use the new version so I can resume my tests of putting the stack and locals in hub memory. Where can I get the new tool chain and updated zog.spin?

Heater. · 2011-06-08 07:29

David,

littlle endian Zog was looking quite good. I have run all the little test programs that come with zog_v1_6 successfully. Only from HUB RAM so far. Perhaps I will have a moment to try some ext RAM tests later today. I'll try and post the goodies then.

David Betz · 2011-06-08 07:38

Heater. wrote: »

David,

littlle endian Zog was looking quite good. I have run all the little test programs that come with zog_v1_6 successfully. Only from HUB RAM so far. Perhaps I will have a moment to try some ext RAM tests later today. I'll try and post the goodies then.

Thanks! I'll be looking forward to your changes. Also, is there any way we can merge the ZOG development into a single tree? It would be nice not to have to do merges every time you modify zog_v1_x. What is missing from the stuff I checked into Google Code? I'd be happy to add whatever is needed for your development so we can share a single zog.spin. Or, if you have some other place (github maybe) where you'd like to keep the official tree I can move my zogload and library work there.

Heater. · 2011-06-09 01:48

A little hold up here. I recompiled my zpugcc tool chain with a cut down crt0.S. Removed all the EMULATE code to save 1.5Kbytes. All my little tests work fine except dhrystone which refuses to output anything. Can't see any reason for dhrystone to care about that yet. Perhaps I have to revert to the old crt0.

Merging developments is a good idea. If we can use the same ZOG for your stuff and the old run_zog/debug_zog that would be great. I haven't had much time to look at the changes you made yet.

I think I'd like to change the memory mapping around a bit. More on that when I have time.

Andrey Demenev · 2011-06-09 04:45

Possibly, it has something to do with standard library?

This patch should make newlib compile as little endian:

diff --git a/toolchain/gcc/newlib/libc/include/machine/ieeefp.h b/toolchain/gcc/newlib/libc/include/machine/ieeefp.h
index e388d86..29ddf0e 100755
--- a/toolchain/gcc/newlib/libc/include/machine/ieeefp.h
+++ b/toolchain/gcc/newlib/libc/include/machine/ieeefp.h
@@ -76,7 +76,7 @@
 #endif
 
 #if defined(__zpu__) 
-#define __IEEE_BIG_ENDIAN
+#define __IEEE_LITTLE_ENDIAN
 #endif

Heater. · 2011-06-09 04:55

Excellent find Andrey. I'll being trying it out later today.

Heater. · 2011-06-11 10:22

Little-endian Zog is now running dhrystone and many other tests from HUB RAM and external RAM on a TriBlade.

There seems to be some issue with using malloc on ext RAM as a couple of tests that use malloc fail there. Odd because the same binaries run from HUB or under zpu_in_c on a PC so the little endian compiler is OK with them.

Sorry I have no internet connection this weekend, apart from this phone, so I cannot post anything.

David Betz · 2011-06-11 10:42

Heater. wrote: »

Little-endian Zog is now running dhrystone and many other tests from HUB RAM and external RAM on a TriBlade.

There seems to be some issue with using malloc on ext RAM as a couple of tests that use malloc fail there. Odd because the same binaries run from HUB or under zpu_in_c on a PC so the little endian compiler is OK with them.

Sorry I have no internet connection this weekend, apart from this phone, so I cannot post anything.

Congratulations! I'll be looking forward to getting the new tool chain and VM so I can merge them into my zogload project.

Heater. · 2011-06-16 07:28

Little endian Zog is now happy with my malloc test running from external RAM on Gadget Ganster board with SdramCache and on a TriBlade with vmcog.
Turns out that initializing the data area in external RAM to zero prior to starting the ZPU gets things working. I guess HUB is zeroed on start up so I never saw the problem running from there. Surely the C run time is supposed to take care of such things?

Anyway it looks like Andrey's little-endian patches for zpu-gcc and little-endian Zog are pretty solid now.

Again I would post the thing but I have no net connection at home. I'll have to sneaker-net it to work and try to find somewhere to post Zog and the compiler.

Perhaps I'll look at the following first though:

1) Get Lonesock's latest F32 floating point support working from C
2) Try some experiments with stack and/or data in HUB when running from ext RAM.
3) fft_bech.
4) Figure out why my cut down crt0.s does not work. Perhaps that issue was just confused with the RAM init problem.
4) Make use of David's Zog modifications so we can merge our sources into a single tree.

Any other pressing issues?

jazzed · 2011-06-16 08:00

Congratulations to you and Andrey !

Yes, .bss must be zero'd. David's loader has an auto-sizing .bss zero algorithm.

I look forward to performance comparisons with old/new code.
The biggest improvement should be with having HUB based stack.

A unified tree would be great

One big goal for a unified tree would be to change the loader so that new platforms don't have to be added (David and I have to work this out - he knows it I prefer a platform agnostic approach). In some ways it's easier for the end user now, but why should we burden the developer with adding a new board type? Once this item is resolved, we'll have an easy front end for other GCC tool chains.

Heater. wrote: »

Little endian Zog is now happy with my malloc test running from external RAM on Gadget Ganster board with SdramCache and on a TriBlade with vmcog.
Turns out that initializing the data area in external RAM to zero prior to starting the ZPU gets things working. I guess HUB is zeroed on start up so I never saw the problem running from there. Surely the C run time is supposed to take care of such things?

Anyway it looks like Andrey's little-endian patches for zpu-gcc and little-endian Zog are pretty solid now.

Again I would post the thing but I have no net connection at home. I'll have to sneaker-net it to work and try to find somewhere to post Zog and the compiler.

Perhaps I'll look at the following first though:

1) Get Lonesock's latest F32 floating point support working from C
2) Try some experiments with stack and/or data in HUB when running from ext RAM.
3) fft_bech.
4) Figure out why my cut down crt0.s does not work. Perhaps that issue was just confused with the RAM init problem.
4) Make use of David's Zog modifications so we can merge our sources into a single tree.

Any other pressing issues?

Cluso99 · 2011-06-16 18:40

CONGRATULATIONS....

This thread has passed 2^15 views. Truly amazing

Heater. · 2011-06-17 01:34

Cluso,

I may have said this before but have a feeling that this thread is getting a lit of hits from those searching out a particular conspiracy theory that goes by the same name as my virtual machine. Bill Hearing did warn me about my choice of name at the time. Still if we get even one person off the conspiracy theory trail and into the Propeller/MCU hobby as a result then it's worth while.

Anyway, bad news today, running fft_bench from ext RAM is sort of working but there seems to be 4 bits dropped of the amplitudes in the results, most odd.

David Betz · 2011-06-18 06:42

jazzed wrote: »

Yes, .bss must be zero'd. David's loader has an auto-sizing .bss zero algorithm.

Well, it doesn't really auto-size. It just uses a header that gives the start address and size of the .bss section. It also knows how to copy the .data section from flash to RAM before starting the program if .text is stored in flash.

One big goal for a unified tree would be to change the loader so that new platforms don't have to be added (David and I have to work this out - he knows it I prefer a platform agnostic approach). In some ways it's easier for the end user now, but why should we burden the developer with adding a new board type? Once this item is resolved, we'll have an easy front end for other GCC tool chains.

This will be somewhat difficult because of the different drivers used to interface to external memory. At a minimum the loader needs to know how to setup the UART for the correct baud rate without necessarily knowing the clock rate that the board is using and also what external memory driver to load. Unless you want these drivers to be loaded from separate files all possible external memory drivers will have to be included in the loader program which means you'll have to type some command line option to select one. It would be nice if we could standardize on a single interface to external memory. I would suggest that we consider using jazzed's JCACHE interface. That is what I used for the C3 and what he used for his Propeller Platform SDRAM board. Any VMCOG drivers would have to be converted to the JCACHE interface.

jazzed · 2011-06-18 11:56

Heater. wrote: »

I may have said this before but have a feeling that this thread is getting a lit of hits from those searching out a particular conspiracy theory that goes by the same name as my virtual machine.

Maybe the *REAL* conspiracy is that knowledge strengthened your resolve to use the ZOG acronym

Tor · 2011-06-19 07:19

Heater. wrote: »

[..]Still if we get even one person off the conspiracy theory trail and into the Propeller/MCU hobby as a result then it's worth while[..]

That would be me.. I've spent the last week or so reading every post in this thread. Interesting stuff. I'm in the process of acquiring some Propeller hardware, then we'll see.

-Tor

David Betz · 2011-06-29 08:57

Heater: Could you post your little-endian changes to zog.spin or email them to me? I'd like to start merging them with my ZOG loader and runtime code.

Thanks,
David

Heater. · 2011-07-01 07:10

David,

Yes I will, but this week I'm far from my home machines and Zog.

Actually I've been visiting IBMs head quarters in Dublin. Wow that place is huge.

I'll be back to Helsinki and Zog on Sunday.

David Betz · 2011-07-01 07:33

Thanks. Enjoy the rest of your trip!

Cluso99 · 2011-07-01 07:52

heater: I wasn't even sure if IBM was still alive... I mean you never hear anything about them these days.

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments