Zog goes little endian.
...
With the bonus that fibo is now almost 7% faster!
Excellent news. Thanks for that. Hopefully a "HUB locals" version of external memory platforms can be running soon. I also look forward to running programs fetched from byte-wide Flash - it will be more work but at least I won't have to wait for anymore tools to get it working.
I had a little look at what could be done with your suggestion. I'm afraid you might be a bit disappointed because:
1) Zog already bypasses the dispatch loop for IM's. It just checks the top bit of every op code for IM and then does it. This was basically done to halve the dispatch table size and according to my code comments already sped fibo up by 7%
2) That look ahead involves fetching the next opcode which means in-lining another byte fetch for it, not nice as we need the space, or making opcode fetch a subroutine, not nice as it's slower.
There might be some mileage in the idea in that it gets rid of maintaining a flag for the first or subsequent IM's but that's only a couple of instructions.
I just replaced Zog's main execution loop with something that I think implements your suggestion. Code below. It shows no difference in fibo execution time at all!
I was hoping for something because as far as I can tell this code save 1 PASM instruction for non-IM opcodes, 2 PASM instructions for a first IM in a sequence and none for following IMs.
Given that 25% if the fibo instructions are single IMs I might have expected to see some gain.
This code looks nicer than the older version but as presented here only works for execution from HUB. Adding the external memory opcode fetch would require two lots of inline code or two subroutine calls niether of which appeals to me.
Looks like Zog will not be doing this look ahead unless you have a sneakier way to do it:)
' Main ZPU fetch and execute loop
done_and_inc_pc add pc, #1
done
execute mov memp, pc 'Read an opcode from memory @ program counter
add memp, zpu_memory_addr
rdbyte data, memp
cmpsub data, #$80 wc 'Check for IM instruction. This saves dispatch table lookup
if_nc jmp #not_im 'for the most common op. 7% fibo speed gain!
call #push_tos 'Must be first IM of a sequence to get here.
mov tos, data 'Extract 7 bits of immediate data into TOS.
shl tos, #(32 - 7) 'Sign extend
sar tos, #(32 - 7)
im_loop add pc, #1 'Is there a following IM?
mov memp, pc 'Read an opcode from memory @ program counter
add memp, zpu_memory_addr
rdbyte data, memp
cmpsub data, #$80 wc 'Check for IM instruction. This saves dispatch table lookup
if_nc jmp #not_im
shl tos, #7 'For following IMs shift TOS and add 7 bits more data
or tos, data
jmp #im_loop
not_im mov address, data 'Some opcodes contains address offsets.
add data, dispatch_tab_addr 'Dispatch op through table.
rdbyte temp, data
jmp temp 'No # here we are jumping through temp.
I just replaced Zog's main execution loop with something that I think implements your suggestion. Code below. It shows no difference in fibo execution time at all!
I was hoping for something because as far as I can tell this code save 1 PASM instruction for non-IM opcodes, 2 PASM instructions for a first IM in a sequence and none for following IMs.
Given that 25% if the fibo instructions are single IMs I might have expected to see some gain.
This code looks nicer than the older version but as presented here only works for execution from HUB. Adding the external memory opcode fetch would require two lots of inline code or two subroutine calls niether of which appeals to me.
Looks like Zog will not be doing this look ahead unless you have a sneakier way to do it:)
' Main ZPU fetch and execute loop
done_and_inc_pc add pc, #1
done
execute mov memp, pc 'Read an opcode from memory @ program counter
add memp, zpu_memory_addr
rdbyte data, memp
cmpsub data, #$80 wc 'Check for IM instruction. This saves dispatch table lookup
if_nc jmp #not_im 'for the most common op. 7% fibo speed gain!
call #push_tos 'Must be first IM of a sequence to get here.
mov tos, data 'Extract 7 bits of immediate data into TOS.
shl tos, #(32 - 7) 'Sign extend
sar tos, #(32 - 7)
im_loop add pc, #1 'Is there a following IM?
mov memp, pc 'Read an opcode from memory @ program counter
add memp, zpu_memory_addr
rdbyte data, memp
cmpsub data, #$80 wc 'Check for IM instruction. This saves dispatch table lookup
if_nc jmp #not_im
shl tos, #7 'For following IMs shift TOS and add 7 bits more data
or tos, data
jmp #im_loop
not_im mov address, data 'Some opcodes contains address offsets.
add data, dispatch_tab_addr 'Dispatch op through table.
rdbyte temp, data
jmp temp 'No # here we are jumping through temp.
Why don't you add zpu_memory_addr to pc whenever you load the pc rather than every time you want to fetch a bytecode?
I just replaced Zog's main execution loop with something that I think implements your suggestion. Code below. It shows no difference in fibo execution time at all!
I was hoping for something because as far as I can tell this code save 1 PASM instruction for non-IM opcodes, 2 PASM instructions for a first IM in a sequence and none for following IMs.
Given that 25% if the fibo instructions are single IMs I might have expected to see some gain.
This code looks nicer than the older version but as presented here only works for execution from HUB. Adding the external memory opcode fetch would require two lots of inline code or two subroutine calls niether of which appeals to me.
Looks like Zog will not be doing this look ahead unless you have a sneakier way to do it:)
(putting on sneaky hat... suggestions marked with a comment)
' Main ZPU fetch and execute loop
done_and_inc_pc add pc, #1
done
execute
'** mov memp, pc 'Read an opcode from memory @ program counter
' ** add memp, zpu_memory_addr
' ** rdbyte data, memp
rdbyte data,pc ' always keep PC offset, saves two instructions
cmpsub data, #$80 wc 'Check for IM instruction. This saves dispatch table lookup
if_nc jmp #not_im 'for the most common op. 7% fibo speed gain!
'** call #push_tos 'Must be first IM of a sequence to get here.
wrlong tos,sp ' saves call/ret overhead, also hits hub window
sub sp,#4 ' also maintain SP as correct address, only add zog_memory offset when initially loading sp
mov tos, data 'Extract 7 bits of immediate data into TOS.
shl tos, #(32 - 7) 'Sign extend
sar tos, #(32 - 7)
im_loop add pc, #1 'Is there a following IM?
'**mov memp, pc 'Read an opcode from memory @ program counter
'**add memp, zpu_memory_addr ' not needed of zpu_memory_addr added when loading pc
rdbyte data, pc
cmpsub data, #$80 wc 'Check for IM instruction. This saves dispatch table lookup
if_nc jmp #not_im
and data,#$7f ' clear high bit so constant is good
shl tos, #7 'For following IMs shift TOS and add 7 bits more data
or tos, data
jmp #im_loop
not_im mov address, data 'Some opcodes contains address offsets.
add data, dispatch_tab_addr 'Dispatch op through table.
rdbyte temp, data
jmp temp 'No # here we are jumping through temp.
The changes above get rid of some instructions and hit a hub window on time
Keeping the zog_memory_offset additions to the minimum (ie when new address is loaded) will help.
With the little endian issue fixed, no need for the xors... it should be OK to only add offsets when the PC and SP addresses are loaded. I hope.
Little endian ZOG even uses Lonesocks float 32 object quite successfully. There are some fails but I think they were there before:
ZOG (LITTLE ENDIAN) v1.6 (HUB)
zpu memory at 0000008C
Test F32:
F32 COG = 4
__fixsfsi (3.6) OK
__fixsfsi (-3.6) OK
__fixunssfsi (147.6) OK
__fixunssfsi (-147.6) OK
__addsf3 OK
sinf OK
asinf OK
cosf OK
acosf OK
tanf OK
atanf OK
atan2f OK
floorf OK
ceilf OK
logf OK
log2f OK
log10f OK
expf OK
exp2f OK
exp10f OK
powf OK
fabsf (pos) OK
fabsf (neg) OK
roundf FAIL
2147483647roundf FAIL
2147483647roundf FAIL
-2147483647roundf FAIL
-2147483647truncf FAIL
2147483647truncf FAIL
2147483647truncf FAIL
-2147483647truncf FAIL
-2147483647modf FAIL
float > OK
float >= OK
float < OK
float <= OK
The first 100000 terms of the Euler series sum to...164472528
Euler OK
All tests complete.
Bill, I like the idea of only updating PC when necessary. However I'm guilty of not posting all the code there. I deleted all the "#ifdef" parts. You see every time we fetch a opcode it's not just RDBYTE, unless running from HUB, there are the other builds where it is a run through VMCOG or SDCACHE code. At which point catching HUB slots and such all goes out the window. In those cases memp is not used as PC is the address parameter directly.
Just now I'm not convinced the savings are worth messing with it.
Now to see if little-endian works from external RAM...
It does not help much when Zog is executing from external memory.
I thought Bill's suggestion had to do with parsing IM instructions. I was merely suggesting that you add the zpu_memory_offset to the pc every time you load it so that you don't have to do that on every bytecode fetch. You're right that it probably won't help much with the external memory models.
Actually I made the same suggestion (add zog_memory_offset) in a post after yours, as I did not see yours before I replied. It can save 16 cycles (one hub access).
I thought Bill's suggestion had to do with parsing IM instructions. I was merely suggesting that you add the zpu_memory_offset to the pc every time you load it so that you don't have to do that on every bytecode fetch. You're right that it probably won't help much with the external memory models.
What's the status of the little-endian ZOG VM and tool chain? I'd like to update my Google Code project to use the new version so I can resume my tests of putting the stack and locals in hub memory. Where can I get the new tool chain and updated zog.spin?
littlle endian Zog was looking quite good. I have run all the little test programs that come with zog_v1_6 successfully. Only from HUB RAM so far. Perhaps I will have a moment to try some ext RAM tests later today. I'll try and post the goodies then.
littlle endian Zog was looking quite good. I have run all the little test programs that come with zog_v1_6 successfully. Only from HUB RAM so far. Perhaps I will have a moment to try some ext RAM tests later today. I'll try and post the goodies then.
Thanks! I'll be looking forward to your changes. Also, is there any way we can merge the ZOG development into a single tree? It would be nice not to have to do merges every time you modify zog_v1_x. What is missing from the stuff I checked into Google Code? I'd be happy to add whatever is needed for your development so we can share a single zog.spin. Or, if you have some other place (github maybe) where you'd like to keep the official tree I can move my zogload and library work there.
A little hold up here. I recompiled my zpugcc tool chain with a cut down crt0.S. Removed all the EMULATE code to save 1.5Kbytes. All my little tests work fine except dhrystone which refuses to output anything. Can't see any reason for dhrystone to care about that yet. Perhaps I have to revert to the old crt0.
Merging developments is a good idea. If we can use the same ZOG for your stuff and the old run_zog/debug_zog that would be great. I haven't had much time to look at the changes you made yet.
I think I'd like to change the memory mapping around a bit. More on that when I have time.
Little-endian Zog is now running dhrystone and many other tests from HUB RAM and external RAM on a TriBlade.
There seems to be some issue with using malloc on ext RAM as a couple of tests that use malloc fail there. Odd because the same binaries run from HUB or under zpu_in_c on a PC so the little endian compiler is OK with them.
Sorry I have no internet connection this weekend, apart from this phone, so I cannot post anything.
Little-endian Zog is now running dhrystone and many other tests from HUB RAM and external RAM on a TriBlade.
There seems to be some issue with using malloc on ext RAM as a couple of tests that use malloc fail there. Odd because the same binaries run from HUB or under zpu_in_c on a PC so the little endian compiler is OK with them.
Sorry I have no internet connection this weekend, apart from this phone, so I cannot post anything.
Congratulations! I'll be looking forward to getting the new tool chain and VM so I can merge them into my zogload project.
Little endian Zog is now happy with my malloc test running from external RAM on Gadget Ganster board with SdramCache and on a TriBlade with vmcog.
Turns out that initializing the data area in external RAM to zero prior to starting the ZPU gets things working. I guess HUB is zeroed on start up so I never saw the problem running from there. Surely the C run time is supposed to take care of such things?
Anyway it looks like Andrey's little-endian patches for zpu-gcc and little-endian Zog are pretty solid now.
Again I would post the thing but I have no net connection at home. I'll have to sneaker-net it to work and try to find somewhere to post Zog and the compiler.
Perhaps I'll look at the following first though:
1) Get Lonesock's latest F32 floating point support working from C
2) Try some experiments with stack and/or data in HUB when running from ext RAM.
3) fft_bech.
4) Figure out why my cut down crt0.s does not work. Perhaps that issue was just confused with the RAM init problem.
4) Make use of David's Zog modifications so we can merge our sources into a single tree.
Yes, .bss must be zero'd. David's loader has an auto-sizing .bss zero algorithm.
I look forward to performance comparisons with old/new code.
The biggest improvement should be with having HUB based stack.
A unified tree would be great
One big goal for a unified tree would be to change the loader so that new platforms don't have to be added (David and I have to work this out - he knows it I prefer a platform agnostic approach). In some ways it's easier for the end user now, but why should we burden the developer with adding a new board type? Once this item is resolved, we'll have an easy front end for other GCC tool chains.
Little endian Zog is now happy with my malloc test running from external RAM on Gadget Ganster board with SdramCache and on a TriBlade with vmcog.
Turns out that initializing the data area in external RAM to zero prior to starting the ZPU gets things working. I guess HUB is zeroed on start up so I never saw the problem running from there. Surely the C run time is supposed to take care of such things?
Anyway it looks like Andrey's little-endian patches for zpu-gcc and little-endian Zog are pretty solid now.
Again I would post the thing but I have no net connection at home. I'll have to sneaker-net it to work and try to find somewhere to post Zog and the compiler.
Perhaps I'll look at the following first though:
1) Get Lonesock's latest F32 floating point support working from C
2) Try some experiments with stack and/or data in HUB when running from ext RAM.
3) fft_bech.
4) Figure out why my cut down crt0.s does not work. Perhaps that issue was just confused with the RAM init problem.
4) Make use of David's Zog modifications so we can merge our sources into a single tree.
I may have said this before but have a feeling that this thread is getting a lit of hits from those searching out a particular conspiracy theory that goes by the same name as my virtual machine. Bill Hearing did warn me about my choice of name at the time. Still if we get even one person off the conspiracy theory trail and into the Propeller/MCU hobby as a result then it's worth while.
Anyway, bad news today, running fft_bench from ext RAM is sort of working but there seems to be 4 bits dropped of the amplitudes in the results, most odd.
Yes, .bss must be zero'd. David's loader has an auto-sizing .bss zero algorithm.
Well, it doesn't really auto-size. It just uses a header that gives the start address and size of the .bss section. It also knows how to copy the .data section from flash to RAM before starting the program if .text is stored in flash.
One big goal for a unified tree would be to change the loader so that new platforms don't have to be added (David and I have to work this out - he knows it I prefer a platform agnostic approach). In some ways it's easier for the end user now, but why should we burden the developer with adding a new board type? Once this item is resolved, we'll have an easy front end for other GCC tool chains.
This will be somewhat difficult because of the different drivers used to interface to external memory. At a minimum the loader needs to know how to setup the UART for the correct baud rate without necessarily knowing the clock rate that the board is using and also what external memory driver to load. Unless you want these drivers to be loaded from separate files all possible external memory drivers will have to be included in the loader program which means you'll have to type some command line option to select one. It would be nice if we could standardize on a single interface to external memory. I would suggest that we consider using jazzed's JCACHE interface. That is what I used for the C3 and what he used for his Propeller Platform SDRAM board. Any VMCOG drivers would have to be converted to the JCACHE interface.
I may have said this before but have a feeling that this thread is getting a lit of hits from those searching out a particular conspiracy theory that goes by the same name as my virtual machine.
Maybe the *REAL* conspiracy is that knowledge strengthened your resolve to use the ZOG acronym
[..]Still if we get even one person off the conspiracy theory trail and into the Propeller/MCU hobby as a result then it's worth while[..]
That would be me.. I've spent the last week or so reading every post in this thread. Interesting stuff. I'm in the process of acquiring some Propeller hardware, then we'll see.
Comments
Excellent news. Thanks for that. Hopefully a "HUB locals" version of external memory platforms can be running soon. I also look forward to running programs fetched from byte-wide Flash - it will be more work but at least I won't have to wait for anymore tools to get it working.
I had a little look at what could be done with your suggestion. I'm afraid you might be a bit disappointed because:
1) Zog already bypasses the dispatch loop for IM's. It just checks the top bit of every op code for IM and then does it. This was basically done to halve the dispatch table size and according to my code comments already sped fibo up by 7%
2) That look ahead involves fetching the next opcode which means in-lining another byte fetch for it, not nice as we need the space, or making opcode fetch a subroutine, not nice as it's slower.
There might be some mileage in the idea in that it gets rid of maintaining a flag for the first or subsequent IM's but that's only a couple of instructions.
I just replaced Zog's main execution loop with something that I think implements your suggestion. Code below. It shows no difference in fibo execution time at all!
I was hoping for something because as far as I can tell this code save 1 PASM instruction for non-IM opcodes, 2 PASM instructions for a first IM in a sequence and none for following IMs.
Given that 25% if the fibo instructions are single IMs I might have expected to see some gain.
This code looks nicer than the older version but as presented here only works for execution from HUB. Adding the external memory opcode fetch would require two lots of inline code or two subroutine calls niether of which appeals to me.
Looks like Zog will not be doing this look ahead unless you have a sneakier way to do it:)
Why don't you add zpu_memory_addr to pc whenever you load the pc rather than every time you want to fetch a bytecode?
The suggestion would only help on sequences of IM's,
(putting on sneaky hat... suggestions marked with a comment)
The changes above get rid of some instructions and hit a hub window on time
Keeping the zog_memory_offset additions to the minimum (ie when new address is loaded) will help.
With the little endian issue fixed, no need for the xors... it should be OK to only add offsets when the PC and SP addresses are loaded. I hope.
Bill, I like the idea of only updating PC when necessary. However I'm guilty of not posting all the code there. I deleted all the "#ifdef" parts. You see every time we fetch a opcode it's not just RDBYTE, unless running from HUB, there are the other builds where it is a run through VMCOG or SDCACHE code. At which point catching HUB slots and such all goes out the window. In those cases memp is not used as PC is the address parameter directly.
Just now I'm not convinced the savings are worth messing with it.
Now to see if little-endian works from external RAM...
Jonathan
No, Zog has to catch up with latest F32 version..
David,
I believe that you and Bill have the same idea.
It does not help much when Zog is executing from external memory.
I start to think we need two zogs. One running code from HUB and one running code from ext RAM.
Just now the whole thing is very confused.
I thought Bill's suggestion had to do with parsing IM instructions. I was merely suggesting that you add the zpu_memory_offset to the pc every time you load it so that you don't have to do that on every bytecode fetch. You're right that it probably won't help much with the external memory models.
littlle endian Zog was looking quite good. I have run all the little test programs that come with zog_v1_6 successfully. Only from HUB RAM so far. Perhaps I will have a moment to try some ext RAM tests later today. I'll try and post the goodies then.
Thanks! I'll be looking forward to your changes. Also, is there any way we can merge the ZOG development into a single tree? It would be nice not to have to do merges every time you modify zog_v1_x. What is missing from the stuff I checked into Google Code? I'd be happy to add whatever is needed for your development so we can share a single zog.spin. Or, if you have some other place (github maybe) where you'd like to keep the official tree I can move my zogload and library work there.
Merging developments is a good idea. If we can use the same ZOG for your stuff and the old run_zog/debug_zog that would be great. I haven't had much time to look at the changes you made yet.
I think I'd like to change the memory mapping around a bit. More on that when I have time.
This patch should make newlib compile as little endian:
There seems to be some issue with using malloc on ext RAM as a couple of tests that use malloc fail there. Odd because the same binaries run from HUB or under zpu_in_c on a PC so the little endian compiler is OK with them.
Sorry I have no internet connection this weekend, apart from this phone, so I cannot post anything.
Congratulations! I'll be looking forward to getting the new tool chain and VM so I can merge them into my zogload project.
Turns out that initializing the data area in external RAM to zero prior to starting the ZPU gets things working. I guess HUB is zeroed on start up so I never saw the problem running from there. Surely the C run time is supposed to take care of such things?
Anyway it looks like Andrey's little-endian patches for zpu-gcc and little-endian Zog are pretty solid now.
Again I would post the thing but I have no net connection at home. I'll have to sneaker-net it to work and try to find somewhere to post Zog and the compiler.
Perhaps I'll look at the following first though:
1) Get Lonesock's latest F32 floating point support working from C
2) Try some experiments with stack and/or data in HUB when running from ext RAM.
3) fft_bech.
4) Figure out why my cut down crt0.s does not work. Perhaps that issue was just confused with the RAM init problem.
4) Make use of David's Zog modifications so we can merge our sources into a single tree.
Any other pressing issues?
Yes, .bss must be zero'd. David's loader has an auto-sizing .bss zero algorithm.
I look forward to performance comparisons with old/new code.
The biggest improvement should be with having HUB based stack.
A unified tree would be great
One big goal for a unified tree would be to change the loader so that new platforms don't have to be added (David and I have to work this out - he knows it I prefer a platform agnostic approach). In some ways it's easier for the end user now, but why should we burden the developer with adding a new board type? Once this item is resolved, we'll have an easy front end for other GCC tool chains.
This thread has passed 2^15 views. Truly amazing
I may have said this before but have a feeling that this thread is getting a lit of hits from those searching out a particular conspiracy theory that goes by the same name as my virtual machine. Bill Hearing did warn me about my choice of name at the time. Still if we get even one person off the conspiracy theory trail and into the Propeller/MCU hobby as a result then it's worth while.
Anyway, bad news today, running fft_bench from ext RAM is sort of working but there seems to be 4 bits dropped of the amplitudes in the results, most odd.
-Tor
Thanks,
David
Yes I will, but this week I'm far from my home machines and Zog.
Actually I've been visiting IBMs head quarters in Dublin. Wow that place is huge.
I'll be back to Helsinki and Zog on Sunday.