Cluso99 said...
ZOG does not have a ring to it. BTW neither does ZPUCog. And I agree it cannot be ZyCog either. :-(
How about Z_Cog, SoftCog (since it is sort of a soft cog), ZPCog. Nothing sounds as good as ZiCog & MoCog.
CogZ or Cog-Z?
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Cluso99 said...
ZOG does not have a ring to it. BTW neither does ZPUCog. And I agree it cannot be ZyCog either. :-(
How about Z_Cog, SoftCog (since it is sort of a soft cog), ZPCog. Nothing sounds as good as ZiCog & MoCog.
Bill: A Spin version of ZPU is definitely in the works but first I'll get the C version running so that we know we understand the ops correctly. Having a debugger would be excellent.
Lonesock & Bill : CogZ it is then. I'll update the first post and thread title.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
For LMM programs CogZ won't even be in the race, since a LONG fetch from hub is the same as a BYTE fetch from hub - and once fetched, these compilers execute instructions at PASM speed. But CogZ may be competitive for XMM programs, where the benefit of fetching a byte vs a long can be significant. It will depend a bit on the hardware.
However, I think you may have found a good solution for "big SPIN"
I'll watch this thread with interest.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
RossH. I would not dream of challenging Catalina or ImageCraft to a drag race "along the straight" in LMM.
In that mode it's only benefit might be it's smaller code size as all ops are only byte wide. So let's see, could someone compile that FIBO I posted with Catalina and ImageCraft and we can see how the code size compares.
The ZPU will fair better in speed comparison with XMM. But I fear even there it will trailing in last because it is a stack based machine with no registers so there is an awful lot of memory access going on.
Here is an example of a simple C statement and the ZPU asm GCC generates. Annotations by me:
int a, b, c;
int main(int argc, char* argv[noparse][[/noparse]]) {
a = b + c;
return (a);
}
Generates:
im b ; Push the address of "b" to stack
load ; Push b to stack top using address popped from stack
im c ; Push the address of "c" to stack
load ; Push c to stack using address popped
add ; Pop and add the top two stack items and push result to stack
im a ; Push the address of "a" to stack
store ; Pop "a" address then "a" and store "a" to the address
Now: An "IM b" can be done in one byte opcode read + 4 byte writes for the PUSH to stack. Total 5 memory accesses.
A LOAD or STORE is one byte read for the op + two times PUSH/POP at 4 bytes each + a memory read of 4 bytes. Total 13 accesses.
The ADD is one byte for the op + two POPS and a PUSH for a total of 13 accesses.
I make that a grand total of 67 byte accesses to external RAM to add two 32 bit numbers !!!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
A very good question Mike and one I have already pondered.
I suspect it would be quite possible to keep the stack in HUB. That would save us 36 external memory accesses in the above example. Halving the external memory thrashing!!
In the general case of course it is quite possible for ZPU to do LOAD and STORE to somewhere within the stack in which case a stack in HUB would fail unless we did some checking of the LOAD/STORE addresses every time and sorted things out accordingly.
I suspect that normal everyday GCC generated code will never do that and we could get away with it. In which case we start closing in on Catalina and Image Craft performance for XMM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Mike is right - your stack should always be in hub RAM. Catalina's LARGE memory mode has a flat address space, but the first $8000 addresses are the normal hub RAM addresses and the stack is always allocated in this address range.
Time/space tradeoffs always fascinate me. Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's - but it could be 1/2 the size. Whether it is worth it depends on how much speed you lose. I'll be interested to see the results.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
Mike, I posted as you edited your post. Yes I think that's what I meant by "checking" and "sorting" the LOAD store accesses addresses. That checking takes a bit of PASM but then the gains are still worth it.
I reckon on the first ZPU implementation to be in SPIN and no XMM so the stack will already be in the right place, in HUB, then we can experiment with moving the code and then data to XMM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
heater: since the ZPU is only tiny, I wonder if there would be space in the cog for the stack? I know it is a tall ask, but it could speed things significantly if it could fit.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
Blimey, The first version of CogZ does not exist yet and we already have a dozen optimizations on the table[noparse]:)[/noparse]
The Stack in Cog may also be possible. Whilst the base ZPU is tiny it's probably worth more to fill up the COG with PASM implementations of the ZPU ops that would otherwise be emulated in ZPU ASM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Having the call stack in cog appealed to me when I was writing Catalina - but it REALLY complicates things. For instance, you can probably only fit the call stack, not the stack frame (which contains all your local variables and function arguments) - otherwise it is quite possible to use all your cog stack space on one function call. But this means you actually have TWO stacks - a call stack in cog ram, and a frame stack in hub ram. The added complexity just didn't seem worth the potential savings it to me.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
I really don't want to get into the complication of separate call and data stacks. I'd rather make the restriction that if you want fast code then you don't put lots of local data in your functions.
So what we need is a flat memory space presented to ZPU via an interface that maps some of it to to HUB and some of it to external RAM. So if you use a lot of stack you will find the code slowing down dramatically.
Problem: I've just been looking at the Java ZPU simulator and it implements a stack that grows upwards!. This seems to complicate that idea somewhat.
RossH: "Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's"
ZPU has an interesting feature. In the code snippet I posted above you will see things like "im b" where it is loading the address of "b" from immediate data. Thing is that "im" is a one byte instruction, the top bit "1" indicates "im" and the other seven bits are the immediate data. So small addresses can be loaded with a single one byte "im". For larger addresses just chain "im"s together and it will load another 7 bits and another 7 bits etc. In this way for small data sets the code can be quite small.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
int go deep(int n) {
int b;
b += go_deep(n-1);
return &b;
}
It is a nonsense function, but it should stop the optimizer from mucking too much with the code.
I am interested in how it handles references to addresses of local variables in C functions. Reason: It might be difficult to put the stack into the hub, depending on how it handles it.
Bill, that turned out to be a deep question [noparse]:)[/noparse]
This is long so please persevere.
You if you compile a single module with -S to just get an assembler listing you get this:
go_deep:
im _memreg+12
load
pushsp
im _memreg+12
store
im -2
pushspadd
popsp
im _memreg+12
load
im 8
add
load
im -1
add
loadsp 0
storesp 8
storesp 8
impcrel (go_deep)
callpcrel
im _memreg+0
load
im _memreg+12
load
im -4
add
load
addsp 4
im _memreg+12
load
im -4
add
store
im _memreg+12
load
im -4
add
loadsp 0
im _memreg+0
store
storesp 4
storesp 8
im 4
pushspadd
popsp
im _memreg+12
store
poppc
.size go_deep, .-go_deep
Problem is it contains things like "_memreg", "impcrel" and "callpcrel" which are not ZPU ops and don't mean anything to me yet. Further more there is not a one to one correspondence between asm instructions there and actual op bytes in the finished program.
For example disassembling the compiled object with objdump gives this:
Whooa ! Why is that so big? And what are all those break points doing in there ?
Turns out those break points are holding places where the linker is going to put IM ops when it links the entire program.
And the linker can get rid of many of them.
So we move on and link the compiled program. The linker fills in the breakpoint slots with IM ops and if we use the -relax option it removes any redundant ones.
Output from objdump on linked program no optimization:
00000564 <go_deep>:
564: 8c im 12
565: 08 load
566: 02 pushsp
567: 8c im 12
568: 0c store
569: fe im -2
56a: 3d pushspadd
56b: 0d popsp
56c: 8c im 12
56d: 08 load
56e: 88 im 8
56f: 05 add
570: 08 load
571: ff im -1
572: 05 add
573: 70 loadsp 0
574: 52 storesp 8
575: 52 storesp 8
576: ed im -19
577: 3f callpcrel
578: 80 im 0
579: 08 load
57a: 8c im 12
57b: 08 load
57c: fc im -4
57d: 05 add
57e: 08 load
57f: 11 addsp 4
580: 8c im 12
581: 08 load
582: fc im -4
583: 05 add
584: 0c store
585: 8c im 12
586: 08 load
587: fc im -4
588: 05 add
589: 70 loadsp 0
58a: 80 im 0
58b: 0c store
58c: 51 storesp 4
58d: 52 storesp 8
58e: 84 im 4
58f: 3d pushspadd
590: 0d popsp
591: 8c im 12
592: 0c store
593: 04 poppc
Hooraay ! That looks more like what we want. Moving on again we optimize for space with -Os
00000564 <go_deep>:
Output from objdump on linked program optimized for size:
564: ff im -1
565: 3d pushspadd
566: 0d popsp
567: ff im -1
568: 14 addsp 16
569: 51 storesp 4
56a: f9 im -7
56b: 3f callpcrel
56c: 82 im 2
56d: 3d pushspadd
56e: 80 im 0
56f: 0c store
570: 83 im 3
571: 3d pushspadd
572: 0d popsp
573: 04 poppc
Job done.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
By the way, if you compile and link the program normally (no -relax option for the linker). Then the linker replaces those breakpoints with as many IM ops as it needs to get the immediate data in. Then replaces any spare following breakpoints with NOPs. So the code stays huge.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
pushspadd pretty much has to be implemented in pasm, it will help a LOT.
I can see that long reads/writes are going to be very common, so I will implement separate messages for BYTE/WORD/LONG, which will remove a bunch of hub accesses to 'vmbytes' - however I will do this after the first version works.
The Spin API won't have to change.
I think your Spin implementation of Cogz will be a great test for my Spin API - and VMCOG [noparse]:)[/noparse]
For Cogz I think a delayed write page replacement policy will be far better than write-through.
What I've done is to create a C version of the ZPU simulator by cutting all the emulator methods out of the ZyLin ZPU simulator which is in Java. Throwing away any extraneous junk. This way we get a good handle on the required implementation with out having to worry about correctly interpreting the documentation. For example the docs don't talk about which way the stack grows or which endian the words and longs are etc etc.
Next step will be to get some ZPU code running under it. Just now it has executed it's first instructions, NOP and BREAKPOINT.
The current challenge is figuring out how to get the GCC to compile stuff nicely and then how to extract a runnable binary image from the resulting executable. None of that process is documented very well either.
If I can get that all working then it's time to simplify the C simulator and make it look more like Spin so I can get on with a Spin version. Then the PASM version.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I've gone ahead and re-arranged VMCOG for separate READVMB/W/L and WRITEVMB/W/L messages, updated the Spin API wrappers etc.
My plan for testing is:
- READB/WRITEB
Then once those work perfectly
- ALIGNED READW/L and WRITEW/L, cause a BUSERR to happen (so we can find software that causes unaligned access); I may add a STATUS long to the mailbox
Once those work perfectly
allow unaligned word/long access, and throw in a bit of optimization so that if the word/long is aligned, only check for presence in working set once.
heater said...
So far...
What I've done is to create a C version of the ZPU simulator by cutting all the emulator methods out of the ZyLin ZPU simulator which is in Java. Throwing away any extraneous junk. This way we get a good handle on the required implementation with out having to worry about correctly interpreting the documentation. For example the docs don't talk about which way the stack grows or which endian the words and longs are etc etc.
Next step will be to get some ZPU code running under it. Just now it has executed it's first instructions, NOP and BREAKPOINT.
The current challenge is figuring out how to get the GCC to compile stuff nicely and then how to extract a runnable binary image from the resulting executable. None of that process is documented very well either.
If I can get that all working then it's time to simplify the C simulator and make it look more like Spin so I can get on with a Spin version. Then the PASM version.
Having the call stack in cog appealed to me when I was writing Catalina - but it REALLY complicates things. For instance, you can probably only fit the call stack, not the stack frame (which contains all your local variables and function arguments) - otherwise it is quite possible to use all your cog stack space on one function call. But this means you actually have TWO stacks - a call stack in cog ram, and a frame stack in hub ram. The added complexity just didn't seem worth the potential savings it to me.
Ross.
Don't want to hijack the thread, which BTW, is pretty awesome in some sense - but where do you guys find the time?!!!
In any case, Ross, ICC in fact uses COG stack for function calls and locals that can fit into registers. ICC has a sophisticated register allocator so the stack pressure is not the bottleneck. This significantly increase the performance of ICC generated programs.
The lack of total memory space is of course the key limiting factor.
Richard: "Where do we find the time?" Recently I don't but today the boss did not need me today and the wife was out with friends so it's bee an intense day tackling problem at hand. That might be it for a while now though.
This whole "using GCC on the Prop" idea is awesome in the sense that anyone would be crazy enough to attempt it[noparse]:)[/noparse]
I just can't escape the idea that a VM that works with byte codes must be a good fit for the existing external memory solutions that Cluso, Bill and others have worked so hard on perfecting. I really don't have the skills to take on any serious compiler development so making use of the ZPU architecture and using a ready made GCC is the best I can manage.
ImageCraft and Catalina can rest assured that whatever comes out of this is never going to challenge them on the performance stakes. Although it looks like with the help of Bills memory caching set up it might not be totally shamed.
By the way, seems to me one way to get speed out of CogZ would be to have only code in external memory. Put the stack and data in HUB and use Bills cache on the code. As the code is read only the cache never has to write anything back and then we are flying!
Actually I think that is the memory model I would like to implement first.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Interesting - with no optimizer, Catalina generates a constant 26 instructions for 'go_deep' (i.e. 108 bytes). I don't know about ICC, but I would expect their code generator to be slightly more efficient (and hence generate slightly smaller code).
If I understand the architecture correctly, what you're saying is that the ZPU compiler may generate anywhere from 16 to 87 instructions for 'go_deep' (i.e. 16 to 87 bytes) - depending on how many IM's it takes to represent the addresses at link time. That's pretty cool - I'll reduce my code size estimates from 1/2 to 1/3 the size of C compiled PASM for real-world programs - i.e. a mix of many 1*IM cases, some 2*IM cases, and the occasional 3*IM and 4*IM case (for globals).
But I'm also going to increase my estimates for execution time - excluding the initial instruction fetch itself (which takes the same time for bytes as for longs), Zog has the added complexity of handling the multiple IM cases (which will take multiple instruction fetch cycles, unless you can somehow arrange to pre-fetch longs into a cache and then decode that as bytes - ugh!). Then there is the fact that because you have no registers, all operations (like add) place on the stack - so most of your Zog instructions need to access hub ram multiple times during execution. Excluding instruction fetches, with local registers you can 'go_deep' with only about 5 or 6 hub operations - but Zog code will require closer to 20. I think you'll be doing well to get even 4 times the speed of C compiled direct to PASM.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
heater: Once it is running (even in spin) we can see where the time is spent and speed it up from there. Better optimisations can be done at the backend even if it takes longer.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
RossH, You understand correctly. But there are two optimizations going on here that can be applied independently.
One is the C compiler itself performing normal optimizations to remove redundant instructions etc. This shows up in the compiled object files. It does not change the number of bytes used for the IM instructions. They are all 5 bytes long as that is how many you need to fill a 32 bit value seven bits at a time.
The IM ops are optimized, shrunk, at link time when the "relax" option is given to the linker.
Let's look at the first instruction of go_deep. As assembler output from the compiler it looks like:
go_deep:
im _memreg+12
Where the IM instruction is a pseudo op representing five IMs.
If we compile to an object file and disassemble it with objdump we have:
Here the linker as plugged in the required immediate value. As it fits with 7 bits only one IM is required the rest of the space is filled with no operations.
Now we do the link again with the magic "relax" option:
00000564 <go_deep>:
564: 8c im 12
Here all the redundant NOPs have been removed.
Reducing the complete go_deep to 16 instructions requires use of the "relax" linker option and the -Os optimization on the compiler.
You are right about the overheads of being a stack based machine with no registers.
You are not quite right about "the initial instruction fetch itself (which takes the same time for bytes as for longs)" at least not when executing code from external memory.
This is where the ZPU has a chance to shine. For external memory having byte wide instructions is a good fit so let's put the code out in external memory, even in a serial SPI FLASH. Let's keep the stack and data in the HUB.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Yes, I understand this particular 'go_deep' function will only ever need 16 bytes - I was generalizing to a more typical case where a function may refer to globals (which may always require 32 bits), and/or have many local variables (some of which may be structs or arrays occupying many bytes). In such cases, the number of times the IM's will occupy only one byte would be substantially reduced, and your code size correspondingly increased.
I agree about XMM - I was only considering the LMM case. In XMM code you will have a definite "fetch" advantage - say 3 to one - that will help make up for the inefficiencies of having no local registers. Whether you will end up being comparable with C compiled to PASM will depend on the efficiency of the XMM hardware.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
Toby, I'm sure 16 bit wide access would help for lots of things even the ZPU. No reason it is carved in stone if you have a DracBlade style external memory with 16 bit or more address bus and latches why not multiplex it with 16 bit data access?
Anyway the game here is to see if ZPU is a good match for 8 bit data access on all the existing external memory solutions or even for serial access devices like SPI FLASH. Then we can have big code and save some valuable pins hopefully with a performance that is still useful.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Comments
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Lonesock & Bill : CogZ it is then. I'll update the first post and thread title.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Re: CogZ vs Catalina & Imagecraft ...
For LMM programs CogZ won't even be in the race, since a LONG fetch from hub is the same as a BYTE fetch from hub - and once fetched, these compilers execute instructions at PASM speed. But CogZ may be competitive for XMM programs, where the benefit of fetching a byte vs a long can be significant. It will depend a bit on the hardware.
However, I think you may have found a good solution for "big SPIN"
I'll watch this thread with interest.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
In that mode it's only benefit might be it's smaller code size as all ops are only byte wide. So let's see, could someone compile that FIBO I posted with Catalina and ImageCraft and we can see how the code size compares.
The ZPU will fair better in speed comparison with XMM. But I fear even there it will trailing in last because it is a stack based machine with no registers so there is an awful lot of memory access going on.
Here is an example of a simple C statement and the ZPU asm GCC generates. Annotations by me:
Now: An "IM b" can be done in one byte opcode read + 4 byte writes for the PUSH to stack. Total 5 memory accesses.
A LOAD or STORE is one byte read for the op + two times PUSH/POP at 4 bytes each + a memory read of 4 bytes. Total 13 accesses.
The ADD is one byte for the op + two POPS and a PUSH for a total of 13 accesses.
I make that a grand total of 67 byte accesses to external RAM to add two 32 bit numbers !!!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
If there needs to be a single uniform address space, how about placing Hub RAM at one end of the address space and pointing the stack pointer there.
Post Edited (Mike Green) : 2/6/2010 6:29:19 AM GMT
I suspect it would be quite possible to keep the stack in HUB. That would save us 36 external memory accesses in the above example. Halving the external memory thrashing!!
In the general case of course it is quite possible for ZPU to do LOAD and STORE to somewhere within the stack in which case a stack in HUB would fail unless we did some checking of the LOAD/STORE addresses every time and sorted things out accordingly.
I suspect that normal everyday GCC generated code will never do that and we could get away with it. In which case we start closing in on Catalina and Image Craft performance for XMM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Mike is right - your stack should always be in hub RAM. Catalina's LARGE memory mode has a flat address space, but the first $8000 addresses are the normal hub RAM addresses and the stack is always allocated in this address range.
Time/space tradeoffs always fascinate me. Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's - but it could be 1/2 the size. Whether it is worth it depends on how much speed you lose. I'll be interested to see the results.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
I reckon on the first ZPU implementation to be in SPIN and no XMM so the stack will already be in the right place, in HUB, then we can experiment with moving the code and then data to XMM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Post Edited (heater) : 2/6/2010 9:46:21 AM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
The Stack in Cog may also be possible. Whilst the base ZPU is tiny it's probably worth more to fill up the COG with PASM implementations of the ZPU ops that would otherwise be emulated in ZPU ASM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Having the call stack in cog appealed to me when I was writing Catalina - but it REALLY complicates things. For instance, you can probably only fit the call stack, not the stack frame (which contains all your local variables and function arguments) - otherwise it is quite possible to use all your cog stack space on one function call. But this means you actually have TWO stacks - a call stack in cog ram, and a frame stack in hub ram. The added complexity just didn't seem worth the potential savings it to me.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
So what we need is a flat memory space presented to ZPU via an interface that maps some of it to to HUB and some of it to external RAM. So if you use a lot of stack you will find the code slowing down dramatically.
Problem: I've just been looking at the Java ZPU simulator and it implements a stack that grows upwards!. This seems to complicate that idea somewhat.
RossH: "Your addresses will presumably be 32 bits, so the code will not be 1/4 the size of Catalina's"
ZPU has an interesting feature. In the code snippet I posted above you will see things like "im b" where it is loading the address of "b" from immediate data. Thing is that "im" is a one byte instruction, the top bit "1" indicates "im" and the other seven bits are the immediate data. So small addresses can be loaded with a single one byte "im". For larger addresses just chain "im"s together and it will load another 7 bits and another 7 bits etc. In this way for small data sets the code can be quite small.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
humanoido
*Stamp SEED Supercomputer *Basic Stamp Supercomputer *TriCore Stamp Supercomputer
*Minuscule Stamp Supercomputer *Three Dimensional Computer *Penguin with 12 Brains
*Penguin Tech *StampOne News! *Penguin Robot Society
*Handbook of BASIC Stamp Supercomputing
*Ultimate List Propeller Languages
*MC Prop Computer
Can you compile the following:
int go deep(int n) {
int b;
b += go_deep(n-1);
return &b;
}
It is a nonsense function, but it should stop the optimizer from mucking too much with the code.
I am interested in how it handles references to addresses of local variables in C functions. Reason: It might be difficult to put the stack into the hub, depending on how it handles it.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
This is long so please persevere.
You if you compile a single module with -S to just get an assembler listing you get this:
Problem is it contains things like "_memreg", "impcrel" and "callpcrel" which are not ZPU ops and don't mean anything to me yet. Further more there is not a one to one correspondence between asm instructions there and actual op bytes in the finished program.
For example disassembling the compiled object with objdump gives this:
Whooa ! Why is that so big? And what are all those break points doing in there ?
Turns out those break points are holding places where the linker is going to put IM ops when it links the entire program.
And the linker can get rid of many of them.
So we move on and link the compiled program. The linker fills in the breakpoint slots with IM ops and if we use the -relax option it removes any redundant ones.
Hooraay ! That looks more like what we want. Moving on again we optimize for space with -Os
Job done.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
pushspadd pretty much has to be implemented in pasm, it will help a LOT.
I can see that long reads/writes are going to be very common, so I will implement separate messages for BYTE/WORD/LONG, which will remove a bunch of hub accesses to 'vmbytes' - however I will do this after the first version works.
The Spin API won't have to change.
I think your Spin implementation of Cogz will be a great test for my Spin API - and VMCOG [noparse]:)[/noparse]
For Cogz I think a delayed write page replacement policy will be far better than write-through.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
What I've done is to create a C version of the ZPU simulator by cutting all the emulator methods out of the ZyLin ZPU simulator which is in Java. Throwing away any extraneous junk. This way we get a good handle on the required implementation with out having to worry about correctly interpreting the documentation. For example the docs don't talk about which way the stack grows or which endian the words and longs are etc etc.
Next step will be to get some ZPU code running under it. Just now it has executed it's first instructions, NOP and BREAKPOINT.
The current challenge is figuring out how to get the GCC to compile stuff nicely and then how to extract a runnable binary image from the resulting executable. None of that process is documented very well either.
If I can get that all working then it's time to simplify the C simulator and make it look more like Spin so I can get on with a Spin version. Then the PASM version.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I've gone ahead and re-arranged VMCOG for separate READVMB/W/L and WRITEVMB/W/L messages, updated the Spin API wrappers etc.
My plan for testing is:
- READB/WRITEB
Then once those work perfectly
- ALIGNED READW/L and WRITEW/L, cause a BUSERR to happen (so we can find software that causes unaligned access); I may add a STATUS long to the mailbox
Once those work perfectly
allow unaligned word/long access, and throw in a bit of optimization so that if the word/long is aligned, only check for presence in working set once.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Don't want to hijack the thread, which BTW, is pretty awesome in some sense - but where do you guys find the time?!!!
In any case, Ross, ICC in fact uses COG stack for function calls and locals that can fit into registers. ICC has a sophisticated register allocator so the stack pressure is not the bottleneck. This significantly increase the performance of ICC generated programs.
The lack of total memory space is of course the key limiting factor.
// richard
This whole "using GCC on the Prop" idea is awesome in the sense that anyone would be crazy enough to attempt it[noparse]:)[/noparse]
I just can't escape the idea that a VM that works with byte codes must be a good fit for the existing external memory solutions that Cluso, Bill and others have worked so hard on perfecting. I really don't have the skills to take on any serious compiler development so making use of the ZPU architecture and using a ready made GCC is the best I can manage.
ImageCraft and Catalina can rest assured that whatever comes out of this is never going to challenge them on the performance stakes. Although it looks like with the help of Bills memory caching set up it might not be totally shamed.
By the way, seems to me one way to get speed out of CogZ would be to have only code in external memory. Put the stack and data in HUB and use Bills cache on the code. As the code is read only the cache never has to write anything back and then we are flying!
Actually I think that is the memory model I would like to implement first.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Interesting - with no optimizer, Catalina generates a constant 26 instructions for 'go_deep' (i.e. 108 bytes). I don't know about ICC, but I would expect their code generator to be slightly more efficient (and hence generate slightly smaller code).
If I understand the architecture correctly, what you're saying is that the ZPU compiler may generate anywhere from 16 to 87 instructions for 'go_deep' (i.e. 16 to 87 bytes) - depending on how many IM's it takes to represent the addresses at link time. That's pretty cool - I'll reduce my code size estimates from 1/2 to 1/3 the size of C compiled PASM for real-world programs - i.e. a mix of many 1*IM cases, some 2*IM cases, and the occasional 3*IM and 4*IM case (for globals).
But I'm also going to increase my estimates for execution time - excluding the initial instruction fetch itself (which takes the same time for bytes as for longs), Zog has the added complexity of handling the multiple IM cases (which will take multiple instruction fetch cycles, unless you can somehow arrange to pre-fetch longs into a cache and then decode that as bytes - ugh!). Then there is the fact that because you have no registers, all operations (like add) place on the stack - so most of your Zog instructions need to access hub ram multiple times during execution. Excluding instruction fetches, with local registers you can 'go_deep' with only about 5 or 6 hub operations - but Zog code will require closer to 20. I think you'll be doing well to get even 4 times the speed of C compiled direct to PASM.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
One is the C compiler itself performing normal optimizations to remove redundant instructions etc. This shows up in the compiled object files. It does not change the number of bytes used for the IM instructions. They are all 5 bytes long as that is how many you need to fill a 32 bit value seven bits at a time.
The IM ops are optimized, shrunk, at link time when the "relax" option is given to the linker.
Let's look at the first instruction of go_deep. As assembler output from the compiler it looks like:
Where the IM instruction is a pseudo op representing five IMs.
If we compile to an object file and disassemble it with objdump we have:
Here the compiler has reserved space for enough IMs to allow for a 32 bit immediate load.
Now we link the object into a complete executable:
Here the linker as plugged in the required immediate value. As it fits with 7 bits only one IM is required the rest of the space is filled with no operations.
Now we do the link again with the magic "relax" option:
Here all the redundant NOPs have been removed.
Reducing the complete go_deep to 16 instructions requires use of the "relax" linker option and the -Os optimization on the compiler.
You are right about the overheads of being a stack based machine with no registers.
You are not quite right about "the initial instruction fetch itself (which takes the same time for bytes as for longs)" at least not when executing code from external memory.
This is where the ZPU has a chance to shine. For external memory having byte wide instructions is a good fit so let's put the code out in external memory, even in a serial SPI FLASH. Let's keep the stack and data in the HUB.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Yes, I understand this particular 'go_deep' function will only ever need 16 bytes - I was generalizing to a more typical case where a function may refer to globals (which may always require 32 bits), and/or have many local variables (some of which may be structs or arrays occupying many bytes). In such cases, the number of times the IM's will occupy only one byte would be substantially reduced, and your code size correspondingly increased.
I agree about XMM - I was only considering the LMM case. In XMM code you will have a definite "fetch" advantage - say 3 to one - that will help make up for the inefficiencies of having no local registers. Whether you will end up being comparable with C compiled to PASM will depend on the efficiency of the XMM hardware.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
Would 16 bit mem access help, or is 8 bit carved in stone?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Style and grace : Nil point
Anyway the game here is to see if ZPU is a good match for 8 bit data access on all the existing external memory solutions or even for serial access devices like SPI FLASH. Then we can have big code and save some valuable pins hopefully with a performance that is still useful.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.