In the LOADSP and STORESP instructions an offset is extracted from the opcode and added to the stack pointer to get the required memory address.
Turns out that the top bit of that offset is inverted. This is an undocumented feature, at least it's not mentioned in the ZPU architecture web page. Initially I thought it was something to do with using signed offsets, that bit would be the sign bit, but that is not so. I did ask Zulin about this and
If zpu can speak little-endian, it would solve many, many issues that slow down zog.
David Betz did a wonderful job creating a better infrastructure for external memory users like C3 and SDRAM.
His loader performance is top notch and will probably be leveraged in the new GCC tool-chain.
If zpu can speak little-endian, it would solve many, many issues that slow down zog.
You might have to elaborate. Of all the PASM instructions the interpreter has to go through to execute a ZPU instruction and access code/data in HUB and ext memory the overhead of the endian fixes has never looked very big to me.
When working with LONGS everything proceeds at full speed with no byte juggling to get the endianness right. That's why I reverse the byte order of every long in the binary prior to loading it to the Prop. For example I use Lonesock's F32 float object with no endianness fiddling.
When working with bytes and words there is only a couple of extra instructions involved to get the endianness right.
So where is endianness causing these many issues? Am I missing a point somewhere?
So where is endianness causing these many issues? Am I missing a point somewhere?
Apparently the endian-ness problem makes it impossible to use HUB RAM for stack and local data while using external memory for code/globals. David may be able to share more info on this. Having stack/locals in HUB RAM would speed up zog by about a factor of 4 and make it somewhat competitive with Catalina.
Having stack/data in HUB and only code/constants in EXT RAM was always one of my primary motivations for wanting to build ZOG. The idea being that bytecodes fetched from ext RAM/ROM would use less memory bandwidth than having to use 32 bit instructions as you do with XMM.
Now, quite why that does not work out is something I really have to find some time to investigate.
... idea being that bytecodes fetched from ext RAM/ROM would use less memory bandwidth than having to use 32 bit instructions as you do with XMM.
That reminds me. I should send you a SpinSocket-Flash kit. SpinSocket-Flash is a module that has a Propeller and 4MB byte-wide flash on a DIP32 footprint. Along with the SpinSocket Platform board, it makes a development environment that fits your idea very nicely.
Two power options are available. Would you like one with low power battery support or higher power 3.3V regulation? The SpinSocket Platform board will support either type and has a LiPo charger.
I'm running David's XBASIC on SpinSocket-Flash now. I'm also running it on GameBaby. The program is stored in flash and David's XBASIC PASM VM does all the work. Zog and Catalina are both hard pressed to interpret XBASIC byte codes fast enough, so he wrote a separate VM. It's a nice language as BASICs go ... like VB6 in some ways.
End of shameless plug and offer of a free sample .... Is the post still on strike over there?
OK Jazzed, I'm up for it. Battery power sounds good. You still have my address? Hopefully I get a bit of a long holiday during June/July when most of Finland shuts down for the summer. The post guys are working well but now Iceland is throwing a lot of s*** into the flight path again:)
Odd thing is that Zog runs fine with code, data and stack in HUB so I'm not sure where it goes wrong moving code/const to ext space. I might go back to my basic Zog version (2.6 or so) and see what I can see with David's memory map and modified linker script ideas. The linker scripts was always something that thwarted me.
OK Jazzed, I'm up for it. Battery power sounds good. You still have my address? Hopefully I get a bit of a long holiday during June/July when most of Finland shuts down for the summer. The post guys are working well but now Iceland is throwing a lot of s*** into the flight path again:)
Odd thing is that Zog runs fine with code, data and stack in HUB so I'm not sure where it goes wrong moving code/const to ext space. I might go back to my basic Zog version (2.6 or so) and see what I can see with David's memory map and modified linker script ideas. The linker scripts was always something that thwarted me.
Actually, ZOG runs okay with the stack in hub memory. The problem is that loading COGs with coginit doesn't work correctly anymore. I think it's because only long accesses to hub memory work correctly. Byte accesses are in the wrong order.
David,
Great you may have saved me from a wild goose chase.
COGINIT from C used to work for me. That's how I start and use FullDuplexSerial and F32 for example. Time to start digging again.
Can you tell me the current memory map you are using, HUB, COG, EXT RAM addresses as seen by C? So that I can apply that to my basic ZOG set up?
OK Jazzed, I'm up for it. Battery power sounds good. You still have my address? Hopefully I get a bit of a long holiday during June/July when most of Finland shuts down for the summer.
I have your address assuming you haven't moved .... I believe you have my email address in case.
June/July? Short time-frame. I guess that's a great time to have vacation so far north. I'll try to get something off to you within a week or so. I have another care package to send out so one stop at the post will be nice.
I fully intend to get an SDRAM and a SpinSocket-Flash driver working on Catalina by summer's end.
Want to race? just kidding. i'm very busy with GameBaby right now.
I still cannot get the endianess issue. I am compiling this program:
volatile unsigned long xxxx = 0x12345678;
void _premain(void)
{
main();
}
int fibo(unsigned int n)
{
if (n < 2) return n;
return fibo(n-2) + fibo(n-1);
}
int main(void)
{
(void)xxxx; // do not optimize it out
fibo(12);
return 0;
}
I see the bytes in binary file in same order as they are in disassembler dump. Analyzing neqbranch and callpcrel instructions also does not show anything unusual. - but the long in data section (0x12345678) IS big endian.
Can't get how this mathes what the docs say:
The instructions are stored big endian. That is the first instruction is stored in the most significant byte, and the forth is in the least significant byte.
The Zog VM has many XORs to handle the endianness problems. If it was just built to be little endian, it would run faster on Propeller just because the XORs can be removed. Unfortunately zpu tools won't make a little endian image. I'd like to see a GCC port that emits PASM with macros to handle jumps and data manipulation, but that may only happen with Propeller 2 stuff ... hard to tell just now, so if you want Propeller 1 GCC, Zog may be the only choice and it would be best to optimize it while there is time.
After almost whole day of searching, editing and compiling (mostly searching), I think I have made a version of GCC and binutils that produce something looking like a binary for little-endian ZPU!
I am going to do some tests and grab some beer (is that the correct order? maybe beer first? ), and will post more info here
To be honest, this endianness issue gives me headache. If I think about it long enough the thing starts flipping from one end to the other and back again. Especially when interfacing ZPU bigendian code with Prop littleendian code. Or is that the otherway around:) It's like looking at one of those optical illusions where as you stare at it the image flips from being one thing to being some other thing and I end up being hopelessly confused.
Now, what you have compiled and disassembled there looks perfectly normal and familiar. So lets concentrate on the first endian issue:
In that code you have defined an initialised integer xxxx = 0x12345678. As you see in the disassembly it is stored in the image, and hence memory, as the sequence of bytes 12, 34, 56, 78 going up memory. That is most significant byte first or bigendian.
Conversely the Prop is littleendian as can be seen by making a similar initialized long in PASM:
xxxx long $12345678
Results in this in the BST listing output:
0018(0000) 78 56 34 12 | xxxx long $12345678
So we have a problem. Lets say we had in C:
volatile unsigned long xxxx = 0xFF000000;
volatile unsigned long yyyy = 0x01000000;
result = xxxx + yyyy;
We want the result to be 0x000000 but the Prop, operating on these as RDLONG, RDLONG, ADD, WRLONG is going to produce the result of $00010000. (Is that right?)
How to fix this?
a) Arrange that whenever the emulator reads a LONG all the bytes within the long are reversed in order prior to use. When writing results all the bytes are reversed again.
This is clearly going to slow the emulation down a lot. The ZPU spends most of it's time dealing with LONGS on the stack. Not Good.
b) Let's reverse the byte order of every four bytes of the binary executable. Either as we load it to memory or as a last step in the build process.
That's nice, now all our initialised data and constants are the right way round. ZOG can operate on LONGS all day, at speed, without error. C under ZOG and other SPIN/PASM code can exchange LONGS without any worries about byte order.
Ah but. Now we have all our bytecodes in the wrong order!
No worries, for the cost of a single XOR when reading a byte code we can be sure to always pick up the right one. (xor memp, #%11). Why? Inverting bit 1 steers the access on WORD up or down from the given code address, Inverting bit 0 steers the access up or down one byte from the given code address resulting in the correct actual address in Propeller space.
Similarly reads and writes to BYTES are fixed with xor %11.
Similarly accesses to WORDS are fixed with xor $10 which steers to the correct WORD in Propeller memory space.
There we are, job done!
Except...This does cause some issues with interfacing to PASM and SPIN code when you want to exchange BYTES and WORDS. For example for Spin to write a string to memory so that ZOG gets it right the Spin code has to reverse every four bytes. This can be done with the XOR trick as well.
N.B. This trick only works correctly because ZPU code should never access WORDS on a odd address or LONGS that are not 4 byte aligned.
After all that I'm not sure about the original question.
I see the bytes in binary file in same order as they are in disassembler dump.
Sure you do, they are both just displaying bytes going up memory.
Can't get how this mathes what the docs say:
The instructions are stored big endian. That is the first instruction is stored in the most significant byte, and the forth is in the least significant byte.
And so they do. If you were to write a C program to read an integer from address $4 you would get the number 0x82700b0b. I.e. bigendian.
If anytone can see a better way to sort this mess I'd like to here it. Developing a little endian zpu-gcc is not on the cards for me.
I'd go for the beer first, take a break and then look at what there is:)
In the first post of this thread there is a ZPU VM in C that runs under Linux. It might help with testing. Of course it's BYTE and WORD access routines are now backwards for you?
If you could put your compiler up somewhere I could try and find some time to do some testing as well.
Yep, on page 26 of this thread there is zog_v1_6 which is the last version I put out I think. Gosh that was a long time ago.
Anyway with #define USE_HUB_MEMORY uncommented and #define USE_JCACHED_MEMORY and #define USE_VIRTUAL_MEMORY commented out it should run fibo.bin that is a ZPU binary executable included via a Spin "file" statement.
You can try using #deine SINGLE_STEP that will execute one byte code every time you press a key.
After almost whole day of searching, editing and compiling (mostly searching), I think I have made a version of GCC and binutils that produce something looking like a binary for little-endian ZPU!
I am going to do some tests and grab some beer (is that the correct order? maybe beer first? ), and will post more info here
Wow! That's great news! I'd love to have a copy of the changes you made to get this to work.
Comments
You are really pushing the boundaries here. I thought LLVM for ZPU was in the very early stages of development.
Jonathan
Not that I am not trusting you But I want to understand how this works.
This is also interesting. What do 2 lower bits have to do with endianess?
Very good questions and well spotted.
In the LOADSP and STORESP instructions an offset is extracted from the opcode and added to the stack pointer to get the required memory address.
Turns out that the top bit of that offset is inverted. This is an undocumented feature, at least it's not mentioned in the ZPU architecture web page. Initially I thought it was something to do with using signed offsets, that bit would be the sign bit, but that is not so. I did ask Zulin about this and
David Betz did a wonderful job creating a better infrastructure for external memory users like C3 and SDRAM.
His loader performance is top notch and will probably be leveraged in the new GCC tool-chain.
You might have to elaborate. Of all the PASM instructions the interpreter has to go through to execute a ZPU instruction and access code/data in HUB and ext memory the overhead of the endian fixes has never looked very big to me.
When working with LONGS everything proceeds at full speed with no byte juggling to get the endianness right. That's why I reverse the byte order of every long in the binary prior to loading it to the Prop. For example I use Lonesock's F32 float object with no endianness fiddling.
When working with bytes and words there is only a couple of extra instructions involved to get the endianness right.
So where is endianness causing these many issues? Am I missing a point somewhere?
Having stack/data in HUB and only code/constants in EXT RAM was always one of my primary motivations for wanting to build ZOG. The idea being that bytecodes fetched from ext RAM/ROM would use less memory bandwidth than having to use 32 bit instructions as you do with XMM.
Now, quite why that does not work out is something I really have to find some time to investigate.
Two power options are available. Would you like one with low power battery support or higher power 3.3V regulation? The SpinSocket Platform board will support either type and has a LiPo charger.
I'm running David's XBASIC on SpinSocket-Flash now. I'm also running it on GameBaby. The program is stored in flash and David's XBASIC PASM VM does all the work. Zog and Catalina are both hard pressed to interpret XBASIC byte codes fast enough, so he wrote a separate VM. It's a nice language as BASICs go ... like VB6 in some ways.
End of shameless plug and offer of a free sample .... Is the post still on strike over there?
Odd thing is that Zog runs fine with code, data and stack in HUB so I'm not sure where it goes wrong moving code/const to ext space. I might go back to my basic Zog version (2.6 or so) and see what I can see with David's memory map and modified linker script ideas. The linker scripts was always something that thwarted me.
Great you may have saved me from a wild goose chase.
COGINIT from C used to work for me. That's how I start and use FullDuplexSerial and F32 for example. Time to start digging again.
Can you tell me the current memory map you are using, HUB, COG, EXT RAM addresses as seen by C? So that I can apply that to my basic ZOG set up?
June/July? Short time-frame. I guess that's a great time to have vacation so far north. I'll try to get something off to you within a week or so. I have another care package to send out so one stop at the post will be nice.
I fully intend to get an SDRAM and a SpinSocket-Flash driver working on Catalina by summer's end.
Want to race? just kidding. i'm very busy with GameBaby right now.
Short time frame, yes, after that the Prop II comes out don't forget. Then it's a long dark winter here porting everything to that:)
using a modified crt0.S
I see the bytes in binary file in same order as they are in disassembler dump. Analyzing neqbranch and callpcrel instructions also does not show anything unusual. - but the long in data section (0x12345678) IS big endian.
Can't get how this mathes what the docs say:
I am going to do some tests and grab some beer (is that the correct order? maybe beer first? ), and will post more info here
To be honest, this endianness issue gives me headache. If I think about it long enough the thing starts flipping from one end to the other and back again. Especially when interfacing ZPU bigendian code with Prop littleendian code. Or is that the otherway around:) It's like looking at one of those optical illusions where as you stare at it the image flips from being one thing to being some other thing and I end up being hopelessly confused.
Now, what you have compiled and disassembled there looks perfectly normal and familiar. So lets concentrate on the first endian issue:
In that code you have defined an initialised integer xxxx = 0x12345678. As you see in the disassembly it is stored in the image, and hence memory, as the sequence of bytes 12, 34, 56, 78 going up memory. That is most significant byte first or bigendian.
Conversely the Prop is littleendian as can be seen by making a similar initialized long in PASM:
Results in this in the BST listing output:
So we have a problem. Lets say we had in C:
We want the result to be 0x000000 but the Prop, operating on these as RDLONG, RDLONG, ADD, WRLONG is going to produce the result of $00010000. (Is that right?)
How to fix this?
a) Arrange that whenever the emulator reads a LONG all the bytes within the long are reversed in order prior to use. When writing results all the bytes are reversed again.
This is clearly going to slow the emulation down a lot. The ZPU spends most of it's time dealing with LONGS on the stack. Not Good.
b) Let's reverse the byte order of every four bytes of the binary executable. Either as we load it to memory or as a last step in the build process.
That's nice, now all our initialised data and constants are the right way round. ZOG can operate on LONGS all day, at speed, without error. C under ZOG and other SPIN/PASM code can exchange LONGS without any worries about byte order.
Ah but. Now we have all our bytecodes in the wrong order!
No worries, for the cost of a single XOR when reading a byte code we can be sure to always pick up the right one. (xor memp, #%11). Why? Inverting bit 1 steers the access on WORD up or down from the given code address, Inverting bit 0 steers the access up or down one byte from the given code address resulting in the correct actual address in Propeller space.
Similarly reads and writes to BYTES are fixed with xor %11.
Similarly accesses to WORDS are fixed with xor $10 which steers to the correct WORD in Propeller memory space.
There we are, job done!
Except...This does cause some issues with interfacing to PASM and SPIN code when you want to exchange BYTES and WORDS. For example for Spin to write a string to memory so that ZOG gets it right the Spin code has to reverse every four bytes. This can be done with the XOR trick as well.
N.B. This trick only works correctly because ZPU code should never access WORDS on a odd address or LONGS that are not 4 byte aligned.
After all that I'm not sure about the original question.
Sure you do, they are both just displaying bytes going up memory.
And so they do. If you were to write a C program to read an integer from address $4 you would get the number 0x82700b0b. I.e. bigendian.
If anytone can see a better way to sort this mess I'd like to here it. Developing a little endian zpu-gcc is not on the cards for me.
In the time it takes me to write all that you have fixed the issue at source
In the first post of this thread there is a ZPU VM in C that runs under Linux. It might help with testing. Of course it's BYTE and WORD access routines are now backwards for you?
If you could put your compiler up somewhere I could try and find some time to do some testing as well.
Heater, is there an easy way to run Zog on bare Propeller, without any external memory?
Anyway with #define USE_HUB_MEMORY uncommented and #define USE_JCACHED_MEMORY and #define USE_VIRTUAL_MEMORY commented out it should run fibo.bin that is a ZPU binary executable included via a Spin "file" statement.
You can try using #deine SINGLE_STEP that will execute one byte code every time you press a key.
Wow! That's great news! I'd love to have a copy of the changes you made to get this to work.
What with:
a) Andrey's little endian zpu-gcc
b) You linker scripts that can get code/constants out into ext memory whilst stack/data are in HUB.
ZOG is finally going to show XMM solutions what it can do when you have code in ext serial FLASH/RAM:)
You guys are great.
No guarantee that it works as expected - I have not tested such things as relocations. Simple tests seem OK