Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

David Betz · 2010-08-24 05:20

I tried running fibo.c using debug_zog.spin and I can get a bit of output on the terminal but not all that I expect. I've attached my modified debug_zog.spin so you can see what I'm trying to do. So far, the only changes I've made are to add #ifdefs for the HYDRA clock settings and I've uncommented the serial port setup and some debug messages. If I run this I get only the following line on my terminal:

zpu memory at 000007B8

I would expect to have also gotten:

ZOG v0.21 (HUB)

What am I doing wrong here? I've also noticed that the ZOG version number is v0.21 but the zip file I got this from said it was v1.0. Maybe that's my problem?

Thanks,
David

Heater. · 2010-08-24 05:39

David,

You have old an version there. Get the latest (v1.3) zip package from here http://forums.parallax.com/showpost.php?p=932516&postcount=363

On critical thing here is that the Makefiles for ZPU programs now reverse all the bytes in each LONG of the binaries. This saves having to do it in Spin/PASM at load time. So at some point the old ZPU binaries became incompatible with ZOG.

I would keep the updates in the first post on the first page but I can't get in as "heater" just yet.

David Betz · 2010-08-24 06:11

Thanks for the suggestions! I switched to v1.3 and I get a little more output now:

zpu memory at 00000088

#pc,opcode,sp,top_of_stack,next_on_stack
#----------

0X0000010 0X00 0X00004FF8 0X80FEAA04
BREAKPOINT

I still don't get the ZOG banner and the "(HUB)" text that it looks like debug_zog.spin is trying to print.

Also, USE_HUB_MEMORY is defined both in debug_zog.spin and zog.spin. I had to comment out the one in zog.spin to get debug_zog.spin to compile with USE_HUB_MEMORY defined.

Thanks,
David

David Betz · 2010-08-24 06:22

I got further and am now able to run fibo.c successfully. I still wonder why I don't get the ZOG banner line printed out. I get the message about "zpu memory at ..." but not the "ZOG v1.3 ..." string. I tried it using the "Parallax Serial Terminal" program and the text comes out there but not on under putty. I guess I just have to use the Parallax program instead of putty for doing my ZOG work.

In any case, thanks for your help. I have my setup working now. I just have to try my own program to see if it works as well as fibo.c does! :-)

Heater. · 2010-08-24 06:46

David, excellent.

No idea about putty, sounds rather odd.

David Betz · 2010-08-24 07:23

Heater. wrote: »

No idea about putty, sounds rather odd.

I have a guess as to why putty doesn't work well. I suspect it's because putty operates (by default anyway) in the normal terminal emulator mode where CR goes back to the start of the current line and LF goes to the next line. I think the Propeller is expecting CR to go to the beginning of the next line. This would explain why I'm only getting a single line out of putty. Every line overwrites the previous line.

jazzed · 2010-08-24 07:39

David Betz wrote: »

... terminal emulator mode where CR goes back to the start of the current line and LF goes to the next line. I think the Propeller is expecting CR to go to the beginning of the next line.

DOS :eyes:
Thats one reason I suggested PST. Also the BST IDE has a terminal program that interprets LFCR "correctly" and automatically connects after download.

I'm like you though and prefer command line tools. It's hard to beat the VI editor for programming, and I had to find my own simple command line terminal program running on Linux. I ended up getting the source to nanocom and fixing it up for my needs - it doesn't have ymodem, but I expect that serial transfers will be way too slow soon anyway.

@Heater,

I was able to do some incremental testing with a 16MB block last night with no problem. I'll expand the test suite a little today then start migrating code to the latest zog. If you're close to zog 1.4, let me know so I don't have to integrate changes twice.

Cheers.

lonesock · 2010-08-24 08:25

here's a tiny speedup for zpu_neq:

zpu_neq                 call    #pop
                        sub     tos, data wz
              if_nz     mov     tos, #1
                        jmp     #done_and_inc_pc

Also, that multiplication code I posted will stop as soon as there are no more 1 bits in x, so it will be much faster on average. If you include those 1st 3 instructions, it makes sure x is the smaller of the two ops, x & y, which means it will terminate even faster on average. (note that both x & y have been absolute-value-ed, or that wouldn't work ;-)

Jonathan

lonesock · 2010-08-24 08:42

Quick observation: zpu_nop and zpu_syscall both immediately jmp somewhere else (#done_and_inc_pc and #syscall, respectively). Could you just change the dispatch table to point directly to the final destination, and then relocate those two so they are within the 1st 256 longs?

Unrelated issue, zpu_swap looks really strange, I'm not sure what it's doing. Is there documentation for that opcode? I could not find it.

thanks,
Jonathan

Heater. · 2010-08-24 10:26

Lonesock:

zpu_nop could jump straight back to the execute loop if that were within 256 bytes. In fact it would fit with but leaves no wiggle room.

Same for zpu_syscall except the syscall code wont fit with 256 bytes.

zpu_swap does look strange. It seems to wrong:)

It does not seem to be documented but it is in the ZPU Java simulator here http://repo.or.cz/w/zpu.git/blob/HEAD:/zpu/sw/simulator/com/zylin/zpu/simulator/Simulator.java

Where it looks like :

case SWAP:
//                      if (feeble[SWAP])
//                      {
//                          emulate();
//                      } else
                      {
                          int swapVal=popIntStack();;
                          pushIntStack(((swapVal >>16)&0xffff)|(swapVal<<16));
                      }
                      break;

So just swapping high and low words of a long.

Meanwhile Zog's SWAP fails to post it's result back to the top of stack. I guess it should look like:

zpu_swap                mov     data, tos
                        shr     data, #16
                        shl     tos, #16
                        or      tos, data
                        jmp     #done_and_inc_pc

We don't seem to have any code that uses swap.

That zpu_neq is brilliant. Sadly can't find any code that uses it.

Heater. · 2010-08-24 10:38

Lonesock: Could you post your complete mult solution. I'm getting too tired to see what I'm doing any more:)

jazzed · 2010-08-24 11:25

Doing some SDRAM cache testing on a malloc block with pseudo-random fill/readback ... This test which uses something like the LFSR PRNG takes about 2.5 minutes. The libc rand() is quite slow and I'm not sure if it's actually working.

Malloc buffer size: 80000
testprand 80000 20000 6b8b4567
0x4e00 - 0x84dfc
Writing Pseudo-random 0x00004e00 0x00000000 0xca3a62b3
Writing Pseudo-random 0x00014e00 0x00004000 0x829d1528
Writing Pseudo-random 0x00024e00 0x00008000 0x78d09443
Writing Pseudo-random 0x00034e00 0x0000c000 0xf988429a
Writing Pseudo-random 0x00044e00 0x00010000 0x9ae2f159
Writing Pseudo-random 0x00054e00 0x00014000 0x414e8a94
Writing Pseudo-random 0x00064e00 0x00018000 0xc3978a21
Writing Pseudo-random 0x00074e00 0x0001c000 0x7cc4214d
Reading Pseudo-random 0x00004e00 0x00000000
Reading Pseudo-random 0x00014e00 0x00004000
Reading Pseudo-random 0x00024e00 0x00008000
Reading Pseudo-random 0x00034e00 0x0000c000
Reading Pseudo-random 0x00044e00 0x00010000
Reading Pseudo-random 0x00054e00 0x00014000
Reading Pseudo-random 0x00064e00 0x00018000
Reading Pseudo-random 0x00074e00 0x0001c000
testprand passed.

I had to adjust the SDRAM Cache driver so that misses aren't so costly. Fibo results are about the same as the last post (all tests use 80MHz clock).

fibo(00) = 000000 (00000ms)
fibo(01) = 000001 (00000ms)
fibo(02) = 000001 (00000ms)
fibo(03) = 000002 (00000ms)
fibo(04) = 000003 (00001ms)
fibo(05) = 000005 (00001ms)
fibo(06) = 000008 (00003ms)
fibo(07) = 000013 (00005ms)
fibo(08) = 000021 (00008ms)
fibo(09) = 000034 (00014ms)
fibo(10) = 000055 (00022ms)
fibo(11) = 000089 (00037ms)
fibo(12) = 000144 (00060ms)
fibo(13) = 000233 (00097ms)
fibo(14) = 000377 (00157ms)
fibo(15) = 000610 (00254ms)
fibo(16) = 000987 (00411ms)
fibo(17) = 001597 (00668ms)
fibo(18) = 002584 (01082ms)
fibo(19) = 004181 (01748ms)
fibo(20) = 006765 (02825ms)
fibo(21) = 010946 (04569ms)
fibo(22) = 017711 (07401ms)
fibo(23) = 028657 (11988ms)
fibo(24) = 046368 (19398ms)

Now I'll see what happens testing 16MB block with pseudo-random fill/readback ... Guess I'll go catch a movie or something meanwhile. :sad: Only 1 hour 22 minutes to test 16000000 bytes - a good data retention test if nothing else. Test passed.

--Steve

lonesock · 2010-08-24 12:03

zpu_swap                ror     tos, #16
                        jmp     #done_and_inc_pc

I also re-arranged a bunch of code, so both syscall and nop fit under 256 longs. However, it doesn't make a difference to the reported speed numbers in fibo (I'm guessing those aren't used much ;-).

Here's my modded version of 1.3:

Jonathan

Edit: the re-organization may not work for external RAM...I was compiling for HUB-only. Also, I did not add in the zpu_swap code, as I wasn't sure how it was supposed to behave. And I modified the zpu_addsp code, saved a long, the case where offset==0 is handled, but a bit slower than it was...same speed as it was if offset > 0.

David Betz · 2010-08-24 18:21

Bill Henning wrote: »

Later this week I am adding a driver for my FlexMem board (four bit wide bus using four SPI ram's for >2MB/sec burst transfer, up to 6.6MB/sec with the two cog special driver I am working on)

How is your FlexMem driver coming? I just made a board to plug into my Hydra that has two 23k256 SPI SRAM chips on it and I'd like to try using it with VMCOG. Ideally, I would use it as a two bit wide memory similar to the four bit wide memory you describe above. Short of that, I'd like to at least verify that each of the two SRAM chips is working by using it as a single 32k memory. Can you point me to where in the VMCOG code I need to look to get my chips hooked up to your code? What do I need to know other than the Propeller pins I used to hookup the SPI interface and the chip selects?

Thanks,
David

Heater. · 2010-08-24 19:11

Jazzed:

Excellent result.

Lonesock:

I just did a quick diff against your modified Zog. You have a lot of nice little mods in there to save space and or time. That zpu_swap makes me laugh, "ror tos, #16" was just to obvious for me to see:)

I'm loath to rearrange the code to put nop and syscall within the 255 limit. And here is why:

1) It eats all the "wiggle room" which makes me nervous, not that I can think of anything else we realy have to put there.

2) It may save 2 JMPs but as SYSCALL is so long winded and rare the tiny speed gain will never show up. Similarly NOP is rare enough we wont see the notice the benefit.

3) The BIGGY, there is something we can put in that space that can have a dramatic speed gain. If you analyse a typical program you will find that there are a huge lot of zpu_loadsp and zpu_storesp used. And that there are a handful of offset values used in those instructions that predominate. So execution can be sped up by implementing some loadsp/storesp ops with hard wired offsets rather than decoding the offsets from the byte code as we do now. The jump table would have entries to these "special case" load/stores.

I tried this with fibo a while back and did get noticeable speed up. Problem is of course to tailor the selection of hardwired load/store offsets to match the application being run.

What would really help is to fix read_word and write_word so that zpu_loadh and zpu_loadh (load/store WORD) work properly and get rid of zpu_emulate which we should never need and eats space and time.

I think I'll put up v1.4 with all your little mods in place so Jazzed can put his RAM driver in.

lonesock · 2010-08-24 20:06

@Heater: Thanks for the kind words. I just wanted to clarify that syscall doesn't need to 'fit' within 255 longs, just that the entry point needs to start withing 255 longs, am I right? And, you are so right, optimizing a NOP does seem a bit silly ("I got the NOP down to 0 clocks!!!!") [8^)

Regarding the zpu_loadsp and zpu_storesp, is this something where a tool could scan the ZPU bytecode and generate the N most common offsets? Or would you need some instrumentation/profiling code to keep track during a typical run? I'm just thinking since the prop has self-modifying code, maybe you could even do something as simple as a speed-store the last used offset, with a shortcut if the next offset matches the last one.

I'd be happy to look over v1.4 as soon as it comes out, this is fun stuff!

Jonathan

Heater. · 2010-08-24 20:31

Lonesock,

You are right only the label zpu_syscall has to be in range.

With my C version of the ZPU running on linux I can do a little profiling, it counts the number of times each opcode is used in a program run. Using that data I did once make customized loadsp/storesp with offsets most commonly used in the fibo calculation and did get a few percent speed up when run on Zog.

That's an interesting idea about dynamically fixing loadsp/storesp at run time. No idea how we would do that yet. But building the profiler into Zogs loadsp/storesp might be a start. Only have to increment 64 counters somewhere.

"...this is fun stuff!"

So let's continue,

I now have a read_word that at least works from HUB memory so zpu_loadh now works without using EMULATE.

read_word               mov     memp, address
'                        and     memp, zpu_memory_mask
                        add     memp, zpu_memory_addr
                        xor     memp, #%10
                        rdword  data, memp
read_word_ret           ret

Edit: Yay, now have write_word and zpu_storeh works after a little tweaking.

Bill Henning · 2010-08-25 10:06

Hi David,

I am in a bit of a crunch because I need to get some PCB's out for production this week, so I won't be able to get to the driver for a day or two because:

- I am just finishing troubleshooting a problem on the new Mem+ with the MCP23S17 I/O expander .... my pcb manufacturer made changes to the gerbers, which resulted in several problems that I am having to identify and work around. I am almost done with the MCP23S17, the only thing left to check on Mem+ is the SD interface.

- After that, I have to check the SPI ram's on the new Morpheus and IR in / IR out. (There were manufacturing errors on Morpheus as well...)

After the above are done, I will be adding the FlexMem driver.

I'll add a two-bit wide mode for you as well

David Betz wrote: »

How is your FlexMem driver coming? I just made a board to plug into my Hydra that has two 23k256 SPI SRAM chips on it and I'd like to try using it with VMCOG. Ideally, I would use it as a two bit wide memory similar to the four bit wide memory you describe above. Short of that, I'd like to at least verify that each of the two SRAM chips is working by using it as a single 32k memory. Can you point me to where in the VMCOG code I need to look to get my chips hooked up to your code? What do I need to know other than the Propeller pins I used to hookup the SPI interface and the chip selects?

Thanks,
David

lonesock · 2010-08-25 11:01

@Heater: Great, I'm glad the word stuff is working. Did you find that endian-ness caused any issues? I'm thinking of things like a C union of char[4] and int or float.

Jonathan

David Betz · 2010-08-25 11:09

Bill Henning wrote: »

After the above are done, I will be adding the FlexMem driver.

I'll add a two-bit wide mode for you as well

I looked over the VMCOG code a little last night and it seems that two 32k SPI SRAM chips is already supported. I see where to define the MOSI, MISO, CLK pins but there is only one definition for CS. How do you derive the CS for the other chip? Just add one? If this two chip SPI SRAM configuration is already supported (under the PROPCADE conditional), then I can get started right away. Is my reading of the code correct (v0.975)?

Bill Henning · 2010-08-25 12:28

Close...

Ok, I pulled up the .975 source, and here is how to change it for two SPI ram's that share CLK, MOSI, MISO, but have individual chip selects...

1) Use the PropCade code as the template

2) PropCade selects SPI ram chip 0..5 by using

movd outa,#n ' where n=0..5

to set the 3 bit address of the 74hc138 on P9,P10,P11 to select the ram chip.

P3, which is /CS, actually enables the '138, thus generating Y0..Y5 (which are the actual SPI ram chip selects)

3) Go for single-bit wide SPI ram for now, doing the dual-bit conversion is significantly more complicated

4) In SPI ram constants, add a CS1, and change pins to match your hardware

5) In BINIT, instead of looping through six chips, initialize your two chips, using your CS and CS1 pins

so replace

        ' Initialize up to six SPI RAM's on PropCade to sequential mode
        mov  count2,#6
ini_lp  mov  outa,#0
        mov  addr,ramseq
        mov  bits,#16
        andn outa,#CS|CLK
        call #send
        or   outa,#CS
        add outa,dstinc
        djnz count2,#ini_lp

with

        ' init low chip
        mov  addr,ramseq
        mov  bits,#16
        andn outa,#CS|CLK
        call #send
        or   outa,#CS
        ' init high chip
        mov  addr,ramseq
        mov  bits,#16
        andn outa,#CS1|CLK
        call #send
        or   outa,#CS1

6) in BSTART, see the logic which decodes the high order VM address bits into dv, to select one of six ram chips.

Remove it

        mov   dv, addr
        shr   dv,#15            ' dv = 0..3, ie the chip select
        movd  outa,dv           ' select SPI device

Change

        andn  outa,#CS|CLK

to

       andn  outa,#CLK
       shr addr,#16 nr, wc
       ' select the lower / upper 32KB SPI ram's based on carry
       muxnc outa,CS
       muxc   outa,CS1

and at the end of both BREAD and BWRITE, where you see

        or    outa,#CS

change it to

        or    outa,#CS|CS1

The above is untested, but should work - it is what I will be doing on Morpheus CPU1.

David Betz wrote: »

I looked over the VMCOG code a little last night and it seems that two 32k SPI SRAM chips is already supported. I see where to define the MOSI, MISO, CLK pins but there is only one definition for CS. How do you derive the CS for the other chip? Just add one? If this two chip SPI SRAM configuration is already supported (under the PROPCADE conditional), then I can get started right away. Is my reading of the code correct (v0.975)?

David Betz · 2010-08-25 12:38

Bill Henning wrote: »

The above is untested, but should work - it is what I will be doing on Morpheus CPU1.

Thanks Bill! That should be enough to get me testing my double SRAM board for the Hydra. Now we'll see how well I soldered it together! :-)

Bill Henning · 2010-08-25 12:43

You are welcome!

David Betz wrote: »

Thanks Bill! That should be enough to get me testing my double SRAM board for the Hydra. Now we'll see how well I soldered it together! :-)

David Betz · 2010-08-25 17:47

Okay, I've run into a minor problem porting VMCOG to my Hydra SRAM board. It looks like the code for PropCade assumes that all of the I/O pins used by the SPI SRAMs are between 0 and 8 since the code expects to be able to put a pin mask in an immediate value. I guess I'll need to make these values COG variables instead of immediate constants unless there is some other clever way around this that I'm not thinking of. The pins I'm using are:

MOSI = P16
CLK = P17
MISO = P18
CS = P19
CS1 = P20

Is there some other way than defining COG variables with the various masks needed?

(obviously, I'm not an ASM wizard!)

Bill Henning · 2010-08-25 17:59

Nope.

Sorry, forgot to warn you about that.

MOSI long 1<<16
etc

is your friend

David Betz wrote: »

Okay, I've run into a minor problem porting VMCOG to my Hydra SRAM board. It looks like the code for PropCade assumes that all of the I/O pins used by the SPI SRAMs are between 0 and 8 since the code expects to be able to put a pin mask in an immediate value. I guess I'll need to make these values COG variables instead of immediate constants unless there is some other clever way around this that I'm not thinking of. The pins I'm using are:

MOSI = P16
CLK = P17
MISO = P18
CS = P19
CS1 = P20

Is there some other way than defining COG variables with the various masks needed?

(obviously, I'm not an ASM wizard!)

David Betz · 2010-08-25 18:24

Ummm... I hadn't counted on this. It looks like I have to get an SD card working to use VMCOG. I think I have enough space on my Hydra card for a microSD slot but I'm not sure what is needed to wire it up beyond just the microSD slot itself.

David Betz · 2010-08-25 18:37

Does anyone here have a simple program to test a SPI SRAM interface? I'd like to verify that my SRAM board works correctly before I start to add the microSD slot to it in order to run ZOG/VMCOG. There is a SPI SRAM driver in the object exchange but it works with a different SRAM chip. I'm using the 23k256.

Thanks,
David

Ariba · 2010-08-25 19:22

David Betz wrote: »

Does anyone here have a simple program to test a SPI SRAM interface? I'd like to verify that my SRAM board works correctly before I start to add the microSD slot to it in order to run ZOG/VMCOG. There is a SPI SRAM driver in the object exchange but it works with a different SRAM chip. I'm using the 23k256.

Thanks,
David

Yes, I posted this before in the VMCOG thread.
I wonder if it works for you, Bill had some troubles with the 10MHz SPI...

Andy

jazzed · 2010-08-25 19:41

Heater. wrote: »

I think I'll put up v1.4 with all your little mods in place so Jazzed can put his RAM driver in.

I'll be ready to integrate in the morning.

Thanks.
--Steve

Heater. · 2010-08-25 23:03

Lonesock:

Did you find that endian-ness caused any issues? I'm thinking of things like a C union of char[4] and int or float.

The ZPU is opposite ended to the Propeller. So if you load a normal ZPU binary into the Prop and use rdlong to pick up LONG constants in the binary the bytes will be backwards.

Rather than fix this by reordering bytes as ZOG runs I just use objcopy to reverse all the bytes of the binary images. See Makefiles. This is good for performance and means LONGS can be used by ZPU and Spin directly but also means that when ZPU programs work on bytes and words you have to do some address tweaking to get the order right.

Fortunately the ZPU GCC ensures 32 bit ints are 4 byte aligned in memory and 16 bit ints are two byte aligned. So the reordering can be done by simply XORing the address, %11 fixes byte addresses, %10 fixes word addresses. So quite speedy. Weird things have happened when I got that wrong, programs would run just fine for thousands of instructions but fail in odd ways later on.

So yes there will be problems with code that uses unions as you describe and other things. WORD and LONG alignment could also be an issue.

The answer is, don't do that. The world is full of C code that runs on Intel and ARM which have opposite endianness, ARMs can have alignment limitations compared to Intel as well. So I don't expect much of a problem there.

One case I have that confuses things is the xxtea crypto which works on blocks of LONGs. Here you have to be careful to order chars in the LONGs when encrypting/decrypting strings.

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments