Shop OBEX P1 Docs P2 Docs Learn Events
Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S - Page 15 — Parallax Forums

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

1121315171838

Comments

  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-24 05:20
    I tried running fibo.c using debug_zog.spin and I can get a bit of output on the terminal but not all that I expect. I've attached my modified debug_zog.spin so you can see what I'm trying to do. So far, the only changes I've made are to add #ifdefs for the HYDRA clock settings and I've uncommented the serial port setup and some debug messages. If I run this I get only the following line on my terminal:

    zpu memory at 000007B8

    I would expect to have also gotten:

    ZOG v0.21 (HUB)

    What am I doing wrong here? I've also noticed that the ZOG version number is v0.21 but the zip file I got this from said it was v1.0. Maybe that's my problem?

    Thanks,
    David
  • Heater.Heater. Posts: 21,230
    edited 2010-08-24 05:39
    David,

    You have old an version there. Get the latest (v1.3) zip package from here http://forums.parallax.com/showpost.php?p=932516&postcount=363

    On critical thing here is that the Makefiles for ZPU programs now reverse all the bytes in each LONG of the binaries. This saves having to do it in Spin/PASM at load time. So at some point the old ZPU binaries became incompatible with ZOG.

    I would keep the updates in the first post on the first page but I can't get in as "heater" just yet.
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-24 06:11
    Thanks for the suggestions! I switched to v1.3 and I get a little more output now:
    zpu memory at 00000088
    
    #pc,opcode,sp,top_of_stack,next_on_stack
    #----------
    
    0X0000010 0X00 0X00004FF8 0X80FEAA04
    BREAKPOINT
    

    I still don't get the ZOG banner and the "(HUB)" text that it looks like debug_zog.spin is trying to print.

    Also, USE_HUB_MEMORY is defined both in debug_zog.spin and zog.spin. I had to comment out the one in zog.spin to get debug_zog.spin to compile with USE_HUB_MEMORY defined.

    Thanks,
    David
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-24 06:22
    I got further and am now able to run fibo.c successfully. I still wonder why I don't get the ZOG banner line printed out. I get the message about "zpu memory at ..." but not the "ZOG v1.3 ..." string. I tried it using the "Parallax Serial Terminal" program and the text comes out there but not on under putty. I guess I just have to use the Parallax program instead of putty for doing my ZOG work.

    In any case, thanks for your help. I have my setup working now. I just have to try my own program to see if it works as well as fibo.c does! :-)
  • Heater.Heater. Posts: 21,230
    edited 2010-08-24 06:46
    David, excellent.

    No idea about putty, sounds rather odd.
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-24 07:23
    Heater. wrote: »
    No idea about putty, sounds rather odd.
    I have a guess as to why putty doesn't work well. I suspect it's because putty operates (by default anyway) in the normal terminal emulator mode where CR goes back to the start of the current line and LF goes to the next line. I think the Propeller is expecting CR to go to the beginning of the next line. This would explain why I'm only getting a single line out of putty. Every line overwrites the previous line.
  • jazzedjazzed Posts: 11,803
    edited 2010-08-24 07:39
    David Betz wrote: »
    ... terminal emulator mode where CR goes back to the start of the current line and LF goes to the next line. I think the Propeller is expecting CR to go to the beginning of the next line.
    DOS :eyes:
    Thats one reason I suggested PST. Also the BST IDE has a terminal program that interprets LFCR "correctly" and automatically connects after download.

    I'm like you though and prefer command line tools. It's hard to beat the VI editor for programming, and I had to find my own simple command line terminal program running on Linux. I ended up getting the source to nanocom and fixing it up for my needs - it doesn't have ymodem, but I expect that serial transfers will be way too slow soon anyway.


    @Heater,

    I was able to do some incremental testing with a 16MB block last night with no problem. I'll expand the test suite a little today then start migrating code to the latest zog. If you're close to zog 1.4, let me know so I don't have to integrate changes twice.

    Cheers.
  • lonesocklonesock Posts: 917
    edited 2010-08-24 08:25
    here's a tiny speedup for zpu_neq:
    zpu_neq                 call    #pop
                            sub     tos, data wz
                  if_nz     mov     tos, #1
                            jmp     #done_and_inc_pc
    
    Also, that multiplication code I posted will stop as soon as there are no more 1 bits in x, so it will be much faster on average. If you include those 1st 3 instructions, it makes sure x is the smaller of the two ops, x & y, which means it will terminate even faster on average. (note that both x & y have been absolute-value-ed, or that wouldn't work ;-)

    Jonathan
  • lonesocklonesock Posts: 917
    edited 2010-08-24 08:42
    Quick observation: zpu_nop and zpu_syscall both immediately jmp somewhere else (#done_and_inc_pc and #syscall, respectively). Could you just change the dispatch table to point directly to the final destination, and then relocate those two so they are within the 1st 256 longs?

    Unrelated issue, zpu_swap looks really strange, I'm not sure what it's doing. Is there documentation for that opcode? I could not find it.

    thanks,
    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-08-24 10:26
    Lonesock:

    zpu_nop could jump straight back to the execute loop if that were within 256 bytes. In fact it would fit with but leaves no wiggle room.

    Same for zpu_syscall except the syscall code wont fit with 256 bytes.

    zpu_swap does look strange. It seems to wrong:)

    It does not seem to be documented but it is in the ZPU Java simulator here http://repo.or.cz/w/zpu.git/blob/HEAD:/zpu/sw/simulator/com/zylin/zpu/simulator/Simulator.java

    Where it looks like :
    case SWAP:
    //                      if (feeble[SWAP])
    //                      {
    //                          emulate();
    //                      } else
                          {
                              int swapVal=popIntStack();;
                              pushIntStack(((swapVal >>16)&0xffff)|(swapVal<<16));
                          }
                          break;
    

    So just swapping high and low words of a long.

    Meanwhile Zog's SWAP fails to post it's result back to the top of stack. I guess it should look like:
    zpu_swap                mov     data, tos
                            shr     data, #16
                            shl     tos, #16
                            or      tos, data
                            jmp     #done_and_inc_pc
    

    We don't seem to have any code that uses swap.

    That zpu_neq is brilliant. Sadly can't find any code that uses it.
  • Heater.Heater. Posts: 21,230
    edited 2010-08-24 10:38
    Lonesock: Could you post your complete mult solution. I'm getting too tired to see what I'm doing any more:)
  • jazzedjazzed Posts: 11,803
    edited 2010-08-24 11:25
    Doing some SDRAM cache testing on a malloc block with pseudo-random fill/readback ... This test which uses something like the LFSR PRNG takes about 2.5 minutes. The libc rand() is quite slow and I'm not sure if it's actually working.

    Malloc buffer size: 80000
    testprand 80000 20000 6b8b4567
    0x4e00 - 0x84dfc
    Writing Pseudo-random 0x00004e00 0x00000000 0xca3a62b3
    Writing Pseudo-random 0x00014e00 0x00004000 0x829d1528
    Writing Pseudo-random 0x00024e00 0x00008000 0x78d09443
    Writing Pseudo-random 0x00034e00 0x0000c000 0xf988429a
    Writing Pseudo-random 0x00044e00 0x00010000 0x9ae2f159
    Writing Pseudo-random 0x00054e00 0x00014000 0x414e8a94
    Writing Pseudo-random 0x00064e00 0x00018000 0xc3978a21
    Writing Pseudo-random 0x00074e00 0x0001c000 0x7cc4214d
    Reading Pseudo-random 0x00004e00 0x00000000
    Reading Pseudo-random 0x00014e00 0x00004000
    Reading Pseudo-random 0x00024e00 0x00008000
    Reading Pseudo-random 0x00034e00 0x0000c000
    Reading Pseudo-random 0x00044e00 0x00010000
    Reading Pseudo-random 0x00054e00 0x00014000
    Reading Pseudo-random 0x00064e00 0x00018000
    Reading Pseudo-random 0x00074e00 0x0001c000
    testprand passed.

    I had to adjust the SDRAM Cache driver so that misses aren't so costly. Fibo results are about the same as the last post (all tests use 80MHz clock).

    fibo(00) = 000000 (00000ms)
    fibo(01) = 000001 (00000ms)
    fibo(02) = 000001 (00000ms)
    fibo(03) = 000002 (00000ms)
    fibo(04) = 000003 (00001ms)
    fibo(05) = 000005 (00001ms)
    fibo(06) = 000008 (00003ms)
    fibo(07) = 000013 (00005ms)
    fibo(08) = 000021 (00008ms)
    fibo(09) = 000034 (00014ms)
    fibo(10) = 000055 (00022ms)
    fibo(11) = 000089 (00037ms)
    fibo(12) = 000144 (00060ms)
    fibo(13) = 000233 (00097ms)
    fibo(14) = 000377 (00157ms)
    fibo(15) = 000610 (00254ms)
    fibo(16) = 000987 (00411ms)
    fibo(17) = 001597 (00668ms)
    fibo(18) = 002584 (01082ms)
    fibo(19) = 004181 (01748ms)
    fibo(20) = 006765 (02825ms)
    fibo(21) = 010946 (04569ms)
    fibo(22) = 017711 (07401ms)
    fibo(23) = 028657 (11988ms)
    fibo(24) = 046368 (19398ms)

    Now I'll see what happens testing 16MB block with pseudo-random fill/readback ... Guess I'll go catch a movie or something meanwhile. :sad: Only 1 hour 22 minutes to test 16000000 bytes - a good data retention test if nothing else. Test passed.

    --Steve
  • lonesocklonesock Posts: 917
    edited 2010-08-24 12:03
    zpu_swap                ror     tos, #16
                            jmp     #done_and_inc_pc
    
    I also re-arranged a bunch of code, so both syscall and nop fit under 256 longs. However, it doesn't make a difference to the reported speed numbers in fibo (I'm guessing those aren't used much ;-).

    Here's my modded version of 1.3:

    Jonathan

    Edit: the re-organization may not work for external RAM...I was compiling for HUB-only. Also, I did not add in the zpu_swap code, as I wasn't sure how it was supposed to behave. And I modified the zpu_addsp code, saved a long, the case where offset==0 is handled, but a bit slower than it was...same speed as it was if offset > 0.
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-24 18:21
    Later this week I am adding a driver for my FlexMem board (four bit wide bus using four SPI ram's for >2MB/sec burst transfer, up to 6.6MB/sec with the two cog special driver I am working on)

    How is your FlexMem driver coming? I just made a board to plug into my Hydra that has two 23k256 SPI SRAM chips on it and I'd like to try using it with VMCOG. Ideally, I would use it as a two bit wide memory similar to the four bit wide memory you describe above. Short of that, I'd like to at least verify that each of the two SRAM chips is working by using it as a single 32k memory. Can you point me to where in the VMCOG code I need to look to get my chips hooked up to your code? What do I need to know other than the Propeller pins I used to hookup the SPI interface and the chip selects?

    Thanks,
    David
  • Heater.Heater. Posts: 21,230
    edited 2010-08-24 19:11
    Jazzed:

    Excellent result.

    Lonesock:

    I just did a quick diff against your modified Zog. You have a lot of nice little mods in there to save space and or time. That zpu_swap makes me laugh, "ror tos, #16" was just to obvious for me to see:)

    I'm loath to rearrange the code to put nop and syscall within the 255 limit. And here is why:

    1) It eats all the "wiggle room" which makes me nervous, not that I can think of anything else we realy have to put there.

    2) It may save 2 JMPs but as SYSCALL is so long winded and rare the tiny speed gain will never show up. Similarly NOP is rare enough we wont see the notice the benefit.

    3) The BIGGY, there is something we can put in that space that can have a dramatic speed gain. If you analyse a typical program you will find that there are a huge lot of zpu_loadsp and zpu_storesp used. And that there are a handful of offset values used in those instructions that predominate. So execution can be sped up by implementing some loadsp/storesp ops with hard wired offsets rather than decoding the offsets from the byte code as we do now. The jump table would have entries to these "special case" load/stores.

    I tried this with fibo a while back and did get noticeable speed up. Problem is of course to tailor the selection of hardwired load/store offsets to match the application being run.

    What would really help is to fix read_word and write_word so that zpu_loadh and zpu_loadh (load/store WORD) work properly and get rid of zpu_emulate which we should never need and eats space and time.

    I think I'll put up v1.4 with all your little mods in place so Jazzed can put his RAM driver in.
  • lonesocklonesock Posts: 917
    edited 2010-08-24 20:06
    @Heater: Thanks for the kind words. I just wanted to clarify that syscall doesn't need to 'fit' within 255 longs, just that the entry point needs to start withing 255 longs, am I right? And, you are so right, optimizing a NOP does seem a bit silly ("I got the NOP down to 0 clocks!!!!") [8^)

    Regarding the zpu_loadsp and zpu_storesp, is this something where a tool could scan the ZPU bytecode and generate the N most common offsets? Or would you need some instrumentation/profiling code to keep track during a typical run? I'm just thinking since the prop has self-modifying code, maybe you could even do something as simple as a speed-store the last used offset, with a shortcut if the next offset matches the last one.

    I'd be happy to look over v1.4 as soon as it comes out, this is fun stuff!

    Jonathan
  • Heater.Heater. Posts: 21,230
    edited 2010-08-24 20:31
    Lonesock,

    You are right only the label zpu_syscall has to be in range.

    With my C version of the ZPU running on linux I can do a little profiling, it counts the number of times each opcode is used in a program run. Using that data I did once make customized loadsp/storesp with offsets most commonly used in the fibo calculation and did get a few percent speed up when run on Zog.

    That's an interesting idea about dynamically fixing loadsp/storesp at run time. No idea how we would do that yet. But building the profiler into Zogs loadsp/storesp might be a start. Only have to increment 64 counters somewhere.

    "...this is fun stuff!"

    So let's continue,

    I now have a read_word that at least works from HUB memory so zpu_loadh now works without using EMULATE.
    read_word               mov     memp, address
    '                        and     memp, zpu_memory_mask
                            add     memp, zpu_memory_addr
                            xor     memp, #%10
                            rdword  data, memp
    read_word_ret           ret
    

    Edit: Yay, now have write_word and zpu_storeh works after a little tweaking.
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-08-25 10:06
    Hi David,

    I am in a bit of a crunch because I need to get some PCB's out for production this week, so I won't be able to get to the driver for a day or two because:

    - I am just finishing troubleshooting a problem on the new Mem+ with the MCP23S17 I/O expander .... my pcb manufacturer made changes to the gerbers, which resulted in several problems that I am having to identify and work around. I am almost done with the MCP23S17, the only thing left to check on Mem+ is the SD interface.

    - After that, I have to check the SPI ram's on the new Morpheus and IR in / IR out. (There were manufacturing errors on Morpheus as well...)

    After the above are done, I will be adding the FlexMem driver.

    I'll add a two-bit wide mode for you as well :)
    David Betz wrote: »
    How is your FlexMem driver coming? I just made a board to plug into my Hydra that has two 23k256 SPI SRAM chips on it and I'd like to try using it with VMCOG. Ideally, I would use it as a two bit wide memory similar to the four bit wide memory you describe above. Short of that, I'd like to at least verify that each of the two SRAM chips is working by using it as a single 32k memory. Can you point me to where in the VMCOG code I need to look to get my chips hooked up to your code? What do I need to know other than the Propeller pins I used to hookup the SPI interface and the chip selects?

    Thanks,
    David
  • lonesocklonesock Posts: 917
    edited 2010-08-25 11:01
    @Heater: Great, I'm glad the word stuff is working. Did you find that endian-ness caused any issues? I'm thinking of things like a C union of char[4] and int or float.

    Jonathan
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-25 11:09
    After the above are done, I will be adding the FlexMem driver.

    I'll add a two-bit wide mode for you as well :)

    I looked over the VMCOG code a little last night and it seems that two 32k SPI SRAM chips is already supported. I see where to define the MOSI, MISO, CLK pins but there is only one definition for CS. How do you derive the CS for the other chip? Just add one? If this two chip SPI SRAM configuration is already supported (under the PROPCADE conditional), then I can get started right away. Is my reading of the code correct (v0.975)?
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-08-25 12:28
    Close...

    Ok, I pulled up the .975 source, and here is how to change it for two SPI ram's that share CLK, MOSI, MISO, but have individual chip selects...

    1) Use the PropCade code as the template

    2) PropCade selects SPI ram chip 0..5 by using

    movd outa,#n ' where n=0..5

    to set the 3 bit address of the 74hc138 on P9,P10,P11 to select the ram chip.

    P3, which is /CS, actually enables the '138, thus generating Y0..Y5 (which are the actual SPI ram chip selects)

    3) Go for single-bit wide SPI ram for now, doing the dual-bit conversion is significantly more complicated

    4) In SPI ram constants, add a CS1, and change pins to match your hardware

    5) In BINIT, instead of looping through six chips, initialize your two chips, using your CS and CS1 pins

    so replace
            ' Initialize up to six SPI RAM's on PropCade to sequential mode
            mov  count2,#6
    ini_lp  mov  outa,#0
            mov  addr,ramseq
            mov  bits,#16
            andn outa,#CS|CLK
            call #send
            or   outa,#CS
            add outa,dstinc
            djnz count2,#ini_lp
    

    with
            ' init low chip
            mov  addr,ramseq
            mov  bits,#16
            andn outa,#CS|CLK
            call #send
            or   outa,#CS
            ' init high chip
            mov  addr,ramseq
            mov  bits,#16
            andn outa,#CS1|CLK
            call #send
            or   outa,#CS1
    

    6) in BSTART, see the logic which decodes the high order VM address bits into dv, to select one of six ram chips.

    Remove it
            mov   dv, addr
            shr   dv,#15            ' dv = 0..3, ie the chip select
            movd  outa,dv           ' select SPI device
    

    Change
            andn  outa,#CS|CLK
    

    to
           andn  outa,#CLK
           shr addr,#16 nr, wc
           ' select the lower / upper 32KB SPI ram's based on carry
           muxnc outa,CS
           muxc   outa,CS1
    

    and at the end of both BREAD and BWRITE, where you see
            or    outa,#CS
    

    change it to
            or    outa,#CS|CS1
    

    The above is untested, but should work - it is what I will be doing on Morpheus CPU1.
    David Betz wrote: »
    I looked over the VMCOG code a little last night and it seems that two 32k SPI SRAM chips is already supported. I see where to define the MOSI, MISO, CLK pins but there is only one definition for CS. How do you derive the CS for the other chip? Just add one? If this two chip SPI SRAM configuration is already supported (under the PROPCADE conditional), then I can get started right away. Is my reading of the code correct (v0.975)?
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-25 12:38
    The above is untested, but should work - it is what I will be doing on Morpheus CPU1.

    Thanks Bill! That should be enough to get me testing my double SRAM board for the Hydra. Now we'll see how well I soldered it together! :-)
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-08-25 12:43
    You are welcome!
    David Betz wrote: »
    Thanks Bill! That should be enough to get me testing my double SRAM board for the Hydra. Now we'll see how well I soldered it together! :-)
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-25 17:47
    Okay, I've run into a minor problem porting VMCOG to my Hydra SRAM board. It looks like the code for PropCade assumes that all of the I/O pins used by the SPI SRAMs are between 0 and 8 since the code expects to be able to put a pin mask in an immediate value. I guess I'll need to make these values COG variables instead of immediate constants unless there is some other clever way around this that I'm not thinking of. The pins I'm using are:

    MOSI = P16
    CLK = P17
    MISO = P18
    CS = P19
    CS1 = P20

    Is there some other way than defining COG variables with the various masks needed?

    (obviously, I'm not an ASM wizard!)
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-08-25 17:59
    Nope.

    Sorry, forgot to warn you about that.

    MOSI long 1<<16
    etc

    is your friend :)
    David Betz wrote: »
    Okay, I've run into a minor problem porting VMCOG to my Hydra SRAM board. It looks like the code for PropCade assumes that all of the I/O pins used by the SPI SRAMs are between 0 and 8 since the code expects to be able to put a pin mask in an immediate value. I guess I'll need to make these values COG variables instead of immediate constants unless there is some other clever way around this that I'm not thinking of. The pins I'm using are:

    MOSI = P16
    CLK = P17
    MISO = P18
    CS = P19
    CS1 = P20

    Is there some other way than defining COG variables with the various masks needed?

    (obviously, I'm not an ASM wizard!)
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-25 18:24
    Ummm... I hadn't counted on this. It looks like I have to get an SD card working to use VMCOG. I think I have enough space on my Hydra card for a microSD slot but I'm not sure what is needed to wire it up beyond just the microSD slot itself.
  • David BetzDavid Betz Posts: 14,511
    edited 2010-08-25 18:37
    Does anyone here have a simple program to test a SPI SRAM interface? I'd like to verify that my SRAM board works correctly before I start to add the microSD slot to it in order to run ZOG/VMCOG. There is a SPI SRAM driver in the object exchange but it works with a different SRAM chip. I'm using the 23k256.

    Thanks,
    David
  • AribaAriba Posts: 2,685
    edited 2010-08-25 19:22
    David Betz wrote: »
    Does anyone here have a simple program to test a SPI SRAM interface? I'd like to verify that my SRAM board works correctly before I start to add the microSD slot to it in order to run ZOG/VMCOG. There is a SPI SRAM driver in the object exchange but it works with a different SRAM chip. I'm using the 23k256.

    Thanks,
    David

    Yes, I posted this before in the VMCOG thread.
    I wonder if it works for you, Bill had some troubles with the 10MHz SPI...

    Andy
  • jazzedjazzed Posts: 11,803
    edited 2010-08-25 19:41
    Heater. wrote: »
    I think I'll put up v1.4 with all your little mods in place so Jazzed can put his RAM driver in.
    I'll be ready to integrate in the morning.

    Thanks.
    --Steve
  • Heater.Heater. Posts: 21,230
    edited 2010-08-25 23:03
    Lonesock:
    Did you find that endian-ness caused any issues? I'm thinking of things like a C union of char[4] and int or float.

    The ZPU is opposite ended to the Propeller. So if you load a normal ZPU binary into the Prop and use rdlong to pick up LONG constants in the binary the bytes will be backwards.

    Rather than fix this by reordering bytes as ZOG runs I just use objcopy to reverse all the bytes of the binary images. See Makefiles. This is good for performance and means LONGS can be used by ZPU and Spin directly but also means that when ZPU programs work on bytes and words you have to do some address tweaking to get the order right.

    Fortunately the ZPU GCC ensures 32 bit ints are 4 byte aligned in memory and 16 bit ints are two byte aligned. So the reordering can be done by simply XORing the address, %11 fixes byte addresses, %10 fixes word addresses. So quite speedy. Weird things have happened when I got that wrong, programs would run just fine for thousands of instructions but fail in odd ways later on.

    So yes there will be problems with code that uses unions as you describe and other things. WORD and LONG alignment could also be an issue.

    The answer is, don't do that. The world is full of C code that runs on Intel and ARM which have opposite endianness, ARMs can have alignment limitations compared to Intel as well. So I don't expect much of a problem there.

    One case I have that confuses things is the xxtea crypto which works on blocks of LONGs. Here you have to be careful to order chars in the LONGs when encrypting/decrypting strings.
Sign In or Register to comment.