Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

jazzed · 2010-08-28 08:12

lonesock wrote: »

Do you happen to have a version of mall.c precompiled that I can test in Hub RAM?

Try this one. Meanwhile I'll try your suggested mmul changes.

jazzed · 2010-08-28 08:52

@lonesock, both versions seem to multipy ok for dereferencing the end of the block, but something else is happening. I pulled in an old version of math_F4 just to make sure the file I posted works for my own comfort.

testmarch 800 200
0x4e60 - 0x505c [b][color=orange]<-- correct now. 0x505c was 0x4e5c before recommended changes.[/color][/b]
W0^ = Write 0 march up.
W0v = Write 0 march down.
R0^ = Read  0 march up.
R0v = Read  0 march down.
W0^ 0x00004e60 0x00000000 0x55555555
W0^ 0x00005260 0x00000100 0x55555555
R0^W1^ 0x00004e60 0xaaaaaaaa
R0^W1^ 0x00005260 0xaaaaaaaa
R1vW0v 0x00004e5f  [b][color=orange]<-- this is wrong though, it should be R1vW0v 0x00004e60[/color][/b]
Error @ 0x4e5f[17f] Expected 0xaaaaaaaa Received 0x00000809

00004e5f: 0x00000809 0x55555555 0x55555555 0x55555555

lonesock · 2010-08-28 12:11

Found it. My version of mult didn't handle the top 32 bits. I'm not sure how to do a 64-bit result without a worst case that is slower than the original (zog 1.3) implementation. I think we should just punt for now, and revert to the 1.3 multiplication code, and I'll look at this entire math block hopefully this coming week.

Sorry for the confusion,
Jonathan

EDIT: Does ZPU even support returning the top 32 bits of a 64 bit multiplication? If not, I can strip out some of the logic in that math block. You would only need 32x32=>32, 32/32=>32, and 32remainder32=>32, right? And it currently seems to be all set for signed multiplication...is that what ZPU expects?

Heater. · 2010-08-28 12:36

Hey Guys,

I have a working v1.5 here now. At least mall(function) and fibo are working from HUB and VMCog.

Basically I have adopted Jazzed's SDRAM version whole sale, back fitted the v1.3 math_F4 and tweaked a few details around.

I want to make some fixes around the ZPU RAM - HUB RAM - I/O mapping in read/write_long then if nothing else comes up that is version 1.5.

Just now I have to eat some good food and drink some good wine:) Perhaps I can post 1.5 tomorrow afternoon.

Heater. · 2010-08-28 18:08

Lonesock:

Does ZPU even support returning the top 32 bits of a 64 bit multiplication?

The best reference we have for zpu_mult is ZyLin's simulator Java code. In there MULT is implimented as:

        pushIntStack(popIntStack() * popIntStack());

So, what we are looking at is signed 32 bit multiplication with the top 32 bits discarded. There aren't any longer integer operations in the ZPU.

The simulators DIV is:

                            int a;
                            int b;
                            a = popIntStack();
                            b = popIntStack();
                            if (b == 0)
                            {
                                throw new CPUException();
                            }
                            pushIntStack(a / b);

With MOD looking much the same.

Heater. · 2010-08-28 18:32

Out of curiosity I had to look at what zpu-gcc produces for integer multiplication, here is the listing:

int multest(int a, int b)
{
     622:       ff              im -1
     623:       3d              pushspadd
     624:       0d              popsp

00000625 <.LM17>:
/home/michael/zog_v1_5/test/hello/hello.c:59
    return(a * b);
     625:       73              loadsp 12
     626:       75              loadsp 20
     627:       29              mult
     628:       80              im 0
     629:       0c              store

0000062a <.LM18>:
/home/michael/zog_v1_5/test/hello/hello.c:60
}
     62a:       83              im 3
     62b:       3d              pushspadd
     62c:       0d              popsp
     62d:       04              poppc

Changing to longs instead of ints we get the same code.

Changing to unsigned we get the same code again.

Changing to short ints we get:

short multest(short a, short b)
{
     622:       fd              im -3
     623:       3d              pushspadd
     624:       0d              popsp
     625:       75              loadsp 20
     626:       90              im 16
     627:       2b              ashiftleft
     628:       77              loadsp 28
     629:       90              im 16
     62a:       2b              ashiftleft

0000062b <.LM17>:
/home/michael/zog_v1_5/test/hello/hello.c:59
    return(a * b);
     62b:       71              loadsp 4
     62c:       90              im 16
     62d:       2c              ashiftright
     62e:       71              loadsp 4
     62f:       90              im 16
     630:       2c              ashiftright
     631:       29              mult
     632:       70              loadsp 0
     633:       90              im 16
     634:       2b              ashiftleft
     635:       70              loadsp 0
     636:       90              im 16
     637:       2c              ashiftright
     638:       80              im 0
     639:       0c              store
     63a:       51              storesp 4
     63b:       52              storesp 8
     63c:       55              storesp 20
     63d:       53              storesp 12

0000063e <.LM18>:
/home/michael/zog_v1_5/test/hello/hello.c:60
}

So 16 bit working on Zog is going to tedious.

Note the "IM 0" and "STORE" used to return the results. Results are returned via the fake register at location zero in memory. This is a pain for running from read only memory. I think it may be possible to relocate it though.

jazzed · 2010-08-28 18:46

Did you try "long long" ? Many 32 bit machines use that for 64 bits.

Heater. · 2010-08-28 18:59

Ah yes, what am I thinking?

The long long multiplication gets a bit, well, long:

multest():
/home/michael/zog_v1_5/test/hello/hello.c:58

long long multest(long long a, long long b)
{
     622:       f7              im -9
     623:       3d              pushspadd
     624:       0d              popsp
     625:       7b              loadsp 44

00000626 <.LM17>:
/home/michael/zog_v1_5/test/hello/hello.c:59
    return(a / b);
     626:       7f              loadsp 60
     627:       61              loadsp 68
     628:       59              storesp 36
     629:       55              storesp 20
     62a:       77              loadsp 28
     62b:       56              storesp 24
     62c:       7d              loadsp 52
     62d:       7f              loadsp 60
     62e:       59              storesp 36
     62f:       53              storesp 12
     630:       77              loadsp 28
     631:       54              storesp 16
     632:       8c              im 12
     633:       3d              pushspadd
     634:       f8              im -8
     635:       05              add
     636:       52              storesp 8
     637:       58              storesp 32
     638:       83              im 3
     639:       cf              im -49
     63a:       3f              callpcrel
     63b:       78              loadsp 32
     63c:       7a              loadsp 40
     63d:       58              storesp 32
     63e:       78              loadsp 32
     63f:       0c              store
     640:       76              loadsp 24
     641:       84              im 4
     642:       19              addsp 36
     643:       0c              store

00000644 <.LM18>:
/home/michael/zog_v1_5/test/hello/hello.c:60
}
     644:       77              loadsp 28
     645:       80              im 0
     646:       0c              store
     647:       8b              im 11
     648:       3d              pushspadd
     649:       0d              popsp
     64a:       04              poppc

The worst thing we can do is unsigned mod and div. They dive off into some lengthy subroutines.

Heater. · 2010-08-29 08:42

Attached is Zog v1.5

This version has:

Adopted Jazzed's 32MByte SD RAM cache.
Adopted Jazzed's use of userdefs.spin for hardware configuration-
Backed out the math_F4 changes of v1.4, now mall.c works again.
Moved all user configurable items to nearer the top of debug_zog.spin.

I did not make any changes to the ZPU memory map, HUB RAM space, I/O space etc. Basically I could not decide on a memory layout.

I would like to have, moving up memory:

1) ZPU RAM space (HUB or external, 32K up to 32M)
2) HUB RAM access space(32KB)
3) COG RAM access space(512 LONGs, 2KB)
4) Memory mapped IO (The rest)

The memory map should be the same for all memory hardware solutions. This way C programs won't need any #defines or such to find HUB, COG, and I/O.

Currently it means the HUB area would start at 32MByte to accommodate Jazzed's SDRAM solution. This would be so even if the actual ZPU RAM available is less when using HUB or smaller external RAM.

Are we likely to want to go bigger? Seems unlikely.

One could define a ZPU RAM address space as, say, 1GByte but then ZPU needs more IM instructions to build the bigger addresses.

Any ideas?

jazzed · 2010-08-29 10:47

Looks good Heater

On memory size: I haven't tested anything other than 32MB yet, but theoretically one SDRAM chip can be 8MB, 16MB, 32MB, 64MB, or even 128MB. Here's a list of what's currently available.

TSOP 54, 3.3V SDRAM Part Numbers on Digikey

MT48LC8M8A2TG-75    64M   (8M x 8)
HYB39S128800FE-7    128M (16M x 8)
HYI39S128800FE-7    128M (16M x 8)
IS42S81600E-7TL     128M (16M x 8)
MT48LC32M8A2P-75    256M (32M x 8)
MT48LC32M8A2P-7E    256M (32M x 8)
IS42S83200D-7TL     256M (32M x 8) *
MT48LC64M8A2TG-75   512M (64M x 8)

*Note the IS42S83200D-7TL is the only part I've tested so far.

One thing to note about the USE_JCACHED_MEMORY design is the way cache memory is accessed.

The cache driver lives in a separate COG and provides buffers. So zog.spin keeps up with the address range represented by the buffer and accesses data with a single HUB transaction on a cache hit.

I've found one small performance improvement for zog's cache interface, and I think there is another in there.

Heater. · 2010-08-29 11:43

128MB Wow.

Question is: Is it really practical to make use of so much RAM. I mean, despite page tables or caches we are seriously lagging behind in performance compared to what a processor with a memory bus can do. Generating large addresses (many IMs) for data access and function calls slows things a bit more.

As I said, I'd like the memory map of COG, HUB and I/O access to be pretty well fixed for all time and all platforms so that one does not have to "port" C code from place to place.

So we could settle on 32Mb, or 64 or 128 and be done with it forever.

Of course one day this will all have to work on a Prop II with whatever help it has for external RAM in hardware. So perhaps we should push the envelope a bit on Prop I now.

What the heck, I'll throw my vote in for putting HUB space up at 128MB. That should be enough for anyone, as Bill Gates famously didn't say. It should be enough to keep uCLinux happy

Bill Henning · 2010-08-29 11:56

My vote is 256MB max - why?

4x IM = 28 bit constant

2^28 = 256MB

Heater. wrote: »

128MB Wow.

Question is: Is it really practical to make use of so much RAM. I mean, despite page tables or caches we are seriously lagging behind in performance compared to what a processor with a memory bus can do. Generating large addresses (many IMs) for data access and function calls slows things a bit more.

As I said, I'd like the memory map of COG, HUB and I/O access to be pretty well fixed for all time and all platforms so that one does not have to "port" C code from place to place.

So we could settle on 32Mb, or 64 or 128 and be done with it forever.

Of course one day this will all have to work on a Prop II with whatever help it has for external RAM in hardware. So perhaps we should push the envelope a bit on Prop I now.

What the heck, I'll throw my vote in for putting HUB space up at 128MB. That should be enough for anyone, as Bill Gates famously didn't say. It should be enough to keep uCLinux happy

Heater. · 2010-08-29 12:23

Bill, very compelling argument for 256M.

256MB = $10000000

The ZPU tool chain uses:

UART TX = $80000024
UART TX = $80000028
TIMER = $80000100

So it will all fit nicely.

jazzed · 2010-08-29 14:43

Thought I would experiment with in-zog one-line cache management as implemented in zog_v1_5 (extra-zog SDRAM Cache management is direct-mapped 128 line). The result is about 200ms improvements in fibo(20) and fibo(18) is under 1 second. Now this could just be tuning for fibo performance, but if my speculation is right, I think other programs would benefit. If nothing else, the small changes save a few longs in zog.spin.

Starting SD driver...0000FFFF
Mounting SD...00000000
Booting fibo.bin
00000000

Reading image... 17055 Bytes Loaded.
Done

Clearing bss: ....
Running Program!
fibo(00) = 000000 (00000ms)
fibo(01) = 000001 (00000ms)
fibo(02) = 000001 (00000ms)
fibo(03) = 000002 (00000ms)
fibo(04) = 000003 (00001ms)
fibo(05) = 000005 (00001ms)
fibo(06) = 000008 (00002ms)
fibo(07) = 000013 (00004ms)
fibo(08) = 000021 (00007ms)
fibo(09) = 000034 (00012ms)
fibo(10) = 000055 (00020ms)
fibo(11) = 000089 (00034ms)
fibo(12) = 000144 (00055ms)
fibo(13) = 000233 (00089ms)
fibo(14) = 000377 (00145ms)
fibo(15) = 000610 (00234ms)
fibo(16) = 000987 (00379ms)
fibo(17) = 001597 (00615ms)
fibo(18) = 002584 (00996ms)
fibo(19) = 004181 (01611ms)
fibo(20) = 006765 (02602ms)
fibo(21) = 010946 (04209ms)
fibo(22) = 017711 (06818ms)
fibo(23) = 028657 (11045ms)
fibo(24) = 046368 (17871ms)
fibo(25) = 075025 (28891ms)
fibo(26) = 121393 (05754ms)

32MB SDRAM with 80MHz Propeller Clock.

Cheers.

Heater. · 2010-08-29 22:33

I have not dared to look into VMCog or SdramCache code much.

Could some one briefly state the difference between VMCog's virtual working pages and SdramCache's cache lines?

As I see it ZPU memory access is constantly thrashing between code fetch and stack read/write at substantially different addresses. Then there is access to the programs actual data areas and the mysterious ZPU pseudo registers down at address zero.

So the smallest program needs at least 4 different buffers (VM pages or cache lines) in order to prevent constant thrashing of buffers between these 4 areas. As the program and data gets bigger further buffers are required to prevent thrashing between different code (and data) areas.

We should be careful comparing speeds using that fbo test. The speed of execution programs under VMCog depends heavily on the number of pages in the working set. The fibo loop itself is very small. This leads to an interesting effect on execution time:

fibo(20) 30 pages : 2782ms
fibo(20) 20 pages : 2782ms
fibo(20) 4 pages : 2782ms.

All the same?!!

Not quite, that fibo program only measures the fibo function execution time not the surrounding test loop and result printing. The fibo function is very small and as no data so it fits in 4 pages and runs fast. The program itself gets visibly much slower with decreasing pages.

jazzed · 2010-08-30 01:55

Heater. wrote: »

Could some one briefly state the difference between VMCog's virtual working pages and SdramCache's cache lines?

A VMCOG page is 512 bytes. An SdramCache line is 64 bytes.
VMCOG has a replacement policy, SdramCache does not.

VMCOG uses 2.5MB/s SDRAM burst read/write.
SdramCache uses 10MB/s burst read/write.

VMCOG can read a 512 byte SDRAM page in about 102us.
SdramCache can read a 64 byte page in less than 7us.

I chose to implement a separate interface because:
1) There is no room in VMCOG for 10MB/s burst code.
2) VMCOG is limits usable external memory to 64KB.
3) VMCOG apparently can not run fibo 26 to completion.
(surely this can be fixed*).

I recognized fibo performance testing is not especially useful.
Once I used a 256 byte cache line for testing. Fibo calculation
performance improved a bit, but everything else degraded.

I've run drhystone tests with SDRAM, but results are not obvious.

Heater. · 2010-08-30 03:07

Are you saying that SdramCache only has one 64 byte cache buffer?'

i.e. it has to refill the buffer when activity moves from a code fetch to a stack access for example?

jazzed · 2010-08-30 06:55

SdramCache has 256 buffers in a 16KB cache.

Zog.spin selects the buffer being used. If the current buffer contains the data, one HUB operation fetches the data after some comparisons (around 9 instructions). A new buffer is selected if the required address is not in the current buffer's range (around 29 instructions). When a new buffer is selected, it is up to the SdramCache.spin to deliver the buffer in tact and will swap a buffer if necessary (up to 15us at 80MHz, potentially 9us on reads).

I've considered exploring alternatives to the current method when I have time just to compare performance. Using a per data item request model with the current SdramCache driver, the best case cache fetch will be 8 instructions + SdramCache overhead of about 16 instructions.

My cycle counting could be wrong and that's why I'm willing to experiment. However, I don't expect the alternative to perform any better than what's there now.

Basically it's a choice of always taking 24 instructions -vs- a mix of 9 or 29 instructions on a cache hit.

lonesock · 2010-08-30 17:33

I took zog 1.5, stripped out the math and put in a faster and smaller version of multiply, and a slightly faster and smaller version of divide & modulus. This version also has the special cases for loadsp and storesp for the high bit, as well as the special cases for when offset = 0.

I'm also including the file I used for testing the multiply and divide and modulus routines against the SPIN equivalents, so you can verify it separately.

Jonathan

EDIT: added the 'and smaller' parts.

Heater. · 2010-08-31 01:56

Lonesock: No time to look into your code but a quick run shows almost 4% performance gain running from HUB and a tad more than 2% running from VMCog. Excellent:)

Jazzed, when are we going to see a self contained 32MB Propeller PC board, DracBlade style?

jazzed · 2010-08-31 08:50

Heater, the first generally available 32MB SDRAM board will be Gadget Gangster Propeller Platform compatible. A Propeller Single Board Computer will follow that.

The Gadget Gangster board is in FAB now and will allow a Propeller Computer solution with Keyboard, Mouse, Serial Port, TV, and uSD card.

I've spent most of this last week working on the I2C Keyboard/Mouse controller code. Hopefully by the time the FAB comes, I'll have the I2C solution ready. I have plenty of room for the controller code in the atTiny85.

A VGA connector is available on the FAB as a stuffing option. It is mainly intended for VGA graphics experimentation and is not usable with uSD without cut/jump rework. The VGA and uSD sections could be reworked to provide a black/white or some other grey-scale video with access to uSD. I've considered adding another latch to free more pins, but that's just floating in the air like vapor now.

lonesock · 2010-08-31 11:21

I just noticed something on zpu_*shift*: tos is ANDed with $3F. $3F is 63, so I'm guessing $1F was desired, but, SHL, SHR, and SAR all only use the low 5 bits of S anyway to shift D, so the AND is unnecessary.

Jonathan

Heater. · 2010-08-31 12:11

Interesting.

In the ZPU Java simulator they have ANDed with 0x3f so I just blindly used it in Zog.

Are you sure the Props shifts only use 5 bits of the src field?
According my prop manual:

INSTR ZCRI CON DEST    SRC
001011 001i 1111 ddddddddd sssssssss

lonesock · 2010-08-31 12:22

Here's my test results using shr in PASM:

Beginning test:
-1153374642 >> 0 = -1153374642
-1153374642 >> 1 = 1570796327
-1153374642 >> 2 = 785398163
-1153374642 >> 3 = 392699081
-1153374642 >> 4 = 196349540
-1153374642 >> 5 = 98174770
-1153374642 >> 6 = 49087385
-1153374642 >> 7 = 24543692
-1153374642 >> 8 = 12271846
-1153374642 >> 9 = 6135923
-1153374642 >> 10 = 3067961
-1153374642 >> 11 = 1533980
-1153374642 >> 12 = 766990
-1153374642 >> 13 = 383495
-1153374642 >> 14 = 191747
-1153374642 >> 15 = 95873
-1153374642 >> 16 = 47936
-1153374642 >> 17 = 23968
-1153374642 >> 18 = 11984
-1153374642 >> 19 = 5992
-1153374642 >> 20 = 2996
-1153374642 >> 21 = 1498
-1153374642 >> 22 = 749
-1153374642 >> 23 = 374
-1153374642 >> 24 = 187
-1153374642 >> 25 = 93
-1153374642 >> 26 = 46
-1153374642 >> 27 = 23
-1153374642 >> 28 = 11
-1153374642 >> 29 = 5
-1153374642 >> 30 = 2
-1153374642 >> 31 = 1
-1153374642 >> 32 = -1153374642
-1153374642 >> 33 = 1570796327
-1153374642 >> 34 = 785398163
-1153374642 >> 35 = 392699081
-1153374642 >> 36 = 196349540
-1153374642 >> 37 = 98174770
-1153374642 >> 38 = 49087385
-1153374642 >> 39 = 24543692
-1153374642 >> 40 = 12271846
-1153374642 >> 41 = 6135923
-1153374642 >> 42 = 3067961
-1153374642 >> 43 = 1533980
-1153374642 >> 44 = 766990
-1153374642 >> 45 = 383495
-1153374642 >> 46 = 191747
-1153374642 >> 47 = 95873
-1153374642 >> 48 = 47936
-1153374642 >> 49 = 23968
-1153374642 >> 50 = 11984
-1153374642 >> 51 = 5992
-1153374642 >> 52 = 2996
-1153374642 >> 53 = 1498
-1153374642 >> 54 = 749
-1153374642 >> 55 = 374
-1153374642 >> 56 = 187
-1153374642 >> 57 = 93
-1153374642 >> 58 = 46
-1153374642 >> 59 = 23
-1153374642 >> 60 = 11
-1153374642 >> 61 = 5
-1153374642 >> 62 = 2
-1153374642 >> 63 = 1

Jonathan

Heater. · 2010-08-31 12:51

OK. In it goes. Good catch.

Heater. · 2010-08-31 15:27

Attached: Zog v1.6.

This has Lonesock's multiply and divide routines and his other optimization tweaks.

I have set up the memory map for ZPU code as:

$00000000 to $0FFFFFFF Normal RAM (256MB) for ZPU code, data, stack. 
$10000000 to $10007FFF Maps to  HUB RAM
$10008000 to $100087FF Maps to ZPU COGs memory.
$10008800 to $FFFFFFFF Memory mapped IO

Note that the is no checking of memory addresses so for example an out of range write when running from HUB may well damage any Spin/PASM code running there.

The map allows access to all COG locations so programs running under ZOG can modify the ZPU instruction set and state at runtime for some interesting effects:)

Only 30 LONGs left in the ZPU COG now with SdRamCache.

lonesock · 2010-08-31 19:02

There are shorter division routines, they're just slower, but I'd be happy to stub one in if you'd like. And if you need, feel free to undefine USE_FASTER_MULT, that will save you 5 longs. (It just means that a*b will not execute at the same speed as b*a, but both will be faster than the old multiply.)

Jonathan.

Heater. · 2010-08-31 23:04

Lonesock,

I think we can stick with the fast math code unless we have to really scrape for LONGs at some point but that is looking unlikely.

30 LONGS should be enough to add some handling for the Propellers LOCKs.

What other Prop features is Zog missing support for?

What else would it be useful to squeeze into those remaining LONGs?

One thing would be unsigned divide and modulus operations. zpu-gcc inserts lengthy subroutines for those ops and it would be better to have the COG do it directly. It's would be a bit messy as we would have to use some currently unused opcodes or, preferably SYSCALL, and the provide some functions C/assembler to use it.

Zog needs to be able to handle an interrupt. If only for a timer tick. I'd like to see FreeRTOS running one day not to mention uCLinux:)

I just recently discovered that there is a VHDL implementation of ZPU that includes an interrupt and a new instruction "POPINT". The idea being that an interrupt automatically masks further interrupts. When the handler is done POPINT pops the return address and clears the interrupt mask.

Floating point support can be added without changes to the Zog interpreter. Just use float32 PASM code with a C wrapper and provide C functions to use it that override those provided by GCC's soft-float library. http://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html Perhaps those unsigned mul, div, mod functions could be included in the soft-float COG.

Bill Henning · 2010-09-01 16:35

heater,

Can you try the 0.981 I just posted to the vmcog thread? I think it might fix the fibo problem...

David found a bug - I used an 'andn' where I should have used an 'and' ... sheesh ... plus a performance boost for pre-setting counters!

Let me know...

jazzed · 2010-09-01 17:57

@Bill. Cool. I'll try to integrate SDRAM code with the new VMCOG tomorrow.

Meanwhile, I have some new test results. I found a way to cut out a window miss in the SdramCache.spin code and decided to chop out some other things that were giving me a headache.

Recent changes have shaved 12 minutes off the 16MB psrandom memory test (was 1:22, now 1:10 H:MM). Fibo 20 now runs in 2383ms and was 2825ms (all tests with an 80MHz system clock). All ZOG code remains the same as before. Kernel and memory enhancements are the only differences.

There are more longs free in zog.spin now. I'll post new versions of zog.spin and SdramCache.spin later.

--Steve

Zog - A ZPU processor core for the Prop + GNU C, C++ and FORTRAN.Now replaces S

Comments