@lonesock, both versions seem to multipy ok for dereferencing the end of the block, but something else is happening. I pulled in an old version of math_F4 just to make sure the file I posted works for my own comfort.
testmarch 800 200
0x4e60 - 0x505c [b][color=orange]<-- correct now. 0x505c was 0x4e5c before recommended changes.[/color][/b]
W0^ = Write 0 march up.
W0v = Write 0 march down.
R0^ = Read 0 march up.
R0v = Read 0 march down.
W0^ 0x00004e60 0x00000000 0x55555555
W0^ 0x00005260 0x00000100 0x55555555
R0^W1^ 0x00004e60 0xaaaaaaaa
R0^W1^ 0x00005260 0xaaaaaaaa
R1vW0v 0x00004e5f [b][color=orange]<-- this is wrong though, it should be R1vW0v 0x00004e60[/color][/b]
Error @ 0x4e5f[17f] Expected 0xaaaaaaaa Received 0x00000809
00004e5f: 0x00000809 0x55555555 0x55555555 0x55555555
Found it. My version of mult didn't handle the top 32 bits. I'm not sure how to do a 64-bit result without a worst case that is slower than the original (zog 1.3) implementation. I think we should just punt for now, and revert to the 1.3 multiplication code, and I'll look at this entire math block hopefully this coming week.
Sorry for the confusion,
Jonathan
EDIT: Does ZPU even support returning the top 32 bits of a 64 bit multiplication? If not, I can strip out some of the logic in that math block. You would only need 32x32=>32, 32/32=>32, and 32remainder32=>32, right? And it currently seems to be all set for signed multiplication...is that what ZPU expects?
Out of curiosity I had to look at what zpu-gcc produces for integer multiplication, here is the listing:
int multest(int a, int b)
{
622: ff im -1
623: 3d pushspadd
624: 0d popsp
00000625 <.LM17>:
/home/michael/zog_v1_5/test/hello/hello.c:59
return(a * b);
625: 73 loadsp 12
626: 75 loadsp 20
627: 29 mult
628: 80 im 0
629: 0c store
0000062a <.LM18>:
/home/michael/zog_v1_5/test/hello/hello.c:60
}
62a: 83 im 3
62b: 3d pushspadd
62c: 0d popsp
62d: 04 poppc
Changing to longs instead of ints we get the same code.
Changing to unsigned we get the same code again.
Changing to short ints we get:
short multest(short a, short b)
{
622: fd im -3
623: 3d pushspadd
624: 0d popsp
625: 75 loadsp 20
626: 90 im 16
627: 2b ashiftleft
628: 77 loadsp 28
629: 90 im 16
62a: 2b ashiftleft
0000062b <.LM17>:
/home/michael/zog_v1_5/test/hello/hello.c:59
return(a * b);
62b: 71 loadsp 4
62c: 90 im 16
62d: 2c ashiftright
62e: 71 loadsp 4
62f: 90 im 16
630: 2c ashiftright
631: 29 mult
632: 70 loadsp 0
633: 90 im 16
634: 2b ashiftleft
635: 70 loadsp 0
636: 90 im 16
637: 2c ashiftright
638: 80 im 0
639: 0c store
63a: 51 storesp 4
63b: 52 storesp 8
63c: 55 storesp 20
63d: 53 storesp 12
0000063e <.LM18>:
/home/michael/zog_v1_5/test/hello/hello.c:60
}
So 16 bit working on Zog is going to tedious.
Note the "IM 0" and "STORE" used to return the results. Results are returned via the fake register at location zero in memory. This is a pain for running from read only memory. I think it may be possible to relocate it though.
Adopted Jazzed's 32MByte SD RAM cache.
Adopted Jazzed's use of userdefs.spin for hardware configuration-
Backed out the math_F4 changes of v1.4, now mall.c works again.
Moved all user configurable items to nearer the top of debug_zog.spin.
I did not make any changes to the ZPU memory map, HUB RAM space, I/O space etc. Basically I could not decide on a memory layout.
I would like to have, moving up memory:
1) ZPU RAM space (HUB or external, 32K up to 32M)
2) HUB RAM access space(32KB)
3) COG RAM access space(512 LONGs, 2KB)
4) Memory mapped IO (The rest)
The memory map should be the same for all memory hardware solutions. This way C programs won't need any #defines or such to find HUB, COG, and I/O.
Currently it means the HUB area would start at 32MByte to accommodate Jazzed's SDRAM solution. This would be so even if the actual ZPU RAM available is less when using HUB or smaller external RAM.
Are we likely to want to go bigger? Seems unlikely.
One could define a ZPU RAM address space as, say, 1GByte but then ZPU needs more IM instructions to build the bigger addresses.
On memory size: I haven't tested anything other than 32MB yet, but theoretically one SDRAM chip can be 8MB, 16MB, 32MB, 64MB, or even 128MB. Here's a list of what's currently available.
TSOP 54, 3.3V SDRAM Part Numbers on Digikey
MT48LC8M8A2TG-75 64M (8M x 8)
HYB39S128800FE-7 128M (16M x 8)
HYI39S128800FE-7 128M (16M x 8)
IS42S81600E-7TL 128M (16M x 8)
MT48LC32M8A2P-75 256M (32M x 8)
MT48LC32M8A2P-7E 256M (32M x 8)
IS42S83200D-7TL 256M (32M x 8) *
MT48LC64M8A2TG-75 512M (64M x 8)
*Note the IS42S83200D-7TL is the only part I've tested so far.
One thing to note about the USE_JCACHED_MEMORY design is the way cache memory is accessed.
The cache driver lives in a separate COG and provides buffers. So zog.spin keeps up with the address range represented by the buffer and accesses data with a single HUB transaction on a cache hit.
I've found one small performance improvement for zog's cache interface, and I think there is another in there.
Question is: Is it really practical to make use of so much RAM. I mean, despite page tables or caches we are seriously lagging behind in performance compared to what a processor with a memory bus can do. Generating large addresses (many IMs) for data access and function calls slows things a bit more.
As I said, I'd like the memory map of COG, HUB and I/O access to be pretty well fixed for all time and all platforms so that one does not have to "port" C code from place to place.
So we could settle on 32Mb, or 64 or 128 and be done with it forever.
Of course one day this will all have to work on a Prop II with whatever help it has for external RAM in hardware. So perhaps we should push the envelope a bit on Prop I now.
What the heck, I'll throw my vote in for putting HUB space up at 128MB. That should be enough for anyone, as Bill Gates famously didn't say. It should be enough to keep uCLinux happy
Question is: Is it really practical to make use of so much RAM. I mean, despite page tables or caches we are seriously lagging behind in performance compared to what a processor with a memory bus can do. Generating large addresses (many IMs) for data access and function calls slows things a bit more.
As I said, I'd like the memory map of COG, HUB and I/O access to be pretty well fixed for all time and all platforms so that one does not have to "port" C code from place to place.
So we could settle on 32Mb, or 64 or 128 and be done with it forever.
Of course one day this will all have to work on a Prop II with whatever help it has for external RAM in hardware. So perhaps we should push the envelope a bit on Prop I now.
What the heck, I'll throw my vote in for putting HUB space up at 128MB. That should be enough for anyone, as Bill Gates famously didn't say. It should be enough to keep uCLinux happy
Thought I would experiment with in-zog one-line cache management as implemented in zog_v1_5 (extra-zog SDRAM Cache management is direct-mapped 128 line). The result is about 200ms improvements in fibo(20) and fibo(18) is under 1 second. Now this could just be tuning for fibo performance, but if my speculation is right, I think other programs would benefit. If nothing else, the small changes save a few longs in zog.spin.
I have not dared to look into VMCog or SdramCache code much.
Could some one briefly state the difference between VMCog's virtual working pages and SdramCache's cache lines?
As I see it ZPU memory access is constantly thrashing between code fetch and stack read/write at substantially different addresses. Then there is access to the programs actual data areas and the mysterious ZPU pseudo registers down at address zero.
So the smallest program needs at least 4 different buffers (VM pages or cache lines) in order to prevent constant thrashing of buffers between these 4 areas. As the program and data gets bigger further buffers are required to prevent thrashing between different code (and data) areas.
We should be careful comparing speeds using that fbo test. The speed of execution programs under VMCog depends heavily on the number of pages in the working set. The fibo loop itself is very small. This leads to an interesting effect on execution time:
Not quite, that fibo program only measures the fibo function execution time not the surrounding test loop and result printing. The fibo function is very small and as no data so it fits in 4 pages and runs fast. The program itself gets visibly much slower with decreasing pages.
VMCOG can read a 512 byte SDRAM page in about 102us.
SdramCache can read a 64 byte page in less than 7us.
I chose to implement a separate interface because:
1) There is no room in VMCOG for 10MB/s burst code.
2) VMCOG is limits usable external memory to 64KB.
3) VMCOG apparently can not run fibo 26 to completion.
(surely this can be fixed*).
I recognized fibo performance testing is not especially useful.
Once I used a 256 byte cache line for testing. Fibo calculation
performance improved a bit, but everything else degraded.
I've run drhystone tests with SDRAM, but results are not obvious.
Zog.spin selects the buffer being used. If the current buffer contains the data, one HUB operation fetches the data after some comparisons (around 9 instructions). A new buffer is selected if the required address is not in the current buffer's range (around 29 instructions). When a new buffer is selected, it is up to the SdramCache.spin to deliver the buffer in tact and will swap a buffer if necessary (up to 15us at 80MHz, potentially 9us on reads).
I've considered exploring alternatives to the current method when I have time just to compare performance. Using a per data item request model with the current SdramCache driver, the best case cache fetch will be 8 instructions + SdramCache overhead of about 16 instructions.
My cycle counting could be wrong and that's why I'm willing to experiment. However, I don't expect the alternative to perform any better than what's there now.
Basically it's a choice of always taking 24 instructions -vs- a mix of 9 or 29 instructions on a cache hit.
I took zog 1.5, stripped out the math and put in a faster and smaller version of multiply, and a slightly faster and smaller version of divide & modulus. This version also has the special cases for loadsp and storesp for the high bit, as well as the special cases for when offset = 0.
I'm also including the file I used for testing the multiply and divide and modulus routines against the SPIN equivalents, so you can verify it separately.
Lonesock: No time to look into your code but a quick run shows almost 4% performance gain running from HUB and a tad more than 2% running from VMCog. Excellent:)
Jazzed, when are we going to see a self contained 32MB Propeller PC board, DracBlade style?
Heater, the first generally available 32MB SDRAM board will be Gadget Gangster Propeller Platform compatible. A Propeller Single Board Computer will follow that.
The Gadget Gangster board is in FAB now and will allow a Propeller Computer solution with Keyboard, Mouse, Serial Port, TV, and uSD card.
I've spent most of this last week working on the I2C Keyboard/Mouse controller code. Hopefully by the time the FAB comes, I'll have the I2C solution ready. I have plenty of room for the controller code in the atTiny85.
A VGA connector is available on the FAB as a stuffing option. It is mainly intended for VGA graphics experimentation and is not usable with uSD without cut/jump rework. The VGA and uSD sections could be reworked to provide a black/white or some other grey-scale video with access to uSD. I've considered adding another latch to free more pins, but that's just floating in the air like vapor now.
I just noticed something on zpu_*shift*: tos is ANDed with $3F. $3F is 63, so I'm guessing $1F was desired, but, SHL, SHR, and SAR all only use the low 5 bits of S anyway to shift D, so the AND is unnecessary.
This has Lonesock's multiply and divide routines and his other optimization tweaks.
I have set up the memory map for ZPU code as:
$00000000 to $0FFFFFFF Normal RAM (256MB) for ZPU code, data, stack.
$10000000 to $10007FFF Maps to HUB RAM
$10008000 to $100087FF Maps to ZPU COGs memory.
$10008800 to $FFFFFFFF Memory mapped IO
Note that the is no checking of memory addresses so for example an out of range write when running from HUB may well damage any Spin/PASM code running there.
The map allows access to all COG locations so programs running under ZOG can modify the ZPU instruction set and state at runtime for some interesting effects:)
Only 30 LONGs left in the ZPU COG now with SdRamCache.
There are shorter division routines, they're just slower, but I'd be happy to stub one in if you'd like. And if you need, feel free to undefine USE_FASTER_MULT, that will save you 5 longs. (It just means that a*b will not execute at the same speed as b*a, but both will be faster than the old multiply.)
I think we can stick with the fast math code unless we have to really scrape for LONGs at some point but that is looking unlikely.
30 LONGS should be enough to add some handling for the Propellers LOCKs.
What other Prop features is Zog missing support for?
What else would it be useful to squeeze into those remaining LONGs?
One thing would be unsigned divide and modulus operations. zpu-gcc inserts lengthy subroutines for those ops and it would be better to have the COG do it directly. It's would be a bit messy as we would have to use some currently unused opcodes or, preferably SYSCALL, and the provide some functions C/assembler to use it.
Zog needs to be able to handle an interrupt. If only for a timer tick. I'd like to see FreeRTOS running one day not to mention uCLinux:)
I just recently discovered that there is a VHDL implementation of ZPU that includes an interrupt and a new instruction "POPINT". The idea being that an interrupt automatically masks further interrupts. When the handler is done POPINT pops the return address and clears the interrupt mask.
Floating point support can be added without changes to the Zog interpreter. Just use float32 PASM code with a C wrapper and provide C functions to use it that override those provided by GCC's soft-float library. http://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html Perhaps those unsigned mul, div, mod functions could be included in the soft-float COG.
@Bill. Cool. I'll try to integrate SDRAM code with the new VMCOG tomorrow.
Meanwhile, I have some new test results. I found a way to cut out a window miss in the SdramCache.spin code and decided to chop out some other things that were giving me a headache.
Recent changes have shaved 12 minutes off the 16MB psrandom memory test (was 1:22, now 1:10 H:MM). Fibo 20 now runs in 2383ms and was 2825ms (all tests with an 80MHz system clock). All ZOG code remains the same as before. Kernel and memory enhancements are the only differences.
There are more longs free in zog.spin now. I'll post new versions of zog.spin and SdramCache.spin later.
Comments
Sorry for the confusion,
Jonathan
EDIT: Does ZPU even support returning the top 32 bits of a 64 bit multiplication? If not, I can strip out some of the logic in that math block. You would only need 32x32=>32, 32/32=>32, and 32remainder32=>32, right? And it currently seems to be all set for signed multiplication...is that what ZPU expects?
I have a working v1.5 here now. At least mall(function) and fibo are working from HUB and VMCog.
Basically I have adopted Jazzed's SDRAM version whole sale, back fitted the v1.3 math_F4 and tweaked a few details around.
I want to make some fixes around the ZPU RAM - HUB RAM - I/O mapping in read/write_long then if nothing else comes up that is version 1.5.
Just now I have to eat some good food and drink some good wine:) Perhaps I can post 1.5 tomorrow afternoon.
The best reference we have for zpu_mult is ZyLin's simulator Java code. In there MULT is implimented as: So, what we are looking at is signed 32 bit multiplication with the top 32 bits discarded. There aren't any longer integer operations in the ZPU.
The simulators DIV is: With MOD looking much the same.
Changing to longs instead of ints we get the same code.
Changing to unsigned we get the same code again.
Changing to short ints we get:
So 16 bit working on Zog is going to tedious.
Note the "IM 0" and "STORE" used to return the results. Results are returned via the fake register at location zero in memory. This is a pain for running from read only memory. I think it may be possible to relocate it though.
The long long multiplication gets a bit, well, long:
The worst thing we can do is unsigned mod and div. They dive off into some lengthy subroutines.
This version has:
Adopted Jazzed's 32MByte SD RAM cache.
Adopted Jazzed's use of userdefs.spin for hardware configuration-
Backed out the math_F4 changes of v1.4, now mall.c works again.
Moved all user configurable items to nearer the top of debug_zog.spin.
I did not make any changes to the ZPU memory map, HUB RAM space, I/O space etc. Basically I could not decide on a memory layout.
I would like to have, moving up memory:
1) ZPU RAM space (HUB or external, 32K up to 32M)
2) HUB RAM access space(32KB)
3) COG RAM access space(512 LONGs, 2KB)
4) Memory mapped IO (The rest)
The memory map should be the same for all memory hardware solutions. This way C programs won't need any #defines or such to find HUB, COG, and I/O.
Currently it means the HUB area would start at 32MByte to accommodate Jazzed's SDRAM solution. This would be so even if the actual ZPU RAM available is less when using HUB or smaller external RAM.
Are we likely to want to go bigger? Seems unlikely.
One could define a ZPU RAM address space as, say, 1GByte but then ZPU needs more IM instructions to build the bigger addresses.
Any ideas?
On memory size: I haven't tested anything other than 32MB yet, but theoretically one SDRAM chip can be 8MB, 16MB, 32MB, 64MB, or even 128MB. Here's a list of what's currently available.
One thing to note about the USE_JCACHED_MEMORY design is the way cache memory is accessed.
The cache driver lives in a separate COG and provides buffers. So zog.spin keeps up with the address range represented by the buffer and accesses data with a single HUB transaction on a cache hit.
I've found one small performance improvement for zog's cache interface, and I think there is another in there.
Question is: Is it really practical to make use of so much RAM. I mean, despite page tables or caches we are seriously lagging behind in performance compared to what a processor with a memory bus can do. Generating large addresses (many IMs) for data access and function calls slows things a bit more.
As I said, I'd like the memory map of COG, HUB and I/O access to be pretty well fixed for all time and all platforms so that one does not have to "port" C code from place to place.
So we could settle on 32Mb, or 64 or 128 and be done with it forever.
Of course one day this will all have to work on a Prop II with whatever help it has for external RAM in hardware. So perhaps we should push the envelope a bit on Prop I now.
What the heck, I'll throw my vote in for putting HUB space up at 128MB. That should be enough for anyone, as Bill Gates famously didn't say. It should be enough to keep uCLinux happy
4x IM = 28 bit constant
2^28 = 256MB
256MB = $10000000
The ZPU tool chain uses:
UART TX = $80000024
UART TX = $80000028
TIMER = $80000100
So it will all fit nicely.
32MB SDRAM with 80MHz Propeller Clock.
Cheers.
Could some one briefly state the difference between VMCog's virtual working pages and SdramCache's cache lines?
As I see it ZPU memory access is constantly thrashing between code fetch and stack read/write at substantially different addresses. Then there is access to the programs actual data areas and the mysterious ZPU pseudo registers down at address zero.
So the smallest program needs at least 4 different buffers (VM pages or cache lines) in order to prevent constant thrashing of buffers between these 4 areas. As the program and data gets bigger further buffers are required to prevent thrashing between different code (and data) areas.
We should be careful comparing speeds using that fbo test. The speed of execution programs under VMCog depends heavily on the number of pages in the working set. The fibo loop itself is very small. This leads to an interesting effect on execution time:
fibo(20) 30 pages : 2782ms
fibo(20) 20 pages : 2782ms
fibo(20) 4 pages : 2782ms.
All the same?!!
Not quite, that fibo program only measures the fibo function execution time not the surrounding test loop and result printing. The fibo function is very small and as no data so it fits in 4 pages and runs fast. The program itself gets visibly much slower with decreasing pages.
VMCOG has a replacement policy, SdramCache does not.
VMCOG uses 2.5MB/s SDRAM burst read/write.
SdramCache uses 10MB/s burst read/write.
VMCOG can read a 512 byte SDRAM page in about 102us.
SdramCache can read a 64 byte page in less than 7us.
I chose to implement a separate interface because:
1) There is no room in VMCOG for 10MB/s burst code.
2) VMCOG is limits usable external memory to 64KB.
3) VMCOG apparently can not run fibo 26 to completion.
(surely this can be fixed*).
I recognized fibo performance testing is not especially useful.
Once I used a 256 byte cache line for testing. Fibo calculation
performance improved a bit, but everything else degraded.
I've run drhystone tests with SDRAM, but results are not obvious.
i.e. it has to refill the buffer when activity moves from a code fetch to a stack access for example?
Zog.spin selects the buffer being used. If the current buffer contains the data, one HUB operation fetches the data after some comparisons (around 9 instructions). A new buffer is selected if the required address is not in the current buffer's range (around 29 instructions). When a new buffer is selected, it is up to the SdramCache.spin to deliver the buffer in tact and will swap a buffer if necessary (up to 15us at 80MHz, potentially 9us on reads).
I've considered exploring alternatives to the current method when I have time just to compare performance. Using a per data item request model with the current SdramCache driver, the best case cache fetch will be 8 instructions + SdramCache overhead of about 16 instructions.
My cycle counting could be wrong and that's why I'm willing to experiment. However, I don't expect the alternative to perform any better than what's there now.
Basically it's a choice of always taking 24 instructions -vs- a mix of 9 or 29 instructions on a cache hit.
I'm also including the file I used for testing the multiply and divide and modulus routines against the SPIN equivalents, so you can verify it separately.
Jonathan
EDIT: added the 'and smaller' parts.
Jazzed, when are we going to see a self contained 32MB Propeller PC board, DracBlade style?
The Gadget Gangster board is in FAB now and will allow a Propeller Computer solution with Keyboard, Mouse, Serial Port, TV, and uSD card.
I've spent most of this last week working on the I2C Keyboard/Mouse controller code. Hopefully by the time the FAB comes, I'll have the I2C solution ready. I have plenty of room for the controller code in the atTiny85.
A VGA connector is available on the FAB as a stuffing option. It is mainly intended for VGA graphics experimentation and is not usable with uSD without cut/jump rework. The VGA and uSD sections could be reworked to provide a black/white or some other grey-scale video with access to uSD. I've considered adding another latch to free more pins, but that's just floating in the air like vapor now.
Jonathan
In the ZPU Java simulator they have ANDed with 0x3f so I just blindly used it in Zog.
Are you sure the Props shifts only use 5 bits of the src field?
According my prop manual:
Jonathan
This has Lonesock's multiply and divide routines and his other optimization tweaks.
I have set up the memory map for ZPU code as:
Note that the is no checking of memory addresses so for example an out of range write when running from HUB may well damage any Spin/PASM code running there.
The map allows access to all COG locations so programs running under ZOG can modify the ZPU instruction set and state at runtime for some interesting effects:)
Only 30 LONGs left in the ZPU COG now with SdRamCache.
Jonathan.
I think we can stick with the fast math code unless we have to really scrape for LONGs at some point but that is looking unlikely.
30 LONGS should be enough to add some handling for the Propellers LOCKs.
What other Prop features is Zog missing support for?
What else would it be useful to squeeze into those remaining LONGs?
One thing would be unsigned divide and modulus operations. zpu-gcc inserts lengthy subroutines for those ops and it would be better to have the COG do it directly. It's would be a bit messy as we would have to use some currently unused opcodes or, preferably SYSCALL, and the provide some functions C/assembler to use it.
Zog needs to be able to handle an interrupt. If only for a timer tick. I'd like to see FreeRTOS running one day not to mention uCLinux:)
I just recently discovered that there is a VHDL implementation of ZPU that includes an interrupt and a new instruction "POPINT". The idea being that an interrupt automatically masks further interrupts. When the handler is done POPINT pops the return address and clears the interrupt mask.
Floating point support can be added without changes to the Zog interpreter. Just use float32 PASM code with a C wrapper and provide C functions to use it that override those provided by GCC's soft-float library. http://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html Perhaps those unsigned mul, div, mod functions could be included in the soft-float COG.
Can you try the 0.981 I just posted to the vmcog thread? I think it might fix the fibo problem...
David found a bug - I used an 'andn' where I should have used an 'and' ... sheesh ... plus a performance boost for pre-setting counters!
Let me know...
Meanwhile, I have some new test results. I found a way to cut out a window miss in the SdramCache.spin code and decided to chop out some other things that were giving me a headache.
Recent changes have shaved 12 minutes off the 16MB psrandom memory test (was 1:22, now 1:10 H:MM). Fibo 20 now runs in 2383ms and was 2825ms (all tests with an 80MHz system clock). All ZOG code remains the same as before. Kernel and memory enhancements are the only differences.
There are more longs free in zog.spin now. I'll post new versions of zog.spin and SdramCache.spin later.
--Steve