Catalina 3.3

AntoineDoinel · 2011-10-20 13:41

Martin Hodge wrote: »

You seem to be having more than your share of hardware issues lately. With the Gameduino+ASC and now this. And it all seems to be data corruption of some kind or another. I'd really check your environment for intense RFI/EMI or very noisy power supplies.

It's very possible that I managed to fry the RamBlades somewhat, despite the fact that I've been always behind some form of protection: UPS, USB power from PC, or at the very least the wall wart connected to an extender with a VDR.

Cluso99 · 2011-10-20 14:02

Antoine: Have you downloaded the latest test programs and ZiCog for the RamBlade? The earlier versions were time sensitive. The latest is at the end of the end of the RamBlade thread (link in my signature). Let me know how you go please, perhaps on the RamBlade thread to save poluting this thread.

RossH · 2011-10-20 14:45

Rayman wrote: »

Well, I can sure see that Toledo chess program being a good test for the code generator...

Sure is - but it's very hard to debug!

Rayman wrote: »

BTW: Does Catalina produce a more readable version of the code somewhere in the pipeline?

No - but Code::Blocks does - on the Code::Blocks Plugins menu, select the Source Code Formatter while the file you want to format is open. Doesn't help much in this case, though!

Rayman wrote: »

BTW2: (Maybe showing my ignorance here) I had the "main" call at the top of the program with a call to "my_main" in it. Shouldn't that have produced an error in C, since (without a header), it doesn't know what "my_main" is, since it is at the bottom of the program?

The original C language was very forgiving in this respect (perhaps "lax" would be a better word

). If a function is not declared it is assumed to be an external function that returns an "int". This was fixed up in the ANSI C standard, but LCC (like most C compilers) still allows old style C.

If you select the Warn about non-ANSI usage option (in the Miscellaneous Options in the build options) then the program will complain about this (and a lot of other things!)

Ross.

Dr_Acula · 2011-10-20 19:01

Hi Ross,

If we can emulate computers from the 1970s and early 1980s, is there any reason we can't get Unix working? Not the new versions like Linux, but go back a few years and start with something much simpler.

I was browsing through some of the old versions here http://minnie.tuhs.org/cgi-bin/utree.pl and it seems to me this is very standard looking C.

Is this crazy talk?

RossH · 2011-10-20 19:16

Dr_Acula wrote: »

Is this crazy talk?

Not quite as crazy as your talk about C#

Some kind of Unix is certainly possible. But what's the point - except perhaps as a novelty?

Ross.

Dr_Acula · 2011-10-20 20:18

But what's the point - except perhaps as a novelty?

How can I explain? Well, back in the 1970s you had the 8080 and that spawned the Z80 and the x86 branch. If we emulate the Z80 then we are emulating a dead end branch. If we go down the x86 branch (eg by emulating a 286) it takes you into DOS and copyright issues with Microsoft. So it appears difficult to take the whole emulation process much further than the early 1980s.

However, researching the Unix path, things seem to have adjusted to many hardware platforms over the years and it seems C is the common thing to a number of platforms. And things seem more open source too. You can then maybe take it more incrementally.

I guess I am thinking of the reply you get asking about Linux: "No, you can't run Linux on a propeller", and to wind the clock back and say why can't you run some of those early Unix versions from the late 1970s, and then port those over to later versions. Especially since that porting process seems to have been documented too. See how far it could go?

Heater. · 2011-10-20 20:39

Dr_A,
I might be wrong but I think that all versions of Unix rely on memory management. That is paged on demand memory, process isolation, virtual memory etc.
Without this we only have a toy Unix.
The Prop does not have it and emulating a memory manager will get you a system so slow it will be hard to tell if it is actually running:)
On the other hand, have a Google around for "one man unix" OMU. That must be runable on a Prop and some RAM and would be fun.

Dr_Acula · 2011-10-21 05:59

Hi Ross,

I don't seem to be having any luck with creating Catalyst. I've tried from within Codeblocks and I have tried following the instructions in the manual. Then I tried your instructions in post #203 and this is the printout

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\Administrator>cd c:\program files\catalina

C:\Program Files\Catalina>cd utilities

C:\Program Files\Catalina\utilities>build_all DRACBLADE TV CACHED_1K
Could Not Find C:\Program Files\Catalina\utilities\*.binary

   ==================
   Building Utilities
   ==================



C:\Program Files\Catalina\utilities>call build_tmp_var DRACBLADE TV CACHED_1K
'build_tmp_var' is not recognized as an internal or external command,
operable program or batch file.

C:\Program Files\Catalina\utilities>homespun Payload_XMM_Loader -L "C:\Program Files\Catalina\target" -b -o XMM  -d
'homespun' is not recognized as an internal or external command,
operable program or batch file.
Copying ...

*.binary
The system cannot find the file specified.
        0 file(s) copied.
'unset' is not recognized as an internal or external command,
operable program or batch file.
'unset' is not recognized as an internal or external command,
operable program or batch file.

C:\Program Files\Catalina\utilities>

Am I missing some sort of command that allows windows to find files in other directories? There is one for Catalina

PATH=C:\Program Files\Catalina\bin;%PATH%

and I tried running that but the errors are the same. Is there another path instruction or something, or is there some other problem?

This is odd. In the file build_all.bat I see this list of switches at the beginning:

rem now build utilities for specific platforms (if any)

if "%1"=="TRIBLADEPROP" goto tribladeprop
if "%1"=="MORPHEUS" goto morpheus
if "%1"=="C3" goto C3
if "%1"=="SUPERQUAD" goto SUPERQUAD
if "%1"=="RAMPAGE" goto RAMPAGE
if "%1"=="HYBRID" goto hybrid
if "%1"=="HYDRA" goto hydra

but no DRACBLADE

Many thanks for your help.

RossH · 2011-10-21 06:16

Dr_Acula wrote: »

Hi Ross,

I don't seem to be having any luck with creating Catalyst. I've tried from within Codeblocks and I have tried following the instructions in the manual. Then I tried your instructions in post #203 and this is the printout

That's probably because I made a few errors in that post (now corrected).

The first is that I directed you to the utilities folder and not the catalyst folder. The second thing is that I didn't make it clear that you need to be in a Catalina Command Line window (selectable from the Catalina program group, or the desktop Icon of that name) not a normal command window.

You can build Catalyst in a normal command window, but you have to do something like the following (actually, this should work in any window):

cd %LCCDIR%
use_catalina
cd catalyst
build_all DRACBLADE TV CACHED_1K
dir bin

Sorry about the confusion!

Dr_Acula wrote: »

This is odd. In the file build_all.bat I see this list of switches at the beginning:

rem now build utilities for specific platforms (if any)

if "%1"=="TRIBLADEPROP" goto tribladeprop
if "%1"=="MORPHEUS" goto morpheus
if "%1"=="C3" goto C3
if "%1"=="SUPERQUAD" goto SUPERQUAD
if "%1"=="RAMPAGE" goto RAMPAGE
if "%1"=="HYBRID" goto hybrid
if "%1"=="HYDRA" goto hydra

but no DRACBLADE

You don't see the DracBlade (or the RamBlade, or the Demo board, or the ASC etc) because when building the utilities for these platforms the script doesn't need to do anything other than just build XMM.binary (which follows the lines you have shown). The platforms you see listed specially have to do other things as well (or instead).

Ross.

Tor · 2011-10-21 15:11

Heater. wrote: »

Dr_A,I might be wrong but I think that all versions of Unix rely on memory management. That is paged on demand memory, process isolation, virtual memory etc.

Unix didn't originally have memory management because it ran on the DEC PDP computers which didn't support it. So, with an old enough version..

-Tor

Edit: Well, I should say 'no virtual memory' support I think. I'm a bit fuzzy about exactly what the PDP systems supported. But there was no virtual memory for sure, because that came with the VAXen.

(BTW I'm working around my post-forum-upgrade posting problem by inserting html 'br' codes manually..

)

Dr_Acula · 2011-10-22 05:36

cd %LCCDIR%
use_catalina
cd catalyst
build_all DRACBLADE TV CACHED_1K
dir bin

That works - thanks Ross!

I made a minor change and added "NTSC" to the command line so it displays better on my TV.

I now have two programs CATTV.EXE and CATVGA.EXE and it goes to the appropriate display. Or program one of those into eeprom.

Catalyst is a nifty operating system.

RossH · 2011-10-22 21:39

Dr_Acula wrote: »

That works - thanks Ross!

I made a minor change and added "NTSC" to the command line so it displays better on my TV.

I now have two programs CATTV.EXE and CATVGA.EXE and it goes to the appropriate display. Or program one of those into eeprom.

Catalyst is a nifty operating system.

Good news!

Ross.

RossH · 2011-10-22 23:13

RossH wrote: »

I have had a bit of a play with the "toledo" chess program. I can almost see why it is not working - I think it may have exposed a bug in the Catalina code generator. I need to do some more work on it to be sure. I hope to have a fix for it on the weekend.

Ross.

Well, Rayman - do you want the good news or the bad news? ...

The good news is that I found the reason toledo chess wouldn't run correctly under Catalina. It took me a while to track it down since it was actually two bugs (the little buggers are ganging up on me now!). One was in the setjmp/longjmp functions and one was in the Catalina code generator itself.

The bad news is that when playing chess against the Propeller, the program can take an hour to figure out each move! I let it run through a couple of moves just to check it was working, but I don't think anyone is every going to sit through a whole game at these speeds. I think the program just does a brute-force evaluation of all available moves and chooses a random one from amongst the "best" moves (how it figures out "best" I have no idea!). A more sophisticated program would probably run faster.

However, thanks for making me try it, since neither of these bugs would have shown up in a normal C program - these obfuscated C programs are a great way to test out a compiler since they tend to do some really peculiar things (to make them harder to descipher).

Because the bug is deep in the code generator, I need to do some more testing before I'm confident I've not broken anything else. I've also got a lot to tidy up, but I'll try and get a new release out next week.

Ross.

Dr_Acula · 2011-10-22 23:23

Hi Ross,

Can I ask a question or two about how the cache works? Say I have a big program, 200k, stored in external ram. I imagine that this is loaded in bits into the cache.

Firstly, how big are those bits?

Secondly, let's say I have my 200k program, and then in that program I define an array that is bigger than hub, eg 100k (large model, right?) Will the contents of that array also end up being cached?

Cheers, Drac.

RossH · 2011-10-23 00:07

Dr_Acula wrote: »

Hi Ross,

Can I ask a question or two about how the cache works? Say I have a big program, 200k, stored in external ram. I imagine that this is loaded in bits into the cache.

Firstly, how big are those bits?

Secondly, let's say I have my 200k program, and then in that program I define an array that is bigger than hub, eg 100k (large model, right?) Will the contents of that array also end up being cached?

Cheers, Drac.

Hi Dr_A,

The cache works in "pages". The size of each page of cache depends on the total cache size you chose, since the number of such pages is constant (but is platform dependent).

On the DracBlade, an 8k cache has 128 pages of 64 bytes each. A 1k cache has 128 pages of 8 bytes each.

The cache doesn't care anything about what the stuff it is caching is actually used for, or how big it may be the external RAM. It just caches individual pages.

Let's say we are using an 8k cache - then every possible 64 byte page of external RAM will be mapped to one or another of the 128 pages of cache space - so obviously many different external RAM pages must end up mapped to each cache page. Only one of these external pages can be held in the cache at a time (at least in a simple cache like the one we are using). If you subsequently want to read or write from another external page that maps to the same cache page, the existing page is checked to see if it "dirty" (i.e. if it has changed since being read). If so, it is written back to the external RAM before the new page is read. Otherwise the new page simply replaces the old page.

Ross.

Dr_Acula · 2011-10-23 00:19

Very interesting.

The reason I ask is that I am thinking of ways to speed up memory access using 12 propeller pins.

There is the dracblade solution using 3 latches but I think it can be faster using counters. Instead of using 3 latches, replace the lower latch with an 8 bit counter. Synchronous counters might be a bit slow but asynchronous counters like two 74161 chips could work. 74F series if HC is not fast enough (130Mhz vs 40Mhz).

The pasm loop could be fast. Read byte. Increment clock by toggling a propeller pin. Read byte. etc. No need to change /rd or /wr in that loop.

(Writing might be a bit slower and might need toggling the /wr pin).

So you load up two latches with the address of the 64 bytes to access and then count the counter.

Just musing here...

RossH · 2011-10-23 01:25

Dr_Acula wrote: »

Just musing here...

Muse away - but remember that when a cache is in use, speeding up the underlying XMM RAM will have negligible impact on program speed. That's the very nature of caches - the better your cache, the less benefit there is to speeding up non-cached access.

Unfortunately, the architecture of the Prop I makes the cache overhead quite high. I could modify the XMM kernel to make much better use of the cache, but I'd have to trade this off against something else - and it simply doesn't seem justified while the number of people using XMM RAM on the Prop is so low.

The situation may change when the Prop II arrives - it's too early to say yet.

Ross.

Dr_Acula · 2011-10-23 03:54

Ah, I see. So are you saying that an idealised program that happens to fit in an 8k cache entirely is still too slow?

Is that due to the pasm to hub transfers only happening 1/8th of the potential speed, due to the way cogs access hub? What is the thing that is limiting the xmm kernel using the cache more effectively?

I have pondered the idea of two caches - one in hub and another smaller cache in a cog. I am not sure what could use such an architecture - maybe an emulation of an ultra RISC instruction set. Or maybe the idea of relocatable pasm that can be loaded into a cog with relative rather than absolute jumps - ie jump 4 instructions behind/ahead. Load in blocks of code into a cache. I don't know if such code could be written in pasm.

Off the record, and within the confines of this thread, IMHO the Prop II does need to hurry up as my comments about running C# on an embedded controller (costing less than a Dracblade) are not made in jest. The world of microcontrollers changes rapidly.

But I do love the intellectual challenge of pushing a chip to its limits, and I think caching opens up a whole world that the propeller has not explored fully yet.

What would you be trading off against cache performance?

RossH · 2011-10-23 04:43

Dr_Acula wrote: »

Ah, I see. So are you saying that an idealised program that happens to fit in an 8k cache entirely is still too slow?

Yes - the cache overhead still applies, since you have to check that the correct page is in the cache - either on every access, or at least every time you change pages. I could modify the kernel to be more "cache aware" but a cached program would still be significantly slower than a non-cached program - even if it fits entirely in the cache.

Dr_Acula wrote: »

Is that due to the pasm to hub transfers only happening 1/8th of the potential speed, due to the way cogs access hub? What is the thing that is limiting the xmm kernel using the cache more effectively?

Partly the fact that there is no fast cog-to-cog communication channel (so all kernel-cache communication has to go via hub RAM) but also the need (as mentioned above) to check the correct page is in cache so often. I had hoped the omission of direct cog-to-cog communication would be rectified on the Prop II, but now I'm not sure it will be.

Dr_Acula wrote: »

I have pondered the idea of two caches - one in hub and another smaller cache in a cog. I am not sure what could use such an architecture - maybe an emulation of an ultra RISC instruction set. Or maybe the idea of relocatable pasm that can be loaded into a cog with relative rather than absolute jumps - ie jump 4 instructions behind/ahead. Load in blocks of code into a cache. I don't know if such code could be written in pasm.

Yes it could - in fact I think this is the technique the GCC team uses to achieve good performance. The cost is high in other ways, though.

Dr_Acula wrote: »

Off the record, and within the confines of this thread, IMHO the Prop II does need to hurry up as my comments about running C# on an embedded controller (costing less than a Dracblade) are not made in jest. The world of microcontrollers changes rapidly.

Not that rapidly! In my view, C# is a completely inappropriate language for microcontrollers. It's even a worse fit than C++ or Java. Many of the good things about C# (and I agree there are some) are irrelevant on a microcontroller, and some can't easily be supported anyway (similar to the way many of the nice things about C++ and Java can't be supported). And the overheads of C# are even higher than those languages. While I agree you could make C# work on the Prop, it would be very slow, very limited and probably very pointless. Of course, this is only my opinion, you're entitled to hold a contrary view - however wrong it may be

Dr_Acula wrote: »

But I do love the intellectual challenge of pushing a chip to its limits, and I think caching opens up a whole world that the propeller has not explored fully yet.

Agreed!

Dr_Acula wrote: »

What would you be trading off against cache performance?

At the very least I would probably have to forgo multithreading and/or fast floating point. And both of these are much more important to me than improving cache performance.

Ross.

Heater. · 2011-10-23 04:52

Dr_A,
Hmmm...relocatable PASM...how would one write that?
We have no way to get at the COGs program counter to do relative jumps with.
Seems one would have to load the code and then fixup all the src/dest fields of all the jumps and calls according to the load address befor running it.

Heater. · 2011-10-23 04:58

Hey, didn't I just prove that C++ is a perfect fit for small embedded systems? In fact exactly the same perfect fit as C as it can genertate identical code for the same problem.

Dr_Acula · 2011-10-23 05:00

I did some speed tests on the Dracblade:

#include <stdio.h>
int main ()
{
       long i;
	printf("Start\n");
	for(i=0;i<1000000;i++)
	{
	}
	printf("Finish\n");
       while (1);                                                    // Prop reboots on exit from main()!
       return 0;
}

59 seconds without a cache and 28 seconds with an 8k cache.

So would this XMM program reside in its entirety in the cache once it has been loaded in? ie even with the slowest external ram, 28 secs is the fastest it can ever run?

Also, you mentioned multithreading? How does that work in Catalina?

@heater

Seems one would have to load the code and then fixup all the src/dest fields of all the jumps and calls according to the load address befor running it.

Yes but could you do that? Maybe even you had to recompile every cog 'function' separately with it sitting in the correct artificial position that it might end up in? Then store that pasm code in external ram. How would the cog know any different if that 'function' was loaded later, and was in the correct position? Self modifying code taken to the extreme. eg you have 'functions' representing opcodes for a Z80 emulator but you only load the 10 most popular ones in a cog cache. Reload the others as needed.

Heater. · 2011-10-23 05:36

Dr_A,
That is an appaling result.
The compiler should notice that your loop has no output and optimize it away altogether. Execution time zero.

RossH · 2011-10-23 14:05

Heater. wrote: »

Hey, didn't I just prove that C++ is a perfect fit for small embedded systems? In fact exactly the same perfect fit as C as it can genertate identical code for the same problem.

No, what you have you shown is that a C program that limits itself to a small C-like subset of C++ generates identical code, which is not really surprising since they both use the same back-end code generator, and both are working from the same abstrax syntax tree (due in part to the fairly contrived nature of the example given).

This is not too far removed from taking the old Cfront program (the original C++ to C translator) running it on a small C++ program and then pointing to both the original C++ source and the generated C source and saying "See! These two programs generate the same code, therefore C++ is no different to C!".

Also, you keep saying "embedded system" rather than "microcontroller" - the two are not the same, and in this forum we deal with microcontrollers. As I have said before, my network file system box is an embedded system - but a microcontroller it ain't! That runs C++ quite happily - under Linux, running on a fairly grunty microprocessor and with a couple of hundred megabytes of RAM.

I've agreed previously that there is a subset of C++ that can be useful on a microcontroller - i.e. the subset of C++ that (apart from some syntactic sugar) is virtually indistringushable from C. But this is not the same thing as saying that C++ is a useful language on a microcontroller. It isn't - you have to avoid large parts of the language, and even those parts you can use are likely to confound beginners.

Ther is no doubt that C is a useful language on a microcontroller such as the Propeller. Not an arbitrary subset of C, but the whole kit and kaboodle, as people demonstrate here daily. When we have a C++ compiler that implements the whole C++ language on the Propeller then we can compare the two for utility.

Ross.

RossH · 2011-10-23 14:08

Heater. wrote: »

Dr_A,
That is an appaling result.
The compiler should notice that your loop has no output and optimize it away altogether. Execution time zero.

True - but if you add "volatile" to the definition of i, then the compiler cannot optimize the loop away, so this is just nitpicking.

RossH · 2011-10-23 14:17

Dr_Acula wrote: »

I did some speed tests on the Dracblade:
...

59 seconds without a cache and 28 seconds with an 8k cache.

The DracBlade is the only board I have that uses parallel SRAM but which runs faster with the cache - this is because it's XMM access is quite slow (part of this may be due to my XMM API implementation, but most of it is due to its XMM memory design which minimizes the number of pins). All my other parallel SRAM boards run faster without the cache.

Dr_Acula wrote: »

So would this XMM program reside in its entirety in the cache once it has been loaded in? ie even with the slowest external ram, 28 secs is the fastest it can ever run?

Use the Optimizer will give you a small boost

. But unless I modify Catalina, this is essentially as fast as it will go.

Dr_Acula wrote: »

Also, you mentioned multithreading? How does that work in Catalina?

Check out the Catalina Reference Manual - page 38. There are exmples in the demo\multihread folder.

EDIT: I just did some timing of your program. You can reduce those times to around 45s/20s (uncached/cached) just by making them SMALL programs rather than LARGE, and then further again by using the Optimizer - 37s/17s.

Dr_Acula · 2011-10-23 17:09

heater

That is an appaling result.
The compiler should notice that your loop has no output and optimize it away altogether

No it is an excellent result *grin*. If the compiler had optimised it away I would have had to add more dummy code in the loop! What I was testing was not the absolute speed of catalina, but the ratio of cached to non cached code for code that is small enough to fit in the cache completely. So the ratio is roughly 2:1 for the dracblade.

I'm intrigued by Ross' comments about the XMM speed. In particular I am wondering how this compares with a serial ram solution? In particular, is it worth thinking about hardware counters so that data can be read in bursts. Such a solution would make no sense for random access, like say the Z80 emulation, but it could make a lot of sense with caching as data is read in blocks. I'm thinking two 374s, one 138, two 161's and 12 prop pins.

RossH · 2011-10-23 18:07

Dr_Acula wrote: »

heater
I'm intrigued by Ross' comments about the XMM speed. In particular I am wondering how this compares with a serial ram solution? In particular, is it worth thinking about hardware counters so that data can be read in bursts. Such a solution would make no sense for random access, like say the Z80 emulation, but it could make a lot of sense with caching as data is read in blocks. I'm thinking two 374s, one 138, two 161's and 12 prop pins.

I've said this several times now, but it doesn't seem to be getting through

: when the cache is in use, the speed of the XMM RAM makes negligible difference. There are significant benefits to be gained by improving the performance of the cache itself, but almost none to be gained by improving the speed of the XMM RAM.

To demonstrate this, I just ran this program on the C3, which has slow serial SRAM and serial FLASH. Using the cache, it takes 20 seconds to run - the same time as it took on the DracBlade, which has parallel SRAM. I can't run the program uncached on the C3, since with serial RAM and FLASH you must use the cache.

For comparison, I also ran the same program on the RamBlade - its times are 12s/14s (uncached/cached). You can't really compare the times directly with the DracBlade since the RamBlade is running at 104Mhz - but it demonstrates that with fast parallel SRAM, uncached access is faster even when the program fits entirely in the cache.

However, while typing this and thinking about the way the cache operates, I've just worked out a way to dramatically improve the cache performance - but it will take a little while to implement, so it probably won't make it into the next Catalina release.

Ross.

Dr_Acula · 2011-10-23 18:57

I've said this several times now, but it doesn't seem to be getting through : when the cache is in use, the speed of the XMM RAM makes negligible difference.

Yes sorry I don't quite understand that. Because if that program ends up entirely in cache, then the performance should be independent of the external ram. But you got a value of 20 seconds and I got 28. Maybe there is some other variable? Is that with a 5Mhz xtal and 8k cache.

I've just worked out a way to dramatically improve the cache performance

Sounds promising. So this brainstorming may be useful after all

RossH · 2011-10-23 19:12

Dr_Acula wrote: »

Yes sorry I don't quite understand that. Because if that program ends up entirely in cache, then the performance should be independent of the external ram. But you got a value of 20 seconds and I got 28. Maybe there is some other variable? Is that with a 5Mhz xtal and 8k cache.

27s for a LARGE program, 20s for a SMALL program. 5Mhz crystal on both DracBlade and C3, but I was using a 1K cache - the main body of the loop is so small it would fit completely in the cache with any cache size.

Dr_Acula wrote: »

...
Sounds promising. So this brainstorming may be useful after all

Sure! After dicussions like these I always end up with more ideas than I could possibly ever have time to implement. That's one reason I like spending time in these forums.

Ross.

Catalina 3.3

Comments