Catalina 3.3

Dr_Acula · 2011-10-23 19:51

27s for a LARGE program, 20s for a SMALL program. 5Mhz crystal on both DracBlade and C3, but I was using a 1K cache - the main body of the loop is so small it would fit completely in the cache with any cache size.

Great - well that saves me needing to design a new board! Caching is brilliant and if you can get it to run even faster that would be awesome.

RossH · 2011-10-23 20:02

Dr_Acula wrote: »

Great - well that saves me needing to design a new board! Caching is brilliant and if you can get it to run even faster that would be awesome.

I'll add it to my "to do" list

Rayman · 2011-10-24 03:34

RossH wrote: »

The bad news is that when playing chess against the Propeller, the program can take an hour to figure out each move! I let it run through a couple of moves just to check it was working, but I don't think anyone is every going to sit through a whole game at these speeds. I think the program just does a brute-force evaluation of all available moves and chooses a random one from amongst the "best" moves (how it figures out "best" I have no idea!). A more sophisticated program would probably run faster.

Ross, Thanks for checking it out. Supposedly, the depth to which it searches is specified by the number of arguments on the command line...
Before, it seemed like it didn't recognize any arguments and went into 2-player mode.
One argument is what we want because then it goes into 5-ply 1-player mode.
Two arguments goes to 6-ply mode, which the author says can get very slow...

Just want to make sure you were testing with just one argument....

RossH · 2011-10-24 03:43

Rayman wrote: »

Ross, Thanks for checking it out. Supposedly, the depth to which it searches is specified by the number of arguments on the command line...
Before, it seemed like it didn't recognize any arguments and went into 2-player mode.
One argument is what we want because then it goes into 5-ply 1-player mode.
Two arguments goes to 6-ply mode, which the author says can get very slow...

Just want to make sure you were testing with just one argument....

Yes, there was a bug was in my setjmp/longjmp function that was corrupting the 1-player/2-player flag.

But I am nopw definitely specifying one parameter, and it definitely takes an hour per turn in "5-ply" mode. I'd hate to think how long it takes in "6-ply" mode!

Still, as heater (or is it mpark?) often quotes: ""The wonder of the dancing bear is not how well it dances, but that it dances at all."

Ross.

Rayman · 2011-10-24 04:45

Ok, I can believe that. I haven't actually tried it on a PC to see how fast it is there. But, maybe 1-minute on PC = 60 minutes on Prop...

Have you figured it out enough to dial it down to 3 or 4 ply?

Heater. · 2011-10-24 05:27

RossH,

Is this the right room for an argument?

No, what you have you shown is that a C program that limits itself to a small
C-like subset of C++ generates identical code,...

What "No"?
What I showed was the same functionality implemented in C and C++. The C++
version is definitely object oriented C++, it has classes and objects and
templates. The C version is written with an object oriented approach so that we
can have multiple instances of our objects.

Turns out the code generated is identical and that the source is much nicer in
the C++ version. That makes C++ a win.

You rightly point out that the C++ code is only using a subset of C++ language
features. But who says all programs have to use all features all the time? For
example I have worked on lots of systems in C that never used malloc().

...which is not really surprising...

Well it surprised me:)
Had I used a tool like Cfront to generate the C code from the C++ and then
presented you those versions there would obviously be no surprise. But that's
not what I did. The C version was written by hand in an attempt to introduce
some object "orientedness" into the code. So the surprise is that the way I did
it in C is actually the way C++ does it.

...you keep saying "embedded system" rather than "microcontroller"

True. Technically I probably don't mean either of them. As you say embedded
systems can be pretty large. But then micro-controllers are growing all the
time as well. Perhaps I should be more rigorous and say "memory constrained
systems" or some such.

I've agreed previously that there is a subset of C++ that can be useful
on a microcontroller - i.e. the subset of C++ that (apart from some syntactic
sugar) is virtually indistringushable from C.

I'm suggesting the subset is quite large and useful. Certainly more than
syntactic sugar.

But this is not the same thing as saying that C++ is a useful language on a
microcontroller. It isn't...

Is:)

When we have a C++ compiler that implements the whole C++ language on the
Propeller then we can compare the two for utility.

I believe it is almost there. Perhaps barring a huge pile of standard
libs/classes that we probably don't want to be using anyway. But that brings us
back to the debate about the distinction between language vs standard libs. I'm
quite happy to have the language without the libs on a MCU. "prinf" who needs
it?

Rayman · 2011-10-24 06:11

I think "printf" along with floating point support is one of the great things about C on the Propeller.
I think it may be the single most useful feature...

jazzed · 2011-10-24 09:13

Rayman wrote: »

I think "printf" along with floating point support is one of the great things about C on the Propeller.
I think it may be the single most useful feature...

I agree with this to a floating point

It's also valuable without floating point.

RossH · 2011-10-24 14:13

Heater. wrote: »

RossH,

Is this the right room for an argument?

No, you want room 12A next door. This room is "Abuse".

Heater. wrote: »

...

I believe it is almost there. Perhaps barring a huge pile of standard
libs/classes that we probably don't want to be using anyway. But that brings us
back to the debate about the distinction between language vs standard libs. I'm
quite happy to have the language without the libs on a MCU. "prinf" who needs
it?

... and here is where we will always part company. A subset of C++ is not C++, just like a subset of C is not C. What you really want is not C++ at all, it is more like embedded C++ (and look what happened to that idea!).

And I guess that is really my key point. It no longer makes sense to have a subset of C on any platform, since C is close to being the smallest useful language in which you can write absolutely anything you like (within a reasonable time frame). C++ is not a small language, and will always have to have subsets derived from it for use in constrained environments. And since we all seem to agree that we cannot stick to the standard, everybody will use a different subset. It is worth pointing out that this used to be the case with C as well - up to about 20 years ago, when it finally dawned on people that they needed a rigorous standard (this is one of the main reasons C is still around!). But users of C++ have obviously not yet learned this lesson, so when you move code from C++ platform to C++ platform you cannot be sure it will either compile or run.

Ross.

Rayman · 2011-10-24 14:26

I think C makes more sense for an embedded environment, where things have to be small and whatnot.
But, personally, I prefer to program in C++, mostly because it lets me be lazy. But also because of "objects".
As an educational device, C++ for Prop1 might make a lot of sense.

Prop2 might be a different story. Microsoft had their Windows Mobile stuff running on an ARM and ran C++ code just fine.
Android phones appear to support C and C++.

jazzed · 2011-10-24 15:18

RossH wrote: »

However, while typing this and thinking about the way the cache operates, I've just worked out a way to dramatically improve the cache performance - but it will take a little while to implement, so it probably won't make it into the next Catalina release.

Since you've gone down the path a little, I thought it might be useful for all of us to consider other ideas.
Let's have some Tea ....

I always wondered why you used the cache the way you have it now saving and loading a single data element at a time (maybe I have this wrong; you are welcome to correct me with a useful explanation). I have to assume it's mostly a matter of getting something to work without breaking other stuff.

As far as GCC improvements go, we have considered using 2-way or 4-way set associative cache, but with a sufficiently fast synchronous burst back-store it makes less difference especially with the extra instructions. I figure as long as the policy is faster than the back store, it doesn't matter too much.

The biggest bang for the buck seems for the GCC design at least is to put the actual cache manager code in the xmm/c kernel itself and keep the actual back-store load/save code in a separate cog. Of course our current cache logic split of responsibility is really fine the way it is, but there is always room for experimenting.

--Steve

Dr_Acula · 2011-10-24 18:43

Hi jazzed,

Let's have some Tea ....

Excellent idea. Tea brewed and sipping it as I write.

Discussing caching sounds an excellent idea. It seems to me that if we all work together we could come with a generic cache algorithm that could be useful for all sorts of programs - GCC, Catalina, other languages and emulations.

There are many algorithms to consider http://en.wikipedia.org/wiki/CPU_cache. And there are questions to answer, like what is the optimum size of each piece, do you store the cache lookup table in cog or hub and where the code resides (like you say).

I'd love to brainstorm caching - do we do it on this thread or on a new thread?

RossH · 2011-10-24 20:03

Jazzed, Dr_A ...

I don't mind if you want to conduct a discussion on caching in this thread. Catalina 3.4 is nearly ready for release (just got to check out the Linux versions) so discussion of more general Catalina issues will shortly move to a new thread.

As the Catalina's use of the cache, I agree it is less than optimal. But there are reasons for this. The main one is that I was hoping that the cache would provide me a mechanism for allowing multi-cog support for XMM C programs (currently I can only support multi-cog LMM C programs). At present, all the XMM memory designs make an implicit assumption that access to the RAM will always be through one cog only - hence the cache is a potential solution for multiple cogs to achieve this. But if multiple C programs running on multiple cogs are using the same cache then this means that the cache needs to be checked on every cache access - at least this is true with the current cache design. A different cache design (or a different XMM RAM design) might avoid the need for this.

Happy to hear of any suggestions anyone has on this score. For example, perhaps instead of an 8k cache, we could have 8x1k caches - i.e. one per cog? The current cache mechanism (which was originally designed by Bill Henning) doesn't support this, but it could be done. Another possible solution would be a comms mechanism that does not need to go via Hub RAM at all (which is very slow). This is not really feasible on the Prop I (it consumes too many pins) but perhaps the Prop II will add this capability (I've really lost touch with the Prop II project - as far as I know there is still no final design or even a proposed instruction set for it).

Anyway, as I say - I'm happy to hear other people's thoughts on this.

The improvement I have thought of for Catalina is simply a better way of using the existing cache that I can fit into the XMM kernel, and without requiring too much fundamental redesign - space in that cog is very tight, and I was trying to add caching without having to re-write (and hence have to re-test) very much. I now think I can do this - the idea is fairly simple, but the implementation of it will take some time simply because even the smallest modification to the XMM kernel usually means I need to find a few extra longs of space.

The main reason I don't want to spend too much time reworking the XMM kernel to improve the existing cache support is that I simply don't believe there are enough XMM users to justify the effort required. Again, happy for this to be discussed.

What we are all doing with things like XMM and caching is really trying to make up for the limitations in the Prop I design. Note that I don't mean deficiencies in the design - the Prop I is perfectly well designed to do what was originally intended for - I just mean those things that stop us doing even more weird and wonderful things with it - limitations like not having enough Hub RAM, or not having an built-in cog-to-cog communications channel. Nobody ever thought these would be necessary on the original Prop I, and it is still unclear whether we will get them on the Prop II (some people apparently still don't see the necessity for them!).

Over to you guys ...

Ross.

Dr_Acula · 2011-10-24 21:20

Very interesting thoughts.

Re:

At present, all the XMM memory designs make an implicit assumption that access to the RAM will always be through one cog only - hence the cache is a potential solution for multiple cogs to achieve this. But if multiple C programs running on multiple cogs are using the same cache then this means that the cache needs to be checked on every cache access - at least this is true with the current cache design. A different cache design (or a different XMM RAM design) might avoid the need for this.

Given that the performance with a cache seems to level all the XMM hardware to the same speed, for the dracblade solution (12 pins) maybe one could think about putting 3 SPI ram chips on those 12 pins rather than a sram? Then you could have three threads running completely independently. It ought to simplify coding as well. Three caches, three external memories, no shared pins?

Am reading the rest of your post in more detail...

jazzed · 2011-10-24 22:21

RossH wrote: »

Jazzed, Dr_A ...

I don't mind if you want to conduct a discussion on caching in this thread. ....

I figured since Ross had discussed caching here, he wouldn't mind talking about it since it's technically interesting. There have been other threads on caching, and yet another wouldn't hurt anyone. It really would be nice if we could standardize a little to encapsulate hardware, but with that often comes unwanted constraints.

Ross has different goals for example as he mentioned. I've always considered an XMM kernel/interpreter itself a single COG, single thread business end code solution (with a Cache COG as a load/save machine) mainly because there is just not enough bus bandwidth really for anything else. Anyone could run multiple LMM, COG, PASM, or SPIN (not officially required, but doable) programs anyway if necessary until all the COGs are gone .... I've discussed threaded LMM with Parallax, and it seems to be viewed as more of a curiosity than anything else. Yes, XMM is more of a boutique feature on Propeller, and it exists only because it can and there are certain advantages to having some extra memory for apps when necessary.

Caching is very powerful. It's even usable with EEPROM as a back store - not the fastest media, but every stand-alone Propeller needs one, so XMM is essentially "free" with a 64KB or better board like the Hydra (128KB) or Propeller Protoboard and Quickstart (64KB). Putting a fast external memory device to work offers a nice speed improvement.

My own experience with XMM and caching has come down to this. XMMC (code only XMM) is the fastest model because no swapping is required. This is the main reason why I started experimenting with 2-wide QuadSPI - it's the fastest cached solution available because of it's synchronous nature. Any SRAM can also use that model too, but of course an SD card or fast serial connection must be used to load the initial program.

Another XMM mode is one where code/constants and data are kept in the external memory back-store. If stack was also kept in external memory, it would be dog slow too. There are models like C3 where both Flash and SRAM are external. Of course all the memory is SPI on C3 so it is relatively slow.

Anyway the reason for mentioning the models is the cache type may be different for the various boards. A Flash, EEPROM, or SD card cache can be very simple. A dual C3 type cache demands at minimum 2 cache spaces - not a 2 way set necessarily. Any SRAM solution that contains code, data, and/or stack really needs to be set associative. A fully associative cache is probably best for that solution, but it will be slow and have limits too.

So there is no one size fits all, but it is at least possible to have an interface that allows access to many bytes within the cache at once without having too many hub accesses. As always, all you need is a little breathing room in the XMM kernel

Our current solution requires about 15 to 20 instructions, so it's not so bad. The interesting idea is to have the cache algorithm itself inside the XMM kernel. Of course one would need a few different kernel types to support the basic external memory models because the caching algorithms would be different for best performance as discussed above.

All that summarizes our experience with caching and XMM so far. Choices made have been reasonable and the opportunity to do something better is always there, but the reality is also that only so much can be done.

jazzed · 2011-10-24 22:33

Dr_Acula wrote: »

Given that the performance with a cache seems to level all the XMM hardware to the same speed, for the dracblade solution (12 pins) maybe one could think about putting 3 SPI ram chips on those 12 pins rather than a sram?

Different hardware implementations do impact performance like in any other model. I do firmly believe that a memory that is being treated as a synchronous burst device is much better than an address setup/byte at a time access device though for caching. Basically, the hardware needs to deliver a cache line's worth of data after one address setup phase. A cache line's worth of data might be 32 to 256 or more bytes. Of course how it's used is also very important.

If you do add 3 SPI chips, make sure they have independent chip selects - the C3 thing is more difficult than necessary.

I have 2 QuadSPI flash chips and 8 SPI RAMs on 12 pins for a 4MB fast flash solution with up to 512KB SRAM. The code solution is not fully implemented yet. I'll share schematics with you if you want to use the design - royalty free.

Dr_Acula · 2011-10-24 22:44

Thanks for the offer jazzed - is the design open source (just thinking ahead if I then use it in an open source design it might sort of become open source in the process?).

Yes I was thinking of 3 23A256 chips each with 4 pins, so separate /CS lines. Ultra simple. And looking at the spec sheets they have a mode exactly as you describe - send the address and then clock out bytes continuously.

As for how it is used - I'll defer to the caching experts here and I remain intrigued by what Ross has just thought of to improve the speed.

RossH · 2011-10-25 01:12

jazzed wrote: »

Ross has different goals for example as he mentioned
....
I've discussed threaded LMM with Parallax, and it seems to be viewed as more of a curiosity than anything else.

Great! This can be the first differentiator between GCC and Catalina. And it's really the main area I wanted to explore in the first place - the diversion into XMM and caching is lots of fun (and a necessary stopgap until the Prop II finally arrives) but ultimately it really only serves to turn the Propeller into an expensive Arduino clone. Some seem to think this is worthwhile goal for commercial reasons, so rather than continuing to argue against it, I'm perfectly happy to leave that avenue to those who want to explore it. More strength to their ARM!

Ross.

Dr_Acula · 2011-10-25 01:20

Hi Ross, can you explain threading for dummies? I read through the manual and saw all the functions you can call but I don't think I really grasped the overview. Does threading mean you can run two or more C programs in parallel? And if so, by implication, callable LMM code can be run in parallel as well?

If so, I have some programming ideas.

Also thinking about caches, three separate spi rams 32k each, cache into three 2k hub caches all running in parallel, one cog for each cache, each cog running its own LMM/C thread? Is this possible??

If it is, the prop could break out of the 2k cog limitation via C (which is arguably a more accessable language than pasm) and you could be doing things in parallel that are not so easy on other chips.

RossH · 2011-10-25 02:21

Dr_Acula wrote: »

Hi Ross, can you explain threading for dummies? I read through the manual and saw all the functions you can call but I don't think I really grasped the overview. Does threading mean you can run two or more C programs in parallel? And if so, by implication, callable LMM code can be run in parallel as well?

Rather than me describe it - look here or here. These documents are specifically about POSIX threads, but the concepts are all much the same for all threading implementations - if you read these articles you will see that all the fundamental thread operations are already present in the Catalina thread library.

Dr_Acula wrote: »

If so, I have some programming ideas.

Also thinking about caches, three separate spi rams 32k each, cache into three 2k hub caches all running in parallel, one cog for each cache, each cog running its own LMM/C thread? Is this possible??

I don't think so. Every cache consumes another cog, so we would have 3 kernel cogs and 3 cache cogs - i.e. only two cogs left for drivers etc. I don't think this is a good tradeoff. Even the overhead of one cache is hard to justify. In any case, I think you are confusing multi-cog with multi-thread. Both are useful - but they are not the same.

Dr_Acula wrote: »

If it is, the prop could break out of the 2k cog limitation via C (which is arguably a more accessable language than pasm) and you could be doing things in parallel that are not so easy on other chips.

Yes, exactly. That's the potential multi-threading on the Propeller offers. Why stop at 8 cogs when you can have several hundred threads? Or both? This potential has just not yet been quite realized.

Ross.

Heater. · 2011-10-25 03:58

RossH,

... and here is where we will always part company. A subset of C++ is not C++,
just like a subset of C is not C. What you really want is not C++ at all, it is
more like embedded C++

Who said anything about subsets?

If your compiler can produce assembler or object modules from the complete
range of language features then it is a full up compiler. I'm sure GCC will be
able to do this for C++.

Now it could happen that you are then unable to produce a runnable program
because you don't have the entire range of standard libraries available.
To me that is a different issue. Those libs may be missing for very good
reasons, like, they are never going to fit in the memory space available.

But then had you written your C code to perform similar functionality in similar ways,
after all you can adopt an object oriented approach to straight C programming, you would
find it also does not fit in the space available.

So yes that is were we differ. For you the language is the syntax, the sematics,
plus whatever supporting libs have been thrown into the standard. For me the
language stops at whatever you can write down in BNF.

We should leave that debate just there. Until next time:)

(Except, given 32MB or RAM and and SD card on a GadgetGanster Prop setup a full
up standards compliant C++ environment can be created. Thus meeting your
criteria. The sanity of doing such a thing is still questionable in my mind.)

But users of C++ have obviously not yet learned this lesson, so when you move
code from C++ platform to C++ platform you cannot be sure it will either compile
or run

Given the amount of rework that needs doing on a C program to move it from one
MCU to another I'd say C still has this problem in MCU world.

Rayman,

I think C makes more sense for an embedded environment, where things have to
be small and whatnot.

That was the whole point of my C++ demonstration on the Prop. For the
equivalent functionality in C and C++ the code size was exactly the same. In
fact the generated code was actually identical. The conclusion is that when
"things have to be small and whatnot" there is no penalty in using C++. There
is the advantage that the source becomes much prettier and easier to maintain.

Heater. · 2011-10-25 04:38

RossH,

I too have recently become fascinated by the idea of threads on the Prop. I know you have made a lot of progress there.
Have you looked at proto-threads?
I managed to create a version of FullDuplexSerial in C that runs in a COG as native code and works at 115200 baud.
However to do that I had to create my own proto-thread like threading macros that are smaller and faster.

RossH · 2011-10-25 04:53

Heater. wrote: »

...
We should leave that debate just there. Until next time:)

Or until we both forget we already had the argument.

Heater. wrote: »

(Except, given 32MB or RAM and and SD card on a GadgetGanster Prop setup a full
up standards compliant C++ environment can be created. Thus meeting your
criteria. The sanity of doing such a thing is still questionable in my mind.)

32MB???? On a Propeller??? Surely you don't mean that! That is way beyond just being questionable - it is completely and utterly insane! Especially given that in C you can do this in 32 kilobytes - not 32 megabytes!

Heater. wrote: »

Given the amount of rework that needs doing on a C program to move it from one
MCU to another I'd say C still has this problem in MCU world.

The amount of work required to move an ANSI C progam? Typically Zip. Nada. Nil. Zilch. None. For an example, see the toledo chess "obfuscated C" program we have been discussing in this thread. Or any of the Catalyst demo programs. Usually you don't have to change a single line to run these on a supercomputer, a desktop PC, or on the Propeller (yes, in some cases I have to write some Catalina specific modules for things like cursor handling in vi etc - but this is because this is O/S dependent, not because I have to modify the C code).

Let's restart this argument when you can do the same thing with an arbitrary C++ program taken from the internet.

Ross.

RossH · 2011-10-25 04:58

Heater. wrote: »

RossH,

I too have recently become fascinated by the idea of threads on the Prop. I know you have made a lot of progress there.
Have you looked at proto-threads?
I managed to create a version of FullDuplexSerial in C that runs in a COG as native code and works at 115200 baud.
However to do that I had to create my own proto-thread like threading macros that are smaller and faster.

Proto-threads look too much like coroutines to me - yuck! I had enough of those horrors in my days of programming Modula-2. I want real threads ... but lightweight ones! Even POSIX threads are too heavy for use on a microcontroller - but they are quite comparable to Catalina's thread library, so that's why I pointed Dr_A to them for a general explanation of threading.

Ross.

Tor · 2011-10-25 05:05

Talking about Toledo and obfuscated C code.. another piece of obfuscated C code, by

RossH · 2011-10-25 05:12

Tor wrote: »

Talking about Toledo and obfuscated C code.. another piece of obfuscated C code, by

Rayman · 2011-10-25 06:17

8080 Emulator that runs CP/M. That's kind of interesting...

I though of another extreme example to try: SPICE. I think it's just regular C. But, it's big and would probably be very slow.
Still, that might be fun to try. Might need that 32 MB for that one...

Heater. · 2011-10-25 07:15

RossH,

What argument?

The amount of work required to move an ANSI C program? Typically Zip. Nada. Nil.
Zilch.

That is true but this is the world of MCU's we are talking about. So most
programs will be dependent on hardware features, uarts, timers,
interrupt controller set up, register addresses, pin locations, memory bank
switching etc etc etc. Not to mention issues with type sizes, byte ordering,
structure packing and so on.

Moving up in sophistication many programs will be dependent on some RTOS API
that has been used to protect the programmer from all the aforementioned
hardware difficulties.

All this portability of which you speak does not exist in MCU land. Well any
way that's my experience of having to pick up and migrate C code projects from platform
to platform in recent years.

Let's restart this argument when you can do the same thing with an arbitrary C++
program taken from the internet.

I suspect that most real world programs you can't do that in C or C++ or any
other language. Most programs have dependencies far beyond what the standard
libs satisfy.

Heater. · 2011-10-25 07:31

RossH,

Proto-threads look too much like coroutines to me - yuck!...

Oh good, something else we can argue about:)

Coroutines are wonderful things. Simple, small and fast. Cooperative multitasking saves all that overhead of interrupts and task scheduling. Also you don't need a stack in many cases.

As you know Chip's FullDuplexSerial.spin has two threads, one for transmitting and one for receiving. He has basically implemented two coroutines in PASM that hand over control to each other using just a JMPRET instruction.

Turns out we can now do that in C compiled to native COG code and still get 115200 baud.

...I want real threads...

No you don't. Never mind pthreads I came to the conclusion that even prot-threads are to heavy weight for in COG code:)

jazzed · 2011-10-25 07:48

Dr_Acula wrote: »

Thanks for the offer jazzed - is the design open source (just thinking ahead if I then use it in an open source design it might sort of become open source in the process?).

If you copy the design, and it's to become open source then I guess you need to cite me as a reference.

RossH wrote: »

Great! This can be the first differentiator between GCC and Catalina.

It is not the first

There are already several other great differentiators. I won't repeat them here since you and Heater already opined them. One of the most important differentiators has never even been discussed publicly.

The threaded LMM feature is in our original requirements, is scoped, and has already had some work done. When Parallax decides multi-threaded LMM is important to make a priority, it won't be too hard to finish. They are happy to demonstrate the power of Propeller without needing xyz multi-threading though.

Catalina 3.3

Comments