Catalina 3.2

RossH · 2011-08-30 17:38

Rayman wrote: »

Ross, I just downloaded 3.2 and it installed without any problems.

I'm looking around for info on Star Trek, VI and JZIP... I found the startrek.c file. Is there no "project" file for these?
Maybe because they're just one file?

I just tried "build" on startrek.c, but nothing happens. Guess I'll have to do something crazy like read the manual...

BTW: Do these programs expect an 80-column display?
Are you outputting the display over serial port in these examples?
Thanks.

Hi Rayman,

From memory, I think you'll find the star trek project file in the xmm_demos.workspace file (in C:\Program Files\Catalina\codeblocks). Star Trek requires XMM RAM.

The vi and jzip project files will be in the catalyst_demos.workspace file. They also require XMM RAM.

Star Trek generally assumes an 80 column display. It therefore works best on a PC serial terminal or on a VGA display.

Ross.

RossH · 2011-08-30 18:50

jazzed wrote: »

Your cache driver doesn't seem to follow that model exactly. Catalina drivers have inter-dependencies and access methods that I don't understand just yet. I could just send you hardware, but am not really sure why I should need to do that. Maybe your update will help clear the fog.

What fog?

You don't need to worry about the cache - with Catalina, whether or not the cache is in use is transparent to you both from the C application level and the XMM API level. Cache handling and driver dependencies are managed by defining a couple of symbols (e.g. you will see symbols like CACHED and SHARED_XMM used in appropriate places) - but you don't even need to worry about these just to get the basic XMM API working.

To support a new XMM board all you need to do is provide implementations of the eight standard XMM API functions described on page 105 of the Catalina Reference Manual. These functions haven't changed since XMM support was added to Catalina a couple of years ago:

XMM_Activate & XMM_Tristate - these functions take care of initialization, and any the driver dependencies - and on some platforms they are simply empty functions.
XMM_ReadLong & XMM_WriteLong - read and write a single 32 bit value
XMM_ReadMult & XMM_WriteMult - read and write 1,2 or 4 bytes
XMM_ReadPage & XMM_WritePage - read and write an arbitrary number of bytes

Flash support added some complications, but these are not relevant in your case. You will probably realize that all the read/write functions could be implemented using only two underlying PASM functions:

XMM_ReadByte & XMM_WriteByte - read and write a single byte value

The reason the XMM API does not define these functions, but instead has separate functions for different size reads and writes, is because many XMM implementations allow you to optimize access using different strategies when you know in advance what type of data you are reading. For instance, XMM_ReadLong is used for instruction reads - so if your XMM RAM is byte-oriented (and especially if your XMM RAM auto-increments the address after each read) you would not implement this as a loop of four single byte reads, you would set the address once only and also unroll the loop for maximum speed. You can't do this if you only have a single "read byte" function.

You don't even need to mess about with Catalina itself (at least not the C compiler, the caching driver or the kernel) to develop and test your XMM API - the Ram_Test program (in the utilities folder) is a pure Spin/PASM program designed to exercise these API functions. Once you have this program working, the same XMM API should work in Catalina. This is how I develop new XMM functions. The main problem is not usually writing the API functions - it is trying to squeeze them down to fit in the available cog RAM space!

If you need a specific example, check out one of the <platform>_XMM.inc implementations in the Catalina target directory - e.g. the file HX512_XMM.inc is used for the Hydra eXtreme SRAM board on both the Hydra and Hybrid platforms.

You can send me a board if you like - but I should warn you that I currently have a backlog of other boards I have promised to support.

Ross.

jazzed · 2011-08-30 18:55

RossH wrote: »

What fog?

Same fog, different year.

I'll post the latest C3 performance numbers I can find for comparisons with xBasic external memory modes, and let you offer updated ones as you see fit.

RossH · 2011-08-30 20:02

jazzed wrote: »

Same fog, different year.

Jazzed,

If you can provide the following three PASM functions:

XMM_Activate - initialize the XMM RAM board. No parameters.
XMM_ReadByte - read a single byte from a specified address in XMM RAM. Parameters:
- Address
XMM_WriteByte - write a single byte to a specified address in XMM RAM. Parameters:
- Address
- Byte

Then I will provide you with target files suitable for use with your board. The performance will be lousy, but you can work on that separately later. I'm not really sure how this process could possibly be made any easier - but if you have any other suggestions I would be pleased to hear them.

Ross.

jazzed · 2011-08-30 20:55

There is no XMM_ReadByte or XMM_WriteByte in your code. You need to clarify this.

I'll look at your "prescription" in the next few days. As far as ease of use goes ...

It would be far easier to just use a cache interface. What happened to that?
With a cache I get 5MB/s (80MHz) transfers from flash.

Why would I want "lousy performance" for any of my hardware?
My boards use synchronous transfers similar to C3 except byte at a time.
I can read and save a byte in every single hub window in a cache COG.

Is this table correct or out of date?

Hardware        |  Language   |  FIBO(20) time  |  FIBO 0 to 26
----------------+-------------+-----------------+--------------
C3 LMM          |  Catalina C |  306ms          |  11s
C3 XMM cached   |  Catalina C |  1468ms         |  1m10s
C3 XMM uncached |  Catalina C |  7386ms         |  5m50s
----------------+-------------+-----------------+--------------

Below is current performance for xBasic using the cache interface.

----------------+-------------+-----------------+--------------
Hardware        |  Language   |  FIBO(20) time  |  FIBO 0 to 26
----------------+-------------+-----------------+--------------
C3 HUB          |  xBasic     |  584ms          |  17s
SSF80 Flash     |  xBasic     |  779ms          |  23s
C3 Flash        |  xBasic     |  803ms          |  24s
----------------+-------------+-----------------+-------------

RossH · 2011-08-30 22:17

jazzed wrote: »

There is no XMM_ReadByte or XMM_WriteByte in your code. You need to clarify this.

Jazzed, I am at a bit of a loss here. Are you being deliberately difficult, or do you really not understand?

Let me try one more time ...

The smallest amount of input I need to add support for your external RAM board to Catalina is three PASM functions.

You can name the functions whatever you like, and you can write them however you like.

The functions can use any mechanism you choose to accept and return values.

Here is a description of what each of the three functions should do:

Function 1: Initialize the board so that calling the other functions will succeed. Please let me know if it is ok to call this function at any time, or if it should only ever be called once (if no initialization is required, then just say so).

Function 2: Read a byte from external RAM, given the address to read from.

Function 3: Write a byte to external RAM, given the address and the byte to write.

Provide these and I will return you a set of target files suitable for use with Catalina.

Ross.

Rayman · 2011-08-31 06:25

Ross, Thanks, I found the workspace with all those projects in it.
I tried compile, and it appears to work. Can I ask one quick question?
I see that no compiler options are selected. I think I've read that "Hydra" is the default platform.
It appears that "Catalina_HiRes_TV_Text.spin" is included in the compile.
Does that mean the output is displayed on TV if you have a Hydra?

(wish I didn't give my Hydra away now

)

There is one feature in "Code::Blocks" that I wish worked...
I selected "catalina_hmi.h" in "catalina.c" included project file.
Then, right click and tried "Open #include file:", but it doesn't work.
Is there an easy way to view all included files? Do I have to manually add them to the project?

RossH · 2011-08-31 23:20

Rayman wrote: »

Ross, Thanks, I found the workspace with all those projects in it.
I tried compile, and it appears to work. Can I ask one quick question?
I see that no compiler options are selected. I think I've read that "Hydra" is the default platform.
It appears that "Catalina_HiRes_TV_Text.spin" is included in the compile.
Does that mean the output is displayed on TV if you have a Hydra?

Yes, if you don't specify any other platform, Catalina uses the configuration data for the Hydra. And on most platforms, the TV output is the default if you don't select anything else (except on platforms that don't have a TV output, where PC is the default).

Rayman wrote: »

(wish I didn't give my Hydra away now )

Why on earth did you do that??? The Hydra is still my favorite platform - it is far and away the most versatile and fully featured of all the Propeller boards!

Rayman wrote: »

There is one feature in "Code::Blocks" that I wish worked...
I selected "catalina_hmi.h" in "catalina.c" included project file.
Then, right click and tried "Open #include file:", but it doesn't work.
Is there an easy way to view all included files? Do I have to manually add them to the project?

No, the problem is that (by default) Code::Blocks doesn't know where to look for include files other than those in your project - Catalina does, but Code::Blocks does not!

However, see this post, where I point out how to fix auto-completion so that it will work for all C library functions (which Code::Blocks learns about by scanning the include files). I think this should solve your problem as well.

Ross.

Dr_Acula · 2011-08-31 23:33

Interesting discussion re the XMM.

In a couple of other threads I have pushed video to the limit but the catch is that it is using an external ram and all the prop pins almost 100% of the time. So that leads to a different design consideration using two propellers, each with their own external ram.

Prop 1 running Catalina and Prop 2 running video.

The advantage here is that more pins are available for Prop 1.

A hypothetical for Ross. Say you have 24 pins free for external ram, and you latch in (say) in blocks of 4096 bytes. The ram driver code is lean - assume a block has been set with this

' pass latchcounter
selectlatch             mov              dira,blockmask            ' enable different pins to reading, ie P0-7 are outputs now
                        mov              t1,latchcounter           ' get the latch counter
                        and              t1,#%01111111              'mask off high bit as 512k/4096/16 is 128
                        mov              outa,t1                    ' output blockcount byte
                        and              outa,latchlow              ' set latch low
                        mov              dira,dira_pins             ' enable these pins for output  %00000000_11111111_11111111_00000000
                        mov              outa,oewrrd                ' set oe,wr,rd and latch high
selectlatch_ret         ret

Then you can read n bytes from external ram to hub with this

'Read a "len" byte block given by "sram_address" into hub at "hubaddr"  - Cluso99's ram driver
' preserves sram_address
rdblock               
                        mov             t1,sram_address         ' store sram address
                        shl             t1,#8                   ' shift left so in the right position
                        or              t1,oerd                 ' %00000000_01010000_00000000_00000000 ' /oe and /rd low 
                        ' outa pre-filled with the address shifted over by 8 and the /oe and /rd low
                        mov             outa,t1                        ' send it out
                        'nop                             ' cluso's driver has a nop but it seems ok without one    
rdloop                  mov             t2, ina               ' read byte from SRAM \ ignores upper bits
                        wrbyte          t2, hubaddr           ' copy byte to hub    /
                        add             hubaddr, #1             ' inc hub pointer
                        add             outa, #(1 << 8)         ' inc sram address
                        djnz            len, #rdloop            ' loop for xxx bytes
                        mov             outa,oewrrd             ' oe,wr and rd and latch all high
rdblock_ret             ret

That ram read code is fast - probably faster than C.

So if Catalina goes and requests just one byte from the ram that is rather inefficient.

But how about Catalina requests a byte, the ram read code hasn't got much else to do so why not just keep reading bytes? Check every now and then that catalina is not now requesting a read, but most of the time, what it ought to do is read in 4096 bytes into hub ready for catalina to use.

Would this make a pretty fast cache - possibly almost as fast as running LMM catalina from hub ram?

RossH · 2011-09-01 00:47

Dr_Acula wrote: »

Interesting discussion re the XMM.

In a couple of other threads I have pushed video to the limit but the catch is that it is using an external ram and all the prop pins almost 100% of the time. So that leads to a different design consideration using two propellers, each with their own external ram.

Prop 1 running Catalina and Prop 2 running video.

The advantage here is that more pins are available for Prop 1.

A hypothetical for Ross. Say you have 24 pins free for external ram, and you latch in (say) in blocks of 4096 bytes. ...
<<SNIP>>
... Would this make a pretty fast cache - possibly almost as fast as running LMM catalina from hub ram?

Way ahead of you, Dr_A!

Both the Morpheus and TriBladeProp boards can do pretty much what you describe - i.e. video on one Prop, Catalina running in XMM RAM on another. Works a treat with Catalina's proxy driver support - the fact that the video is running on a separate propeller is completely transparent to the C application (well, perhaps not completely transparent - it is a bit slower!).

Imagine what one could do with a Morpheus CPU #1 (which can do hi-res VGA but needs four cogs to do it) serially connected to one of Cluso's new boards that has nearly all pins dedicated to fast XMM RAM access!

Also, Catalina already has full cache support - on any platform. The intrinsic overheads of caching mean that the speed of executing programs from XMM is much the same on any cached platform - i.e. having very fast XMM access does not really improve the overall cache performance much.

However, while caching may be faster than some of the native XMM interfaces (most notably it is faster than any interface that uses SPI RAM) it does not approach the speeds of LMM programs executing from Hub RAM. For that we will need the Prop II.

Ross.

Heater. · 2011-09-01 01:50

RossH,

Imagine what one could do with a Morpheus CPU #1 (which can do hi-res VGA but needs four cogs to do it) serially connected to one of Cluso's new boards that has nearly all pins dedicated to fast XMM RAM access!

Actually I have a hard time imagining such things. Well, I can imagine it but I'm not sure what the point is, you have one Prop completely tied up doing graphics and comms to another Prop that is completely tied up with it's external RAM interface.

Result is there is not not much Prop goodness left anywhere for doing anything else useful. As a platform for large programs and video there are many other smaller, cheaper, faster, better solutions.

This will all look much better with the Prop II with all it's bazillions of pins so don't let me discourage any of this development.

Dr_Acula · 2011-09-01 03:21

@heater

As a platform for large programs and video there are many other smaller, cheaper, faster, better solutions.

Hey heater, don't lose the faith!

I'd be interested to know what those solutions are. Whatever you do, you need a display, and you need a power supply/regulator. So ok, 2 props plus eeproms $18 and 2 memory chips $8. What else is out there? CPLD - still needs external memory and I gather they use more power and cost more. Other chips - can they really run C at full speed and video/serial ports/keyboard at the same time?

Show me something on a breadboard that uses less power, runs faster, costs less, has near instant software support via a forum and is just as easy to hook up as the prop.

Like Ross, I'm excited by the dual prop solutions. We know the prop can now do a decent sized full color display - good enough to show your photos and detailed enough to run windows calculator. And the video prop can do a lot more because at the moment, 90% of the ram is free and 5 cogs are free. So you can move things into that prop like multiple font libraries, multiple tiles and code to do things like display a text box as an abstract object.

And over on the C side, the programs I've played with have been a little constrained by having most of the hub memory devoted to video ram. Devote the hub to a bigger cache instead and that increases the probability of cache hits, which must make the code run faster.

No need to wait for the prop II, there is so much we can be doing now!

RossH · 2011-09-01 04:45

Dr_Acula wrote: »

Heater. wrote: »

...This will all look much better with the Prop II with all it's bazillions of pins so don't let me discourage any of this development.

@heater

Hey heater, don't lose the faith!

...

No need to wait for the prop II, there is so much we can be doing now!

Exactly!

The problem in these forums today (I almost feel like saying "The problem with kids today ...!"

) is that everybody seems to be spending way too much time whinging about the limitations of the Propeller I, and also sitting around waiting for "The Next Big Thing". But the Propeller I has now been available for ... what? ... 5 years? And nobody is yet using it to anywhere near its full potential!

Few people seem any longer prepared just to take what's on offer and use it to create something new. We all used to do that when I first started frequenting these forums. What has gone wrong? Well, many things - but one of them is undoubtedly the Propeller II itself! The Propeller II has actually done a better job of burying the Propeller I than a certain well-known forum member has managed to do in all his years of relentless trolling for the competition!

But the Propeller II is at best a chimera. It's the vanishing point on the horizon. It's the white lines on the freeway. And it is sapping the creative energy we all used to see in these forums.

Why wait?

Ross.

P.S. From Lord of Light:

Vishnu, the Preserver, held the entire Celestial City within his mind, until the day he circled Milehigh Spire on the back of the Garuda Bird, stared downward and the City was captured perfect in a drop of perspiration on his brow.

Like the Celestial City, the Propeller II will one day emerge - perfectly formed - from a drop of perspiration on Chip's brow. Until then, we have the Propeller I - let's use it!

Cluso99 · 2011-09-01 05:15

I wholeheartedly agree Ross. We are only beginning to touch the surface of what the prop can do. And we can use multiple props quite efficiently. It will never be a fully fledged arm. It will never be an iPad. But who really cares! We have so much to do with the prop and it is such a pleasure to program

Heater. · 2011-09-01 05:57

Guys,

No, I have not lost faith.

Yes, I think there is a lot of unexplored territory surrounding the Prop I.

No, I'm not sitting around meditating on the perspiration on Chip's brow.

My point is that as a platform for running large programs in C, or whatever
other language, it does not make much sense. By the time you have bolted on
that super fast SRAM/SDRAM/FLASH interface you have blown away most of your
pins. The resulting setup may be good for running 32MB of code but then it has
few resources left for anything else. There are not many pins left and so any
free cogs remaining don't have room to breath. The resulting concoction is also
pretty slow as a platform to run C code.

So then we get into the world of two Prop solutions, or more, piling on the
complexity and the expense.

That's not to say that XMM is altogether a bad idea. Sometimes you just need
more code space for things that are not time critical. It's just that I reckon
it's better suck up the fact that it's going to be slow, compared to a myriad of
other devices, and use a memory interface that consumes as few resources as
possible. That is to say serial devices. Then you have your big code AND still
have most of the Prop left for doing what it does best.

Still that's just my feeling, I'm sure others feel differently.

I only mentioned Prop II because I'm sure a lot of development that is
happening now with XMM and Catalina or gcc or xbasic or whatever on Prop I
might not look very sensible given the attitude expressed above but is by no
means wasted effort.

Dr_A, other platforms, I almost dare not say, but for running C code with a nice
display in a small space, low power, wide temperature spec I use this:
http://www.igep.es/index.php?option=com_content&view=article&id=46&Itemid=55
Not to mention that it has ethernet, wifi, sd card, serial ports, USB,
bluetooth, audio in/out all for 160 Euros.

Or there is it's smaller brother:
http://www.igep.es/index.php?option=com_content&view=article&id=109&Itemid=123

OK it's Smile for real-time I/O but that's only a Propeller on a serial link
away. Not only that I can develop Spin/PASM on it using HomeSpun:)

Next up I want to get my hands on this:
http://www.raspberrypi.org/
A slightly less functional board but with a projected cost of 25-35 dollars!

Rayman · 2011-09-01 06:04

I'm hoping Ross will get my Flashpoint modules working for XMM soon...
Then, I should be able to use two RamPage modules on one Prop for a single Prop computer.
One RamPage for video buffer and another for XMM. That's only 16 pins at most...

Heater. · 2011-09-01 06:21

Rayman,

Excellent, that's just the ticket.

Toby Seckshund · 2011-09-01 07:02

Hey Heater.

You have gone over to using that new fangled stuff, haven't you?

I tried to print off those pages about the cute little boards that I couldn't focus on, but my Creed 7B couldn't cope ;-)

RossH · 2011-09-02 05:24

jazzed wrote: »

...
Is this table correct or out of date?

Hardware        |  Language   |  FIBO(20) time  |  FIBO 0 to 26
----------------+-------------+-----------------+--------------
C3 LMM          |  Catalina C |  306ms          |  11s
C3 XMM cached   |  Catalina C |  1468ms         |  1m10s
C3 XMM uncached |  Catalina C |  7386ms         |  5m50s
----------------+-------------+-----------------+--------------

Below is current performance for xBasic using the cache interface.

----------------+-------------+-----------------+--------------
Hardware        |  Language   |  FIBO(20) time  |  FIBO 0 to 26
----------------+-------------+-----------------+--------------
C3 HUB          |  xBasic     |  584ms          |  17s
SSF80 Flash     |  xBasic     |  779ms          |  23s
C3 Flash        |  xBasic     |  803ms          |  24s
----------------+-------------+-----------------+-------------

Yes, this table is out of date - very much so. I have recently made a few changes to the Catalina code generator that bring the time for Fibo(20) on the C3 down from over 300ms to under 200ms. They also reduce the code size by around 20%. And I have further changes in progress that will reduce these numbers even further.

So why am I not crowing about this? Why have I not published this fantastic new code generator?

Well, the truth is that such improvements are only achievable for Fibo type programs. Gains of this magnitude would never be realized in a real world situation. The new code generator may give good improvements in some cases, and slight improvements in many others - but in a few cases it could actually increase the execution time.

This just goes to show that Fibo is a ridiculously bad benchmark. It is far too easy to be fooled by Fibo results because Fibo exercises only a minute fraction of the code generator. And the whole Fibo function is so small (less than 20 instructions) that even a trivial caching algorithm will give a completely spurious view of the real-world performance of larger programs.

What this means is that any code generator design decisions based solely on Fibo results are likely to lead to very disappointing results in real-world situations.

I will publish the new Catalina code generator when I am confident the improvements can be realized in real-world programs, and are not limited simply to giving good Fibo results.

Ross.

RossH · 2011-09-02 05:29

Rayman wrote: »

I'm hoping Ross will get my Flashpoint modules working for XMM soon...
Then, I should be able to use two RamPage modules on one Prop for a single Prop computer.
One RamPage for video buffer and another for XMM. That's only 16 pins at most...

Next on my list, Ray! Promise!

Ross.

jazzed · 2011-09-02 07:41

I look forward to your new results with HUB, C3 Flash external memory, and other platform code performance.

FIBO testing is a starting point; not and end.

Heater's fft allgorithm and other "benchmarks" discussed elsewhere should be considered.
We will have plenty of comparisons I'm sure.

RossH wrote: »

Yes, this table is out of date - very much so. I have recently made a few changes to the Catalina code generator that bring the time for Fibo(20) on the C3 down from over 300ms to under 200ms. They also reduce the code size by around 20%. And I have further changes in progress that will reduce these numbers even further.

Dr_Acula · 2011-09-02 16:08

Ross, quick question. How small can a C program be?

cognew(myarray)

where myarray contains 2048 bytes for a cog program.

This is all it does. Starts up, loads one cog program and sets it running. So no need for printf or floating point or any libraries at all.

How small would such a program be in Catalina?

RossH · 2011-09-02 17:17

jazzed wrote: »

We will have plenty of comparisons I'm sure.

Hi Jazzed,

I have loads of other stuff to work on on first, though. Like XMM support for more platforms - including Ray's boards and also yours.

Have you implemented the three necessary functions (outlined here) yet?

Ross.

RossH · 2011-09-02 17:30

Dr_Acula wrote: »

Ross, quick question. How small can a C program be?

cognew(myarray)

where myarray contains 2048 bytes for a cog program.

This is all it does. Starts up, loads one cog program and sets it running. So no need for printf or floating point or any libraries at all.

How small would such a program be in Catalina?

Hi Dr_A,

Small in what sense? It would only be a couple of lines of C code, but the resulting binary file will be around 4k, since it would need 2k for the cog program, and another 2k for the kernel.

Ross.

jazzed · 2011-09-02 18:06

RossH wrote: »

Have you implemented the three necessary functions (outlined here) yet?

No. I can't find XMM_ReadByte or XMM_WriteByte in your code. You need to clarify this.
Did you mean XMM_ReadPage or XMM_WritePage?

RossH · 2011-09-02 18:20

jazzed wrote: »

No. I can't find XMM_ReadByte or XMM_WriteByte in your code. You need to clarify this.
Did you mean XMM_ReadPage or XMM_WritePage?

Jazzed,

I don't care what you call the functions. Please read this post.

Ross.

Dr_Acula · 2011-09-02 18:43

Small in what sense? It would only be a couple of lines of C code, but the resulting binary file will be around 4k, since it would need 2k for the cog program, and another 2k for the kernel.

Ok, great news. I'm thinking of a video driver with 28672 bytes for the screen buffer so that leaves 4096 bytes for code. The driver is virtually all pasm, so all the pasm arrays can be stored at startup in the video buffer. So if the kernel only takes 2k, that ought to fit with no problems.

Back to hardware (I'm working on a dual prop gadget ganster board with one prop running C and one prop doing the best color resolution the prop can do). I'd like the video prop to run C as well, for a pure C design.

jazzed · 2011-09-02 18:53

RossH wrote: »

Jazzed,

I don't care what you call the functions. Please read this post.

Ross.

I don't know you mean. What API's will you call?

David Betz · 2011-09-02 19:11

jazzed wrote: »

I don't know you mean. What API's will you call?

I think he means that if you write the Init, ReadByte, and WriteByte functions for your hardware and send them to him, he'll send you a set of target files that will let Catalina work with your board. In the case of your SDRAM board, I think those functions could be written on top of your JCACHE interface to the COG that manages the SDRAM access and refresh. Presumably, they are functions that run in his LMM kernel COG. The exact names don't matter because he'll adjust his code to use whatever you supply. At least that's my understanding of what he's saying.

Rayman · 2011-09-02 19:11

Jazzed, just to interject... As I recall, RAM access in Catalina is essentially long based, not byte based (fortunately for me.) But, you still need to provide byte level page access, I think.

SD card and flash access is byte based though...

Catalina 3.2

Comments