Challenge: Execute Prop II code from it's CLUT space (CLMM) ?

Heater. · 2012-12-16 22:12

This is a bit cheeky of me as I have yet to get to grips with the Propeller II instruction set other than a cursory glance at what has been going on in the II thread recently. But I'd like to suggest the idea of executing PASM code from the Propeller II CLUT memory space. Sorry if this has come up before. if so just forget I said anything.

The Prop II has 265 longs of memory over and above the usual COG space. This is was originally intended as a space for a Colour Look Up Table (CLUT) in the Prop II's new video set up. The P II has also grown some extra instructions enabling access to that memory space as a stack.

With that has comes the possibility to use the CLUT memory for other purposes. Bill Henning has for example shown how it can be used as a store for dispatch tables in interpreters, VMs, emulators and the like.

Obviously the CLUT could be used for other look up tables. For example the PII ROM has no sin/cos/log tables so one could build sin/cos/log or other tables in CLUT memory for fast access. One little trick here is that the PII does not clear that CLUT space when loading a COG so the table content can be calculated and loaded by one COG program and then the program that needs the table can be loaded and run.

What about using CLUT space to increase the amount of executable space for PASM itself?

Currently on the Propellers we have Large Memory Model (LMM), for PASM in HUB, External Memory Model (XMM) for PASM in external devices.
What about CLUTMM for PASM in CLUT memory? It would just require a tight fetch, execute loop similar to LMM but fetching from CLUT memory instead of HUB memory.

Ah you say, CLUTMM is useless, it's slower than normal and only 256 instructions are available.

BUT, the Propeller II also has a means or running multiple threads at speed with hardware scheduling. That means we can have one thread executing normal PASM from COG space and another thread running a CLUTMM loop executing code from the CLUT memory. Alternatively a single thread can be used and CLUTLMM functions fired up as and when needed. No reason to not have a program that is a mixture of both native and CLUTMM.

Bingo! We have now extended the COG instruction space by 50%. 768 instructions minus the thread/CLUTMM overhead.

As I have not studied the PII much , don't have an FPGA board to try out the PII emulator and don't have much time available, I can only put the thought up here for others to explore if it tickles their fancy.

Sapieha · 2012-12-16 22:22

Hi Heater.

It was good point.

It is possible You rewrite I8080 that totally run from COG+CLUT and only access HUB for I8080 code?

Heater. · 2012-12-16 22:35

Sapieha,

Yes, yes. The 8080/8085 instruction set can be almost completely done in a single COG on the PI. Apart from the dispatch table. There is an an early version of PropAltair or ZiCog that does 8085 only and shows that it is possible. I could not get some things in to COG, like the long winded and little used DAA instruction, They were done with LMM.

In the 8080 instruction set most of the op codes are MOV, you only need a dispatch table for half of the opcodes, 128 entries. That leaves 128 entries in CLUT for code space for some 8080 ops. I think it will probably all fit in COG.

That is why I started to thing about CLUT as PASM code space and CLUT as dispatch table before that.

ctwardell · 2012-12-16 22:56

Yes, it works very well. I was working on it just now.
You sort of took away my leverage, I was going to release it under same license as Bill to show how bad this can get if we all plant flags and agree to instead make it an app note if Bill does the same.
Depending on how you setup the loop you can even use the stack setspa, addspa, and subspa to do absolute, and relative jumps.

Also depending on how you setup the loop, can't use REP in this case, you can run threaded code.

I'm calling it CLMM (pronounced as CLEM)

C.W.

Cluso99 · 2012-12-16 23:00

I have mentioned using the CLUT to load overlays into cog ram to be executed there. Loading can be executed really fast using a single pop executed n times.
BTW I have asked Chip for a special shift instruction to shift the 9bit vectors and keep the top 5 bits unshifted. I use this method in the faster spin interpreter. (see the documentation thread at almost the end)

We are certainly going to have some fun with P2. I am about to order my DE0-nano so hopefully will have it before the end of the week.

Heater. · 2012-12-16 23:06

ctwardell,

Now this is what I love about this forum, you only have to have an idea pop into mind and someone has already done it. Especially if it's obviously impossible:)
Of course I'm so slow that this is normally true in the rest of life as well.

I don't have any leverage. No code. Nothing. Just the muse.

I do like "CLMM" better than "CLUTMM". Do you mind if I change the thread title to use "CLMM"?

jazzed · 2012-12-16 23:08

CLUMP - CLUT Memory Program

ctwardell · 2012-12-16 23:10

Heater, go for the name change.

ctwardell · 2012-12-16 23:12

Attached is source for CLMM Engine #1

Released as MIT.

I'll do a better write up and post more engines soon.

Enjoy.

C.W.

CLMM_Engine1.spin

ctwardell · 2012-12-16 23:18

It's a relief to be able to discuss this, I didn't want to tip my hand and have anyone release non-MIT.

It is very curious, from my understanding of the pipeline I would think I would need at least three nops between the popar snd the instruction slot, and I do if I want to manipulate the "PC" via setspa, addspa, and subspa.
If I just do balls out execute with no jumps like that I can have no nops between the popar and the instruction slot, that may be the overlapped stuff Chip mentioned on Bill's BCEE thread.

C.W.

Heater. · 2012-12-16 23:20

Cluso,

I also thought about overlays from CLUT memory. Probably a very good technique if the code in the overlay contains loops. Then the overhead of loading diminishes with respect to the execution time.

However for in line code a single instruction CLMM technique may be better. The problem I ran into using overlays in ZiCog was that you need a space as big as your biggest overlay in COG in which to load it. With a LMM style you only need space for the small LMM loop. As the ZiCog overlays were straight line code the speed difference between overlay and LMM is not great.

I noticed your request for the special shift operations. I must say that having had a quick look at the list of instructions for PII it's pretty overwhelming already. Perhaps adding more odd things is not a good idea. For sure 90 percent of those instructions are never going to be used by compilers.

It pains me I can't get into PII experiments with the nano board, I have so much going I just can't allow myself another toy just now.

Sapieha · 2012-12-16 23:24

Hi Heater.

It is why I don't like C type compilers ---

Heater. wrote: »

Cluso,

I also thought about overlays from CLUT memory. Probably a very good technique if the code in the overlay contains loops. Then the overhead of loading diminishes with respect to the execution time.

However for in line code a single instruction CLMM technique may be better. The problem I ran into using overlays in ZiCog was that you need a space as big as your biggest overlay in COG in which to load it. With a LMM style you only need space for the small LMM loop. As the ZiCog overlays were straight line code the speed difference between overlay and LMM is not great.

I noticed your request for the special shift operations. I must say that having had a quick look at the list of instructions for PII it's pretty overwhelming already. Perhaps adding more odd things is not a good idea. For sure 90 percent of those instructions are never going to be used by compilers.

It pains me I can't get into PII experiments with the nano board, I have so much going I just can't allow myself another toy just now.

pedward · 2012-12-16 23:35

Since the other thread gotta mess, here's another use for the CLUT, a hash lookup table. You could use it as a translation table for generating hashes with expansion. You can use the CLUT for fast CRC16, which is used in Bluetooth and a variety of other protocols. You can also do a CRC32 with table based long lookups: http://wiki.osdev.org/CRC32

CRC is nothing more than hashing, in fact it shares a lot in common with the AES algorithm. AES has a table of polynomials instead of 1 polynomial, but the math to divide and reduce polynomials between CRC and AES is identical. There are 256 forward and 256 reverse polynomials in AES-128 (encrypt and decrypt s-boxes), so the CLUT works great for that too.

In fact, AES-128 needs 512 bytes for the whole implementation, you could store the encrypt and decrypt tables in the upper and lower words of each CLUT entry, just add a second step of SHR X,#16 to get the table entry, using the same lookup mechanism.

The tables are what make AES a difficult to implement algorithm on the Prop 1, but with the CLUT, you have another 1KB of RAM for data in each COG, which greatly expands what you can do.

Just some food for thought.

Heater. · 2012-12-16 23:39

ctwardell,

Blimey, it's done already. Great stuff. Now I just have to sit down with PII docs and figure out how it works exactly. This is not your fathers Propeller.

Clearly we need a way to load up the CLUT with code first. A simple RDLONG, pusha (whatever) loop would do.

To maximize available COG space the COG could be started with a small CLUT loading program which stops when done and then the COG is started again with your real application code.

Actually, such a CLUT loading COG program, that does pretty much nothing else. might be a useful thing to have stand alone as it has applicability to loading sin/cos/log tables and other such tasks prior to loading app code hat uses those CLUT tables.

Sapieha · 2012-12-16 23:43

Hi Heater.

Relatively complete instructions set posted.

http://forums.parallax.com/showthread.php?144643-Propeller-II-emulation-for-Idiots&p=1151037&viewfull=1#post1151037

Heater. wrote: »

ctwardell,

Blimey, it's done already. Great stuff. Now I just have to sit down with PII docs and figure out how it works exactly. This is not your fathers Propeller.

Clearly we need a way to load up the CLUT with code first. A simple RDLONG, pusha (whatever) loop would do.

To maximize available COG space the COG could be started with a small CLUT loading program which stops when done and then the COG is started again with your real application code.

Actually, such a CLUT loading COG program, that does pretty much nothing else. might be a useful thing to have stand alone as it has applicability to loading sin/cos/log tables and other such tasks prior to loading app code hat uses those CLUT tables.

Heater. · 2012-12-16 23:45

Sapieha,

Yes, compilers have that difficulty, not just C compilers. It's just damn hard for a compiler to look at your code, realize intent and then optimize for a complex array of instructions or weird architectures. . Compilers are for general purpose, code that you want to be usable on many platforms. That is clearly not the kind of code that we will see inside Propeller COGs very much.

peward,

Now we're cooking.

Heater. · 2012-12-16 23:54

For anyone stumbling across this thread in the future. The story continues over here http://forums.parallax.com/showthread.php?144677-Announcing-CLMM-(pronounced-as-Clem)-Execute-Code-from-the-CLUT

I think this thread is done already.

jmg · 2012-12-17 00:11

Heater. wrote: »

...
BUT, the Propeller II also has a means or running multiple threads at speed with hardware scheduling. That means we can have one thread executing normal PASM from COG space and another thread running a CLUTMM loop executing code from the CLUT memory. Alternatively a single thread can be used and CLUTLMM functions fired up as and when needed. No reason to not have a program that is a mixture of both native and CLUTMM.

Bingo! We have now extended the COG instruction space by 50%. 768 instructions minus the thread/CLUTMM overhead.

The hardware sliced threading is the late addition that makes this truly useful.

It is a pity that QuadSPI(DDR) in hardware did not make the cut, as that is another useful 'opcode feeder path' for a 'background rate' slice - without the hard ceiling the CLUT has.

Cluso99 · 2012-12-17 00:33

Heater: In the spin interpreter, there are lots of little instruction sets that could be small overlays. I think the same would apply to ZiCog. The penalty is only a clock per instruction loaded plus a couple of setup instructions (which may only replace other decoding). So worst case is a 2:1 loss. But this is faster than LMM. Don't forget we are running 4x faster per instruction and 2x faster clock, and 2x hub access. So P2 should fly.

Another use for the CLUT...
A bit map where we are not using the CLUT for video. In my 1pin mono TV code (no need for 1 pin any more because it only uses 1 pin anyway

) I just generate simple TV using VGA mode. I store the bitmap in the cog but this would free up space to do a full VT100 within the same cog too.
Now if only we could page map the 1K into a window of cog space - and yes I know it is not 4way like the cog ram

And we can use the CLUT for buffers too. For instance, setting up a string.

ctwardell · 2012-12-17 00:39

Cluso99 wrote: »

And we can use the CLUT for buffers too. For instance, setting up a string.

Chip shows using the CLUT as a receive buffer for serial in the Monitor code.

C.W.

Heater. · 2012-12-17 01:10

Cluso,

Early versions of ZiCog used your overlay technique. I persistened with it for a long time. I just was very tight for space and switching to LMM eased that up. I was not so worried about speed for those Z80 ops in overlay/LMM as they are almost never used in the CP/M world and probably not much elsewhere. On the PII I'm going to worry about the speed of those ops even less.

Cluso99 · 2012-12-18 01:38

Heater. wrote: »

Cluso,

Early versions of ZiCog used your overlay technique. I persistened with it for a long time. I just was very tight for space and switching to LMM eased that up. I was not so worried about speed for those Z80 ops in overlay/LMM as they are almost never used in the CP/M world and probably not much elsewhere. On the PII I'm going to worry about the speed of those ops even less.

Oh, I don't think I have that version. I am sure the version of ZiCog I use still uses overlays. My first P2 pcb will have provision for 512KB of SRAM and microSD

Can you solder smt 0.4mm pitch???

Heater. · 2012-12-18 02:17

Perhaps it was only the last couple of ZiCog versions that used LMM. I might have to spend Christmas sorting out what is what in the ZiCog department in preparation for PII.
I guess I can solder 0.4mm, under advice.

Challenge: Execute Prop II code from it's CLUT space (CLMM) ?

Comments