Challenge: Execute Prop II code from it's CLUT space (CLMM) ?
Heater.
Posts: 21,230
This is a bit cheeky of me as I have yet to get to grips with the Propeller II instruction set other than a cursory glance at what has been going on in the II thread recently. But I'd like to suggest the idea of executing PASM code from the Propeller II CLUT memory space. Sorry if this has come up before. if so just forget I said anything.
The Prop II has 265 longs of memory over and above the usual COG space. This is was originally intended as a space for a Colour Look Up Table (CLUT) in the Prop II's new video set up. The P II has also grown some extra instructions enabling access to that memory space as a stack.
With that has comes the possibility to use the CLUT memory for other purposes. Bill Henning has for example shown how it can be used as a store for dispatch tables in interpreters, VMs, emulators and the like.
Obviously the CLUT could be used for other look up tables. For example the PII ROM has no sin/cos/log tables so one could build sin/cos/log or other tables in CLUT memory for fast access. One little trick here is that the PII does not clear that CLUT space when loading a COG so the table content can be calculated and loaded by one COG program and then the program that needs the table can be loaded and run.
What about using CLUT space to increase the amount of executable space for PASM itself?
Currently on the Propellers we have Large Memory Model (LMM), for PASM in HUB, External Memory Model (XMM) for PASM in external devices.
What about CLUTMM for PASM in CLUT memory? It would just require a tight fetch, execute loop similar to LMM but fetching from CLUT memory instead of HUB memory.
Ah you say, CLUTMM is useless, it's slower than normal and only 256 instructions are available.
BUT, the Propeller II also has a means or running multiple threads at speed with hardware scheduling. That means we can have one thread executing normal PASM from COG space and another thread running a CLUTMM loop executing code from the CLUT memory. Alternatively a single thread can be used and CLUTLMM functions fired up as and when needed. No reason to not have a program that is a mixture of both native and CLUTMM.
Bingo! We have now extended the COG instruction space by 50%. 768 instructions minus the thread/CLUTMM overhead.
As I have not studied the PII much , don't have an FPGA board to try out the PII emulator and don't have much time available, I can only put the thought up here for others to explore if it tickles their fancy.
The Prop II has 265 longs of memory over and above the usual COG space. This is was originally intended as a space for a Colour Look Up Table (CLUT) in the Prop II's new video set up. The P II has also grown some extra instructions enabling access to that memory space as a stack.
With that has comes the possibility to use the CLUT memory for other purposes. Bill Henning has for example shown how it can be used as a store for dispatch tables in interpreters, VMs, emulators and the like.
Obviously the CLUT could be used for other look up tables. For example the PII ROM has no sin/cos/log tables so one could build sin/cos/log or other tables in CLUT memory for fast access. One little trick here is that the PII does not clear that CLUT space when loading a COG so the table content can be calculated and loaded by one COG program and then the program that needs the table can be loaded and run.
What about using CLUT space to increase the amount of executable space for PASM itself?
Currently on the Propellers we have Large Memory Model (LMM), for PASM in HUB, External Memory Model (XMM) for PASM in external devices.
What about CLUTMM for PASM in CLUT memory? It would just require a tight fetch, execute loop similar to LMM but fetching from CLUT memory instead of HUB memory.
Ah you say, CLUTMM is useless, it's slower than normal and only 256 instructions are available.
BUT, the Propeller II also has a means or running multiple threads at speed with hardware scheduling. That means we can have one thread executing normal PASM from COG space and another thread running a CLUTMM loop executing code from the CLUT memory. Alternatively a single thread can be used and CLUTLMM functions fired up as and when needed. No reason to not have a program that is a mixture of both native and CLUTMM.
Bingo! We have now extended the COG instruction space by 50%. 768 instructions minus the thread/CLUTMM overhead.
As I have not studied the PII much , don't have an FPGA board to try out the PII emulator and don't have much time available, I can only put the thought up here for others to explore if it tickles their fancy.
Comments
It was good point.
It is possible You rewrite I8080 that totally run from COG+CLUT and only access HUB for I8080 code?
Yes, yes. The 8080/8085 instruction set can be almost completely done in a single COG on the PI. Apart from the dispatch table. There is an an early version of PropAltair or ZiCog that does 8085 only and shows that it is possible. I could not get some things in to COG, like the long winded and little used DAA instruction, They were done with LMM.
In the 8080 instruction set most of the op codes are MOV, you only need a dispatch table for half of the opcodes, 128 entries. That leaves 128 entries in CLUT for code space for some 8080 ops. I think it will probably all fit in COG.
That is why I started to thing about CLUT as PASM code space and CLUT as dispatch table before that.
You sort of took away my leverage, I was going to release it under same license as Bill to show how bad this can get if we all plant flags and agree to instead make it an app note if Bill does the same.
Depending on how you setup the loop you can even use the stack setspa, addspa, and subspa to do absolute, and relative jumps.
Also depending on how you setup the loop, can't use REP in this case, you can run threaded code.
I'm calling it CLMM (pronounced as CLEM)
C.W.
BTW I have asked Chip for a special shift instruction to shift the 9bit vectors and keep the top 5 bits unshifted. I use this method in the faster spin interpreter. (see the documentation thread at almost the end)
We are certainly going to have some fun with P2. I am about to order my DE0-nano so hopefully will have it before the end of the week.
Now this is what I love about this forum, you only have to have an idea pop into mind and someone has already done it. Especially if it's obviously impossible:)
Of course I'm so slow that this is normally true in the rest of life as well.
I don't have any leverage. No code. Nothing. Just the muse.
I do like "CLMM" better than "CLUTMM". Do you mind if I change the thread title to use "CLMM"?
Released as MIT.
I'll do a better write up and post more engines soon.
Enjoy.
C.W.
CLMM_Engine1.spin
It is very curious, from my understanding of the pipeline I would think I would need at least three nops between the popar snd the instruction slot, and I do if I want to manipulate the "PC" via setspa, addspa, and subspa.
If I just do balls out execute with no jumps like that I can have no nops between the popar and the instruction slot, that may be the overlapped stuff Chip mentioned on Bill's BCEE thread.
C.W.
I also thought about overlays from CLUT memory. Probably a very good technique if the code in the overlay contains loops. Then the overhead of loading diminishes with respect to the execution time.
However for in line code a single instruction CLMM technique may be better. The problem I ran into using overlays in ZiCog was that you need a space as big as your biggest overlay in COG in which to load it. With a LMM style you only need space for the small LMM loop. As the ZiCog overlays were straight line code the speed difference between overlay and LMM is not great.
I noticed your request for the special shift operations. I must say that having had a quick look at the list of instructions for PII it's pretty overwhelming already. Perhaps adding more odd things is not a good idea. For sure 90 percent of those instructions are never going to be used by compilers.
It pains me I can't get into PII experiments with the nano board, I have so much going I just can't allow myself another toy just now.
It is why I don't like C type compilers ---
CRC is nothing more than hashing, in fact it shares a lot in common with the AES algorithm. AES has a table of polynomials instead of 1 polynomial, but the math to divide and reduce polynomials between CRC and AES is identical. There are 256 forward and 256 reverse polynomials in AES-128 (encrypt and decrypt s-boxes), so the CLUT works great for that too.
In fact, AES-128 needs 512 bytes for the whole implementation, you could store the encrypt and decrypt tables in the upper and lower words of each CLUT entry, just add a second step of SHR X,#16 to get the table entry, using the same lookup mechanism.
The tables are what make AES a difficult to implement algorithm on the Prop 1, but with the CLUT, you have another 1KB of RAM for data in each COG, which greatly expands what you can do.
Just some food for thought.
Blimey, it's done already. Great stuff. Now I just have to sit down with PII docs and figure out how it works exactly. This is not your fathers Propeller.
Clearly we need a way to load up the CLUT with code first. A simple RDLONG, pusha (whatever) loop would do.
To maximize available COG space the COG could be started with a small CLUT loading program which stops when done and then the COG is started again with your real application code.
Actually, such a CLUT loading COG program, that does pretty much nothing else. might be a useful thing to have stand alone as it has applicability to loading sin/cos/log tables and other such tasks prior to loading app code hat uses those CLUT tables.
Relatively complete instructions set posted.
http://forums.parallax.com/showthread.php?144643-Propeller-II-emulation-for-Idiots&p=1151037&viewfull=1#post1151037
Yes, compilers have that difficulty, not just C compilers. It's just damn hard for a compiler to look at your code, realize intent and then optimize for a complex array of instructions or weird architectures. . Compilers are for general purpose, code that you want to be usable on many platforms. That is clearly not the kind of code that we will see inside Propeller COGs very much.
peward,
Now we're cooking.
I think this thread is done already.
The hardware sliced threading is the late addition that makes this truly useful.
It is a pity that QuadSPI(DDR) in hardware did not make the cut, as that is another useful 'opcode feeder path' for a 'background rate' slice - without the hard ceiling the CLUT has.
Another use for the CLUT...
A bit map where we are not using the CLUT for video. In my 1pin mono TV code (no need for 1 pin any more because it only uses 1 pin anyway ) I just generate simple TV using VGA mode. I store the bitmap in the cog but this would free up space to do a full VT100 within the same cog too.
Now if only we could page map the 1K into a window of cog space - and yes I know it is not 4way like the cog ram
And we can use the CLUT for buffers too. For instance, setting up a string.
Chip shows using the CLUT as a receive buffer for serial in the Monitor code.
C.W.
Early versions of ZiCog used your overlay technique. I persistened with it for a long time. I just was very tight for space and switching to LMM eased that up. I was not so worried about speed for those Z80 ops in overlay/LMM as they are almost never used in the CP/M world and probably not much elsewhere. On the PII I'm going to worry about the speed of those ops even less.
Can you solder smt 0.4mm pitch???
I guess I can solder 0.4mm, under advice.