XMM language musings

Dr_Acula · 2012-03-21 04:38

While working with average joe to speed up the touchscreen board, we have ended up with a design that has very fast external memory transfers. Using two sram chips in parallel, data is transferred two bytes at a time. Further, there is a hardware counter so there is no need to load in an address each cycle. So this gives you access to 19 address bits (or more) plus 16 data bits which is clearly more than the 32 pins on the propeller. I'm happy to stand corrected, but I believe this is the fastest XMM memory solution to date.

A data transfer cycle comes down to 4 pasm instructions (I think!)

   andn outa,readbit ' set the read bit low (or the write bit depending if it is a read or write)
   mov myvalue,ina ' get the value 
   or outa,readbit ' set the read bit high
   djnz loop ' loop n times

The value in 'myvalue' is a word, so you need to do this twice to get a long, and there needs to be a leftshift in there for one of those as well as an 'or' to combine the two values. And then you need to put that value somewhere - in hub or in the cog.

I've been thinking about virtual machines. Zog. Zicog. A whole myriad of stack and register based machines I found on wikipedia. But I've been thinking of something more specifically designed for the propeller.

Thinking about virtual machines with an arbitrary large memory, with a fast external memory system one can think about bypassing hub and working with just the cog and external memory (the hub will still be very useful but it is nice to start with it all completely free for anything you want).

Everything seems to come back to the actual hardware we have to work with, which is a cog with 2k of memory. If you run a virtual machine, you have to fetch your virtual instruction, process it, store or load values, then get the next instruction. I've read some of the arguments of stack based machines vs register based machines but neither are as fast as they could be. An example - storing a single long constant on the stack on Zog requires four byte stores. That is four emulated instructions plus all the overheads all to do what you can do in pasm with one instruction.

So I have been thinking about working more closely with the actual propeller architecture and working within a subset of pasm.

Consider a long assembly program - 100k in length. Too big for LMM. If a subroutine/call happens to be under 1k that is good - just load that code fragment into the cog and run it as it is.

If however it is more than 1k with no calls or jumps, split it up into 1k fragments and load each fragment in turn.

The key here is that a 1k fragment of pasm code loaded from external memory runs exactly as pure pasm. There is an overhead to load it obviously, but if it happens to have some internal loops or jumps then those run at pasm speeds. A video driver would be possible if it happens to fit into 1k.

Why 1k? Well that is just a guess, but I figure the other 1k would be needed to handle the code to load blocks of external memory into the cog, plus the setup code, plus some common registers. Though that is probably an over estimate - maybe only a few hundred bytes?

From the point of view of a compiler like the proptool, you would do the program in 496 long groups, copying in the memory access the same every time and then adding different code fragments. Then link it all together. Each 496 long piece of code is internally consistent and hence easily compilable.

Ok, this is a subset of pasm, and you can't do everything:

jumps and calls internally would work.
but returns would jump out of that fragment and would require loading in new fragments. Ditto calls.

The hardest thing I think is you can't have a list of variables at the bottom of the pasm code. So you need some way of handling constants. Maybe a psuedo instruction that is translated by a precompiler to four mov myvalue,#nnn with some bitshifts.

It might be possible to have registers that stay constant between code fragments so you only have to load in those constants once. Or constants could be stored in hub.

There might be some advantages to declaring a list of 16 longs (call them registers) that are common to each code fragment so that you can pass data between fragments quickly.

I'm not sure exactly of all the rules. In some ways it may be better not to have rules - write any pasm you like.

But I wonder if this could do something that LMM can't do, and that is to run a video display at full speed if you wanted to. Then shut that down and run some other code?

So it is not really a matter of creating a list of new instructions like in a virtual machine. Rather, it is a matter of excluding certain pasm instructions and replacing them with other ways of doing things. I think the things you need to replace are calls and jumps that are not local to the 1k code fragment, and all references to variables and constants that you would normally put at the bottom of your pasm code.

But you can have an almost unlimited list of variables and constants which you can't do in standard pasm, because they can all be stored in external ram. So in practice you might load in a group of them from external ram at the beginning of a code fragment, and then they are accessed like standard pasm.

Of course, what you also need is some sort of smart precompiler that chops up a huge pasm program into manageable bits. The rules are not too hard though - if a subroutine is less than 1k then load as it is. If >1k then do it in pieces. Probably need some sort of local stack to handle the calls and returns of the fragments but that is not too hard.

Maybe it won't work? But I like the challenge of trying to think of a model that can run huge XMM pasm programs and where at least small program fragments can run at native pasm speed.

Thoughts would be most appreciated.

jazzed · 2012-03-21 05:54

Dr_Acula wrote: »

While working with average joe to speed up the touchscreen board, we have ended up with a design that has very fast external memory transfers. Using two sram chips in parallel, data is transferred two bytes at a time. Further, there is a hardware counter so there is no need to load in an address each cycle. So this gives you access to 19 address bits (or more) plus 16 data bits which is clearly more than the 32 pins on the propeller. I'm happy to stand corrected, but I believe this is the fastest XMM memory solution to date.

That's probably true. Losing 16+ pins hurts but it's better than losing almost all of them

Word at a time 10MB/s with a synchronous clock at 80MHz. What kind of counters are you using? Schematic?

I assume "readbit" in your PASM snippet is your counter strobe, so you have to toggle it to get the next address.
Wouldn't want to assume too much though.

You'll need a cache to take full advantage of the bus speed especially for byte-wide data like ASCII strings.
So a cache read loop would look something like this after "readbit" CTRA and PHSB address increment initialization:

# magic CTRA/CTRB initialization goes here ....

# cache read loop
:loop
  mov tmp, ina
  wrword tmp, phsb
  djnz length, #:loop

A write cache loop will be slower unless you have special write enabled hardware.

SRLM · 2012-03-21 10:20

I think the compiler would also benefit from a feature that duplicates code to prevent switching fragments. For example, a large PASM program might have a multiply subroutine which is called from many areas of code. Rather than having to load the multiply fragment on each call, it would be nice to have the option to duplicate the routine in each fragment that uses it.

If I understand right, your solution uses ~17 pins (16 data, 1 address increment)? How would you plan on loading an arbitrary fragment with a given address into memory?

It sounds like there are two possible modes of operation:
a) load one instruction, run it, repeat
b) load many instructions (aka, fragment) run them, repeat

The main idea in your post primarily is that you can use two RAM chips and an external hardware counter to reduce pin count while increasing bandwidth, and you can cache fragments of code in the cog before executing. Is this correct?

As a side note, with the selection of fragments it would proably be better to base the size on the control flow graph rather than code independent metrics. The conrol flow graph would allow the compiler to determine the structure of the code and the number of incomming/outgoing links for an arbitrary grouping of blocks, and variable liveness analysis( which would allow you to determine how many "registers" you need to store for an arbitrary group of blocks). With this approach you could optimize for the structure of the program and save some time by not loading lines of code that are redundant.

Some control flow lecture slides: www.cs.umd.edu/class/fall2003/cmsc631/lectures/l02.pdf

pedward · 2012-03-21 11:45

You need to get rid of the explicit read set/reset and use a counter to twiddle that in sync with a read loop.

ersmith · 2012-03-21 14:01

Dr_Acula wrote: »

Maybe it won't work? But I like the challenge of trying to think of a model that can run huge XMM pasm programs and where at least small program fragments can run at native pasm speed.

Thoughts would be most appreciated.

Your XMM solution sounds very nice (and fast!)

propgcc already can run some program fragments (namely loops of < 1K) at native PASM speed, even from XMM. It does this by implementing an FCACHE pseudo-instruction which loads a loop (or a small recursive function) into local memory and then runs it. The advantage of concentrating on loops is that that is where most programs spend their time -- there's not usually much benefit to speeding up initialization code, for example.

Your suggestion of splitting the program up into 1K chunks is a pretty good one. I had to do something similar on a custom RISC system once where part of the internal memory came out unusable; in that case I did it by modifying the assembler to automatically insert JUMP instructions to the next block at the end of each block of code. The tricky part is trying to avoid thrashing, which is where something like FCACHE comes in -- you want to make sure that loops don't get split across chunks unless they absolutely have to be. The other down side is that it will bloat the code a bit, although for XMM solutions it's not as big a deal.

Eric

Dr_Acula · 2012-03-21 15:30

Some great suggestions there. It sounds like the GCC group has already looked at these problems.

Ok, say you take a typical program - in C or Pasm or Propbasic, and try to chop it up into 1k fragments. Then do some optimising as mentioned above - eg if a code fragment calls a multiply routine it may well be more efficient to copy that multiply routine many times into different parts of the program making a larger overall program but faster.

I can see two ways to use the code fragments:
1) the cog loads the code fragment directly from external ram
2) a driver program loads the cog from hub and a separate program manages caching fragments from external ram to hub

I wonder if option 2 may end up a similar speed?

Option 2 would be running cognew all the time, but with a different code fragment. Maybe you might devote half the hub ram to screen buffers and the like, and the other half to a cache?

The advantage of option 2 is that with caching, the absolute external memory speed becomes less important.

I think this has been done by the GCC people, but more for C rather than pasm. So, thinking this through, you start with a large 100k pasm program which you pass through a precompiler a number of times. Perhaps you have a restriction you place on the programmer to say that all functions must fit in about 1k or so?

So - first pass of the compiler - look for functions that are called many times from different parts of the code (multiply, string handling) and see if it is possible copy these into the function that calls them.

Second pass - see if the new program still passes the "will a function fit in 1k" test. There is probably a mathematical way of doing this. Some functions might end up too big and then you have to ask - is it better to call another function and reload a cog each loop, or is it better to split that new code in two and only load the cog twice? I guess you have to analyse the code to see if it is a linear piece of code or whether it contains a loop. I think this is what Eric is saying above.

Finally though, you might come up with a 100k program split into 100 1k chunks. Then you add in the overhead code that all cogs need - namely the code at the beginning that handles the passed variables, and the list of local variables at the end. So each chunk now becomes a standard 496 long piece of data that can be loaded into a cog. Cache maybe 8 of these in hub ram.

There might be some advantages to running the 'main' program in a high level language like C or Spin.

A smart cache driver could speed things up even more. Say your compiler determined that there is a linear piece of code that is going to take three cognews to run. So the compiler might put some instructions in the code to load the first cognew from xmm and set it running. Then fetch the second cognew data from xmm while the first one is running.

I guess the difference here is that the cache driver is not using one of the standard algorithms like "most recently used", but rather, the compiler tries to work out the optimum cache load order at compile time and then puts instructions to do this into the code.

ersmith · 2012-03-21 18:30

Ok, say you take a typical program - in C or Pasm or Propbasic, and try to chop it up into 1k fragments.

An important note -- many fragments may be less than 1k. For example, if a function is 256 bytes it really only makes sense to load those bytes; if you load a whole 1K then you'll be wasting 3/4 of the load time (unless the remainder of the 1K is also called along with the small function).

I can see two ways to use the code fragments:
1) the cog loads the code fragment directly from external ram
2) a driver program loads the cog from hub and a separate program manages caching fragments from external ram to hub

I wonder if option 2 may end up a similar speed?

You left out:
3) a separate program manages caching fragments from external ram to hub, and the cog loads its code from hub after communicating with the cache program

That's how most compilers actually do XMM these days, I think. The big problem with option 2) is that the cognew function always loads 2KB. The code loading time can be a pretty significant part of the cost of LMM/XMM execution; in fact in straight code (no loops) it takes 8 cycles to load an instruction, with a further 8 cycles of wait time for the next hub window, and only 4 cycles to execute it.

Second pass - see if the new program still passes the "will a function fit in 1k" test. There is probably a mathematical way of doing this. Some functions might end up too big and then you have to ask - is it better to call another function and reload a cog each loop, or is it better to split that new code in two and only load the cog twice? I guess you have to analyse the code to see if it is a linear piece of code or whether it contains a loop. I think this is what Eric is saying above.

Well, not quite. GCC just looks at all loops and checks their size. If the loop is small enough to fit into internal memory, and if the FCACHE optimization is turned on, then we convert it into an internal loop. This requires a bit of work; branches have to be changed from LMM form to straight JMP instructions, and immediate constants have to be broken out into registers and loaded as well. A similar optimization is performed on recursive functions. We don't try to split up linear (non-looping) code into chunks to load into internal memory. It's an interesting idea, and perhaps there might be some benefit, but on the other hand in many cases one ends up adding overhead to code that is only run once.

It sounds like you're suggesting doing the splitting up on arbitrary PASM code. That's actually harder than doing it on a high level language, since C or BASIC code is generally easier to analyze (loops are typically marked by FOR or WHILE).

SRLM · 2012-03-22 01:39

ersmith wrote: »

An important note -- many fragments may be less than 1k. For example, if a function is 256 bytes it really only makes sense to load those bytes; if you load a whole 1K then you'll be wasting 3/4 of the load time (unless the remainder of the 1K is also called along with the small function).
...
It sounds like you're suggesting doing the splitting up on arbitrary PASM code. That's actually harder than doing it on a high level language, since C or BASIC code is generally easier to analyze (loops are typically marked by FOR or WHILE).

And here is where the control flow graph comes in...

I disagree. I think the splitting of arbitrary PASM code is easier than doing it with a HLL. With PASM you only have to worry about GOTOs (or conditional variants), rather than whiles, ifs, do whiles, switches, elses, gotos (legal sometimes!), function calls, and other assorted branching type statements. With PASM, the compiler simply has to mark
1. The first instruction
2. The target of every branch*
3. The instruction immediately following a GOTO
Each mark then indicates a new flow control block, and the compiler can create a graph from this and determine loops and the like. With the branching type statements a compiler would have a difficult time determining run time size before code generation, and so it couldn't even come up with a guess. But with PASM* the compiler has enough information. In either case I don't think the compiler could, in a general case, determine which loops/etc. are the heaviest hitters.

*Although I'd like to add that my entire argument is probably pointless since PASM permits dynamic modification of jump addresses, so applying this principle to hand written PASM code is probably a useless effort anyway. But on an architecture that is more boring than the Propeller it might make sense!

Heater. · 2012-03-22 04:22

SRLM,

I'm am no compiler writer but I am very sure that is entirely backwards.
The whole point of "structured" languages with their "for", "while", "if", "then","else" and other block structures and the downer on "goto" is that if you have the structure in the source code it becomes easier to reason about the code and easier for the compiler to optimize things.

For example, the propgcc has this fcache idea. The compiler can look at your "while" loop, for example. Check that it makes no calls in the while block or "gotos" out of the block. Check how big the compiled while block is. Etc etc and then decide if that "while" loop can be loaded into cog for full speed execution.

This sort of thing is only possible with structured code. A "goto" only language like BASIC or assembler is not amenable to this kind of analysis. For example, the code flow is entirely dependent on the input data which the compiler knows nothing of. There are no clues in the source.

Heater. · 2012-03-22 04:23

SRLM,

I'm am no compiler writer but I am very sure that is entirely backwards.
The whole point of "structured" languages with their "for", "while", "if", "then","else" and other block structures and the downer on "goto" is that if you have the structure in the source code it becomes easier to reason about the code and easier for the compiler to optimize things.

For example, the propgcc has this fcache idea. The compiler can look at your "while" loop, for example. Check that it makes no calls in the while block or "gotos" out of the block. Check how big the compiled while block is. Etc etc and then decide if that "while" loop can be loaded into cog for full speed execution.

This sort of thing is only possible with structured code. A "goto" only language like BASIC or assembler is not amenable to this kind of analysis. For example, the code flow is entirely dependent on the input data which the compiler knows nothing of. There are no clues in the source.

SRLM · 2012-03-22 10:13

Heater. wrote: »

SRLM,

I'm am no compiler writer but I am very sure that is entirely backwards.
The whole point of "structured" languages with their "for", "while", "if", "then","else" and other block structures and the downer on "goto" is that if you have the structure in the source code it becomes easier to reason about the code and easier for the compiler to optimize things.

For example, the propgcc has this fcache idea. The compiler can look at your "while" loop, for example. Check that it makes no calls in the while block or "gotos" out of the block. Check how big the compiled while block is. Etc etc and then decide if that "while" loop can be loaded into cog for full speed execution.

This sort of thing is only possible with structured code. A "goto" only language like BASIC or assembler is not amenable to this kind of analysis. For example, the code flow is entirely dependent on the input data which the compiler knows nothing of. There are no clues in the source.

Ok, from the HLL how does the compiler know if a loop will fit in FCACHE? By checking the control flow graph and seeing how big the blocks are. GCC does use the control flow graph for the general case: http://gcc.gnu.org/onlinedocs/gccint/Control-Flow.html. Although I don't know if the propGCC team has used it for their fcache.

My comment about branching type statements (if, while, for, etc) is that a compiler cannot determine in the general case if a piece of code will repeat (loop) frequently. For things like "repeat 100" then yes, it's easy to look at the HLL and determine that the block loops 100 times. But it's just as easy with assembly code* in that case to determine that the block loops 100 times. The advantage for all cases when FCACHEing comes in that you can now determine the size and variable liveness from the assembly** level code. A side advantage is that it's easier to determine if there is a statement that exits the FCACHEed block (just search for a GOTO that jumps to somewhere outside), but with a bit of effort the compiler could probably do this with the HLL.

* The same note as my previous post #8: PASMs dynamic jumps may introduce hiccups into this theory.
** Usually an intermediate level code is used, but typically this intermediate (eg three address code) is like a psuedo code assembly without any machine specific features or HLL features (FOR, WHILE, IF, etc.). So it's reasonable to make the assumption that if it applies to intermediate code then it applies to assembly, and vice versa.

pedward · 2012-03-22 17:43

WRT assembler, proper loops are always a jump instruction to a label. A compiler has to store label declarations for fixups later anyway, so when you hit an instruction that references a label, AND the instruction is some form of JMP, you have a loop. It does get harder, but it can be done.

Dr_Acula · 2012-03-22 21:04

Blue sky thinking - I'm wondering about a new language for the propeller.
1) Programs as big as you like
2) Compiled, not interpreted, so as fast as possible
3) Inline native Pasm code if you want to add one or more lines
4) Load and reload cogs on the fly and designed from the ground up to do this easily
5) Pasm largely hidden from the user, especially things like the complexity of passing variables to cogs
6) Global variables that exist in hub but can be used in a cog without having to think about it too much
7) Running from external memory, but ability to interface directly with external memory from within the program
8) Minimal number of 'cognew' for speed
9) Run cogs forever if you want to
10) Multiple subroutines in one cog and user control about how those subroutines are grouped.

Why on earth a new language?!

Well, the first problem I've come across is how to group subroutines in cogs. There are advantages to letting the compiler do this, but there could also be advantages to letting the user do this because the user knows the structure of their program. So you might group string functions in one cog because they are likely to be used together. So you need a new program word, like CogGroup(1) and End CogGroup

I don't think C89 and C++ let you add words like this. These are not a "function". Maybe we could add this to Spin and call it XMM spin or something? I've called it XMM Basic because Basic has many variants so it is ok to add one more! But that is a very flexible name and could change down the track.

The language is translated by a precompiler into Spin. Maybe it looks a bit like this:

' XMM Basic for the Propeller

Public Myvariable as Long	' these global variables are in hub ram
Public Bytevariable as Byte
Public Mystring as String:20 ' string variable =  a byte array 20 bytes long

Sub Main()
	Dim A As Long
	A = 5
	MyProcedure(A) ' adds 10 to A and alters Myvariable 
	PrintLong(Myvariable)	' print is a subroutine that needs to be written
End Sub

CogGroup(1) ' this group fits in a cog together. Number is the cog this runs in
  ' precompiler adds the max number of passed variables here

  Sub MyProcedure(N)
	Dim A as Long		' this A is local to this procedure
	Myvariable = N + 10     ' alters a hub variable, user can use public variables within cogs - precompiler writes the code
  End Sub

  Sub PrintLong(A)
	' send A out to whatever display is on the board

  End Sub

  ... more subroutines

End CogGroup

There are some things here which are a bit unusual. I made some notes below. It is a sort of hybrid compiled/interpreted language. It is compiled in that every command is converted ultimately to pasm, like Propbasic. But it is also interpreted, in that there is a spin routine running in the background working out whether the next subroutine will result in a cognew or not.

The number in CogGroup is not the cog it runs in. It is just a number that groups these functions together and there could be hundreds of these. Maybe you could think of them as Classes, but Class means something different I think.

All public variables converted to a VAR section
If a Cogroup is too big, precompiler gives an error
The Main program is translated by a precompiler to Spin
If a public variable is mentioned in a procedure, the precompiler adds the code to pass the hub location, plus the wrlong or rdlong to the right location
Multiple procedures can fit in one cog
If a procedure is called from main, check if it is in the current cog
If not, then cognew from external memory (sram, SD) - possibly with a local hub buffer
When translating to pasm, the precompiler determines the maximum number of variables to pass
for any of the included subroutines. Send just one variable to the cog which is an array pointer
Each subroutine extracts its passed variables from that array
Return variables are public global variables - pass the location of these in pasm and the local cog
code alters these in hub directly with a wrlong.
When a subroutine in a cog finishes running it sets a flag so the main program knows to continue

Able to set cogs running forever
Able to have inline pasm code that runs at full speed

XMM language musings

Comments