Some Propeller reverse-engineering info

Cliff L. Biffle · 2006-10-19 05:35

Work has slowed down for a week or two, so I finally got around to reverse-engineering the Propeller binary format and some of the Spin bytecodes. (Chip sent me an example machine-code loader routine, but it didn't work, for reasons I outlined in a PM to him -- so I cleanroom-engineered the file, because it's more fun that way.)

Info here:
www.cliff.biffle.org/software/propeller/binary-format.html
www.cliff.biffle.org/software/propeller/spin-bytecode.html

These are both works in progress; I'd appreciate any additional info from the community.

My assembler is coming along quickly, and currently has frontends for the Parallax ASM syntax and a different, somewhat cleaner syntax (for folks who fondly remember 68000 assembler). It's feature-complete, but I haven't built out the full instruction set yet. Source and binaries soon.

Post Edited (Cliff L. Biffle) : 10/19/2006 5:53:13 AM GMT

Mike Green · 2006-10-19 06:13

I'm very much looking forward to seeing what you've done, and using it. Thanks for the effort.
Mike

Bill Henning · 2006-10-19 06:45

I'll second that... while I've grown to like spin, I'd also be very interested in going to the bare metal.

The best way to reverse engineer spin would be to write trivial sample programs, to exercise each feature of the language, and examine the resulting byte code; however personally, I am more interested in being able to totally take over the propeller and not having to worry about using spin to load code - now in an ideal world, it would also be interesting to invoke spin from assembly code for certain situations.

Basically, what I find limiting is the "32k static load on startup" - I'd like to see ability swap code in, dynamically building code on the fly for cogs, and much more; things that would not be easy to implement right now. Heck, I'd even be interested in the ability to run spin apps in a constrained portion of the hub memory... as I said, I like spin [noparse]:)[/noparse]

Peter Jakacki · 2006-10-19 07:12

Good one Cliff, I looked at the Spin bytecodes and it looks very Forth'y to me. It probably makes sense to implement a stack based VM for the Spin compiler "target" when memory constraints are so tight.

I am busy moving at the moment but as soon as I can I will have a look at the kernel myself.

*Peter*

Cliff L. Biffle · 2006-10-19 14:54

I'm only working out the SPIN bytecodes as it proves necessary for my other efforts; in particular, I have no intention of writing a SPIN compiler. (My compilers have type systems.)

That said, the current SPIN compiler seems to do nearly no compile-time optimization -- in particular, it does not do constant folding, does not inline short functions, and seems to keep local variables off the stack (resulting in more memory accesses).

These three are particularly unfortunate, because they encourage unreadable code in the name of speed. (Though, I suppose, you're already working in a language that uses |< and ~~ as unary operators...readability may not be a concern.)

Here are two concrete examples which, while semantically equivalent, differ in execution speed. They are nonsense code, of course, but were we using a real compiler (say, GCC), these would each be equivalent at runtime.

v := 2 + 2 + 2 ' slow.
v := constant(2 + 2 + 2) ' fast

' Slow version:
r := (v & $4) << 8
v := r | $42

' Fast version: inlining is good for you! Or something.
v := ((v & $4) << 8) | $42

' Slow version
PUB ILikeStructuredProgramming(v) :r
r := twiddle(v,2)
r := twiddle(r,4)
r := twiddle(r,8)
PRI twiddle(x,y) : r
r := (x >> y) | (y << (32 - y))

' Fast version
PUB ILovePoorCode(x) : r
r := ((((x >> 2) | constant(2 << 30)) >> 4) | constant(4 << 28)) >> 8) | constant(8 << 24)
' Or, after constant folding,
r := (x >> 14) | $8480000

cgracey · 2006-10-19 19:37

Cliff,

As you've noted, the compiler generates byte code based quite directly on source code. Though it does not perform optimizations, the intepreted byte-code language, itself,·is quite memory-efficient. In the future, an optimizing compiler·could shrink programs (somewhat) and make them (a little)·faster, but this is a huge step up in complexity (read 'reliability')·and is not a priority for us right now. What we have works perfectly and·it serves its purpose well. I do agree that constant folding would be a huge win. I'll look into implementing that, as it might not be too difficult to do.

I bet you'll make an awesome C compiler!

Cliff L. Biffle said...
I'm only working out the SPIN bytecodes as it proves necessary for my other efforts; in particular, I have no intention of writing a SPIN compiler. (My compilers have type systems.)

That said, the current SPIN compiler seems to do nearly no compile-time optimization -- in particular, it does not do constant folding, does not inline short functions, and seems to keep local variables off the stack (resulting in more memory accesses).

These three are particularly unfortunate, because they encourage unreadable code in the name of speed. (Though, I suppose, you're already working in a language that uses |< and ~~ as unary operators...readability may not be a concern.)

Here are two concrete examples which, while semantically equivalent, differ in execution speed. They are nonsense code, of course, but were we using a real compiler (say, GCC), these would each be equivalent at runtime.

v := 2 + 2 + 2 ' slow.
v := constant(2 + 2 + 2) ' fast

' Slow version:
r := (v & $4) << 8
v := r | $42

' Fast version: inlining is good for you! Or something.
v := ((v & $4) << 8) | $42

' Slow version
PUB ILikeStructuredProgramming(v) :r
r := twiddle(v,2)
r := twiddle(r,4)
r := twiddle(r,8)
PRI twiddle(x,y) : r
r := (x >> y) | (y << (32 - y))

' Fast version
PUB ILovePoorCode(x) : r
r := ((((x >> 2) | constant(2 << 30)) >> 4) | constant(4 << 28)) >> 8) | constant(8 << 24)
' Or, after constant folding,
r := (x >> 14) | $8480000

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Mike Green · 2006-10-19 20:07

Just an aside Cliff, "and seems to keep local variables off the stack (resulting in more memory accesses)" doesn't really apply in that the stack is implemented in the same space (HUB memory) as all other variables (and interpretive code for that matter). Using stack relative accesses does save on code space since the interpretive codes can be shorter (smaller addresses).

Cliff L. Biffle · 2006-10-19 21:43

Mike,

In this case, the code in question does something like this:

push 2
push 2
add
pop (or tuck) value to local variable
push that same local variable
push 2
add
pop to same local variable.

That's what I mean by more memory accesses: it's shuffling the stack a lot more than it ought to. Optimized stack code would omit the intermediate pop/push.

(Note: I'm still working out how exactly the local variable stores work. I suspect it's simply moving TOS down into the stack by an offset, but it's still unnecessary.)

Mike Green said...

Using stack relative accesses does save on code space since the interpretive codes can be shorter (smaller addresses).

I'm not disputing the value of a stack-oriented bytecode machine.

cgracey · 2006-10-19 21:46

This is, indeed,·the case. Local variables are kept on the stack and there are efficient single-byte read and write instructions for the first eight locals, as well as the first eight global long VARs.

Mike Green said...
Just an aside Cliff, "and seems to keep local variables off the stack (resulting in more memory accesses)" doesn't really apply in that the stack is implemented in the same space (HUB memory) as all other variables (and interpretive code for that matter). Using stack relative accesses does save on code space since the interpretive codes can be shorter (smaller addresses).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

cgracey · 2006-10-19 22:33

Cliff,

Can you please please show the source that created this byte stream?

Cliff L. Biffle said...
Mike,

In this case, the code in question does something like this:

push 2
push 2
add
pop (or tuck) value to local variable
push that same local variable
push 2
add
pop to same local variable.

That's what I mean by more memory accesses: it's shuffling the stack a lot more than it ought to. Optimized stack code would omit the intermediate pop/push.

(Note: I'm still working out how exactly the local variable stores work. I suspect it's simply moving TOS down into the stack by an offset, but it's still unnecessary.)

Mike Green said...

Using stack relative accesses does save on code space since the interpretive codes can be shorter (smaller addresses).

I'm not disputing the value of a stack-oriented bytecode machine.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Cliff L. Biffle · 2006-10-19 23:48

Chip Gracey said...
Can you please please show the source that created this byte stream?

I don't have the original -- I'm playing around in PropTool and not saving my sources -- but here's an example. It's pathological, but it shows what I'm referring to.

PUB start : v | r
  r := 2
  r := r + 3
  r := r + 4
  v := r

Spin bytecodes:

37 00 ; push 2^1
65 ; tuck to local 1
64 ; push local 1
37 21 ; push 2^2-1
EC ; add
65 ; tuck to local 1
64 ; push local 1
37 01 ; push 2^2
EC ; add
65 ; tuck to local 1
64 ; push local 1
61 ; tuck to return value
32 ; return

(Intriguingly, r+=2 and r := r + 2 generate different bytecodes. More fodder for study!)

So, Chip, it's like you say -- each statement is being compiled separately, without regard for the stack effects of the others.

By contrast, a "canonical" compiler for a stack machine would leave locals as intermediate values on the stack, rather than stashing them in an activation record below the current data stack. The code generated would simply omit the 65/64 pairs:

37 00 ; push 2^1
37 21 ; push 2^2-1
EC ; add
37 01 ; push 2^2
EC ; add
61 ; tuck to return value
32 ; return

Between that and constant folding, you'd wind up with this:

38 09 ; push 9
61 ; tuck to return value
32 ; return

An interested party could write a relatively simple two-wide peephole optimizer to take care of this; the artifact (the tuck/push sequence) is very consistent. I had to address a similar issue in javac in JDK1.1, back when Java was interpreted. (Unfortunately, the PropTool can't be invoked from scripts and the like, so you'd wind up manually writing binaries to the disk, processing them, and loading them back in.)

cgracey · 2006-10-20 03:00

Well, until we have an optimizer, I will continue to write optimized Spin code. Then, after we have an optimizer,·I will·keep writing optimized Spin code.

Cliff L. Biffle said...

PUB start : v | r
  r := 2
  r := r + 3
  r := r + 4
  v := r

Spin bytecodes:

37 00 ; push 2^1
65 ; tuck to local 1
64 ; push local 1
37 21 ; push 2^2-1
EC ; add
65 ; tuck to local 1
64 ; push local 1
37 01 ; push 2^2
EC ; add
65 ; tuck to local 1
64 ; push local 1
61 ; tuck to return value
32 ; return

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 10/20/2006 9:04:54 AM GMT

rokicki · 2006-10-20 07:39

Actually, it would be really cool if we could configure an external program to run on
source before it was compiled, and yet another external program to run on the
binary stuff that was so generated. This way we could add our own typing
restrictions to spin, or add a peephole optimizer, or the like, without having to
delve into Delphi. We could (for a simple case) run the m4 preprocessor on
source, so you can have full preprocessing semantics. Or someone can
(if they really really want to) change >= and <= to behave more like they
more "normally" do, and use something different like >:= and <:= or something
for the extended semantics.

Of course this does open a Pandora's box when it comes to compatibility and
the like, but for many people it would allow them to (for instance) write a
trivial Perl script to catch the errors they make time and again.

And perhaps enough experimentation with a simple peephole optimizer may
actually yield something that could be easily and profitably integrated with
the "normal" IDE.

I believe the Microsoft IDEs generally support this type of functionality.

Cliff L. Biffle · 2006-10-20 15:32

rokicki said...
I believe the Microsoft IDEs generally support this type of functionality.

It's also the basis for every Unix development tool, ever. I'm totally with you.

Though in my case, it'd need to somehow run these programs outside of my VM on my real computer, where I have scripting languages and m4.

Cliff L. Biffle · 2006-10-20 15:34

Howdy folks. Quick update.

I've refined my Propeller binary preamble and gotten it working for the general case. With some help from Chip, I've worked out the mystery 8-byte sequence that Propeller Tool requires to load a file (yet which the Propeller seems to disregard).

The file layout table and the preamble have been updated on my webpage:
www.cliff.biffle.org/software/propeller/binary-format.html

I've also got a lot more of the SPIN bytecode worked out; I'll update that page shortly.

Some Propeller reverse-engineering info

Comments