Announcing P2BEE: Propeller 2 Bytecode Execution Engine

Bill Henning · 2012-12-13 22:39

Propeller 2 BEE is the ultimate emulation engine for the Propeller 2!

I know, that is a very strong claim. I'll prove it.

Benefits:

- Maximum possible execution rate for byte-encoded instructions
6.5 clocks per byte avearage execution rate for single cycle P2 instructions

- Makes writing processor emulators MUCH easier with much less code

- Makes writing any virtual machine much easier and run much faster

- Propeller 2 BEE is now the premiere emulation platform

- for P2BEE's requiring less than 256 instructions rest of STACK is available

Designed for:

- Specifically for Propeller 2
- Fastest possible Spin VM
- Fastest possible Java VM
- Fastest and Smallest "compressed mode" for C and other compilers
- Fastest possible 8 bit processor emulation
- Can be for 16/32 bit emulators (in some cases)
- provides DRASTIC speed up for Spin, Forth, ZOG, Z80, 6809, 6502 and
every other emulator and virtual machine
- Retro Gaming
- Retro Console Emulation
- Retro Computer Emulation

License: Creative Commons Attribution-ShareAlike 3.0 Unported
http://creativecommons.org/licenses/by-sa/3.0/legalcode

History:

I've been having a lot of fun with the Propeller 2 on my DE0-Nano, and I came up with a really cute trick that lead to developing P2BEE.

Today I verified that my P2BEE concept works, and that the engine works.

I could not wait to publish it - and I can't wait to see all the different emulators and virtual machines that will be based on it!

**********************************************************************
FAQ:
**********************************************************************

Why is P2BEE so fast?

Propeller 2 BEE pulls out all the stops, and uses all of the tricks I could
think of to execute byte codes as fast as possible. It uses a specific order
of cached byte read and stack access instructions optimized for the pipeline
details of the Propeller 2.

How did you think of it?

I have developing software and hardware for the Parallax Propeller since it
became available. I always had a great interest in processor emulation for
retro computers and gaming.

Way back, I came up with the LMM virtual machine for the Propeller, allowing
it to execute larger programs than it could in the native "cog" memory.

Once I saw the specifications and instructions for the STACK memory, I started
thinking of non-obvious uses for it... and once Chip increased it to 256 longs
from the original 128 entr "CLUT" version, it became even more interesting.

256 longs... a very useful number. There are 256 possible values for a byte.

Thus P2BEE was born.

Single cycle propeller instructions stored in the STACK (CLUT) memory can be
executed by the inner P2BEE engine in 6.5 cycles (on average) - obviously
other instructions will take longer.

By storing various JMP instructions, sequences of instructions can be run
for every byte code - making coding VM's and emulators immensely easier.

Bill Henning · 2012-12-13 22:39

reserved for more FAQ, links etc

- added untested v.011 that uses a pipelined approach to hopefully reduce execution time to ~4.5 cycles for single cycle instructions

Cluso99 · 2012-12-14 00:44

Sounds extremely interesting Bill. Took a quick look at the source.

I expected the CLUT would be extremely useful for other things besides clut and stacks.

cgracey · 2012-12-14 01:17

Wow! What a neat idea!

Get a cached byte from hub memory, translate it to a long via the stack RAM, and execute it. How simple, but effective!

Good job!

Peter Jakacki · 2012-12-14 01:46

Bill Henning wrote: »

Merry Xmas Everyone!

I know, I am early.

Propeller 2 BEE is the ultimate emulation engine for the Propeller 2!

I know, that is a very strong claim. I'll prove it.
.

Everyone is having so much fun but it's really interesting seeing the stuff that has been done already. Chip knows that there is an up side with this and a down side because P2 better come out soon and it better work too!

Thanks Bill, I will sit down sometime next week and digest this tasty dish, sounds interesting, especially for Tachyon P2.

cgracey · 2012-12-14 01:48

Bill,

Have you thought about overlapped read/execute:

        repd    $1ff,#4         'infinite repeat of 4 instructions

	nop			'spacers
	nop
	mov	ins,#0		'clear ins to nop

        rdbytec bop,ptra++	'get cached byte                       - 4 clocks per loop
        setspa  bop		'set stack pointer
        popar   ins             'read long
ins     nop                     'execute prior long

Peter Jakacki · 2012-12-14 01:53

cgracey wrote: »

Bill,

Have you thought about overlapped read/execute:

        repd    $1ff,#4         'infinite repeat of 4 instructions

    nop            'spacers
    nop
    mov    ins,#0        'clear ins to nop

        rdbytec bop,ptra++    'get cached byte                       - 4 clocks per loop
        setspa  bop        'set stack pointer
        popar   ins             'read long
ins     nop                     'execute prior long

I just came back to ask something similar, it certainly has turned the cog into fast spinning propeller churning up the bytecode waters.

Heater. · 2012-12-14 02:09

Bill,
If I understand the idea correctly:

a) An emulator/VM for a byte wide instruction machine often uses a look up table to dispatch op codes to instruction sequences. Think Z80, Zog, Java etc.
b) That dispatch table on a Prop is likely to be in HUB RAM using WDLONGs
c) You are proposing to put that dispatch table in the COGs CLUT/stack RAM for fast look up.

For things like a Z80 the dispatch table in STACK RAM would have to contain jumps to code sequences or even multiple code sequence addresses squeezed into to long. After all a Z80 op can take a lot of Prop instructions to complete.

Very cunning but hardly an idea that has not occurred to an emulator writer and Prop enthusiast, like myself. Although we may never have come up with the fastest way to do it:)

I had to stop reading your source when I to the license which is incompatible with things like GPL or MIT.

Bill Henning · 2012-12-14 04:39

Thanks Cluso!

Cluso99 wrote: »

Sounds extremely interesting Bill. Took a quick look at the source.

I expected the CLUT would be extremely useful for other things besides clut and stacks.

Bill Henning · 2012-12-14 04:45

Thanks Chip!

I kept looking for other uses for the stack ram, and when I squinted just the right way, this idea popped out.

Translation tables have been done on a lot of architectures (including prop1) in the past, but the unique clut/stack ram access instructions allowed me to decouple translation from the hub, and more importantly, the cache, so the RDBYTEC runs mostly cached, making this the fastest method possible on a Prop2.

cgracey wrote: »

Wow! What a neat idea!

Get a cached byte from hub memory, translate it to a long via the stack RAM, and execute it. How simple, but effective!

Good job!

Bill Henning · 2012-12-14 04:46

Thanks Peter!

It should make Tachyon run MUCH faster. I am looking forward to see where you take it.

Peter Jakacki wrote: »

Everyone is having so much fun but it's really interesting seeing the stuff that has been done already. Chip knows that there is an up side with this and a down side because P2 better come out soon and it better work too!

Thanks Bill, I will sit down sometime next week and digest this tasty dish, sounds interesting, especially for Tachyon P2.

Bill Henning · 2012-12-14 04:58

Hi Chip.

Yesterday I experimented with the pipeline delays for instructions fetched from the stack, and I needed two delay slots between popping the instruction and executing it.

Using an "execute next time around" like RDQUAD based LMM should work and be faster, but would require the vm's to keep pipeline effects in mind. Here is that test version:

        REPS #511,#4       ' not the infinite version as this saves me two delay slots when re-starting the loop after a subroutine
       mov    ins,#0

       RDBYTEC bop,ptra++
       SETSPA    bop
ins    NOP
       POPAR      ins

I'll try this one in a couple of hours - after breakfast

cgracey wrote: »

Bill,

Have you thought about overlapped read/execute:

        repd    $1ff,#4         'infinite repeat of 4 instructions

	nop			'spacers
	nop
	mov	ins,#0		'clear ins to nop

        rdbytec bop,ptra++	'get cached byte                       - 4 clocks per loop
        setspa  bop		'set stack pointer
        popar   ins             'read long
ins     nop                     'execute prior long

Bill Henning · 2012-12-14 05:00

May I quote that in the documentation for P2BEE? Attributed to you, of course.

Peter Jakacki wrote: »

it certainly has turned the cog into fast spinning propeller churning up the bytecode waters

Bill Henning · 2012-12-14 05:10

Unlike the, GPL the license I picked allows commercial use, and the only difference from MIT is requiring attribution.

In academia, attribution is required, so I don't see the issue.

Of course lookup tables have done in in the past - far before the propeller - CLUT's, TLB's, caches, lookup tables, microcode engines etc.

But I did come up with this first for P2, in a manner that avoids spoiling the RDxxxxC cache.

Today I am posting a potentially faster variant, that reduces it to four instructions, at the expense of making it more complicated with an additional pipeline.

Heater. wrote: »

Bill,
If I understand the idea correctly:

a) An emulator/VM for a byte wide instruction machine often uses a look up table to dispatch op codes to instruction sequences. Think Z80, Zog, Java etc.
b) That dispatch table on a Prop is likely to be in HUB RAM using WDLONGs
c) You are proposing to put that dispatch table in the COGs CLUT/stack RAM for fast look up.

For things like a Z80 the dispatch table in STACK RAM would have to contain jumps to code sequences or even multiple code sequence addresses squeezed into to long. After all a Z80 op can take a lot of Prop instructions to complete.

Very cunning but hardly an idea that has not occurred to an emulator writer and Prop enthusiast, like myself. Although we may never have come up with the fastest way to do it:)

I had to stop reading your source when I to the license which is incompatible with things like GPL or MIT.

Tor · 2012-12-14 05:34

Bill Henning wrote: »

Unlike the, GPL the license I picked allows commercial use[..]

The GPL allows commercial use of course. There's all that about derived work though. But for something like this which is a separate "engine" type of module the LGPL (Lesser GPL) is just about a perfect fit, as it is for almost every library: You can use it freely without regards to the nature of your own code, as long as a) any changes you do to the module (the "engine" in this case) are made public, and b) the user is able to use another variant of the LGPL module with the (possibly commercial) work. Just like MIT or BSD the LGPL gives the right to use the code almost everywhere, but the LGPL also makes sure that changes/improvements by users are not kept hidden.

-Tor

Heater. · 2012-12-14 05:36

Bill,
I'm all for attribution, credit should be given where it is deserved.

However:

In the linked Spin file it says:
"You may not distribute derived works under a different license without written permission from William Henning."
Clearly if I update ZiCog or Zog, for example to Prop II and use your I have created a derived work from it.
That means I have to put my existing code out under your same license, according to your terms above. Or I have to get permission from you to release under the MIT license which its seems you don't want to.

I have not read that specific licence but it seems to not be compatible with any OSI licence according to this page http://opensource.org/licenses/alphabetical where there is no mention of it.

It seems like an unusual choice for a software project.

It flies in the face of the majority of open Propeller code that is under the MIT license.

Heater. · 2012-12-14 05:45

The GPL allows commercial use of course, whilst at the same time reducing the value of you software to zero:)
1) If I use GPLed code I have to apply the GPL to my derivative work.
2) If I sell a binary of that now GPLed work I may be asked for the source and have to give it.
3) The user is then free to pass on that source under the same terms.
4) The sale value of my product is now zero.

LGPL may be better for this kind of thing, but has weird rules about what is linked in and not linked in and companies hate to mess with all that.

None of these work because Bill wants attribution.

ctwardell · 2012-12-14 05:51

So if I read these comments right, Bill basically has staked a claim to using the CLUT as a look up table?

Isn't that what it's there for?

C.W.

Heater. · 2012-12-14 06:03

Not exactly.
There is the idea of the Prop II CLUT/Stack RAM as an opcode look up table. Only protectable under patent if has not been done before which I'm sure it has.
Then there is the actual implementation as in Bill's source code, protectable under copyright like any published work.

Quite how this goes if you have or hear of the idea and then write your own, which quite likely will end up looking very similar.

Could GCC ever make use of Bill's code given this license?

David Betz · 2012-12-14 06:06

Bill Henning wrote: »

Merry Xmas Everyone!

I know, I am early.

Propeller 2 BEE is the ultimate emulation engine for the Propeller 2!

I know, that is a very strong claim. I'll prove it.

Benefits:

- Maximum possible execution rate for byte-encoded instructions
6.5 clocks per byte avearage execution rate for single cycle P2 instructions

- Makes writing processor emulators MUCH easier with much less code

- Makes writing any virtual machine much easier and run much faster

- Propeller 2 BEE is now the premiere emulation platform

- for P2BEE's requiring less than 256 instructions rest of STACK is available

Designed for:

- Specifically for Propeller 2
- Fastest possible Spin VM
- Fastest possible Java VM
- Fastest and Smallest "compressed mode" for C and other compilers
- Fastest possible 8 bit processor emulation
- Can be for 16/32 bit emulators (in some cases)
- provides DRASTIC speed up for Spin, Forth, ZOG, Z80, 6809, 6502 and
every other emulator and virtual machine
- Retro Gaming
- Retro Console Emulation
- Retro Computer Emulation

License: Creative Commons Attribution-ShareAlike 3.0 Unported
http://creativecommons.org/licenses/by-sa/3.0/legalcode

History:

I've been having a lot of fun with the Propeller 2 on my DE0-Nano, and I came up with a really cute trick that lead to developing P2BEE.

Today I verified that my P2BEE concept works, and that the engine works.

I could not wait to publish it - and I can't wait to see all the different emulators and virtual machines that will be based on it!

**********************************************************************
FAQ:
**********************************************************************

Why is P2BEE so fast?

Propeller 2 BEE pulls out all the stops, and uses all of the tricks I could
think of to execute byte codes as fast as possible. It uses a specific order
of cached byte read and stack access instructions optimized for the pipeline
details of the Propeller 2.

How did you think of it?

I have developing software and hardware for the Parallax Propeller since it
became available. I always had a great interest in processor emulation for
retro computers and gaming.

Way back, I came up with the LMM virtual machine for the Propeller, allowing
it to execute larger programs than it could in the native "cog" memory.

Once I saw the specifications and instructions for the STACK memory, I started
thinking of non-obvious uses for it... and once Chip increased it to 256 longs
from the original 128 entr "CLUT" version, it became even more interesting.

256 longs... a very useful number. There are 256 possible values for a byte.

Thus P2BEE was born.

Single cycle propeller instructions stored in the STACK (CLUT) memory can be
executed by the inner P2BEE engine in 6.5 cycles (on average) - obviously
other instructions will take longer.

By storing various JMP instructions, sequences of instructions can be run
for every byte code - making coding VM's and emulators immensely easier.

Sounds like you've validated one of the intended uses of the CLUT. Congratulations! I'm glad you didn't find any hardware problems in the process! I think we're all going to have fun finding clever ways to use the new P2 features. Now if I could just get the P2 version of GAS done I might have some time to play with the new features myself! :-)

Tor · 2012-12-14 06:12

Heater. wrote: »

LGPL may be better for this kind of thing, but has weird rules about what is linked in and not linked in and companies hate to mess with all that.

LGPL is used a lot by companies because it's fundamentally different from GPL, and the criteria are not complicated: The user (of your software which uses the LGPL work) must be presented with a way of combining your work with the LGPL work (say, a different or newer version), one way or another. That's all it boils down to. It's popular for commercial companies because you simply use a shared library (DLL in Windows speak) version of the LGPL module (the alternative is to provide object files of your code which can be re-linked. But that's much more of a hassle).
GPL is a very different beast, it is definitely difficult to sort out if your use of it is derived work or not, and if all that even applies (e.g. if the GPL module is just another variant of many with the same API then your use of it is _not_ a derived work - think Libc or the Unix API).
Not that I'm in any way arguing about what license Bill should use - not at all - I'm only commenting about the interpretation of GPL variants as that came up in the thread. And wanting attribution, for example, is totally understandable (when that's said, Copyright of course has to be retained in GPL/LGPL work too of course)

-Tor

ctwardell · 2012-12-14 06:16

Heater,

Well my intent is for emulator use, so it is of concern.

http://forums.parallax.com/showthread.php?144199-Propeller-II-Emulation-of-the-P2-on-DE0-NANO-amp-DE2-115-FPGA-boards&p=1148239&highlight=cosmacog#post1148239

All of my lookups will be jumps, so translating an opcode to a jump in every case.

I haven't looked at and don't plan to look at Bill's code for this.

For the 1802 it will still be running at the typical 3.579/2 Mhz for 1861 support, so not looking for the speed of some tight loop doing fast lookup/execute anyway.

I just hope we don't enter an arms race of everyone looking for little snippits to claim, otherwise I have a DEO Nano for sale...

C.W.

Bill Henning · 2012-12-14 06:27

Big mashup reply so I don't have to reply individually...

Tor:

thanks - if there was an "attribution required" version of LGPL, I might have chosen it.

Heater: (#16,17)

You are right, but that would also mean that people would have to attribute derivations of ZOG to you.

I'd love to see you do a ZOG using this technique.

ctwardwell:

No. I just require attribution to using the clut/stack as an instruction store for a quite optimal byte code execution engine.

Heater: (#19)

Yes, GCC is welcome to use it, at no charge, as long as they attribute as I ask and use this license for a hypothetical byte code compressed engine based on P2BEE.

David: (#20)

The CLUT/STACK was designed for color look up table, later stack functionality was added. It was not intended for storing executable code for byte code expansion

So instead of "validating an intended use" I came up with a clever new way to use it in an unintended way that will have great benefits for all byte code execution.

It is rather similar to LMM actually, but fetching code from a memory smaller than the cog registers, instead of larger like the hub.

This could allow for an extremely compact drastically faster compressed mode for GCC and Spin ... with the only cost to Parallax being an attribution requirement.

David Betz · 2012-12-14 06:32

Bill Henning wrote: »

David: (#20)

The CLUT/STACK was designed for color look up table, later stack functionality was added. It was not intended for storing executable code for byte code expansion

So instead of "validating an intended use" I came up with a clever new way to use it in an unintended way that will have great benefits for all byte code execution.

It is rather similar to LMM actually, but fetching code from a memory smaller than the cog registers, instead of larger like the hub.

This could allow for an extremely compact drastically faster compressed mode for GCC and Spin ... with the only cost to Parallax being an attribution requirement.

CLUT stands for Color LookUp Table. It seems to me to be a fairly obvious extension to look up things other than colors. That's all I was saying. I haven't read your code and will probably try not to. I am not a big fan of people trying to lay claim to ideas even if they do publish them under a liberal license.

Dave Hein · 2012-12-14 06:48

I thought CLUT meant "Code LookUp Table", which is why I was confused by Bill's Xmas Present. Bill, can I put your license on the labels on all the Xmas presents I hand out this year? I want to make sure my family acknowledges the gifts I give them every time they use them.

Bill Henning · 2012-12-14 07:36

UPDATE:

I am adding P2BEE v0.12 to the first post

This version has an alternate pipelined mode that can execute a single cycle Propeller 2 instruction in 4.53 cycles on average!

potatohead · 2012-12-14 07:40

I won't look at this.

I've a DE2 for sale, if we are going to have a land grab on P2.

Sorry Bill.

Think of where LMM would be today with this license. "Hey, that nop is there to comply with non MIT licensing, if you want it to run at peak speed, talk to Bill..."

Heater. · 2012-12-14 07:52

Bill,

You are right, but that would also mean that people would have to attribute derivations of ZOG to you.

Quite so, which is not something I ever wanted to impose on any user. Ergo I cannot use your code.

I'd love to see you do a ZOG using this technique.

For sure Zog and ZiCog have been waiting for the Prop II (Zog has been quite happy to sleep in his iceberg until then) and for sure the CLUT as dispatch table is an idea that has already popped into mind.

I just require attribution to using the clut/stack as an instruction store for a quite optimal byte code execution engine.

Thing is you cannot "require" attribution for that idea as such, you can only request it. Unless you could patent the idea, which is unlikely, you have no such such rights to "require". You can require attribution for the published work under copyright, which is why I stopped reading. This may mean that I, for example, never reach the optimum speed of your code, unless I happen to realize all the little tricks involved, but I am quite free to use the idea in my way.

GCC is welcome to use it, at no charge, as long as they attribute as I ask and use this license for a hypothetical byte code compressed engine based on P2BEE

Except I'm not sure that they can, such restrictions may be in conflict with the GPLed GCC. Again, nothing stops them implementing their own clean room version of the idea and using that, even if it is a bit suboptimal.

The CLUT/STACK was designed for color look up table, later stack functionality was added. It was not intended for storing executable code for byte code expansion

Really, I always thought "CLUT" stood for "Code Look Up Table":)

...with the only cost to Parallax being an attribution requirement.

Except they don't have the right to impose that restriction on GCC either. (as far as I can tell).

Sapieha · 2012-12-14 07:57

Hi.

I will not address this question to any. It is Rhetoric Question?

But why that many people have problems to give credits for others work

Ps. I saw many posts on this forum that some people claimed others work that its own

Bill with that license Don't say -- Any need pay for it -- Only give credits for his work

My 5 cent's

potatohead · 2012-12-14 08:15

Looking back, there have been a few GPL / Attribution type licenses attached to code here. One of the first was my original Potatotext, which people found compelling, but unable to use in their projects. Hippy rewrote it to get around the GPL, and they became far more useful ending up as AiGeneric. Always thought that was the best move, and I dropped the GPL on it, mostly being naive at the time, not for any other intent or purpose. Great lesson.

Another one was Eric Ball's software video technique. A similar kind of license was placed on it. Fantastic code that has color capability not seen in just about every other driver out there. I still think that one is the best, runs at 14Mhz, offers a great color set and it includes some very clever software code to render the color sub-carrier, allowing for a few things we've not seen exploited on the P1. That code was never widely used either. That's not a negative to Eric, who deserves credit for that driver, and where it was used, credit was given gladly. (me) It's excellent, and it demonstrates his very deep understanding of video. Over the years, lots of video drivers got done, and I'm quite sure that one just got forgotten amidst so many great ways to exploit P1 video hardware.

Frankly, a license like this will more or less insure that the technique does not see wide use. Of course, we can encrypt now too, so who knows? Very interesting and new questions.

Jim Bagley authored Prop GFX, released in binary form and well documented. We really didn't adopt a binary blob well, despite his seriously good efforts to document how to use it. Again, not a negative statement against him. He's got serious graphics / game experience running over 30 years and probably has some stuff in there he would rather keep control over, and I thought it generous to work so hard to publish it in a way we could enjoy.

I put code here because I take code from here, and it's been a working arrangement that has served us all very well. If we start down this road with P2, it's going to get really interesting! The idea that code gets put here so that we all can benefit will go away, leaving lots of little islands of code and much less innovation in this little ecosystem.

Edit: You know, I saw the other thread about requiring code posted here be MIT something. Maybe that is a great requirement! Anyone wanting to use another license can write their package up here, advertizing essentially, leaving people free to consider it and enter into the appropriate licensing agreements. IMHO, it's an unfair exploitation of this forum to do otherwise as it limits discussion and puts people into odd circumstances, despite there being no nefarious intent to be there. Click on the wrong thread and? I would rather not see that happen.

That's my $.02

Bill, please don't quote that code out in the open here, or warn us so we can avoid the thread. I want to not see it. And I want to not see it for the difficulty doing so will bring, not that I don't want to recognize how bad *** smart you are about this stuff. Sorry man, and thanks for helping out.

Heater. · 2012-12-14 08:15

Sapieha,

I have no problem with giving credit to the creators of things I have used. Most of my code contains such "thank you notices" and references to original sources. I'm sure most here feel the same. After all "we stand on the shoulders of giants" as they say.

No that is not the point. Rather, we are pointing out that with that string attached it is a very nice Christmas present many projects, current and future cannot use. Most importantly for the Parallax and the Prop II the GCC effort.

Announcing P2BEE: Propeller 2 Bytecode Execution Engine

Comments