New Spin

cgracey · 2017-02-17 15:59

I've started working on the new Spin.

At first, it's going to be interpreted byte code. In-line assembly will be allowed, though. Also, every assembly instruction, except branches and context-dependent stuff like ALTI and AUGS, has a procedure form in Spin, making all hardware functions readily accessible. It actually takes a huge load off Spin, by circumventing the need to re-dress lots of functions. If you look at the spreadsheet linked to in the "Prop2 FPGA Files!!!" thread, you can see how they work. They will get people rapidly acquainted with how the actual instructions work. There are local bit variables, CF and ZF, which get used and updated by the pasm-instruction procedures.

For the run-time data and call stacks, the LUT will be used. The user can set how much is otherwise available for his own use. This means that there is no need to declare a stack in hub space. It's now implied. The limitation is up to 512 longs, but that's going to be fine for programs that don't call themselves recursively.

To fetch code bytes, 'RDBYTE b,PTRA++' could be used, but I came up with a faster way of doing it, while keeping the FIFO free for the user:

'
'
' Get new block, then bytes via 'calld b_ret,b_call'
'
new_block		setq	#15
			rdlong	block_base,block_addr
			mov	block_long,#0
			mov	b_call,#new_long
			ret


new_long		alts	block_long,#block_base	'get byte %xxxx00
			getbyte	b,0,#0
			calld	b_call,b_ret

			alts	block_long,#block_base	'get byte %xxxx01
			getbyte	b,0,#1
			calld	b_call,b_ret

			alts	block_long,#block_base	'get byte %xxxx10
			getbyte	b,0,#2
			calld	b_call,b_ret

			alts	block_long,#block_base	'get byte %xxxx11
			getbyte	b,0,#3
			calld	b_call,b_ret

			incmod	block_long,#15	wc	'another long?
	if_nc		jmp	#new_long

			add	block_addr,#64		'read next block
			setq	#15
			rdlong	block_base,block_addr
			jmp	#new_long

block_base		res	16
block_long		res	1
block_addr		res	1
b_call			res	1
b_ret			res	1

That only needs one instruction (a CALLD) per byte fetch and it takes 12 clocks, plus another ~50 clocks every 64th byte. The simple RDBYTE takes 9..24 clocks.

potatohead · 2017-02-17 17:44

Just took a look at that spreadsheet. Nice reference Chip. Thank you!

Will the LUT stacks work bottom up, or top down? Thinking top down might be better. Users can make their declarations, always work from 0 to [limit], SPIN works top down and is ignored by the programmer, who knows it's there, but doesn't have to worry about it much, until it can't be.

David Betz · 2017-02-17 18:29

If you put the stack in the LUT what happens with Spin code that takes the address of a stack variable? Will that work? I see that done frequently in code that loads a COG.

Roy Eltham · 2017-02-17 19:15

David,
Aren't LUT addresses unique from HUB ones in cog land? Or did that change? So I think taking addresses shouldn't matter for LUT vs HUB, although which instruction is used to read the value will be different depending...

ke4pjw · 2017-02-17 19:16

I am so happy to see this! I know there has been a huge push for C. I like C and the C style syntax. However, I must say that when I get a chance to code on 3 day weekends, I *REALLY* like Spin and PASM. It's such a great fit for the prop.

I was thinking that this time around, Spin would be compiled, (And I am sure compilers will be written) but I was surprised to see it will be interpreted.

Anyway, YAY! Keep knocking them out of the park Chip! I follow the Prop2 forum like a lovesick crackhead, hoping to see when we might get some silicon.

It will be ready, when it is ready, and I am good with that

Dave Hein · 2017-02-17 19:23

ke4pjw wrote: »

I was thinking that this time around, Spin would be compiled, (And I am sure compilers will be written) but I was surprised to see it will be interpreted.

It looks like Spin will be compiled into bytecodes, which are then executed by a bytecode interpreter like in the P1.

potatohead · 2017-02-17 19:27

At first.

David Betz · 2017-02-17 19:46

Dave Hein wrote: »

ke4pjw wrote: »

I was thinking that this time around, Spin would be compiled, (And I am sure compilers will be written) but I was surprised to see it will be interpreted.

It looks like Spin will be compiled into bytecodes, which are then executed by a bytecode interpreter like in the P1.

There is already a P2 Spin compiler that produces native code by Eric Smith. I'm not sure why Parallax isn't interested.

jmg · 2017-02-17 19:56

David Betz wrote: »

There is already a P2 Spin compiler that produces native code by Eric Smith. I'm not sure why Parallax isn't interested.

Yes, that is impressive work, and the risk of a lack of a common code base, will be Spin2 fragmentation & incompatibility, which will put users off P2, rather than attract them.

Has the P2 testing really finished ?

USB code seemed still in a state of flux last I looked, and Video test cases are far from complete.
I've not seen anyone talk about cameras for a while...
Or motor controller use cases ?
Or any examples using HyperRAM/ HyperFLASH ?

David Betz · 2017-02-17 20:11

Roy Eltham wrote: »

David,
Aren't LUT addresses unique from HUB ones in cog land? Or did that change? So I think taking addresses shouldn't matter for LUT vs HUB, although which instruction is used to read the value will be different depending...

Can one use the same instruction to fetch or store a value in hub and LUT?

Roy Eltham · 2017-02-17 20:56

jmg,
A lot of Spin is P1 centric, and not directly translatable to P2. Also, stuff has to be added to support the new features of the hardware.

The basic syntax and non-hardware-specific stuff can be the same, but there has to be differences and programmers will need to know them do use the P2 for anything beyond the most basic stuff.

I suppose someone could do an emulation of P1 on P2, but matching the exact timings of the counter stuff will be difficult, and that's just one part of it.

David,
It's different instructions, but that can be handled for you in Spin.

jmg · 2017-02-17 21:09

Roy Eltham wrote: »

jmg,
A lot of Spin is P1 centric, and not directly translatable to P2. Also, stuff has to be added to support the new features of the hardware.

The basic syntax and non-hardware-specific stuff can be the same, but there has to be differences and programmers will need to know them do use the P2 for anything beyond the most basic stuff.

Well, yes, that's yet another Spin-portability-downside-issue.

My main point was more related to spawning Multiple flavours of P2-Spin.

David Betz · 2017-02-17 22:02

Roy Eltham wrote: »

David,
It's different instructions, but that can be handled for you in Spin.

Then I guess there would need to be different syntax for accessing the LUT. Maybe lut[x] like we can already do long[x]. Or maybe pointers could have types. Seems like it could get messy.

David Betz · 2017-02-17 22:16

jmg wrote: »

My main point was more related to spawning Multiple flavours of P2-Spin.

I suppose we just need to wait to see what Chip comes up with.

Cluso99 · 2017-02-17 23:22

By having the local variables and stack in LUT will result in a big speed boost for spin. In P1 (almost?) every bytecode results in a pop/execute/push where the pop and push are hub accesses resulting in stalls.
In addition to this, using the fast fill buffer from hub to lut and then peel off the required bytecodes will also result in a big clock saving.

In my faster P1 Spin Interpreter, I put the decode table into hub. There is probably enough lut space for this and it saves quite a few clocks and space, meaning that the actual interpreter can be improved too.

There are lots of ways to tweek the interpreter because we have lut, and hub-exec. The little used bytecodes can be hived of to hub-exec, giving more space for the interpreter to be optimised.

All these things can be improved once we have the base.

BTW If the P1 Bytecode structure does not change, there is no reason why there should be much difference to the P1 spin code, apart from extensions.

David Betz · 2017-02-17 23:33

Cluso99 wrote: »

By having the local variables and stack in LUT will result in a big speed boost for spin. In P1 (almost?) every bytecode results in a pop/execute/push where the pop and push are hub accesses resulting in stalls.
In addition to this, using the fast fill buffer from hub to lut and then peel off the required bytecodes will also result in a big clock saving.

In my faster P1 Spin Interpreter, I put the decode table into hub. There is probably enough lut space for this and it saves quite a few clocks and space, meaning that the actual interpreter can be improved too.

There are lots of ways to tweek the interpreter because we have lut, and hub-exec. The little used bytecodes can be hived of to hub-exec, giving more space for the interpreter to be optimised.

All these things can be improved once we have the base.

BTW If the P1 Bytecode structure does not change, there is no reason why there should be much difference to the P1 spin code, apart from extensions.

But why do a byte code interpreter at all? Why not just go directly to a native compiler?

ke4pjw · 2017-02-18 01:14

David Betz wrote: »

But why do a byte code interpreter at all? Why not just go directly to a native compiler?

That is my thought. It *SEEMS* like building a compiler would be easier than building a tokenizer, optimizer for the programming tool and an interpreter in ROM. I only understand these things as concepts, so it's difficult for me to gauge what is "easy". The first time a came across the concept was with BASIC09 back when I was a kid. It's still all voodoo to me

cgracey · 2017-02-18 01:15

David Betz wrote: »

Cluso99 wrote: »

By having the local variables and stack in LUT will result in a big speed boost for spin. In P1 (almost?) every bytecode results in a pop/execute/push where the pop and push are hub accesses resulting in stalls.
In addition to this, using the fast fill buffer from hub to lut and then peel off the required bytecodes will also result in a big clock saving.

In my faster P1 Spin Interpreter, I put the decode table into hub. There is probably enough lut space for this and it saves quite a few clocks and space, meaning that the actual interpreter can be improved too.

There are lots of ways to tweek the interpreter because we have lut, and hub-exec. The little used bytecodes can be hived of to hub-exec, giving more space for the interpreter to be optimised.

All these things can be improved once we have the base.

BTW If the P1 Bytecode structure does not change, there is no reason why there should be much difference to the P1 spin code, apart from extensions.

But why do a byte code interpreter at all? Why not just go directly to a native compiler?

Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.

David Betz · 2017-02-18 01:25

cgracey wrote: »

David Betz wrote: »

Cluso99 wrote: »

By having the local variables and stack in LUT will result in a big speed boost for spin. In P1 (almost?) every bytecode results in a pop/execute/push where the pop and push are hub accesses resulting in stalls.
In addition to this, using the fast fill buffer from hub to lut and then peel off the required bytecodes will also result in a big clock saving.

In my faster P1 Spin Interpreter, I put the decode table into hub. There is probably enough lut space for this and it saves quite a few clocks and space, meaning that the actual interpreter can be improved too.

There are lots of ways to tweek the interpreter because we have lut, and hub-exec. The little used bytecodes can be hived of to hub-exec, giving more space for the interpreter to be optimised.

All these things can be improved once we have the base.

BTW If the P1 Bytecode structure does not change, there is no reason why there should be much difference to the P1 spin code, apart from extensions.

But why do a byte code interpreter at all? Why not just go directly to a native compiler?

Without good optimization, a native compiler will make big code. The byte code starts out optimized. Plus, habit, I guess.

Well, I'm certainly not one to complain too loudly about byte code interpreters. I've written lots over the years.

David Betz · 2017-02-18 01:27

I don't really like the idea that parameters and local variables will be placed in the LUT. It means you can't take the address of them and use them like any other hub memory pointer. How about if only return addresses and stack frame management lives in the LUT and the actual parameters and local variables are in a hub stack?

cgracey · 2017-02-18 01:57

David Betz wrote: »

I don't really like the idea that parameters and local variables will be placed in the LUT. It means you can't take the address of them and use them like any other hub memory pointer. How about if only return addresses and stack frame management lives in the LUT and the actual parameters and local variables are in a hub stack?

As Cluso pointed out, things are way faster without having to do random hub r/w's.

Here's how we can solve this, based on address (@variable):

%0xxx_xxxx_xxxx_AAAA_AAAA_AAAA_AAAA_AAAA = hub variable address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx0A_AAAA_AAAA = cog register address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx1A_AAAA_AAAA = LUT register address (locals build upwards from LUT[0])

Or, maybe it would be better to do this, as it would protect lower hub memory and make address MSB's "don't-care":

%xxx0_0000..$xxx0_01FF = cog register address
%xxx0_0200..$xxx0_03FF = LUT register address (locals build upwards from LUT[0])
%xxx0_0400..$xxxF_FFFF = hub variable address

Either way, indexing math would work fine.

evanh · 2017-02-18 01:58

Don't most people just plonk all the Spin variables down as globals anyway? This would give people a reason to start using locals.

David Betz · 2017-02-18 02:02

cgracey wrote: »
David Betz wrote: »

I don't really like the idea that parameters and local variables will be placed in the LUT. It means you can't take the address of them and use them like any other hub memory pointer. How about if only return addresses and stack frame management lives in the LUT and the actual parameters and local variables are in a hub stack?

As Cluso pointed out, things are way faster without having to do random hub r/w's.

Here's how we can solve this, based on address (@variable):
%0xxx_xxxx_xxxx_AAAA_AAAA_AAAA_AAAA_AAAA = hub variable address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx0A_AAAA_AAAA = cog register address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx1A_AAAA_AAAA = LUT register address (locals build upwards from LUT[0])
Or, maybe it would be better to do this, as it would protect lower hub memory and make address MSB's "don't-care":
%xxx0_0000..$xxx0_01FF = cog register address
%xxx0_0200..$xxx0_03FF = LUT register address (locals build upwards from LUT[0])
%xxx0_0400..$xxxF_FFFF = hub variable address
Either way, indexing math would work fine.

So you'll make the indexing operator do the right thing based on the address? That works for interpreted code but not so well if you pass the address to PASM code when loading a COG.

cgracey · 2017-02-18 02:05

David Betz wrote: »
cgracey wrote: »
David Betz wrote: »

I don't really like the idea that parameters and local variables will be placed in the LUT. It means you can't take the address of them and use them like any other hub memory pointer. How about if only return addresses and stack frame management lives in the LUT and the actual parameters and local variables are in a hub stack?

As Cluso pointed out, things are way faster without having to do random hub r/w's.

Here's how we can solve this, based on address (@variable):
%0xxx_xxxx_xxxx_AAAA_AAAA_AAAA_AAAA_AAAA = hub variable address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx0A_AAAA_AAAA = cog register address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx1A_AAAA_AAAA = LUT register address (locals build upwards from LUT[0])
Or, maybe it would be better to do this, as it would protect lower hub memory and make address MSB's "don't-care":
%xxx0_0000..$xxx0_01FF = cog register address
%xxx0_0200..$xxx0_03FF = LUT register address (locals build upwards from LUT[0])
%xxx0_0400..$xxxF_FFFF = hub variable address
Either way, indexing math would work fine.
So you'll make the indexing operator do the right thing based on the address? That works for interpreted code but not so well if you pass the address to PASM code when loading a COG.

You just do an address check before reading or writing. Yes, easy in an interpreter, harder in compiled code.

potatohead · 2017-02-18 02:10

Byte code does present a compact code option. We should have it IMHO. Byte code will allow really big programs in native RAM. Or small ones that provide for large buffers.

Remember, we get real in line PASM this time. Speed vs size will play out much differently. It will be faster and easier to optimize parts for speed.

David Betz · 2017-02-18 02:19

cgracey wrote: »
David Betz wrote: »
cgracey wrote: »
David Betz wrote: »

I don't really like the idea that parameters and local variables will be placed in the LUT. It means you can't take the address of them and use them like any other hub memory pointer. How about if only return addresses and stack frame management lives in the LUT and the actual parameters and local variables are in a hub stack?

As Cluso pointed out, things are way faster without having to do random hub r/w's.

Here's how we can solve this, based on address (@variable):
%0xxx_xxxx_xxxx_AAAA_AAAA_AAAA_AAAA_AAAA = hub variable address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx0A_AAAA_AAAA = cog register address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx1A_AAAA_AAAA = LUT register address (locals build upwards from LUT[0])
Or, maybe it would be better to do this, as it would protect lower hub memory and make address MSB's "don't-care":
%xxx0_0000..$xxx0_01FF = cog register address
%xxx0_0200..$xxx0_03FF = LUT register address (locals build upwards from LUT[0])
%xxx0_0400..$xxxF_FFFF = hub variable address
Either way, indexing math would work fine.
So you'll make the indexing operator do the right thing based on the address? That works for interpreted code but not so well if you pass the address to PASM code when loading a COG.
You just do an address check before reading or writing. Yes, easy in an interpreter, harder in compiled code.

Here I'm not talking about compiled code. I'm talking about an interpretive program passing parameters to a PASM COG image. The COG code would have to understand and act on your pointer bit encoding. Too bad the LDxxx and STxxx instructions can't be made to work like hub exec to allow access to all three memory spaces with the same instruction.

cgracey · 2017-02-18 02:23

David Betz wrote: »
cgracey wrote: »
David Betz wrote: »
cgracey wrote: »
David Betz wrote: »

I don't really like the idea that parameters and local variables will be placed in the LUT. It means you can't take the address of them and use them like any other hub memory pointer. How about if only return addresses and stack frame management lives in the LUT and the actual parameters and local variables are in a hub stack?

As Cluso pointed out, things are way faster without having to do random hub r/w's.

Here's how we can solve this, based on address (@variable):
%0xxx_xxxx_xxxx_AAAA_AAAA_AAAA_AAAA_AAAA = hub variable address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx0A_AAAA_AAAA = cog register address
%1xxx_xxxx_xxxx_xxxx_xxxx_xx1A_AAAA_AAAA = LUT register address (locals build upwards from LUT[0])
Or, maybe it would be better to do this, as it would protect lower hub memory and make address MSB's "don't-care":
%xxx0_0000..$xxx0_01FF = cog register address
%xxx0_0200..$xxx0_03FF = LUT register address (locals build upwards from LUT[0])
%xxx0_0400..$xxxF_FFFF = hub variable address
Either way, indexing math would work fine.
So you'll make the indexing operator do the right thing based on the address? That works for interpreted code but not so well if you pass the address to PASM code when loading a COG.
You just do an address check before reading or writing. Yes, easy in an interpreter, harder in compiled code.
Here I'm not talking about compiled code. I'm talking about an interpretive program passing parameters to a PASM COG image. The COG code would have to understand and act on your pointer bit encoding. Too bad the LDxxx and STxxx instructions can't be made to work like hub exec to allow access to all three memory spaces with the same instruction.

That would be complicated, when it comes to the FIFO implementation. We'll just filter in software.

Seairth · 2017-02-18 02:33

David Betz wrote: »
cgracey wrote: »
%xxx0_0000..$xxx0_01FF = cog register address
%xxx0_0200..$xxx0_03FF = LUT register address (locals build upwards from LUT[0])
%xxx0_0400..$xxxF_FFFF = hub variable address
You just do an address check before reading or writing. Yes, easy in an interpreter, harder in compiled code.

Here I'm not talking about compiled code. I'm talking about an interpretive program passing parameters to a PASM COG image. The COG code would have to understand and act on your pointer bit encoding. Too bad the LDxxx and STxxx instructions can't be made to work like hub exec to allow access to all three memory spaces with the same instruction.
[/quote]

David, I'm not seeing what you're talking about. Wouldn't the the above pattern work as-is in PASM?

David Betz · 2017-02-18 02:52

Something like this where the address of mbox is passed to the COG and the COG assumes it can use that as the base address of an array to access mbox, rxpin, txpin, mode, baudrate, rxsiz, txsiz, and buffs.

PUB startx(mbox, rxpin, txpin, mode, baudrate, rxsiz, txsiz, buffs) : okay

'' Start packet driver - starts a cog
'' returns false if no cog available
''
'' mode bit 0 = invert rx
'' mode bit 1 = invert tx
'' mode bit 2 = open-drain/source tx
'' mode bit 3 = ignore tx echo on rx

  ' stop the cog if it is already running
  stopx(mbox)

  ' compute the ticks per bit from the baudrate
  baudrate := clkfreq / baudrate

  ' start the driver cog
  okay := long[mbox][4] := cognew(@entry, @mbox) + 1

  ' if the cog started okay wait for it to finish initializing
  if okay
    repeat while mbox <> 0

David Betz · 2017-02-18 02:55

I just realized that my example wouldn't work even if RDxxx understood address spaces since I'm passing parameters that are in the LUT of one COG to another COG. That isn't possible.

Roy Eltham · 2017-02-18 09:11

If all locals/params are in LUT space then you can't use them to be passed to another COGs new/init.

Perhaps we can have a way to say what memory we want locals/params in? Otherwise we are forced to only use globals...

New Spin

Comments