Fast Bytecode Interpreter

cgracey · 2017-04-06 23:30

Ersmith,

If PA/PB/PTRA/PTRB could be your basereg, then maybe we could make an instruction which adds a 20-bit immediate, but rather than modify PA/PB/PTRA/PTRB, the sum appears as S in the next instruction.

	LOCOS	PA,#imm20offset
	WRLONG	data,0

cgracey · 2017-04-06 23:46

Ersmith,

How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?

Ariba · 2017-04-06 23:49

Load and Store with base register + #offset can be done with ALTS:

	alts	baseReg,#offset9
	rdlong	dataReg,0-0

	alts	baseReg,#offset9
	wrbyte	dataReg,0-0

	alts	baseReg,##offset32    '2 instr
	rdword	dataReg,0-0

Andy

ozpropdev · 2017-04-06 23:59

What about a AUG instruction that extends the 5 bit N field in PTRA/B hub instructionis.

augn #imm_offset
wr[long data,PTRA[imm_offset]

This way you get the benefits of pre/post indexing too.

cgracey · 2017-04-07 00:09

Ariba wrote: »
Load and Store with base register + #offset can be done with ALTS:
	alts	baseReg,#offset9
	rdlong	dataReg,0-0

	alts	baseReg,#offset9
	wrbyte	dataReg,0-0

	alts	baseReg,##offset32    '2 instr
	rdword	dataReg,0-0
Andy

All those ALTxx instructions do is modify fields in the next instruction, not final S or D values. So, getting 20-bit alterations requires something more.

cgracey · 2017-04-07 00:18

ozpropdev wrote: »
What about a AUG instruction that extends the 5 bit N field in PTRA/B hub instructionis.
augn #imm_offset
wr[long data,PTRA[imm_offset]
This way you get the benefits of pre/post indexing too.

Hey, that could be almost for free, but would preclude immediate absolute addressing using AUGS, as happens now. It's something to consider, though.

I was thinking we could just change the encoding of CALLD to eke out another instruction which has a 20-bit immediate.

Currently:

EEEE 11100WW RAA AAAAAAAAA AAAAAAAAA	CALLD   PA/PB/PTRA/PTRB,#A

...could become:

EEEE 111000W RAA AAAAAAAAA AAAAAAAAA	CALLD   PA/PB,#A
EEEE 111001W WAA AAAAAAAAA AAAAAAAAA	LOCOS   PA/PB/PTRA/PTRB,#A

This would limit CALLD to just PA and PB, but I don't really see why you'd want to jump to where PTRA/PTRB points.

This LOCOS instruction would sum PA/PB/PTRA/PTRB with a 20-bit immediate offset and put it into the next instruction's S value, which would be used by the subsequent RDxxxx/WRxxxx instruction.

ersmith · 2017-04-07 00:18

cgracey wrote: »

Ersmith,

How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?

The compiler would generally use arbitrary registers. If we can only use a limited set then we can stick with PTRA and PTRB (rather than PA and PB) but the more special cases there are the more tricky optimization becomes.

ersmith · 2017-04-07 00:19

Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...

JasonDorie · 2017-04-07 00:24

jmg wrote: »

There is nothing technically stopping more than one COG from reaching into XIP memory, except that would thrash the bus more, and increase latencies. In some cases, that could be tolerable.

That would need some kind of mutex on the bus though, right? You couldn't do it concurrently in multiple cogs I assume?

jmg wrote: »

Users can, and will, pull out 'driver level like' functions to run on other COGS. Some of those may creep into HUB space, but the general nature is they would be loaded and run without being moved.

XIP allows the 'main code' to place most often used functions in HUB, and then call into further memory, where the usual size-speed trade-offs apply. Code like POST, or Setup menus, are some examples of not-used-often, but still required to be available code.

That would also require compiler, linker, and loader support. This functionality is possible with P1 using PropGCC to compile overlays into eeprom, but almost no one has ever used it. Macca did a demo of it recently, but I think the latest versions of the Prop loader tool removed the support for overlay code, and it's not accessible through SimpleIDE - you need to hand code the command lines to make it go. It'd be nice to see overlays "officially supported", though probably less of an issue with more memory and fast compressed instruction decoding.

cgracey · 2017-04-07 00:24

ersmith wrote: »

Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...

There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.

Substituting values is more relaxed on timing, but I need to figure out what would be useful.

cgracey · 2017-04-07 00:25

ersmith wrote: »

cgracey wrote: »

Ersmith,

How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?

The compiler would generally use arbitrary registers. If we can only use a limited set then we can stick with PTRA and PTRB (rather than PA and PB) but the more special cases there are the more tricky optimization becomes.

Would it be a problem if CALLD (the "LINK" instruction) only worked with PA/PB?

cgracey · 2017-04-07 00:40

ozpropdev wrote: »
What about a AUG instruction that extends the 5 bit N field in PTRA/B hub instructionis.
augn #imm_offset
wr[long data,PTRA[imm_offset]
This way you get the benefits of pre/post indexing too.

I think this is the best idea. It's only a few more adders to add this and, like you said, you have the benefit of pre-/post-usage and update/no-update. This would do it all.

By the way, it could just work off the AUGS instruction. If AUGS is in play during RDxxxx/WRxxxx, then it is used to augment the normally-5-bit offset field. This could still allow for 20-bit immediate addressing without PTRA/PTRB by making the PTRA/PTRB use kick in only when bit 31 of the augmented S is set.

jmg · 2017-04-07 00:52

JasonDorie wrote: »

jmg wrote: »

There is nothing technically stopping more than one COG from reaching into XIP memory, except that would thrash the bus more, and increase latencies. In some cases, that could be tolerable.

That would need some kind of mutex on the bus though, right? You couldn't do it concurrently in multiple cogs I assume?

Yes, hence my comment on increase latencies.
Instead of one COG having control, each would need to fetch small packets, then release the BUS for possible other use.
Keeping delays under control gets challenging, but the way P2 shares pin, there is no HW prevention.

JasonDorie wrote: »

That would also require compiler, linker, and loader support. This functionality is possible with P1 using PropGCC to compile overlays into eeprom, but almost no one has ever used it. Macca did a demo of it recently, but I think the latest versions of the Prop loader tool removed the support for overlay code, and it's not accessible through SimpleIDE - you need to hand code the command lines to make it go.

Yes, that did seem a good idea, given the caveats of P1, but I guess things slowing down use of this are the finite EEPROM size and somewhat slow i2c EEPROM speed. (far slower than XIP memory)

JasonDorie wrote: »

It'd be nice to see overlays "officially supported", though probably less of an issue with more memory and fast compressed instruction decoding.

My thinking is users will always run out of code space, eventually.
P2 is larger, but can do far more too. ( I see other MCUs shipping now, offer 2MB Flash and 1MB SRAM)

The question is what happens then, and how much of a hard ceiling is that ?

ersmith · 2017-04-07 01:09

cgracey wrote: »

ersmith wrote: »

Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...

There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.

Substituting values is more relaxed on timing, but I need to figure out what would be useful.

You know, having the full 20 bits is overkill -- this is for offsetting into a structure, and so it doesn't really need to span the whole address space. I think a lot of processors allow 12 bit signed offsets. 9 bits would be better than nothing: it would allow us to reduce the 3 instruction sequence above down to 2 (1 + an augment), and would be an improvement over the 5 bit offset available in PTRA/PTRB.

So I think my first choice would be an ALTS like instruction that replaced the source value (rather than source field). But an AUG that expands the range of offsets for PTRA/PTRB would be an OK second choice.

All of this is conditional on it not being too expensive either in gate counts or in your time!

Eric

cgracey · 2017-04-07 01:14

ersmith wrote: »

cgracey wrote: »

ersmith wrote: »

Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...

There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.

Substituting values is more relaxed on timing, but I need to figure out what would be useful.

You know, having the full 20 bits is overkill -- this is for offsetting into a structure, and so it doesn't really need to span the whole address space. I think a lot of processors allow 12 bit signed offsets. 9 bits would be better than nothing: it would allow us to reduce the 3 instruction sequence above down to 2 (1 + an augment), and would be an improvement over the 5 bit offset available in PTRA/PTRB.

So I think my first choice would be an ALTS like instruction that replaced the source value (rather than source field). But an AUG that expands the range of offsets for PTRA/PTRB would be an OK second choice.

All of this is conditional on it not being too expensive either in gate counts or in your time!

Eric

Making AUGS behave differently with RDxxxx/WRxxxx is almost free. There is no new instruction in that case, only a change in behavior. The thing about any 20-bit immediate instruction is that it's going to have, at most, two bits available for register selection. If we can just say PTRA/PTRB is good enough, then it's really simple to make this happen.

cgracey · 2017-04-07 01:17

Ersmith,

Question:

Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?

potatohead · 2017-04-07 01:26

Shouldn't it be? Indexed data or code may not be aligned.

cgracey · 2017-04-07 01:29

potatohead wrote: »

Shouldn't it be? Indexed data or code may not be aligned.

Right, as long we've got 20 bits, I see no need for scaling, as all it might do is make 3/4 of the memory unreachable (RDLONG). If the compiler knows what the scaled value should be, it knows the unscaled value, and it can just use that.

ersmith · 2017-04-07 01:31

cgracey wrote: »

Question:

Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?

Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way

Eric

cgracey · 2017-04-07 01:33

ersmith wrote: »

cgracey wrote: »

Question:

Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?

Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way

Eric

All right. I'll get this change made, test it, and recompile the FPGA images.

potatohead · 2017-04-07 01:49

Nice!

This time is super high value. Lean and mean tools to make a lot of people happy.

Dave Hein · 2017-04-07 02:40

I haven't followed the details of the P2 for about a year, and I've forgotten much of what I knew about it from a year ago. So please bear with me on some possibly silly questions. Most of the time C code will be executing from HUB memory. It will used the CALLD instruction to call functions. This will result in using the long version of CALLD, which only allows using either PA, PB, PTRA or PTRB as the link register. PA/PB are cog RAM locations, and PTRA/PTRB are registers, correct? Are there any limitations on using PTRA/PTRB for the link register? That is, do any instructions access the shadow memory instead, which would not contain the same values as the registers?

It seems like a reasonable thing to do for C is to use PA for the link register and PTRA for the stack pointer. Does that make sense? For C the stack starts at the high end of memory, and a pre-decrement is done before writing to the stack. After a read is done a post-increment is performed. Is this consistent with the way the P2 handles the pointers?

What are CALLPA/CALLPB? They're new since I've last kept up with the P2. It appears that they put the return address on the hardware stack and also PA/PB. What's the purpose for doing that?

ozpropdev · 2017-04-07 03:28

@Dave
They are all cog locations

1F6	PA  (used to be called ADRA)
1F7	PB  (used to be called ADRB)
1F8	PTRA	
1F9	PTRB

and

CALLPA  {#}D,{#}S  'Call to S** by pushing {C, Z, PC[19:0]} onto stack, copy D to PA.
CALLPB  {#}D,{#}S  'Call to S** by pushing {C, Z, PC[19:0]} onto stack, copy D to PB.

cgracey · 2017-04-07 03:33

The purpose of CALLPA/CALLPB is to pass D as a parameter to PA/PB, while calling S.

cgracey · 2017-04-07 03:42

Dave, all your assumptions are correct. There are no shadow register problems with PTRA/PTRB, by the way.

Dave Hein · 2017-04-07 03:49

I understand that PA, PB, PTRA and PTRB are mapped to cog locations, but I was curious about how they were implemented. It appears that PA and PB are physically in cog RAM, but PTRA and PTRB are implemented as registers. That implies that there may be cog RAM also at the same addresses as PTRA and PTRB. When an instruction is fetched from $1F8 does it come from cog RAM or from the register that implements PTRA? There doesn't appear to be any way to write to cog RAM at $1f8, unless something like "mov PTRA, #5" puts a 5 in both the cog RAM and the PTRA register.

Chip, thanks for the clarification on CALLPA/CALLPB. Of course the documentation states exactly what you said. I just didn't read it correctly.

cgracey · 2017-04-07 03:53

For D and S register reading, PTRA/PTRB data come from the PTRA/PTRB hardware registers. For execution, cog RAM $1F8 would be read.

I believe writes to $1F8/$1F9 do write the RAM. I'll check soon.

ozpropdev · 2017-04-07 06:02

cgracey wrote: »

For D and S register reading, PTRA/PTRB data come from the PTRA/PTRB hardware registers. For execution, cog RAM $1F8 would be read.

I believe writes to $1F8/$1F9 do write the RAM. I'll check soon.

Curiosity got the better of me so I tried this

dat		org

		mov	ptra,inst	'load instruction into ram + ptra

		rep	@.loop,#123	'nodify ptra register
		rdbyte	0,ptra++
.loop
		call	#ptra		'execute shadow ram
		drvh	#32
		jmp	#$

inst	_ret_	drvh	#33		'executes ok

Even after indexing the PTRA register the original $1F8 RAM contents persist.

Dave Hein · 2017-04-07 13:33

Thanks for confirming this. I actually knew this stuff a year or two ago, but my memory had faded on these details. And things have changed a bit since then.

Bill Henning · 2017-04-08 00:36

20 bit unscaled offset will be great for both compilers and pasm2 code

cgracey wrote: »

ersmith wrote: »

cgracey wrote: »

Question:

Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?

Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way

Eric

All right. I'll get this change made, test it, and recompile the FPGA images.

Fast Bytecode Interpreter

Comments