Shop OBEX P1 Docs P2 Docs Learn Events
Fast Bytecode Interpreter - Page 18 — Parallax Forums

Fast Bytecode Interpreter

1151618202130

Comments

  • cgraceycgracey Posts: 14,134
    Ersmith,

    If PA/PB/PTRA/PTRB could be your basereg, then maybe we could make an instruction which adds a 20-bit immediate, but rather than modify PA/PB/PTRA/PTRB, the sum appears as S in the next instruction.
    	LOCOS	PA,#imm20offset
    	WRLONG	data,0
    
  • cgraceycgracey Posts: 14,134
    Ersmith,

    How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?
  • AribaAriba Posts: 2,690
    edited 2017-04-06 23:49
    Load and Store with base register + #offset can be done with ALTS:
    	alts	baseReg,#offset9
    	rdlong	dataReg,0-0
    
    	alts	baseReg,#offset9
    	wrbyte	dataReg,0-0
    
    	alts	baseReg,##offset32    '2 instr
    	rdword	dataReg,0-0
    

    Andy

  • What about a AUG instruction that extends the 5 bit N field in PTRA/B hub instructionis.
    augn #imm_offset
    wr[long data,PTRA[imm_offset]
    
    This way you get the benefits of pre/post indexing too.
  • cgraceycgracey Posts: 14,134
    Ariba wrote: »
    Load and Store with base register + #offset can be done with ALTS:
    	alts	baseReg,#offset9
    	rdlong	dataReg,0-0
    
    	alts	baseReg,#offset9
    	wrbyte	dataReg,0-0
    
    	alts	baseReg,##offset32    '2 instr
    	rdword	dataReg,0-0
    

    Andy

    All those ALTxx instructions do is modify fields in the next instruction, not final S or D values. So, getting 20-bit alterations requires something more.
  • cgraceycgracey Posts: 14,134
    ozpropdev wrote: »
    What about a AUG instruction that extends the 5 bit N field in PTRA/B hub instructionis.
    augn #imm_offset
    wr[long data,PTRA[imm_offset]
    
    This way you get the benefits of pre/post indexing too.

    Hey, that could be almost for free, but would preclude immediate absolute addressing using AUGS, as happens now. It's something to consider, though.

    I was thinking we could just change the encoding of CALLD to eke out another instruction which has a 20-bit immediate.

    Currently:
    EEEE 11100WW RAA AAAAAAAAA AAAAAAAAA	CALLD   PA/PB/PTRA/PTRB,#A
    

    ...could become:
    EEEE 111000W RAA AAAAAAAAA AAAAAAAAA	CALLD   PA/PB,#A
    EEEE 111001W WAA AAAAAAAAA AAAAAAAAA	LOCOS   PA/PB/PTRA/PTRB,#A
    

    This would limit CALLD to just PA and PB, but I don't really see why you'd want to jump to where PTRA/PTRB points.

    This LOCOS instruction would sum PA/PB/PTRA/PTRB with a 20-bit immediate offset and put it into the next instruction's S value, which would be used by the subsequent RDxxxx/WRxxxx instruction.
  • cgracey wrote: »
    Ersmith,

    How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?

    The compiler would generally use arbitrary registers. If we can only use a limited set then we can stick with PTRA and PTRB (rather than PA and PB) but the more special cases there are the more tricky optimization becomes.
  • Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...
  • jmg wrote: »
    There is nothing technically stopping more than one COG from reaching into XIP memory, except that would thrash the bus more, and increase latencies. In some cases, that could be tolerable.

    That would need some kind of mutex on the bus though, right? You couldn't do it concurrently in multiple cogs I assume?

    jmg wrote: »
    Users can, and will, pull out 'driver level like' functions to run on other COGS. Some of those may creep into HUB space, but the general nature is they would be loaded and run without being moved.

    XIP allows the 'main code' to place most often used functions in HUB, and then call into further memory, where the usual size-speed trade-offs apply. Code like POST, or Setup menus, are some examples of not-used-often, but still required to be available code.

    That would also require compiler, linker, and loader support. This functionality is possible with P1 using PropGCC to compile overlays into eeprom, but almost no one has ever used it. Macca did a demo of it recently, but I think the latest versions of the Prop loader tool removed the support for overlay code, and it's not accessible through SimpleIDE - you need to hand code the command lines to make it go. It'd be nice to see overlays "officially supported", though probably less of an issue with more memory and fast compressed instruction decoding.
  • cgraceycgracey Posts: 14,134
    ersmith wrote: »
    Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...

    There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.

    Substituting values is more relaxed on timing, but I need to figure out what would be useful.
  • cgraceycgracey Posts: 14,134
    ersmith wrote: »
    cgracey wrote: »
    Ersmith,

    How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?

    The compiler would generally use arbitrary registers. If we can only use a limited set then we can stick with PTRA and PTRB (rather than PA and PB) but the more special cases there are the more tricky optimization becomes.

    Would it be a problem if CALLD (the "LINK" instruction) only worked with PA/PB?
  • cgraceycgracey Posts: 14,134
    edited 2017-04-07 00:44
    ozpropdev wrote: »
    What about a AUG instruction that extends the 5 bit N field in PTRA/B hub instructionis.
    augn #imm_offset
    wr[long data,PTRA[imm_offset]
    
    This way you get the benefits of pre/post indexing too.

    I think this is the best idea. It's only a few more adders to add this and, like you said, you have the benefit of pre-/post-usage and update/no-update. This would do it all.

    By the way, it could just work off the AUGS instruction. If AUGS is in play during RDxxxx/WRxxxx, then it is used to augment the normally-5-bit offset field. This could still allow for 20-bit immediate addressing without PTRA/PTRB by making the PTRA/PTRB use kick in only when bit 31 of the augmented S is set.
  • jmgjmg Posts: 15,171
    JasonDorie wrote: »
    jmg wrote: »
    There is nothing technically stopping more than one COG from reaching into XIP memory, except that would thrash the bus more, and increase latencies. In some cases, that could be tolerable.

    That would need some kind of mutex on the bus though, right? You couldn't do it concurrently in multiple cogs I assume?

    Yes, hence my comment on increase latencies.
    Instead of one COG having control, each would need to fetch small packets, then release the BUS for possible other use.
    Keeping delays under control gets challenging, but the way P2 shares pin, there is no HW prevention.
    JasonDorie wrote: »
    That would also require compiler, linker, and loader support. This functionality is possible with P1 using PropGCC to compile overlays into eeprom, but almost no one has ever used it. Macca did a demo of it recently, but I think the latest versions of the Prop loader tool removed the support for overlay code, and it's not accessible through SimpleIDE - you need to hand code the command lines to make it go.
    Yes, that did seem a good idea, given the caveats of P1, but I guess things slowing down use of this are the finite EEPROM size and somewhat slow i2c EEPROM speed. (far slower than XIP memory)
    JasonDorie wrote: »
    It'd be nice to see overlays "officially supported", though probably less of an issue with more memory and fast compressed instruction decoding.
    My thinking is users will always run out of code space, eventually.
    P2 is larger, but can do far more too. ( I see other MCUs shipping now, offer 2MB Flash and 1MB SRAM)

    The question is what happens then, and how much of a hard ceiling is that ?
  • cgracey wrote: »
    ersmith wrote: »
    Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...

    There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.

    Substituting values is more relaxed on timing, but I need to figure out what would be useful.

    You know, having the full 20 bits is overkill -- this is for offsetting into a structure, and so it doesn't really need to span the whole address space. I think a lot of processors allow 12 bit signed offsets. 9 bits would be better than nothing: it would allow us to reduce the 3 instruction sequence above down to 2 (1 + an augment), and would be an improvement over the 5 bit offset available in PTRA/PTRB.

    So I think my first choice would be an ALTS like instruction that replaced the source value (rather than source field). But an AUG that expands the range of offsets for PTRA/PTRB would be an OK second choice.

    All of this is conditional on it not being too expensive either in gate counts or in your time!

    Eric
  • cgraceycgracey Posts: 14,134
    ersmith wrote: »
    cgracey wrote: »
    ersmith wrote: »
    Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...

    There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.

    Substituting values is more relaxed on timing, but I need to figure out what would be useful.

    You know, having the full 20 bits is overkill -- this is for offsetting into a structure, and so it doesn't really need to span the whole address space. I think a lot of processors allow 12 bit signed offsets. 9 bits would be better than nothing: it would allow us to reduce the 3 instruction sequence above down to 2 (1 + an augment), and would be an improvement over the 5 bit offset available in PTRA/PTRB.

    So I think my first choice would be an ALTS like instruction that replaced the source value (rather than source field). But an AUG that expands the range of offsets for PTRA/PTRB would be an OK second choice.

    All of this is conditional on it not being too expensive either in gate counts or in your time!

    Eric

    Making AUGS behave differently with RDxxxx/WRxxxx is almost free. There is no new instruction in that case, only a change in behavior. The thing about any 20-bit immediate instruction is that it's going to have, at most, two bits available for register selection. If we can just say PTRA/PTRB is good enough, then it's really simple to make this happen.
  • cgraceycgracey Posts: 14,134
    edited 2017-04-07 01:19
    Ersmith,

    Question:

    Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?
  • Shouldn't it be? Indexed data or code may not be aligned.
  • cgraceycgracey Posts: 14,134
    edited 2017-04-07 01:30
    potatohead wrote: »
    Shouldn't it be? Indexed data or code may not be aligned.

    Right, as long we've got 20 bits, I see no need for scaling, as all it might do is make 3/4 of the memory unreachable (RDLONG). If the compiler knows what the scaled value should be, it knows the unscaled value, and it can just use that.
  • cgracey wrote: »
    Question:

    Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?

    Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way

    Eric
  • cgraceycgracey Posts: 14,134
    ersmith wrote: »
    cgracey wrote: »
    Question:

    Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?

    Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way

    Eric

    All right. I'll get this change made, test it, and recompile the FPGA images.
  • Nice!

    This time is super high value. Lean and mean tools to make a lot of people happy.

  • I haven't followed the details of the P2 for about a year, and I've forgotten much of what I knew about it from a year ago. So please bear with me on some possibly silly questions. Most of the time C code will be executing from HUB memory. It will used the CALLD instruction to call functions. This will result in using the long version of CALLD, which only allows using either PA, PB, PTRA or PTRB as the link register. PA/PB are cog RAM locations, and PTRA/PTRB are registers, correct? Are there any limitations on using PTRA/PTRB for the link register? That is, do any instructions access the shadow memory instead, which would not contain the same values as the registers?

    It seems like a reasonable thing to do for C is to use PA for the link register and PTRA for the stack pointer. Does that make sense? For C the stack starts at the high end of memory, and a pre-decrement is done before writing to the stack. After a read is done a post-increment is performed. Is this consistent with the way the P2 handles the pointers?

    What are CALLPA/CALLPB? They're new since I've last kept up with the P2. It appears that they put the return address on the hardware stack and also PA/PB. What's the purpose for doing that?
  • @Dave
    They are all cog locations
    1F6	PA  (used to be called ADRA)
    1F7	PB  (used to be called ADRB)
    1F8	PTRA	
    1F9	PTRB	
    

    and
    CALLPA  {#}D,{#}S  'Call to S** by pushing {C, Z, PC[19:0]} onto stack, copy D to PA.
    CALLPB  {#}D,{#}S  'Call to S** by pushing {C, Z, PC[19:0]} onto stack, copy D to PB.
    
  • cgraceycgracey Posts: 14,134
    The purpose of CALLPA/CALLPB is to pass D as a parameter to PA/PB, while calling S.
  • cgraceycgracey Posts: 14,134
    Dave, all your assumptions are correct. There are no shadow register problems with PTRA/PTRB, by the way.
  • I understand that PA, PB, PTRA and PTRB are mapped to cog locations, but I was curious about how they were implemented. It appears that PA and PB are physically in cog RAM, but PTRA and PTRB are implemented as registers. That implies that there may be cog RAM also at the same addresses as PTRA and PTRB. When an instruction is fetched from $1F8 does it come from cog RAM or from the register that implements PTRA? There doesn't appear to be any way to write to cog RAM at $1f8, unless something like "mov PTRA, #5" puts a 5 in both the cog RAM and the PTRA register.

    Chip, thanks for the clarification on CALLPA/CALLPB. Of course the documentation states exactly what you said. I just didn't read it correctly.
  • cgraceycgracey Posts: 14,134
    edited 2017-04-07 03:58
    For D and S register reading, PTRA/PTRB data come from the PTRA/PTRB hardware registers. For execution, cog RAM $1F8 would be read.

    I believe writes to $1F8/$1F9 do write the RAM. I'll check soon.
  • cgracey wrote: »
    For D and S register reading, PTRA/PTRB data come from the PTRA/PTRB hardware registers. For execution, cog RAM $1F8 would be read.

    I believe writes to $1F8/$1F9 do write the RAM. I'll check soon.

    Curiosity got the better of me so I tried this
    dat		org
    
    		mov	ptra,inst	'load instruction into ram + ptra
    
    		rep	@.loop,#123	'nodify ptra register
    		rdbyte	0,ptra++
    .loop
    		call	#ptra		'execute shadow ram
    		drvh	#32
    		jmp	#$
    
    inst	_ret_	drvh	#33		'executes ok
    
    Even after indexing the PTRA register the original $1F8 RAM contents persist.




  • Thanks for confirming this. I actually knew this stuff a year or two ago, but my memory had faded on these details. And things have changed a bit since then.
  • 20 bit unscaled offset will be great for both compilers and pasm2 code
    cgracey wrote: »
    ersmith wrote: »
    cgracey wrote: »
    Question:

    Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?

    Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way

    Eric

    All right. I'll get this change made, test it, and recompile the FPGA images.

Sign In or Register to comment.