HUB EXEC Update Here

rogloh · 2014-02-12 14:24

When thinking about this more I just realized something. We can probably achieve what I was asking for with another simpler mechanism entirely and not necessarily require proposed LINKTAB changes.

Application code that calls relocatable code functions can just call into the static table directly, the static call addresses will be computed as the table base + function number by the code generator / author. This table will purely act as a "trampoline" and bounce off to the relocated functions themselves. We just need to patch this table with the absolute relocated address when the loader loads the relocatable code from SDRAM into free hub RAM for example. The caller has already saved the return address for us, but we still need to get the function index so we know which function was called and the loader can load it as required. This can be done using "LR" and the LINK instruction that writes it (assuming it works with 16 bit hub addresses). The loader will be able to compute the function index being called by LR - TABLE_BASE, and this index can be used to reference another separate table that holds SDRAM source function addresses and lengths so it knows where to read the function code in from during relocation.

Eg. Here is a sample table for 6 functions, 4 have been loaded already and are already relocated to some dynamic hub addresses, 2 are still (temporarily) pointing to the loader code.

FN1: JMP  #absolute_fn_1  
FN2: JMP  #absolute_fn_2
FN3: LINK #loader  ' fn_3 is not yet loaded, any callers will jump to the loader which will load it in and patch this entry 
FN4: JMP  #absolute_fn_4
FN5: LINK #loader  ' fn_5 is not yet loaded, ditto
FN6: JMP  #absolute_fn_6

To call a relocated application function you can just do standard absolute calls, your calling convention can use CALL, CALLA, CALLX, whatever, so long as the return method matches in the function itself.

CALLA #FN_2

Pretty simple, and no need for any extra AUGS/AUGD overhead etc beyond what would be put in for you normally/automatically.

The only limitation I see is that LR itself might not be usable for leaf functions when running in an SDRAM+hubexec model because the loader requires its use. Though we could always try to use the small internal stack and use a CALL #loader instead of LINK #loader instruction in the table above to help remedy this. The loader routine can then pop the value to generate its function index from the calling row in the table itself. So I guess LR is still usable that way, it doesn't have to be clobbered by the trampoline/loader.

Roger.

cgracey · 2014-02-12 15:31

Baggers wrote: »

Awesome progress Chip, you're on a roll lately

I got all the WRWIDE masking implemented and it grew things by quite a bit. I backed out the WIDEBM/WM/LM and things shrunk way down. The problem with those last three instructions is that they take tons of gates to evaluate 32 bytes for non-0 values in parallel. I left SETMASK in to be able to control the write mask, but that's it. So, it's not everything you asked for, but it still allows for gated byte writes which are important. You can always do a RDWIDE to set up the background, modify the WIDES, and then write them all back with WRWIDE, without much time penalty.

cgracey · 2014-02-12 15:35

Cluso99 wrote: »

Yes Bill, scaled by 4 because the table is longs, and yes the offset is from the table start.

This should work whether the table is in hub or cog, so if D is < $200, no scaling of S. Chip is this possible to do?

The addresses for jump purposes will always be words.

I think what you guys are asking for is simply a link instruction that jumps to a base+offset, with the base being either relative or absolute. Is this the case?

Cluso99 · 2014-02-12 15:38

rogloh (Roger): Sorry about the spelling. Not only is it eyesight, when posting from my xoom I cannot use "reply with quote" and I cannot fix errors unless I totally backspace deleting everything. My iPad Mini Retina works nicely but my wife has absconded with it.

For an object that holds a jump table at the beginning, the jmps will be relative. Once loaded, the only thing required is to set the loaded base address to the absolute location in a register so that D points to the register for all LINKTAB's. What would be nice is that the S value be scaled (either as is =no-scale if D stores a cog base address, or <<2 if D stores a hub base address). Therefore, if D < $200 then the address is in cog, otherwise hub (where the address in bytes).

There will be many uses for LINKTAB. An example is in the spin interpreter (prop1) where I coded in a set of jmp vectors. The bytecode is automatically decoded by using the bytecode as the S (offset) value into the jump table.

Bill Henning · 2014-02-12 15:41

Yes, and the offsets being stored as an array of words (it would be wasteful to use a long for the offsets)

cgracey wrote: »

The addresses for jump purposes will always be words.

I think what you guys are asking for is simply a link instruction that jumps to a base+offset, with the base being either relative or absolute. Is this the case?

Cluso99 · 2014-02-12 15:49

cgracey wrote: »

The addresses for jump purposes will always be words.

I think what you guys are asking for is simply a link instruction that jumps to a base+offset, with the base being either relative or absolute. Is this the case?

Yes. If both are not possible, then absolute seems to be the best case, and we can add to D ourselves to make it relative.

But Bill was wanting the result to be fetched from a word list and that then jumped to. From what I understand that the 4 stage pipeline cannot perform this additional fetch.

Chip, a question...
Does the AUGD/S instructions take a lot of silicon/time?
Reason is I was wondering if you could simply implement a SCALES #n (n=0/1/2 and maybe 3/4/5) in similar vein where the S value in the following instruction is shifted left n positions.
Often we need to scale a value for pairs/nibbles/bytes/words/longs/etc and then do some maths on it.
SCALES #2
ADD D,S 'where S needs to be shifted left (scale) before adding to D, but we need to the original S value.
Currently I am unsure how much this happens, but my memory suggests I do this often.

Cluso99 · 2014-02-12 16:02

Chip & All,
A while back a few of us discussed the SETX instruction that sets the ICCCC bits in an instruction.
I think that SETCCCC (or SETCOND) might be more useful to only set the CCCC bits in an instruction (ie replace SETX). We can always use the SETB to set/clr the I bit of an instruction if required. IMHO there is likely to be way more uses to set the conditional bits of an instruction.
What do you and others think?

rogloh · 2014-02-12 16:24

cgracey wrote: »

The addresses for jump purposes will always be words.

I think what you guys are asking for is simply a link instruction that jumps to a base+offset, with the base being either relative or absolute. Is this the case?

Yes, having both forms gives us the most flexibility, however in my previous post I believe I have now identified a way to achieve what I was originally pushing for without requiring the modified LINKTAB instruction. Unless there are other foreseen benefits to using the absolute form of LINKTAB, I really now don't have good reason to push hard for the change myself but others may like it for other purposes. If so, better chime in now...

Cluso99 · 2014-02-12 16:28

rogloh wrote: »

Yes, having both forms gives us the most flexibility, however in my previous post I believe I have now identified a way to achieve what I was originally pushing for without requiring the modified LINKTAB instruction. Unless there are other foreseen benefits to using the absolute form of LINKTAB, I really now don't have good reason to push hard for the change myself but others may like it for other purposes. If so, better chime in now...

I forsee more use for the absolute version of LINKTAB rather than the relative (current) version, but it's just my opinion.

cgracey · 2014-02-12 17:45

Cluso99 wrote: »

Chip & All,
A while back a few of us discussed the SETX instruction that sets the ICCCC bits in an instruction.
I think that SETCCCC (or SETCOND) might be more useful to only set the CCCC bits in an instruction (ie replace SETX). We can always use the SETB to set/clr the I bit of an instruction if required. IMHO there is likely to be way more uses to set the conditional bits of an instruction.
What do you and others think?

I totally agree. The I bit can be set/cleared discretely, while the CCCC field can be set with SETCOND (was SETX). I'll make that change. Very minor.

cgracey · 2014-02-12 17:51

Cluso99 wrote: »

I forsee more use for the absolute version of LINKTAB rather than the relative (current) version, but it's just my opinion.

LINKTAB uses @S to specify the base, with D providing the variable offset. I think this is very important to have within relocatable programs, as it provides a means of numbered-routine dispatch.

To achieve the effect you guys are talking about, you'd only need to add the absolute address (16 bits) to the relative address (16 bits) with an ADD instruction, and then do whatever JMP/LINK/CALLx you want. We can make a single instruction to jump to base+offset, as either a JMP or LINK, but a call is out of the question because there is no opcode space for all that those would require.

Cluso99 · 2014-02-12 18:19

cgracey wrote: »

LINKTAB uses @S to specify the base, with D providing the variable offset. I think this is very important to have within relocatable programs, as it provides a means of numbered-routine dispatch.

To achieve the effect you guys are talking about, you'd only need to add the absolute address (16 bits) to the relative address (16 bits) with an ADD instruction, and then do whatever JMP/LINK/CALLx you want. We can make a single instruction to jump to base+offset, as either a JMP or LINK, but a call is out of the question because there is no opcode space for all that those would require.

Aha. I now see why you have been using S/@/@@ for the table base and D as the offset. I can see it's use when D is a register and its contents are the offset.

I have obviously been thinking differently about the use where the D register would be set once by the program and would be storing the base address of the table, and that S would either be an immediate value (being an index 0..511 into the table) or the S register would contain a value (being an index 0..xxx into the table - also useful for decoding bytecodes). This method allows for the program to simply calculate the absolute address of the table at runtime and set this into D. Once done, any LINKTAB TABLE,#INDEX could be performed without any relocation. This gives the possibility of using a numbered routine approach, as well as bytecode or similar decoding. Unfortunately the addition of PC precludes this form of table use.

Couldn't you use this method for your case too???

Yes, I realise we are out of those precious opcodes. That's why I was looking to see if any of them could have their WZ or WC recovered.

rogloh · 2014-02-12 18:24

cgracey wrote: »

LINKTAB uses @S to specify the base, with D providing the variable offset. I think this is very important to have within relocatable programs, as it provides a means of numbered-routine dispatch.

To achieve the effect you guys are talking about, you'd only need to add the absolute address (16 bits) to the relative address (16 bits) with an ADD instruction, and then do whatever JMP/LINK/CALLx you want. We can make a single instruction to jump to base+offset, as either a JMP or LINK, but a call is out of the question because there is no opcode space for all that those would require.

Agree, a relative and self contained LINKTAB within a functional block itself is still very useful in those cases. Also if one can keep their LINKTAB instructions within 511 longs of the start of the jump table, the table sizes themselves can become very large and allow indexes > 511 in D with no need for the extra AUGS. I like that too. Best keep LINKTAB as is now me thinks.

If there is instruction room, a separate JMP D,S could be nice where the action just becomes PC = D + S without clobbering LR, but I'm not pushing especially hard for it and your time and the scarce instruction space may be better used by important remaining features. As you say we can always do the "ADD D, S" and then a "JMP D". We just lose our D value, but if that needs to get setup as an index each time around, doesn't mattter so much if we clobber it, only one extra instruction added there.

Roger.

Seairth · 2014-02-12 18:44

cgracey wrote: »

I totally agree. The I bit can be set/cleared discretely, while the CCCC field can be set with SETCOND (was SETX). I'll make that change. Very minor.

Taking this just a bit further, what if you were to move the CCCC field to the front of the instruction? Then, SETCOND simply becomes a SETNIB.

Cluso99 · 2014-02-12 18:48

cgracey wrote: »

I totally agree. The I bit can be set/cleared discretely, while the CCCC field can be set with SETCOND (was SETX). I'll make that change. Very minor.

Thanks Chip. I am sure this will be way more useful - easy to disable/enable an instruction by SETCOND instr,#0000 and SETCOND instr,#1111 or SETCOND instr,#cond.

Cluso99 · 2014-02-12 18:55

Seairth wrote: »

Taking this just a bit further, what if you were to move the CCCC field to the front of the instruction? Then, SETCOND simply becomes a SETNIB.

No, please no.

Opcodes are 7, 9, 10, 14 bits and all from the top down. Most are 7 or 9 with the 9 encompassing the Z and C bits. The next set use 10 bits, adding I. It is only the SET/FIXINDx that use 14 bits, adding CCCC.

It is just unfortunate that the cccc bits are not nibble aligned. Moving D and S would make it so much unlike the P1.

cgracey · 2014-02-12 19:07

Seairth wrote: »

Taking this just a bit further, what if you were to move the CCCC field to the front of the instruction? Then, SETCOND simply becomes a SETNIB.

It's too bad those CCCC bits aren't already nibble-aligned. It would be a huge mess to move them, so I think we'll have to keep SETCOND.

ozpropdev · 2014-02-12 19:08

Seairth wrote: »

Taking this just a bit further, what if you were to move the CCCC field to the front of the instruction? Then, SETCOND simply becomes a SETNIB.

Do you mean move CCCC to bits 31..28 of opcode as front?

Seairth · 2014-02-12 19:56

ozpropdev wrote: »

Do you mean move CCCC to bits 31..28 of opcode as front?

Yes, but as Chip points out, it would be a mess to move it now. Ah well. I'm guessing that SETCOND is fairly small, so this wouldn't have saved much silicon anyhow. On the other hand, by moving this out of the middle, it might have also allowed for a more flexible opcode encoding. That might have also saved some silicon. Or not. I'm still trying to understand how these sorts of changes affect synthesis.

cgracey · 2014-02-12 23:25

I have a question for you guys:

Is it really important that 'LINKTAB D,@S' writes the return address to $000?

It seems to me that more often than not, only a jump-table dispatch is needed in cases of LINKTAB, and always having the return address written to $000 might be undesirable. The way LINKTAB works now is especially good for calling a numbered routine from a list, which is a very special case, but for a jump table, it just introduces the possibly troublesome side-effect of writing $000, which may already contain an important link address.

We have very specific LINK #/@/D instructions which can jump to anywhere, by variable (D) or constant (#/@) without needing constant augmentation. These are clean instructions which are used exactly for linking. I kind of hate LINKTAB getting into the mix, with caveat behaviors, like only supporting relative addresses and possibly needing AUGS.

I would like to turn LINKTAB back into JMPLIST. Anyone passionate about this not happening?

Roy Eltham · 2014-02-12 23:55

I agree with you Chip. Turn linktab back into jmplist.

cgracey · 2014-02-13 00:24

Roy Eltham wrote: »

I agree with you Chip. Turn linktab back into jmplist.

Done!

I feel a lot cleaner now.

Sapieha · 2014-02-13 00:44

Hi Chip.

It is correctly what I asked --- But CALL not jump

cgracey wrote: »

The addresses for jump purposes will always be words.

I think what you guys are asking for is simply a link instruction that jumps to a base+offset, with the base being either relative or absolute. Is this the case?

Cluso99 · 2014-02-13 03:32

cgracey wrote: »

I have a question for you guys:

Is it really important that 'LINKTAB D,@S' writes the return address to $000?

It seems to me that more often than not, only a jump-table dispatch is needed in cases of LINKTAB, and always having the return address written to $000 might be undesirable. The way LINKTAB works now is especially good for calling a numbered routine from a list, which is a very special case, but for a jump table, it just introduces the possibly troublesome side-effect of writing $000, which may already contain an important link address.

We have very specific LINK #/@/D instructions which can jump to anywhere, by variable (D) or constant (#/@) without needing constant augmentation. These are clean instructions which are used exactly for linking. I kind of hate LINKTAB getting into the mix, with caveat behaviors, like only supporting relative addresses and possibly needing AUGS.

I would like to turn LINKTAB back into JMPLIST. Anyone passionate about this not happening?

To me there is way more use in it being a CALL/LINK than a JMP to a table+offset. I am fairly sure Bill sees it this way too.

For instance, what I did with my version of the spin interpreter was use calls to execute various blocks of code to perform the bytecode. Some of these were common routines used by a number of bytecodes. Although we did not have LINKTAB, I did build a similar pseudo mechanism. Currently I am not sure if these could be performed by the other CALLx instructions. I thought it would be excellent for the JMPLIST to be a LINKTAB (ie a CALL) where the return address is automatically stored in a fixed location. The return is optional (because there is no stack to be popped).

Since you have already implemented this, we can try it out and rediscuss if necessary.

Just re-read the post. I need to think again as maybe the LINK will do the job I was thinking of anyway

Ale · 2014-02-15 04:11

I see that now the propeller 2 has many more opcodes than originally thought, I just ask myself how useful are all these new opcodes ?... Are they really worth it ?, just asking. Because most of the assembly of many processors is normally ignored by C compilers, they use a very small subset. Do we have any statistics to show how gcc performs and what is used ?.
With execution from HUB now in place there is not that much what separates the P2 from other uC. What we found with the P1 as needed (LMM) is now here, did we wanted a LMM-enabled P1 or we wanted a multicore (so-to-say) ARM/MIPS/SH uC ?... maybe just random thoughts... Now we need PLCC84 and we are all set

cgracey · 2014-02-15 04:40

Cluso99 wrote: »

To me there is way more use in it being a CALL/LINK than a JMP to a table+offset. I am fairly sure Bill sees it this way too.

For instance, what I did with my version of the spin interpreter was use calls to execute various blocks of code to perform the bytecode. Some of these were common routines used by a number of bytecodes. Although we did not have LINKTAB, I did build a similar pseudo mechanism. Currently I am not sure if these could be performed by the other CALLx instructions. I thought it would be excellent for the JMPLIST to be a LINKTAB (ie a CALL) where the return address is automatically stored in a fixed location. The return is optional (because there is no stack to be popped).

Since you have already implemented this, we can try it out and rediscuss if necessary.

Just re-read the post. I need to think again as maybe the LINK will do the job I was thinking of anyway

The basic building-block instructions are all there, I think. They can be used in combinations of 2 or 3 to realize all kinds of indexed branches.

cgracey · 2014-02-15 04:49

Ale wrote: »

I see that now the propeller 2 has many more opcodes than originally thought, I just ask myself how useful are all these new opcodes ?... Are they really worth it ?, just asking. Because most of the assembly of many processors is normally ignored by C compilers, they use a very small subset. Do we have any statistics to show how gcc performs and what is used ?.
With execution from HUB now in place there is not that much what separates the P2 from other uC. What we found with the P1 as needed (LMM) is now here, did we wanted a LMM-enabled P1 or we wanted a multicore (so-to-say) ARM/MIPS/SH uC ?... maybe just random thoughts... Now we need PLCC84 and we are all set

C compilers probably won't use 1/5 of the total instructions, but when you're in PASM, you'll be able to do all kinds of stuff.

I wonder if the new architecture, with hub execution, will be less fun to program, because so much more is possible. I think some of us enjoy working with tight constraints because it makes for challenges. No, it will be fun because you can do all kinds of things, and lots of them, at once. You can have a cog acting as a very tight controller, and you can use another as a system-level GUI.

Ahle2 · 2014-02-15 06:51

I have not been very active lately and have not yet catched up on all new things regarding hubexec and multitasking etc; So this may be answared elsewhere and/or it may be a stupid question. Is there a way to read out PC+Z+C from a task. (Yes I do have a reason for asking this) I guess it would be possible by using trickery that involves such things as Calls, AUX, Push, Pop and some special code running in the task that you want to "monitor".

/Johannes

Bill Henning · 2014-02-15 07:16

Ale wrote: »

I see that now the propeller 2 has many more opcodes than originally thought, I just ask myself how useful are all these new opcodes ?...

They are very useful for pasm, and will allow much more to be accomplished - with fewer instructions.

Ale wrote: »

Are they really worth it ?, just asking.

YES!!!!!!!!!!

Ale wrote: »

Because most of the assembly of many processors is normally ignored by C compilers, they use a very small subset.

True, mostly because it takes a LOT of work to optimize compilers to be able to make good use of all the instructions.

Ale wrote: »

Do we have any statistics to show how gcc performs and what is used ?.

Does not exist, and is largely irrelevant - as all the instructions will be very useful for hand crafted assembly code.

Ale wrote: »

With execution from HUB now in place there is not that much what separates the P2 from other uC.

Strongly disagree.

AUX, usage for CLUT, stacks, fifo's, scratchpad storage, video buffers... nothing on ARM's like it.

Cog mode for totally deterministic timing.

up to four tasks (hardware threads) within each cog, again, ARM has nothing like it (XMOS does)

timers, very high speed uart's and much much more.

If you program only in C, using only the features C makes available, then a single cog in hubexec is similar to an ARM. But even that cog could run three other tasks in it - unlike ARM.

Once you get into PASM, it is a totally different, and wonderfully powerful, super-powered beast.

Ale wrote: »

What we found with the P1 as needed (LMM) is now here, did we wanted a LMM-enabled P1 or we wanted a multicore (so-to-say) ARM/MIPS/SH uC ?...

For C use, it is now as good for C as other processors.

But there is more to the world than C, and in pasm it truly shines.

Ale wrote: »

maybe just random thoughts... Now we need PLCC84 and we are all set

I'd like a PLCC84 version! It would make working with P2 easier; but I'd hate to lose all those nice I/O's...

Bill Henning · 2014-02-15 07:21

cgracey wrote: »

C compilers probably won't use 1/5 of the total instructions, but when you're in PASM, you'll be able to do all kinds of stuff.

EXACTLY! Pasm2 is going to be even more fun than Pasm, as it gives a lot more power.

cgracey wrote:

I wonder if the new architecture, with hub execution, will be less fun to program, because so much more is possible. I think some of us enjoy working with tight constraints because it makes for challenges.

For me, it will be more fun, as it will allow me to do far more.

Far more.

Besides, we get to push the envelope, and find the new limits.

cgracey wrote:

No, it will be fun because you can do all kinds of things, and lots of them, at once. You can have a cog acting as a very tight controller, and you can use another as a system-level GUI.

Yep!

Not to mention that the unique three memory architecture (cog, aux, hub) breaks the Von Neuman bottleneck quite nicely.

HUB EXEC Update Here

Comments