Yes, this is how I propose that BIG work. The BIG instruction supplies bits 31:9 to the following instruction's S field.
What I was proposing wasn't this complicated. The 32 bit value would be used as an immediate value in the instruction that follows the BIG instruction. There would be no range checking and no hub accesses unless the instruction happened to be RDxxxx or WRxxxx. In that case, the 32 bit immediate value would be the hub address for the hub access. Really, nothing in the COG processor would need to change except the handling of immediate values in the S field.
I don't think it would even affect the execution unit. It would only affect the instruction decoder that handles the forming of immediate operands.
I think there should be one 23 bit "big" register for each thread. That way the BIG instruction could be used even when threading was in use.
How do you guys do quotes within quotes copied from the post you are referring to? Reply with quote does not do that?
Thanks for the answers David. Yes, I understand now. That sounds quite easy IMHO.
Might be possible to actually extend it to loading a 32 bit field and just "OR" in the lower #S 9 bits. This could provide more uses such that the BIG instruction would be 32 bits and the #S could be #0. I am not going to suggest adding the #S as this most likely takes too long within the pipeline although it could be useful.
BIG [#]D ' Loads a 32 bit register to be "OR"ed with the next instructions #S (9 bits) to be used as a resultant 32 bit immediate S field.
Presuming we can free up a full instruction, then an immediate value of
xxxxxxx 00 x xxxx xxxxxxxxx SSSSSSSSS ' Loads a the value stored in register S into the "BIG" register
xxxxxxx 10 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the lower "BIG" register bits 22..0 and zero bits 31..23.
xxxxxxx 11 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the upper "BIG" register bits 31..9 and zero bits 8..0
We would require 4 such registers for use in multi-tasking.
As Sapieha pointed out, there is effectively no NOP instruction. There is a WAIT #n instruction.
But you can no longer assume that an instruction with cccc=0000 will not execute (ie as a NOP).
Currently the top 14 bits must be all zeros to ensure a NOP - well not precisely... this bit config
0xxxxxx xx x 0000 xxxxxxxxx xxxxxxxxx ensures a NOP
Reply with Quote from your post gets me nothing.
So I have to manually cut & paste within manual quote and end quote tags? Or is there a simple way to copy someones post that already includes quotes, while keeping those quotes?
DECOD5 takes the lower 5 bits and decodes it into a single bit mask.
DECOD5 reg2
replaces
MOV reg,#1
SHL reg1,reg2
DECOD4 takes the lower 4 bits and creates a 16 bit mask. The resulting mask is copied to the 2 word positions.
DECOD3 takes the lower 3 bits and creates a 8 bit mask. The resulting mask is copied to the 4 byte positions.
ENCOD does the reverse.
Example
ENCOD reg1,#000 would return the value 5 to represent the 5th bit is set.
Values returned are in the range of 1 to 32.
A zero result represents no bit is set.
IIRC if multiple bits are set, it returns the most significant bit.
Maybe these can be shrunk into one opcode by using WZ,WC as suggested.
So WZ & WC are not required for DECOD3/4/5 but WZ looks like being required for ENCOD?
That's OK as currently ENCOD is shared with BLMASK, but it only saves 2 instructions. I was hoping that I could find a place for BLMASK and free another instruction slot.
How do you guys do quotes within quotes copied from the post you are referring to? Reply with quote does not do that?
Thanks for the answers David. Yes, I understand now. That sounds quite easy IMHO.
Might be possible to actually extend it to loading a 32 bit field and just "OR" in the lower #S 9 bits. This could provide more uses such that the BIG instruction would be 32 bits and the #S could be #0. I am not going to suggest adding the #S as this most likely takes too long within the pipeline although it could be useful.
BIG [#]D ' Loads a 32 bit register to be "OR"ed with the next instructions #S (9 bits) to be used as a resultant 32 bit immediate S field.
Presuming we can free up a full instruction, then an immediate value of
xxxxxxx 00 x xxxx xxxxxxxxx SSSSSSSSS ' Loads a the value stored in register S into the "BIG" register
xxxxxxx 10 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the lower "BIG" register bits 22..0 and zero bits 31..23.
xxxxxxx 11 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the upper "BIG" register bits 31..9 and zero bits 8..0
We would require 4 such registers for use in multi-tasking.
So WZ & WC are not required for DECOD3/4/5 but WZ looks like being required for ENCOD?
That's OK as currently ENCOD is shared with BLMASK, but it only saves 2 instructions. I was hoping that I could find a place for BLMASK and free another instruction slot.
So WZ & WC are not required for DECOD3/4/5 but WZ looks like being required for ENCOD?
That's OK as currently ENCOD is shared with BLMASK, but it only saves 2 instructions. I was hoping that I could find a place for BLMASK and free another instruction slot.
WZ with ENCOD I don't think has any effect. I'm firing up the FPGA now to check.
Edit: Just thinking about it.... It would reflect no bits set....Oops!
I wonder if there are other opportunities to combine instructions that don't need one or more of WZ WC and I ?
Originally Posted by Cluso99
So WZ & WC are not required for DECOD3/4/5 but WZ looks like being required for ENCOD?
That's OK as currently ENCOD is shared with BLMASK, but it only saves 2 instructions. I was hoping that I could find a place for BLMASK and free another instruction slot.
No, I have carefully scanned them. But there may be something that comes out of direct HUB-AUX transfers that could affect the RDBYTE/WORD/LONG Cache versions, or the RDAUX/RDAUXR, but I am not hopeful.
There are a couple in the 1000011-1111110 area that might yield something.
Then there is 1111111 & S=1xxxxxxxx that may also be available for an 8 bit S.
And I have a REPS/REPD alternative that partially frees 1111110.
Sounds like DECODExx may save us two dual-op instructions (if I read the above correctly), and that's all we need for (HJMP/HCALL/HCALLA/HCALLB) and BIG, HRET / HRETA / HRETB don't need any arguments.
No, I have carefully scanned them. But there may be something that comes out of direct HUB-AUX transfers that could affect the RDBYTE/WORD/LONG Cache versions, or the RDAUX/RDAUXR, but I am not hopeful.
There are a couple in the 1000011-1111110 area that might yield something.
Then there is 1111111 & S=1xxxxxxxx that may also be available for an 8 bit S.
And I have a REPS/REPD alternative that partially frees 1111110.
How do you guys do quotes within quotes copied from the post you are referring to? Reply with quote does not do that?
Thanks for the answers David. Yes, I understand now. That sounds quite easy IMHO.
Might be possible to actually extend it to loading a 32 bit field and just "OR" in the lower #S 9 bits. This could provide more uses such that the BIG instruction would be 32 bits and the #S could be #0. I am not going to suggest adding the #S as this most likely takes too long within the pipeline although it could be useful.
BIG [#]D ' Loads a 32 bit register to be "OR"ed with the next instructions #S (9 bits) to be used as a resultant 32 bit immediate S field.
Presuming we can free up a full instruction, then an immediate value of
xxxxxxx 00 x xxxx xxxxxxxxx SSSSSSSSS ' Loads a the value stored in register S into the "BIG" register
xxxxxxx 10 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the lower "BIG" register bits 22..0 and zero bits 31..23.
xxxxxxx 11 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the upper "BIG" register bits 31..9 and zero bits 8..0
We would require 4 such registers for use in multi-tasking.
Bill, yes it could be used in RD/WRxxxx
Sounds good although I'm not sure I see the value of being able to load the BIG register from another register. Also, I think Bill said that since BIG can't be encoded as a NOP, there may not be much reason to have the form that loads the low order bits with the BIG value. How would that even work? Would there be an extra bit to say which way BIG had been loaded so the instruction decode would know how to combine it with the S bits of the next instruction? That seems overly complicated to me. Maybe we'd better let Bill chime in on whether there is still value in loading the low bits rather than the high bits.
Sounds good although I'm not sure I see the value of being able to load the BIG register from another register. Also, I think Bill said that since BIG can't be encoded as a NOP, there may not be much reason to have the form that loads the low order bits with the BIG value. How would that even work? Would there be an extra bit to say which way BIG had been loaded so the instruction decode would know how to combine it with the S bits of the next instruction? That seems overly complicated to me. Maybe we'd better let Bill chime in on whether there is still value in loading the low bits rather than the high bits.
The low 23 bit option is still very useful as hub addresses ignore the high bits.
Okay but I would hate for this to be rejected because we tried to pile on too many features. Of course, I guess that's the standard approach with the P2 so far. :-)
Anyway, if you allow either the low or high 23 bits to be loaded you'll need one extra bit to remember which was requested by the BIG instruction so it can be combined correctly with the S field of the next instruction.
Following the discussion that is going on, I've tought some new ideas, targeted to the realm of the P2.
I used to have a method, many times abused to be true, in former applications that I'd crafted along the years, to gain some 'almost" NOPs behaviour, from congested instruction set decoders.
I'll try to depict it here, but due to my known difficulties to write in English, I'll beg you pardon an patience, for any typo or seems-to-be-confusing descriptions I'll make.
Since OCT related coding, focused to be executed from HUB memory, must be eight consecutively long aligned, but not necessarily depart executing from xxxxxxxxxxxxxxxxxxxx000, then, for the first eight longs that will be fetched from HUB memory, all the "needed" 32 bit constants that will be used "inside" those "less than eight" executable instruction block, are present at the first x longs, that belongs to that block.
Since we don't need to use the full 32 bits of data, but only 23, there are 9 remaining "unused" bits.
Three of they, will be used to set the entry point for the next "eight long aligned" code block, that will be fetched in advance, as the present block is under execution. This provides enough space, to represent any number of constants we could need, to be referenced at the next code block, and so on.
This provides for straight execution of code blocks, yet providing ample room for 'almost' immediate values placement, without having to waste a single JUMP, to skip over data space.
It's kind of unaligned inline immediate values placement, and sure, almost for free.
Now the technic to fetch those values, useable either from HUB, as for AUX, and even COG memory.
Whenever an operation, references the same place, as the source and destination, for a read or write operation, they are to be treated as NOPS, inside the pipeline, in the aspect of doing their WRITE phase, at stage four.
First, because it's worthless to write over the same place, a value that is already there.
Second, because the full 32 bits of the gathered value, are at disposal, to be used elsewhere; in the present case, to load the BIG register.
And this also gives us some 9 unused bits, three of them, sure, compromised as above.
IMHO, the pipeline ALU will have no problem at all, dealing with the above depicted operations, sure, pending Chip's analisys and approval, and also sure, the comments, aditions and critics of each and every other of the many participants of the forum.
Naturaly, the same technic still works, easily, for AUX and COG memories too.
I hope it helps in the present situation.
Yanomani
P.S. When I wrote "Whenever an operation, references the same place, as the source and destination, for a read or write operation", I was talking about general memory, not the pin circuits, or any other special feature register, where writing over could be used for special purposes.
P.S. 2 - Sure, "It's kind of unaligned inline immediate values placement, and sure, almost for free." is not true.
You must place the WRLONG D,S, where D=S, in order to get the action done. My mistake and shame!:blank:
P.S. 3 - "Second, because the full 32 bits of the gathered value, are at disposal, to be used elsewhere; in the present case, to load the BIG register." To be true, the write phase will exist, directed to the BIG register, and to the three bit "next OCT entry point" register. This must not be cleared, untill used for the first time, at next block execution entry.
The usage case with the low 32 bits is basically meant for the table case, and directly enoded hub addresses. If BIG is OR'd with a #0 in the S, that works as well... so I don't think the extra bit is needed.
Okay but I would hate for this to be rejected because we tried to pile on too many features. Of course, I guess that's the standard approach with the P2 so far. :-)
Anyway, if you allow either the low or high 23 bits to be loaded you'll need one extra bit to remember which was requested by the BIG instruction so it can be combined correctly with the S field of the next instruction.
The usage case with the low 32 bits is basically meant for the table case, and directly enoded hub addresses. If BIG is OR'd with a #0 in the S, that works as well... so I don't think the extra bit is needed.
Could you give a concrete example of the table case? I'm having a hard time visualizing what you're talking about. Sorry to be so dense!
Originally Posted by Cluso99
Might be possible to actually extend it to loading a 32 bit field and just "OR" in the lower #S 9 bits. This could provide more uses such that the BIG instruction would be 32 bits and the #S could be #0. I am not going to suggest adding the #S as this most likely takes too long within the pipeline although it could be useful.
BIG [#]D ' Loads a 32 bit register to be "OR"ed with the next instructions #S (9 bits) to be used as a resultant 32 bit immediate S field.
Presuming we can free up a full instruction, then an immediate value of
xxxxxxx 00 x xxxx xxxxxxxxx SSSSSSSSS ' Loads a the value stored in register S into the "BIG" register
xxxxxxx 10 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the lower "BIG" register bits 22..0 and zero bits 31..23.
xxxxxxx 11 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the upper "BIG" register bits 31..9 and zero bits 8..0
We would require 4 such registers for use in multi-tasking.
Bill, yes it could be used in RD/WRxxxx
Sounds good although I'm not sure I see the value of being able to load the BIG register from another register. Also, I think Bill said that since BIG can't be encoded as a NOP, there may not be much reason to have the form that loads the low order bits with the BIG value. How would that even work? Would there be an extra bit to say which way BIG had been loaded so the instruction decode would know how to combine it with the S bits of the next instruction? That seems overly complicated to me. Maybe we'd better let Bill chime in on whether there is still value in loading the low bits rather than the high bits.
The instruction "BIG" would load the 23 bits into the appropriate bits in the "BIG" register. The next executed instruction would not know, or care, where the bits were loaded. It's #S field would just be ORed with the BIG register.
I am still not sure of the requirement regarding HJMP, HCALL and HRET, and how they get used.
I presume you do not need to save/restore the Z/C flags with these instructions?
Could we simplify this whole thing a bit, and disregard multi-tasking for this mode of operation? Might simplify it quite a bit for Chip, etc.
Does the mapping/windowing of AUX into COG help if you could map larger blocks into COG?
I have written LMM pasm, so I understand how we FJMP, FCALL and FRET. Also I know that in this mode, it is better to have constants embedded as NOP instructions (18 bit constants) than have to setup fixed constants in cog.
I am presuming that the GCC compiler emits code in a similar fashion.
I presume that is why the BIG instruction is important, and I understand that.
Currently in LMM on P1 we run a tight 4 instruction loop. In P2 that loop is a 5 instruction loop - this is what I use in my P2 Debugger...
''-------[ LMM execution loop ]-------------------------
LmmLoop rdlong lmm_opcode, lmm_pc ' rdlong (read LMM hub instr into OPCODE using PC)
add lmm_pc, #4 ' PC++ (inc PC to next LMM hub instr)
lmm_op2 nop ' rdlong delay (optional 2nd instruction execution)
lmm_opcode nop ' rdlong result (execute the LMM hub instr)
jmp #LmmLoop ' loop
By being able to window some AUX ram into COG ram, we can now execute in place, saving the LMM execution loop.
Presuming we window only 8*Longs (the RDWIDE instruction width) of AUX into COG $1E0..$1E7 we might do something like this...
xxx: RDWIDE ddd,sss 'read 8*longs into aux which is windowed into the following instructions
'[I]some delay to ensure the aux has been read[/I]
1e0: instr1 '\\ 8*longs read in by the RDWIDE instruction
1e1: instr2 '||
...
1e7: instr8 '//
1e8: JMP #xxx 'go fetch another 8*longs
... some instructions to accept the FJMP, FCALL, FRET instructions
Extending the above HUBEXEC (named by Bill) model (replaces LMM model)...
This method would permit a tight DJNZ style instruction loop
1df: RDWIDE ddd,sss WC 'read 8*longs into aux which is windowed into the following instructions at $1E0..$1E7; WC means stall until read.
1e0: instr1 '\\ 8*longs read in by the RDWIDE instruction
1e1: instr2 '||
...
1e7: instr8 '//
1e8: JMP #$1df 'go fetch another 8*longs
... some instructions to receive the FJMP, FCALL, FRET instructions... some instructions to accept the FJMP, FCALL, FRET instructions
An alternative, but the REPS instruction terminates with DJNZ style instructions..
1dd: REPS #9 'repeat next 9 instructions until a JMP is executed
1de: NOP 'spacer instruction
1df: RDWIDE ddd,sss WC 'read 8*longs into aux which is windowed into the following instructions at $1E0..$1E7; WC means stall until read.
1e0: instr1 '\\ 8*longs read in by the RDWIDE instruction
1e1: instr2 '||
...
1e7: instr8 '//
... some instructions to receive the FJMP, FCALL, FRET instructions... some instructions to accept the FJMP, FCALL, FRET instructions
I have asked Chip if it were possible to
(1) Make the RDWIDE instruction capable of delivering up to a count of 32 x 8*Long reads into AUX in the background with a tiny state m/c
(2) If it would be possible to map up to the whole 32 x 8*Long AUX registers into COG ram
By mapping a large Aux block into Cog, a good set of hub instructions could be executed inline at a time, and possibly small loops could be contained
within those blocks read, giving an enormous boost to performance.
I got rid of the SETPIX0/1/2/3 instructions and made a new SETPIXW instruction that loads all eight PIX terms from the WIDE registers, all at once. So, there are four 'D/#,S/#' opcodes available now.
I've loosely read this thread and I understand that you are looking for some opcode space for HJMP/HCALL/...
I also see there is talk about how to have a 32-bit constant in-line. About that: I think the idea has already been posited, but we could have a dummy instruction that doesn't do anything, though its 23 LSBs are free for data payload. Any instruction that has an immediate D or S, with priority going to S, can look for this dummy instruction in the next-lower stage of the pipeline. If it sees it, and it hasn't been cancelled as trailing branch code, it will use its 23 LSBs to augment the 9-bit immediate value it already has, giving it a full 32-bit immediate for D or S. This would solve the problem, would it not?
I got rid of the SETPIX0/1/2/3 instructions and made a new SETPIXW instruction that loads all eight PIX terms from the WIDE registers, all at once. So, there are four 'D/#,S/#' opcodes available now.
I've loosely read this thread and I understand that you are looking for some opcode space for HJMP/HCALL/...
I also see there is talk about how to have a 32-bit constant in-line. About that: I think the idea has already been posited, but we could have a dummy instruction that doesn't do anything, though its 23 LSBs are free for data payload. Any instruction that has an immediate D or S, with priority going to S, can look for this dummy instruction in the next-lower stage of the pipeline. If it sees it, and it hasn't been cancelled as trailing branch code, it will use its 23 LSBs to augment the 9-bit immediate value it already has, giving it a full 32-bit immediate for D or S. This would solve the problem, would it not?
That seems to fit the 32 bit constant concept quite well.
That would work well with MUL / DIV as well.
Now, what to name it?
I got rid of the SETPIX0/1/2/3 instructions and made a new SETPIXW instruction that loads all eight PIX terms from the WIDE registers, all at once. So, there are four 'D/#,S/#' opcodes available now.
I've loosely read this thread and I understand that you are looking for some opcode space for HJMP/HCALL/...
I also see there is talk about how to have a 32-bit constant in-line. About that: I think the idea has already been posited, but we could have a dummy instruction that doesn't do anything, though its 23 LSBs are free for data payload. Any instruction that has an immediate D or S, with priority going to S, can look for this dummy instruction in the next-lower stage of the pipeline. If it sees it, and it hasn't been cancelled as trailing branch code, it will use its 23 LSBs to augment the 9-bit immediate value it already has, giving it a full 32-bit immediate for D or S. This would solve the problem, would it not?
Yes Chip. It's what was called the "BIG" instruction in that summary post of mine.
That seems to fit the 32 bit constant concept quite well.
That would work well with MUL / DIV as well.
Now, what to name it?
It would probably never be used for in-cog code, as it would waste a cycle as the dummy data-payload instruction floated through the pipeline, but it would provide code executing from the hub a way to have 32-bit constants without resorting to complicated means.
Yes Chip. It's what was called the "BIG" instruction in that summary post of mine.
Super!
It would be used like this:
ADD reg,#bigconstant & $1FF
BIG #bigconstant >> 9
That would add bigconstant to reg.
Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.
Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4. This was a really clever idea you guys came up with, and it turns out that it can be done by using the registers already in the pipeline, so it's almost free!
Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.
Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4. This was a really clever idea you guys came up with, and it turns out that it can be done by using the registers already in the pipeline, so it's almost free!
Great news. That is clever just reversing the order of the two instructions. And working for #D as well. WTG Chip!
It doesn't work for multi-tasking? (That's fine I think)
Looks like we will get that "HUBEXEC" (execute in place) model working! What a performance boost over LMM!
Comments
Thanks for the answers David. Yes, I understand now. That sounds quite easy IMHO.
Might be possible to actually extend it to loading a 32 bit field and just "OR" in the lower #S 9 bits. This could provide more uses such that the BIG instruction would be 32 bits and the #S could be #0. I am not going to suggest adding the #S as this most likely takes too long within the pipeline although it could be useful.
BIG [#]D ' Loads a 32 bit register to be "OR"ed with the next instructions #S (9 bits) to be used as a resultant 32 bit immediate S field.
Presuming we can free up a full instruction, then an immediate value of
xxxxxxx 00 x xxxx xxxxxxxxx SSSSSSSSS ' Loads a the value stored in register S into the "BIG" register
xxxxxxx 10 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the lower "BIG" register bits 22..0 and zero bits 31..23.
xxxxxxx 11 n nnnn nnnnnnnnn nnnnnnnnn ' Load 23 immediate bits into the upper "BIG" register bits 31..9 and zero bits 8..0
We would require 4 such registers for use in multi-tasking.
Bill, yes it could be used in RD/WRxxxx
But you can no longer assume that an instruction with cccc=0000 will not execute (ie as a NOP).
Currently the top 14 bits must be all zeros to ensure a NOP - well not precisely... this bit config
0xxxxxx xx x 0000 xxxxxxxxx xxxxxxxxx ensures a NOP
So I have to manually cut & paste within manual quote and end quote tags? Or is there a simple way to copy someones post that already includes quotes, while keeping those quotes?
So WZ & WC are not required for DECOD3/4/5 but WZ looks like being required for ENCOD?
That's OK as currently ENCOD is shared with BLMASK, but it only saves 2 instructions. I was hoping that I could find a place for BLMASK and free another instruction slot.
I like your encoding! That would work well.
WZ with ENCOD I don't think has any effect. I'm firing up the FPGA now to check.
Edit: Just thinking about it.... It would reflect no bits set....Oops!
There are a couple in the 1000011-1111110 area that might yield something.
Then there is 1111111 & S=1xxxxxxxx that may also be available for an 8 bit S.
And I have a REPS/REPD alternative that partially frees 1111110.
Sounds like DECODExx may save us two dual-op instructions (if I read the above correctly), and that's all we need for (HJMP/HCALL/HCALLA/HCALLB) and BIG, HRET / HRETA / HRETB don't need any arguments.
Anyway, if you allow either the low or high 23 bits to be loaded you'll need one extra bit to remember which was requested by the BIG instruction so it can be combined correctly with the S field of the next instruction.
I used to have a method, many times abused to be true, in former applications that I'd crafted along the years, to gain some 'almost" NOPs behaviour, from congested instruction set decoders.
I'll try to depict it here, but due to my known difficulties to write in English, I'll beg you pardon an patience, for any typo or seems-to-be-confusing descriptions I'll make.
Since OCT related coding, focused to be executed from HUB memory, must be eight consecutively long aligned, but not necessarily depart executing from xxxxxxxxxxxxxxxxxxxx000, then, for the first eight longs that will be fetched from HUB memory, all the "needed" 32 bit constants that will be used "inside" those "less than eight" executable instruction block, are present at the first x longs, that belongs to that block.
Since we don't need to use the full 32 bits of data, but only 23, there are 9 remaining "unused" bits.
Three of they, will be used to set the entry point for the next "eight long aligned" code block, that will be fetched in advance, as the present block is under execution. This provides enough space, to represent any number of constants we could need, to be referenced at the next code block, and so on.
This provides for straight execution of code blocks, yet providing ample room for 'almost' immediate values placement, without having to waste a single JUMP, to skip over data space.
It's kind of unaligned inline immediate values placement, and sure, almost for free.
Now the technic to fetch those values, useable either from HUB, as for AUX, and even COG memory.
Whenever an operation, references the same place, as the source and destination, for a read or write operation, they are to be treated as NOPS, inside the pipeline, in the aspect of doing their WRITE phase, at stage four.
First, because it's worthless to write over the same place, a value that is already there.
Second, because the full 32 bits of the gathered value, are at disposal, to be used elsewhere; in the present case, to load the BIG register.
And this also gives us some 9 unused bits, three of them, sure, compromised as above.
IMHO, the pipeline ALU will have no problem at all, dealing with the above depicted operations, sure, pending Chip's analisys and approval, and also sure, the comments, aditions and critics of each and every other of the many participants of the forum.
Naturaly, the same technic still works, easily, for AUX and COG memories too.
I hope it helps in the present situation.
Yanomani
P.S. When I wrote "Whenever an operation, references the same place, as the source and destination, for a read or write operation", I was talking about general memory, not the pin circuits, or any other special feature register, where writing over could be used for special purposes.
P.S. 2 - Sure, "It's kind of unaligned inline immediate values placement, and sure, almost for free." is not true.
You must place the WRLONG D,S, where D=S, in order to get the action done. My mistake and shame!:blank:
P.S. 3 - "Second, because the full 32 bits of the gathered value, are at disposal, to be used elsewhere; in the present case, to load the BIG register." To be true, the write phase will exist, directed to the BIG register, and to the three bit "next OCT entry point" register. This must not be cleared, untill used for the first time, at next block execution entry.
The usage case with the low 32 bits is basically meant for the table case, and directly enoded hub addresses. If BIG is OR'd with a #0 in the S, that works as well... so I don't think the extra bit is needed.
I am still not sure of the requirement regarding HJMP, HCALL and HRET, and how they get used.
I presume you do not need to save/restore the Z/C flags with these instructions?
Could we simplify this whole thing a bit, and disregard multi-tasking for this mode of operation? Might simplify it quite a bit for Chip, etc.
Does the mapping/windowing of AUX into COG help if you could map larger blocks into COG?
I have written LMM pasm, so I understand how we FJMP, FCALL and FRET. Also I know that in this mode, it is better to have constants embedded as NOP instructions (18 bit constants) than have to setup fixed constants in cog.
I am presuming that the GCC compiler emits code in a similar fashion.
I presume that is why the BIG instruction is important, and I understand that.
Currently in LMM on P1 we run a tight 4 instruction loop. In P2 that loop is a 5 instruction loop - this is what I use in my P2 Debugger...
By being able to window some AUX ram into COG ram, we can now execute in place, saving the LMM execution loop.
Presuming we window only 8*Longs (the RDWIDE instruction width) of AUX into COG $1E0..$1E7 we might do something like this... Am I on the right track?
This method would permit a tight DJNZ style instruction loop
An alternative, but the REPS instruction terminates with DJNZ style instructions..
I have asked Chip if it were possible to
(1) Make the RDWIDE instruction capable of delivering up to a count of 32 x 8*Long reads into AUX in the background with a tiny state m/c
(2) If it would be possible to map up to the whole 32 x 8*Long AUX registers into COG ram
By mapping a large Aux block into Cog, a good set of hub instructions could be executed inline at a time, and possibly small loops could be contained
within those blocks read, giving an enormous boost to performance.
I've loosely read this thread and I understand that you are looking for some opcode space for HJMP/HCALL/...
I also see there is talk about how to have a 32-bit constant in-line. About that: I think the idea has already been posited, but we could have a dummy instruction that doesn't do anything, though its 23 LSBs are free for data payload. Any instruction that has an immediate D or S, with priority going to S, can look for this dummy instruction in the next-lower stage of the pipeline. If it sees it, and it hasn't been cancelled as trailing branch code, it will use its 23 LSBs to augment the 9-bit immediate value it already has, giving it a full 32-bit immediate for D or S. This would solve the problem, would it not?
That seems to fit the 32 bit constant concept quite well.
That would work well with MUL / DIV as well.
Now, what to name it?
It would probably never be used for in-cog code, as it would waste a cycle as the dummy data-payload instruction floated through the pipeline, but it would provide code executing from the hub a way to have 32-bit constants without resorting to complicated means.
Super!
It would be used like this:
ADD reg,#bigconstant & $1FF
BIG #bigconstant >> 9
That would add bigconstant to reg.
Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.
Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4. This was a really clever idea you guys came up with, and it turns out that it can be done by using the registers already in the pipeline, so it's almost free!
It doesn't work for multi-tasking? (That's fine I think)
Looks like we will get that "HUBEXEC" (execute in place) model working! What a performance boost over LMM!
A small price to pay for a HUGE feature!