There's also a CALLAR, CALLBR, CALLARD, and CALLBRD. Since STACK is now AUX, I think the "R" refers to "reverse" (like the old PUSHA[R] and POPA[R]). For the RDAUX/WRAUX operations, I'm guessing that the "R" is only differentiated from the non-"R" version when using SPA/SPB. Going out on a limb, I'm guessing SPA/SPB is used when the MSB of S-field is set (otherwise, it's an absolute offset between 0 and 255). However, why have the separate "R" version in that case? Couldn't the increment/decrement of SPA/SPB be encoded in the lower eight bits (similar to the INDA/INDB encoding)?
Ok. Maybe I should wait until Chip clarifies.
Things were getting complicated with reverse stack operations, so I just made new WRAUXR and RDAUXR instructions that NOT the address. This lets you run two stacks that build toward each other. All CALLs, RETurns, PUSHes and POPs work normally, but NOT the final address if they have an "R" in them. This way, CALLs and PUSHes are always [SPx++] and RETurns and POPs are always . This is simpler than before and easier to think about.
Spin needs to maintain a run-time stack, but also needs to allow other use for the AUX RAM. So, it's stack operations are all reversed, building from the top down, keeping the AUX RAM free from the bottom up.
May I make a suggestion Chip for a replacement for the now obsolete ESWAP8 instruction.
In many PCB I/O layouts it quite common for pins to connected to dual row headers (0.1 inch for example).
Most PCB layouts tend to fan-out the pins so that one side of the connector is odd numbered I/o and the other even.
The F{GA boards are an example of this. P0,P2,P4 on one side and P1,P3,P5 the other.
In a lot of cases it would be nice to use a group of pins on one side of the connector.
My idea is a SPREAD instruction that does the following.
SPREAD D,S
where s = %11001 for example
D would = %0101000001 after
A simple shift could align to odd/even bits or maybe WC could be used to shift left 1 bit optionally.
Basically the 9 bits of an immediate or the lower 16 bits of S are spread over 32 bits
A reverse instruction SQUISH could do the opposite. Every 2nd bit is aligned to create a 16 bit result. WC could be used to grab odd/even bits.
This would simplify/reduce code required to mask/modify IO pins.
Things were getting complicated with reverse stack operations, so I just made new WRAUXR and RDAUXR instructions that NOT the address. This lets you run two stacks that build toward each other. All CALLs, RETurns, PUSHes and POPs work normally, but NOT the final address if they have an "R" in them. This way, CALLs and PUSHes are always [SPx++] and RETurns and POPs are always . This is simpler than before and easier to think about.
Spin needs to maintain a run-time stack, but also needs to allow other use for the AUX RAM. So, it's stack operations are all reversed, building from the top down, keeping the AUX RAM free from the bottom up.
Actually, couldn't you do that with MERGEW and SPLITW?
I just had another look at SPLITW and MERGEW. Its been a while since I last read about them.
There close to what I was talking about. I think there names might have mislead me.
I'll crank up the FPGA and have a proper look.
where s = %11001 for example
D would = %0101000001 after
A simple shift could align to odd/even bits or maybe WC could be used to shift left 1 bit optionally.
Basically the 9 bits of an immediate or the lower 16 bits of S are spread over 32 bits
A reverse instruction SQUISH could do the opposite. Every 2nd bit is aligned to create a 16 bit result. WC could be used to grab odd/even bits.
This would simplify/reduce code required to mask/modify IO pins.
I figure it's just a basic MUX function?
I would call an instruction that changed
abcdefgh_ijklmnop_qrstuvwx_yz012345
to
aqbrcsdt_eufvgwhx_iyjzk0l1_m2n3o4p5
ZIP, and the inverse UNZIP. Would that do what you want?
Things were getting complicated with reverse stack operations, so I just made new WRAUXR and RDAUXR instructions that NOT the address. This lets you run two stacks that build toward each other. All CALLs, RETurns, PUSHes and POPs work normally, but NOT the final address if they have an "R" in them. This way, CALLs and PUSHes are always [SPx++] and RETurns and POPs are always . This is simpler than before and easier to think about.
Spin needs to maintain a run-time stack, but also needs to allow other use for the AUX RAM. So, it's stack operations are all reversed, building from the top down, keeping the AUX RAM free from the bottom up.
In the new instruction list, I only see PUSHZC and POPZC, I don't see the PUSHA, PUSHAR, etc. Are these going to be aliases for WRAUX[R]?
Regardless, what is the bit encoding for RDAUX/WRAUX?
Also, it occurs to me that the terms "SPA" and "SPB" are a bit of a misnomer, now that you're no longer referring to it as a stack. But "APA" and "APB" don't feel right somehow.
In the new instruction list, I only see PUSHZC and POPZC, I don't see the PUSHA, PUSHAR, etc. Are these going to be aliases for WRAUX[R]?
Regardless, what is the bit encoding for RDAUX/WRAUX?
Also, it occurs to me that the terms "SPA" and "SPB" are a bit of a misnomer, now that you're no longer referring to it as a stack. But "APA" and "APB" don't feel right somehow.
Yes, PUSHes and POPs are special cases of WRAUX and RDAUX.
Yes, PUSHes and POPs are special cases of WRAUX and RDAUX.
If that's the case, are they really necessary? For instance, could you instead do:
PUSHA D => WRAUX D, XPA++
POPA D => RDAUX D, --XPA
PUSHAR D => WRAUX D, XPA--
POPAR D =? RDAUX D, ++XPA
' note the use of XPA instead of SPA, for auX Ptr A. Just a thought...
As you can see, it looks very similar to INDA/INDB and PTRA/PTRB usage. This would also allow you to get rid of WRAUXR and RDAUXR. You might even be able to cut the CALL/RET instructions to something like:
CALLX D, XPA++ 'was CALLA D
RETX --XPA 'was RETA
' they could also just be CALL and RET. The X was to emphasize the use of AUX, but it's not really necessary.
I needed more bits than were available in the %1111110 set. Do you see a better way? I may not be thinking straight.
Those last 3 instructions were real odd-balls, so I put them at the end. Everything before those three is pretty regular. Weirdos to the back of the bus!
Maybe the only suggestion I'd make is to move the %1111111 instructions to %1111101, then reserve %1111111 for later expansion. That way, the extended instructions would mask to %1111110. Maybe that's not at all helpful (in Verilog, synthesis, etc), in which case, I wouldn't suggest making the change.
Chip,
Yeah, I saw the ESWAP8 did the S -> D thing, but wasn't sure how important that would really be. Typically when you do an endian swap you would prefer it be in place. Often you are doing it to a packet received over the network, or to a file read from mass storage. It's fine to leave it if you don't need the instruction space.
ozpropdev:
MERGEW will do what you want there. You need to give it two WORDs, but one will just be all zeros, the other will be your value, and the result will be your value in every other bit (and zeros in the between bits, or whatever value was in the other input WORD).
SPLITW does the opposite taking every other bit in a DWORD and put the even ones in one WORD and the odd ones in the other WORD.
I do this in code on the PC using "dilated integers" (and the reverse undilating). It's used for Morton Order (Z order) which is a common order for graphics textures on GPUs, it's also useful with quadtrees and octrees.
Yeah, that makes more sense. I've always thought in terms of "adding NOPs", but WAITX would probably be better.
Or just WAIT. I also like jmg's suggestion of SLEEP, even if it doesn't reduce power usage. This would distinguish it from the other WAITxxx operations that depend on an external trigger (CNT, PEQ, etc.)
ozpropdev:
MERGEW will do what you want there. You need to give it two WORDs, but one will just be all zeros, the other will be your value, and the result will be your value in every other bit (and zeros in the between bits, or whatever value was in the other input WORD).
SPLITW does the opposite taking every other bit in a DWORD and put the even ones in one WORD and the odd ones in the other WORD.
Thanks Roy
I've been testing them on the FPGA today. I think their names might have misled me to think they did something else.
I pays to re-read the docs now and then to refresh the old grey matter.
Cheers
Brian
There are some very nice new instructions in that list. INCD/INCDS/DECD/DECDS for incrementing/decrementing the D address and optionally S too (prev done with ADD/SUB D,X200/X201 - saves a variable).
For NRZI serial (USB), an instruction that XORs the C with a pin and puts the resultin C would reduce 3 instructions to1.It would be helpful if WZ set Z according to the pin state too.
XORPC pinnno [WC],[WZ]
replaces this sequence...
TEST K,INA WZ
MUXZ NRZI,MASK30
SHL NRZI,#1
and this is followedby
RCR DATA,#1
RCL STUFFCNT,#6 WZ
Chip, I forgot to add I really like the way you reused the R bit for a7th opcode bit. Makes the set look nice. Love the L bit for immediate D. We have a lot of instructions to replace the lost NR instruction use.
I think a JNPEQ - jump if all selected pins are zero - would also be useful.
IMHO, one of the strong points of the Propeller architecture is bit manipulation. I often use instructions such as rcr, rev, muxc, andn, movi, movd for intended and unintended things. To use "movd", in som situations, saves me 3 instructions instead of doing "and dest" -> "and source" -> "shl source" -> "or dest, source". Designing your code (and formats) with Propeller instruction bit fieds in mind can increase execution speed for inner loops. I often think of how nice it would be to have arbitrary bit field read/write instructions. That would be very useful for a lot of general cases.
3 instructions would be needed.
SMM - Set Multiplex Mask (maybe the accumulator could be used as a mask?)
MUXV - Multiplex Value
DMIV - Demultiplex Into Value
If you want to fill a destination address with data in some arbitrary bits, you could just.
SMM bitMask
MUX dest, source
The bits in source would fill upp all the ones of the mask in destination. Starting from the LSB and until there are no more "holes" to fill.
SMM -> DMIV would do the opposite of course.
Maybe these kind of "dynamic" instructions eats more silicon and are harder to implement?
Well, things are moving along well, despite a few delays due to hard-to-find bugs after the Big Change.
Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.
Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cog's internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled. The rest are what you'd expect: the flag states (Z and C), the task number (T), and the program counter (PC). When you need to see what a cog is doing, this really spills its guts. You could make a trace from another cog by having it wait for an internal (port D) edge event, then log so many clock cycles of activity, which can then be mapped back to the code that is known to be in the cog of interest. Anyway, it doesn't take any special code to operate; you just make 16 pins outputs, then do a SETRACE #word to start the outputting. This should be helpful to people who want to get an understanding of what's actually going on with their code at the clock-cycle level.
I REALLY like your SETTRACE instruction! It will make debugging much easier, as it will be possible to watch the PC and find out if a cog is stuck.
Here is an idea for the next shuttle run:
if a specific pin (P85? whichever is the highest "unused" pin) is found to be pulled low on startup, SETTRACE the boot loader / monitor to say P64-P79... only adds 2 instructions to the "ROM"
The above, and a logic analyzer, would help verify the chip until its fully tested; and since the bootloader/monitor source is published, it won't hurt to leave it in for a production run (to avoid a mask charge)
For development work, it will be fantastic - capturing the execution profile of 1..4 tasks, for post-capture analysis!
Well, things are moving along well, despite a few delays due to hard-to-find bugs after the big change.
Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.
Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cogs internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled.
Well, things are moving along well, despite a few delays due to hard-to-find bugs after the Big Change.
Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.
Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cogs internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled. The rest are what you'd expect: the flag states (Z and C), the task number (T), and the program counter (PC). When you need to see what a cog is doing, this really spills its guts. You could make a trace from another cog by having it wait for an internal (port D) edge event, then log so many clock cycles of activity, which can then be mapped back to the code that is known to be in the cog of interest. Anyway, it doesn't take any special code to operate; you just make 16 pins outputs, then do a SETRACE #word to start the outputting. This should be helpful to people who want to get an understanding of what's actually going on with their code at the clock-cycle level.
Comments
Things were getting complicated with reverse stack operations, so I just made new WRAUXR and RDAUXR instructions that NOT the address. This lets you run two stacks that build toward each other. All CALLs, RETurns, PUSHes and POPs work normally, but NOT the final address if they have an "R" in them. This way, CALLs and PUSHes are always [SPx++] and RETurns and POPs are always . This is simpler than before and easier to think about.
Spin needs to maintain a run-time stack, but also needs to allow other use for the AUX RAM. So, it's stack operations are all reversed, building from the top down, keeping the AUX RAM free from the bottom up.
Interesting idea.
I have question on --- NOPX.
It is not better to rename it to
WAITX
?
Yeah, that makes more sense. I've always thought in terms of "adding NOPs", but WAITX would probably be better.
Thanks
For me to -- it looks more logical
If this drops the power significantly, it could even be called SLEEPx ?
I just had another look at SPLITW and MERGEW. Its been a while since I last read about them.
There close to what I was talking about. I think there names might have mislead me.
I'll crank up the FPGA and have a proper look.
P.S. I like "SQUISH" as well
ZIP / UNZIP sounds good too.
In the new instruction list, I only see PUSHZC and POPZC, I don't see the PUSHA, PUSHAR, etc. Are these going to be aliases for WRAUX[R]?
Regardless, what is the bit encoding for RDAUX/WRAUX?
Also, it occurs to me that the terms "SPA" and "SPB" are a bit of a misnomer, now that you're no longer referring to it as a stack. But "APA" and "APB" don't feel right somehow.
Yes, PUSHes and POPs are special cases of WRAUX and RDAUX.
If that's the case, are they really necessary? For instance, could you instead do:
As you can see, it looks very similar to INDA/INDB and PTRA/PTRB usage. This would also allow you to get rid of WRAUXR and RDAUXR. You might even be able to cut the CALL/RET instructions to something like:
Maybe the only suggestion I'd make is to move the %1111111 instructions to %1111101, then reserve %1111111 for later expansion. That way, the extended instructions would mask to %1111110. Maybe that's not at all helpful (in Verilog, synthesis, etc), in which case, I wouldn't suggest making the change.
Yeah, I saw the ESWAP8 did the S -> D thing, but wasn't sure how important that would really be. Typically when you do an endian swap you would prefer it be in place. Often you are doing it to a packet received over the network, or to a file read from mass storage. It's fine to leave it if you don't need the instruction space.
ozpropdev:
MERGEW will do what you want there. You need to give it two WORDs, but one will just be all zeros, the other will be your value, and the result will be your value in every other bit (and zeros in the between bits, or whatever value was in the other input WORD).
SPLITW does the opposite taking every other bit in a DWORD and put the even ones in one WORD and the odd ones in the other WORD.
I do this in code on the PC using "dilated integers" (and the reverse undilating). It's used for Morton Order (Z order) which is a common order for graphics textures on GPUs, it's also useful with quadtrees and octrees.
Or just WAIT. I also like jmg's suggestion of SLEEP, even if it doesn't reduce power usage. This would distinguish it from the other WAITxxx operations that depend on an external trigger (CNT, PEQ, etc.)
Thanks Roy
I've been testing them on the FPGA today. I think their names might have misled me to think they did something else.
I pays to re-read the docs now and then to refresh the old grey matter.
Cheers
Brian
You have just made P2 into what might be a crazy fast FORTH machine.
For NRZI serial (USB), an instruction that XORs the C with a pin and puts the resultin C would reduce 3 instructions to1.It would be helpful if WZ set Z according to the pin state too.
XORPC pinnno [WC],[WZ]
replaces this sequence...
TEST K,INA WZ
MUXZ NRZI,MASK30
SHL NRZI,#1
and this is followedby
RCR DATA,#1
RCL STUFFCNT,#6 WZ
I think a JNPEQ - jump if all selected pins are zero - would also be useful.
Just wondering if all the fancy new P2 features include relative addressing ?
Cheers,
Peter (pjv)
3 instructions would be needed.
SMM - Set Multiplex Mask (maybe the accumulator could be used as a mask?)
MUXV - Multiplex Value
DMIV - Demultiplex Into Value
If you want to fill a destination address with data in some arbitrary bits, you could just.
SMM bitMask
MUX dest, source
The bits in source would fill upp all the ones of the mask in destination. Starting from the LSB and until there are no more "holes" to fill.
SMM -> DMIV would do the opposite of course.
Maybe these kind of "dynamic" instructions eats more silicon and are harder to implement?
/Johannes
P2_Instruction_16Oct2013.zip
Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.
Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cog's internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled. The rest are what you'd expect: the flag states (Z and C), the task number (T), and the program counter (PC). When you need to see what a cog is doing, this really spills its guts. You could make a trace from another cog by having it wait for an internal (port D) edge event, then log so many clock cycles of activity, which can then be mapped back to the code that is known to be in the cog of interest. Anyway, it doesn't take any special code to operate; you just make 16 pins outputs, then do a SETRACE #word to start the outputting. This should be helpful to people who want to get an understanding of what's actually going on with their code at the clock-cycle level.
That's excellent Chip!
I REALLY like your SETTRACE instruction! It will make debugging much easier, as it will be possible to watch the PC and find out if a cog is stuck.
Here is an idea for the next shuttle run:
if a specific pin (P85? whichever is the highest "unused" pin) is found to be pulled low on startup, SETTRACE the boot loader / monitor to say P64-P79... only adds 2 instructions to the "ROM"
The above, and a logic analyzer, would help verify the chip until its fully tested; and since the bootloader/monitor source is published, it won't hurt to leave it in for a production run (to avoid a mask charge)
For development work, it will be fantastic - capturing the execution profile of 1..4 tasks, for post-capture analysis!
Chip, that's cool! That feature could lead to debugging tools that are not possible today!
SETRACE is kind of like the NSA for cogs.
Nice debug facility ---- Can You add commands to operate it --- From <monitor?