Propeller II update - BLOG

cgracey · 2013-10-16 12:07

Seairth wrote: »

There's also a CALLAR, CALLBR, CALLARD, and CALLBRD. Since STACK is now AUX, I think the "R" refers to "reverse" (like the old PUSHA[R] and POPA[R]). For the RDAUX/WRAUX operations, I'm guessing that the "R" is only differentiated from the non-"R" version when using SPA/SPB. Going out on a limb, I'm guessing SPA/SPB is used when the MSB of S-field is set (otherwise, it's an absolute offset between 0 and 255). However, why have the separate "R" version in that case? Couldn't the increment/decrement of SPA/SPB be encoded in the lower eight bits (similar to the INDA/INDB encoding)?

Ok. Maybe I should wait until Chip clarifies.

Things were getting complicated with reverse stack operations, so I just made new WRAUXR and RDAUXR instructions that NOT the address. This lets you run two stacks that build toward each other. All CALLs, RETurns, PUSHes and POPs work normally, but NOT the final address if they have an "R" in them. This way, CALLs and PUSHes are always [SPx++] and RETurns and POPs are always . This is simpler than before and easier to think about.

Spin needs to maintain a run-time stack, but also needs to allow other use for the AUX RAM. So, it's stack operations are all reversed, building from the top down, keeping the AUX RAM free from the bottom up.

cgracey · 2013-10-16 12:28

ozpropdev wrote: »

May I make a suggestion Chip for a replacement for the now obsolete ESWAP8 instruction.

In many PCB I/O layouts it quite common for pins to connected to dual row headers (0.1 inch for example).
Most PCB layouts tend to fan-out the pins so that one side of the connector is odd numbered I/o and the other even.
The F{GA boards are an example of this. P0,P2,P4 on one side and P1,P3,P5 the other.
In a lot of cases it would be nice to use a group of pins on one side of the connector.
My idea is a SPREAD instruction that does the following.

SPREAD D,S

where s = %11001 for example
D would = %0101000001 after
A simple shift could align to odd/even bits or maybe WC could be used to shift left 1 bit optionally.

Basically the 9 bits of an immediate or the lower 16 bits of S are spread over 32 bits
A reverse instruction SQUISH could do the opposite. Every 2nd bit is aligned to create a 16 bit result. WC could be used to grab odd/even bits.

This would simplify/reduce code required to mask/modify IO pins.

I figure it's just a basic MUX function?

Interesting idea.

Sapieha · 2013-10-16 12:29

Hi Chip..

I have question on --- NOPX.

It is not better to rename it to

WAITX

?

cgracey · 2013-10-16 12:31

Sapieha wrote: »

Hi Chip..

I have question on --- NOPX.

It is not better to rename it to
WAITX
?

Yeah, that makes more sense. I've always thought in terms of "adding NOPs", but WAITX would probably be better.

Sapieha · 2013-10-16 12:32

Hi Chip.

Thanks

For me to -- it looks more logical

cgracey wrote: »

Yeah, that makes more sense. I've always thought in terms of "adding NOPs", but WAITX would probably be better.

jmg · 2013-10-16 12:59

cgracey wrote: »

Yeah, that makes more sense. I've always thought in terms of "adding NOPs", but WAITX would probably be better.

If this drops the power significantly, it could even be called SLEEPx ?

Bill Henning · 2013-10-16 13:42

Thanks, now it makes sense.

cgracey wrote: »

Things were getting complicated with reverse stack operations, so I just made new WRAUXR and RDAUXR instructions that NOT the address. This lets you run two stacks that build toward each other. All CALLs, RETurns, PUSHes and POPs work normally, but NOT the final address if they have an "R" in them. This way, CALLs and PUSHes are always [SPx++] and RETurns and POPs are always . This is simpler than before and easier to think about.

Spin needs to maintain a run-time stack, but also needs to allow other use for the AUX RAM. So, it's stack operations are all reversed, building from the top down, keeping the AUX RAM free from the bottom up.

ozpropdev · 2013-10-16 15:13

Seairth wrote: »

Actually, couldn't you do that with MERGEW and SPLITW?

I just had another look at SPLITW and MERGEW. Its been a while since I last read about them.
There close to what I was talking about. I think there names might have mislead me.
I'll crank up the FPGA and have a proper look.

P.S. I like "SQUISH" as well

Circuitsoft · 2013-10-16 15:22

ozpropdev wrote: »

where s = %11001 for example
D would = %0101000001 after
A simple shift could align to odd/even bits or maybe WC could be used to shift left 1 bit optionally.

Basically the 9 bits of an immediate or the lower 16 bits of S are spread over 32 bits
A reverse instruction SQUISH could do the opposite. Every 2nd bit is aligned to create a 16 bit result. WC could be used to grab odd/even bits.

This would simplify/reduce code required to mask/modify IO pins.

I figure it's just a basic MUX function?

I would call an instruction that changed

abcdefgh_ijklmnop_qrstuvwx_yz012345

to

aqbrcsdt_eufvgwhx_iyjzk0l1_m2n3o4p5

ZIP, and the inverse UNZIP. Would that do what you want?

ozpropdev · 2013-10-16 15:33

That's the sort of thing i'm talking about but with every 2nd bit zeroed so it can be used as a mask on a port operation.

a0b0c0d0_e0f0g0h0_i0j0k0l0_m0n0p0q0

or

0a0b0c0d_0e0f0g0h_0i0j0k0l_0m0n0o0p

ZIP / UNZIP sounds good too.

Seairth · 2013-10-16 17:27

cgracey wrote: »

Things were getting complicated with reverse stack operations, so I just made new WRAUXR and RDAUXR instructions that NOT the address. This lets you run two stacks that build toward each other. All CALLs, RETurns, PUSHes and POPs work normally, but NOT the final address if they have an "R" in them. This way, CALLs and PUSHes are always [SPx++] and RETurns and POPs are always . This is simpler than before and easier to think about.

Spin needs to maintain a run-time stack, but also needs to allow other use for the AUX RAM. So, it's stack operations are all reversed, building from the top down, keeping the AUX RAM free from the bottom up.

In the new instruction list, I only see PUSHZC and POPZC, I don't see the PUSHA, PUSHAR, etc. Are these going to be aliases for WRAUX[R]?

Regardless, what is the bit encoding for RDAUX/WRAUX?

Also, it occurs to me that the terms "SPA" and "SPB" are a bit of a misnomer, now that you're no longer referring to it as a stack. But "APA" and "APB" don't feel right somehow.

cgracey · 2013-10-16 17:48

Seairth wrote: »

In the new instruction list, I only see PUSHZC and POPZC, I don't see the PUSHA, PUSHAR, etc. Are these going to be aliases for WRAUX[R]?

Regardless, what is the bit encoding for RDAUX/WRAUX?

Also, it occurs to me that the terms "SPA" and "SPB" are a bit of a misnomer, now that you're no longer referring to it as a stack. But "APA" and "APB" don't feel right somehow.

Yes, PUSHes and POPs are special cases of WRAUX and RDAUX.

Seairth · 2013-10-16 18:34

cgracey wrote: »

Yes, PUSHes and POPs are special cases of WRAUX and RDAUX.

If that's the case, are they really necessary? For instance, could you instead do:

PUSHA D    => WRAUX D, XPA++
POPA D     => RDAUX D, --XPA
PUSHAR D   => WRAUX D, XPA--
POPAR D    =? RDAUX D, ++XPA

' note the use of XPA instead of SPA, for auX Ptr A.  Just a thought...

As you can see, it looks very similar to INDA/INDB and PTRA/PTRB usage. This would also allow you to get rid of WRAUXR and RDAUXR. You might even be able to cut the CALL/RET instructions to something like:

CALLX D, XPA++    'was CALLA D
RETX --XPA        'was RETA

' they could also just be CALL and RET.  The X was to emphasize the use of AUX, but it's not really necessary.

Seairth · 2013-10-16 18:47

cgracey wrote: »

I needed more bits than were available in the %1111110 set. Do you see a better way? I may not be thinking straight.

Those last 3 instructions were real odd-balls, so I put them at the end. Everything before those three is pretty regular. Weirdos to the back of the bus!

Maybe the only suggestion I'd make is to move the %1111111 instructions to %1111101, then reserve %1111111 for later expansion. That way, the extended instructions would mask to %1111110. Maybe that's not at all helpful (in Verilog, synthesis, etc), in which case, I wouldn't suggest making the change.

Roy Eltham · 2013-10-17 00:19

Chip,
Yeah, I saw the ESWAP8 did the S -> D thing, but wasn't sure how important that would really be. Typically when you do an endian swap you would prefer it be in place. Often you are doing it to a packet received over the network, or to a file read from mass storage. It's fine to leave it if you don't need the instruction space.

ozpropdev:
MERGEW will do what you want there. You need to give it two WORDs, but one will just be all zeros, the other will be your value, and the result will be your value in every other bit (and zeros in the between bits, or whatever value was in the other input WORD).
SPLITW does the opposite taking every other bit in a DWORD and put the even ones in one WORD and the odd ones in the other WORD.

I do this in code on the PC using "dilated integers" (and the reverse undilating). It's used for Morton Order (Z order) which is a common order for graphics textures on GPUs, it's also useful with quadtrees and octrees.

Seairth · 2013-10-17 04:55

cgracey wrote: »

Yeah, that makes more sense. I've always thought in terms of "adding NOPs", but WAITX would probably be better.

Or just WAIT. I also like jmg's suggestion of SLEEP, even if it doesn't reduce power usage. This would distinguish it from the other WAITxxx operations that depend on an external trigger (CNT, PEQ, etc.)

ozpropdev · 2013-10-17 05:27

Roy Eltham wrote: »

ozpropdev:
MERGEW will do what you want there. You need to give it two WORDs, but one will just be all zeros, the other will be your value, and the result will be your value in every other bit (and zeros in the between bits, or whatever value was in the other input WORD).
SPLITW does the opposite taking every other bit in a DWORD and put the even ones in one WORD and the odd ones in the other WORD.

Thanks Roy
I've been testing them on the FPGA today. I think their names might have misled me to think they did something else.
I pays to re-read the docs now and then to refresh the old grey matter.

Cheers
Brian

localroger · 2013-10-18 19:07

cgracey wrote: »

This lets you run two stacks that build toward each other.

You have just made P2 into what might be a crazy fast FORTH machine.

Cluso99 · 2013-10-19 22:07

There are some very nice new instructions in that list. INCD/INCDS/DECD/DECDS for incrementing/decrementing the D address and optionally S too (prev done with ADD/SUB D,X200/X201 - saves a variable).

For NRZI serial (USB), an instruction that XORs the C with a pin and puts the resultin C would reduce 3 instructions to1.It would be helpful if WZ set Z according to the pin state too.
XORPC pinnno [WC],[WZ]
replaces this sequence...
TEST K,INA WZ
MUXZ NRZI,MASK30
SHL NRZI,#1
and this is followedby
RCR DATA,#1
RCL STUFFCNT,#6 WZ

Cluso99 · 2013-10-19 22:23

Chip, I forgot to add I really like the way you reused the R bit for a7th opcode bit. Makes the set look nice. Love the L bit for immediate D. We have a lot of instructions to replace the lost NR instruction use.

I think a JNPEQ - jump if all selected pins are zero - would also be useful.

pjv · 2013-10-20 11:49

Hi All,

Just wondering if all the fancy new P2 features include relative addressing ?

Cheers,

Peter (pjv)

Ahle2 · 2013-10-20 16:22

IMHO, one of the strong points of the Propeller architecture is bit manipulation. I often use instructions such as rcr, rev, muxc, andn, movi, movd for intended and unintended things. To use "movd", in som situations, saves me 3 instructions instead of doing "and dest" -> "and source" -> "shl source" -> "or dest, source". Designing your code (and formats) with Propeller instruction bit fieds in mind can increase execution speed for inner loops. I often think of how nice it would be to have arbitrary bit field read/write instructions. That would be very useful for a lot of general cases.

3 instructions would be needed.

SMM - Set Multiplex Mask (maybe the accumulator could be used as a mask?)
MUXV - Multiplex Value
DMIV - Demultiplex Into Value

If you want to fill a destination address with data in some arbitrary bits, you could just.

SMM bitMask
MUX dest, source

The bits in source would fill upp all the ones of the mask in destination. Starting from the LSB and until there are no more "holes" to fill.

SMM -> DMIV would do the opposite of course.

Maybe these kind of "dynamic" instructions eats more silicon and are harder to implement?

/Johannes

Cluso99 · 2013-10-21 00:48

Here are Chip's revised P2 Instructions (16Oct2013) in excel format
P2_Instruction_16Oct2013.zip

cgracey · 2013-10-29 09:46

Well, things are moving along well, despite a few delays due to hard-to-find bugs after the Big Change.

Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.

Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cog's internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled. The rest are what you'd expect: the flag states (Z and C), the task number (T), and the program counter (PC). When you need to see what a cog is doing, this really spills its guts. You could make a trace from another cog by having it wait for an internal (port D) edge event, then log so many clock cycles of activity, which can then be mapped back to the code that is known to be in the cog of interest. Anyway, it doesn't take any special code to operate; you just make 16 pins outputs, then do a SETRACE #word to start the outputting. This should be helpful to people who want to get an understanding of what's actually going on with their code at the clock-cycle level.

potatohead · 2013-10-29 09:52

!!!

That's excellent Chip!

Bill Henning · 2013-10-29 09:59

I think everyone will thank you for the update

I REALLY like your SETTRACE instruction! It will make debugging much easier, as it will be possible to watch the PC and find out if a cog is stuck.

Here is an idea for the next shuttle run:

if a specific pin (P85? whichever is the highest "unused" pin) is found to be pulled low on startup, SETTRACE the boot loader / monitor to say P64-P79... only adds 2 instructions to the "ROM"

The above, and a logic analyzer, would help verify the chip until its fully tested; and since the bootloader/monitor source is published, it won't hurt to leave it in for a production run (to avoid a mask charge)

For development work, it will be fantastic - capturing the execution profile of 1..4 tasks, for post-capture analysis!

cgracey wrote: »

Well, things are moving along well, despite a few delays due to hard-to-find bugs after the big change.

Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.

Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cogs internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled.

Jeff Martin · 2013-10-29 09:59

cgracey wrote: »

...This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity.

Chip, that's cool! That feature could lead to debugging tools that are not possible today!

localroger · 2013-10-29 10:02

Way cool, Chip. SETRACE is like TRON but all growed up.

cgracey · 2013-10-29 10:08

localroger wrote: »

Way cool, Chip. SETRACE is like TRON but all growed up.

SETRACE is kind of like the NSA for cogs.

Sapieha · 2013-10-29 10:11

Hi Chip.

Nice debug facility ---- Can You add commands to operate it --- From <monitor?

cgracey wrote: »

Well, things are moving along well, despite a few delays due to hard-to-find bugs after the Big Change.

Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.

Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cogs internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled. The rest are what you'd expect: the flag states (Z and C), the task number (T), and the program counter (PC). When you need to see what a cog is doing, this really spills its guts. You could make a trace from another cog by having it wait for an internal (port D) edge event, then log so many clock cycles of activity, which can then be mapped back to the code that is known to be in the cog of interest. Anyway, it doesn't take any special code to operate; you just make 16 pins outputs, then do a SETRACE #word to start the outputting. This should be helpful to people who want to get an understanding of what's actually going on with their code at the clock-cycle level.

Propeller II update - BLOG

Comments