Propeller II update - BLOG

Yanomani · 2013-12-11 15:45

cgracey wrote: »

Thanks for the link.

I remember this now. If it could magically fill up any arbitrary set of holes, it would be super useful, but that would take a lot of gate time, as the hole-seeking would be sequential. If it were a contiguous field of bits, without interruption, it would be a lot simpler. It could be done.

I love the idea of ANY arbitrary bits. That would be the ultimate, but maybe not practical.

Chip

Perhaps I'm losing something, sure due to eye aging and/or a big mistake of mine, but if I have Ahle2's proposal correctly understood, why couldn't it be represented by the following boolean expression?

(DEST) <== ((SRC) AND (MASK)) OR ((DEST) AND NOT(MASK))

Why should it be performed bitwisely, departing from LSB, and not in a parallel fashion way? To spare a lot of gates?

Yanomani

cgracey · 2013-12-11 15:53

Yanomani wrote: »

Chip

Perhaps I'm losing something, sure due to eye aging and/or a big mistake of mine, but if I have Ahle2's proposal correctly understood, why couldn't it be represented by the following boolean expression?

(DEST) <== ((SRC) AND (MASK)) OR ((DEST) AND NOT(MASK))

Why should it be performed bitwisely, departing from LSB, and not in a parallel fashion way? To spare a lot of gates?

Yanomani

What Ahle was proposing was a means by which you could distribute source bits into many non-aligned bits of the destination. It would be a super-useful mechanism, but the part about the destination bits being potentially non-contiguous makes it complicated.

Cluso99 · 2013-12-11 21:24

Chip,
Are you in a position to publish the new opcode format while you get it into pnut, etc?
I wouldn't mind getting the instruction decoding into my P2 Debugger (disassembler section) while I have a bit of time.
And of course, I am totally curious of what you have achieved this time

Tks.

cgracey · 2013-12-11 22:46

Cluso99 wrote: »

Chip,
Are you in a position to publish the new opcode format while you get it into pnut, etc?
I wouldn't mind getting the instruction decoding into my P2 Debugger (disassembler section) while I have a bit of time.
And of course, I am totally curious of what you have achieved this time
Tks.

Almost. There is one thing I'm trying to resolve yet.

evanh · 2013-12-12 01:16

cgracey wrote: »

What Ahle was proposing was a means by which you could distribute source bits into many non-aligned bits of the destination. It would be a super-useful mechanism, but the part about the destination bits being potentially non-contiguous makes it complicated.

Sounds messy ... like a cascaded priority encoder. O_o

Yanomani · 2013-12-12 03:57

cgracey wrote: »

What Ahle was proposing was a means by which you could distribute source bits into many non-aligned bits of the destination. It would be a super-useful mechanism, but the part about the destination bits being potentially non-contiguous makes it complicated.

Chip

Sorry by the late reply, but I've been trying hard to understand the intended operations to be done, in the traveling of bits from SRC to DEST.

I'm not sure about the role played by MASK at all.
Are its one-valued bits, intended to determine where to put SRC bits, taken one-by-one, starting from SRC's LSB, into DEST positions, pointed to by one-valued ones, present at MASK?

Or are the zero-valued bits, already present at DEST, which determine, by being interpreted as "holes", where to put the bits extracted from SRC, one-by-one, departing from its LSB? And in this case, what's the role played by MASK's contents, at all?

Perhaps it's only about my perception, perhaps someone else is also having a hard session, trying to understand the intended behavior.

evanh wrote: »

Sounds messy ... like a cascaded priority encoder. :surprise:

Maybe Ahle2 or someone else will just want to dive in, to help us "poor mortals", to acquire a bit of understanding.:surprise:

Yanomani

Ahle2 · 2013-12-12 05:19

Yanomani wrote: »

Are its one-valued bits, intended to determine where to put SRC bits, taken one-by-one, starting from SRC's LSB, into DEST positions.....?

Exactly!

* It would be very useful when setting and getting HW registers that are not aligned in a nice way.
* It would also be useful for all kinds of encoding, decoding, scrambling, CRC, LFSR, interlacing, interleaving etc... tasks.
* It would be useful in combination with setting/getting pins to spread/despread unaligned data busses.
* It would be useful for all kinds of graphics manipulations/blitting to known and unknown formats.

/Johannes

cgracey · 2013-12-12 05:54

Ahle2 wrote: »

Exactly!

* It would be very useful when setting and getting HW registers that are not aligned in a nice way.
* It would also be useful for all kinds of encoding, decoding, scrambling, CRC, LFSR, interlacing, interleaving etc... tasks.
* It would be useful in combination with setting/getting pins to spread/despread unaligned data busses.
* It would be useful for all kinds of graphics manipulations/blitting to known and unknown formats.

/Johannes

It could work, but it would be huge. There would likely need to be 32 different 32:1 mux's, each with a 5-bit selector value and an additional 1-bit S/D selector (32*6 flipflops). The heavy lifting could be done when it gets configured, so that when used, the source data flows through the mux's to the destination data. This would be an awesome feature for any processor to have. It would take a massive amount of logic, unless some trick could be discovered.

ctwardell · 2013-12-12 05:59

cgracey wrote: »

It could work, but it would be huge. There would likely need to be 32 different 32:1 mux's, each with a 5-bit selector value and an additional 1-bit S/D selector (32*6 flipflops). The heavy lifting could be done when it gets configured, so that when used, the source data flows through the mux's to the destination data. This would be an awesome feature for any processor to have. It would take a massive amount of logic, unless some trick could be discovered.

I wonder if we really need a full 32 x 32, it seem like 8 x 32 would be useful in many cases and better than nothing.

So the 8 low order bits of the source could be mapped to any 8 of 32 destination bits.

C.W.

cgracey · 2013-12-12 06:06

ctwardell wrote: »

I wonder if we really need a full 32 x 32, it seem like 8 x 32 would be useful in many cases and better than nothing.

So the 8 low order bits of the source could be mapped to any 8 of 32 destination bits.

C.W.

Good observation that maybe 8 bits is enough. It would be very 'complete' with 32, though.

ctwardell · 2013-12-12 06:17

cgracey wrote: »

Good observation that maybe 8 bits is enough. It would be very 'complete' with 32, though.

I guess the other trade off could be time. Process some number of bits at a time and build up the result to be retrieved some number of cycles later.

Say maybe:

1 cycle later the low order 8 bits have been processed
2 cycles later the next 8 bits
3 cycles later the next 8 bits
4 cycles later the final 8 bits

C.W.

Bill Henning · 2013-12-12 06:41

cgracey wrote: »

Good observation that maybe 8 bits is enough. It would be very 'complete' with 32, though.

WARNING: SUGGESTION FOR P3 ONLY

It would need two cog registers - GALIN and GALOUT, and 34 registers (32 bits each) DW, DI, and DA0..DA31

It is a configurable logic block, with one cycle translation - think CPLD/FPGA "mega LUT"

The following would require 32x32 bit registers or flip-flops or memory cells

D = {D0..D31} ' the output (GALOUT)
S = {S0..S31} ' the input (GALIN)

DW = {DW0..DW31} ' 32 bit mask as to which bits in the destination may change
DI = {DI0..DI31} ' 32 bit inverse register, invert "result" before writing to destination bit (WD mask allowing)
DA = {DA0..DA31} - AND term for Dn, 32 bits corresponding to S0..S31

IF DWn then Dn = ( (DAn AND S0) OR (DAn AND S1) OR ... OR (DAn AND S31) ) XOR DIn

This basically turns it into a programmable gate array - think 32 in 32 out GAL

For even more flexibility, also produce inverse bits for Sn as Sn' but that would double the size of the gate array.

One of these massive LUT's per cog

I got the idea from IO pin or-ing, PAL's and GAL's.

It would be capable of arbitrary binary translation, CMM->cog 1 cycle decode, and much else.

SUGGESTION FOR P4

If there is room for multiple GAL blocks, add programmable instructions

GALxx D,S/#

Which uses D and S for the input and output, and runs it through GAL block XX.

Presto, programmable, user defined since cycle instructions! Fits in 4 stage pipeline.

REPS
nop
GAL1 D,S#

could implement a single-cycle loop of a GAL instruction

REPS
nop
GAL1 D,S/#
GAL2 D,S/#
GAL3 D,S/#
GAL4 D,S/#

Four stage state machine

ctwardell · 2013-12-12 06:54

Bill,

That's a lot of flops, but a lot of feature too.

Of course if we get the process size down those flops won't be such a big deal.

Would be very nice to have.

The FPGA's for emulation are going to get cost prohibitive moving forward.

We will need to setup a timeshared target system that lets users test apps over the web.

C.W.

Bill Henning · 2013-12-12 07:01

Thanks C.W.

I saw Chip thinking of 32 muxes, and started thinking is there any way to simplify that... and came up with this.

It would basically add a small CPLD (well, more of a GAL) to every cog; with S acting like the flip-flops for each input, and D as the output registers. So many possibilities....

ctwardell wrote: »

Bill,

That's a lot of flops, but a lot of feature too.

Of course if we get the process size down those flops won't be such a big deal.

Would be very nice to have.

C.W.

pedward · 2013-12-12 08:34

I suggested to chip a while ago that it would have been nice if the P2 shipped with a small LE CPLD on die.

I think that may be possible for the P3. The idea was intended to solve any of the "now I really need this one hardware op" kind of features.

I think having it hook into Port D and have configurable I/O would be the way to go.

Seairth · 2013-12-12 11:16

Maybe a simpler approach (though not as general purpose) to Ahle2's request would be an instruction like

SETFLD #offset, #bits, #n
MOVF reg, [#]val, #n

The idea is that you specify the offset (0-31) of the "field" and the number of bits (1-32) in that field. Then the MOVF would copy the least-significant number of bits from s-field (register or nine-bit literal) to the offset within d-field. The "n" value would allow for up to "n" (whatever Chip could get away with) fields. This would, of course, require two additional registers for each supported field: a 5-bit register to store the shift/offset value and a 32-bit register to store the generated mask. The MOVF would be ((D & (!mask) ^ ((S << offset) & mask))..

Kerry S · 2013-12-12 12:07

Seairth wrote: »

Maybe a simpler approach (though not as general purpose) to Ahle2's request would be an instruction like

SETFLD #offset, #bits, #n
MOVF reg, [#]val, #n

The idea is that you specify the offset (0-31) of the "field" and the number of bits (0-31) in that field. Then the MOVF would copy the least-significant number of bits from s-field (register or nine-bit literal) to the offset within d-field. The "n" value would allow for up to "n" (whatever Chip could get away with) fields. This would, of course, require two additional registers for each supported field: a 5-bit register to store the shift/offset value and a 32-bit register to store the generated mask. The MOVF would be ((D & (!mask) ^ ((S << offset) & mask))..

<just an idea, don't laugh too hard>

What about setting a reusable register (PTRA?) first with the mask of bits to change, then call the SETFLD?

Example:

MOV reg?, MASK
SETFLD Dest, Source

Where source contains the values for the masked bits you want to update in dest. That keeps the standard INST DEST, SOURCE format intact. The programmer would just have to know to set reg to their mask before calling SETFLD.

That would be really handy for setting control grouped, but not physically grouped, outputs to new values without having to worry about messing up other output values. Things like you would do with a ladder logic scripting system.

Seairth · 2013-12-12 12:45

Seairth wrote: »

SETFLD #offset, #bits, #n
MOVF reg, [#]val, #n

Also, the SETD, SETS, SETI, and SETX (even MOV) could be considered special cases of this, where offset and mask is hard-coded. This might allow all of those instructions to use the same internal logic, thereby saving a little silicon (over having distinct instructions).

Yanomani · 2013-12-12 15:34

Ahle2 wrote: »

Exactly!

* It would be very useful when setting and getting HW registers that are not aligned in a nice way.
* It would also be useful for all kinds of encoding, decoding, scrambling, CRC, LFSR, interlacing, interleaving etc... tasks.
* It would be useful in combination with setting/getting pins to spread/despread unaligned data busses.
* It would be useful for all kinds of graphics manipulations/blitting to known and unknown formats.

/Johannes

Ahle2

Thanks for this wonderful, almost Mike Nelson's style, deep dive coaching!
At least I'd stopped believing that time has passed by me fast enough, up to the point of transmuting my brain in some guava jam pot!

And I HATE guava jam! Blaaarghhh It hurts my entire (and long lasting) dentition! Too much sugar in there, to me!

I was about to bet half my savings, that you're intending to do some very nice scramble/descramble routines with those instructions. It seems that It would be a good bet, anyway!:cool:

Yanomani

tonyp12 · 2013-12-12 16:00

Is MovS, MovD,MovI still there?.
If so it should just be one single instruction there the assembler puts in the destination field location and width for you.
As the real underlying instruction is MovXY where x is any value between 0-31 for dest field and Y 0-31 for width. (possible to do?)

Cluso99 · 2013-12-12 16:41

tonyp12 wrote: »

Is MovS, MovD,MovI still there?.
If so it should just be one single instruction there the assembler puts in the destination field location and width for you.
As the real underlying instruction is MovXY where x is any value between 0-31 for dest field and Y 0-31 for width. (possible to do?)

They are now called SETS, SETD, SETI and a new one SETX for the 5 remaining bits.

cgracey · 2013-12-12 16:57

tonyp12 wrote: »

Is MovS, MovD,MovI still there?.
If so it should just be one single instruction there the assembler puts in the destination field location and width for you.
As the real underlying instruction is MovXY where x is any value between 0-31 for dest field and Y 0-31 for width. (possible to do?)

It would take 2x5 bits to express an offset and a field length. It could be done with two instructions: one that configures (SETFLD D/#,S/#) and another that moves (MOVFLD D,S,#). That would be really handy. After I get hub execution done, I'll look into this. We've been talking about this kind of thing for a while and the only practical way is to have contiguous bits.

rogloh · 2013-12-12 17:24

cgracey wrote: »

It would take 2x5 bits to express an offset and a field length. It could be done with two instructions: one that configures (SETFLD D/#,S/#) and another that moves (MOVFLD D,S,#). That would be really handy. After I get hub execution done, I'll look into this. We've been talking about this kind of thing for a while and the only practical way is to have contiguous bits.

Yes this type of feature would be very useful. Currently to copy a subset of bits from one longword into another I often have to do AND, ROT, ANDN, OR sequence which starts to burn quite a lot of cycles in tight loops, and also in COG memory to hold the mask, unless the bits happened to align with MOVD, MOVS, MOVI placement and I can leverage those to assist me (which is not always the case). Your proposed approach mentioned essentially halves the instruction count which would be very nice.

PS. Note this approach above also destroys the source register whose field is being copied. If you don't want to destroy it and need it for other things, you also need to do a MOV to a temp register first, thus requiring 5 instructions and probably also a mask or two if they are not in the lower 9 bits making 6-7 COG memory locations being burned in many cases. Moving down to 2 only for this type of operation would be a real win.

Ariba · 2013-12-12 17:33

cgracey wrote: »

It would take 2x5 bits to express an offset and a field length. It could be done with two instructions: one that configures (SETFLD D/#,S/#) and another that moves (MOVFLD D,S,#). That would be really handy. After I get hub execution done, I'll look into this. We've been talking about this kind of thing for a while and the only practical way is to have contiguous bits.

The problem with a configure and a move instruction is that this not works well with multitasking. You will need to store the configuration for every task separately.

Andy

cgracey · 2013-12-12 17:57

Ariba wrote: »

The problem with a configure and a move instruction is that this not works well with multitasking. You will need to store the configuration for every task separately.

Andy

True. We also need to break MOVFLD into SETFLD and GETFLD, and maybe have a ROLFLD, too, which would be like GETFLD, but would rotate the gotten field into D rather than just zero-extending it. Also, a RORFLD would really round things out. We'd be set then!

These things are so easy to do, but I'm mired deep in changes to support hub exec now. Please remind me later if you don't see this implemented.

Seairth · 2013-12-12 19:09

Ariba wrote: »

The problem with a configure and a move instruction is that this not works well with multitasking. You will need to store the configuration for every task separately.

Instead, allow up to 4 field offset/masks to be stored. If you want to use one per task, you can. But you could also use them in any other combination. For instance, by setting each of the 4 fields to the I, X, D, and S fields, you effectively recreate SETI, SETX, SETD, SETS. Of course, if the instructions could be encoded to support up to 8 fields, then it would be possible to hard-code 4 of them to the I, X, D, and S fields. But that would be a lot of instructions! Maybe if you gave up the Z/C flags...

I guess we'll have to wait and see what Chip comes up with when he gets around to it.

tonyp12 · 2013-12-12 19:54

>Maybe if you gave up the Z/C flags...
No one use Z/C flag for MovS now on Prop1 as z is only set if the whole 32bit dest result is zero, when you mov nibbles you are not interested in what the resulting longs are 99% of the time.

Cluso99 · 2013-12-12 20:01

Chip,
For after you get the hubexec done...

I have again been thinking of better ways to utilise the cog/aux/wide/cache rams.

What if these memories were all built from 32 * Four-Port (3 read, 1 write, as the cog is now) LONG blocks.
Eight (8) of these blocks 32 Longs) would represent 256 Longs (half the cog ram; and also the current size of aux ram).
Therefore 2 sets of 8 blocks would make the cog ram, and one set (maybe space for 2 sets?) would make the aux ram.
By being multiples of 8, this matches the WIDE width for R/W to/from Hub.

Now, the aux ram, when used for the video gen (dac side), there would be a spare read port for that. When not used for the
video gen side, those 3 ports could be paralleled with the I, S & D bus of the ALU. The write port would be muxed (shared)
between the WIDE Hub access (for reading from hub and writing to aux) and the W (the writeback) from the ALU. One of the
read ports would be muxed (shared) between the WIDE Hub access (for writing to the hub and reading from the aux).

Now we have interchangeable Aux and Cog blocks. Hopefully this would permit some simple methods to access the aux ram
just like we access the cog ram, by standard instructions such as AND, XOR, etc (because they would feed the ALU's I, S & D
read ports and W write port. We could access by using the AUGS and/or AUGD instructions by setting b9=1 ($2xx), of by
setting a permanent b9=1 register.

Could we now use the new Aux and Cog blocks to be the hubexec instruction cache?

I realise there would be some manual layout effort required by Beau. He has already done the 4 port blocks of memory, but these
would need to be modified into blocks of 32 longs for the x8 WIDE use.

This is just another idea that may make better use of the aux ram, particularly with the new WIDE 8*Long accesses and hubexec, etc.

rogloh · 2013-12-12 21:42

Cluso99 wrote: »

Chip,
For after you get the hubexec done...

I have again been thinking of better ways to utilise the cog/aux/wide/cache rams.

What if these memories were all built from 32 * Four-Port (3 read, 1 write, as the cog is now) LONG blocks.
Eight (8) of these blocks 32 Longs) would represent 256 Longs (half the cog ram; and also the current size of aux ram).
Therefore 2 sets of 8 blocks would make the cog ram, and one set (maybe space for 2 sets?) would make the aux ram.
By being multiples of 8, this matches the WIDE width for R/W to/from Hub.

Now, the aux ram, when used for the video gen (dac side), there would be a spare read port for that. When not used for the
video gen side, those 3 ports could be paralleled with the I, S & D bus of the ALU. The write port would be muxed (shared)
between the WIDE Hub access (for reading from hub and writing to aux) and the W (the writeback) from the ALU. One of the
read ports would be muxed (shared) between the WIDE Hub access (for writing to the hub and reading from the aux).

Now we have interchangeable Aux and Cog blocks. Hopefully this would permit some simple methods to access the aux ram
just like we access the cog ram, by standard instructions such as AND, XOR, etc (because they would feed the ALU's I, S & D
read ports and W write port. We could access by using the AUGS and/or AUGD instructions by setting b9=1 ($2xx), of by
setting a permanent b9=1 register.

Could we now use the new Aux and Cog blocks to be the hubexec instruction cache?

I realise there would be some manual layout effort required by Beau. He has already done the 4 port blocks of memory, but these
would need to be modified into blocks of 32 longs for the x8 WIDE use.

This is just another idea that may make better use of the aux ram, particularly with the new WIDE 8*Long accesses and hubexec, etc.

Probably need to see a diagram to fully understand how you intend to connect/mux all these buses, but it appears like there is a slight flaw due to a W port conflict if you are doing hub instruction reading in hubexec mode and the ALU wants to write a result in the same clock cycle, given there is only a single write port to these memories. Seems you would have to stall something, split up the buses or have two write ports, right?

Cluso99 · 2013-12-12 21:58

rogloh wrote: »

Probably need to see a diagram to fully understand how you intend to connect/mux all these buses, but it appears like there is a slight flaw due to a W port conflict if you are doing hub instruction reading in hubexec mode and the ALU wants to write a result in the same clock cycle, given there is only a single write port to these memories. Seems you would have to stall something, split up the buses or have two write ports, right?

We already have precisely the same conflict now. There would need to be a stall when there is a conflict.

However, if the blocks are all identical, then swapping blocks in and out becomes a much simpler reality and would probably result in much better usage of the various ram blocks.
I am just putting it out there in the hope of spurring some simple ideas/concepts that may result in yet another significant improvement.

Propeller II update - BLOG

Comments