Propeller II update - BLOG

KC_Rob · 2013-12-03 13:31

Heater. wrote: »

And to think. All I ever wanted from a PII originally was an order of magnitude faster execution speed and 256K RAM and 64 pins....

Amen. And the discussion has in many ways come full circle - eg, all the recent talk about adding hub memory.

I don't visit the forums for a few days and things go off the rails - once again. My, what a flurry! lol

Cluso99 · 2013-12-03 14:01

I was going to keep this until a little later, but the pace has been frenetic overnight, and the hub execute in place with Bill's comments, make this more relevant now.

I put forward a proposal that the AUX & QUAD-CACHE be redone as AUX 32*8*longs, enabling a RD/WRWIDE to read/write 8*Longs straight into any of the 32 AUX 8*Long blocks in 1 hub clock. Assuming this comes to pass...

The RDWIDE[C] [#]D/PTR instruction becomes...

RD/WRWIDE [#]D/PTR,#bbb,#nnnnn [WC]
where:
#bbb: is the starting block# (0-31) of the 8*Long Aux Ram block (it will wrap if necessary)
#nnnnn: is the count of 8*Long hub transfers to perform
WC: is set to stall the cog until this/these transfer(s) complete

RD/WRWIDE is run by a tiny state m/c so that the transfer can be done in the background and the cog can continue processing. WC controls whether the cog can continue or not.

We will now have two methods to increase slot bandwidth (providing Chip still implements it). One with a donor(s) cogs, and another using any free slots. If you just presume a donor slot of 1 extra slot, making 1:4, then at 200MHz you can transfer up to 1KB in 2*8*Long/8 transfers per hub cycle = 1KB @ 1600MB/s.

VIDEO GEN:
By being able to transfer 1KB in a non-blocking instruction allows the cog to execute the RDWIDE instruction to transfer a whole 1KB using only 1 clock to start the transfer. So together with the waitvid and a few frame instuctions, the whole generation is done (per 1KB of display data) in a few clocks, leaving the majority for some really serious other processing.
In fact, for P3 with a few more registers, the complete video frame could be generated with a state machine, leaving the cog completely free (without AUX) to process anything else.

LMM:
By being able to transfer large blocks up to 1KB in the background, LMM would be significantly enhanced. If "n" blocks of AUX could be windowed into cog, simulated execute in place (with the appropriate caveats and/or tweaks) could become a reality.
LMM would probably use only a couple of 8*Long blocks, leaving the remaining AUX to be used as stack space. The remaining cog space would be used for variables.

In the old LMM method, FJMP & FCALL instructions were followed by a NOP a..a holding the hub address. It could be beneficial to change the sequence to NOP a..a followed by FJMP & FCALL instructions. The NOP, FJMP and FCALL instructions might now become dedicated LMM instructions to save yet more instruction execution space/time.
BTW I only took a cursory look at Bill and others info about the execution model.

COG PAMS PROGRAMS:
By being able to transfer large blocks of data between Hub and Aux using the non-blocking RD/WRWIDE instruction would provide an enormous speed and space improvement. Video update, gui and game cogs would benefit greatly by these methods.

There are a few possibilities that could help significantly...

(1) The use of AUXA/AUXB pointers located at $1F0..$1F1. This would work precisely as does INDA/INDB, but would point to AUX Ram instead of Cog Ram.
All existing instructions would benefit from this.

(2) Windowing Aux Ram blocks into Cog Ram blocks would aid by instructions being able to work on the Aux directly for data usage, removing the requirement to transfer back and forth between Aux Ram and Cog Ram.

(3) A much better solution would be to allow a special instruction to "switch" the use of the top bit of the resultant D & S address bits ("x"=bit 8 of x_aaaa_aaaa) such that addresses $000.$0FF would use Aux Ram $00..$FF and addresses $100.$1FF would use Cog Ram.

This would permit all instructions to use Aux Ram $00.$FF as variable space, while still retaining the full Cog Ram as instruction (shared with cog variable space anywhere above $100). The caveat here is that self-modifying instructions in the lower cog space of $000..$0FF would not be possible.

This would function similar to the way the shadow registers function on the P1 where it is possible to run sw from the shadow registers.

A different mapping scheme might also be possible by using a more complex enable instruction.

BTW I am going to dual post this in both Propeller II update and Hub Execution Model threads.

Bill Henning · 2013-12-03 14:17

I like your proposal for loading AUX quickly, however LMM would be better replaced with HUBEXEC - which would also simplify compiler writing.

BUT

RDOCTLAUX
WROCTLAUX

would be really, really nice for high speed graphics (1080p anyone?)

Chip may be able to implement it using the XFR path to the CLUT, assuming it is feasible of course.

Ym2413a · 2013-12-03 14:59

KC_Rob wrote: »

Amen. And the discussion has in many ways come full circle - eg, all the recent talk about adding hub memory.

Yeah it's pretty funny in that sense. I remember already 5 or more years ago when the debate of "More COGS or More RAM" first started I was 100% for the 256Kb 8Cog version of the P2... and that's where we are now.

ctwardell · 2013-12-03 15:51

This is now getting me worried, the changes are getting pretty 'out there' and are going to need some serious testing. I'd hate to see the P2 ship with an errata the size of a phone book.

I'm not sure what the date is to get it to the foundry, but once Chip is done with the changes it will need to go through synthesis and then Beau will then need to do some work to tie the core to the parts that don't get synthesized.

Ideally there would be at least a month or preferably more of testing with the FPGA's once Chips changes are complete, and that should leave time for bug fixes. That has to happen before it goes off to synthesis.

It sounds like it's too big for the DE0-nano now so the amount of available testers will drop, and it makes little sense to buy a DE2-115 at this point since a different FPGA platform will likely be used to the P3.

C.W.

Cluso99 · 2013-12-03 16:00

Posted on Hub Execution Model thread...

cgracey wrote: »

Sorry I haven't been contributing. After two hours of sleep, I had to get up to tend to some things and I haven't been back to sleep. My brain won't be sharp until I sleep again. I'm contemplating how this will all work, as well as I can. I'm kind of fuzzy on a few things. I need to draw some diagrams of how things work.

To much buzzing around inside your head! You do need to get some well earned sleep.

A quick question:
What do DECOD3/4/5 do? Could they (and the ENCOD?) instructions require WZ and WC? If not, could they be folded into 1 opcode freeing up 2 valuable full D&S instructions?

K2 · 2013-12-03 16:11

ctwardell wrote: »

It sounds like it's too big for the DE0-nano now so the amount of available testers will drop...

That is a great point. Wish I had a DE2-115. OTOH it is reassuring that ariba, ozpropdev, nutson, and others do have them.

jmg · 2013-12-03 16:16

ctwardell wrote: »

It sounds like it's too big for the DE0-nano now so the amount of available testers will drop, and it makes little sense to buy a DE2-115 at this point since a different FPGA platform will likely be used to the P3.

It might still pack into a DE0, optimised for size - time will tell ?

There are a couple of FPGA alternatives in this thread

http://forums.parallax.com/showthread.php/150849-BeMicro-CV-FPGA-Board-for-P2

- the smallest has much more RAM than DE0 (but will still be under 256k?) and ~ 13% more Logic, and is one generation newer, so should build to higher MHz ?
The larger one is ~ 3 GOGs and >> 256K RAM.at ~ $179

- but those Cyclone V Boards do need new FPGA builds.

ctwardell · 2013-12-03 16:25

K2 wrote: »

That is a great point. Wish I had a DE2-115. OTOH it is reassuring that ariba, ozpropdev, nutson, and others do have them.

Yes, but they need final code so they can start using them, currently everything is too much of a moving target.

There seems to be a feeling that it's good to go ahead and talk these ideas out that may go in the P2, but more likely the P3.

These ideas are like puzzles for Chip to solve and I'm sure he loves solving them, but that isn't getting the SERDES designed or any of the other things that need finalized.

Too damn many squirrels to chase and only one hunter, and I fear we are wearing him out.

C.W.

Ym2413a · 2013-12-03 20:03

Knowing right when to stop engineering and start selling is the hardest part of any project...
Part of the reason why I didn't jump on the FPGA bandwagon was I had a gut feeling the P2 was going to change a lot during development. Looking back now, I kind of wished I did get a FPGA board.
You guys where able to get so much input into the design process by being first to the gate!

KC_Rob · 2013-12-03 20:51

Ym2413a wrote: »

Knowing right when to stop engineering and start selling is the hardest part of any project...

I once worked at a place where there was a "sign" in the engineering area which said something to the effect that there comes a time in every project when the engineers must be killed so the damn thing can be shipped. I'm of course opposed to killing engineers, yours truly in particular, or anyone else for that matter, but I was able to appreciate the underlying point.

potatohead · 2013-12-03 21:12

Yes, but they need final code so they can start using them, currently everything is too much of a moving target.

Yep. Ideally, this change takes the time it takes, and then we have a fairly stable FPGA to jam on for a while.

ozpropdev · 2013-12-03 21:40

potatohead wrote: »

Yep. Ideally, this change takes the time it takes, and then we have a fairly stable FPGA to jam on for a while.

I'm excited and VERY confident that the next FPGA release will rapidly be referred to as "the one".
Can't wait for the next JAM session!

Ozpropdev

potatohead · 2013-12-03 21:50

Me neither.

I kind of missed out on this one. Good times ahead for sure!

KeithE · 2013-12-03 21:56

KC_Rob wrote: »

I once worked at a place where there was a "sign" in the engineering area which said something to the effect that there comes a time in every project when the engineers must be killed so the damn thing can be shipped. I'm of course opposed to killing engineers, yours truly in particular, or anyone else for that matter, but I was able to appreciate the underlying point.

Search on google for "go ahead make one more change" and the sign can be yours ;-)

cgracey · 2013-12-04 01:40

ctwardell wrote: »

It sounds like it's too big for the DE0-nano now so the amount of available testers will drop, and it makes little sense to buy a DE2-115 at this point since a different FPGA platform will likely be used to the P3.

We can disable CTRB to free up quite a few LE's. It will probably still fit.

cgracey · 2013-12-04 01:54

You guys have lots of great ideas. I realize from reading this thread that I need to add two more modes to XFR: WIDEs-to-AUX and AUX-to-WIDEs. This would facilitate 8-long transfers between them. Also, I need to make this work by automatically issuing 'WRWIDE/RDWIDE PTRA++' instructions in the background. There may be reason, too, to add such automation to XFR's pins-to-AUX and AUX-to-pins, as well as to pins-to-WIDEs and WIDEs-to-pins. I just need to upgrade XFR to do more... after I get a little more sleep.

ozpropdev · 2013-12-04 01:58

cgracey wrote: »

You guys have lots of great ideas. I realize from reading this thread that I need to add two more modes to XFR: WIDEs-to-AUX and AUX-to-WIDEs. This would facilitate 8-long transfers between them. Also, I need to make this work by automatically issuing 'WRWIDE/RDWIDE PTRA++' instructions in the background. There may be reason, too, to add such automation to XFR's pins-to-AUX and AUX-to-pins, as well as to pins-to-WIDEs and WIDEs-to-pins. I just need to upgrade XFR to do more... after I get a little more sleep.

WIDE to AUX is a big one!
Lots to be gained there.

Ariba · 2013-12-04 02:03

K2 wrote: »

That is a great point. Wish I had a DE2-115. OTOH it is reassuring that ariba, ozpropdev, nutson, and others do have them.

Just want let you know that I don't have a DE2-115. And I'm not willing to buy one, so if we don't get something for the DE0 I can't help with testing anymore.
Beside the drop of testers, I'm also concerned that what not fits in a DE0 will maybe also not fit 8 times together with 256k Hubram on the available die.

I think we don't really need HUBEXecute because with a 8 long read cache we get anyway a faster LMM speed (up to 1/2..1/3 native speed). Such fast execution makes most sense with a big program memory, 256kB may not be enough (that's 64k instructions). We would need fast execution from the big external RAM.

Andy

Invent-O-Doc · 2013-12-04 07:24

It looks like some recent decisions will allow for a big improvement in the capabilities of P2 at the cost of a few minor inconveniences. That's great. At some point, SOON, the design needs to be frozen so they company guys can go ahead and make something to sell. I don't want to buy an FPGA. I want to buy some chips.

K2 · 2013-12-04 07:45

Ariba wrote: »

Just want let you know that I don't have a DE2-115.

Thanks for the clarification. Amazing that you and ozpropdev were so instrumental in the last debug, and you both did it with a DE0! What can I say but, "I'm not worthy!"

KC_Rob · 2013-12-04 08:51

KeithE wrote: »

Search on google for "go ahead make one more change" and the sign can be yours ;-)

Ha! Yes, the image makes the idea even more explicit.

Ym2413a · 2013-12-04 11:39

Invent-O-Doc wrote: »

SOON, the design needs to be frozen so they company guys can go ahead and make something to sell. I don't want to buy an FPGA. I want to buy some chips.

I want some chips too. : ]
I might give in and get the small FPGA board though to get a little jump start on development.

ozpropdev · 2013-12-04 12:38

K2 wrote: »

Thanks for the clarification. Amazing that you and ozpropdev were so instrumental in the last debug, and you both did it with a DE0! What can I say but, "I'm not worthy!"

For the record, I'm testing on both DE0-Nano and DE2-115 platforms back to back.

Circuitsoft · 2013-12-04 19:47

JRetSapDoog wrote: »

IIRC, recently, Chip mentioned the possibility of baking in some kind of true (or reasonably true) random number generator (not the pseudo kind).

The CPUs that Via makes digitize thermal noise to produce 20MB/sec of random data. Would that be doable?

Cluso99 · 2013-12-04 20:02

Chip,
At the risk of getting shot by others here on the forum, here are the instruction fixes and adds that have been proposed and that you were interested in.
While I scoured the recent forums, apologies to anything I missed.
At least this puts it in one place for you. It's up to you to do with this what you like.

BTW I have left the SERDES out of this.

Posible instruction fixes/changes/suggestions/additions...
=======================================================================================================
Here is a possible fix required:
WAITCNT
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222701&viewfull=1#post1222701[/URL]
=======================================================================================================
Reason: Add new pin-pair instruction for use with USB bit-banging receive (similar to GETP/GETNP)
        The S value (sub-instruction bits) "yyyyyyyy" would use the next available slot after CACHEX
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222515&viewfull=1#post1222515[/URL]
1111111 ZC L CCCC DDDDDDDDD xyyyyyyyy       GETXP   [#]D [WZ],[WC]  ' set flags for the pin-pair for usb bit-banging  
                                                                    '   D = PINx (0..127), PINy := PINx XOR $1 (it's complementary pin-pair)
                                                                    '   C = C XOR PINx via WC
                                                                    '   Z = !(PINx OR PINy) via WZ (ie ZERO if both PINx and PINy are both ZERO == SE0 in USB)
PINx and PINy are a pair of pins. If PINx is even then PINy := PINx + 1 else if PINx is odd then PINy := PINx - 1
The allowance for the PINx/PINy pair to be reversed is for USB LS & HS where J/K are effectively swapped between D-/D+.
WZ & WC would normally be used.
=======================================================================================================
Reason: Add new instruction(s) for calculating/accumulating CRC for 1-bit using the Polynomial set in "ACCA"
        The S value (sub-instruction bits) "yyyyyyyy" would use the next available slot after CACHEX
        
Thread: [URL]http://forums.parallax.com/showthread.php/151992-CRC-generation?p=1222728&viewfull=1#post1222728[/URL]
1111111 xx x CCCC DDDDDDDDD xyyyyyyyy       CRCBIT  D   ' accumulate CRC
                                                        '   C    = current data bit (to be accumulated)
                                                        '   D    = CRC Register
                                                        '   ACCA = polynomial
The CRCBIT instruction performs the following...
(1) X := C XOR D[0]
(2) D := D >> 1
(3) if X == 1 then D := D XOR ACCA
Alternately, a special register to hold the polynomial "POLY" could be used, requiring the instruction(s)
1111111 x0 x xxxx DDDDDDDDD xyyyyyyyy       CRCBIT  D   ' accumulate CRC
1111111 x1 x xxxx DDDDDDDDD xyyyyyyyy       SETPOLY D   ' set the polynomial to be used in 
=======================================================================================================
Reason: Add new pin-pair variants for use with complementary/differential I/O 2 wire protocols
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222689&viewfull=1#post1222689[/URL]

For reference only...
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00111000           SETZC   D/#             (D[1:0] into Z/C via WZ/WC)
                                                            presume this really means...(D[1:0] into !Z/C via WZ/WC)
Currently
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110000           GETP    D/#             (pin into !Z/C via WZ/WC)
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110001           GETNP   D/#             (pin into Z/!C via WZ/WC)
--L-            1111111 xx L CCCC DDDDDDDDD x10011000           OFFP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011001           NOTP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011010           CLRP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011011           SETP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011100           SETPC   D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011101           SETPNC  D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011110           SETPZ   D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011111           SETPNZ  D/#
Replace with...
ZCL-            1111111 00 L CCCC DDDDDDDDD x00110000           GETPP   D/#     (pin-pair PINy:PINx into !Z/C)
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110000           GETP    D/#             (pin into !Z/C via WZ/WC)
ZCL-            1111111 00 L CCCC DDDDDDDDD x00110001           GETNPP  D/#     (pin-pair PINy:PINx into Z/!C)
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110001           GETNP   D/#             (pin into Z/!C via WZ/WC)
These could share opcodes???
--L-            1111111 00 L CCCC DDDDDDDDD x10011000           OFFP    D/#             (pin#=0???  , dir#=0)
--L-            1111111 01 L CCCC DDDDDDDDD x10011000           NOTP    D/#             (pin#=!pin# , dir#=1)
--L-            1111111 10 L CCCC DDDDDDDDD x10011000           CLRP    D/#             (pin#=0     , dir#=1)
--L-            1111111 11 L CCCC DDDDDDDDD x10011000           SETP    D/#             (pin#=1     , dir#=1)
These could share opcodes???
--L-            1111111 00 L CCCC DDDDDDDDD x10011001           SETPC   D/#             (pin#=C     , dir#=1)
--L-            1111111 01 L CCCC DDDDDDDDD x10011001           SETPNC  D/#             (pin#=!C    , dir#=1)
--L-            1111111 10 L CCCC DDDDDDDDD x10011001           SETPZ   D/#             (pin#=Z     , dir#=1)
--L-            1111111 11 L CCCC DDDDDDDDD x10011001           SETPNZ  D/#             (pin#=!Z    , dir#=1)
New pin-pair instructions...(could use x10011010-x10011111 if freed above, or use new sub-opcodes avail following CACHEX)
--L-            1111111 00 L CCCC DDDDDDDDD x10011010           OFFPP   D/#     (pin-pair PINy:PINx=00???       , dir#=00)
--L-            1111111 01 L CCCC DDDDDDDDD x10011010           NOTPP   D/#     (pin-pair PINy:PINx=!PINy:!PINx), dir#=11)
--L-            1111111 10 L CCCC DDDDDDDDD x10011010           CLRPP   D/#     (pin-pair PINy:PINx=00          , dir#=11)
--L-            1111111 11 L CCCC DDDDDDDDD x10011010           SETPP   D/#     (pin-pair PINy:PINx=11          , dir#=11)
--L-            1111111 00 L CCCC DDDDDDDDD x10011011           SETPPLH D/#     (pin-pair PINy:PINx=01          , dir#=11)
--L-            1111111 01 L CCCC DDDDDDDDD x10011011           SETPPHL D/#     (pin-pair PINy:PINx=10          , dir#=11)
                                                                  Note: SETPPHL could be achievd by using SETPPLH PINy
I don't really see the need for these 2, but put it here in case you think it desirable...
--L-            1111111 10 L CCCC DDDDDDDDD x10011011           SETPPZC D/#     (pin-pair PINy:PINx=!Z/C        , dir#=1)
--L-            1111111 11 L CCCC DDDDDDDDD x10011011           SETPPNF D/#     (pin-pair PINy:PINx=Z/!C        , dir#=1)
D/# specifies PINx (0..127). PINy := PINx XOR #1 (ie it's twin pin-pair)
 (ie PINx and PINy are a pair of pins. If PINx is even then PINy := PINx + 1 else if PINx is odd then PINy := PINx - 1)
=======================================================================================================
Reason: Combine to use 1 instruction with variants
        Frees up opcodes 1000000 & 1000001
        Remove WZ/WC options
        Providing ENCOD can remove WZ option, it can move from 1000011,
         freeing BLMASK to share with another instruction variant
        
Currently...
ZCWS            1000000 ZC I CCCC DDDDDDDDD SSSSSSSSS           DECOD3  D,S/#
ZCWS            1000001 ZC I CCCC DDDDDDDDD SSSSSSSSS           DECOD4  D,S/#
ZCWS            1000010 ZC I CCCC DDDDDDDDD SSSSSSSSS           DECOD5  D,S/#
Z-WS            1000011 Z0 I CCCC DDDDDDDDD SSSSSSSSS           ENCOD   D,S/#   (shared with BLMASK)

Replace with...
--WS            1000010 00 I CCCC DDDDDDDDD SSSSSSSSS           DECOD3  D,S/#
--WS            1000010 01 I CCCC DDDDDDDDD SSSSSSSSS           DECOD4  D,S/#
--WS            1000010 10 I CCCC DDDDDDDDD SSSSSSSSS           DECOD5  D,S/#
--WS            1000010 11 I CCCC DDDDDDDDD SSSSSSSSS           ENCOD   D,S/#   
=======================================================================================================
Reason: Combine to use 1 instruction with variants
        May facilitate later use of opcode 1111110
Currently...        
-----------------------------------------------------------------------------------------------------
1111110 10 n nnnn nnnnnnnnn nnniiiiii        REPS    #n,#i   'execute 1..64 inst's 1..131072 times  1
1111111 00 0 CCCC 111111111 001iiiiii        REPD    #i      'execute 1..64 inst's infintely        1
1111111 00 0 CCCC DDDDDDDDD 001iiiiii        REPD    D,#i    'execute 1..64 inst's D+1 times        1
1111111 00 1 CCCC nnnnnnnnn 001iiiiii        REPD    #n,#i   'execute 1..64 inst's 1..512 times     1
-----------------------------------------------------------------------------------------------------
Replace with...
        fL *                                                 ' *=infinitely
1111111 00 0 xxxx DDDDDDDDD 001iiiiii        REPS    D,#i    'execute 1..64 inst's D+1 times        1+1
1111111 00 1 xxxx xxxxxxxxx 001iiiiii        REPS    #i      'execute 1..64 inst's infinitely       1+1
1111111 01 n nnnn nnnnnnnnn 001iiiiii        REPS    #n,#i   'execute 1..64 inst's 1..16384 times   1+1
1111111 10 0 CCCC DDDDDDDDD 001iiiiii        REPD    D,#i    'execute 1..64 inst's D+1 times        1+3
1111111 10 1 CCCC xxxxxxxxx 001iiiiii        REPD    #i      'execute 1..64 inst's infinitely       1+3
1111111 11 0 CCCC nnnnnnnnn 001iiiiii        REPD    #n,#i   'execute 1..64 inst's 1..512 times     1+3
=======================================================================================================
Reason: Swap instruction opcodes GETWORD/SETWORD, WAITPEQ and WAITPNE with TESTB, WRBYTE/WRWORD and SQRT64/QSINCOS
          so that SETNIB works with these instructions (ie all nibble #6 bits other than "n/nn/nnn" bits are zeros)
Of the instructions that have n, nn & nnn in their opcodes & WZ fields, only GETWORD/SETWORD, WAITPEQ and WAITPNE
 have opcodes that have "1" bits in the 6th nibble (other than "n" bits).
If these instruction opcodes were swapped with TESTB, WRBYTE/WRWORD and SQRT64/QSINCOS,
 their 6th nibble bits would have "0" bits in the non "n" bit positions.
This would permit the SETNIB D,[#]S,#6 instruction to be used to set the "n/nn/nnn" bits,
 providing the remaining nibble bits are "0".
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222324&viewfull=1#post1222324[/URL]
=======================================================================================================
Reason: Suggested by David & Bill for GCC assistance
Thread: [URL]http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-(split-from-blog)?p=1224484&viewfull=1#post1224484[/URL]
(and also a little earlier for the history)
Background: Any instruction with an immediate value for #S is limited to 9-bits.
            GCC often needs to manipulate a larger value, and so performs a few instructions to utilise this.
            David & Bill can explain the purpose better than I can.
What is desired is a way to utilise an instruction to set an internal register, which, when combined with
 the following instruction, which will use an immediate #S value, the resultant S value is an immediate
 value of 32 bits. This would only work for the following instruction after "BIG", and the BIG would then
 be reset to zeros (or a flag cleared).
Originally what was asked for is this BIG instruction to set the upper bits 31..9 with the immediate 32-bit "n"
 field, and the lower bits 8..0 =0000000.
By making this more general purpose, perhaps the following might be implemented instead...
 BIG #n sets an internal register "BIG" with the imediate 23 bits, either the top 23 bits or the bottom 23 bits,
 depending on another instruction bit "Z". (ie Z indicates n<<23)
If the ALU now takes any #S instruction, and if the previous instruction was a "BIG", then the ALU will combine
 the immediate 9 bits with the BIG register to form a new immediate value. Since there may be insufficient time
 to add the BIG value to the #S value in the pipeline, it was thought that an "OR" of the bits might be simpler,
 or alternatley, just use the upper 23 bits of BIG with the lower 9 bits of #S.
              
Presuming we can free up a full instruction, then... 
xxxxxxx 10 n nnnn nnnnnnnnn nnnnnnnnn        BIG     #D      ' Load 23 immediate bits into the lower "BIG" register bits 22..0 and zero bits 31..23.
xxxxxxx 11 n nnnn nnnnnnnnn nnnnnnnnn        BIGU    #D      ' Load 23 immediate bits into the upper "BIG" register bits 31..9 and zero bits 8..0
4 such registers for use in multi-tasking.
=======================================================================================================

P2_Instr_suggestons_002.spin

David Betz · 2013-12-04 20:36

The BIG instruction can be a lot simpler. It can simply write its 23 bit immediate value into a hidden register that will supply bits 31:9 of the S field of the following instruction. That will allow 32 bit immediate constants with a two instruction sequence. If the hidden register is reset to zero after it is used then the combination of the hidden register with S can always be done and will usually have no effect since the hidden register will contain zero unless it has just been set by a BIG instruction.

Bill Henning · 2013-12-04 21:06

I hope you found another slot for the HJMP/HCALL instructions, as that will provide 4x speedup vs. regular LMM loop.... more tomorrow, wifey's home.

Cluso99 wrote: »

Chip,
At the risk of getting shot by others here on the forum, here are the instruction fixes and adds that have been proposed and that you were interested in.
While I scoured the recent forums, apologies to anything I missed.
At least this puts it in one place for you. It's up to you to do with this what you like.

BTW I have left the SERDES out of this.

Posible instruction fixes/changes/suggestions/additions...
=======================================================================================================
Here is a possible fix required:
WAITCNT
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222701&viewfull=1#post1222701[/URL]
=======================================================================================================
Reason: Add new pin-pair instruction for use with USB bit-banging receive (similar to GETP/GETNP)
        The S value (sub-instruction bits) "yyyyyyyy" would use the next available slot after CACHEX
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222515&viewfull=1#post1222515[/URL]
1111111 ZC L CCCC DDDDDDDDD xyyyyyyyy       GETXP   [#]D [WZ],[WC]  ' set flags for the pin-pair for usb bit-banging  
                                                                    '   D = PINx (0..127), PINy := PINx XOR $1 (it's complementary pin-pair)
                                                                    '   C = C XOR PINx via WC
                                                                    '   Z = !(PINx OR PINy) via WZ (ie ZERO if both PINx and PINy are both ZERO == SE0 in USB)
PINx and PINy are a pair of pins. If PINx is even then PINy := PINx + 1 else if PINx is odd then PINy := PINx - 1
The allowance for the PINx/PINy pair to be reversed is for USB LS & HS where J/K are effectively swapped between D-/D+.
WZ & WC would normally be used.
=======================================================================================================
Reason: Add new instruction(s) for calculating/accumulating CRC for 1-bit using the Polynomial set in "ACCA"
        The S value (sub-instruction bits) "yyyyyyyy" would use the next available slot after CACHEX
        
Thread: [URL]http://forums.parallax.com/showthread.php/151992-CRC-generation?p=1222728&viewfull=1#post1222728[/URL]
1111111 xx x CCCC DDDDDDDDD xyyyyyyyy       CRCBIT  D   ' accumulate CRC
                                                        '   C    = current data bit (to be accumulated)
                                                        '   D    = CRC Register
                                                        '   ACCA = polynomial
The CRCBIT instruction performs the following...
(1) X := C XOR D[0]
(2) D := D >> 1
(3) if X == 1 then D := D XOR ACCA
Alternately, a special register to hold the polynomial "POLY" could be used, requiring the instruction(s)
1111111 x0 x xxxx DDDDDDDDD xyyyyyyyy       CRCBIT  D   ' accumulate CRC
1111111 x1 x xxxx DDDDDDDDD xyyyyyyyy       SETPOLY D   ' set the polynomial to be used in 
=======================================================================================================
Reason: Add new pin-pair variants for use with complementary/differential I/O 2 wire protocols
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222689&viewfull=1#post1222689[/URL]

For reference only...
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00111000           SETZC   D/#             (D[1:0] into Z/C via WZ/WC)
                                                            presume this really means...(D[1:0] into !Z/C via WZ/WC)
Currently
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110000           GETP    D/#             (pin into !Z/C via WZ/WC)
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110001           GETNP   D/#             (pin into Z/!C via WZ/WC)
--L-            1111111 xx L CCCC DDDDDDDDD x10011000           OFFP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011001           NOTP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011010           CLRP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011011           SETP    D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011100           SETPC   D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011101           SETPNC  D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011110           SETPZ   D/#
--L-            1111111 xx L CCCC DDDDDDDDD x10011111           SETPNZ  D/#
Replace with...
ZCL-            1111111 00 L CCCC DDDDDDDDD x00110000           GETPP   D/#     (pin-pair PINy:PINx into !Z/C)
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110000           GETP    D/#             (pin into !Z/C via WZ/WC)
ZCL-            1111111 00 L CCCC DDDDDDDDD x00110001           GETNPP  D/#     (pin-pair PINy:PINx into Z/!C)
ZCL-            1111111 ZC L CCCC DDDDDDDDD x00110001           GETNP   D/#             (pin into Z/!C via WZ/WC)
These could share opcodes???
--L-            1111111 00 L CCCC DDDDDDDDD x10011000           OFFP    D/#             (pin#=0???  , dir#=0)
--L-            1111111 01 L CCCC DDDDDDDDD x10011000           NOTP    D/#             (pin#=!pin# , dir#=1)
--L-            1111111 10 L CCCC DDDDDDDDD x10011000           CLRP    D/#             (pin#=0     , dir#=1)
--L-            1111111 11 L CCCC DDDDDDDDD x10011000           SETP    D/#             (pin#=1     , dir#=1)
These could share opcodes???
--L-            1111111 00 L CCCC DDDDDDDDD x10011001           SETPC   D/#             (pin#=C     , dir#=1)
--L-            1111111 01 L CCCC DDDDDDDDD x10011001           SETPNC  D/#             (pin#=!C    , dir#=1)
--L-            1111111 10 L CCCC DDDDDDDDD x10011001           SETPZ   D/#             (pin#=Z     , dir#=1)
--L-            1111111 11 L CCCC DDDDDDDDD x10011001           SETPNZ  D/#             (pin#=!Z    , dir#=1)
New pin-pair instructions...(could use x10011010-x10011111 if freed above, or use new sub-opcodes avail following CACHEX)
--L-            1111111 00 L CCCC DDDDDDDDD x10011010           OFFPP   D/#     (pin-pair PINy:PINx=00???       , dir#=00)
--L-            1111111 01 L CCCC DDDDDDDDD x10011010           NOTPP   D/#     (pin-pair PINy:PINx=!PINy:!PINx), dir#=11)
--L-            1111111 10 L CCCC DDDDDDDDD x10011010           CLRPP   D/#     (pin-pair PINy:PINx=00          , dir#=11)
--L-            1111111 11 L CCCC DDDDDDDDD x10011010           SETPP   D/#     (pin-pair PINy:PINx=11          , dir#=11)
--L-            1111111 00 L CCCC DDDDDDDDD x10011011           SETPPLH D/#     (pin-pair PINy:PINx=01          , dir#=11)
--L-            1111111 01 L CCCC DDDDDDDDD x10011011           SETPPHL D/#     (pin-pair PINy:PINx=10          , dir#=11)
                                                                  Note: SETPPHL could be achievd by using SETPPLH PINy
I don't really see the need for these 2, but put it here in case you think it desirable...
--L-            1111111 10 L CCCC DDDDDDDDD x10011011           SETPPZC D/#     (pin-pair PINy:PINx=!Z/C        , dir#=1)
--L-            1111111 11 L CCCC DDDDDDDDD x10011011           SETPPNF D/#     (pin-pair PINy:PINx=Z/!C        , dir#=1)
D/# specifies PINx (0..127). PINy := PINx XOR #1 (ie it's twin pin-pair)
 (ie PINx and PINy are a pair of pins. If PINx is even then PINy := PINx + 1 else if PINx is odd then PINy := PINx - 1)
=======================================================================================================
Reason: Combine to use 1 instruction with variants
        Frees up opcodes 1000000 & 1000001
        Remove WZ/WC options
        Providing ENCOD can remove WZ option, it can move from 1000011,
         freeing BLMASK to share with another instruction variant
        
Currently...
ZCWS            1000000 ZC I CCCC DDDDDDDDD SSSSSSSSS           DECOD3  D,S/#
ZCWS            1000001 ZC I CCCC DDDDDDDDD SSSSSSSSS           DECOD4  D,S/#
ZCWS            1000010 ZC I CCCC DDDDDDDDD SSSSSSSSS           DECOD5  D,S/#
Z-WS            1000011 Z0 I CCCC DDDDDDDDD SSSSSSSSS           ENCOD   D,S/#   (shared with BLMASK)

Replace with...
--WS            1000010 00 I CCCC DDDDDDDDD SSSSSSSSS           DECOD3  D,S/#
--WS            1000010 01 I CCCC DDDDDDDDD SSSSSSSSS           DECOD4  D,S/#
--WS            1000010 10 I CCCC DDDDDDDDD SSSSSSSSS           DECOD5  D,S/#
--WS            1000010 11 I CCCC DDDDDDDDD SSSSSSSSS           ENCOD   D,S/#   
=======================================================================================================
Reason: Combine to use 1 instruction with variants
        May facilitate later use of opcode 1111110
Currently...        
-----------------------------------------------------------------------------------------------------
1111110 10 n nnnn nnnnnnnnn nnniiiiii        REPS    #n,#i   'execute 1..64 inst's 1..131072 times  1
1111111 00 0 CCCC 111111111 001iiiiii        REPD    #i      'execute 1..64 inst's infintely        1
1111111 00 0 CCCC DDDDDDDDD 001iiiiii        REPD    D,#i    'execute 1..64 inst's D+1 times        1
1111111 00 1 CCCC nnnnnnnnn 001iiiiii        REPD    #n,#i   'execute 1..64 inst's 1..512 times     1
-----------------------------------------------------------------------------------------------------
Replace with...
        fL *                                                 ' *=infinitely
1111111 00 0 xxxx DDDDDDDDD 001iiiiii        REPS    D,#i    'execute 1..64 inst's D+1 times        1+1
1111111 00 1 xxxx xxxxxxxxx 001iiiiii        REPS    #i      'execute 1..64 inst's infinitely       1+1
1111111 01 n nnnn nnnnnnnnn 001iiiiii        REPS    #n,#i   'execute 1..64 inst's 1..16384 times   1+1
1111111 10 0 CCCC DDDDDDDDD 001iiiiii        REPD    D,#i    'execute 1..64 inst's D+1 times        1+3
1111111 10 1 CCCC xxxxxxxxx 001iiiiii        REPD    #i      'execute 1..64 inst's infinitely       1+3
1111111 11 0 CCCC nnnnnnnnn 001iiiiii        REPD    #n,#i   'execute 1..64 inst's 1..512 times     1+3
=======================================================================================================
Reason: Swap instruction opcodes GETWORD/SETWORD, WAITPEQ and WAITPNE with TESTB, WRBYTE/WRWORD and SQRT64/QSINCOS
          so that SETNIB works with these instructions (ie all nibble #6 bits other than "n/nn/nnn" bits are zeros)
Of the instructions that have n, nn & nnn in their opcodes & WZ fields, only GETWORD/SETWORD, WAITPEQ and WAITPNE
 have opcodes that have "1" bits in the 6th nibble (other than "n" bits).
If these instruction opcodes were swapped with TESTB, WRBYTE/WRWORD and SQRT64/QSINCOS,
 their 6th nibble bits would have "0" bits in the non "n" bit positions.
This would permit the SETNIB D,[#]S,#6 instruction to be used to set the "n/nn/nnn" bits,
 providing the remaining nibble bits are "0".
Thread: [URL]http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222324&viewfull=1#post1222324[/URL]
=======================================================================================================
Reason: Suggested by David & Bill for GCC assistance
Thread: [URL]http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-(split-from-blog)?p=1224484&viewfull=1#post1224484[/URL]
(and also a little earlier for the history)
Background: Any instruction with an immediate value for #S is limited to 9-bits.
            GCC often needs to manipulate a larger value, and so performs a few instructions to utilise this.
            David & Bill can explain the purpose better than I can.
What is desired is a way to utilise an instruction to set an internal register, which, when combined with
 the following instruction, which will use an immediate #S value, the resultant S value is an immediate
 value of 32 bits. This would only work for the following instruction after "BIG", and the BIG would then
 be reset to zeros (or a flag cleared).
Originally what was asked for is this BIG instruction to set the upper bits 31..9 with the immediate 32-bit "n"
 field, and the lower bits 8..0 =0000000.
By making this more general purpose, perhaps the following might be implemented instead...
 BIG #n sets an internal register "BIG" with the imediate 23 bits, either the top 23 bits or the bottom 23 bits,
 depending on another instruction bit "Z". (ie Z indicates n<<23)
If the ALU now takes any #S instruction, and if the previous instruction was a "BIG", then the ALU will combine
 the immediate 9 bits with the BIG register to form a new immediate value. Since there may be insufficient time
 to add the BIG value to the #S value in the pipeline, it was thought that an "OR" of the bits might be simpler,
 or alternatley, just use the upper 23 bits of BIG with the lower 9 bits of #S.
              
Presuming we can free up a full instruction, then... 
xxxxxxx 10 n nnnn nnnnnnnnn nnnnnnnnn        BIG     #D      ' Load 23 immediate bits into the lower "BIG" register bits 22..0 and zero bits 31..23.
xxxxxxx 11 n nnnn nnnnnnnnn nnnnnnnnn        BIGU    #D      ' Load 23 immediate bits into the upper "BIG" register bits 31..9 and zero bits 8..0
4 such registers for use in multi-tasking.
=======================================================================================================

P2_Instr_suggestons_002.spin

Cluso99 · 2013-12-04 22:25

David Betz wrote:

The BIG instruction can be a lot simpler. It can simply write its 23 bit immediate value into a hidden register that will supply bits 31:9 of the S field of the following instruction. That will allow 32 bit immediate constants with a two instruction sequence. If the hidden register is reset to zero after it is used then the combination of the hidden register with S can always be done and will usually have no effect since the hidden register will contain zero unless it has just been set by a BIG instruction.

That is what I was trying to convey, with some optional extensions. Does it need to be for the immediately following instruction? I know that is your current use, but as long as it gets set, and when it gets used, it would be zero'd.

Here are a couple of usage examples...

  SETBIG #(hubadr >> 9)
  RDWORD cogreg, #(hubadr & $1FF)   'read word into cogreg from hubaddr 
...
  SETBIG #(longvalue >> 9)
  XOR    cogreg, #(longvalue & $1FF)  'and cogreg with longvalue

Of course both these could be done differently. David or Bill will need to explain their actual intended use.

Bill Henning wrote: »

I hope you found another slot for the HJMP/HCALL instructions, as that will provide 4x speedup vs. regular LMM loop.... more tomorrow, wifey's home.

I have found at least 2. One will be for the "BIG" instruction - need a better name IMHO.
That's providing Chip hasn't found a use for those instruction slots

You will need to explain to me more about HJMP/HCALL - I will take another look at the Hub Execution Model thread.

ozpropdev · 2013-12-04 22:43

Cluso99 wrote: »

BIG instruction - need a better name IMHO.

What about EXTEND or EXPAND?

Propeller II update - BLOG

Comments