Should the next Propeller be code-compatible?

cgracey · 2008-08-28 16:39

jazzed said...

...will you have any kind of post increment or other variations? This has been requested/suggested before many times.

Yes,·and it's already been implemented in the FPGA design. It uses the same RDxxxx/WRxxxx instruction codes, but is activated when S is immediate and S[noparse][[/noparse]8] is high (as if you were going to immediately access $100..$1FF, which nobody has probably ever done, as immediate accesses tend to be focused on locations $000..$00F). The S coding looks like this:

%1_SUP_XXXXX

1 = Trigger pointer addressing, not immediate $000.$0FF addressing
S = Select pointer·(0 = PTRA, 1 = PTRB)
U = Update·PTRx (0 = don't update PTRx, 1 = add scaled index to PTRx)
P = Pre/Post usage for addressing (0 = use·PTRx plus scaled index, 1 = use PTRx)
X = MSB-extended index (-16..+15) which gets scaled according to xxBYTE/xxWORD/xxLONG

Here's how you use them:

···············SETPTRA D··············'set·PTRA to D
···············SETPTRB D··············'set·PTRB to D

···············GETPTRA D··············'get·PTRA into D
···············GETPTRB D··············'get·PTRB into D

···············RDBYTE· D,PTRA·········'read byte at PTRA into D (S = %1000_00000)
···············RDWORD· D,PTRB[noparse][[/noparse]10]·····'read·word at PTRB+10*2 into D (S = %1100_01010)
···············RDLONG··D,PTRA[noparse][[/noparse]-4]·····'read·long at PTRA-4*4 into D (S = %1000_11100)
···············RDLONG· D,PTRB[noparse][[/noparse]--1]····'read long at PTRB-1*4 into D, subtract 1*4 from PTRB (S = %1110_11111)
···············RDBYTE· D,PTRA[noparse][[/noparse]++3]····'read·byte at PTRA+3*1 into D, add 3*1 to PTRA (S = %1010_00011)
···············RDBYTE· D,PTRA[noparse][[/noparse]1++]····'read·byte at PTRA into D, add 1*1 to PTRA (S = %1011_00001)
···············WRWORD· D,PTRA[noparse][[/noparse]2--]····'write D to word at PTRA, subtract 2*2 from PTRA (S = %1011_11110)

Both PTRA and PTRB get initialized to PAR.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 8/28/2008 5:02:56 PM GMT

Beau Schwabe · 2008-08-28 16:50

heater,

"Where did all that extra silicon magically appear from ?" - The current Propeller uses a 350nm process, while the Propeller under development is being done in a 180nm process.
If all of the transistors, capacitors, resistors, etc. scaled to a 1:1 translation between processes you would basically have a real estate gain of 3.78 times the current design. Unfortunately,
the components don't a scale the same, but by slightly altering the design and making functional improvements based on characteristics of the different process, you can get close.
The current die is about 6mm square... the new die will be slightly larger.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe

IC Layout Engineer
Parallax, Inc.

potatohead · 2008-08-28 16:51

Thanks Chip!

And you are just damn cool for talking CPU design with us.

Personally, I'm very excited about this:

rep [noparse][[/noparse]32,3] 'repeat 3 instructions 32 times

nop 'must execute two instructions here

nop

shl x,#1 'begin 3-instruction block

cmpsub x,y wc

rcl q,#1 'total cycles = 3 + 3*32 = 99

!!!!

Way to go on optimizing that COG instruction space! Unrolled loops without actually unrolling them -->at least that's what I'm seeing.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!

Chat in real time with other Propellerheads on IRC #propeller @ freenode.net

jazzed · 2008-08-28 17:06

Excellent Chip!

One more ... I've often wanted to do this: "WRLONG INA, ptr"
Yes, no, maybe ?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Phil Pilgrim (PhiPi) · 2008-08-28 17:10

Chip,

That pointer addressing is too cool for words! Thanks for finding a way to make it happen!

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

cgracey · 2008-08-28 17:19

jazzed said...
Excellent Chip!

One more ... I've often wanted to do this: "WRLONG INA, ptr"
Yes, no, maybe ?

I understand. That would mean making INA accessible via D. Not a problem, but you'll have six other instructions per hub cycle outside of the WRLONG, so you'll have·plenty of time to do·a 'MOV reg,INA'. If you want to quickly capture INA activity into cog RAM, you could do this:

········ SETINDA buffptr·· 'set INDA's pointer to a 256-register buffer

again··· REP···· [noparse][[/noparse]256,1]
········ NOP·············· 'put something useful instead of these two NOPs
········ NOP
········ MOV···· INDA,INA· 'repeats 256 times, auto-inc'ing and wrapping INDA's pointer

········ 'buff = 256 snapshots of INA'

········ JMP···· #again

buffptr· PTRX··· buff,256 'define circular buffer, same as LONG (buff+256-1)<<9 + buff
buff···· RES···· 256

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

rokicki · 2008-08-28 17:25

Break compatibility. Let your ideas run wild. If we get even *one* neat thing by breaking compatibility, it's worth it.

But at the same time, it would be cool to add conditional compilation to the IDE so we could make our code work on either.

cgracey · 2008-08-28 17:26

Phil Pilgrim (PhiPi) said...
Chip,

That pointer addressing is too cool for words! Thanks for finding a way to make it happen!

-Phil

Thank Paul Baker. He came up here one day and we got ALL SORTS of cool stuff figured out.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

jazzed · 2008-08-28 17:36

Awesome. So would this sample at 80MHz assuming 160MHz clock?

Entry     ORG       0
 
          MOV       PTR, PAR
          MOV       T0,  #511
:loop     WRLONG    INA, PTR[noparse][[/noparse]1++]
          NOP
          WRLONG    INA, PTR[noparse][[/noparse]1++]
          DJNZ      T0,  #:loop
 
PTR       PTRX      1
T0        RES       1

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

cgracey · 2008-08-28 17:45

jazzed said...
Awesome. So would this sample at 80MHz assuming 160MHz clock?

Entry     ORG       0
 
          MOV       PTR, PAR
          MOV       T0,  #511
:loop     WRLONG    INA, PTR[noparse][[/noparse]1++]
          NOP
          WRLONG    INA, PTR[noparse][[/noparse]1++]
          DJNZ      T0,  #:loop
 
PTR       PTRX      1
T0        RES       1

No. Each WRLONG must wait for its hub turn that comes every 8 clocks (assuming 8 cogs). So, it would sample at 20MHz. If you want really high bandwidth you need to use cog ram. Hub ram could only keep up for some periodically-distilled results, but not the whole data stream.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2008-08-28 17:55

Chip Gracey said...
... assuming 8 cogs ...

??? Has the number of cogs not been decided yet, or are you counting active cogs? (I presume it's not the latter, since that would break determinism; but I had to ask.)

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

cgracey · 2008-08-28 18:07

Phil Pilgrim (PhiPi) said...

??? Has the number of cogs not been decided yet, or are you counting active cogs? (I presume it's not the latter, since that would break determinism; but I had to ask.)

Well, the other day Beau and I did some floorplan checking and 16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8. I think it would be bad to break determinism, in any case.

Once I break from current compatibility with the FPGA, interpreter, and compiler, it's going to be a lot easier to move into the wild blue yonder. I came to the conclusion last night that the next order of business is to move to 32 bit addressing. This profoundly affects the compiler (now 8,153 lines of 80386 code), interpreter, and booter, not to mention the Windows app. It's an all-or-nothing prospect. This is going to be a difficult transition, but once made, the sky's the limit.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

jazzed · 2008-08-28 18:17

So we're back to this assuming 160MHz ?· 10MHz sample rate ?

              movd      :wrdata,bptr            ' set pointer
              mov       ndx,    #511            ' set count
:wrdata       mov       0-0,    ina             '   0ns ... get data ... next is 100ns
              add       bptr,   #1              '  25ns ... increment pointer
              movd      :wrdata,bptr            '  50ns ... save data
              djnz      ndx,    #:wrloop        '  75ns ... repeat until buffer done
              

ndx           res       1

Or maybe this?· 13.3MHz sample rate ?

              movd      :wrdata,bptr            ' set pointer
              mov       ndx,    #511            ' set count


:wrdata       mov       0-0,    ina             '   0ns ... get data ... next is 75ns
              movd      :wrdata,bptr[noparse][[/noparse]1++]       '  25ns ... save data
              djnz      ndx,    #:wrloop        '  50ns ... repeat until buffer done
ptr           ptrx      1              

ndx           res       1

Apples to apples without REP which I don't exactly grasp yet. Looks like your REP example samples at 8MHz no?

BTW, how much of schedule change does 32 non-compatability mode cause ? I'll take the current design if the change has too much impact. Right now, you are ahead of the curve and have no competition in this class of microcontroller. Losing to the next leap-frog technology competitor can be devestating in the biggest markets.

Thanks for entertaining our questions so far.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

parsko · 2008-08-28 18:20

Chip Gracey (Parallax) said...
... and 16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8.

Huh? Did I read that correctly?

In my best Arnold voice: "Wha' chu talkin bout Willis?"

heater · 2008-08-28 18:26

Oh, heart sinking here. I had convinced myself from reading all the posts that you had found a magic way to get 16 COGs in there.
Still 8 gives us double the HUB access rate of 16 so not so bad.
Just have to figure out how to get two threads (or more) running in a cog at a super lick with all those new instructions/modes.
BUT hey, what about 12 COGs then ?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Phil Pilgrim (PhiPi) · 2008-08-28 18:37

Chip,

If you do limit it to eight, would there be any chance of including four counters per cog, instead of two? A program I wrote recently gobbled cogs, not because of processing requirements, but because it needed the counters. I know there are some addressing constraints in the cogs, so I don't know where you'd put CTRC .. PHSD without eating into code space or (shudder) banking them. (Actually banking might not be that bad if each cog had a writable address translation table for the SPRs. For example, if I'm using four counters, I probably don't need access to the video registers. And once the CTRXs are set up, I could shunt them out of address range as well, to gain access to INA and INB, say.)

Barring that, is a non-power-of-two cog count (i.e. 12) out of the question? (I'm not saying any of this stuff is necessary — just trying to probe what's possible.

)

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

QuattroRS4 · 2008-08-28 19:06

Chip Gracey said...
16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8.

As it has been decided 32bits and a clean slate ... I was really hoping you'd say 16Cogs.
If the slate is clean - go all out ! I am sure anyone here would't mind if PropII launch was delayed to facilitate .

Due to recent developments Will it be More Cogs AND Ram ? as opposed to more Cogs OR Ram ! ... oh to dream!

I am getting carried away here !

Regards,
John Twomey

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Necessity is the mother of invention'

Those who can, do.Those who can’t, teach.

Post Edited (QuattroRS4) : 8/28/2008 7:43:01 PM GMT

cgracey · 2008-08-28 19:18

Phil Pilgrim (PhiPi) said...

If you do limit it to eight, would there be any chance of including four counters per cog, instead of two?

We have an identical issue with # of ports. I've been thinking that we should have a single set of INP, OUTP, DIRP, and ALTP (new, for analog confguration) registers for all 32-bit ports. There would be a selection mechanism via special instruction (4-port example):

······ setport D····· 'mux port D[noparse][[/noparse]6..5] into INP/OUTP/DIRP/ALTP register spaces (D is a pin#)
······ setport [noparse][[/noparse]n]··· 'mux port n (0..3) into register INP/OUTP/DIRP/ALTP spaces (n is·a port#)

We could do the same for however many CTRs we've got. This would also free up some special register space (4-counter example):

······ setctr· D····· 'mux D[noparse][[/noparse]1..0] into CTR/FRQ/PHS register spaces
······ setctr··[noparse][[/noparse]n]··· 'mux n (0..1) into CTR/FRQ/PHS register spaces

Once·we get to 32-bit addressing, everything will be malleable and it will be easy to do stuff like this. Putting in 16 cogs (and a mechanism for selecting up to 64) would be really simple. It's going to take several days to get there, though.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Capt. Quirk · 2008-08-28 19:33

Well, the other day Beau and I did some floorplan checking and 16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8. I think it would be bad to break determinism, in any case.

How big would a finished chip be with an 8mm sq die?, is the extra 4 sq mm's mega bucks more? and·which manufacture·or what chip(s) are you trying to compete with?

simonl · 2008-08-28 19:45

@Chip: I can count the number of uC manufacturers that would even contemplate this kind of consumer involvement on - well - NO hands! Kudos to you.

FWIW: I'd have said take the clean-slate approach anyway, so thanks for that decision.

What's all this about EIGHT COGs? Noooo! I'd pay extra for 16 - my visions for chip use would mostly need more than 8 COGs - I'm not clever enough to code multiple tasks into a COG, and just throwing a COG at a problem is THE appeal of the Propeller for me...

Oh; and if the IDE's got to be re-written anyway, PLEASE add conditional compilation

(BTW: How many pins are we gonna have on PII?)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,

Simon

www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style

Mike Green · 2008-08-28 19:59

For sure, if you're going to scrimp on pins or cogs or counters, cleanly allow room for 2 or 4 times that many in the instruction set and control registers to avoid having to change code in the future as chip density and process technology continues to improve.

Sapieha · 2008-08-28 20:13

Hi Chip

You have open mind consrtuct PropI.
Open it more to avoid PropII from fault in its parellel power procesing

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

heater · 2008-08-28 20:27

Sapieha: Please elaborate. What fault do you see? How should it be avoided?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Erik Friesen · 2008-08-28 20:30

Hmm. Do I understand the above to mean a semi- pic style banksel?

Well whatever but if you do it that way please make it easy to keep straight.

Rayman · 2008-08-28 20:31

I think 8 faster cogs (by 8x) would be enough for me. Considering I need 4 to do SXGA with cursor and this could probably be done with just one 8x faster cog...

Any way to have one cog (or the hub) access the unused counters in another cog?

Sapieha · 2008-08-28 20:40

Hi heater.

Sorry my bad English but.

1. LMM that many talk is not parallel power prcesing in full COG speed. Only semi parallel!
2. If code compatiblity decrease parallel power prcesing. Skip It!
and many other aspects why decrease COG´s capablites

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

hippy · 2008-08-28 20:41

Chip Gracey (Parallax) said...
16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8.

I reckon there's going to be a lot of disappointment if that's the case as it seems everyone here has been thinking 16.

If there's a RDTXFR and WRTXFR which will magically transfer longs between two Propellers connected using a single pin at high speed that would mitigate the number of Cogs needed by making multi-chip arrays easier. Something like that would be nice to have even if it wasn't a blindingly fast link.

Paul Baker · 2008-08-28 20:45

Indeed each cog will be 8x faster, higher use of JMPRET can fold multiple processes into the the same cog. And the current multi cog video drivers can be ported into 1 or 2 now.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Sapieha · 2008-08-28 20:47

Hi hippy.

It was My point in thred.

http://forums.parallax.com/forums/default.aspx?f=25&p=1&m=212396

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Ken Peterson · 2008-08-28 20:51

Wow....I'm away from reading this forum for a couple of days and the PropII is being re-designed in real time!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
·"I have always wished that my computer would be as easy to use as my telephone.· My wish has come true.· I no longer know how to use my telephone."

- Bjarne Stroustrup

Should the next Propeller be code-compatible?

Comments