...will you have any kind of post increment or other variations? This has been requested/suggested before many times.
Yes,·and it's already been implemented in the FPGA design. It uses the same RDxxxx/WRxxxx instruction codes, but is activated when S is immediate and S[noparse][[/noparse]8] is high (as if you were going to immediately access $100..$1FF, which nobody has probably ever done, as immediate accesses tend to be focused on locations $000..$00F). The S coding looks like this:
%1_SUP_XXXXX
1 = Trigger pointer addressing, not immediate $000.$0FF addressing
S = Select pointer·(0 = PTRA, 1 = PTRB)
U = Update·PTRx (0 = don't update PTRx, 1 = add scaled index to PTRx)
P = Pre/Post usage for addressing (0 = use·PTRx plus scaled index, 1 = use PTRx)
X = MSB-extended index (-16..+15) which gets scaled according to xxBYTE/xxWORD/xxLONG
Here's how you use them:
···············SETPTRA D··············'set·PTRA to D ···············SETPTRB D··············'set·PTRB to D
···············GETPTRA D··············'get·PTRA into D ···············GETPTRB D··············'get·PTRB into D
···············RDBYTE· D,PTRA·········'read byte at PTRA into D (S = %1000_00000) ···············RDWORD· D,PTRB[noparse][[/noparse]10]·····'read·word at PTRB+10*2 into D (S = %1100_01010) ···············RDLONG··D,PTRA[noparse][[/noparse]-4]·····'read·long at PTRA-4*4 into D (S = %1000_11100) ···············RDLONG· D,PTRB[noparse][[/noparse]--1]····'read long at PTRB-1*4 into D, subtract 1*4 from PTRB (S = %1110_11111) ···············RDBYTE· D,PTRA[noparse][[/noparse]++3]····'read·byte at PTRA+3*1 into D, add 3*1 to PTRA (S = %1010_00011) ···············RDBYTE· D,PTRA[noparse][[/noparse]1++]····'read·byte at PTRA into D, add 1*1 to PTRA (S = %1011_00001) ···············WRWORD· D,PTRA[noparse][[/noparse]2--]····'write D to word at PTRA, subtract 2*2 from PTRA (S = %1011_11110)
Both PTRA and PTRB get initialized to PAR.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 8/28/2008 5:02:56 PM GMT
"Where did all that extra silicon magically appear from ?" - The current Propeller uses a 350nm process, while the Propeller under development is being done in a 180nm process.
If all of the transistors, capacitors, resistors, etc. scaled to a 1:1 translation between processes you would basically have a real estate gain of 3.78 times the current design. Unfortunately,
the components don't a scale the same, but by slightly altering the design and making functional improvements based on characteristics of the different process, you can get close.
The current die is about 6mm square... the new die will be slightly larger.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Beau Schwabe
IC Layout Engineer
Parallax, Inc.
One more ... I've often wanted to do this: "WRLONG INA, ptr"
Yes, no, maybe ?
I understand. That would mean making INA accessible via D. Not a problem, but you'll have six other instructions per hub cycle outside of the WRLONG, so you'll have·plenty of time to do·a 'MOV reg,INA'. If you want to quickly capture INA activity into cog RAM, you could do this:
········ SETINDA buffptr·· 'set INDA's pointer to a 256-register buffer
again··· REP···· [noparse][[/noparse]256,1] ········ NOP·············· 'put something useful instead of these two NOPs ········ NOP ········ MOV···· INDA,INA· 'repeats 256 times, auto-inc'ing and wrapping INDA's pointer
········ 'buff = 256 snapshots of INA'
········ JMP···· #again
buffptr· PTRX··· buff,256 'define circular buffer, same as LONG (buff+256-1)<<9 + buff
buff···· RES···· 256
No. Each WRLONG must wait for its hub turn that comes every 8 clocks (assuming 8 cogs). So, it would sample at 20MHz. If you want really high bandwidth you need to use cog ram. Hub ram could only keep up for some periodically-distilled results, but not the whole data stream.
??? Has the number of cogs not been decided yet, or are you counting active cogs? (I presume it's not the latter, since that would break determinism; but I had to ask.)
??? Has the number of cogs not been decided yet, or are you counting active cogs? (I presume it's not the latter, since that would break determinism; but I had to ask.)
Well, the other day Beau and I did some floorplan checking and 16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8. I think it would be bad to break determinism, in any case.
Once I break from current compatibility with the FPGA, interpreter, and compiler, it's going to be a lot easier to move into the wild blue yonder. I came to the conclusion last night that the next order of business is to move to 32 bit addressing. This profoundly affects the compiler (now 8,153 lines of 80386 code), interpreter, and booter, not to mention the Windows app. It's an all-or-nothing prospect. This is going to be a difficult transition, but once made, the sky's the limit.
So we're back to this assuming 160MHz ?· 10MHz sample rate ?
movd :wrdata,bptr ' set pointer
mov ndx, #511 ' set count
:wrdata mov 0-0, ina ' 0ns ... get data ... next is 100ns
add bptr, #1 ' 25ns ... increment pointer
movd :wrdata,bptr ' 50ns ... save data
djnz ndx, #:wrloop ' 75ns ... repeat until buffer done
ndx res 1
Or maybe this?· 13.3MHz sample rate ?
movd :wrdata,bptr ' set pointer
mov ndx, #511 ' set count
:wrdata mov 0-0, ina ' 0ns ... get data ... next is 75ns
movd :wrdata,bptr[noparse][[/noparse]1++] ' 25ns ... save data
djnz ndx, #:wrloop ' 50ns ... repeat until buffer done
ptr ptrx 1
ndx res 1
Apples to apples without REP which I don't exactly grasp yet. Looks like your REP example samples at 8MHz no?
BTW, how much of schedule change does 32 non-compatability mode cause ? I'll take the current design if the change has too much impact. Right now, you are ahead of the curve and have no competition in this class of microcontroller. Losing to the next leap-frog technology competitor can be devestating in the biggest markets.
Oh, heart sinking here. I had convinced myself from reading all the posts that you had found a magic way to get 16 COGs in there.
Still 8 gives us double the HUB access rate of 16 so not so bad.
Just have to figure out how to get two threads (or more) running in a cog at a super lick with all those new instructions/modes.
BUT hey, what about 12 COGs then ?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
If you do limit it to eight, would there be any chance of including four counters per cog, instead of two? A program I wrote recently gobbled cogs, not because of processing requirements, but because it needed the counters. I know there are some addressing constraints in the cogs, so I don't know where you'd put CTRC .. PHSD without eating into code space or (shudder) banking them. (Actually banking might not be that bad if each cog had a writable address translation table for the SPRs. For example, if I'm using four counters, I probably don't need access to the video registers. And once the CTRXs are set up, I could shunt them out of address range as well, to gain access to INA and INB, say.)
Barring that, is a non-power-of-two cog count (i.e. 12) out of the question? (I'm not saying any of this stuff is necessary — just trying to probe what's possible. )
Chip Gracey said...
16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8.
As it has been decided 32bits and a clean slate ... I was really hoping you'd say 16Cogs.
If the slate is clean - go all out ! I am sure anyone here would't mind if PropII launch was delayed to facilitate .
Due to recent developments Will it be More Cogs AND Ram ? as opposed to more Cogs OR Ram ! ... oh to dream!
I am getting carried away here !
Regards,
John Twomey
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ 'Necessity is the mother of invention'
Those who can, do.Those who can’t, teach.
Post Edited (QuattroRS4) : 8/28/2008 7:43:01 PM GMT
If you do limit it to eight, would there be any chance of including four counters per cog, instead of two?
We have an identical issue with # of ports. I've been thinking that we should have a single set of INP, OUTP, DIRP, and ALTP (new, for analog confguration) registers for all 32-bit ports. There would be a selection mechanism via special instruction (4-port example):
······ setport D····· 'mux port D[noparse][[/noparse]6..5] into INP/OUTP/DIRP/ALTP register spaces (D is a pin#) ······ setport [noparse][[/noparse]n]··· 'mux port n (0..3) into register INP/OUTP/DIRP/ALTP spaces (n is·a port#)
We could do the same for however many CTRs we've got. This would also free up some special register space (4-counter example):
······ setctr· D····· 'mux D[noparse][[/noparse]1..0] into CTR/FRQ/PHS register spaces ······ setctr··[noparse][[/noparse]n]··· 'mux n (0..1) into CTR/FRQ/PHS register spaces
Once·we get to 32-bit addressing, everything will be malleable and it will be easy to do stuff like this. Putting in 16 cogs (and a mechanism for selecting up to 64) would be really simple. It's going to take several days to get there, though.
Well, the other day Beau and I did some floorplan checking and 16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8. I think it would be bad to break determinism, in any case.
How big would a finished chip be with an 8mm sq die?, is the extra 4 sq mm's mega bucks more? and·which manufacture·or what chip(s) are you trying to compete with?
@Chip: I can count the number of uC manufacturers that would even contemplate this kind of consumer involvement on - well - NO hands! Kudos to you.
FWIW: I'd have said take the clean-slate approach anyway, so thanks for that decision.
What's all this about EIGHT COGs? Noooo! I'd pay extra for 16 - my visions for chip use would mostly need more than 8 COGs - I'm not clever enough to code multiple tasks into a COG, and just throwing a COG at a problem is THE appeal of the Propeller for me...
Oh; and if the IDE's got to be re-written anyway, PLEASE add conditional compilation
(BTW: How many pins are we gonna have on PII?)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-) BTW: I type as I'm thinking, so please don't take any offense at my writing style
For sure, if you're going to scrimp on pins or cogs or counters, cleanly allow room for 2 or 4 times that many in the instruction set and control registers to avoid having to change code in the future as chip density and process technology continues to improve.
I think 8 faster cogs (by 8x) would be enough for me. Considering I need 4 to do SXGA with cursor and this could probably be done with just one 8x faster cog...
Any way to have one cog (or the hub) access the unused counters in another cog?
1. LMM that many talk is not parallel power prcesing in full COG speed. Only semi parallel!
2. If code compatiblity decrease parallel power prcesing. Skip It!
and many other aspects why decrease COG´s capablites
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Nothing is impossible, there are only different degrees of difficulty.
Chip Gracey (Parallax) said...
16 cogs would result in a die about 8mm on the edge, which is huge, so I've kind of been thinking 8.
I reckon there's going to be a lot of disappointment if that's the case as it seems everyone here has been thinking 16.
If there's a RDTXFR and WRTXFR which will magically transfer longs between two Propellers connected using a single pin at high speed that would mitigate the number of Cogs needed by making multi-chip arrays easier. Something like that would be nice to have even if it wasn't a blindingly fast link.
Indeed each cog will be 8x faster, higher use of JMPRET can fold multiple processes into the the same cog. And the current multi cog video drivers can be ported into 1 or 2 now.
Wow....I'm away from reading this forum for a couple of days and the PropII is being re-designed in real time!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ ·"I have always wished that my computer would be as easy to use as my telephone.· My wish has come true.· I no longer know how to use my telephone."
Comments
%1_SUP_XXXXX
1 = Trigger pointer addressing, not immediate $000.$0FF addressing
S = Select pointer·(0 = PTRA, 1 = PTRB)
U = Update·PTRx (0 = don't update PTRx, 1 = add scaled index to PTRx)
P = Pre/Post usage for addressing (0 = use·PTRx plus scaled index, 1 = use PTRx)
X = MSB-extended index (-16..+15) which gets scaled according to xxBYTE/xxWORD/xxLONG
Here's how you use them:
···············SETPTRA D··············'set·PTRA to D
···············SETPTRB D··············'set·PTRB to D
···············GETPTRA D··············'get·PTRA into D
···············GETPTRB D··············'get·PTRB into D
···············RDBYTE· D,PTRA·········'read byte at PTRA into D (S = %1000_00000)
···············RDWORD· D,PTRB[noparse][[/noparse]10]·····'read·word at PTRB+10*2 into D (S = %1100_01010)
···············RDLONG··D,PTRA[noparse][[/noparse]-4]·····'read·long at PTRA-4*4 into D (S = %1000_11100)
···············RDLONG· D,PTRB[noparse][[/noparse]--1]····'read long at PTRB-1*4 into D, subtract 1*4 from PTRB (S = %1110_11111)
···············RDBYTE· D,PTRA[noparse][[/noparse]++3]····'read·byte at PTRA+3*1 into D, add 3*1 to PTRA (S = %1010_00011)
···············RDBYTE· D,PTRA[noparse][[/noparse]1++]····'read·byte at PTRA into D, add 1*1 to PTRA (S = %1011_00001)
···············WRWORD· D,PTRA[noparse][[/noparse]2--]····'write D to word at PTRA, subtract 2*2 from PTRA (S = %1011_11110)
Both PTRA and PTRB get initialized to PAR.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 8/28/2008 5:02:56 PM GMT
"Where did all that extra silicon magically appear from ?" - The current Propeller uses a 350nm process, while the Propeller under development is being done in a 180nm process.
If all of the transistors, capacitors, resistors, etc. scaled to a 1:1 translation between processes you would basically have a real estate gain of 3.78 times the current design. Unfortunately,
the components don't a scale the same, but by slightly altering the design and making functional improvements based on characteristics of the different process, you can get close.
The current die is about 6mm square... the new die will be slightly larger.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
And you are just damn cool for talking CPU design with us.
Personally, I'm very excited about this:
rep [noparse][[/noparse]32,3] 'repeat 3 instructions 32 times
nop 'must execute two instructions here
nop
shl x,#1 'begin 3-instruction block
cmpsub x,y wc
rcl q,#1 'total cycles = 3 + 3*32 = 99
!!!!
Way to go on optimizing that COG instruction space! Unrolled loops without actually unrolling them -->at least that's what I'm seeing.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
One more ... I've often wanted to do this: "WRLONG INA, ptr"
Yes, no, maybe ?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
That pointer addressing is too cool for words! Thanks for finding a way to make it happen!
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
········ SETINDA buffptr·· 'set INDA's pointer to a 256-register buffer
again··· REP···· [noparse][[/noparse]256,1]
········ NOP·············· 'put something useful instead of these two NOPs
········ NOP
········ MOV···· INDA,INA· 'repeats 256 times, auto-inc'ing and wrapping INDA's pointer
········ 'buff = 256 snapshots of INA'
········ JMP···· #again
buffptr· PTRX··· buff,256 'define circular buffer, same as LONG (buff+256-1)<<9 + buff
buff···· RES···· 256
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
But at the same time, it would be cool to add conditional compilation to the IDE so we could make our code work on either.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
Once I break from current compatibility with the FPGA, interpreter, and compiler, it's going to be a lot easier to move into the wild blue yonder. I came to the conclusion last night that the next order of business is to move to 32 bit addressing. This profoundly affects the compiler (now 8,153 lines of 80386 code), interpreter, and booter, not to mention the Windows app. It's an all-or-nothing prospect. This is going to be a difficult transition, but once made, the sky's the limit.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Or maybe this?· 13.3MHz sample rate ?
Apples to apples without REP which I don't exactly grasp yet. Looks like your REP example samples at 8MHz no?
BTW, how much of schedule change does 32 non-compatability mode cause ? I'll take the current design if the change has too much impact. Right now, you are ahead of the curve and have no competition in this class of microcontroller. Losing to the next leap-frog technology competitor can be devestating in the biggest markets.
Thanks for entertaining our questions so far.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Huh? Did I read that correctly?
In my best Arnold voice: "Wha' chu talkin bout Willis?"
Still 8 gives us double the HUB access rate of 16 so not so bad.
Just have to figure out how to get two threads (or more) running in a cog at a super lick with all those new instructions/modes.
BUT hey, what about 12 COGs then ?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
If you do limit it to eight, would there be any chance of including four counters per cog, instead of two? A program I wrote recently gobbled cogs, not because of processing requirements, but because it needed the counters. I know there are some addressing constraints in the cogs, so I don't know where you'd put CTRC .. PHSD without eating into code space or (shudder) banking them. (Actually banking might not be that bad if each cog had a writable address translation table for the SPRs. For example, if I'm using four counters, I probably don't need access to the video registers. And once the CTRXs are set up, I could shunt them out of address range as well, to gain access to INA and INB, say.)
Barring that, is a non-power-of-two cog count (i.e. 12) out of the question? (I'm not saying any of this stuff is necessary — just trying to probe what's possible. )
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
As it has been decided 32bits and a clean slate ... I was really hoping you'd say 16Cogs.
If the slate is clean - go all out ! I am sure anyone here would't mind if PropII launch was delayed to facilitate .
Due to recent developments Will it be More Cogs AND Ram ? as opposed to more Cogs OR Ram ! ... oh to dream!
I am getting carried away here !
Regards,
John Twomey
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Necessity is the mother of invention'
Those who can, do.Those who can’t, teach.
Post Edited (QuattroRS4) : 8/28/2008 7:43:01 PM GMT
······ setport D····· 'mux port D[noparse][[/noparse]6..5] into INP/OUTP/DIRP/ALTP register spaces (D is a pin#)
······ setport [noparse][[/noparse]n]··· 'mux port n (0..3) into register INP/OUTP/DIRP/ALTP spaces (n is·a port#)
We could do the same for however many CTRs we've got. This would also free up some special register space (4-counter example):
······ setctr· D····· 'mux D[noparse][[/noparse]1..0] into CTR/FRQ/PHS register spaces
······ setctr··[noparse][[/noparse]n]··· 'mux n (0..1) into CTR/FRQ/PHS register spaces
Once·we get to 32-bit addressing, everything will be malleable and it will be easy to do stuff like this. Putting in 16 cogs (and a mechanism for selecting up to 64) would be really simple. It's going to take several days to get there, though.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
FWIW: I'd have said take the clean-slate approach anyway, so thanks for that decision.
What's all this about EIGHT COGs? Noooo! I'd pay extra for 16 - my visions for chip use would mostly need more than 8 COGs - I'm not clever enough to code multiple tasks into a COG, and just throwing a COG at a problem is THE appeal of the Propeller for me...
Oh; and if the IDE's got to be re-written anyway, PLEASE add conditional compilation
(BTW: How many pins are we gonna have on PII?)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style
You have open mind consrtuct PropI.
Open it more to avoid PropII from fault in its parellel power procesing
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Well whatever but if you do it that way please make it easy to keep straight.
Any way to have one cog (or the hub) access the unused counters in another cog?
Sorry my bad English but.
1. LMM that many talk is not parallel power prcesing in full COG speed. Only semi parallel!
2. If code compatiblity decrease parallel power prcesing. Skip It!
and many other aspects why decrease COG´s capablites
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
I reckon there's going to be a lot of disappointment if that's the case as it seems everyone here has been thinking 16.
If there's a RDTXFR and WRTXFR which will magically transfer longs between two Propellers connected using a single pin at high speed that would mitigate the number of Cogs needed by making multi-chip arrays easier. Something like that would be nice to have even if it wasn't a blindingly fast link.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
It was My point in thred.
http://forums.parallax.com/forums/default.aspx?f=25&p=1&m=212396
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
·"I have always wished that my computer would be as easy to use as my telephone.· My wish has come true.· I no longer know how to use my telephone."
- Bjarne Stroustrup