Should the next Propeller be code-compatible?

OwenS · 2008-09-02 15:32

Chip,

You said that adding "cache-line style" accesses would require lots of muxes.

From my understanding, would it not be possible to just place two (per port) in front of the RAM data lines? (Assuming a 16byte read)

One to multiplex the 128-bits down to the 64 lsbs
One to multiplex the 64-bits down to the 32 lsbs

Each of the read muxes would need two inputs: One to enable them and one from their respective address line
Each of the write muxes would need just one input: One to enable it, which fans out the data entering it's lower half onto both of it's outputs

Now, I will admit, these are rather big muxes.

I selected 16 byte for two reasons:
It's reasonably sized - enough to allow very fast data transfer, and
It's been proven to be effective - most desktop processors have 128-bit ram buses, 128-bit cache lines, etc

Now, another question: Do you really need the masking capabilities? I don't know of many applications which would need them. Most applications which need data fast just want to suck lots in and bash it out another end.

Of course, this all depends upon the difficulty of implementing 128-bit wide busses and RAM

Bill Henning · 2008-09-02 16:10

Chip Gracey (Parallax) said...

But wait! There's more!

JMPRETD INDA,INDB WC, WZ 'save old Z/C to D[noparse][[/noparse]10..9], load new Z/C from S[noparse][[/noparse]10..9]

Now this looks like a very viable, interesting and easy software only solution... I used to not like JMPRET for multi-threading for coming up with new addresses, but this way its pretty automagic!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Beau Schwabe · 2008-09-02 16:31

Just an FYI... I'm thinking about this from a layout perspective.· I need to confirm with Chip how wide the buss wires need to be but for the illustration I used wires that were 1um wide.· Typically for good separation the wires are spaced apart from one another at the same pitch that they are wide.· So 128 Buss wires at 1um wide would translate to 256 microns reserved for the buss.· Fortunately in this process we have 5 metal layers that we can work with.
If I designate ODD metal layers (M1,M3, M5)·for buss·routing only and EVEN metal layers (M2,M4)·for·COG routing to the buss, then I can at least half the required space for the buss channel down to 128 microns.· If I use an over-lapping technique with M1, M3, and M5 I can get it down to about a third of the original required space of 256 microns down to about 92 microns.
Since M2 is in-between M1 and M3 it can either connect up or down to M3 or M1 with a via connection... same with M4... it can·connect up or down to M5 or M3.
·

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe

IC Layout Engineer
Parallax, Inc.

cgracey · 2008-09-02 20:15

OwenS said...
Chip,

You said that adding "cache-line style" accesses would require lots of muxes.

From my understanding, would it not be possible to just place two (per port) in front of the RAM data lines? (Assuming a 16byte read)

One to multiplex the 128-bits down to the 64 lsbs
One to multiplex the 64-bits down to the 32 lsbs

Each of the read muxes would need two inputs: One to enable them and one from their respective address line
Each of the write muxes would need just one input: One to enable it, which fans out the data entering it's lower half onto both of it's outputs

Now, I will admit, these are rather big muxes.

I selected 16 byte for two reasons:
It's reasonably sized - enough to allow very fast data transfer, and
It's been proven to be effective - most desktop processors have 128-bit ram buses, 128-bit cache lines, etc

Now, another question: Do you really need the masking capabilities? I don't know of many applications which would need them. Most applications which need data fast just want to suck lots in and bash it out another end.

Of course, this all depends upon the difficulty of implementing 128-bit wide busses and RAM

Jeff Martin and I talked a lot about this issue this morning and I think we will pursue a 256-bit data bus (8 longs), after all. The plan is to use instruction %000011 (follows RD/WRBYTE, RD/WRLONG, RD/WRLONG) for a RD/WRLONGS instruction, where you specify 'data,address', as usual, but preload a counter with a special·'TRLONGS' instruction to·establish the number of longs to be transfered:
········TRLONGS count
······· RDLONGS· cog_address,hub_address
This mechanism would be simple to use and transfer longs at the sustained rate of the clock. At 160MHz, a whole cog could reload in 3.2us. It would handle alignment boundaries seamlessly, too. Because it would take EVERY clock cycle to keep up with the long transfers within the cog, this instruction would not be a background process, but take all the processing time until it was completed.
Today I'm going to get the multitasking implemented, and then I want to pursue this.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Sapieha · 2008-09-02 20:31

Hi Chip Gracey.

Very nice decision.

If I understand I cna manipulate even single bytes to and form COG in one instruction

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

cgracey · 2008-09-02 20:43

Sapieha said...
Hi Chip Gracey.

Very nice decision.

If I understand I cna manipulate even single bytes to and form COG in one instruction

Yes, the current RDxxxx/WRxxxx instructions would still work as they do. This would be an addition to them.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

OwenS · 2008-09-02 20:49

Chip,

That sounds VERY awesome! Now all you need to do is add a 5th write port on the SRAM! (I'm kidding! I'm kidding! :P )
I can only imagine the speeds this could bring to, for example, LMM

Cluso99 · 2008-09-02 20:57

Chip said...
JMPRETD INDA,INDB·· WC, WZ·· 'save old·Z/C to D[noparse][[/noparse]10..9], load new Z/C from S[noparse][[/noparse]10..9]

Chip, I presume if the WC was specified it would save and replace the C and if not specified, no save/replace would be specified. Similarly with WZ.

There are·times you want to save and restore flags throughout a program, so this would be awsome. However, I am not sure of·your use of INDA and INDB (will leave that for you).

Chip said...
········TRLONGS count
······· RDLONGS· cog_address,hub_address

This too would be awsome. It would allieviate the hub access bottleneck and give us fast hub stores and loads for any purpose (data, LMM, overlays, and ways not currently thought of.
·

Sapieha · 2008-09-02 21:01

Hi Owens...

You proposo is not so bad with litle modification.

RD/WR HUB RAM direct to PORT(PORTs Registers) in one instruction X Bytes/Words/Longs!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/2/2008 9:09:29 PM GMT

J. A. Streich · 2008-09-02 21:04

I also vote for breaking things in the near term for flat memory for some time to come.

You mention "analog pins". Are these pins that have ADCs attached to them? What's the maximum sampling rate?

Sapieha · 2008-09-02 21:10

Hi J. A. Streich.

Chip mentioned that posiblity.

I mentioned even R2R DAC Out 2x16 Bits on every COG

Ps. First post on this page.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/2/2008 9:20:37 PM GMT

cgracey · 2008-09-02 21:20

J. A. Streich said...
I also vote for breaking things in the near term for flat memory for some time to come.

You mention "analog pins". Are these pins that have ADCs attached to them? What's the maximum sampling rate?

Each pin on the next Propeller will have an integrated delta-sigma ADC that will sample at the clock frequency. At 160MHz, you could get the following resolutions and sample rates:

1 bit @ 80 MHz
2 bits @ 40 MHz
3 bits @ 20 MHz
4 bits @ 10 MHz
5·bits @ 5 MHz
6·bits @ 2.5 MHz
7·bits @ 1.25 MHz
8·bits @ 625 KHz
9·bits @ 312 KHz
10·bits @ 156 KHz
11·bits @ 78 KHz
12·bits @ 39 KHz
...
16 bits @ 2.4 KHz

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Javalin · 2008-09-02 21:23

>Each pin on the next Propeller will have an integrated delta-sigma ADC that will sample at the clock frequency. At 160MHz, you could get the following resolutions and sample rates:

wow.

cgracey · 2008-09-02 21:27

Cluso99 said...

Chip said...
JMPRETD INDA,INDB·· WC, WZ·· 'save old·Z/C to D[noparse][[/noparse]10..9], load new Z/C from S[noparse][[/noparse]10..9]

Chip, I presume if the WC was specified it would save and replace the C and if not specified, no save/replace would be specified. Similarly with WZ.

Anytime a JMPRETD/JMPRET (or CALLD/CALL) executes, the return address is poked into D[noparse][[/noparse]8..0]. The new behavior would also poke Z and C into D[noparse][[/noparse]10..9]. Of course, all this is predicated on the WR bit being set, so that D will be written back. In the case of a JMPD/JMP or CALLD/CALL, D is not written back (WR=0).
·
There are·times you want to save and restore flags throughout a program, so this would be awsome. However, I am not sure of·your use of INDA and INDB (will leave that for you).

Can you elaborate on this?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Sapieha · 2008-09-02 22:22

Hi Chip Gracey.

You said....
""....
··· TRLONGS count
··· RDLONGS cog_address,hub_address
This mechanism would be simple to use and transfer longs at the sustained rate of the clock. At 160MHz, a whole cog could reload in 3.2us. It would handle alignment boundaries seamlessly, too. Because it would take EVERY clock cycle to keep up with the long transfers within the cog, this instruction would not be a background process, but take all the processing time until it was completed.
"....

I said
""....
Hi Owens...

You proposo is not so bad with litle modification.

RD/WR HUB RAM direct to PORT(PORTs Registers) in one instruction X Bytes/Words/Longs!
".....

My proposo is to ad to..
··· TRLONGS count,increment flags < Increment only cog_address, hub_address else both
·············································· (Mostuseful·HUB_Addres with transfer block protocols
·············································· ·from and to I/O. With HUB flag on it is posible to fill block
··············································· with specific value
··· RDLONGS cog_address,hub_address

Ps. All instructions that spare longs in COG is welcome.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/2/2008 10:36:13 PM GMT

Capt. Quirk · 2008-09-02 22:25

Are you now building a replacement for the Prop1 or a big brother with possibly 64 i/o pins?

Post Edited (Capt. Quirk) : 9/2/2008 10:34:36 PM GMT

Paul Baker · 2008-09-02 22:33

The original Prop isn't going anywhere, the chip under discussion will be sold along side of the current Propeller.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

heater · 2008-09-02 23:12

Sapieha: "I think 9 COGs is not option in plase on construction on binary counters nature (more complicated)"

I don't see the problem. Of course if you build a simple ripple counter 9 might be a bit more tricky but from my limited understanding of digital logic design it seems just as easy to build a state machine to count from 0 to 8 as to count from 0 to 7.

There might be other reasons why 9 is a crazy number of COGS, layout for example, but counting them seems trivial.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Sapieha · 2008-09-02 23:22

Hi heater.

I can explain it so fine. You can build conters that count both binary and decimal.

First 0-7= 3 Bits counter in power on 2
second 0-8 = 4 Bits counter and no power on 2

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Tubular · 2008-09-02 23:37

@heater re: 9 cogs
Interesting concept. If you also connect and wrap the diagonals, each cog would talk to all 8 neighbors

J. A. Streich · 2008-09-03 00:13

Chip, looking at the list of sample frequencies it looks like the best options available for audio, without external ADC would be:
11 bits @ 78 KHz
12 bits @ 39 KHz

Looks very nice, and I think I'll be one of the first in line for the Prop II to build the audio project I have in mind.

Will each pin also have a compatible DAC for output, or will we have use the methods we've been using on the Prop I or external DAC?

Something to think about, CD sampling rate is 16 bit @ 44.1 kHz... Think the prop III will be able to do have each pin sample at this rate?

Bill Henning · 2008-09-03 00:54

Very interesting.

Assuming that one does an early "TRLONGS 8" during cog initialization, due to the 256 bit wide hub memory bus, that I could do something like

loop:
rdlongs cog_buff,hub_buff[noparse][[/noparse]++32]
<do fun stuff for six cycles>
jmp #loop

with the above loop reading hub memory into the cog at a 640MB/sec?????? wow... the cogs are not fast enough to chew data at that rate [noparse]:)[/noparse]

If true, I am drooling already.

Best,

Bill

Chip Gracey (Parallax) said...

Jeff Martin and I talked a lot about this issue this morning and I think we will pursue a 256-bit data bus (8 longs), after all. The plan is to use instruction %000011 (follows RD/WRBYTE, RD/WRLONG, RD/WRLONG) for a RD/WRLONGS instruction, where you specify 'data,address', as usual, but preload a counter with a special 'TRLONGS' instruction to establish the number of longs to be transfered:
TRLONGS count
RDLONGS cog_address,hub_address
This mechanism would be simple to use and transfer longs at the sustained rate of the clock. At 160MHz, a whole cog could reload in 3.2us. It would handle alignment boundaries seamlessly, too. Because it would take EVERY clock cycle to keep up with the long transfers within the cog, this instruction would not be a background process, but take all the processing time until it was completed.
Today I'm going to get the multitasking implemented, and then I want to pursue this.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

william chan · 2008-09-03 01:04

Will Propeller II consume less current than Propeller I at the same clock speed?

I think the Propeller has a lot of potential in battery powered applications.

I am working on one battery powered application myself.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Paul Baker · 2008-09-03 01:45

Hey Chip I've been thinking about it and I'm not sure that TRLONGS count is strictly necessary, there will be the repeat command, and the compiler could be smart enough to break it into pieces. Now if TRLONGS makes it easier from the compiler or programmer stand point, then include it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

cgracey · 2008-09-03 03:47

Paul Baker (Parallax) said...
Hey Chip I've been thinking about it and I'm not sure that TRLONGS count is strictly necessary, there will be the repeat command, and the compiler could be smart enough to break it into pieces. Now if TRLONGS makes it easier from the compiler or programmer stand point, then include it.

We could not·REPeat·RDLONGS with maximum efficiency because after each hub access, it would go one cycle past the next hub opportunity. Being an instruction, though, it could issue another hub r/w command in the same cycle that it is storing the last long from the last fetch, so it never misses a beat.

And YES, it would transfer 640 MB/S to/from a cog. You could also say that the max hub memory bandwidth is 8 times that, or 5.12 GB/S.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Cluso99 · 2008-09-03 03:52

Cluso99 said...

Chip said...
JMPRETD INDA,INDB·· WC, WZ·· 'save old·Z/C to D[noparse][[/noparse]10..9], load new Z/C from S[noparse][[/noparse]10..9]

Chip, I presume if the WC was specified it would save and replace the C and if not specified, no save/replace would be specified. Similarly with WZ.

Anytime a JMPRETD/JMPRET (or CALLD/CALL) executes, the return address is poked into D[noparse][[/noparse]8..0]. The new behavior would also poke Z and C into D[noparse][[/noparse]10..9]. Of course, all this is predicated on the WR bit being set, so that D will be written back. In the case of a JMPD/JMP or CALLD/CALL, D is not written back (WR=0).

Wouldn't that break the use of the jmpret for storing the return address into another jmp (or jmpret) source as bits 1 & 0 of·it's Dest would also be overwritten? That is why I thought that the wc and wz options would specify if the c and z flags would be saved (and also loaded). Otherwise, the jmpretd could not be utilised for a normal jmpret and execute the extra 2 free instructions.

Cluso99 said...
There are·times you want to save and restore flags throughout a program, so this would be awsome. However, I am not sure of·your use of INDA and INDB (will leave that for you).

Can you elaborate on this?

Not sure what you mean, so...
I was not sure if INDA and INDB are special registers or just plain cog memory to be used as indirect pointers?
Programs such as the Interpreter often save flags before and restore after calling certain subroutines.

Paul Baker · 2008-09-03 03:53

Ah I understand now, thanks

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

heater · 2008-09-03 04:44

Sapieha: Yes but I don't see any particular reason why the cog counter must be a power of 2. Just use 9 counts in 4 bits.

Tubular: Connecting all to all would be ideal but I kind of assumed that would be to many tracks to lay down in the space available.

Anyway from recent posts it seems the data transfer rate through the HUB is going to be so vast that it makes any other COG to COG interconnect unnecessary. Just go through the HUB.

That only leaves the requirement for fast Prop to Prop links for those who need more COGS.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2008-09-03 05:42

Power saving suggestion. WAITLOCK

Whilst we are discussing data transfer through the hub I would like to make the following observation:

In the Prop I if it is required to save power, in a battery powered application say, there are the options to:

1) Stop as many COGS as possible during "low power mode" then restart them when there is something to do.
2) Have the COGS wait on a pin (WAITPEQ, WAITPNE) or counter (WAITCNT) which presumably gets them to use as little power as 1)

Now 1) is a bit of a pain as you have program all that starting and stopping somewhere and destroys all the state of what ever process was running on the COG unless take the trouble to save/restore it.

2) Is fine as long as your COG is interacting with external hardware (inc. time). And one could always use a pin to "signal" from COG to COG that a wait (power down) is required.

So for COGS that are sourcing/sinking data from another COG through HUB RAM I would like to see some kind of WAIT on the the data source/sink.

That is WAITLOCK id (or LOCKWAIT I guess.)

With this a COG would automatically halt into low power mode until it had access to the data guarded by the lock. No more busy looping on LOCKSET.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2008-09-03 05:52

Acually, thinking about it, I don't see why LOCKSET dosen't just do a wait anyway.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Should the next Propeller be code-compatible?

Comments