The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

koehler · 2014-04-08 02:04

[ ]

Cluso99 · 2014-04-08 02:06

Brian Fairchild wrote: »

True, although with a headline number of 512k the P1+ shouldn't fall too short. Plus, with the P1+ you can trade off program against data.

At the risk of getting all nostalgic, my fist microcontroller work was with 8751s. 4k of UV EPROM and all of 128 bytes of RAM. Program, burn, debug, erase, repeat.

Then you would recall the 8751's opponent, the 68705 - the first one in that family was MC68705P3S - 1.7KB UV EPROM and 128B RAM @ 1MHz. I developed a 2 x 68705 board - there was always a tube erasing chips and they originally cost $170 each

heater: I had forgotten they were initially ceramic

Brian Fairchild · 2014-04-08 02:13

Heater. wrote: »

We love nostalgia around here.

Sigh. Intel Intellec Series 2's, ASM51, ICE51, PL/M51....Actually, scrub that last one. PL/M51 was a dog.

David Betz · 2014-04-08 02:34

Cluso99 wrote: »

re HUBEXEC mode

It stops the detractors who see 2KB cog ram as too small.
It places the Call return address in a fixed location like GCC wants.
In its most basic form it runs at LMM speed and only uses 25% power.

You will note I used a separate JMPRETX instruction. This will force the user to know when he is writing hubexec code, and the compiler can be enhanced to check certain caveats.

The biggest thing to me is that I can write Hubexec Pasm simply. I went thru' all the problems of using LMM (without macros) for my P2 Debugger which made me understand the sw issues involved with LMM. Once Chip added hubexec, almost all of them went away. I could convert cog code to hubexec quite simply.

So, basic hubexec is definitely worth pursuing.

I'm not sure we really need to distinguish between JMP and JMPRETX unless that makes the hub exec logic simpler. It wastes an opcode otherwise.

David Betz · 2014-04-08 02:36

cgracey wrote: »

One aspect of having 4-long hub transfers is that with a few simple address tags (TLB, David?) we could direct out-of-cog addresses into 4-long register blocks which are serving as instruction caches. Part of the cog register RAM becomes the cache! No cache-line flipflops and mux's needed!

I think hub exec is going to happen, because it won't take much. What would really blow it wide open would be to have a 256-bit hub data path, so that each cog could do an 8-instruction fetch every 8 instructions. That would have the effect of jacking the power up quite a bit, I'm afraid. All cogs could run at 100% speed from the hub without branching or hub accesses.

I think Cluso thought up this possibility on the Prop2 effort.

Hmmm... TLB? Unfortunately, there is a little more to it than that. You need some sort of trap that will run code to fill the cache if no tag matches. You also need to be able to restart the instruction that caused the miss. Are you sure that's in the scope of this new chip? It would be nice but I wouldn't want it to jeopardize this effort.

jmg · 2014-04-08 02:53

cgracey wrote: »

I know that a CLUT is really important for 8-bit pixel data, so that it can be expanded into 8:8:8 RGB. Even a tiny CLUT for 4-bit pixel data would be nice. We'll see.

Tasks are easy to implement. Having a simplified time-slot scheme would be best for this chip..

A tiny CLUT is likely to be too small, but a simplified time-slot scheme shared access to COG ram allows CLUT to be a full 8 bit. (less Code space)
It would halve the COG Opcode speed, but avoids any extra-die costing RAM.

Pixel Clock speeds seem to be 33.3MHz, & 38.1MHz, for SysCLKs indicated of 133.2MHz and 152.4MHz at the /2/2
needed. That looks to just fit inside the MHz envelope.
I presume the rest of the video channel can feed at this rate, if the HW manages the CLUT ?

RossH · 2014-04-08 02:53

David Betz wrote: »

Hmmm... TLB? Unfortunately, there is a little more to it than that. You need some sort of trap that will run code to fill the cache if no tag misses. You also need to be able to restart the instruction that caused the miss. Are you sure that's in the scope of this new chip? It would be nice but I wouldn't want it to jeopardize this effort.

And so it begins ...

Ross.

Cluso99 · 2014-04-08 03:01

David Betz wrote: »

I'm not sure we really need to distinguish between JMP and JMPRETX unless that makes the hub exec logic simpler. It wastes an opcode otherwise.

The new JMPRETX (JMPX/CALLX/RETX) is needed because we need to have #<17 bit address> and the return address needs to be stored in a fixed location (likely $1EF). This instruction can use WC & WZ to store/recall the Z & C flags (as per P2).
The existing P1 JMPRET (JMP/CALL/RET) should be kept for compatibility - uses D to store return address in cog, and S for the goto address in cog. WZ is set if result=0.

Brian Fairchild · 2014-04-08 03:02

cgracey wrote: »

That would have the effect of jacking the power up quite a bit, I'm afraid. All cogs could run at 100% speed from the hub without branching or hub accesses.

Then don't do it. I really don't believe that this chip needs 100% from every COG.

In another topic I wrote...

Brian Fairchild wrote: »

So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.

This was with Bill's simple slot allocator.

Brian Fairchild · 2014-04-08 03:04

You know, the danger with feature creep is that the only person you end up competing with is yourself.

Cluso99 · 2014-04-08 03:10

jmg wrote: »

A tiny CLUT is likely to be too small, but a simplified time-slot scheme shared access to COG ram allows CLUT to be a full 8 bit. (less Code space)
It would halve the COG Opcode speed, but avoids any extra-die costing RAM.

Pixel Clock speeds seem to be 33.3MHz, & 38.1MHz, for SysCLKs indicated of 133.2MHz and 152.4MHz at the /2/2
needed. That looks to just fit inside the MHz envelope.
I presume the rest of the video channel can feed at this rate, if the HW manages the CLUT ?

Could the COG RAM be divided as 2 x 256 long blocks, and one block be switched to be CLUT ?
In this mode, code could execute in parallel in one half, and also load the CLUT. When in this mode, instead of the instructions being read out of the CLUT section, could the video be read out of this section while the other half of the cog continues running code ? ie the other half can write to the clut (write port) while the video reads (read port).

I don't think its a big silicon use provided the cog could be done in two blocks of 256 longs (2 sets of 4 x 46 longs). Is it worth doing ???
(I cannot answer this as I don't really understand how the clut gets used).

jmg · 2014-04-08 03:32

Cluso99 wrote: »

Could the COG RAM be divided as 2 x 256 long blocks, and one block be switched to be CLUT ?
In this mode, code could execute in parallel in one half, and also load the CLUT. When in this mode, instead of the instructions being read out of the CLUT section, could the video be read out of this section while the other half of the cog continues running code ? ie the other half can write to the clut (write port) while the video reads (read port).

I don't think its a big silicon use provided the cog could be done in two blocks of 256 longs (2 sets of 4 x 46 longs). Is it worth doing ???
(I cannot answer this as I don't really understand how the clut gets used).

Yes, that is exactly what I am proposing, but you may also be meaning a non-mux split, so there is no time slice penalty ?

In my case, the Memory is one block, and address is MUX'd, with shared data out going to either Execute, or Video paths.

If the memory is split, that is both address and data MUXs, but the MHz figures can he halved.

Only Chip would know how complex these are relative to each other.

Baggers · 2014-04-08 03:35

Wow, go to sleep and a 13 page encyclopedia appears from nowhere lol
P.S. Bill, don't leave, you've been a great contributor to the Prop scene

A tiny clut would be fine, to go from 8bpp to 8:8:8 with the 128Bit waitvid, it can be like the old 4x8bpp waitvids we used for a single colour per pixel on the original P1, we could have the 4 longs as the values, this would give us awesome graphics capabilities, this wouldn't need any 'clut" as such, because the cog ram is 128bit, allowing for the single read in the waitvid to get 4 longs 8:8:8:x instead of just one.
An 8bpp 640x480 VGA display at 8:8:8RGB is going to look amazing and would take 307,200 ( + 1K palette data ) = 301KB, and we'd still have 211KB - ( ROM size ) left. And this is without much modification too!

512KB is more than enough, even moreso now we have more IO pins than 32! SDRAM/SRAM can be added giving that extra memory should you need more! it's only 128KB short of the "640KB ought to be enough for everyone" remember.

512KB is more than most of the computers we had back then, and what a few of the people on this forum will be wanting to emulate too! most of them can be done in HUB ram now!

As for CMPR and ANDN although I have used it on a few occasions, I'd rather see them go for something more useful, as I can replace those instructions with two instructions.

RossH · 2014-04-08 03:45

Cluso99 wrote: »

The new JMPRETX (JMPX/CALLX/RETX) is needed because we need to have #<17 bit address> and the return address needs to be stored in a fixed location (likely $1EF). This instruction can use WC & WZ to store/recall the Z & C flags (as per P2).
The existing P1 JMPRET (JMP/CALL/RET) should be kept for compatibility - uses D to store return address in cog, and S for the goto address in cog. WZ is set if result=0.

I thought Chip was just implementing multiple loads to support a simple Hub Exec?

Why do we need all this other stuff?

Ross.

Cluso99 · 2014-04-08 04:05

Baggers wrote: »

Wow, go to sleep and a 13 page encyclopedia appears from nowhere lol

Yes, I know!!!

P.S. Bill, don't leave, you've been a great contributor to the Prop scene

+1

A tiny clut would be fine, to go from 8bpp to 8:8:8 with the 128Bit waitvid, it can be like the old 4x8bpp waitvids we used for a single colour per pixel on the original P1, we could have the 4 longs as the values, this would give us awesome graphics capabilities, this wouldn't need any 'clut" as such, because the cog ram is 128bit, allowing for the single read in the waitvid to get 4 longs 8:8:8:x instead of just one.
An 8bpp 640x480 VGA display at 8:8:8RGB is going to look amazing and would take 307,200 ( + 1K palette data ) = 301KB, and we'd still have 211KB - ( ROM size ) left. And this is without much modification too!

I really value your comments on this.

512KB is more than enough, even moreso now we have more IO pins than 32! SDRAM/SRAM can be added giving that extra memory should you need more! it's only 128KB short of the "640KB ought to be enough for everyone" remember.

Then they said no-one will ever need 4GB (and this time I believed them, so how dumb am I)

512KB is more than most of the computers we had back then, and what a few of the people on this forum will be wanting to emulate too! most of them can be done in HUB ram now!

As for CMPR and ANDN although I have used it on a few occasions, I'd rather see them go for something more useful, as I can replace those instructions with two instructions.

ANDN OUTA,<mask> is required for turning off pins quickly, just like OR and XOR.
I use CMPSUB a lot, but have found CMPR extremely useful in P2 (its not in P1).

Cluso99 · 2014-04-08 04:11

RossH wrote: »

Cluso99 wrote: »

The new JMPRETX (JMPX/CALLX/RETX) is needed because we need to have #<17 bit address> and the return address needs to be stored in a fixed location (likely $1EF). This instruction can use WC & WZ to store/recall the Z & C flags (as per P2).
The existing P1 JMPRET (JMP/CALL/RET) should be kept for compatibility - uses D to store return address in cog, and S for the goto address in cog. WZ is set if result=0.

I thought Chip was just implementing multiple loads to support a simple Hub Exec?

Why do we need all this other stuff?

Ross.

How are you going to do a JMP <hubaddr> and CALL <hubaddr> without these ??? RET can be done with the JMP.
JMPRETX (JMPX/CALLX/RETX) is only one instruction, similar to JMPRET (JMP/CALL/RET).

I have just realised the DJNZ/TJNZ/TJZ will need to be relative in hubexec mode too.

Sapieha · 2014-04-08 04:15

Hi.

I'm disappointed how things wend in time I was on Hospital.

Even if I like some ideas.
It is not what I wanted o I think it is time for me to write mo own COG code that meet my needs for my experiments ---
Much already done but now I think I don't have time to wait before all discussions be clear about NEW things people wait in that new IC.

I was very happy with P2 Instructions set and how it worked. It was clever and very usable that mixed COG/HUB execution.
Give much of 2 worlds of type of programing ----> BUT looked to be to good to be true.
So now we will have simpler IC even if it will have more COG's/HUB-RAM. I think will not give that much power in Programing as original P2 code.

All for at people every time Chip found new possibility's People asked every time for more.
Now we have nice P2 binary that work nice ---> we don't have even complete instructions SET descriptions to it to use it fully out.
BAD ---> but life learned me don't wait to MUCH -- it will always be shortened.

NO more complains from me -- Have great day ALL.

Seairth · 2014-04-08 04:31

Bill Henning wrote: »

Sorry, I intensely dislike cooperative tasking. Wastes a lot of cog space, have to insert manual yields.

Possibly true. My concern, however, is that we (the community) are making so many requests that affect fundamental aspects of the architecture that I feel like we are trying to make the P2-, not the P1+. I'm trying to limit my suggestions to those things which I perceive can be added without significant architectural change. Adding P2-style tasking definitely require an architectural change which I believe to be significant. That adds both time and risk to getting the P1+ finished quickly.

Heater. wrote: »

Cooperative multi-tasking is a dog.

Example, a full duplex uart rx and tx threads.

You can insert a "yield/suspend" point in their thread loops. But then you find the latency in responding to incoming edges is too low and the out going edges are very jittery.

OK add more "yield/suspend" points in those loops. Now you are pushing up the instruction count, wasting valuable COG space. Not only that you are slowing things down. Those yield instructions take time.

In the extreme you have a "yield/suspend" in between every actual useful instruction. Now you have:
a) doubled the size of your code and
b) Halved it's performance.

Hardware scheduled threads, an instructions from each thread executed alternately, is much better because:
a) Code size is minimized.
b) Performance is doubled, due to a)
c) Latency in response to incoming edges and the tx is clock is minimized, a lot less jitter and more timing accuracy.

Anyway, if you want cooperative threading we already have it with jmpret don't we? As used in FullDuplexSerial on the P1.

Yeah, I looked at Chip's FullDuplexSerial. There are a total of 4 JMPRETs (plus two holding registers, which go away with a baked-in task register) out of a total of 85 instructions. With my suggested instructions, the code size doesn't change. I'd hardly consider those instructions to be a lot of wasted space and they are putting zero pressure on the overall code size. And the SWTASK instruction has the advantage of maintaining the flags, which JMPRET doesn't do. (Admittedly, FDS doesn't need that feature.)

Further, if Chip adds the pin state machines, I believe your concerns about I/O latency and jitter will be significantly mitigated. (True, that's only a guess. Not an unreasonable one.)

Remember that we are now going to have 16 cogs! There will be significantly less need to do "space invaders" type programming in a single cog. (hah! what an unexpected turn of phrase!). So why complicate the architecture to support a mode that will be even less necessary than on the P2?

Heater. · 2014-04-08 05:05

Seairth,

"space invaders" type programming

I love it. Very apt.

Yes FDS does not waste a lot of code space with yields. But then it is limited in the speeds it runs at and the accuracy of the edge timing it can generate/sample. Improving that accuracy requires inserting more yields. Which takes more code, which takes more time which at some point means things are getting worse rather than better. Perhaps FDS is already at that sweet spot, I don't know. Chip wrote it so I imagine it is.

Hardware threading allows you to push it much further. Just write the loops you need and let it rip.

Perhaps this can be done with pin state machines, I have no idea, they don't exist yet.

I did say above that threads were valuable when COGs were big, expensive and few. Now that we have many perhaps not so much.

Baggers · 2014-04-08 05:21

Cluso99 wrote: »

Yes, I know!!!

+1

I really value your comments on this.

Then they said no-one will ever need 4GB (and this time I believed them, so how dumb am I)

ANDN OUTA,<mask> is required for turning off pins quickly, just like OR and XOR.
I use CMPSUB a lot, but have found CMPR extremely useful in P2 (its not in P1).

Haha yeah, I totally forgot CMPR is a P2 instruction haha, so yeah either way, we can just use other instructions!

haha on the 4GB, I remember back in 1999, taking my brother-in-law to go get parts to make up a new computer for him, and upon getting a hard drive, ( asking for a 10GB at least ) the guy behind the counter, saying you wouldn't need more than 2Gig haha we both looked at each other and laughed! haha Needless to say, we went to another stall!

Heater. · 2014-04-08 05:44

Seairth,

Your analysis of FullDuplexSerial needs a closer look:

Below is the receive loop. It consists of 9 instructions.

That means more than 10% of the code space (and time) is wasted on the yield (jmpret).

If we home in in the inner loop that is only 5 instructions so we could say 20% of the time is wasted on the yield.

Moreover there are 4 instructions used for what would normally be a single WAITPxx instruction if this were a thread.

So we are looking at about 50% of waste time. And no hope of improving edge timing and sampling jitter. Looks like hardware scheduled threads would help a lot here.

:bit                    add     rxcnt,bitticks        'ready next bit period


:wait                   jmpret  rxcode,txcode         'run a chuck of transmit code, then return


                        mov     t1,rxcnt              'check if bit receive period done
                        sub     t1,cnt
                        cmps    t1,#0           wc
        if_nc           jmp     #:wait


                        test    rxmask,ina      wc    'receive bit on rx pin
                        rcr     rxdata,#1
                        djnz    rxbits,#:bit

cgracey · 2014-04-08 05:46

Sapieha,

The new chip will have a lot of good things in it, including hub exec. We just couldn't get the P2 to fit in 180nm in any adequate manner. So, we are going back to the basics, but adding a few key elements from the Prop2 development. It's true that these cogs won't be as fast, but there will be more of them, so that the total MIPS will be higher, but the power will be 1/8th.

When this is done, we'll pick up where we left off on Prop2. That's the best we can do right now, in order to get a real chip into production. Hang in there, please.

Sapieha wrote: »

Hi.

I'm disappointed how things wend in time I was on Hospital.

Even if I like some ideas.
It is not what I wanted o I think it is time for me to write mo own COG code that meet my needs for my experiments ---
Much already done but now I think I don't have time to wait before all discussions be clear about NEW things people wait in that new IC.

I was very happy with P2 Instructions set and how it worked. It was clever and very usable that mixed COG/HUB execution.
Give much of 2 worlds of type of programing ----> BUT looked to be to good to be true.
So now we will have simpler IC even if it will have more COG's/HUB-RAM. I think will not give that much power in Programing as original P2 code.

All for at people every time Chip found new possibility's People asked every time for more.
Now we have nice P2 binary that work nice ---> we don't have even complete instructions SET descriptions to it to use it fully out.
BAD ---> but life learned me don't wait to MUCH -- it will always be shortened.

NO more complains from me -- Have great day ALL.

Rayman · 2014-04-08 05:49

I don't think we need a special CLUT circuit... I think you can just use 1/2 of the cog RAM as CLUT. The rest of the cog would just be code that grabs the screen buffer data from HUB or maybe SDRAM, looks up the corresponding RGB data for each byte from the CLUT and then sends it to the DAC.
Doesn't that work?

cgracey wrote: »

I know that a CLUT is really important for 8-bit pixel data, so that it can be expanded into 8:8:8 RGB. Even a tiny CLUT for 4-bit pixel data would be nice. We'll see.

Tasks are easy to implement. Having a simplified time-slot scheme would be best for this chip. Tubular had a great idea for making variable-length slot patterns that was implemented in Prop2, but that is too complex for this chip.

cgracey · 2014-04-08 05:52

Cluso99 wrote: »

I use CMPSUB a lot, but have found CMPR extremely useful in P2 (its not in P1).

SUBR/CMPR will be in the next chip. I've been working on the instruction set all night and I pulled a few gems out of the Prop2 set.

Rayman · 2014-04-08 05:53

Sounds like a math-coprocessor located in the HUB...

cgracey wrote: »

CORDIC, MUL32X32, DIV64/32, SQRT (maybe) will be in the hub, but pipelined, so nobody has to wait for anybody else, only their turn at the hub.

Baggers · 2014-04-08 06:03

cgracey wrote: »

SUBR/CMPR will be in the next chip. I've been working on the instruction set all night and I pulled a few gems out of the Prop2 set.

Awesome Chip! do you have a final list of what instructions are going in it?

cgracey · 2014-04-08 06:21

Baggers wrote: »

Awesome Chip! do you have a final list of what instructions are going in it?

Yes, but I can't get it from my work computer (that hates the internet) onto my laptop, without my thumb drive that is back in the house.

It needs refining, anyway. It's yet a mess, with opcodes undefined. Just the instructions are there properly.

For what it's worth, I added PTRA and PTRB into it, to facilitate efficient hub access and hub exec. There's also AUGS and AUGD and the immediate 17-bit JMP/CALL/LINK instructions. The hub exec cache is a section of cog ram this is used for hardware registers, so nothing there gets wasted. There's another section up top for DCACHE that is otherwise used for read-only registers like CNT/RND/INA/INB. I figure that for hub exec, there's no benefit in having more than one 4-long cache line, since it will be exhausted after every four instructions, anyway. So, it should run 50 MIPS without branches or hub reads/writes.

Cluso99 · 2014-04-08 06:28

cgracey wrote: »

SUBR/CMPR will be in the next chip. I've been working on the instruction set all night and I pulled a few gems out of the Prop2 set.

That is great Chip. I cannot wait to see what you think is required. Hope it doesn't add too much though.
But, you best get some sleep. I know you are excited again - you have pulled another rabbit out of your hat!

Baggers · 2014-04-08 06:40

cgracey wrote: »

Yes, but I can't get it from my work computer (that hates the internet) onto my laptop, without my thumb drive that is back in the house.

It needs refining, anyway. It's yet a mess, with opcodes undefined. Just the instructions are there properly.

For what it's worth, I added PTRA and PTRB into it, to facilitate efficient hub access and hub exec. There's also AUGS and AUGD and the immediate 17-bit JMP/CALL/LINK instructions. The hub exec cache is a section of cog ram this is used for hardware registers, so nothing there gets wasted. There's another section up top for DCACHE that is otherwise used for read-only registers like CNT/RND/INA/INB. I figure that for hub exec, there's no benefit in having more than one 4-long cache line, since it will be exhausted after every four instructions, anyway. So, it should run 50 MIPS without branches or hub reads/writes.

And there was me being over the moon with the first proposed P16X32B now we get all these Super I/O pins, hub exec, ptrs!
Really looking forward to this now!

Dave Hein · 2014-04-08 06:43

I'm curious what the final version of the P1+ will look like. The following list shows the features that I think are going into P1+ based on the features that P2 had. Is this correct, or am I way off base here?

8-COG 16-COG P1 Core
4-port 2-port cog memory
20-bit 16-bit multiplier
256K 512K RAM/ROM
32-bit Multiply/Divide Engine in each cog the hub
Cordic Engine in each cog the hub
PTRA/PTRB
INDA/INDB
256-long CLUT/FIFO
PTRX/PTRY
Data Cache
360+ 200+ Instructions
256-bit 128-bit Hub Bus
4 tasks
hubex
4 Instruction Caches
serial I/O
Pre-emptive threads

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments