Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 10 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

178101213144

Comments

  • SeairthSeairth Posts: 2,474
    edited 2014-04-08 06:43
    Heater. wrote: »
    Seairth,

    Your analysis of FullDuplexSerial needs a closer look:

    Below is the receive loop. It consists of 9 instructions.

    That means more than 10% of the code space (and time) is wasted on the yield (jmpret).

    If we home in in the inner loop that is only 5 instructions so we could say 20% of the time is wasted on the yield.

    Moreover there are 4 instructions used for what would normally be a single WAITPxx instruction if this were a thread.

    So we are looking at about 50% of waste time. And no hope of improving edge timing and sampling jitter. Looks like hardware scheduled threads would help a lot here.
    :bit                    add     rxcnt,bitticks        'ready next bit period
    
    
    :wait                   jmpret  rxcode,txcode         'run a chuck of transmit code, then return
    
    
                            mov     t1,rxcnt              'check if bit receive period done
                            sub     t1,cnt
                            cmps    t1,#0           wc
            if_nc           jmp     #:wait
    
    
                            test    rxmask,ina      wc    'receive bit on rx pin
                            rcr     rxdata,#1
                            djnz    rxbits,#:bit
    

    I was not arguing your point about the execution time ("is a dog"), rather Bill's comment about the waste of cog space. My apologies for the lack of clarity. Again, I believe the new pin approach that Chip is discussing will mitigate your concerns (baud rate, jitter, etc) to some extent. No, it's not a perfect solution. No, you won't be able to push the I/O as far as other tasking approaches might allow. But there are always going to be use cases where the hardware doesn't suffice. So, as I said before, I am trying to offer a solution with minimal hardware requirement (and minimal impact on time and risk). I think we can afford to wait until the P2 for full tasking support.

    Frankly, I'd take the P1+ with no additional task support if it would mean we get it sooner.
  • ctwardellctwardell Posts: 1,716
    edited 2014-04-08 06:56
    I'm starting to fear that 16 of these beasties will not end up fitting the available space.

    Even with the items that have been factored out into common resources, these are going to be way more complex than P1 cogs.

    C.W.
  • SeairthSeairth Posts: 2,474
    edited 2014-04-08 07:03
    Heater. wrote: »
    Seairth,

    Your analysis of FullDuplexSerial needs a closer look:

    Below is the receive loop. It consists of 9 instructions.

    That means more than 10% of the code space (and time) is wasted on the yield (jmpret).

    If we home in in the inner loop that is only 5 instructions so we could say 20% of the time is wasted on the yield.

    Moreover there are 4 instructions used for what would normally be a single WAITPxx instruction if this were a thread.

    So we are looking at about 50% of waste time. And no hope of improving edge timing and sampling jitter. Looks like hardware scheduled threads would help a lot here.
    :bit                    add     rxcnt,bitticks        'ready next bit period
    
    
    :wait                   jmpret  rxcode,txcode         'run a chuck of transmit code, then return
    
    
                            mov     t1,rxcnt              'check if bit receive period done
                            sub     t1,cnt
                            cmps    t1,#0           wc
            if_nc           jmp     #:wait
    
    
                            test    rxmask,ina      wc    'receive bit on rx pin
                            rcr     rxdata,#1
                            djnz    rxbits,#:bit
    

    Here's another thought as well: add a SWTASK #n/D variant that would be just like the zero-param SWTASK, but store #n/D instead of PC+1. The FDS receive code would then look like:
    receive                 test    rxtxmode,#%001  wz    'wait for start bit on rx pin
                            test    rxmask,ina      wc
            if_z_eq_c       swtask  #receive
    
                            mov     rxbits,#9             'ready to receive byte
                            mov     rxcnt,bitticks
                            shr     rxcnt,#1
                            add     rxcnt,cnt                          
    
    :bit                    add     rxcnt,bitticks        'ready next bit period
    
    :wait                   mov     t1,rxcnt              'check if bit receive period done
                            sub     t1,cnt
                            cmps    t1,#0           wc
            if_nc           swtask  #:wait
    
                            test    rxmask,ina      wc    'receive bit on rx pin
                            rcr     rxdata,#1
                            djnz    rxbits,#:bit
    
                            shr     rxdata,#32-9          'justify and trim received byte
                            and     rxdata,#$FF
                            test    rxtxmode,#%001  wz    'if rx inverted, invert byte
            if_nz           xor     rxdata,#$FF
    
                            rdlong  t2,par                'save received byte and inc head
                            add     t2,rxbuff
                            wrbyte  rxdata,t2
                            sub     t2,rxbuff
                            add     t2,#1
                            and     t2,#$0F
                            wrlong  t2,par
    
                            swtask  #receive              'byte done, receive next byte
    

    This has zero impact on code space or execution time. No, it doesn't get rid of the jitter issue and may only marginally increase baud rate. Again, that's where the new pin stuff comes in.
  • John AbshierJohn Abshier Posts: 1,116
    edited 2014-04-08 07:46
    I looks like we are starting to stray (featuritis) from the chip described in post #1. Hopefully we will not go another 9 months and have another still born chip.

    Ken, time for the wet blanket. I recommend wool.

    John Abshier
  • User NameUser Name Posts: 1,451
    edited 2014-04-08 07:57
    cgracey wrote: »
    One aspect of having 4-long hub transfers is that with a few simple address tags (TLB, David?) we could direct out-of-cog addresses into 4-long register blocks which are serving as instruction caches. Part of the cog register RAM becomes the cache! No cache-line flipflops and mux's needed!

    I like this idea a lot!! Nothing would turn a 200 MHz peregrine falcon into an 80 MHz buzzard faster than throwing a lot of mux's into the critical execution path. At this point I don't much care what this chip has or has not, so long as it doesn't compromise speed for gadgets.
  • jazzedjazzed Posts: 11,803
    edited 2014-04-08 07:58
    John, wool is good. Just hope we don't need PKP foam.

    Chip,

    Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-08 08:07
    Hi Brian!

    Thank you for the replay I asked for, NOW I remember :)

    Excellent question.

    (the pace of postings these last few days has been overwhelming.... stack overflow...)

    From what I have seen so far, a video cog, or a high speed sampling / signal generation cog could make use of every second instruction, which is actually every fourth hub slot.

    Which means we could have 3 "fast" cogs, and many "slow" peripheral cogs.

    Bandwidth

    200Mhz / 16 cogs * 4 slots to fast cog = 50M hub slots per cog, 50M * 16 bytes = 800MB/sec bandwidth (with 32 / 128 slot "fast" cog)

    Hubexec

    (assuming executing out of the hub 4-long buffer, same as above)

    200Mhz / 16 cogs * 4 slots to fast cog... we get prefetch for free! (with 32 / 128 slot "fast" cog)

    (as next hub cycle delivers the next 4 longs, all it needs is auto address increment)

    100MIPS :-) *for simple instructions, **4x faster than non-cached LMM

    You like? I LIKE!
    My question is now...."if your new proposal was implemented, given the above scenario, what speed increase would we see in the COGs running from HUB which currently run at 23.5MIPS?"
  • cgraceycgracey Posts: 14,152
    edited 2014-04-08 08:08
    jazzed wrote: »
    John, wool is good. Just hope we don't need PKP foam.

    Chip,

    Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.

    17 bits is the PC width, which addresses longs that are instructions. A hub exec program can do RDBYTE/RDBYTEC/WRBYTE using a 19-bit PTRA or PTRB. Because of the way the DCACHE works, only PTRA/B accesses are allowed - no no S register. The reason is because I must force the DCACHE address ($1F0) into the S fetch to access the DCACHE quads. In all the code I wrote for the Prop2, only once did I use S for address. The pointers are very easy to load and use.
  • ctwardellctwardell Posts: 1,716
    edited 2014-04-08 08:30
    Hardware threads...

    One of the problems with all the add-ons is they seem to get to *almost* something else, so "just on more little thing" and pretty soon we are back at P2.

    With the hardware threads it will soon become...

    - Can each thread do HUBEXEC?
    - Now that they do HUBEXEC can we have preemptive HUBEXEC?
    - Can each thread get it's own set of pointers?
    - How do we divvy up hub access?
    - Can each thread...well you get the point.

    I don't hate threads, and I do see some nice use cases, so how about coming up with something really simple and closed ended.

    May I suggest:

    I know we hate modes, but...

    Normal Mode: One thread, can use HUBEXEC.
    Threaded Mode: Two threads, equal time each, NO HUBEXEC, NO Duplicate Pointers, only the flags and PC unique to each thread, first come first serve on HUB access.

    I would rather just leave threading out of this chip, but if it is deemed a requirement please keep it simple and don't let it be a stepping stone to "just one more thing"...

    C.W.
  • jazzedjazzed Posts: 11,803
    edited 2014-04-08 08:34
    cgracey wrote: »
    17 bits is the PC width, which addresses longs that are instructions. A hub exec program can do RDBYTE/RDBYTEC/WRBYTE using a 19-bit PTRA or PTRB. Because of the way the DCACHE works, only PTRA/B accesses are allowed - no no S register. The reason is because I must force the DCACHE address ($1F0) into the S fetch to access the DCACHE quads. In all the code I wrote for the Prop2, only once did I use S for address. The pointers are very easy to load and use.

    Great.

    Can you post an update of the feature spec (minus instruction lists)?

    I know you are trying to exercise restraint and optimize too, but a spec that can be maintained can be followed. The P2 spec just sat there and never got updated.

    Please avoid too much threading.

    I suppose it''s also useful to know what P1 compatibility if any may be disappearing. For example, I noticed HUBOP in the list(s) of unused instructions ... that is a misnomer of course because COGNEW is a HUBOP ;-)

    Thanks.
  • cgraceycgracey Posts: 14,152
    edited 2014-04-08 08:39
    jazzed wrote: »
    Great.

    Can you post an update of the feature spec (minus instruction lists)?

    I know you are trying to exercise restraint and optimize too, but a spec that can be maintained can be followed. The P2 spec just sat there and never got updated.

    Please avoid too much threading.

    I suppose it''s also useful to know what P1 compatibility if any may be disappearing. For example, I noticed HUBOP in the list(s) of unused instructions ... that is a misnomer of course because COGNEW is a HUBOP ;-)

    Thanks.

    HUBOP is becoming a clearinghouse for all functions that have to do with the hub, or possibly even video! I'm trying to focus the cog on being efficient at flow control and computation. The more generic we can make its peripheral interfaces, the simpler and faster it can become. I'm really tired now, as I've been up over 24 hours, so I need to get some sleep.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-04-08 08:39
    It seems my previous post was ignored, so I'll ask again. Is this the planned list of features for P1+?

    8-COG 16-COG P1 Core
    4-port 2-port cog memory
    20-bit 16-bit multiplier
    256K 512K RAM/ROM
    32-bit Multiply/Divide Engine in each cog the hub
    Cordic Engine in each cog the hub
    PTRA/PTRB
    INDA/INDB
    256-long CLUT/FIFO
    PTRX/PTRY
    Data Cache
    360+ 200+ Instructions
    256-bit 128-bit Hub Bus
    4 tasks
    hubex
    4 Instruction Caches
    serial I/O
    Pre-emptive threads
  • cgraceycgracey Posts: 14,152
    edited 2014-04-08 08:42
    ctwardell wrote: »
    Hardware threads...

    One of the problems with all the add-ons is they seem to get to *almost* something else, so "just on more little thing" and pretty soon we are back at P2.

    With the hardware threads it will soon become...

    - Can each thread do HUBEXEC?
    - Now that they do HUBEXEC can we have preemptive HUBEXEC?
    - Can each thread get it's own set of pointers?
    - How do we divvy up hub access?
    - Can each thread...well you get the point.

    I don't hate threads, and I do see some nice use cases, so how about coming up with something really simple and closed ended.

    May I suggest:

    I know we hate modes, but...

    Normal Mode: One thread, can use HUBEXEC.
    Threaded Mode: Two threads, equal time each, NO HUBEXEC, NO Duplicate Pointers, only the flags and PC unique to each thread, first come first serve on HUB access.

    I would rather just leave threading out of this chip, but if it is deemed a requirement please keep it simple and don't let it be a stepping stone to "just one more thing"...

    C.W.

    I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.
  • rjo__rjo__ Posts: 2,114
    edited 2014-04-08 08:48
    (related to Chip's estimate above) The fact that the new chip will have fabulous cog bandwidth but only 1/8 the effective processing power of the prior design, speaks to the power of the instruction set that is in that prior design. There have been references to design creep. To me it seemed that the driving ethic of the design effort was to optimize performance by replacing common functions with single instructions. Every time you can replace a sequence of common code with a single instruction, the instruction set seems to bloat…but in fact these new instructions function more like macros than the instructions in the original P1, and the impact that they can have on total functionality is obviously huge. In my view, knowing that Chip was looking for places in the design, where this ethic could be applied, there were lots of suggestions about potential opportunities. I hope we can eventually get back to this design ideology and can find a way to better characterize the instruction set… so that the core instructions are set well apart and easily distinguished from these compound instructions. This should be easy to do. The only issue I was concerned about was the complexity of the addressing. I have never been good with this kind of programming and I was really afraid that I might never fully understand it… Of course that wouldn't stop me from using it:)
  • ctwardellctwardell Posts: 1,716
    edited 2014-04-08 08:52
    cgracey wrote: »
    I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.

    My main concern is that it not lead to more feature creep.

    C.W.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-08 08:55
    excellent suggestion!
    jmg wrote: »
    i'll relabel this a little with a more descriptive term, that better reflects what actually happens here

    video (clut) memory sharing

    borrowing a term from the pc world of cheap systems where they have one memory array and code & video share the bus.
    Saves the die area of a separate clut, but shares cog ram and slots (50%) to do this.

    The video hw shifts each 8b (/4b/2b/1b?) pixel and uses that as the clut index, and sends that 32 bit read, split to the dac or direct to pins.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-08 09:01
    Excellent! Saves a ton of hub memory, better for elegant pasm, better for compilers... THANK YOU!
    cgracey wrote: »
    Yes, but I can't get it from my work computer (that hates the internet) onto my laptop, without my thumb drive that is back in the house.

    It needs refining, anyway. It's yet a mess, with opcodes undefined. Just the instructions are there properly.

    For what it's worth, I added PTRA and PTRB into it, to facilitate efficient hub access and hub exec. There's also AUGS and AUGD and the immediate 17-bit JMP/CALL/LINK instructions. The hub exec cache is a section of cog ram this is used for hardware registers, so nothing there gets wasted. There's another section up top for DCACHE that is otherwise used for read-only registers like CNT/RND/INA/INB. I figure that for hub exec, there's no benefit in having more than one 4-long cache line, since it will be exhausted after every four instructions, anyway. So, it should run 50 MIPS without branches or hub reads/writes.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-08 09:03
    Only instruction addresses are 17 bit long address encoded, saves two bits on encoding, and Px's don't support non-aligned instructions.
    jazzed wrote: »
    John, wool is good. Just hope we don't need PKP foam.

    Chip,

    Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.
  • cgraceycgracey Posts: 14,152
    edited 2014-04-08 09:05
    Dave Hein wrote: »
    It seems my previous post was ignored, so I'll ask again. Is this the planned list of features for P1+?

    8-COG 16-COG P1 Core
    4-port 2-port cog memory
    20-bit 16-bit multiplier
    256K 512K RAM/ROM
    32-bit Multiply/Divide Engine in each cog the hub
    Cordic Engine in each cog the hub
    PTRA/PTRB
    INDA/INDB
    256-long CLUT/FIFO
    PTRX/PTRY
    Data Cache
    360+ 200+ Instructions
    256-bit 128-bit Hub Bus
    4 tasks
    hubex
    4 Instruction Caches
    serial I/O
    Pre-emptive threads



    That looks correct, but INDA and INDB will exist. In a non-pipelined architecture like this, they are very simple to do.

    I'm counting about 170 instructions now. There is only one instruction cache line, since adding a bunch more would only make loops faster and necessitate a bunch of 15-bit comparators. With the planned setup, hub execution will be half the speed of cog execution. If it turns out we can go 256 bits wide, after all, it will be full speed in a straight line.

    Here is the new register map. The DCACHE and ICACHE areas are in locales where the RAM is neither read nor written by instructions:
    addr		read		write		name		background
    --------------------------------------------------------------------------
    000..1EF	RAM		RAM		-		-
    
    1F0		CNT		-		CNT		DCACHE0
    1F1		RND		-		RND		DCACHE1
    1F2		INA		-		INA		DCACHE2
    1F3		INB		-		INB		DCACHE3
    1F4		RAM		RAM+OUTA	OUTA		-
    1F5		RAM		RAM+OUTB	OUTB		-
    1F6		RAM		RAM+DIRA	DIRA		-
    1F7		RAM		RAM+DIRB	DIRB		-
    1F8		RAM		RAM+CTRA	CTRA		-
    1F9		RAM		RAM+CTRB	CTRB		-
    1FA		RAM		RAM+FRQA	FRQA		-
    1FB		RAM		RAM+FRQB	FRQB		-
    1FC		PHSA		PHSA		PHSA		ICACHE0
    1FD		PHSB		PHSB		PHSB		ICACHE1
    1FE		indirect	indirect	INDA		ICACHE2
    1FF		indirect	indirect	INDB		ICACHE3
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-08 09:05
    Since without pipelining it seems cheap, I'd go for it. Monitor in one cog? Heck, I might leave it running all the time!!!!
    cgracey wrote: »
    I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.
  • mindrobotsmindrobots Posts: 6,506
    edited 2014-04-08 09:13
    Since without pipelining it seems cheap, I'd go for it. Monitor in one cog? Heck, I might leave it running all the time!!!!

    16 cogs? A primitive monitor, debugger, serial HIM running all the time - you bet! (Forth kernel? - I think I can mention it parenthetically! :smile:)
  • Heater.Heater. Posts: 21,230
    edited 2014-04-08 09:32
    You know me. 170 instructions is already quite enough.
  • cgraceycgracey Posts: 14,152
    edited 2014-04-08 09:36
    Heater. wrote: »
    You know me. 170 instructions is already quite enough.


    Yes, I cringed when I thought you'd see that.

    It just takes that much to get program flow and computation running smoothly.

    I'd post the instruction set, but it's still a mess. I just got rid of the pixel blending instructions. They are totally fun to play with, but an excess in this chip.
  • SapiehaSapieha Posts: 2,964
    edited 2014-04-08 10:21
    Hi Chip.

    I understand Yours problem.
    And I think You now have good insight in what can be done and even usable from P2 work.
    I know You still will come with clever solutions to made that IC usable
    I will hang on as long my health give me.

    BUT still -- It is possible have Instructions info to last BIN FPGA code --- S I can have any thing I can work on even if not so usable in NEXT STEP of IC.
    Only work with electronics hold me little more (as my life last years are sleep and siting with computer with programing/thinking) else it will be only sleeping and that not help my health.

    cgracey wrote: »
    Sapieha,

    The new chip will have a lot of good things in it, including hub exec. We just couldn't get the P2 to fit in 180nm in any adequate manner. So, we are going back to the basics, but adding a few key elements from the Prop2 development. It's true that these cogs won't be as fast, but there will be more of them, so that the total MIPS will be higher, but the power will be 1/8th.

    When this is done, we'll pick up where we left off on Prop2. That's the best we can do right now, in order to get a real chip into production. Hang in there, please.
  • BaggersBaggers Posts: 3,019
    edited 2014-04-08 10:25
    I agree, the pixel blending will be overkill for this chip.
  • jazzedjazzed Posts: 11,803
    edited 2014-04-08 10:45
    Only instruction addresses are 17 bit long address encoded, saves two bits on encoding, and Px's don't support non-aligned instructions.
    Yes, Chip answered that already. Guess you missed that.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-04-08 10:53
    cgracey wrote: »
    Yes, I cringed when I thought you'd see that.

    It just takes that much to get program flow and computation running smoothly.

    I'd post the instruction set, but it's still a mess. I just got rid of the pixel blending instructions. They are totally fun to play with, but an excess in this chip.

    Could they possibly live as a hub resource along side the cordic/math stuff? Maybe a trimmed down subset? They are not strongly needed, but they are super nice for doing GUIs.
  • Martin HodgeMartin Hodge Posts: 1,246
    edited 2014-04-08 11:03
    Seairth wrote: »
    Frankly, I'd take the P1+ with no additional task support if it would mean we get it sooner.

    +1x10100
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-08 11:17
    Yep, I missed his answer to you.

    I was trying to save him time :)

    I really don't know how Chip manages with so little sleep.
    jazzed wrote: »
    Yes, Chip answered that already. Guess you missed that.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-04-08 11:25
    Please write the monitor so that we can hook into it.

    Say, a U command, for user debugger, or anything really. Monitor arguments passed in, and our program can return to it easily. This allows an upload to include whatever debugging package the developer deems necessary, and it takes advantage of the serial link already setup and established.
Sign In or Register to comment.