Your analysis of FullDuplexSerial needs a closer look:
Below is the receive loop. It consists of 9 instructions.
That means more than 10% of the code space (and time) is wasted on the yield (jmpret).
If we home in in the inner loop that is only 5 instructions so we could say 20% of the time is wasted on the yield.
Moreover there are 4 instructions used for what would normally be a single WAITPxx instruction if this were a thread.
So we are looking at about 50% of waste time. And no hope of improving edge timing and sampling jitter. Looks like hardware scheduled threads would help a lot here.
:bit add rxcnt,bitticks 'ready next bit period
:wait jmpret rxcode,txcode 'run a chuck of transmit code, then return
mov t1,rxcnt 'check if bit receive period done
sub t1,cnt
cmps t1,#0 wc
if_nc jmp #:wait
test rxmask,ina wc 'receive bit on rx pin
rcr rxdata,#1
djnz rxbits,#:bit
I was not arguing your point about the execution time ("is a dog"), rather Bill's comment about the waste of cog space. My apologies for the lack of clarity. Again, I believe the new pin approach that Chip is discussing will mitigate your concerns (baud rate, jitter, etc) to some extent. No, it's not a perfect solution. No, you won't be able to push the I/O as far as other tasking approaches might allow. But there are always going to be use cases where the hardware doesn't suffice. So, as I said before, I am trying to offer a solution with minimal hardware requirement (and minimal impact on time and risk). I think we can afford to wait until the P2 for full tasking support.
Frankly, I'd take the P1+ with no additional task support if it would mean we get it sooner.
Your analysis of FullDuplexSerial needs a closer look:
Below is the receive loop. It consists of 9 instructions.
That means more than 10% of the code space (and time) is wasted on the yield (jmpret).
If we home in in the inner loop that is only 5 instructions so we could say 20% of the time is wasted on the yield.
Moreover there are 4 instructions used for what would normally be a single WAITPxx instruction if this were a thread.
So we are looking at about 50% of waste time. And no hope of improving edge timing and sampling jitter. Looks like hardware scheduled threads would help a lot here.
:bit add rxcnt,bitticks 'ready next bit period
:wait jmpret rxcode,txcode 'run a chuck of transmit code, then return
mov t1,rxcnt 'check if bit receive period done
sub t1,cnt
cmps t1,#0 wc
if_nc jmp #:wait
test rxmask,ina wc 'receive bit on rx pin
rcr rxdata,#1
djnz rxbits,#:bit
Here's another thought as well: add a SWTASK #n/D variant that would be just like the zero-param SWTASK, but store #n/D instead of PC+1. The FDS receive code would then look like:
receive test rxtxmode,#%001 wz 'wait for start bit on rx pin
test rxmask,ina wc
if_z_eq_c swtask #receive
mov rxbits,#9 'ready to receive byte
mov rxcnt,bitticks
shr rxcnt,#1
add rxcnt,cnt
:bit add rxcnt,bitticks 'ready next bit period
:wait mov t1,rxcnt 'check if bit receive period done
sub t1,cnt
cmps t1,#0 wc
if_nc swtask #:wait
test rxmask,ina wc 'receive bit on rx pin
rcr rxdata,#1
djnz rxbits,#:bit
shr rxdata,#32-9 'justify and trim received byte
and rxdata,#$FF
test rxtxmode,#%001 wz 'if rx inverted, invert byte
if_nz xor rxdata,#$FF
rdlong t2,par 'save received byte and inc head
add t2,rxbuff
wrbyte rxdata,t2
sub t2,rxbuff
add t2,#1
and t2,#$0F
wrlong t2,par
swtask #receive 'byte done, receive next byte
This has zero impact on code space or execution time. No, it doesn't get rid of the jitter issue and may only marginally increase baud rate. Again, that's where the new pin stuff comes in.
I looks like we are starting to stray (featuritis) from the chip described in post #1. Hopefully we will not go another 9 months and have another still born chip.
One aspect of having 4-long hub transfers is that with a few simple address tags (TLB, David?) we could direct out-of-cog addresses into 4-long register blocks which are serving as instruction caches. Part of the cog register RAM becomes the cache! No cache-line flipflops and mux's needed!
I like this idea a lot!! Nothing would turn a 200 MHz peregrine falcon into an 80 MHz buzzard faster than throwing a lot of mux's into the critical execution path. At this point I don't much care what this chip has or has not, so long as it doesn't compromise speed for gadgets.
John, wool is good. Just hope we don't need PKP foam.
Chip,
Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.
Thank you for the replay I asked for, NOW I remember
Excellent question.
(the pace of postings these last few days has been overwhelming.... stack overflow...)
From what I have seen so far, a video cog, or a high speed sampling / signal generation cog could make use of every second instruction, which is actually every fourth hub slot.
Which means we could have 3 "fast" cogs, and many "slow" peripheral cogs.
Bandwidth
200Mhz / 16 cogs * 4 slots to fast cog = 50M hub slots per cog, 50M * 16 bytes = 800MB/sec bandwidth (with 32 / 128 slot "fast" cog)
Hubexec
(assuming executing out of the hub 4-long buffer, same as above)
200Mhz / 16 cogs * 4 slots to fast cog... we get prefetch for free! (with 32 / 128 slot "fast" cog)
(as next hub cycle delivers the next 4 longs, all it needs is auto address increment)
100MIPS :-) *for simple instructions, **4x faster than non-cached LMM
My question is now...."if your new proposal was implemented, given the above scenario, what speed increase would we see in the COGs running from HUB which currently run at 23.5MIPS?"
John, wool is good. Just hope we don't need PKP foam.
Chip,
Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.
17 bits is the PC width, which addresses longs that are instructions. A hub exec program can do RDBYTE/RDBYTEC/WRBYTE using a 19-bit PTRA or PTRB. Because of the way the DCACHE works, only PTRA/B accesses are allowed - no no S register. The reason is because I must force the DCACHE address ($1F0) into the S fetch to access the DCACHE quads. In all the code I wrote for the Prop2, only once did I use S for address. The pointers are very easy to load and use.
One of the problems with all the add-ons is they seem to get to *almost* something else, so "just on more little thing" and pretty soon we are back at P2.
With the hardware threads it will soon become...
- Can each thread do HUBEXEC?
- Now that they do HUBEXEC can we have preemptive HUBEXEC?
- Can each thread get it's own set of pointers?
- How do we divvy up hub access?
- Can each thread...well you get the point.
I don't hate threads, and I do see some nice use cases, so how about coming up with something really simple and closed ended.
May I suggest:
I know we hate modes, but...
Normal Mode: One thread, can use HUBEXEC.
Threaded Mode: Two threads, equal time each, NO HUBEXEC, NO Duplicate Pointers, only the flags and PC unique to each thread, first come first serve on HUB access.
I would rather just leave threading out of this chip, but if it is deemed a requirement please keep it simple and don't let it be a stepping stone to "just one more thing"...
17 bits is the PC width, which addresses longs that are instructions. A hub exec program can do RDBYTE/RDBYTEC/WRBYTE using a 19-bit PTRA or PTRB. Because of the way the DCACHE works, only PTRA/B accesses are allowed - no no S register. The reason is because I must force the DCACHE address ($1F0) into the S fetch to access the DCACHE quads. In all the code I wrote for the Prop2, only once did I use S for address. The pointers are very easy to load and use.
Great.
Can you post an update of the feature spec (minus instruction lists)?
I know you are trying to exercise restraint and optimize too, but a spec that can be maintained can be followed. The P2 spec just sat there and never got updated.
Please avoid too much threading.
I suppose it''s also useful to know what P1 compatibility if any may be disappearing. For example, I noticed HUBOP in the list(s) of unused instructions ... that is a misnomer of course because COGNEW is a HUBOP ;-)
Can you post an update of the feature spec (minus instruction lists)?
I know you are trying to exercise restraint and optimize too, but a spec that can be maintained can be followed. The P2 spec just sat there and never got updated.
Please avoid too much threading.
I suppose it''s also useful to know what P1 compatibility if any may be disappearing. For example, I noticed HUBOP in the list(s) of unused instructions ... that is a misnomer of course because COGNEW is a HUBOP ;-)
Thanks.
HUBOP is becoming a clearinghouse for all functions that have to do with the hub, or possibly even video! I'm trying to focus the cog on being efficient at flow control and computation. The more generic we can make its peripheral interfaces, the simpler and faster it can become. I'm really tired now, as I've been up over 24 hours, so I need to get some sleep.
One of the problems with all the add-ons is they seem to get to *almost* something else, so "just on more little thing" and pretty soon we are back at P2.
With the hardware threads it will soon become...
- Can each thread do HUBEXEC?
- Now that they do HUBEXEC can we have preemptive HUBEXEC?
- Can each thread get it's own set of pointers?
- How do we divvy up hub access?
- Can each thread...well you get the point.
I don't hate threads, and I do see some nice use cases, so how about coming up with something really simple and closed ended.
May I suggest:
I know we hate modes, but...
Normal Mode: One thread, can use HUBEXEC.
Threaded Mode: Two threads, equal time each, NO HUBEXEC, NO Duplicate Pointers, only the flags and PC unique to each thread, first come first serve on HUB access.
I would rather just leave threading out of this chip, but if it is deemed a requirement please keep it simple and don't let it be a stepping stone to "just one more thing"...
C.W.
I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.
(related to Chip's estimate above) The fact that the new chip will have fabulous cog bandwidth but only 1/8 the effective processing power of the prior design, speaks to the power of the instruction set that is in that prior design. There have been references to design creep. To me it seemed that the driving ethic of the design effort was to optimize performance by replacing common functions with single instructions. Every time you can replace a sequence of common code with a single instruction, the instruction set seems to bloat…but in fact these new instructions function more like macros than the instructions in the original P1, and the impact that they can have on total functionality is obviously huge. In my view, knowing that Chip was looking for places in the design, where this ethic could be applied, there were lots of suggestions about potential opportunities. I hope we can eventually get back to this design ideology and can find a way to better characterize the instruction set… so that the core instructions are set well apart and easily distinguished from these compound instructions. This should be easy to do. The only issue I was concerned about was the complexity of the addressing. I have never been good with this kind of programming and I was really afraid that I might never fully understand it… Of course that wouldn't stop me from using it:)
I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.
My main concern is that it not lead to more feature creep.
i'll relabel this a little with a more descriptive term, that better reflects what actually happens here
video (clut) memory sharing
borrowing a term from the pc world of cheap systems where they have one memory array and code & video share the bus.
Saves the die area of a separate clut, but shares cog ram and slots (50%) to do this.
The video hw shifts each 8b (/4b/2b/1b?) pixel and uses that as the clut index, and sends that 32 bit read, split to the dac or direct to pins.
Yes, but I can't get it from my work computer (that hates the internet) onto my laptop, without my thumb drive that is back in the house.
It needs refining, anyway. It's yet a mess, with opcodes undefined. Just the instructions are there properly.
For what it's worth, I added PTRA and PTRB into it, to facilitate efficient hub access and hub exec. There's also AUGS and AUGD and the immediate 17-bit JMP/CALL/LINK instructions. The hub exec cache is a section of cog ram this is used for hardware registers, so nothing there gets wasted. There's another section up top for DCACHE that is otherwise used for read-only registers like CNT/RND/INA/INB. I figure that for hub exec, there's no benefit in having more than one 4-long cache line, since it will be exhausted after every four instructions, anyway. So, it should run 50 MIPS without branches or hub reads/writes.
John, wool is good. Just hope we don't need PKP foam.
Chip,
Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.
It seems my previous post was ignored, so I'll ask again. Is this the planned list of features for P1+?
8-COG 16-COG P1 Core 4-port 2-port cog memory 20-bit 16-bit multiplier 256K 512K RAM/ROM
32-bit Multiply/Divide Engine in each cog the hub
Cordic Engine in each cog the hub
PTRA/PTRB INDA/INDB 256-long CLUT/FIFO PTRX/PTRY
Data Cache 360+ 200+ Instructions 256-bit 128-bit Hub Bus
4 tasks
hubex
4 Instruction Caches serial I/O Pre-emptive threads
That looks correct, but INDA and INDB will exist. In a non-pipelined architecture like this, they are very simple to do.
I'm counting about 170 instructions now. There is only one instruction cache line, since adding a bunch more would only make loops faster and necessitate a bunch of 15-bit comparators. With the planned setup, hub execution will be half the speed of cog execution. If it turns out we can go 256 bits wide, after all, it will be full speed in a straight line.
Here is the new register map. The DCACHE and ICACHE areas are in locales where the RAM is neither read nor written by instructions:
I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.
You know me. 170 instructions is already quite enough.
Yes, I cringed when I thought you'd see that.
It just takes that much to get program flow and computation running smoothly.
I'd post the instruction set, but it's still a mess. I just got rid of the pixel blending instructions. They are totally fun to play with, but an excess in this chip.
I understand Yours problem.
And I think You now have good insight in what can be done and even usable from P2 work.
I know You still will come with clever solutions to made that IC usable
I will hang on as long my health give me.
BUT still -- It is possible have Instructions info to last BIN FPGA code --- S I can have any thing I can work on even if not so usable in NEXT STEP of IC.
Only work with electronics hold me little more (as my life last years are sleep and siting with computer with programing/thinking) else it will be only sleeping and that not help my health.
The new chip will have a lot of good things in it, including hub exec. We just couldn't get the P2 to fit in 180nm in any adequate manner. So, we are going back to the basics, but adding a few key elements from the Prop2 development. It's true that these cogs won't be as fast, but there will be more of them, so that the total MIPS will be higher, but the power will be 1/8th.
When this is done, we'll pick up where we left off on Prop2. That's the best we can do right now, in order to get a real chip into production. Hang in there, please.
It just takes that much to get program flow and computation running smoothly.
I'd post the instruction set, but it's still a mess. I just got rid of the pixel blending instructions. They are totally fun to play with, but an excess in this chip.
Could they possibly live as a hub resource along side the cordic/math stuff? Maybe a trimmed down subset? They are not strongly needed, but they are super nice for doing GUIs.
Please write the monitor so that we can hook into it.
Say, a U command, for user debugger, or anything really. Monitor arguments passed in, and our program can return to it easily. This allows an upload to include whatever debugging package the developer deems necessary, and it takes advantage of the serial link already setup and established.
Comments
I was not arguing your point about the execution time ("is a dog"), rather Bill's comment about the waste of cog space. My apologies for the lack of clarity. Again, I believe the new pin approach that Chip is discussing will mitigate your concerns (baud rate, jitter, etc) to some extent. No, it's not a perfect solution. No, you won't be able to push the I/O as far as other tasking approaches might allow. But there are always going to be use cases where the hardware doesn't suffice. So, as I said before, I am trying to offer a solution with minimal hardware requirement (and minimal impact on time and risk). I think we can afford to wait until the P2 for full tasking support.
Frankly, I'd take the P1+ with no additional task support if it would mean we get it sooner.
Even with the items that have been factored out into common resources, these are going to be way more complex than P1 cogs.
C.W.
Here's another thought as well: add a SWTASK #n/D variant that would be just like the zero-param SWTASK, but store #n/D instead of PC+1. The FDS receive code would then look like:
This has zero impact on code space or execution time. No, it doesn't get rid of the jitter issue and may only marginally increase baud rate. Again, that's where the new pin stuff comes in.
Ken, time for the wet blanket. I recommend wool.
John Abshier
I like this idea a lot!! Nothing would turn a 200 MHz peregrine falcon into an 80 MHz buzzard faster than throwing a lot of mux's into the critical execution path. At this point I don't much care what this chip has or has not, so long as it doesn't compromise speed for gadgets.
Chip,
Is the full HUB RAM space accessible by one hubexec COG at byte granularity? Saw something about 17 bits and that isn't enough for 512KB a byte at a time.
Thank you for the replay I asked for, NOW I remember
Excellent question.
(the pace of postings these last few days has been overwhelming.... stack overflow...)
From what I have seen so far, a video cog, or a high speed sampling / signal generation cog could make use of every second instruction, which is actually every fourth hub slot.
Which means we could have 3 "fast" cogs, and many "slow" peripheral cogs.
Bandwidth
200Mhz / 16 cogs * 4 slots to fast cog = 50M hub slots per cog, 50M * 16 bytes = 800MB/sec bandwidth (with 32 / 128 slot "fast" cog)
Hubexec
(assuming executing out of the hub 4-long buffer, same as above)
200Mhz / 16 cogs * 4 slots to fast cog... we get prefetch for free! (with 32 / 128 slot "fast" cog)
(as next hub cycle delivers the next 4 longs, all it needs is auto address increment)
100MIPS :-) *for simple instructions, **4x faster than non-cached LMM
You like? I LIKE!
17 bits is the PC width, which addresses longs that are instructions. A hub exec program can do RDBYTE/RDBYTEC/WRBYTE using a 19-bit PTRA or PTRB. Because of the way the DCACHE works, only PTRA/B accesses are allowed - no no S register. The reason is because I must force the DCACHE address ($1F0) into the S fetch to access the DCACHE quads. In all the code I wrote for the Prop2, only once did I use S for address. The pointers are very easy to load and use.
One of the problems with all the add-ons is they seem to get to *almost* something else, so "just on more little thing" and pretty soon we are back at P2.
With the hardware threads it will soon become...
- Can each thread do HUBEXEC?
- Now that they do HUBEXEC can we have preemptive HUBEXEC?
- Can each thread get it's own set of pointers?
- How do we divvy up hub access?
- Can each thread...well you get the point.
I don't hate threads, and I do see some nice use cases, so how about coming up with something really simple and closed ended.
May I suggest:
I know we hate modes, but...
Normal Mode: One thread, can use HUBEXEC.
Threaded Mode: Two threads, equal time each, NO HUBEXEC, NO Duplicate Pointers, only the flags and PC unique to each thread, first come first serve on HUB access.
I would rather just leave threading out of this chip, but if it is deemed a requirement please keep it simple and don't let it be a stepping stone to "just one more thing"...
C.W.
Great.
Can you post an update of the feature spec (minus instruction lists)?
I know you are trying to exercise restraint and optimize too, but a spec that can be maintained can be followed. The P2 spec just sat there and never got updated.
Please avoid too much threading.
I suppose it''s also useful to know what P1 compatibility if any may be disappearing. For example, I noticed HUBOP in the list(s) of unused instructions ... that is a misnomer of course because COGNEW is a HUBOP ;-)
Thanks.
HUBOP is becoming a clearinghouse for all functions that have to do with the hub, or possibly even video! I'm trying to focus the cog on being efficient at flow control and computation. The more generic we can make its peripheral interfaces, the simpler and faster it can become. I'm really tired now, as I've been up over 24 hours, so I need to get some sleep.
8-COG 16-COG P1 Core
4-port 2-port cog memory
20-bit 16-bit multiplier
256K 512K RAM/ROM
32-bit Multiply/Divide Engine in each cog the hub
Cordic Engine in each cog the hub
PTRA/PTRB
INDA/INDB
256-long CLUT/FIFO
PTRX/PTRY
Data Cache
360+ 200+ Instructions
256-bit 128-bit Hub Bus
4 tasks
hubex
4 Instruction Caches
serial I/O
Pre-emptive threads
I agree with not making duplicate pointers per task, etc. But to get bare bones multi-tasking, all we need is 1..3 more Z/C/PC's and some mux's for them. It's a big deal for a program like the ROM_Monitor. It wouldn't work in one cog without multi-tasking.
My main concern is that it not lead to more feature creep.
C.W.
That looks correct, but INDA and INDB will exist. In a non-pipelined architecture like this, they are very simple to do.
I'm counting about 170 instructions now. There is only one instruction cache line, since adding a bunch more would only make loops faster and necessitate a bunch of 15-bit comparators. With the planned setup, hub execution will be half the speed of cog execution. If it turns out we can go 256 bits wide, after all, it will be full speed in a straight line.
Here is the new register map. The DCACHE and ICACHE areas are in locales where the RAM is neither read nor written by instructions:
16 cogs? A primitive monitor, debugger, serial HIM running all the time - you bet! (Forth kernel? - I think I can mention it parenthetically! )
Yes, I cringed when I thought you'd see that.
It just takes that much to get program flow and computation running smoothly.
I'd post the instruction set, but it's still a mess. I just got rid of the pixel blending instructions. They are totally fun to play with, but an excess in this chip.
I understand Yours problem.
And I think You now have good insight in what can be done and even usable from P2 work.
I know You still will come with clever solutions to made that IC usable
I will hang on as long my health give me.
BUT still -- It is possible have Instructions info to last BIN FPGA code --- S I can have any thing I can work on even if not so usable in NEXT STEP of IC.
Only work with electronics hold me little more (as my life last years are sleep and siting with computer with programing/thinking) else it will be only sleeping and that not help my health.
Could they possibly live as a hub resource along side the cordic/math stuff? Maybe a trimmed down subset? They are not strongly needed, but they are super nice for doing GUIs.
+1x10100
I was trying to save him time
I really don't know how Chip manages with so little sleep.
Say, a U command, for user debugger, or anything really. Monitor arguments passed in, and our program can return to it easily. This allows an upload to include whatever debugging package the developer deems necessary, and it takes advantage of the serial link already setup and established.