Chip, regarding "RDWORD/WRWORD". I agree we could do without them at a pinch - but they will have to be simulated in software, and each 16 bit Hub access will now take a couple of extra instructions.
From a high level language this is no problem at all - but my gut tells me it is not a good idea for a micro-controller that is still heavily oriented towards embedded applications.
People who have to develop fast and compact embedded code would never use 32 bits where 16 would do - but now every time they do so they will have to mentally juggle the increased access time for such values against the increased code size.
They could easily end up hating this new chip every time they have to do this.
Ross.
EDIT: I see Chip already responded to this issue at post #22.
If you can run Perl, you're welcome to use the program I used to analyze my files (attached). The first thing it does is to strip all comments, since things like and and or tend to be rife there. Then it only looks in DAT sections for opcodes. I also found that DAT-resident strings included opcode mimics, so I included the long, word, and byte pseudo-ops to short-circuit any further search in the line. It also handles Unicoded files.
BTW, the last thing it prints is the number of waitpxx ops with immediate operands vs. the total number.
Does anyone have any data on how other chips apportion VDD pins based on their core current? I know on very high-current chips, they place power pads right in the middle of the die and attach the package directly to them.
This is pasted from a 100 pin CPLD from Atmel (333MHz internal spec)
ATF1508RE has up to 80 bi-directional I/O pins, four dedicated input pins, 1 internal voltage regulator supply input pin (VCCIRI), 6 I/O VCC pins (VCCIOA and VCCIOB), 8 ground pins (GND and 1 internal voltage regulator output pin (VCCIR0).
- note that spec's ~ 30mA @ 100MHz on VccCore, so the VccCore pin count here is going to be lower than you need.
I'm really liking the sound of this! I agree that code compatibility isn't necessary. Some additional questions:
What is the expected clock speed?
Will all cogs still be identical?
Will it use the 2-clock or 4-clock design?
Will it keep the ROM lookup tables or CORDIC?
Will it keep the old or new bootstrap?
Will it have the new monitor?
Will it have the debug/trace?
Will the CLUT become AUX?
Will the HUB access be every 32 clock cycles?
Will there be an equivalent to PORT_D?
Will there be 16 HUB locks?
And a few thoughts on the above questions:
For PORT_D, if it would be easy to add a hardwired bus between pairs of cogs (0 and 1, 2 and 3, etc.), this might make it easier to write efficient protocols that require two cogs. Additionally, it might be possible to add software support for an 8-bit (assuming the limited number of I/O pins) external RAM driver that is controllable via PORT_D from the "main" program. In other words, the driver would run in COG 1 and the main program would run in COG 0, commanding it over the hardwired port. The driver would still most likely transfer between external and HUB RAM, which would allow for larger memory models in the "main" program. If I had a choice in the bus architecture, I'd say two 32-bit registers that are cross-coupled such that the first is write-only and the second is read-only (i.e. no need for DIRx).
If there's not enough room for CORDIC in each cog, could you instead make a single instance available in the HUB? Since you will have a 128-bit data bus, it should be possible to start a CORDIC calculation with a single HUBOP (pointing to a block of registers) and read the results on the following hub slot. With this approach, there is obviously the potential for resource conflict. This simplest solution is to leave it up to the programmer to avoid accessing CORDIC from two cogs at the same time.
1) 200MHz
2) yes
3) 2-clock
4) CORDIC
5) new, with authentication
6) yes
7) no, maybe?
8) no CLUT, WAITVID can convey 128 bits now
9) every 16 clocks / 8 instructions
10) Only A and B, for now
11) yes
CORDIC, MUL32X32, DIV64/32, SQRT (maybe) will be in the hub, but pipelined, so nobody has to wait for anybody else, only their turn at the hub.
Cogs could offset each other by 1 clock via WAITCLK to tag-team on the pins.
I don't know if cog-to-cog 32-bit links will be practical.
Also, I'm assuming that the following P2 features will not be ported:
SERDES
INDx
tasks
register remapping
If this were the case, would it be possible to add an extremely simple cooperative multitasking instruction set? I'm thinking something along the following lines:
Single internal TASK register for holding a PC/Z/C.
GETTASK instruction to read TASK.
SETTASK instruction to write TASK.
SWTASK instruction that takes PC+1/Z/C and swaps it with whatever is in TASK.
With just SETTASK and SWTASK, it would be possible to write drivers with "concurrent" read/write threads. With GETTASK, more complex schedulers could be developed. No, it's not as efficient as interleaved tasking, but it should be very little increase in complexity and circuitry for a significant increase in usability over the current P1 approach(es).
Right, no serdes, tasks, register remapping - though there may be some INDx. Tasks in a non-pipelined architecture are almost trivial to implement, but I don't want to go there yet. Having multiple tasks IS a lot of fun and makes some apps possible to do in one cog.
Nope, all the nice palette stuff is gone. I will really miss the 4bpp mode, and palette modes. Unfortunately that needs AUX (formerly CLUT) and the new video engine... many gates
On Morpheus, I use RRRGGGBB, works very well. A little external logic would allow RRGGBBII.
- INDx would be EXTREMELY useful, especially if it has the same modes as P2 (so it can be used as a stack, or FIFO), Minimum 2 please, four would incredible.
- Personally, I'd prefer 32 cogs, with my hub slot mapping table. Failing that, task capability would be great
(putting on kevlar suit)
- I don't dare ask for the PTRA/PTRB support... even with the kevlar suit on ... although it is great for compiler support
Right, no serdes, tasks, register remapping - though there may be some INDx. Tasks in a non-pipelined architecture are almost trivial to implement, but I don't want to go there yet. Having multiple tasks IS a lot of fun and makes some apps possible to do in one cog.
Just chiming in to say I like the idea of going ahead with this chip. P2 was getting out of hand and this looks much more do-able and still a quantum jump over what we have now. Will be following more closely than I was following the P2 thread lately...
The reference devices like SSD1963, support 16,18,24 bit pixel modes, and slave BUS widths of 8,9,12,16,18,24
Where they have < 24bbp, I think they left justify in a 24b output field.
Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.
I don't think the 32 was serious, just a for-example point.
We still need an OnSemi Power/Speed Simulation pass on this, before even 16 COGS & 200MHz are actually confirmed as within the Power/Process envelopes.
- Personally, I'd prefer 32 cogs, with my hub slot mapping table. Failing that, task capability would be great
32 COGs is unlikely to fit the power envelope, and even 16 still needs Sim confirmation.
Unused COGs burn quite a lot of die area, so I think simple tasking should be checked, once an OnSemi Sim confirms how many COGs can 'stay cool' inside the package.
Notes: ClusoDebugger_276.spin contains every instruction, so you can effectively subtract 1 from all of these. I didn't exclude comments, so and and or are artificially high, a few other things are also higher because of spin keywords.
Here's the number of hits total for each instruction:
32 COGs is unlikely to fit the power envelope, and even 16 still needs Sim confirmation.
Unused COGs burn quite a lot of die area, so I think simple tasking should be checked, once an OnSemi Sim confirms how many COGs can 'stay cool' inside the package.
Simple tasking would mean no need for 32 cogs. Works for me. I know, feature creep.
So I guess we could do 8-bit fullscreen WVGA using a 384kB pixel buffer...
Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...
Ok... So then we would have 36 I/O available after SDRAM. With 4 used for VGA we are left with 32. Same as what I have now to work with (P1). I would have to give up my 4 hard inputs (direct to Prop) to get the Mouse and Keyboard serial ports that I am now getting from the grafted Raspberry Pi. Not optimal, but doable.
As for memory, if you are planning on this to have SDRAM typically, can you not (don't hang me) make the I/O pins for that interface just digital and drop the analog from them? That would free up area for more memory for the LCD/VGA guys. If you don't need the SDRAM they would still be available for regular digital I/O applications. Would there really be a practical use for 80 analog pins on one chip?
Even with the extra I/O 16 cogs is fine. Please don't give Ken a stroke! He has been very good with our insanity up til now and we need him to be 100% working his marketing magic.
Would it be possible to set the number of cogs that can access the hub memory...therefore increasing the bandwidth to hub memory...
etc have a setting so all 16 cogs access memory..one to set it to 8 cogs...all the way down to one cog full access...increasing bandwidth with each setting....that way hub exec can be done at varing rates...the best of both worlds.
So I guess we could do 8-bit fullscreen WVGA using a 384kB pixel buffer...
Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...
Maybe this is a case for the simple tasking ? Only instead of 2 pgms slicing, one is the code and the other is the Video-Gen using the COG memory instead of its own local memory.
Preserves RAM which is the die costly item here.
Simple tasking would mean no need for 32 cogs. Works for me. I know, feature creep.
That's why I suggested the minimal cooperative approach. This approach require zero modification of the pipeline or instruction processing, while enabling what I imagine to be the biggest use case: a single cog with separate I/O read and write threads. With 16 cogs, I see much less need for the 4-task approach in P2. This is a KISS solution that should have the least impact on the new chip.
By the way, what nickname are we giving this thing?
Comments
From a high level language this is no problem at all - but my gut tells me it is not a good idea for a micro-controller that is still heavily oriented towards embedded applications.
People who have to develop fast and compact embedded code would never use 32 bits where 16 would do - but now every time they do so they will have to mentally juggle the increased access time for such values against the increased code size.
They could easily end up hating this new chip every time they have to do this.
Ross.
EDIT: I see Chip already responded to this issue at post #22.
Try (untested!)
electrodude
If you can run Perl, you're welcome to use the program I used to analyze my files (attached). The first thing it does is to strip all comments, since things like and and or tend to be rife there. Then it only looks in DAT sections for opcodes. I also found that DAT-resident strings included opcode mimics, so I included the long, word, and byte pseudo-ops to short-circuit any further search in the line. It also handles Unicoded files.
BTW, the last thing it prints is the number of waitpxx ops with immediate operands vs. the total number.
-Phil
This is pasted from a 100 pin CPLD from Atmel (333MHz internal spec)
ATF1508RE has up to 80 bi-directional I/O pins, four dedicated input pins, 1 internal voltage regulator supply input pin (VCCIRI), 6 I/O VCC pins (VCCIOA and VCCIOB), 8 ground pins (GND and 1 internal voltage regulator output pin (VCCIR0).
- note that spec's ~ 30mA @ 100MHz on VccCore, so the VccCore pin count here is going to be lower than you need.
How much RAM does that ~70% give, and what growth in die-edge would be needed to make it to 800x480 LCD numbers ?
800x480 @ 8bpp = 384K
If you don't need a back buffer for page flipping, it will work.
Hmm - 8bpp ? Does this design include a 256 entry Palette RAM ?
ie How does that 8bpp map onto the DACs chip has mentioned.
1) 200MHz
2) yes
3) 2-clock
4) CORDIC
5) new, with authentication
6) yes
7) no, maybe?
8) no CLUT, WAITVID can convey 128 bits now
9) every 16 clocks / 8 instructions
10) Only A and B, for now
11) yes
CORDIC, MUL32X32, DIV64/32, SQRT (maybe) will be in the hub, but pipelined, so nobody has to wait for anybody else, only their turn at the hub.
Cogs could offset each other by 1 clock via WAITCLK to tag-team on the pins.
I don't know if cog-to-cog 32-bit links will be practical.
Right, no serdes, tasks, register remapping - though there may be some INDx. Tasks in a non-pipelined architecture are almost trivial to implement, but I don't want to go there yet. Having multiple tasks IS a lot of fun and makes some apps possible to do in one cog.
On Morpheus, I use RRRGGGBB, works very well. A little external logic would allow RRGGBBII.
That's 512KB. Each 128KB takes 5.7 square mm.
How much ram is needed for the LCD's?
- INDx would be EXTREMELY useful, especially if it has the same modes as P2 (so it can be used as a stack, or FIFO), Minimum 2 please, four would incredible.
- Personally, I'd prefer 32 cogs, with my hub slot mapping table. Failing that, task capability would be great
(putting on kevlar suit)
- I don't dare ask for the PTRA/PTRB support... even with the kevlar suit on ... although it is great for compiler support
Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.
You guys want a wet blanket? I'm here. We might be starting again, but we're in the home stretch, so let's finish this one!
Ken Gracey
I thought PTRx was necessary in order to access larger HUB memory spaces.
In purely raw pixel storage, it is like this - of course, some will map to DACs easier than others.
8*800*480/8 = 384000
9*800*480/8 = 432000
10*800*480/8 = 480000
11*800*480/8 = 528000
12*800*480/8 = 576000
13*800*480/8 = 624000
14*800*480/8 = 672000
15*800*480/8 = 720000
16*800*480/8 = 768000
17*800*480/8 = 816000
18*800*480/8 = 864000
19*800*480/8 = 912000
20*800*480/8 = 960000
21*800*480/8 = 1008000
22*800*480/8 = 1056000
23*800*480/8 = 1104000
24*800*480/8 = 1152000
The reference devices like SSD1963, support 16,18,24 bit pixel modes, and slave BUS widths of 8,9,12,16,18,24
Where they have < 24bbp, I think they left justify in a 24b output field.
I don't think the 32 was serious, just a for-example point.
We still need an OnSemi Power/Speed Simulation pass on this, before even 16 COGS & 200MHz are actually confirmed as within the Power/Process envelopes.
32 COGs is unlikely to fit the power envelope, and even 16 still needs Sim confirmation.
Unused COGs burn quite a lot of die area, so I think simple tasking should be checked, once an OnSemi Sim confirms how many COGs can 'stay cool' inside the package.
Here's the updated version of my original list:
Notes: ClusoDebugger_276.spin contains every instruction, so you can effectively subtract 1 from all of these. I didn't exclude comments, so and and or are artificially high, a few other things are also higher because of spin keywords.
Here's the number of hits total for each instruction:
Phil, I will try downloading and using your perl program now.
It sure does speed it up, and reduces operations needed greatly for compiled code stack operations.
Simple tasking would mean no need for 32 cogs. Works for me. I know, feature creep.
Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...
This represents actual PASM usage in the files, eliminating comments, strings, etc, and only looking in DAT sections. Thanks Phil!
The wet blanket is a good start, you might want to get the firehose ready though. :-)
Ok... So then we would have 36 I/O available after SDRAM. With 4 used for VGA we are left with 32. Same as what I have now to work with (P1). I would have to give up my 4 hard inputs (direct to Prop) to get the Mouse and Keyboard serial ports that I am now getting from the grafted Raspberry Pi. Not optimal, but doable.
As for memory, if you are planning on this to have SDRAM typically, can you not (don't hang me) make the I/O pins for that interface just digital and drop the analog from them? That would free up area for more memory for the LCD/VGA guys. If you don't need the SDRAM they would still be available for regular digital I/O applications. Would there really be a practical use for 80 analog pins on one chip?
Even with the extra I/O 16 cogs is fine. Please don't give Ken a stroke! He has been very good with our insanity up til now and we need him to be 100% working his marketing magic.
etc have a setting so all 16 cogs access memory..one to set it to 8 cogs...all the way down to one cog full access...increasing bandwidth with each setting....that way hub exec can be done at varing rates...the best of both worlds.
Maybe this is a case for the simple tasking ? Only instead of 2 pgms slicing, one is the code and the other is the Video-Gen using the COG memory instead of its own local memory.
Preserves RAM which is the die costly item here.
That's why I suggested the minimal cooperative approach. This approach require zero modification of the pipeline or instruction processing, while enabling what I imagine to be the biggest use case: a single cog with separate I/O read and write threads. With 16 cogs, I see much less need for the 4-task approach in P2. This is a KISS solution that should have the least impact on the new chip.
By the way, what nickname are we giving this thing?