I'm guessing this is impossible, but the ideal for me would be a system where we had 12 cogs, but the memory architecture was changed so that the first 128kB was standard Hub RAM, but then each subsequent 2kB block of RAM (times 12) was aliased to the cogs' RAM. If a given cog was running, we would get read-only access to that cog-block (or maybe read/write if that were possible, and specified with a bit flag when starting the cog), but if the cog was not running it could transparently be used as extra RAM. In this scenario it might be nice to have the first cog to run be cog 12, and start the highest available cog, giving a nice contiguous memory boost to Hub RAM. (Alternatively, reverse map the order of the cog's RAM, so cog 12's RAM comes right after the regular 128kB, then cog 11's RAM, etc.)
I am not expecting to fit usb fs into 1 cog with the extra 2 instructions,just tomake it easy to bit-bang the interface and do crc. The reply timing isvery tight. I do think 2 cogs will be quite easy though.
The props philosophy is to create simple reusable objects for standard drivers. However, some of us, including Chiip/Parallax will make some standard driver mixes. That is why I suggested keyboard/mouse/serial would be a single cog. But users are not going to want to jump into multitasking, with all of its complexities, just to solve a cog shortage. That is why I now believe cogs are going to be the most sort after. Driving all those pins with all sorts of interfaces will take cogs. Itsno use telling prospective customers, look, we have this great system withou interrupts - you just use simple code to drive this x interface using its own cog. Oh, and by the way, because cogs are in short supply, you will need to learn this alternative method of multi-tasking so you can fit some of your drivers together in 1 cog. Then you will getthe reply, why didn't you just use interrupts and provide hw interfaces to these peripherals. IMHO there won't be a very compelling argument now because you have traded one complexity for another.
The P2 cogs are really potent and useful. For your application above you could easily have a "HMI" cog that takes care of Video, Keyboard, Mouse, serial if you want. You could also probably fit the gui in as well. Then that cog becomes a useful, reusable block. This is pretty much exactly what OzPropDev has done with his invaders demo, although you'd probably put the game engine in with the main program, or hub ram.
Are you sure USB FS will require at least 2 cogs, if we get silicon clocking at 160MHz+? Or will those proposed decoding instructions get it into 1 cog? How demanding is the command processor?
I had a similar idea. With process control you have a lot of read/calculate/writes going on. Usually in real time from the I/O but the user interface only needs to update every so often, usually a much slower pace. If the Cog ram was dual port and mapped into the Hub ram then the display, or secondary data processing, cogs could invisibly read the current values without the I/O cog having to spend its valuable time, and limited program memory, writing those out with Hub Write commands. The compilers would need to have a way to identify and track the cog variables for mapping purposes but that should not be too hard to do.
An example is a CNC machine. The servo drive cog needs to read the encoders at very high rates, verify motor tracking and update current axis position yet the position information that is contained there only needs to be updated at the user interface level every 1/10 of a second or so. So we either waste code space, and time, writing every value change to the Hub or we complicate our code with some 'only update to hub memory every ### process cycles'.
The reading cog would still need to spend a Hub cycle to get the data, but that would be not as time critical as the process cog's job.
----
Oh and I fully agree with with Cluso99's observation above. That is the PRIMARY reason I want to use the P2. Simplicity of development and maintenance.
I am not expecting to fit usb fs into 1 cog with the extra 2 instructions,just tomake it easy to bit-bang the interface and do crc. The reply timing is very tight. I do think 2 cogs will be quite easy though.
How does that currently split out, on the FPGA P2, for
COG M : USB FS Low Level Driver - How many Cycles / Memory used ?
COG M+1: USB FS Command Processor - - How many Cycles / Memory used ?
Any loops unrolled for the FPGA that can shrink, in a XX MHz chip ? What MHz is needed to re-roll any loops ?
What about USB-LS, does that buy much in packing ability ?
How much SerDES bit-level support would be needed, to shrink functional USB into one COG ?
0: Main program (spin or C)
1: Video generator (output)
2: Video manipulation (game engine, gui, etc)
3: SD Driver (at least 1 cog)
4: SDRAM and Cache Driver
5: USB Low Level Driver
6: USB Command Processor USB FS will require at least 2 cogs)
Now I only have 1 cog left for keyboard/mouse/serial.
Desperately need more cogs!
That is a fully-fleshed out system, 2 more COGS would get headroom, and leave area for RAM as well ?
I'm guessing this is impossible, but the ideal for me would be a system where we had 12 cogs, but the memory architecture was changed so that the first 128kB was standard Hub RAM, but then each subsequent 2kB block of RAM (times 12) was aliased to the cogs' RAM.
Not impossible, but an impractical waste of silicon. COG memory is multi ported to allow opcode and operand access, in one clock.
It is also tightly integrated for speed. HUB RAM is larger and slower.
What has been suggested is mail-box style RAM, but that also has restrictions in use, and adding another memory area starts to complicate things.
An alternative is a FIFO Star, that adds no extra memory areas, and has a low impact on COG RAM, which could give an alternative COG-COG pathway.
Less clear, is how much this separate path would really buy, in terms of speed, given HUB can have wide-paths already.
Not impossible, but an impractical waste of silicon. COG memory is multi ported to allow opcode and operand access, in one clock.
It is also tightly integrated for speed. HUB RAM is larger and slower.
Ok, now I have a better understanding of cog ram.
One thing I am still a foggy on is this new 'AUX' ram. What exactly is that and why is it there? I.E. what is the intended use?
>One thing I am still a foggy on is this new 'AUX' ram. What exactly is that and why is it there? I.E. what is the intended use?
Pretty sure that is the old CLUT, (color palate lookup table) that have been made larger and more generic access ram.
You can still use it as a FIFO with dual stack pointers etc.
I normally don't put in my two cents on this thread, There is already a lot of very intelligent and smart people speaking and helping with the development as it is. But I feel right now is a good time to speak up. : ]
I think the clear answer and lowest risk is adding more HUB-RAM. The multitasking, pipelining and increased Cog speed already takes care of the performance jump.
I see a lot of code that I wrote on P1 in two cogs now fitting nicely into a single cog... so when comparing the new Prop2 cogs to the old Prop1 cogs isn't it like we now have 32 Prop1 cogs in computational comparison thanks to the pipeline and multitasking?
The HUB-RAM matter isn't about code space. It's about data. With the Prop we have amazing MIPs and computation power in a small package, with a teaspoon sized memory footprint to work with. Having more HUB-RAM will close the gap between using the Prop2 alone on a board, or having to give up pins and added cost and complexity of interfacing an external RAM chip. I know in some cases when logging or processing a lot of data its required you will need a low cost SRAM chip or SD-card no matter what... But I can see a lot of benefit from adding another ~100k of memory.
For me adding more HUB-RAM makes for a better balance of computing and opens us to more data intense applications for the chip without the extra cost of external memory chips.
3D printer guys can get more vectors in memory, Datalogging applications will be able to store larger snapshots of senor information, Audio applications will have more wavetable sample room, Video guys will get dual buffer video and higher resolutions, more fonts, etc. : ]
Frankly, I don't even care anymore. I don't care how many hub slots a cog gets, how many cogs there are, or even whether multitasking is in the mix, for that matter. What I do care -- and care deeply about -- is seeing a company and the people who work there who do matter to me -- and that I depend upon to a large extent for my livelihood -- get dragged down by an overly expensive, insanely long development cycle that has no end in sight and that's open to way too much input from people who will have virtually no impact on the chip's ultimate sales. The P2's development has to be more than an expensive, seven-year-and-counting hobby, or before long we won't even have a P1 to talk about. How many more "just two hour" mods will it take, Chip? It's time to end this insanity and get the P2 out the door, while there still is a door.
-Phil
P.S. In case you didn't catch it, subtlety and diplomacy are not my strong suits. And I'll probably regret this in the morning. For the time being, though, it felt good to get it off my chest.
Cogs vs. RAM. I'd go for the RAM. If I can keep from having to add external RAM, That would be a big deal. I see less need for more COGs because each cog can multitask. Maybe more COG ram would be nice, though.
The most important things:
1. Make the fewest changes possible in the design
2. Design to ensure success on next synthesis & fab (hopefully one that works)
3. Get the design out ASAP.
Chip, what's the estimate on how much additional AUX ram might be available if other factors are kept constant?
(ie for 8 cogs, its 100kB extra hub ram, vs 50% extra cogs, vs how much additional stack ram?)
I talked to Beau this morning and got measurements for the various die features, currently. Here they are:
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
I thought the core was ~10 square mm, but I was kind of wrong. The actual core cell area is ~10, but the core layout area is 14.71. The place-and-route tools like initial cell density to be around 65% of the layout area to achieve a good route, while having room for the clock tree. ~10 divided by 14.71 is ~68%, so we were a little high, initially.
The DAC bus is going away, so that is going to free up 10.33 square mm. You can see that four more hub RAMs will take 7.04 square mm, and still leave about 3.3 square mm for more logic, which should be quite adequate for our newer logic. That means 256KB of hub RAM.
Also, having eight hub RAMs means we could go from QUADs to OCTs, so that you can move 8 longs per hub access, instead of 4. This would certainly get LMM code running at 1/2 speed of native PASM. It also means that two SDRAMs, making a 32-bit data bus, could be written/read to/from hub RAM at 1 long per clock, which means 800MB/s.
Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time.
How does that currently split out, on the FPGA P2, for
COG M : USB FS Low Level Driver - How many Cycles / Memory used ?
COG M+1: USB FS Command Processor - - How many Cycles / Memory used ?
Any loops unrolled for the FPGA that can shrink, in a XX MHz chip ? What MHz is needed to re-roll any loops ?
What about USB-LS, does that buy much in packing ability ?
How much SerDES bit-level support would be needed, to shrink functional USB into one COG ?
All this remains to be seen. I expect LS could be done in 1 cog, but remember that is only 1.5Mb/s whereas FS is 12Mb/s although in reality you are not going to see that speed when you add all the overhead. The devil is in the users requirements.
Currently,at 80MHz, with the new 2 helper instructions, I can happily read (that's actually the hardest) the usb by looping. FYI it is 6.7 instructions / incoming bit @ 80MHz. IIRC the response has to be made within 7 usb bit times but non-compliant usb stretches usb ls out to 16 bit times. I am unsure whether fs can cope with this stretched response. Also, in LS, the non-compliant tricks are to skip frames (ie pretend you missed a frame as it was bad, in order to gain more time to respond). Many of these tricks may not work in the faster FS. I am not aiming for full compliance, just working FS.
My loop also caters for bit-(un)stuffing too. SERDES would aid by assembling the bitstream up to 32 bits, so I could prepare for possible responses. It would be particularly beneficial to have NRZI and bit-stuffing and bit-unstuffing, but do not want that at the expense of a general purpose SERDES that has so many other purposes to offer. There is also the condition of SE0 (both input bits=00) which is an "end of frame" condition.
So I guess, rather than complete hw for the serial interface, I am leaning on the props philosophy of throwing a sw cog at the problem. However, there are not going to be enough cogs for this.
And this is now my realisation. At least Chip hinted at this when he offered another 4 cogs. IMHO this is now a no-brainer... we need the cogs!
That is a fully-fleshed out system, 2 more COGS would get headroom, and leave area for RAM as well ?
But I haven't roome left for the users interface/drivers of all the remaining pins. So presuming somehow, the cogs I have for the system can be reduced by 1, that leaves 1 for the users drivers of the remaining say 56 I/Os. This means the user will have to learn multi-tasking, so he doesn't get all the advantages of the props cogs.
So, I think +4 cogs is best.
But if at all possible, I think a small block (say 16 or 32 longs) of multiport ram between all cogs (likely at the centre of the die) would be extremely beneficial. Probably could even remove the D port if we had this. And IMHO it is an easy addin.
The DAC bus is going away, so that is going to free up 10.33 square mm. You can see that four more hub RAMs will take 7.04 square mm, and still leave about 3.3 square mm for more logic, which should be quite adequate for our newer logic. That means 256KB of hub RAM.
The DAC bus is going away, so that is going to free up 10.33 square mm. You can see that four more hub RAMs will take 7.04 square mm, and still leave about 3.3 square mm for more logic, which should be quite adequate for our newer logic. That means 256KB of hub RAM.
Also, having eight hub RAMs means we could go from QUADs to OCTs, so that you can move 8 longs per hub access, instead of 4. This would certainly get LMM code running at 1/2 speed of native PASM. It also means that two SDRAMs, making a 32-bit data bus, could be written/read to/from hub RAM at 1 long per clock, which means 800MB/s.
Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time.
Having the DAC bus go away sounds like a real winner for lowering risk and has some real performance benefits.
I thought about mentioning a 256 bit bus to do the OCTs last night, but didn't want to get shot! That is a real nice benefit of the extra HUB RAM.
Since that change would double HUB bandwidth to the COGs maybe we can drop the HUB slot sharing since the primary goal was to increase HUB slot bandwidth.
I like the HUB slot sharing, but maybe it isn't the time right now given all the ruckus it has caused.
I talked to Beau this morning and got measurements for the various die features, currently. Here they are:
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
I thought the core was ~10 square mm, but I was kind of wrong. The actual core cell area is ~10, but the core layout area is 14.71. The place-and-route tools like initial cell density to be around 65% of the layout area to achieve a good route, while having room for the clock tree. ~10 divided by 14.71 is ~68%, so we were a little high, initially.
The DAC bus is going away, so that is going to free up 10.33 square mm. You can see that four more hub RAMs will take 7.04 square mm, and still leave about 3.3 square mm for more logic, which should be quite adequate for our newer logic. That means 256KB of hub RAM.
Also, having eight hub RAMs means we could go from QUADs to OCTs, so that you can move 8 longs per hub access, instead of 4. This would certainly get LMM code running at 1/2 speed of native PASM. It also means that two SDRAMs, making a 32-bit data bus, could be written/read to/from hub RAM at 1 long per clock, which means 800MB/s.
Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time.
Chip,
Does this mean 4 more cogs is off the table?
If not, I am presuming a cog is 1x (cog ram + aux ram + (1/8)*core) ?
So put simply 8 cogs =8*0.62+8*0.20+14.71 = 21.27, so 4 cogs = 10.65mm2
So it is overly tight to say the least!
Could we have 10 cogs (works well with 200MHz and 1:10 hub loop) = 5.33mm2 ?
That would leave 5mm2
Increase aux ram to 512 longs (9 bits like cog ram) =8*0.20 = 1.6mm2
A small 32+2 * LONG block in the die centre to be shared as a single resource - size negligible - single port (round-robbin priority without delays ie could be cog 1,1,3,1,7,etc - ie after a cog gets a slot, the next if offered to the next, etc until someone wants it, so no wasted slots until someone want it). The concept - use it between 2 cogs only for high speed transfers.
Question:
Could the hub simply be redone to be 8 * 16KB blocks (currently 4 * 32KB blocks) to give 8*LONGs per hub access ? I presume it would be over the top to be 16*8KB and 16*long access ?
BTW I know see why you need to increase hub in regular block sizes
I talked to Beau this morning and got measurements for the various die features, currently. Here they are:
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
I thought the core was ~10 square mm, but I was kind of wrong. The actual core cell area is ~10, but the core layout area is 14.71. The place-and-route tools like initial cell density to be around 65% of the layout area to achieve a good route, while having room for the clock tree. ~10 divided by 14.71 is ~68%, so we were a little high, initially.
The DAC bus is going away, so that is going to free up 10.33 square mm. You can see that four more hub RAMs will take 7.04 square mm, and still leave about 3.3 square mm for more logic, which should be quite adequate for our newer logic. That means 256KB of hub RAM.
Also, having eight hub RAMs means we could go from QUADs to OCTs, so that you can move 8 longs per hub access, instead of 4. This would certainly get LMM code running at 1/2 speed of native PASM. It also means that two SDRAMs, making a 32-bit data bus, could be written/read to/from hub RAM at 1 long per clock, which means 800MB/s.
Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time.
Chip! Please please please make the HUB RAM actually be 256K and not the minus 2K for ROM. Instead use some of the new space for ROM, hopefully a bit bigger to include more boot up ability and more monitor features.
.....
The DAC bus is going away, so that is going to free up 10.33 square mm. You can see that four more hub RAMs will take 7.04 square mm, and still leave about 3.3 square mm for more logic, which should be quite adequate for our newer logic. That means 256KB of hub RAM.
.....
Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time.
This sounds like a win-win-win all round, and a no-brainer solution!
I wonder if such changes should go in the P3 bucket?
I'm saying that because what Chip has suggested now sounds very low risk and really does look like a 'super P1' and fits almost all of the 'prop' philosophy.
My suggestion is:
- Do the 256k HUB with OCT transfers now included.
- Leave the AUX alone unless we could still maybe expand from 256 to 512 longs while still doing the 256k HUB.
- Drop the HUB Slot Sharing
- Add the SERDES/CRC items.
Chip! Please please please make the HUB RAM actually be 256K and not the minus 2K for ROM. Instead use some of the new space for ROM, hopefully a bit bigger to include more boot up ability and more monitor features.
I don't actually see the monitor being that big a deal as long as some simple bits were left.
The main advantage of the monitor as I see it, is to provide a simple mechanism to load initial code to program the flash. It is an extended downloader.
The reason I say this is because its a simple matter of downloading monitor code into the flash. Once that is there, we can do what we want.
By being flash resident, the monitor can be extended easily, like I have done with the P2 debugger, which actually resides in hub and runs in LMM mode. So the 16 long stub can be inserted in anyone's cog program to provide access to the monitor/debugger from within their program - so they can examine hub, cog, single step, etc, while all the time still having their program in cog.
If it were implemented this way, then the boot cog would just load itself from a small block of ROM which could be totally separate from the hub ram.
It's a run once at boot/reboot and forget.
I don't actually see the monitor being that big a deal as long as some simple bits were left.
The main advantage of the monitor as I see it, is to provide a simple mechanism to load initial code to program the flash. It is an extended downloader.
The reason I say this is because its a simple matter of downloading monitor code into the flash. Once that is there, we can do what we want.
By being flash resident, the monitor can be extended easily, like I have done with the P2 debugger, which actually resides in hub and runs in LMM mode. So the 16 long stub can be inserted in anyone's cog program to provide access to the monitor/debugger from within their program - so they can examine hub, cog, single step, etc, while all the time still having their program in cog.
If it were implemented this way, then the boot cog would just load itself from a small block of ROM which could be totally separate from the hub ram.
It's a run once at boot/reboot and forget.
This sounds like a good idea. Have the tiny boot ROM map to hub address zero on reset and then have it unmap itself revealing RAM underneath after COG 0 has booted.
I wonder if such changes should go in the P3 bucket?
I'm saying that because what Chip has suggested now sounds very low risk and really does look like a 'super P1' and fits almost all of the 'prop' philosophy.
My suggestion is:
- Do the 256k HUB with OCT transfers now included.
If more cogs are not considered, then yes, I agree this is the WTG.
- Leave the AUX alone unless we could still maybe expand from 256 to 512 longs while still doing the 256k HUB.
If space, yes.
- Drop the HUB Slot Sharing
Absolutely not. Could be a simpler mechanism, but don't restrict the possibilities.
* A donour cog could yield or gift his slot
* A cog can utilise free slots
- Add the SERDES/CRC items.
Absolutely. Chip has already agreed.
But we must still recommend what is in the SERDES. There is time while Beau implements the changes required to the DAC bus.
This sounds like a good idea. Have the tiny boot ROM map to hub address zero on reset and then have it unmap itself revealing RAM underneath after COG 0 has booted.
I'd rather see the boot rom map into Cog 0's RAM area, because that's ultimately where it ends up being copied to anyway.
All this remains to be seen. I expect LS could be done in 1 cog, but remember that is only 1.5Mb/s whereas FS is 12Mb/s although in reality you are not going to see that speed when you add all the overhead. The devil is in the users requirements
Yes, LS is slower, but for a good many apps, it will be just a PC serial link, and in that space, 1.5Mb is actually fast.
However, there are not going to be enough cogs for this.
And this is now my realisation. At least Chip hinted at this when he offered another 4 cogs. IMHO this is now a no-brainer... we need the cogs!
From the numbers Chip gave above, COGs are quite expensive in silicon, certainly a lot more expensive than a little extra SerDes help will be.
The question then is, just how much SerDes help is needed, to pack this into a single COG
My loop also caters for bit-(un)stuffing too. SERDES would aid by assembling the bitstream up to 32 bits, so I could prepare for possible responses. It would be particularly beneficial to have NRZI and bit-stuffing and bit-unstuffing, but do not want that at the expense of a general purpose SERDES that has so many other purposes to offer. There is also the condition of SE0 (both input bits=00) which is an "end of frame" condition.
What is the code-size cost of bit stuff/unstuff in SW ?
This sounds like a good idea. Have the tiny boot ROM map to hub address zero on reset and then have it unmap itself revealing RAM underneath after COG 0 has booted.
Actually, IMHO I think it would not even have to map itself there. The boot/reboot hw instruction would just copy it into cog from a ROM block.
Chip! Please please please make the HUB RAM actually be 256K and not the minus 2K for ROM. Instead use some of the new space for ROM, hopefully a bit bigger to include more boot up ability and more monitor features.
Currently ROM is made by patching RAM, so that naturally makes it want to be small.
It also means 256k of RAM + 2K ROM, bumps to another address bit = more decode = some speed impact.
Memory generators also often are not expecting fractional pages, so manual intervention could be required.
It should be possible to alias the ROM to some high memory area, (for future proof) but it will still overlay into the top of the physical on-chip memory.
The ROM has to appear somewhere, so apart from cosmetic, what is the issue with 256k-2k ?
This sounds like a good idea. Have the tiny boot ROM map to hub address zero on reset and then have it unmap itself revealing RAM underneath after COG 0 has booted.
Some loader state engine is always needed, to get the ROM data to where it can RUN.
What may be possible on a Tiny-ROM approach, is a serial ROM, which is more hidden than present ROM.
The ROM code cannot get too small, as there are a lot of boot choices to cover.
Comments
I'm guessing this is impossible, but the ideal for me would be a system where we had 12 cogs, but the memory architecture was changed so that the first 128kB was standard Hub RAM, but then each subsequent 2kB block of RAM (times 12) was aliased to the cogs' RAM. If a given cog was running, we would get read-only access to that cog-block (or maybe read/write if that were possible, and specified with a bit flag when starting the cog), but if the cog was not running it could transparently be used as extra RAM. In this scenario it might be nice to have the first cog to run be cog 12, and start the highest available cog, giving a nice contiguous memory boost to Hub RAM. (Alternatively, reverse map the order of the cog's RAM, so cog 12's RAM comes right after the regular 128kB, then cog 11's RAM, etc.)
Does that make sense?
Jonathan
The props philosophy is to create simple reusable objects for standard drivers. However, some of us, including Chiip/Parallax will make some standard driver mixes. That is why I suggested keyboard/mouse/serial would be a single cog. But users are not going to want to jump into multitasking, with all of its complexities, just to solve a cog shortage. That is why I now believe cogs are going to be the most sort after. Driving all those pins with all sorts of interfaces will take cogs. Itsno use telling prospective customers, look, we have this great system withou interrupts - you just use simple code to drive this x interface using its own cog. Oh, and by the way, because cogs are in short supply, you will need to learn this alternative method of multi-tasking so you can fit some of your drivers together in 1 cog. Then you will getthe reply, why didn't you just use interrupts and provide hw interfaces to these peripherals. IMHO there won't be a very compelling argument now because you have traded one complexity for another.
Just my thoughts and 2c
An example is a CNC machine. The servo drive cog needs to read the encoders at very high rates, verify motor tracking and update current axis position yet the position information that is contained there only needs to be updated at the user interface level every 1/10 of a second or so. So we either waste code space, and time, writing every value change to the Hub or we complicate our code with some 'only update to hub memory every ### process cycles'.
The reading cog would still need to spend a Hub cycle to get the data, but that would be not as time critical as the process cog's job.
----
Oh and I fully agree with with Cluso99's observation above. That is the PRIMARY reason I want to use the P2. Simplicity of development and maintenance.
How does that currently split out, on the FPGA P2, for
COG M : USB FS Low Level Driver - How many Cycles / Memory used ?
COG M+1: USB FS Command Processor - - How many Cycles / Memory used ?
Any loops unrolled for the FPGA that can shrink, in a XX MHz chip ? What MHz is needed to re-roll any loops ?
What about USB-LS, does that buy much in packing ability ?
How much SerDES bit-level support would be needed, to shrink functional USB into one COG ?
That is a fully-fleshed out system, 2 more COGS would get headroom, and leave area for RAM as well ?
Not impossible, but an impractical waste of silicon. COG memory is multi ported to allow opcode and operand access, in one clock.
It is also tightly integrated for speed. HUB RAM is larger and slower.
What has been suggested is mail-box style RAM, but that also has restrictions in use, and adding another memory area starts to complicate things.
An alternative is a FIFO Star, that adds no extra memory areas, and has a low impact on COG RAM, which could give an alternative COG-COG pathway.
Less clear, is how much this separate path would really buy, in terms of speed, given HUB can have wide-paths already.
Ok, now I have a better understanding of cog ram.
One thing I am still a foggy on is this new 'AUX' ram. What exactly is that and why is it there? I.E. what is the intended use?
Pretty sure that is the old CLUT, (color palate lookup table) that have been made larger and more generic access ram.
You can still use it as a FIFO with dual stack pointers etc.
I think the clear answer and lowest risk is adding more HUB-RAM. The multitasking, pipelining and increased Cog speed already takes care of the performance jump.
I see a lot of code that I wrote on P1 in two cogs now fitting nicely into a single cog... so when comparing the new Prop2 cogs to the old Prop1 cogs isn't it like we now have 32 Prop1 cogs in computational comparison thanks to the pipeline and multitasking?
The HUB-RAM matter isn't about code space. It's about data. With the Prop we have amazing MIPs and computation power in a small package, with a teaspoon sized memory footprint to work with. Having more HUB-RAM will close the gap between using the Prop2 alone on a board, or having to give up pins and added cost and complexity of interfacing an external RAM chip. I know in some cases when logging or processing a lot of data its required you will need a low cost SRAM chip or SD-card no matter what... But I can see a lot of benefit from adding another ~100k of memory.
For me adding more HUB-RAM makes for a better balance of computing and opens us to more data intense applications for the chip without the extra cost of external memory chips.
3D printer guys can get more vectors in memory, Datalogging applications will be able to store larger snapshots of senor information, Audio applications will have more wavetable sample room, Video guys will get dual buffer video and higher resolutions, more fonts, etc. : ]
Yeah, what he said.
The most important things:
1. Make the fewest changes possible in the design
2. Design to ensure success on next synthesis & fab (hopefully one that works)
3. Get the design out ASAP.
I echo what Phil and others have said earlier, get it out of the door whilst there still is one.
Regards,
Coley
No, that is not a typo.
Bean
I talked to Beau this morning and got measurements for the various die features, currently. Here they are:
DAC bus = 10.33 square mm - this is being removed, freeing up the area
hub RAM = 1.76 square mm (x4)
cog RAM = 0.62 square mm (x8)
aux RAM = 0.20 square mm (x8)
core = 14.71 square mm
I thought the core was ~10 square mm, but I was kind of wrong. The actual core cell area is ~10, but the core layout area is 14.71. The place-and-route tools like initial cell density to be around 65% of the layout area to achieve a good route, while having room for the clock tree. ~10 divided by 14.71 is ~68%, so we were a little high, initially.
The DAC bus is going away, so that is going to free up 10.33 square mm. You can see that four more hub RAMs will take 7.04 square mm, and still leave about 3.3 square mm for more logic, which should be quite adequate for our newer logic. That means 256KB of hub RAM.
Also, having eight hub RAMs means we could go from QUADs to OCTs, so that you can move 8 longs per hub access, instead of 4. This would certainly get LMM code running at 1/2 speed of native PASM. It also means that two SDRAMs, making a 32-bit data bus, could be written/read to/from hub RAM at 1 long per clock, which means 800MB/s.
Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time.
Currently,at 80MHz, with the new 2 helper instructions, I can happily read (that's actually the hardest) the usb by looping. FYI it is 6.7 instructions / incoming bit @ 80MHz. IIRC the response has to be made within 7 usb bit times but non-compliant usb stretches usb ls out to 16 bit times. I am unsure whether fs can cope with this stretched response. Also, in LS, the non-compliant tricks are to skip frames (ie pretend you missed a frame as it was bad, in order to gain more time to respond). Many of these tricks may not work in the faster FS. I am not aiming for full compliance, just working FS.
My loop also caters for bit-(un)stuffing too. SERDES would aid by assembling the bitstream up to 32 bits, so I could prepare for possible responses. It would be particularly beneficial to have NRZI and bit-stuffing and bit-unstuffing, but do not want that at the expense of a general purpose SERDES that has so many other purposes to offer. There is also the condition of SE0 (both input bits=00) which is an "end of frame" condition.
So I guess, rather than complete hw for the serial interface, I am leaning on the props philosophy of throwing a sw cog at the problem. However, there are not going to be enough cogs for this.
And this is now my realisation. At least Chip hinted at this when he offered another 4 cogs. IMHO this is now a no-brainer... we need the cogs! But I haven't roome left for the users interface/drivers of all the remaining pins. So presuming somehow, the cogs I have for the system can be reduced by 1, that leaves 1 for the users drivers of the remaining say 56 I/Os. This means the user will have to learn multi-tasking, so he doesn't get all the advantages of the props cogs.
So, I think +4 cogs is best.
But if at all possible, I think a small block (say 16 or 32 longs) of multiport ram between all cogs (likely at the centre of the die) would be extremely beneficial. Probably could even remove the D port if we had this. And IMHO it is an easy addin.
I had to read that twice !!
Having the DAC bus go away sounds like a real winner for lowering risk and has some real performance benefits.
I thought about mentioning a 256 bit bus to do the OCTs last night, but didn't want to get shot! That is a real nice benefit of the extra HUB RAM.
Since that change would double HUB bandwidth to the COGs maybe we can drop the HUB slot sharing since the primary goal was to increase HUB slot bandwidth.
I like the HUB slot sharing, but maybe it isn't the time right now given all the ruckus it has caused.
C.W.
Does this mean 4 more cogs is off the table?
If not, I am presuming a cog is 1x (cog ram + aux ram + (1/8)*core) ?
So put simply 8 cogs =8*0.62+8*0.20+14.71 = 21.27, so 4 cogs = 10.65mm2
So it is overly tight to say the least!
Could we have 10 cogs (works well with 200MHz and 1:10 hub loop) = 5.33mm2 ?
That would leave 5mm2
Increase aux ram to 512 longs (9 bits like cog ram) =8*0.20 = 1.6mm2
A small 32+2 * LONG block in the die centre to be shared as a single resource - size negligible - single port (round-robbin priority without delays ie could be cog 1,1,3,1,7,etc - ie after a cog gets a slot, the next if offered to the next, etc until someone wants it, so no wasted slots until someone want it). The concept - use it between 2 cogs only for high speed transfers.
Question:
Could the hub simply be redone to be 8 * 16KB blocks (currently 4 * 32KB blocks) to give 8*LONGs per hub access ? I presume it would be over the top to be 16*8KB and 16*long access ?
BTW I know see why you need to increase hub in regular block sizes
Chip! Please please please make the HUB RAM actually be 256K and not the minus 2K for ROM. Instead use some of the new space for ROM, hopefully a bit bigger to include more boot up ability and more monitor features.
I wonder if such changes should go in the P3 bucket?
I'm saying that because what Chip has suggested now sounds very low risk and really does look like a 'super P1' and fits almost all of the 'prop' philosophy.
My suggestion is:
- Do the 256k HUB with OCT transfers now included.
- Leave the AUX alone unless we could still maybe expand from 256 to 512 longs while still doing the 256k HUB.
- Drop the HUB Slot Sharing
- Add the SERDES/CRC items.
C.W.
The main advantage of the monitor as I see it, is to provide a simple mechanism to load initial code to program the flash. It is an extended downloader.
The reason I say this is because its a simple matter of downloading monitor code into the flash. Once that is there, we can do what we want.
By being flash resident, the monitor can be extended easily, like I have done with the P2 debugger, which actually resides in hub and runs in LMM mode. So the 16 long stub can be inserted in anyone's cog program to provide access to the monitor/debugger from within their program - so they can examine hub, cog, single step, etc, while all the time still having their program in cog.
If it were implemented this way, then the boot cog would just load itself from a small block of ROM which could be totally separate from the hub ram.
It's a run once at boot/reboot and forget.
* A donour cog could yield or gift his slot
* A cog can utilise free slots Absolutely. Chip has already agreed.
But we must still recommend what is in the SERDES. There is time while Beau implements the changes required to the DAC bus.
I'd rather see the boot rom map into Cog 0's RAM area, because that's ultimately where it ends up being copied to anyway.
Yes, LS is slower, but for a good many apps, it will be just a PC serial link, and in that space, 1.5Mb is actually fast.
That makes sense, to get something working.
From the numbers Chip gave above, COGs are quite expensive in silicon, certainly a lot more expensive than a little extra SerDes help will be.
The question then is, just how much SerDes help is needed, to pack this into a single COG
What is the code-size cost of bit stuff/unstuff in SW ?
Currently ROM is made by patching RAM, so that naturally makes it want to be small.
It also means 256k of RAM + 2K ROM, bumps to another address bit = more decode = some speed impact.
Memory generators also often are not expecting fractional pages, so manual intervention could be required.
It should be possible to alias the ROM to some high memory area, (for future proof) but it will still overlay into the top of the physical on-chip memory.
The ROM has to appear somewhere, so apart from cosmetic, what is the issue with 256k-2k ?
Some loader state engine is always needed, to get the ROM data to where it can RUN.
What may be possible on a Tiny-ROM approach, is a serial ROM, which is more hidden than present ROM.
The ROM code cannot get too small, as there are a lot of boot choices to cover.