Going out on a limb
David Betz
Posts: 14,516
Hub execution has solved the problem of being limited to 2K of COG memory for PASM programs. How about tackling how to handle programs and data that are bigger than 256K now? It's been suggested that SDRAM might be a common feature on P2 boards so it would be nice to be able to make use of it for code and data rather than just frame buffers. This could, of course, be done with XMM but now that we have hub execution the penalty for going to XMM is going to be huge. I wonder if it might be possible to add a simple TLB on top of the hub execution model to translate any address beyond 256K to a hub address? The TLB could probably be fairly small and still have a huge effect on the performance of external code. The main catch to this is that you'd have to have some sort of address fault trap that branched to a known location probably in COG memory when an external address was not found in the TLB. You'd also need a way to restart the instruction that cause the fault. It seems to me that the only instructions that could cause faults are instruction fetches and hub read/write instructions. Any other instruction would not need to be restartable. I suppose this would have to be a P3 feature but it might not be all that difficult to implement and would essentially give unlimited code and data space with much better performance than XMM.
(I wonder how long it will take for Ken's hit men to arrive at my house?)
(I wonder how long it will take for Ken's hit men to arrive at my house?)
Comments
What might be possible is a simple wait mechanism for address off chip, which would rely on a tiny SDRAM loader to signal ready.
That is a comparator, and a tiny state engine, and does not try to do too much.
Advantage of this is it makes future devices compatible, the change would be wait times would become less with silicon support/different DRAMs. The loader stub wait could vary with size allocated to any cache
If you try to do this across COGS, you lose determinism, as the SDRAM re-issue-address is a significant time.
I think some SW assist will be needed here and to keep things simple some mix of HW and SW will be best.
Also, somewhere in all this, someone needs to manage SDRAM refresh too..., or at least yield to any Auto-refresh.
Cypress do show new SRAMs so maybe off chip fetch could be simpler and more granular without SDRAM ?
(tho there is a price penalty )
The classic way is a full MMU, traps and all. Maybe after using the upcoming hubexec loadable something other than an MMU (or segment registers) will become viable.
For the moment, I see the speed/size hierarchy as follows:
1) cog-mode pasm = smallest programs, fastest code
2) hubexec pasm = medium size programs, pretty fast code
3) SDRAM XMM = HUGE programs, ok speed code
I can potentially see many alternate approaches that may be faster than the cache cog - but only time will tell.
(Sorry for the late comment, wifey and I have been fighting the flu)
As you say though, these ideas are only viable in a P3 timeframe.
Also, sorry to hear you had the flu. I'm glad you've recovered!
It may be able to share some of the HUB wide fetch logic, but even a very basic Quad-random 32 bit read, could be :
FastReadOctalWord = 12 clks+8clks for 32 bits, so 70MHz QSPI can feed opcodes at 285ns, no queue at all
Every opcode generates a new address and read, so this is slow-and-simple - there are no decisions or checks, every opcode generates a new address and gets new data.
It would suit a lowest-time-slice thread, leaving most of the COG power for other tasks.
FastReadOctalWord looks to fetch 16 bytes, or 4 opcodes, on a block basis, and can do that in a peak rate of 143ns/opcode.
Silicon support needed is a 16 byte buffer, and a compare decision.
Not all opcodes re-read, and the logic needs to chose buffer-rd or Flash read
Worst case would be a short jump between 2 blocks, which would run at 571ns/opcode
Best case would be a loop of up to 4 opcodes, that can stay in the buffer, & that could actually spin well under 143ns
(ie run at whatever rate the threads and fetch paths can deliver on-chip opcodes )
Winbond Data seems unclear if 16.32.48 byte reads are possible, or only 16 ?
Some small octal-align steps in the compiler could improve the speeds.
but you still need to fetch external code, and the storage, and decisions when to do so. That makes it a a P3 problem ?
In the simplest form, the Flash XIP needs none of that - it simply outputs a new address for each opcode.
Very simple in HW, and moderate but predictable speed results.
Most of the HW will come for free from the better SerDes anyway.
What's added is effectively a 'fetch opcodes via SerDes' handler state engine.
By doing the simple address-packaging in silicon, you can use most of the COG for real work, and feed a low-resource thread the XIP native opcodes ~ 3.5M OPS is ok for many tasks, and it could call COG-resident libraries, for inner-loop speed, if really needed.
The flash can accept 24 bit Address, so long jumps would be by @REG, and I think Chip was making all others relative so code in XIP flash, would be the same as code in HUB.
http://en.wikipedia.org/wiki/Translation_lookaside_buffer
I'm suggesting that what they call the "page table walk" be handled entirely in software.
You could specifically code where you can offload an overlaying mechanism to another cog, and this would work for P2.
While a P3 with external hub might be a good version, it would be competing with ARM again. Why go there?
Anyway, with what we have in P2 I think we won't need to worry about P3 for quite some time
Another gotcha for general purpose IOs trying to meet 100 MHz in a 180nm process would be a challenge. Many embedded controllers in 140 nm only spec to 80 MHz, and these use custom pads designed for the high speed.
DDR and DDR2 are a totally different animal with LVDS pins that are really hobbyist unfriendly, because most won't handle voltages above 2.5V
So I think this is all best revisited for a P3 in a couple years.
However, I do see a P2 "project" that could sell reasonably well to hobbyists/schools where P2 software can be done totally developed on the P2. IMHO this would make a better RPi and a better Arduino platform, but it would require critical mass. I intend to build such an animal, and I am certain I am not alone (Bill & others, and even Chip when/if he gets time). The latest round of improvements have given this prospect one "huge" boost.
Another may just be interaction code. That doesn't need to be very fast, but assets and lots of code to handle the UI consume a lot of room making things friendly, not so barebones. Any speed we can get easy is worth it.
Maybe we are talking about different solutions.
Anything in software does not need Silicon changes.
The idea behind my simplest Execute in Place, (XIP) is to need no software at all, & dumb HW, in the most basic variant.
It uses Hardware to shift the bits, and leaves the words to SW.
It should be able to use the improved (coming?) SerDes for the Pin-level interface.
The CPU simply fetches an opcode, & the added HW checks where the code is located, and if off-chip, then it generates the address burst for QuadSPI Flash, and then reads the 32 bits.
The CPU waits until the opcode is ready.
Single-line waits could avoid the re-fetch, and so save power & reduce EMC.
The Octal Word command in Winbond Flash, allows a variant of this, where 16 byte blocks are fetched, and the HW gets the opcodes contained within that 16 bytes.
I think you are indeed out on a limb - one which the rest of the industry is already in the process of sawing off.
Not only do you have the Raspberry Pi, which already has an awesome price/performance ratio compared to the P2, but then along comes things like this based on the 400 Mhz Quark.
The P2 will struggle to find a niche against such high-powered alternatives. It's best chance would be to retain the simplicity of programming that the P1 had (compared to other comparable micro-controllers) and thereby appeal strongly to the education and hobbyist markets. Unfortunately, I don't see this being a feature of the P2, which seems to increase in programming complexity every time I revisit these forums.
Ross.
Seems Intel is catching up with ARM in this space finally. There has been ARM running Linux in an SD card form factor with WIFI for ages.
http://www.transcend-info.com/products/Catlist.asp?FldNo=24
I agree, trying to morph the Prop into a general purpose machine like the ARM or now the little Intels can never be a winning strategy. And yes the loss of simplicity in the P2 worries me too.