Going out on a limb

David Betz · 2014-01-04 17:42

Hub execution has solved the problem of being limited to 2K of COG memory for PASM programs. How about tackling how to handle programs and data that are bigger than 256K now? It's been suggested that SDRAM might be a common feature on P2 boards so it would be nice to be able to make use of it for code and data rather than just frame buffers. This could, of course, be done with XMM but now that we have hub execution the penalty for going to XMM is going to be huge. I wonder if it might be possible to add a simple TLB on top of the hub execution model to translate any address beyond 256K to a hub address? The TLB could probably be fairly small and still have a huge effect on the performance of external code. The main catch to this is that you'd have to have some sort of address fault trap that branched to a known location probably in COG memory when an external address was not found in the TLB. You'd also need a way to restart the instruction that cause the fault. It seems to me that the only instructions that could cause faults are instruction fetches and hub read/write instructions. Any other instruction would not need to be restartable. I suppose this would have to be a P3 feature but it might not be all that difficult to implement and would essentially give unlimited code and data space with much better performance than XMM.

(I wonder how long it will take for Ken's hit men to arrive at my house?)

jmg · 2014-01-04 17:56

SDRAM is never going to quite be transparent, so there will always be a speed penalty crossing that threshold.

What might be possible is a simple wait mechanism for address off chip, which would rely on a tiny SDRAM loader to signal ready.
That is a comparator, and a tiny state engine, and does not try to do too much.

Advantage of this is it makes future devices compatible, the change would be wait times would become less with silicon support/different DRAMs. The loader stub wait could vary with size allocated to any cache

David Betz · 2014-01-04 17:59

jmg wrote: »

SDRAM is never going to quite be transparent, so there will always be a speed penalty crossing that threshold.

What might be possible is a simple wait mechanism for address off chip, which would rely on a tiny SDRAM loader to signal ready.
That is a comparator, and a tiny state engine, and does not try to do too much.

Advantage of this is it makes future devices compatible, the change would be wait times would become less with silicon support/different DRAMs. The loader stub wait could vary with size allocated to any cache

That would work but could be terribly slow. If you could pull in an entire page on a TLB fault then you could run at hub speeds until you crossed over onto a different page. I guess it depends on how much your program branches.

David Betz · 2014-01-04 18:03

jmg wrote: »

SDRAM is never going to quite be transparent, so there will always be a speed penalty crossing that threshold.

What might be possible is a simple wait mechanism for address off chip, which would rely on a tiny SDRAM loader to signal ready.
That is a comparator, and a tiny state engine, and does not try to do too much.

Advantage of this is it makes future devices compatible, the change would be wait times would become less with silicon support/different DRAMs. The loader stub wait could vary with size allocated to any cache

Actually though, the idea of stalling is good. Maybe if another COG was responsible for updating the TLB then the COG that hit the address fault could just stall waiting for a TLB hit. Then you wouldn't need restartable instructions or traps.

David Betz · 2014-01-04 18:33

David Betz wrote: »

Actually though, the idea of stalling is good. Maybe if another COG was responsible for updating the TLB then the COG that hit the address fault could just stall waiting for a TLB hit. Then you wouldn't need restartable instructions or traps.

Well, you wouldn't need traps but there would need to be a way for the COG that generates the address fault to let the other COG know what address was needed and the other COG would also need to be able to write to the TLB of the COG encountering the fault. I guess this is beginning to sound complicated after all. The trap might actually be easier since it would all be contained in a single COG.

jmg · 2014-01-04 19:03

David Betz wrote: »

Actually though, the idea of stalling is good. Maybe if another COG was responsible for updating the TLB then the COG that hit the address fault could just stall waiting for a TLB hit. Then you wouldn't need restartable instructions or traps.

If you try to do this across COGS, you lose determinism, as the SDRAM re-issue-address is a significant time.
I think some SW assist will be needed here and to keep things simple some mix of HW and SW will be best.

Also, somewhere in all this, someone needs to manage SDRAM refresh too..., or at least yield to any Auto-refresh.

Cypress do show new SRAMs so maybe off chip fetch could be simpler and more granular without SDRAM ?
(tho there is a price penalty )

David Betz · 2014-01-04 19:42

jmg wrote: »

If you try to do this across COGS, you lose determinism, as the SDRAM re-issue-address is a significant time.

I think any external memory solution will lose determinism. In fact, even hub execution loses determinism because you can't predict cache hits and misses. If you want deterministic code, you'll have to use COG-resident code or at least hub code that fits in a single cache line.

I think some SW assist will be needed here and to keep things simple some mix of HW and SW will be best.

That is what I was trying to propose. The HW does nothing but check the TLB to see if there is a hub/external translation present for the address being fetched. If there isn't, it traps off to COG code that loads a page from external memory using Chip's SDRAM driver and updates the TLB before resuming the instruction that caused the address fault. This is a combination hardware/software solution. The hardware required would be similar to the cache tag lookup hardware that Chip is already planning on adding to support hub execution.

Also, somewhere in all this, someone needs to manage SDRAM refresh too..., or at least yield to any Auto-refresh.

Chip's SDRAM driver will handle that.

Cypress do show new SRAMs so maybe off chip fetch could be simpler and more granular without SDRAM ?
(tho there is a price penalty )

Simpler because they don't have to be refreshed? What do you mean by "granular"?

MJB · 2014-01-05 08:15

David Betz wrote: »

Actually though, the idea of stalling is good. Maybe if another COG was responsible for updating the TLB then the COG that hit the address fault could just stall waiting for a TLB hit. Then you wouldn't need restartable instructions or traps.

wouldn't it be possible to save the extra COG and instead use a second thread/task to manage the TLB while the task/thread with the pagefault is waiting.

David Betz · 2014-01-05 08:58

MJB wrote: »

wouldn't it be possible to save the extra COG and instead use a second thread/task to manage the TLB while the task/thread with the pagefault is waiting.

Yes, that would probably be better than another COG because a different task in the current COG would presumably be able to update the TLB directly. In fact, I guess that could eliminate the need for a trap as well. I can imagine a task that is idle waiting for a TLB fault to happen and gets started just as the task that caused the fault is stalled waiting for a TLB reload.

Bill Henning · 2014-01-06 13:25

I think adding hardware support for running directly out of SDRAM is a P3 thing

The classic way is a full MMU, traps and all. Maybe after using the upcoming hubexec loadable something other than an MMU (or segment registers) will become viable.

For the moment, I see the speed/size hierarchy as follows:

1) cog-mode pasm = smallest programs, fastest code

2) hubexec pasm = medium size programs, pretty fast code

3) SDRAM XMM = HUGE programs, ok speed code

I can potentially see many alternate approaches that may be faster than the cache cog - but only time will tell.

(Sorry for the late comment, wifey and I have been fighting the flu)

David Betz · 2014-01-06 13:39

Bill Henning wrote: »

I think adding hardware support for running directly out of SDRAM is a P3 thing

The classic way is a full MMU, traps and all. Maybe after using the upcoming hubexec loadable something other than an MMU (or segment registers) will become viable.

For the moment, I see the speed/size hierarchy as follows:

1) cog-mode pasm = smallest programs, fastest code

2) hubexec pasm = medium size programs, pretty fast code

3) SDRAM XMM = HUGE programs, ok speed code

I can potentially see many alternate approaches that may be faster than the cache cog - but only time will tell.

(Sorry for the late comment, wifey and I have been fighting the flu)

Essentially, a TLB is the least hardware you can add to support executing from external memory. You could use to implement an MMU in software if you really wanted to do that but I'm not convinced it is necessary. One reason I mentioned it is that even with 256k of hub memory, P2 is still pretty wimpy compared to other modern MCUs that often have 512k of flash and 64k or more of SRAM. And that flash can be used more efficiently than hub memory because the ARM chip supports a "thumb" instruction set where each instruction is only 16 bits. This is like Eric's CMM instruction set but implemented in hardware. Another alternative that I suggested to Chip a while back would be to just implement CMM in hardware. That would have the advantage that it would be usable in hub execution mode essentially doubling the amount of available program space.

As you say though, these ideas are only viable in a P3 timeframe.

Also, sorry to hear you had the flu. I'm glad you've recovered!

jmg · 2014-01-06 14:55

There is also execute in place on FLASH memory, which probably has simpler silicon support.
It may be able to share some of the HUB wide fetch logic, but even a very basic Quad-random 32 bit read, could be :

FastReadOctalWord = 12 clks+8clks for 32 bits, so 70MHz QSPI can feed opcodes at 285ns, no queue at all
Every opcode generates a new address and read, so this is slow-and-simple - there are no decisions or checks, every opcode generates a new address and gets new data.
It would suit a lowest-time-slice thread, leaving most of the COG power for other tasks.

FastReadOctalWord looks to fetch 16 bytes, or 4 opcodes, on a block basis, and can do that in a peak rate of 143ns/opcode.

Silicon support needed is a 16 byte buffer, and a compare decision.
Not all opcodes re-read, and the logic needs to chose buffer-rd or Flash read
Worst case would be a short jump between 2 blocks, which would run at 571ns/opcode
Best case would be a loop of up to 4 opcodes, that can stay in the buffer, & that could actually spin well under 143ns
(ie run at whatever rate the threads and fetch paths can deliver on-chip opcodes )

Winbond Data seems unclear if 16.32.48 byte reads are possible, or only 16 ?
Some small octal-align steps in the compiler could improve the speeds.

potatohead · 2014-01-06 15:32

Aren't HUBEX programs limited to 64K? Thought I saw that but can't find the reference.

jazzed · 2014-01-06 15:50

16bit program counter (16 bits << 2) = 64K longs = 256KB hub ram.

David Betz · 2014-01-06 15:51

potatohead wrote: »

Aren't HUBEX programs limited to 64K? Thought I saw that but can't find the reference.

They're limited to 64k longs which is 256k bytes. I had forgotten about that. You'd have to use the BIG instruction or whatever Chip called it to get a full 32 bit address in external memory mode.

David Betz · 2014-01-06 15:56

jmg wrote: »

There is also execute in place on FLASH memory, which probably has simpler silicon support.

I think the TLB would be simpler since all it does is translate external addresses to hub addresses and then executes the instruction identically to a hub-resident instruction. However, I neglected to mention that the PC would have to be increased to 30 bits. I guess the translation might be in the critical path though so maybe this would limit clock speed.

jmg · 2014-01-06 16:27

David Betz wrote: »

I think the TLB would be simpler since all it does is translate external addresses to hub addresses and then executes the instruction identically to a hub-resident instruction. However, I neglected to mention that the PC would have to be increased to 30 bits. I guess the translation might be in the critical path though so maybe this would limit clock speed.

but you still need to fetch external code, and the storage, and decisions when to do so. That makes it a a P3 problem ?

In the simplest form, the Flash XIP needs none of that - it simply outputs a new address for each opcode.
Very simple in HW, and moderate but predictable speed results.

Most of the HW will come for free from the better SerDes anyway.
What's added is effectively a 'fetch opcodes via SerDes' handler state engine.

By doing the simple address-packaging in silicon, you can use most of the COG for real work, and feed a low-resource thread the XIP native opcodes ~ 3.5M OPS is ok for many tasks, and it could call COG-resident libraries, for inner-loop speed, if really needed.

The flash can accept 24 bit Address, so long jumps would be by @REG, and I think Chip was making all others relative so code in XIP flash, would be the same as code in HUB.

David Betz · 2014-01-06 16:31

jmg wrote: »

but you still need to fetch external code, and the storage, and decisions when to do so. That makes it a a P3 problem ?

I guess I wasn't clear. The address fault code is what fetches data from external memory. The pipeline doesn't need to know anything about that. The TLB is used to translate an external address (say one with bit 31 set) to a hub address. If that translation fails the core generates a trap or starts another task. It is the job of the trap handler or other task to load the page containing that address from external memory. It then updates the TLB to include a translation for that page and resumes execution of the code that generated the trap. The core itself knows nothing about external memory or how to read/write it. That is all handled in software.

http://en.wikipedia.org/wiki/Translation_lookaside_buffer

I'm suggesting that what they call the "page table walk" be handled entirely in software.

potatohead · 2014-01-06 16:48

Thanks guys. I missed the long addressing.

Cluso99 · 2014-01-06 17:09

Now we have 256KB hub and HUBEXEC, seems to me that XMM would mainly be competing with ARM style processors, not microcontrollers.
You could specifically code where you can offload an overlaying mechanism to another cog, and this would work for P2.

While a P3 with external hub might be a good version, it would be competing with ARM again. Why go there?

Anyway, with what we have in P2 I think we won't need to worry about P3 for quite some time

brucee · 2014-01-06 17:11

I am going to saw off that limb I think. I doubt a 200 MHz COG could actually control an SDRAM at 100MHz, and might actually have trouble bit banging a 50 MHz rate. It has to toggle the clock, and generate commands and addresses and data for the SDRAM for each of those clocks. I think you might need a little hardware help to do all this. I assume in any case one COG would be dedicated to handling requests for potentially more than one other COG ??

Another gotcha for general purpose IOs trying to meet 100 MHz in a 180nm process would be a challenge. Many embedded controllers in 140 nm only spec to 80 MHz, and these use custom pads designed for the high speed.

DDR and DDR2 are a totally different animal with LVDS pins that are really hobbyist unfriendly, because most won't handle voltages above 2.5V

So I think this is all best revisited for a P3 in a couple years.

David Betz · 2014-01-06 20:04

brucee wrote: »

I am going to saw off that limb I think. I doubt a 200 MHz COG could actually control an SDRAM at 100MHz, and might actually have trouble bit banging a 50 MHz rate. It has to toggle the clock, and generate commands and addresses and data for the SDRAM for each of those clocks. I think you might need a little hardware help to do all this. I assume in any case one COG would be dedicated to handling requests for potentially more than one other COG ??

Another gotcha for general purpose IOs trying to meet 100 MHz in a 180nm process would be a challenge. Many embedded controllers in 140 nm only spec to 80 MHz, and these use custom pads designed for the high speed.

DDR and DDR2 are a totally different animal with LVDS pins that are really hobbyist unfriendly, because most won't handle voltages above 2.5V

So I think this is all best revisited for a P3 in a couple years.

Why does it need to be able to clock the SDRAM at 100MHz to implement my proposed TLB scheme for executing code from external memory?

David Betz · 2014-01-06 20:06

Cluso99 wrote: »

Now we have 256KB hub and HUBEXEC, seems to me that XMM would mainly be competing with ARM style processors, not microcontrollers.
You could specifically code where you can offload an overlaying mechanism to another cog, and this would work for P2.

While a P3 with external hub might be a good version, it would be competing with ARM again. Why go there?

Anyway, with what we have in P2 I think we won't need to worry about P3 for quite some time

ARM processors are being used as microcontrollers right now and my guess is that they are cheaper than what P2 will cost. In fact, they are even starting to replace 8 bit MCUs. I'm not saying that the P2 doesn't have big advantages over these microcontroller ARM chips in some ways but I am saying that it doesn't have the code capacity of some of the larger ones.

Cluso99 · 2014-01-06 21:35

David: I guess I don't see "commercial" apps for the P2 that will require MBs of Hubexec speed code. I see SDRAM for video, and perhaps slower overlays, but not critical speed apps/drivers.

However, I do see a P2 "project" that could sell reasonably well to hobbyists/schools where P2 software can be done totally developed on the P2. IMHO this would make a better RPi and a better Arduino platform, but it would require critical mass. I intend to build such an animal, and I am certain I am not alone (Bill & others, and even Chip when/if he gets time). The latest round of improvements have given this prospect one "huge" boost.

potatohead · 2014-01-06 21:43

CNC could be one commercial app requiring a large data and code space.

Another may just be interaction code. That doesn't need to be very fast, but assets and lots of code to handle the UI consume a lot of room making things friendly, not so barebones. Any speed we can get easy is worth it.

jmg · 2014-01-06 21:54

David Betz wrote: »

The core itself knows nothing about external memory or how to read/write it. That is all handled in software.

Maybe we are talking about different solutions.
Anything in software does not need Silicon changes.
The idea behind my simplest Execute in Place, (XIP) is to need no software at all, & dumb HW, in the most basic variant.
It uses Hardware to shift the bits, and leaves the words to SW.
It should be able to use the improved (coming?) SerDes for the Pin-level interface.

The CPU simply fetches an opcode, & the added HW checks where the code is located, and if off-chip, then it generates the address burst for QuadSPI Flash, and then reads the 32 bits.
The CPU waits until the opcode is ready.

Single-line waits could avoid the re-fetch, and so save power & reduce EMC.

The Octal Word command in Winbond Flash, allows a variant of this, where 16 byte blocks are fetched, and the HW gets the opcodes contained within that 16 bytes.

David Betz · 2014-01-07 03:17

jmg wrote: »

Maybe we are talking about different solutions.
Anything in software does not need Silicon changes.

You're not understanding what I'm proposing. I'm suggesting a hybrid hardware/software solution. The hardware is used to allow full speed hub execution when the PC points to an external address that is currently present in hub memory. The core knows this because when it sees an external address it looks it up in a hardware TLB. If it finds the page in the TLB then it is already present in hub memory so execution proceeds with no software intervention. However, if the page is not found in the TLB, the core issues a trap that causes software to be run that will read the page from external memory into hub memory and update the TLB to reflect that change. After the software is done resolving the address fault, it tells the core to resume execution of the code that caused the address fault. This can't be done entirely in software because it would require an LMM-like software execution loop which would slow down execution significantly. The TLB hardware is a fairly low-cost way to allow full speed execution from external addresses. The problem with XIP is that it requires that the core know how to fetch data from external memory. My solution does not.

RossH · 2014-01-07 03:59

Hi David,

I think you are indeed out on a limb - one which the rest of the industry is already in the process of sawing off.

Not only do you have the Raspberry Pi, which already has an awesome price/performance ratio compared to the P2, but then along comes things like this based on the 400 Mhz Quark.

The P2 will struggle to find a niche against such high-powered alternatives. It's best chance would be to retain the simplicity of programming that the P1 had (compared to other comparable micro-controllers) and thereby appeal strongly to the education and hobbyist markets. Unfortunately, I don't see this being a feature of the P2, which seems to increase in programming complexity every time I revisit these forums.

Ross.

Heater. · 2014-01-07 05:33

RossH,

Seems Intel is catching up with ARM in this space finally. There has been ARM running Linux in an SD card form factor with WIFI for ages.
http://www.transcend-info.com/products/Catlist.asp?FldNo=24

I agree, trying to morph the Prop into a general purpose machine like the ARM or now the little Intels can never be a winning strategy. And yes the loss of simplicity in the P2 worries me too.

David Betz · 2014-01-07 05:51

RossH wrote: »

Hi David,

I think you are indeed out on a limb - one which the rest of the industry is already in the process of sawing off.

Not only do you have the Raspberry Pi, which already has an awesome price/performance ratio compared to the P2, but then along comes things like this based on the 400 Mhz Quark.

The P2 will struggle to find a niche against such high-powered alternatives. It's best chance would be to retain the simplicity of programming that the P1 had (compared to other comparable micro-controllers) and thereby appeal strongly to the education and hobbyist markets. Unfortunately, I don't see this being a feature of the P2, which seems to increase in programming complexity every time I revisit these forums.

Ross.

I'm not trying to morph P2/P3 into something that can run Linux. I didn't suggest an MMU although I think one could be done with the TLB hardware I am proposing. I'm just trying to find a way to efficiently execute code from external memory with minimal hardware support. As I mentioned before, I think if you compare how much compiled code can fit in 256k of hub memory on the P2 with what will fit in 512k of flash memory on an ARM Cortex M series chip you'll find that we have far less code space. I was just trying to address that nothing more. Anyway, since this is a feature that would appear in P3 if at all, we don't really know how much hub RAM we can expect in P3 so the discussion may be premature.

David Betz · 2014-01-07 06:16

Okay, in thinking about this some more I'm not sure that executing code from external memory with hardware assist really makes sense for the P2/P3/Px. What would probably be better for making more efficient use of on-chip memory would be some sort of CMM-like instruction set that could pack code more efficiently into hub memory. This again is probably a P3 feature but may fit better into the use cases anticipated for the Propeller than a TLB. We might also want to consider abandoning XMM for P2 since it has enough hub memory for significant microcontroller applications and we don't want to try to make the P2 into a general purpose computer so there may not be a need for larger programs than will fit in 256k bytes. It made sense to implement XMM for P1 because 32k bytes is very constraining. That argument can't be made for P2 with its much larger hub memory.

Going out on a limb

Comments