Going out on a limb

Heater. · 2014-01-07 07:54

David,

We might also want to consider abandoning XMM for P2 since it has enough hub memory for significant microcontroller applications...

Except...Chip seems to be intent on shipping a P2 on a module with 32MD of SDRAM. So perhaps XMM should be retained for use with that. I have no idea what the performance would be or if it would be of practical use for many. I will want room for that Espruino JS engine though!

David Betz · 2014-01-07 08:47

Heater. wrote: »

David,

Except...Chip seems to be intent on shipping a P2 on a module with 32MD of SDRAM. So perhaps XMM should be retained for use with that. I have no idea what the performance would be or if it would be of practical use for many. I will want room for that Espruino JS engine though!

I guess my concern is that the difference between hub execution performance and XMM performance on P2 will be a lot bigger than the difference between LMM and XMM performance on P1. That was my motivation for suggesting the TLB idea.

Tor · 2014-01-07 08:48

Well.. I for one could have used a P2/P3 with a TLB.. I've been planning to emulate a specific type of minicomputer when the P2 comes out, and a TLB would have been handy. Combined with the typical Propeller features. I don't want to use a Linux-based ARM or Intel, I've already written an emulator for that platform and I'm looking for something different.

-Tor

jmg · 2014-01-07 14:41

David Betz wrote: »

The problem with XIP is that it requires that the core know how to fetch data from external memory. My solution does not.

The two are not mutually exclusive. Somewhere in the TLB, it needs to choose not only HUB/Off chip but it also needs to know which off chip to fetch from.

Provided the silicon can do the lowest level IO without SW, like Send Address : Read_32 or ReadOctW on QuadSPI Flash, and a similar operation on SDRAM (feed HW address, and HW does the SDRAM cycles to fetch 32 or more bits )

The absolute minimal form of this would be an Address trap that can hand-over a memory read request, to SW. IF it is outside the on-chip Address range.

The requesting thread stalls until the opcode is delivered some cycles later, and the Trap handling thread, slumbers again, waiting on a TrapEvent

That at least makes SW future device proof.

jmg · 2014-01-07 14:50

David Betz wrote: »

.. We might also want to consider abandoning XMM for P2 since it has enough hub memory for significant microcontroller applications and we don't want to try to make the P2 into a general purpose computer so there may not be a need for larger programs than will fit in 256k bytes. It made sense to implement XMM for P1 because 32k bytes is very constraining. That argument can't be made for P2 with its much larger hub memory.

I'm not sure many would consider 32 -> 256 'much larger', especially given you can now pack many more threads into a COG.

Also, besides CODE reading, there are large-info cases like FONTS, where fast and easy access to external memory is very useful, and for easy handling there, it is good if the tools can manage larger addresses (even if just on Data).

Once all that is in place, the next natural step is to allow code to come from off-chip too, with as much bit-level silicon assist as can be practically provided.

David Betz · 2014-01-07 14:51

jmg wrote: »

The absolute minimal form of this would be an Address trap that can hand-over a memory read request, to SW. IF it is outside the on-chip Address range.

That at least makes SW future device proof.

This is exactly what I am suggesting. The hardware notices that an address is not on chip. It first looks in the TLB to see if there is a translation for that address. If there is not, it traps to software that handles everything else. If there is a translation in the TLB then it uses the new address and goes on its merry way.

jmg · 2014-01-07 14:55

David Betz wrote: »

This is exactly what I am suggesting. The hardware notices that an address is not on chip. It first looks in the TLB to see if there is a translation for that address. If there is not, it traps to software that handles everything else. If there is a translation in the TLB then it uses the new address and goes on its merry way.

The TLB is a superset - the Address Trap can skip TLB step, and instead pass all of-chip addresses to the Software.
TLB is more powerful, but also more silicon.

David Betz · 2014-01-07 15:08

jmg wrote: »

The TLB is a superset - the Address Trap can skip TLB step, and instead pass all of-chip addresses to the Software.
TLB is more powerful, but also more silicon.

I think that is too little hardware and not really worth implementing. It means every instruction that touches external memory generates a trap. The TLB is a compromise that lets external code run at full speed most of the time with very little hardware overhead.

ctwardell · 2014-01-07 15:31

I really don't want to see a silicon Rube Goldberg or Heath Robinson machine, but that really seems like where we are headed, if not already there.

The P1 was elegant, the P2 might still be, but sticking more and more patches on the COG concept to make it pretend it is a full on CPU seems to be getting out of hand.

I hate to say it, but I don't even know what the P2 is trying to target anymore.

C.W.

jmg · 2014-01-07 15:34

David Betz wrote: »

I think that is too little hardware and not really worth implementing. It means every instruction that touches external memory generates a trap.

That's up to Chip, but it is worthwhile detailing the trade offs for him.

Sure, 'Trap always' is lower bandwidth, but it is deterministic, and has very low system impact as well.
ie Two COGS could run a QuadSPI each, and even launch identical code copies.

Once code goes into a local cache, everyone needs to agree on the rules and mapping of multiple copies.
A TLB design should have a Zero sized cache as one option, in which case it fully includes the other approach as well.

David Betz wrote: »

The TLB is a compromise that lets external code run at full speed most of the time with very little hardware overhead.

- but you still need to fetch that "full speed most of the time" into some local store, so this has a higher resource footprint, and needs more system level planning.

The ReadOctWord option in the Winbond QuadSPI parts, can be compatible with this locally cached approach.

Less clear to me is if you can continue clocking a ReadOctWord, for larger local-store fills ?

jmg · 2014-01-07 15:36

ctwardell wrote: »

I hate to say it, but I don't even know what the P2 is trying to target anymore.

It still targets what it did before, now it can also target more. You have not lost anything.

Seairth · 2014-01-07 15:46

jmg wrote: »

It still targets what it did before, now it can also target more. You have not lost anything.

I have to disagree. Something has been lost: simplicity. It is one of the best features of the P1. And it's increasingly missing from the P2 (in my opinion, of course).

David Betz · 2014-01-07 16:00

ctwardell wrote: »

I hate to say it, but I don't even know what the P2 is trying to target anymore.

C.W.

I'm not sure I ever understood it really. It seems like it will be a fun chip to program but I don't know enough about Parallax's target market to know if it has the right features. I assume Chip and Ken are worrying about that. I think they figure if they make an interesting chip then someone will find a use for it

David Betz · 2014-01-07 16:09

jmg wrote: »

T- but you still need to fetch that "full speed most of the time" into some local store, so this has a higher resource footprint, and needs more system level planning.

That part is handled by the cache logic that Chip is already adding to support hub execution.

David Betz · 2014-01-07 16:24

Seairth wrote: »

I have to disagree. Something has been lost: simplicity. It is one of the best features of the P1. And it's increasingly missing from the P2 (in my opinion, of course).

The main complexity that I'm a bit worried about is moving the Spin stack into AUX memory and the subsequent need to use the AUX[xxx] syntax for accessing pointers to variables on the stack. It removes the ability to do something I've seen done frequently in Spin programs, setting up parameters for COGINIT on the stack.

rod1963 · 2014-01-07 16:50

What is easy and fun for a guru level coder is different than what a noob or a prospective commercial evaluator of the P2 will see. A lot IMO will depend on the quality of the documentation and tools that parallax supplies or fails to supply.

Hopefully Parallax will do a serious beta test of real silicon before a official launch of the P2 in order to accurately document the beast and make sure there are no gotchas with the development software.

jmg · 2014-01-07 17:04

David Betz wrote: »

That part is handled by the cache logic that Chip is already adding to support hub execution.

I'm not sure the HW used has quite the right aspect ratios/cycles, but we both agree if the Cache Silicon is sitting there, it certainly makes sense to be able to connect it to off chip memory, as well as to the Hub

jmg · 2014-01-07 17:06

Seairth wrote: »

I have to disagree. Something has been lost: simplicity. It is one of the best features of the P1. And it's increasingly missing from the P2 (in my opinion, of course).

That's easily solved with a subset data sheet.
It could even be called P2 for P1 users.

David Betz · 2014-01-07 17:09

jmg wrote: »

I'm not sure the HW used has quite the right aspect ratios/cycles, but we both agree if the Cache Silicon is sitting there, it certainly makes sense to be able to connect it to off chip memory, as well as to the Hub

I expect the cache logic and the TLB logic to be completely independent. The cache line size certainly doesn't have to match the page size. The idea is that the cache logic uses the address after it has been translated by the TLB logic. This is where my scheme may fail since it could lengthen a critcal path and blow timing.

rogloh · 2014-01-07 20:10

In a separate post in another thread I earlier had wondered if we could combine XMM with hub execution mode to gain performance benefits and this also started to lead me towards the idea of paging and detecting faults etc so I stopped discussing it as I know that paging is another whole can of worms and most likely couldn't and shouldn't make it into this next P2 propeller.

However I am still contemplating whether it might be possible in software to create a good VM memory model somewhat like XMMC for those applications needing a larger program memory space. Ideally a model which could use external memory for code only and the internal hub memory for holding data and potentially now some hub execution code as well. This would be a hybrid between Harvard and Von Neumann architectures. It is basically the memory model that is supported on some ARM devices (e.g. STM32 family) with program code runnable from both flash and internal RAM, and program data stored in RAM only. On these devices code run from internal RAM can be run at full processor speed without wait states but extra wait states can be required if running code from flash, depending on clock speed.

Now on a future P2 given enough non-multiplexed address/data pins and with sub 10ns 32 bit wide SRAM holding code only I expect a VM can probably be made to run at up to 50 MIPs if Chip gets the device to hit 200 MHz. This is with a 4 cycle VM loop I envisaged repeating indefinitely using the instruction block repeater. That number is in the ballpark of some of these cheaper ARM microcontroller devices running from flash (as flash is typically slower than internal RAM). If we could get 50 MIPs peak when running from a larger external SRAM and then also hit 200 MIPs peak when running any specially nominated "fast" code in hub exec mode from internal hub RAM, we are really starting to reach decent performance numbers comparable to these ARM devices and it could also open up very large programs on P2 as well. Best of both worlds...

Now think about this: if hub memory is always used for the data space (ie. external program memory operates as read only), a VM running C code should basically be able to run most of the same compiled code in VM mode or in hub mode because all data lookups are the same using RDLONG/WRLONG instructions etc, the stack works the same way in both modes, and no special address range checks are required which would slow down the VM or prevent us using hub execution. The key thing that changes then are just all the branches/call instructions encoded in each mode. They are purely what determines what memory the code can be run from and how it can be called, just like COG mode vs hub mode code. All the internal VM registers would be exactly the same in both modes (apart from the PC) so this allows consistent argument passing between different functions executing from the two different memories.

With the new jumps/calls Chip is adding for hub execution we now get a way to switch between VM execution mode and hub execution mode as we enter/exit the VM loop running in the COG. But we'd need a couple of extra things. We'd want a way to access look up table data stored in external program memory because this external memory wouldn't be directly accessible from any single real P2 instruction, only from a virtual instruction we'd have to invent. [Note that the Harvard based 8 bit AVR family has the same issue to resolve and uses special LPM instructions for this purpose along with special macros to get at data stored in program space. We'd need some P2 VM instruction equivalent for that.] Also we'd want some way to initially load up the external memory from some other store such as flash (like the SPM instruction on AVR). I'm sure there would be some other things to consider such as function pointer dereferences and any limitations, but you all hopefully get the gist of this idea.

Unfortunately an SDRAM based XMMC VM won't ever be as good as the non multiplexed SRAM approach, the random read latency is the real killer there and extra caching is required which also means extra checking and that rules out very fast execution. But with external SRAM things might work out well when you can afford the I/O pins and extra memory cost involved. Be nice to have a VM model that could take advantage of this.

Seairth · 2014-01-07 20:17

jmg wrote: »

That's easily solved with a subset data sheet.
It could even be called P2 for P1 users.

Just ignoring parts of the hardware/architecture doesn't make the hardware/architecture simpler . Further, at the point you've reduced the use of the P2 to what the P1 is capable of, what have you gained other than speed and more hub memory? And, obviously, there are new features that make the P2 unquestionably better while still keeping things P1-simple: ADC, DAC, CORDIC, more I/O pins, QUADs, etc. But, what you can't do right now (again, in my opinion) is say to someone "It's as simple to use as P1, but significantly more powerful." There are plenty of features that have edge case, special side-effects, etc. that make the P2 non-simple (even if you choose to ignore those features).

David Betz · 2014-01-07 20:23

Seairth wrote: »

JFurther, at the point you've reduced the use of the P2 to what the P1 is capable of, what have you gained other than speed and more hub memory?

Actually, even just those two features would have made a P1+ chip appealing! :-)

Going out on a limb

Comments