Another way to handle larger cog RAM is increase the size of the program counter, and access cog locations beyond 511 using index registers. If the program counter were 10 bits wide instead of 9 programs could be 1024 instruction long. Instead of doing a jump as "jmp #label" You would do a jump as "jmp label_address, where the location at label_address would contain the address of "label". Code that is located within the first 512 cog locations could work as it does on P1, and code after the first 512 location would need to used the indirect jump.
If we add some index registers we could access data locations beyond 511. I would suggest using the same method for indexing that P2 will have. Most variables could be defined within the first 512 longs of memory so that they could be accessed like they currently are in P1. Additional variables, arrays and tables could be stored in the high end of cog memory, and accessed through the index registers.
On the other hand I like the index register idea, although not quite the way Dave Hein proposes it be done.
Increase the program counter size to 10 or more bits as required for the memory to be addressed.
Keep the first 512 locations general purpose, so allow any one of them to be used as index registers.
If the WC & WZ bits are not used for any jump instructions as Bill Henning posted can they be used to indicate an indirect jump? It may be nice to have more memory, but not being able to call subroutines may make it of much less use.
WC & WZ are not needed for any of the jump instructions, using them for two extra address bits would allow for 2048 cog locations (for jump destinations)
with 2048 cog locations, each of the four tasks could have 512 longs (minus SFR's)
Wouldn't you need to increase both the s and d address, so increasing memory to 1K?
Reading the main SRAM takes as much time as a cog clock cycle, so we cannot speed it up even 2x.
Oh well.
Here's another thought, then (also probably worth ignoring). Now that the P2 has the ability to access external RAM, what if you got rid of the hub RAM altogether and instead figured out a way to extend the cog register space (maybe Bill's idea of using task remapping would work). PORTD already provides some of the inter-cog communication that was typically done via the hub., and you could still leave a small "hub" for the mailboxes that have been discussed in the past. Beyond that, if large amounts of RAM are needed for data storage (and the expanded cog space wasn't enough), that's where the external RAM comes in.
Also, the COGINIT would always run the same startup ROM ("load my code from EEPROM") except that you would specify the EEPROM address (with POR being address zero). That way, you could have enormous codebases (much more than 128K) without having to juggle the loading process through the hub.
Okay, maybe this is more of a P3 idea than for P2. Fun to imagine nonetheless.
Reading the main SRAM takes as much time as a cog clock cycle, so we cannot speed it up even 2x.
Chip,
My point is that HUB access could be as fast as COG RAM access, and stuff like LMM could go away entirely.
If you wanted to, you could give say 4 COGs such super powers (on demand) and have 8 additional COGs that behave normally.
Any such "SuperCOG" could have 8 times faster main memory access and larger code space for business code and data. Also, you could implement a compressed instruction set or SPIN byte-code processor that has even better code footprint than current P1 ASM. Self-modifying code would no longer be used.
Additionally, each SuperCOG program size could be up to 128KB depending on how many "local registers" you wanted to allocate to ASM destination fields. This could also be determined at run time.
INA, OUTA, DIRA, etc would be global and follow the same rules as current COGs.
Sorry if such things are heresy, but you've already designed the perfect micro-controller core at least once.
Eventually it will be time to produce something bigger and faster IMHO. Faster could be a process shrink away on P1. However, there are limits to your TDM HUB idea that need to be over-come. Other CPU shops do it with arbitration, but that fails with determinism.
You could always just throw an ARM core in it (I know of course that is not appealing to you and will never happen).
Lots of the ideas here are good. I especially like using external memory as HUB RAM, but I really think there needs to be a way to directly execute programs rather than using some LMM/XMM paradigm (hacks devised out of necessity rather than created as original beauty).
Please no banks in the COG. We've got LMM for that, and it's going to be quick and fairly easy.
I think getting rid of the HUB in P3 may well make some sense. At that point, we've moved from a micro-controller to a CPU. When it's time to start really thinking about that, it will be an excellent discussion!
This could be address determined, so a small amount of code space could still self modify.
That would be std COG memory, as now, so we can call that Near-Code.
Far-Code has an extended PC and fetches from HUB Memory ( or even from a hardware QuadSPI-DDR block, Far-Far-Code ?)
That decision could be entirely memory-map based, and the Compiler tools would allocate code into 3 places, under user directives.
My point is that HUB access could be as fast as COG RAM access, and stuff like LMM could go away entirely.
...
Any such "SuperCOG" could have 8 times faster main memory access and larger code space for business code and data.
It could not quite be 8 times faster than now, or it blocks all other COGS, but what might be possible, is a time-slice allocator.
If you had 16 or 32 slots, one scenario could be to map COG0 far-code fetches to 50% of 16 slots (every second one), and then the other COGS could get 1:16 allocates for their Data access.
Far-Code fetches run at 50% of Near-Code speed - Memory bandwidth is the same, you have just sliced and shared it in a varying manner.
Another mapping, could be C.C.R, in a 24 bit field for 66% bandwidth, or C.C.C.R for 75% BW to Far-Code, and 1:32 for Data.
Of course, there is no free lunch, to give one COG far-code access, at reasonable speed, the other COGs Data-bandwidth has decreased.
Making hub ram sounds great until you remember r/w quad accesses 128 bits wide. Where are these pins going to come from? And we havent even considered the address and control pins yet. And then you will all ask for it in a DIP paackage.
Cog ram is quad port to permit 3 reads and 1 write per clock.
LMM does not support instruction modify. But Bill's suggestion caters for the far jumps found in LMM.
I think getting rid of the HUB in P3 may well make some sense. At that point, we've moved from a micro-controller to a CPU. When it's time to start really thinking about that, it will be an excellent discussion!
What is the hub RAM for right now? Only a holding area for the cogs. You can't execute out of it. You can't manipulate data in it (beyond WRxxx). This could all be (optionally) handled by external RAM with zero loss of functionality. Also, note that I didn't suggest getting rid of the HUB entirely. There may be other shared resources (LOCKs, MBOXs, Monitor ROM, etc) that still must be shared between the cogs. And each cog would still have it's own (expanded) memory registers and (expanded) stack/data space. For many existing applications, cogs with 8K of registers and 2-4K of stack/data would entirely mitigate the need for external RAM. But the other possibilities that would be opened up as a result would be incredible (e.g. serious DSP In a single cog).
Load and Store of course! The ultimate RISC machine. :P
Seriously, external RAM is not going to cut it as a replacement. HubRAM has aggregate bandwidth of something like 16x200MHz=3GB/s. And what's more it does it with tight access timing, on par with cache RAM.
A separate problem with that much general register space is it takes an additional 4 bits per direct register reference, with the typical D,S references taking 8 extra bits from the instruction encoding.
Although, there is probably no need for every instruction to have absolute addressing like that.
Load and Store of course! The ultimate RISC machine. :P
Seriously, external RAM is not going to cut it as a replacement. HubRAM has aggregate bandwidth of something like 16x200MHz=3GB/s. And what's more it does it with tight access timing, on par with cache RAM.
Where'd the 200MHz come from? With quad transfers and a best-case read-modify-write of 16 clock cycles (@ 160MHz), I come up with only 160MB/s. If transferring between two cogs (in one direction only). You'd get only 320MB/s. And I'm still suggesting that there be hub mailboxes for the asynchronous transfers where PORTD won't work. Yes, hub ram would still be generally faster than external ram, but that doesn't necessarily mean that it's the ideal fit for the majority of uses. Or it might be.
A separate problem with that much general register space is it takes an additional 4 bits per direct register reference, with the typical D,S references taking 8 extra bits from the instruction encoding.
Although, there is probably no need for every instruction to have absolute addressing like that.
True. Instruction encoding would definitely have to be revamped.
Incidentally, I chose 8K (memory, or 2K registers) because it would give 2K per task, which would allow the existing instructions to work as-is, with the additional 2 bits implied by the task. This may not be the right approach, but it does minimize impact on instruction encoding.
Actually, if you only did 1K COG registers (though more would obviously be better), then doubled the instruction size to 64 bits, you would be able to have at least as many instructions as you do now (~500), but you'd have twice the working memory as you do now. Just think, the first 32 bits could be opcodes/flags/etc and destination address (16-bit?), while the second 32 bits could contain the source address or 32-bit immediate value! For instructions that might reference external (or hub) RAM, you could address up to 4GB of address space! Also, you wouldn't have to do any register remapping to access the additional memory. And there'd be ample growing room for the future versions of the Propeller!
Yes, I know I'm getting carried away and that these changes won't show up in the P2, but it's fun to think about anyhow.
It would be far better to make it possible for all COGS to be able to read any global RAM location simultaneously and allow writing via round robin or strict priority dependent on the user's needs.
I know it is difficult, but it is worth investigating further.
+1. Whether it's feasible or not I have no clue, but the idea itself has merit - and opens up new possibilities.
What might also, on a related note, be worth considering is some way to launch a cog pointing at code in memory external to the cog, thereby eliminating the need to copy code into the cog - ie, fast launches.
Of course, the above may just be another whacky idea.
No problem! We just divide the size of hub memory by 9 and end up with 14k! :-)
Best chance of that is with a cross-point switch. I haven't tried to work out how bulky an 8x8 switch is. It would presumably add fetching latency but that could prolly be managed. And presumably all instruction fetches would then come directly via Hub - reducing the demands on Cog registers.
Software would have to be much more aware of which Cog is accessing which Hub bank at any one time. If, say, Cog 0 was executing instructions from, say, bank 0 and Cog 1 decided it had to update a block shared data ... actually, if this was going to work well then we'd prolly want two ports per Cog and 16 banks of HubRAM. That makes a 16x16 cross-point switch!
Chip,
...
Eventually it will be time to produce something bigger and faster IMHO. Faster could be a process shrink away on P1. However, there are limits to your TDM HUB idea that need to be over-come. Other CPU shops do it with arbitration, but that fails with determinism.
You could always just throw an ARM core in it (I know of course that is not appealing to you and will never happen).
Lots of the ideas here are good. I especially like using external memory as HUB RAM, but I really think there needs to be a way to directly execute programs rather than using some LMM/XMM paradigm (hacks devised out of necessity rather than created as original beauty).
--Steve
I could not agree more. Now having said that, I know it's not something that can be accomplished easily and painlessly. That is, for one thing we're quite possibly contemplating fundamental changes to Propeller architecture.
Notwithstanding, there have been many interesting ideas in this thread (eg, some of Seairth's). Another point-of-view to think about: perhaps relying so heavily on centralized memory is the issue. That is, instead allocate more directly to cogs, say 16 KB each, then use another scheme to share data (stored in cog RAM locations reserved for this?) between them - similarly in a time-sequenced fashion; not ideal either, but the concept has its merits.
The other thing that occurs to me is that maybe we have unrealistic notions about the variety of applications Propeller can thrive in. On the other hand, even with that concession, one has to admit that a relatively straight-forward IO aggregator/processor can, without too much difficulty, run up against a 496-instruction per-cog limit.
I have to agree with KC_rob, I think too many people do have unrealistic notions of what they can do with a Prop - like running Linux, powering video game consoles, a full blown multi-media processor like the Sitara, etc. People are better off just getting a Raspberry if they want those things. It can be done cheaper and easier than it can ever be done on Prop.
Now where the Prop shines IMO is in deeply embedded projects, whether medical or industrial, etc. I think people are better off coming up with synergistic solutions where you have a ARM/PPC/Intel/whatever and a Prop(2).
Trying to stretch a Prop to run big programs efficiently from external RAM and so on is a fools errand. The world is already full of such machines from ARM, Intel and so on. They do it much better. They can do it cheaper and at lower power, checkout the STM32 for example.
The Prop is fantastic for those real-time multi-tasking bit twiddling jobs, and hopefully serial streaming jobs with the PII.
As such it would make an excellent partner to an ARM core. ARM runs big code, even Linux for those that want it, Prop runs the real-time I/O interfacing stuff.
Clearly others have this concept in mind, that is why the Sitara ARM Socs have PRU's.
If Parallax could license a Prop core to somebody to be used along side an ARM core on a SoC I think they would be very happy. Who would not like to have a Sitara class machine with the ease of programming and open source tools of the Prop instead of those PRU's?
Now where the Prop shines IMO is in deeply embedded projects, whether medical or industrial, etc. I think people are better off coming up with synergistic solutions where you have a ARM/PPC/Intel/whatever and a Prop(2).
This is absolutely spot on and has always been my preferred way to view the Prop, as tempting as it is to get carried away with everyone else here on the forums. But if P2 is going to be a "helpmate" to Sitara ARMs etc., its design should be scaled - in terms of cost, size, power use, and complexity - appropriately.
In short then. Parallax might want to look beyond the Prop as an MCU chip towards the Prop as a CPU core IP that can be licensed to others to use in their ARM SoCs.
That is not going to happen.
Why base a P3 on an ARM when anyone could take their licensed ARM core and put 8 of them on a chip with a HUB?
ARM does not have that tight immediate connection to it's pins.
ARM probably needs cache to perform.
A decent ARM, for running Linux, needs it's MMU and virtual memory.
And so on.
If any of that made sense to do why would TI make a Sitara SOC with small PRU's for the real-world interfacing?
That is not going to happen.
Why base a P3 on an ARM when anyone could take their licensed ARM core and put 8 of them on a chip with a HUB? ARM does not have that tight immediate connection to it's pins.
ARM probably needs cache to perform.
A decent ARM, for running Linux, needs it's MMU and virtual memory.
And so on.
If any of that made sense to do why would TI make a Sitara SOC with small PRU's for the real-world interfacing?
What am I missing? EFM32 looks like a typical ARM Soc. Like STM32 and so on. Perhaps optimized for low power (only 24MHz)
I don't even see it doing what the P1's 8 cores can do at 20MHZ each.
Comments
Have to agree with Heater and Sapieha on bank switching. It can get really messy software wise.
On the other hand I like the index register idea, although not quite the way Dave Hein proposes it be done.
Increase the program counter size to 10 or more bits as required for the memory to be addressed.
Keep the first 512 locations general purpose, so allow any one of them to be used as index registers.
If the WC & WZ bits are not used for any jump instructions as Bill Henning posted can they be used to indicate an indirect jump? It may be nice to have more memory, but not being able to call subroutines may make it of much less use.
Wouldn't you need to increase both the s and d address, so increasing memory to 1K?
Oh well.
Here's another thought, then (also probably worth ignoring). Now that the P2 has the ability to access external RAM, what if you got rid of the hub RAM altogether and instead figured out a way to extend the cog register space (maybe Bill's idea of using task remapping would work). PORTD already provides some of the inter-cog communication that was typically done via the hub., and you could still leave a small "hub" for the mailboxes that have been discussed in the past. Beyond that, if large amounts of RAM are needed for data storage (and the expanded cog space wasn't enough), that's where the external RAM comes in.
Also, the COGINIT would always run the same startup ROM ("load my code from EEPROM") except that you would specify the EEPROM address (with POR being address zero). That way, you could have enormous codebases (much more than 128K) without having to juggle the loading process through the hub.
Okay, maybe this is more of a P3 idea than for P2. Fun to imagine nonetheless.
Chip,
My point is that HUB access could be as fast as COG RAM access, and stuff like LMM could go away entirely.
If you wanted to, you could give say 4 COGs such super powers (on demand) and have 8 additional COGs that behave normally.
Any such "SuperCOG" could have 8 times faster main memory access and larger code space for business code and data. Also, you could implement a compressed instruction set or SPIN byte-code processor that has even better code footprint than current P1 ASM. Self-modifying code would no longer be used.
Additionally, each SuperCOG program size could be up to 128KB depending on how many "local registers" you wanted to allocate to ASM destination fields. This could also be determined at run time.
INA, OUTA, DIRA, etc would be global and follow the same rules as current COGs.
Sorry if such things are heresy, but you've already designed the perfect micro-controller core at least once.
Eventually it will be time to produce something bigger and faster IMHO. Faster could be a process shrink away on P1. However, there are limits to your TDM HUB idea that need to be over-come. Other CPU shops do it with arbitration, but that fails with determinism.
You could always just throw an ARM core in it (I know of course that is not appealing to you and will never happen).
Lots of the ideas here are good. I especially like using external memory as HUB RAM, but I really think there needs to be a way to directly execute programs rather than using some LMM/XMM paradigm (hacks devised out of necessity rather than created as original beauty).
--Steve
Please no banks in the COG. We've got LMM for that, and it's going to be quick and fairly easy.
I think getting rid of the HUB in P3 may well make some sense. At that point, we've moved from a micro-controller to a CPU. When it's time to start really thinking about that, it will be an excellent discussion!
This could be address determined, so a small amount of code space could still self modify.
That would be std COG memory, as now, so we can call that Near-Code.
Far-Code has an extended PC and fetches from HUB Memory ( or even from a hardware QuadSPI-DDR block, Far-Far-Code ?)
That decision could be entirely memory-map based, and the Compiler tools would allocate code into 3 places, under user directives.
It could not quite be 8 times faster than now, or it blocks all other COGS, but what might be possible, is a time-slice allocator.
If you had 16 or 32 slots, one scenario could be to map COG0 far-code fetches to 50% of 16 slots (every second one), and then the other COGS could get 1:16 allocates for their Data access.
Far-Code fetches run at 50% of Near-Code speed - Memory bandwidth is the same, you have just sliced and shared it in a varying manner.
Another mapping, could be C.C.R, in a 24 bit field for 66% bandwidth, or C.C.C.R for 75% BW to Far-Code, and 1:32 for Data.
Of course, there is no free lunch, to give one COG far-code access, at reasonable speed, the other COGs Data-bandwidth has decreased.
Cog ram is quad port to permit 3 reads and 1 write per clock.
LMM does not support instruction modify. But Bill's suggestion caters for the far jumps found in LMM.
What is the hub RAM for right now? Only a holding area for the cogs. You can't execute out of it. You can't manipulate data in it (beyond WRxxx). This could all be (optionally) handled by external RAM with zero loss of functionality. Also, note that I didn't suggest getting rid of the HUB entirely. There may be other shared resources (LOCKs, MBOXs, Monitor ROM, etc) that still must be shared between the cogs. And each cog would still have it's own (expanded) memory registers and (expanded) stack/data space. For many existing applications, cogs with 8K of registers and 2-4K of stack/data would entirely mitigate the need for external RAM. But the other possibilities that would be opened up as a result would be incredible (e.g. serious DSP In a single cog).
Seriously, external RAM is not going to cut it as a replacement. HubRAM has aggregate bandwidth of something like 16x200MHz=3GB/s. And what's more it does it with tight access timing, on par with cache RAM.
A separate problem with that much general register space is it takes an additional 4 bits per direct register reference, with the typical D,S references taking 8 extra bits from the instruction encoding.
Although, there is probably no need for every instruction to have absolute addressing like that.
Where'd the 200MHz come from? With quad transfers and a best-case read-modify-write of 16 clock cycles (@ 160MHz), I come up with only 160MB/s. If transferring between two cogs (in one direction only). You'd get only 320MB/s. And I'm still suggesting that there be hub mailboxes for the asynchronous transfers where PORTD won't work. Yes, hub ram would still be generally faster than external ram, but that doesn't necessarily mean that it's the ideal fit for the majority of uses. Or it might be.
True. Instruction encoding would definitely have to be revamped.
Incidentally, I chose 8K (memory, or 2K registers) because it would give 2K per task, which would allow the existing instructions to work as-is, with the additional 2 bits implied by the task. This may not be the right approach, but it does minimize impact on instruction encoding.
Yes, I know I'm getting carried away and that these changes won't show up in the P2, but it's fun to think about anyhow.
What might also, on a related note, be worth considering is some way to launch a cog pointing at code in memory external to the cog, thereby eliminating the need to copy code into the cog - ie, fast launches.
Of course, the above may just be another whacky idea.
Best chance of that is with a cross-point switch. I haven't tried to work out how bulky an 8x8 switch is. It would presumably add fetching latency but that could prolly be managed. And presumably all instruction fetches would then come directly via Hub - reducing the demands on Cog registers.
Software would have to be much more aware of which Cog is accessing which Hub bank at any one time. If, say, Cog 0 was executing instructions from, say, bank 0 and Cog 1 decided it had to update a block shared data ... actually, if this was going to work well then we'd prolly want two ports per Cog and 16 banks of HubRAM. That makes a 16x16 cross-point switch!
Notwithstanding, there have been many interesting ideas in this thread (eg, some of Seairth's). Another point-of-view to think about: perhaps relying so heavily on centralized memory is the issue. That is, instead allocate more directly to cogs, say 16 KB each, then use another scheme to share data (stored in cog RAM locations reserved for this?) between them - similarly in a time-sequenced fashion; not ideal either, but the concept has its merits.
The other thing that occurs to me is that maybe we have unrealistic notions about the variety of applications Propeller can thrive in. On the other hand, even with that concession, one has to admit that a relatively straight-forward IO aggregator/processor can, without too much difficulty, run up against a 496-instruction per-cog limit.
Now where the Prop shines IMO is in deeply embedded projects, whether medical or industrial, etc. I think people are better off coming up with synergistic solutions where you have a ARM/PPC/Intel/whatever and a Prop(2).
How right you are.
Trying to stretch a Prop to run big programs efficiently from external RAM and so on is a fools errand. The world is already full of such machines from ARM, Intel and so on. They do it much better. They can do it cheaper and at lower power, checkout the STM32 for example.
The Prop is fantastic for those real-time multi-tasking bit twiddling jobs, and hopefully serial streaming jobs with the PII.
As such it would make an excellent partner to an ARM core. ARM runs big code, even Linux for those that want it, Prop runs the real-time I/O interfacing stuff.
Clearly others have this concept in mind, that is why the Sitara ARM Socs have PRU's.
If Parallax could license a Prop core to somebody to be used along side an ARM core on a SoC I think they would be very happy. Who would not like to have a Sitara class machine with the ease of programming and open source tools of the Prop instead of those PRU's?
Why base a P3 on an ARM when anyone could take their licensed ARM core and put 8 of them on a chip with a HUB?
ARM does not have that tight immediate connection to it's pins.
ARM probably needs cache to perform.
A decent ARM, for running Linux, needs it's MMU and virtual memory.
And so on.
If any of that made sense to do why would TI make a Sitara SOC with small PRU's for the real-world interfacing?
I think You need revise that
I think Silicon Labs learn from Parallax !!
EFM32 Zero Gecko 32-bit ARM Cortex-M0+ Microcontroller
What am I missing? EFM32 looks like a typical ARM Soc. Like STM32 and so on. Perhaps optimized for low power (only 24MHz)
I don't even see it doing what the P1's 8 cores can do at 20MHZ each.
Quite so. And I hope that is not what Parallax want's to do with the Propeller. The world is full of ARMs already, what would be the point?