New concepts for a better Propeller
Kaio
Posts: 257
Now as we have the Propeller 1 design open for everybody since more than seven months we have seen some small (IMO) improvements like hub execution incl. AUGS/AUGDS which are fine.
There were many discussions which features and how many cogs and RAM should be included in an extended version based on the Propeller 1.
But I was missing some new ideas to eliminate the existing bottle neck of the Propeller 1 design, the hub access.
Therefore I will present here three new concepts which doesn't use the usual hub mechanism however they provide shared access of main memory and additionally shared cog RAM access both with no delay.
I hear some of you now say, "that's awesome!". Now here's the bad news, the cog count is (currently) limited. But I saw some guys mention they would be happy with 4 cogs if the cog RAM would be bigger. Yes, you can have huge cog RAM.
Preface:
Best regards,
Thomas
There were many discussions which features and how many cogs and RAM should be included in an extended version based on the Propeller 1.
But I was missing some new ideas to eliminate the existing bottle neck of the Propeller 1 design, the hub access.
Therefore I will present here three new concepts which doesn't use the usual hub mechanism however they provide shared access of main memory and additionally shared cog RAM access both with no delay.
I hear some of you now say, "that's awesome!". Now here's the bad news, the cog count is (currently) limited. But I saw some guys mention they would be happy with 4 cogs if the cog RAM would be bigger. Yes, you can have huge cog RAM.
Preface:
- The amount of ROM mentioned in the concepts are only as example based on the current size. There is no limit to change it if you want.
- The amount of RAM mentioned in the concepts and the distribution over different cogs are only as example.
- AUGS/AUGDS instructions are required to get access of the huge RAM. Maybe we need some more special instructions for efficient work.
Best regards,
Thomas
Comments
2 super cogs 128 KB RAM, 32 KB ROM
==============================
Main memory is implemented as dual-port RAM and is virtually segmented in two parts. Each memory part is assigned to one of the cog.
main memory
$0000_0000 64 KB mapped for super cog 1
...
$0001_0000 64 KB mapped for super cog 2
...
$0002_0000 32 KB ROM
...
$0002_7FFF
RAM view from cog
* shared access via rdXXXX / wrXXXX of the main RAM (and also super cog RAMs) from both cogs
- no solution for configuration area yet
The following table shows the address translation depending on super cog from view of the dual-port (main) RAM. Advantages/Disadvantages:
+ 2 super cogs excuting code in separate large RAMs
+ independent access of 2 super cogs and shared RAM without (hub) delay
+ independent access of cog RAM from other cog (shared cog RAM) without delay
- only 2 cogs
4 super cogs 256 KB RAM, 32 KB ROM
==============================
Same as SP2 but using a quad-port RAM.
Main memory is implemented as quad-port RAM and is virtually segmented in four parts. Each memory part is assigned to one of the cogs.
main memory
$0000_0000 64 KB mapped for super cog 1
...
$0001_0000 64 KB mapped for super cog 2
...
$0002_0000 64 KB mapped for super cog 3
...
$0003_0000 64 KB mapped for super cog 4
...
$0004_0000 32 KB ROM
...
$0004_7FFF
RAM view from cog
* shared access via rdXXXX / wrXXXX of the main RAM (and also super cog RAMs) from all cogs
- no solution for config area yet
The following table shows the address translation depending on super cog from view of the dual-port (main) RAM. Advantages/Disadvantages:
+ 4 super cogs at all
+ all cogs excuting code in separate large RAM
+ independent access of 4 super cogs and shared RAM without (hub) delay
+ independent access of super cog RAM from each other super cog (shared cog RAM) without delay
- no solution for SPRs yet
- no solution for configuration area yet
2 super cogs + 2 co-cogs (2 KB RAM), 128 KB RAM, 32 KB ROM
==================================================
Main memory is implemented as dual-port RAM and is virtually segmented in two parts. Each memory part is assigned to one of the super cogs.
Additional one cog with 2 KB dual-port RAM is assigned to each super cog.
main memory
$0000_0000 64 KB mapped for super cog 1
...
$0001_0000 64 KB mapped for super cog 2
...
$0002_0000 32 KB ROM
...
$0002_7FFF
RAM view from super cog
* each super cog has one co-cog with 512 longs RAM
* shared access via rdXXXX / wrXXXX of the main RAM (and also super cog RAMs) from both super cogs
RAM view from co-cog
* rdXXXX / wrXXXX instructions not permitted
* all other assembly instructions work as usual in the cog (9 bit address $000 - $1FF)
The following table shows the address translation depending on super cog from view of the dual-port (main) RAM.
The following table shows the address translation depending on co-cog from view of the dual-port (cog) RAM. Advantages/Disadvantages:
+ 4 cogs at all
+ 2 super cogs excuting code in separate large RAM
+ independent access of 2 super cogs and shared RAM without (hub) delay
+ independent access of super cog RAM from other super cog (shared cog RAM) without delay
+ independent access of co-cog RAM from related super cog (shared cog RAM) without delay
+ 2 co-cogs i.e. for low level driver
* In a FPGA, dual-port memory comes almost for free, but quad port does not.
* Present opcodes have 9 bit fields, dictated by binary compatible operation.
The memory-domain elasticity I can see in a P1V are things like
a) Local Indirect Data inside a COG (can share with an adjacent COG almost for free) - see other thread on this.
This is easy to add, in a FPGA, and keeps binary subset compatible operation.
b) Some small HUB area that is N-Ported, somewhat costly, but it does remove HUB waits in that area.
The key is to keep it small, for COG-COG messages.
c) Add XIP HW to have transparent read of QuadSPI memory - Waits are longer here, but the memory is transparent to the user. Lots of it, just slower, and burst-reads would have lower waits than random reads.
The indirect read HW can access via the 32b pointer, all memory areas, What changes is the speed (waits).
a) removes HUB waits, if those were just to store local data.
FYI the DE2-115 can support 1 x P1V with 416K hub ram.
The Bemicro CV can support 1 x P1V with 128k hub ram.
http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=167&No=921
says
Cyclone V 5CEBA4F23C7N Device
49K Programmable Logic Elements
3080 Kbits embedded memory
4 Fractional PLLs
1 Hard Memory Controllers
Initial test compiles show 288K hub ram is largest fit in DE0-CV. Attempted 320K but quratus reported following error :
Error (170048): Selected device has 308 RAM location(s) of type M10K block. However, the current design needs more than 308 to successfully fit
Here's what I was thinking about: http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1267735#post1267735
Now that I look at it again, I'm not sure exactly how much of the rotating hub selector it actually describes. It would probably help if I actually knew Verilog.
I must of missed that one.
BTW It is the complete hub system code.
There are no good Bed & Breakfasts at the summit of most mountains, so why keep climbing them?
In today's computer environment, the processing power and graphic capabilities far exceed anything produced in the 70's and 80's so why build a retro-computer?
We're hobbyists (mostly), it's what we do! As FPGA development board prices drop and on-board resources increase, I could see using custom FPGA P1V's for some projects. A Propeller Project board costs $25 and gives you a standard P1, a BE-Micro cost $35 and can give you a P1 with 64 I/O pins and up to 128K of HUB, not bad for $15 more? What is the P2 Project Board going to cost? If all you need is a few extra I/O pins, isn't the P1V-64 on a BE-Micro worth $35?
Concepts for a better Propeller, or any thing else, are two a penny. Anyone can have concepts. We had a million of them on the interminably long P2 development thread that led nowhere.
What the concepts need is a carefully worked out plan of how they fit in with the current architecture. Or even HDL that does it.
Changing the addressing and hence instruction set is basically designing a different machine. It's not a Propeller anymore. Now you need to provide new software tools to use it. The assembler, the Spin interpreter, the C compiler.
Not only is there no Bed and Breakfast on top of the mountain, you have just made your mountain ten times higher. Or perhaps dug yourself a big hole to climb out of first!
Yes, mindrobots, I can see that having a P1 in an FPGA board may have it's uses, and even cost complexity benefits over other solutions. For example if you want to add that missing I/O port or somehow optimize the P1 performance or surround it with some custom logic that you would otherwise have to build onto a custom circuit board with logic chips. All good stuff.
I just don't get major changes to the architecture. They just don't seem worth it.
The only time wasted is the time one freely choose to spend following it......if you think that's wasted time, then don't follow the thread but don't find fault with those that spend their time following it just because you disagree.
(any you's used above are the general you, not the specific you belonging to any particular previous poster.)
When one means the general "you" rather than the specific "you" one can use "one" instead. That is what "one" is for and it makes things much clearer.
Now, I suspect that your "one" was actually me so I have to say:
Of course I follow such threads. One is always keen to hear new ideas. If someone has put up an idea in all seriousness it is worthy of consideration and comment, else why would they have bothered posting here? Not all comments need be positive, some could be challenging. It's all about the debate. That is what we mean by "intellectual discourse". Have you ever wondered why a PhD student has to "defend" his thesis? It's very adversarial that way.
Ultimately we hope it all leads to good stuff, as you say.
Yes, the economics of actually getting to product are a barrier. But Heater recently posted a link about Mr. Peddle of how the 6502 evolved from the 6800. There was a sudden breakthrough in production costs, and people took advantage of it. You just never know when an opportunity might occur, and 'opportunity favors the prepared'.
I am certainly not going to rack my brains over each and every alternative Propeller architecture that is proposed, but there is real creative effort invovled. So please don't disillusion those efforts.
And yes, the P2 means a bit more to us... just because we long for more resources, not a redeployment of limits we have already suffered.
OK, you one! (darn spell checkers!!)
My defense:
My secondary defense: Not really. By the way, shouldn't that be "their" thesis instead of "his" thesis? I believe they have recently begun to allow females to pursue PhDs.
I kind'a sort'a agree with your defense. Well, both them. I did say above that I can see the point in a P1 core in an FPGA plus some performance tweaks or external functionality added.
As to your second defense, here we go. My new concept for a better Propeller is as follows:
1) It's a 64 bit machine.
2) It has at least 16 COGs.
3) Each COG is is a RISC V architecture with 1 megabyte of local RAM
4) There are multiple megs of shared FLASH and RAM.
4) It makes use of clocked I/O and SERDES tightly coupled to the COGS for the high speed real world interfacing.
This is a cool concept because there is a Bed and Breakfast at the top of the mountain. There already exists a GCC C/C++ compiler for RISCV.
No. We don't have any truck with any of that political correctness nonsense around here.
Taking that angle, are there any numbers for a RISC-V on a Cyclone V ?
It was not my goal to reduce the number of COGS. It's still the result of using n-port RAM. ;-)
You are right and I know that quad port memory is currently not standard in FPGA. But you can do it with some lines of code.
--> Advanced Synthesis Cookbook
There are no changes necessary on the instructions for those concepts. As you know AUGS and AUGDS can be helpfully to extend the D and S fields and avoid the 9 bit limitation.
I'll check this.
Thanks!
Thank you!
On a side note, it'll be interesting to see what nVidia can pull off with their push into "stacked" RAM. I hope they give some details like the number of interconnect wires for example.
That's true in an ASIC, but FPGAs move things around a little, and there, Dual port has very little additional cost, as the BlockRAMS are inherently dual port.
QuadPort does cost, as you overlay two dual ports to emulate that - but provided it is kept small, there could be a place for N-port message memory.
Meanwhile, FPGA's are getting ever-cheaper and there is a place for more modest, FPGA achievable targets.
Given current price-curves, I can see a place for a Prop1 and a P1V on a small module.
Such a module could even migrate to P2 with relative ease, when that becomes a disti-part-code.
What is a waste is the criticism of posting the ideas - it just wastes thread space where objective discussion of the ideas can be lost in the quagmire. So, instead of wasting time criticising each other, how about you "remain silent" or discuss the idea. Sometimes brilliant alternatives come out of ideas.
The concept of sharing "some" of the hub ram with some "cogs" seems like a good objective. However, making the hub totally 2-port memory wastes a lot of silicon which reduces the amount of hub available.
But the concept of having larger cog space is music to my ears, especially if some of this is shared with hub ram. Accessing extended cog space is easy enough with just a couple of extra instructions (I have shown this before). Basically it is simpler than hubexec, but would flow on from hubexec easily.
However, I don't want to see less cogs on a P1. So what about some alternatives...
One big P1 improvement would be to have 1-clock hub access - immediately doubles hub performance to 1:8 instead of 1:16
How about a P1 with...
* MUL and perhaps DIV
* Video only on 2 cogs (seems a waste of silicon on all cogs)
* A simple serial (not a uart, just a shifter in and a shifter out, clocked by counter or by external pin) on 6 cogs (replacing the video)
* Simple hubexec that also works for extended cog ram
* 4x 2KB or 4KB RAM blocks shared between adjacent pairs of cogs (as extended cog space) with option for
-- dedicated to either cog only at full speed
-- shared by both cogs at half speed (ie 1:2 clock access round-robbin for deterministic use)
* 64KB minimum hub RAM (more preferred), single clock (1:8 access)
* ROM code sufficient to boot a cog (ie no ROM SIN/LOG/FONT tables as in orginal P1 - these can be softloaded, as can SPIN)
The limit to COGS in a FPGA is financial - you can buy 8 COGS now in a P1 for under $1/COG
Many of that laundry list is in the latest P1V's with conditional builds, and there is a lot 'Natural resource' that comes for free in modern small FPGAs.
MUL is one example, PLLs is another.
Dual port memory is also free, so "shared by both cogs at half speed (ie 1:2 clock access round-robbin for deterministic use)" may not be needed.
I like the idea of a Indirect-with-wait * approach to all the memory areas, then local COG memory has no added waits, COG-COG / Local Array memory likely has no added waits either, HUB access would be hub-slot paced, (new 1:8?) and off-chip access to QuadSPI (x1 or x2) can be HW managed to ~1-2 dozen clocks random access, & less with natural support for block moves.
Indirect-with-wait would likely be a 2 cycle opcode (min), as it needs to decode and read the index, then do the fetch.
* I detailed the possible (almost free) bit-level mapping of a @ Rn with ++/-- options here - via the upper 4 bits currently unused in FPGAs.
http://forums.parallax.com/showthread.php/160278-The-need-for-CTRx-addressable-registers?p=1321602&viewfull=1#post1321602
We had a lot of discussion on the implications of that which has yet to play out.
I know the P2 includes or is set to include fifo, etc... Maybe that renders just the egg beater a waste of time.
It would be compelling to write some P1 code to better understand what it means.
There are likely to be quite a few 'other changes' to get that working.
Such a design works best when the burst nature of the HW can be tapped, but with a 4 cycle opcode, and no auto-inc, the core is a lot slower than the rotating LSB selector. - so much slower, that just moving from 16:1 to 8:1 HUB is likely to be more useful a change.
Video, via a optional CLUT, was one use that did use the full bandwidth of rotating LSB selector, but even that needed a local FIFO for clock rate smoothing.