Shop OBEX P1 Docs P2 Docs Learn Events
New concepts for a better Propeller — Parallax Forums

New concepts for a better Propeller

KaioKaio Posts: 257
edited 2015-03-25 10:45 in Propeller 1
Now as we have the Propeller 1 design open for everybody since more than seven months we have seen some small (IMO) improvements like hub execution incl. AUGS/AUGDS which are fine.
There were many discussions which features and how many cogs and RAM should be included in an extended version based on the Propeller 1.
But I was missing some new ideas to eliminate the existing bottle neck of the Propeller 1 design, the hub access.

Therefore I will present here three new concepts which doesn't use the usual hub mechanism however they provide shared access of main memory and additionally shared cog RAM access both with no delay.
I hear some of you now say, "that's awesome!". Now here's the bad news, the cog count is (currently) limited. But I saw some guys mention they would be happy with 4 cogs if the cog RAM would be bigger. Yes, you can have huge cog RAM.

Preface:
  • The amount of ROM mentioned in the concepts are only as example based on the current size. There is no limit to change it if you want.
  • The amount of RAM mentioned in the concepts and the distribution over different cogs are only as example.
  • AUGS/AUGDS instructions are required to get access of the huge RAM. Maybe we need some more special instructions for efficient work.
You are all invited to discuss the 3 concepts (SP2, SP4 and SP2X2) and your improvements are welcome.

Best regards,
Thomas
«1

Comments

  • KaioKaio Posts: 257
    edited 2015-03-23 17:29
    Concept 1: SP2

    2 super cogs 128 KB RAM, 32 KB ROM
    ==============================

    Main memory is implemented as dual-port RAM and is virtually segmented in two parts. Each memory part is assigned to one of the cog.

    main memory
    $0000_0000 64 KB mapped for super cog 1
    ...
    $0001_0000 64 KB mapped for super cog 2
    ...
    $0002_0000 32 KB ROM
    ...
    $0002_7FFF

    RAM view from cog
    * shared access via rdXXXX / wrXXXX of the main RAM (and also super cog RAMs) from both cogs
    (17 bit address $0_0000 - $2_7FFF)
    * all other assembly instructions work as usual in the cog but with bigger RAM
    (14 bit address $0000 - $3FFF)
    - no solution for SPRs yet
    - no solution for configuration area yet

    The following table shows the address translation depending on super cog from view of the dual-port (main) RAM.
    assembly instr.   | Port A (s-cog 1)  | Port B (s-cog 2)   | translation
    ------------------+-------------------+--------------------+---------------------------------------------
    rdXXXX / wrXXXX   | $0_0000 - $2_7FFF | $0_0000 - $2_7FFF -> $0_0000 - $2_7FFF (no translation necessary)
    other in s-cog 1  | $0_0000 - $0_3FFF |                   -> $0_0000 - $0_FFFC (addr << 2)
    other in s-cog 2  |                   | $0_0000 - $0_3FFF -> $1_0000 - $1_FFFC (addr << 2 | $1_0000)
    
    Advantages/Disadvantages:
    + 2 super cogs excuting code in separate large RAMs
    + independent access of 2 super cogs and shared RAM without (hub) delay
    + independent access of cog RAM from other cog (shared cog RAM) without delay
    - only 2 cogs
  • KaioKaio Posts: 257
    edited 2015-03-23 17:30
    Concept 2: SP4

    4 super cogs 256 KB RAM, 32 KB ROM
    ==============================

    Same as SP2 but using a quad-port RAM.
    Main memory is implemented as quad-port RAM and is virtually segmented in four parts. Each memory part is assigned to one of the cogs.

    main memory
    $0000_0000 64 KB mapped for super cog 1
    ...
    $0001_0000 64 KB mapped for super cog 2
    ...
    $0002_0000 64 KB mapped for super cog 3
    ...
    $0003_0000 64 KB mapped for super cog 4
    ...
    $0004_0000 32 KB ROM
    ...
    $0004_7FFF

    RAM view from cog
    * shared access via rdXXXX / wrXXXX of the main RAM (and also super cog RAMs) from all cogs
    (19 bit address $0_0000 - $4_7FFF)
    * all other assembly instructions work as usual in the cog but with bigger RAM
    (14 bit address $0000 - $3FFF)
    - no solution for SPRs yet
    - no solution for config area yet

    The following table shows the address translation depending on super cog from view of the dual-port (main) RAM.
    assembly instr.   | Port A (s-cog 1)  | Port B (s-cog 2)   | translation
    ------------------+-------------------+--------------------+---------------------------------------------
    rdXXXX / wrXXXX   | $0_0000 - $4_7FFF | $0_0000 - $4_7FFF -> $0_0000 - $4_7FFF (no translation necessary)
    other in s-cog 1  | $0_0000 - $0_3FFF |                   -> $0_0000 - $0_FFFC (addr << 2)
    other in s-cog 2  |                   | $0_0000 - $0_3FFF -> $1_0000 - $1_FFFC (addr << 2 | $1_0000)
    
    
    assembly instr.   | Port C (s-cog 3)  | Port D (s-cog 4)   | translation
    ------------------+-------------------+--------------------+---------------------------------------------
    other in s-cog 3  | $0_0000 - $0_3FFF |                   -> $2_0000 - $2_FFFC (addr << 2 | $2_0000)
    other in s-cog 4  |                   | $0_0000 - $0_3FFF -> $3_0000 - $3_FFFC (addr << 2 | $3_0000)
    
    Advantages/Disadvantages:
    + 4 super cogs at all
    + all cogs excuting code in separate large RAM
    + independent access of 4 super cogs and shared RAM without (hub) delay
    + independent access of super cog RAM from each other super cog (shared cog RAM) without delay
    - no solution for SPRs yet
    - no solution for configuration area yet
  • KaioKaio Posts: 257
    edited 2015-03-23 17:31
    Concept 3: SP2X2

    2 super cogs + 2 co-cogs (2 KB RAM), 128 KB RAM, 32 KB ROM
    ==================================================

    Main memory is implemented as dual-port RAM and is virtually segmented in two parts. Each memory part is assigned to one of the super cogs.
    Additional one cog with 2 KB dual-port RAM is assigned to each super cog.

    main memory
    $0000_0000 64 KB mapped for super cog 1
    ...
    $0001_0000 64 KB mapped for super cog 2
    ...
    $0002_0000 32 KB ROM
    ...
    $0002_7FFF

    RAM view from super cog
    * each super cog has one co-cog with 512 longs RAM
    * shared access via rdXXXX / wrXXXX of the main RAM (and also super cog RAMs) from both super cogs
    (18 bit address $0_0000 - $2_7FFF)
    * shared access via rdXXXX / wrXXXX of the related co-cog RAM from each super cogs
    (18 bit address $2_8000 - $2_87FF, last 2 KB (512 longs) shared with related co-cog incl. SPRs)
    * all other assembly instructions work as usual in the cog but with bigger RAM
    (14 bit address $0000 - $3FFF)
    - no solution for configuration area yet

    RAM view from co-cog
    * rdXXXX / wrXXXX instructions not permitted
    * all other assembly instructions work as usual in the cog (9 bit address $000 - $1FF)

    The following table shows the address translation depending on super cog from view of the dual-port (main) RAM.
    assembly instr.   | Port A (s-cog 1)  | Port B (s-cog 2)   | translation
    ------------------+-------------------+--------------------+---------------------------------------------------
    rdXXXX / wrXXXX   | $0_0000 - $2_7FFF | $0_0000 - $2_7FFF -> $0_0000 - $2_7FFF (no translation necessary)
    other in s-cog 1  | $0_0000 - $0_3FFF |                   -> $0_0000 - $0_FFFC (addr << 2)
    other in s-cog 2  |                   | $0_0000 - $0_3FFF -> $1_0000 - $1_FFFC (addr << 2 | $1_0000)
    

    The following table shows the address translation depending on co-cog from view of the dual-port (cog) RAM.
    assembly instr.   | Port A (co-cog X) | Port B (s-cog X)   | translation
    ------------------+-------------------+--------------------+---------------------------------------------------
    rdXXXX / wrXXXX   | not permitted     | $2_8000 - $2_87FF -> $0_0000 - $0_01FF ((addr & $7FF) >> 2) if addr >= $2_8000
    other in co-cog X | $0000 - $01FF     |                   -> $0_0000 - $0_01FF (no translation necessary)
    other in s-cog X  |                   | not permitted
    
    Advantages/Disadvantages:
    + 4 cogs at all
    + 2 super cogs excuting code in separate large RAM
    + independent access of 2 super cogs and shared RAM without (hub) delay
    + independent access of super cog RAM from other super cog (shared cog RAM) without delay
    + independent access of co-cog RAM from related super cog (shared cog RAM) without delay
    + 2 co-cogs i.e. for low level driver
  • jmgjmg Posts: 15,173
    edited 2015-03-23 18:17
    * Slashing the number of COGS is a pretty severe 'solution' to HUB access :)
    * In a FPGA, dual-port memory comes almost for free, but quad port does not.
    * Present opcodes have 9 bit fields, dictated by binary compatible operation.

    The memory-domain elasticity I can see in a P1V are things like
    a) Local Indirect Data inside a COG (can share with an adjacent COG almost for free) - see other thread on this.
    This is easy to add, in a FPGA, and keeps binary subset compatible operation.
    b) Some small HUB area that is N-Ported, somewhat costly, but it does remove HUB waits in that area.
    The key is to keep it small, for COG-COG messages.
    c) Add XIP HW to have transparent read of QuadSPI memory - Waits are longer here, but the memory is transparent to the user. Lots of it, just slower, and burst-reads would have lower waits than random reads.

    The indirect read HW can access via the 32b pointer, all memory areas, What changes is the speed (waits).
    a) removes HUB waits, if those were just to store local data.
  • ElectrodudeElectrodude Posts: 1,658
    edited 2015-03-23 20:29
    Has anyone tried implementing the rotating hub selector on the P1 and adding a hub streamer and such? This makes sequential accesses take only 1 long/tick for every cog, but random access are still just as slow. I think Chip posted the verilog for it somewhere a while back.
  • ozpropdevozpropdev Posts: 2,792
    edited 2015-03-23 21:55
    AFAIK no Verilog code has been released of Chip's new rotating hub scheme.
    FYI the DE2-115 can support 1 x P1V with 416K hub ram.
    The Bemicro CV can support 1 x P1V with 128k hub ram. :)
  • jmgjmg Posts: 15,173
    edited 2015-03-23 22:54
    What's your estimate of what can fit in the new Terasic DE0-CV Board

    http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=167&No=921

    says

    Cyclone V 5CEBA4F23C7N Device
    49K Programmable Logic Elements
    3080 Kbits embedded memory
    4 Fractional PLLs
    1 Hard Memory Controllers
  • ozpropdevozpropdev Posts: 2,792
    edited 2015-03-24 02:35
    @jmg
    Initial test compiles show 288K hub ram is largest fit in DE0-CV.
    Device	5CEBA4F23C7
    Logic utilization (in ALMs)	7,733 / 18,480 ( 42 % )
    Total block memory bits	2,490,368 / 3,153,920 ( 79 % )
    
    Attempted 320K but quratus reported following error :
    Error (170048): Selected device has 308 RAM location(s) of type M10K block. However, the current design needs more than 308 to successfully fit
  • ElectrodudeElectrodude Posts: 1,658
    edited 2015-03-24 05:42
    ozpropdev wrote: »
    AFAIK no Verilog code has been released of Chip's new rotating hub scheme.

    Here's what I was thinking about: http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1267735#post1267735

    Now that I look at it again, I'm not sure exactly how much of the rotating hub selector it actually describes. It would probably help if I actually knew Verilog.
  • ozpropdevozpropdev Posts: 2,792
    edited 2015-03-24 06:25
    Thanks for the link Electrodude!
    I must of missed that one. :)

    BTW It is the complete hub system code.
  • Dave HeinDave Hein Posts: 6,347
    edited 2015-03-24 07:20
    What's the purpose of this thread? It's unlikely that a P1+ chip would ever be made, so why waste time discussing it?
  • mindrobotsmindrobots Posts: 6,506
    edited 2015-03-24 08:27
    Dave Hein wrote: »
    What's the purpose of this thread? It's unlikely that a P1+ chip would ever be made, so why waste time discussing it?

    There are no good Bed & Breakfasts at the summit of most mountains, so why keep climbing them? :D

    In today's computer environment, the processing power and graphic capabilities far exceed anything produced in the 70's and 80's so why build a retro-computer?

    We're hobbyists (mostly), it's what we do! As FPGA development board prices drop and on-board resources increase, I could see using custom FPGA P1V's for some projects. A Propeller Project board costs $25 and gives you a standard P1, a BE-Micro cost $35 and can give you a P1 with 64 I/O pins and up to 128K of HUB, not bad for $15 more? What is the P2 Project Board going to cost? If all you need is a few extra I/O pins, isn't the P1V-64 on a BE-Micro worth $35?
  • Heater.Heater. Posts: 21,230
    edited 2015-03-24 08:53
    Well, what is the point?

    Concepts for a better Propeller, or any thing else, are two a penny. Anyone can have concepts. We had a million of them on the interminably long P2 development thread that led nowhere.

    What the concepts need is a carefully worked out plan of how they fit in with the current architecture. Or even HDL that does it.

    Changing the addressing and hence instruction set is basically designing a different machine. It's not a Propeller anymore. Now you need to provide new software tools to use it. The assembler, the Spin interpreter, the C compiler.

    Not only is there no Bed and Breakfast on top of the mountain, you have just made your mountain ten times higher. Or perhaps dug yourself a big hole to climb out of first!

    Yes, mindrobots, I can see that having a P1 in an FPGA board may have it's uses, and even cost complexity benefits over other solutions. For example if you want to add that missing I/O port or somehow optimize the P1 performance or surround it with some custom logic that you would otherwise have to build onto a custom circuit board with logic chips. All good stuff.

    I just don't get major changes to the architecture. They just don't seem worth it.
  • mindrobotsmindrobots Posts: 6,506
    edited 2015-03-24 09:16
    The changes in this thread are radical and propose major changes in direction and architecture as have pointed out but maybe they will give someone a P1V epiphany about something that isn't so radical and doesn't make the Propeller into something that isn't a Propeller and could be built into the FPGA and used by the current tools.

    The only time wasted is the time one freely choose to spend following it......if you think that's wasted time, then don't follow the thread but don't find fault with those that spend their time following it just because you disagree.

    (any you's used above are the general you, not the specific you belonging to any particular previous poster.)
  • Dave HeinDave Hein Posts: 6,347
    edited 2015-03-24 09:30
    Perhaps the word "waste" was a bit too strong. Maybe it should have been "spend". I understand that developing a P1+ in an FPGA is interesting to some. I prefer to wait a few more months until Chip unveils the new and improved P2.
  • Heater.Heater. Posts: 21,230
    edited 2015-03-24 09:31
    mindrobots,

    When one means the general "you" rather than the specific "you" one can use "one" instead. That is what "one" is for and it makes things much clearer.

    Now, I suspect that your "one" was actually me so I have to say:

    Of course I follow such threads. One is always keen to hear new ideas. If someone has put up an idea in all seriousness it is worthy of consideration and comment, else why would they have bothered posting here? Not all comments need be positive, some could be challenging. It's all about the debate. That is what we mean by "intellectual discourse". Have you ever wondered why a PhD student has to "defend" his thesis? It's very adversarial that way.

    Ultimately we hope it all leads to good stuff, as you say.
  • LoopyBytelooseLoopyByteloose Posts: 12,537
    edited 2015-03-24 09:40
    Happy to see a review of possible new concepts. I tend to think that the European Propeller users tend to think a bit more deeply about these things than Americans, while the Aussies are fantastic at exploiting every bit of what they can get their hands on..

    Yes, the economics of actually getting to product are a barrier. But Heater recently posted a link about Mr. Peddle of how the 6502 evolved from the 6800. There was a sudden breakthrough in production costs, and people took advantage of it. You just never know when an opportunity might occur, and 'opportunity favors the prepared'.

    I am certainly not going to rack my brains over each and every alternative Propeller architecture that is proposed, but there is real creative effort invovled. So please don't disillusion those efforts.

    And yes, the P2 means a bit more to us... just because we long for more resources, not a redeployment of limits we have already suffered.
  • mindrobotsmindrobots Posts: 6,506
    edited 2015-03-24 09:44
    Heater. wrote: »
    mindrobots,

    When one means the general "you" rather than the specific "you" one can use "one" instead. That is what "one" is for and it makes things much clearer.

    Now, I suspect that your "one" was actually me so I have to say:

    Of course I follow such threads. One is always keen to hear new ideas. If someone has put up an idea in all seriousness it is worthy of consideration and comment, else why would they have bothered posting here? Not all comments need be positive, some could be challenging. It's all about the debate. That is what we mean by "intellectual discourse". Have you ever wondered why a PhD student has to "defend" his thesis? It's very adversarial that way.

    Ultimately we hope it all leads to good stuff, as you say.

    OK, you one! :D (darn spell checkers!!)

    My defense:
    We're hobbyists (mostly), it's what we do! As FPGA development board prices drop and on-board resources increase, I could see using custom FPGA P1V's for some projects. A Propeller Project board costs $25 and gives you a standard P1, a BE-Micro cost $35 and can give you a P1 with 64 I/O pins and up to 128K of HUB, not bad for $15 more? What is the P2 Project Board going to cost? If all you need is a few extra I/O pins, isn't the P1V-64 on a BE-Micro worth $35?

    My secondary defense:
    maybe they will give someone a P1V epiphany about something that isn't so radical and doesn't make the Propeller into something that isn't a Propeller and could be built into the FPGA and used by the current tools.
    Have you ever wondered why a PhD student has to "defend" his thesis?
    Not really. By the way, shouldn't that be "their" thesis instead of "his" thesis? I believe they have recently begun to allow females to pursue PhDs. :D
  • Heater.Heater. Posts: 21,230
    edited 2015-03-24 09:59
    mindrobots,

    I kind'a sort'a agree with your defense. Well, both them. I did say above that I can see the point in a P1 core in an FPGA plus some performance tweaks or external functionality added.

    As to your second defense, here we go. My new concept for a better Propeller is as follows:

    1) It's a 64 bit machine.

    2) It has at least 16 COGs.

    3) Each COG is is a RISC V architecture with 1 megabyte of local RAM

    4) There are multiple megs of shared FLASH and RAM.

    4) It makes use of clocked I/O and SERDES tightly coupled to the COGS for the high speed real world interfacing.

    This is a cool concept because there is a Bed and Breakfast at the top of the mountain. There already exists a GCC C/C++ compiler for RISCV.
    ...shouldn't that be "their" thesis instead of "his" thesis?...
    No. We don't have any truck with any of that political correctness nonsense around here.
  • jmgjmg Posts: 15,173
    edited 2015-03-24 11:59
    Heater. wrote: »
    3) Each COG is is a RISC V architecture with 1 megabyte of local RAM

    Taking that angle, are there any numbers for a RISC-V on a Cyclone V ?
  • KaioKaio Posts: 257
    edited 2015-03-24 15:54
    jmg wrote: »
    * Slashing the number of COGS is a pretty severe 'solution' to HUB access :)
    * In a FPGA, dual-port memory comes almost for free, but quad port does not.
    * Present opcodes have 9 bit fields, dictated by binary compatible operation.
    jmg, thanks for your response and your hints.

    It was not my goal to reduce the number of COGS. It's still the result of using n-port RAM. ;-)
    You are right and I know that quad port memory is currently not standard in FPGA. But you can do it with some lines of code.
    --> Advanced Synthesis Cookbook

    There are no changes necessary on the instructions for those concepts. As you know AUGS and AUGDS can be helpfully to extend the D and S fields and avoid the 9 bit limitation.
    jmg wrote: »
    The memory-domain elasticity I can see in a P1V are things like
    a) Local Indirect Data inside a COG (can share with an adjacent COG almost for free) - see other thread on this.
    This is easy to add, in a FPGA, and keeps binary subset compatible operation.
    b) Some small HUB area that is N-Ported, somewhat costly, but it does remove HUB waits in that area.
    The key is to keep it small, for COG-COG messages.
    c) Add XIP HW to have transparent read of QuadSPI memory - Waits are longer here, but the memory is transparent to the user. Lots of it, just slower, and burst-reads would have lower waits than random reads.

    The indirect read HW can access via the 32b pointer, all memory areas, What changes is the speed (waits).
    a) removes HUB waits, if those were just to store local data.
    I'll check this.

    Thanks!
  • KaioKaio Posts: 257
    edited 2015-03-24 16:19
    mindrobots wrote: »
    The changes in this thread are radical and propose major changes in direction and architecture as have pointed out but maybe they will give someone a P1V epiphany about something that isn't so radical and doesn't make the Propeller into something that isn't a Propeller and could be built into the FPGA and used by the current tools.
    That's exactly for what this thread was intended. I have only proposed some ideas that came up to me at the last weekend. Perhaps someone find it motivating to see another way. Others are rather still waiting for the P2.

    Thank you!
  • evanhevanh Posts: 15,920
    edited 2015-03-24 17:38
    By having large cores, especially when dual/quad porting the local memory, the shared single ported on-chip performance memory, HubRAM in the case of the Propeller, becomes much smaller than it could have been. You can't avoid the space constraints of a single die.

    On a side note, it'll be interesting to see what nVidia can pull off with their push into "stacked" RAM. I hope they give some details like the number of interconnect wires for example.
  • jmgjmg Posts: 15,173
    edited 2015-03-24 17:47
    evanh wrote: »
    By having large cores, especially when dual/quad porting the local memory, the shared single ported on-chip performance memory, HubRAM in the case of the Propeller, becomes much smaller than it could have been.

    That's true in an ASIC, but FPGAs move things around a little, and there, Dual port has very little additional cost, as the BlockRAMS are inherently dual port.
    QuadPort does cost, as you overlay two dual ports to emulate that - but provided it is kept small, there could be a place for N-port message memory.
  • evanhevanh Posts: 15,920
    edited 2015-03-24 17:53
    Yeah, well, when reading about 64-bit cores with megabytes of core ram and multi-megabytes of shared I kind of figured FPGAs as only being the test bed.
  • jmgjmg Posts: 15,173
    edited 2015-03-24 18:03
    evanh wrote: »
    Yeah, well, when reading about 64-bit cores with megabytes of core ram and multi-megabytes of shared I kind of figured FPGAs as only being the test bed.
    hehe, all it needs is someone to fund that step.
    Meanwhile, FPGA's are getting ever-cheaper and there is a place for more modest, FPGA achievable targets.
    Given current price-curves, I can see a place for a Prop1 and a P1V on a small module.
    Such a module could even migrate to P2 with relative ease, when that becomes a disti-part-code.
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-03-24 18:07
    IMHO any ideas are good. If they aren't proposed then there is no discussion and the thread dies. The only time wasted is in the ensuing discussion.

    What is a waste is the criticism of posting the ideas - it just wastes thread space where objective discussion of the ideas can be lost in the quagmire. So, instead of wasting time criticising each other, how about you "remain silent" or discuss the idea. Sometimes brilliant alternatives come out of ideas.

    The concept of sharing "some" of the hub ram with some "cogs" seems like a good objective. However, making the hub totally 2-port memory wastes a lot of silicon which reduces the amount of hub available.

    But the concept of having larger cog space is music to my ears, especially if some of this is shared with hub ram. Accessing extended cog space is easy enough with just a couple of extra instructions (I have shown this before). Basically it is simpler than hubexec, but would flow on from hubexec easily.

    However, I don't want to see less cogs on a P1. So what about some alternatives...

    One big P1 improvement would be to have 1-clock hub access - immediately doubles hub performance to 1:8 instead of 1:16

    How about a P1 with...
    * MUL and perhaps DIV
    * Video only on 2 cogs (seems a waste of silicon on all cogs)
    * A simple serial (not a uart, just a shifter in and a shifter out, clocked by counter or by external pin) on 6 cogs (replacing the video)
    * Simple hubexec that also works for extended cog ram
    * 4x 2KB or 4KB RAM blocks shared between adjacent pairs of cogs (as extended cog space) with option for
    -- dedicated to either cog only at full speed
    -- shared by both cogs at half speed (ie 1:2 clock access round-robbin for deterministic use)
    * 64KB minimum hub RAM (more preferred), single clock (1:8 access)
    * ROM code sufficient to boot a cog (ie no ROM SIN/LOG/FONT tables as in orginal P1 - these can be softloaded, as can SPIN)
  • jmgjmg Posts: 15,173
    edited 2015-03-24 18:32
    Cluso99 wrote: »
    However, I don't want to see less cogs on a P1. So what about some alternatives...
    Do you mean Silicon, or P1V ?
    The limit to COGS in a FPGA is financial - you can buy 8 COGS now in a P1 for under $1/COG
    Cluso99 wrote: »
    One big P1 improvement would be to have 1-clock hub access - immediately doubles hub performance to 1:8 instead of 1:16

    How about a P1 with...
    * MUL and perhaps DIV
    * Video only on 2 cogs (seems a waste of silicon on all cogs)
    * A simple serial (not a uart, just a shifter in and a shifter out, clocked by counter or by external pin) on 6 cogs (replacing the video)
    * Simple hubexec that also works for extended cog ram
    * 4x 2KB or 4KB RAM blocks shared between adjacent pairs of cogs (as extended cog space) with option for
    -- dedicated to either cog only at full speed
    -- shared by both cogs at half speed (ie 1:2 clock access round-robbin for deterministic use)
    * 64KB minimum hub RAM (more preferred), single clock (1:8 access)
    * ROM code sufficient to boot a cog (ie no ROM SIN/LOG/FONT tables as in orginal P1 - these can be softloaded, as can SPIN)

    Many of that laundry list is in the latest P1V's with conditional builds, and there is a lot 'Natural resource' that comes for free in modern small FPGAs.

    MUL is one example, PLLs is another.

    Dual port memory is also free, so "shared by both cogs at half speed (ie 1:2 clock access round-robbin for deterministic use)" may not be needed.

    I like the idea of a Indirect-with-wait * approach to all the memory areas, then local COG memory has no added waits, COG-COG / Local Array memory likely has no added waits either, HUB access would be hub-slot paced, (new 1:8?) and off-chip access to QuadSPI (x1 or x2) can be HW managed to ~1-2 dozen clocks random access, & less with natural support for block moves.

    Indirect-with-wait would likely be a 2 cycle opcode (min), as it needs to decode and read the index, then do the fetch.

    * I detailed the possible (almost free) bit-level mapping of a @ Rn with ++/-- options here - via the upper 4 bits currently unused in FPGAs.

    http://forums.parallax.com/showthread.php/160278-The-need-for-CTRx-addressable-registers?p=1321602&viewfull=1#post1321602
  • potatoheadpotatohead Posts: 10,261
    edited 2015-03-24 19:38
    I personally would love to see a P1 image with the egg beater running, no other changes, unless they are needed to make the HUB work.

    We had a lot of discussion on the implications of that which has yet to play out.

    I know the P2 includes or is set to include fifo, etc... Maybe that renders just the egg beater a waste of time.

    It would be compelling to write some P1 code to better understand what it means. :p
  • jmgjmg Posts: 15,173
    edited 2015-03-24 20:02
    potatohead wrote: »
    I personally would love to see a P1 image with the egg beater running, no other changes, unless they are needed to make the HUB work.

    There are likely to be quite a few 'other changes' to get that working.

    Such a design works best when the burst nature of the HW can be tapped, but with a 4 cycle opcode, and no auto-inc, the core is a lot slower than the rotating LSB selector. - so much slower, that just moving from 16:1 to 8:1 HUB is likely to be more useful a change.

    Video, via a optional CLUT, was one use that did use the full bandwidth of rotating LSB selector, but even that needed a local FIFO for clock rate smoothing.
Sign In or Register to comment.