Shop OBEX P1 Docs P2 Docs Learn Events
What would you want more of, cogs or RAM? - Page 3 — Parallax Forums

What would you want more of, cogs or RAM?

1356729

Comments

  • GadgetmanGadgetman Posts: 2,436
    edited 2006-11-25 23:01
    Because now each instruction only takes ONE clock pulse to execute instead of the four on the current Propeller, which quadruples the speed, and the clock-speed is also doubled...

    I'm not certain, but my guess is that 5V is avoided because it adds speed constraints.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Don't visit my new website...
  • parskoparsko Posts: 501
    edited 2006-11-26 00:30
    Martin Hebel said...
    While I don't think the plan is to bring port B out to the I/O pins with this version, it would be great if it were available internally to allow a 32-bit bus between cogs for inter-cog communications.

    Great discussions,
    Martin

    Martin, you read my mind! I had remembered just that reading through the posts, then got to yours... I remember someone mentioning this a few months back, something of a "ghost" register. I haven't thought of how you would then coordinate the pin outputs if it was external. But internal use would be nice, I'd second that suggestion.

    -Parsko
  • lairdtlairdt Posts: 36
    edited 2006-11-26 00:51
    8 (faster) cogs and 256k RAM, would prefer 1024 word cog RAM as well.
  • GavinGavin Posts: 134
    edited 2006-11-26 01:08
    How about moving the current part to faster technologies.
    Leave everything the same except make the cogs run at 160mips.
    Just thinking it might be faster to get silicon so we can all get more power sooner. No major design change except shrinking the micron size. If the die gets smaller it should be cheaper than the current part.
    We can then call it the Turboprop and have a Supercharged Hydra.

    Gavin
  • IanMIanM Posts: 40
    edited 2006-11-26 02:41
    Looks like a lot of people are voting more RAM. I would prefer more cogs. It is the cogs that distinguishes the prop from all other uC. 16 cogs would simplify (not complicate) a lot of applications. It doesn't seem to fit the prop philosophy if the next version just adds more memory and speed. If you need that much more memory you can go off chip or perhaps the prop is not the best option.

    However, whichever wins, looking forward to it!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Ian Mitchell
    www.research.utas.edu.au
  • lairdtlairdt Posts: 36
    edited 2006-11-26 03:12
    Coming from a MCU with 16MB RAM/EEPROM addressing, I don't think the 256k increase is too much to ask for in Propeller v2.0. Coupled with the added processing power of 8 cogs however, not having 256k is a real problem. It would only get worse in a system with 8 faster or 16 total cogs.
  • Matthew HayMatthew Hay Posts: 63
    edited 2006-11-26 03:42
    Okay unfortunately I've never used the prop (only used the BS2, 8052, and atmel butterfly) but I'd say go for the 8/256K.·

    Also I'd go for adding in the ability to run multiple·chips ie have a master chip and slave chips.· You could have the slave chips act as extensions of the master (ie more cogs / io pins).· Which with a little create programming you could almost do that now (though not having used the chip I can't say for sure).

    Anyway that just a crazy idea I had, though I'm not sure how hard it would be to build that into a chip.

    -Matt Hay
  • Tracy AllenTracy Allen Posts: 6,656
    edited 2006-11-26 03:47
    I'd vote for 8 COGs and 256k RAM, with 8 cycles per hub access. Especially when combined with the pipeline for one instruction per cycle.

    Do branches take 4 cycles, when the pipeline has to be emptied and refilled?

    Would the 256kbytes (load in at startup from a 256k eeprom be available for spin objects, images of Propasm programs, and for data processing? Just checking that the 256k hub ram is not banked or limited in some way.

    The bandwidth selection ideas are intriguing, but it sounds like it might be a headache to document and support. There is something very easy to grasp about 1:8 KISS.

    I second the idea of implementing the port registers for the 64 pin device, if simply as a side door method of communicating between cogs. Or an port independent register of that sort, that follows the same control rules.

    Also I second the idea of extended COG counters, to include the increment on hub read or write. There are other selections I'd like for input and output on the cog counters. For example, the capability to allow a selection of output from PHS or CARRY in the DETector modes. But that is another topic.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Tracy Allen
    www.emesystems.com
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-26 04:29
    Sorry I've been away from the discussion ... I vote for 8 cogs and 256K. The faster cogs and hub access will allow more functionality with fewer cogs. I like the idea of adjustable hub access bandwidth with common-case defaults. Most of the time, the adjustable access won't be necessary, but it will save a cog or two occasionally (and the associated complexity) when the issue is indeed hub access bandwidth.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-26 05:10
    Chip Gracey said...
    Maybe when a cog is launched, its hub-access requirement could be stated, and then the launch would pass/fail based not just on whether or not a cog was available, but also on whether or not a requested-bandwidth hub slot was available. For example, you could have 1:4 being the highest, then 1:8, 1:16, and finally 1:32. Every program should use the lowest-possible setting. It would take only a bit of logic in the hub to negotiate the setup requests and then serve them deterministically thereafter.
    Chip, I really like the idea; but it could get a little tricky, depending on the order in which the requests are made. For optimal "time-packing", you'd want the most demanding cogs assigned first and the least-demanding last. But there's no guarantee that that's the order in which the requests will come. And you can't jiggle things after the fact, since it'll throw the timing off for cogs that've already been assigned.

    Did already you have an algorithm in mind for this? If so, I'm very curious what it might be.

    Also, would it be possible for a cog to request a different access priority after being launched? This would enable more efficient bandwidth sharing when rapid hub access is needed only in short bursts.

    -Phil
  • power mouseypower mousey Posts: 2
    edited 2006-11-26 06:04
    ·hey Chip,

    ·how about both options. true.

    ·and also a third option....up to 16 cogs active and with a maximum of 256k of ram. also, maybe expand the rom too.

    ·also how about this?: have the capability in the hardware and software of the propeller chip for sharing and using some of the cogs(other cogs) general purpose ram for some of the program code,memory for extra and fast memory and some of the code in their registers.

    ·for example: use a few cogs in an application...lauch a few other cogs and use their general purpose ram for some of the code and data too. yet,·even though these cogs are launched and active...thier registers are used for some of the code and data.

    ·cheers,

    power mousey
  • cgraceycgracey Posts: 14,133
    edited 2006-11-26 06:14
    Phil Pilgrim (PhiPi) said...

    Chip, I really like the idea; but it could get a little tricky, depending on the order in which the requests are made. For optimal "time-packing", you'd want the most demanding cogs assigned first and the least-demanding last. But there's no guarantee that that's the order in which the requests will come. And you can't jiggle things after the fact, since it'll throw the timing off for cogs that've already been assigned.

    Did already you have an algorithm in mind for this? If so, I'm very curious what it might be.

    Also, would it be possible for a cog to request a different access priority after being launched? This would enable more efficient bandwidth sharing when rapid hub access is needed only in short bursts.

    -Phil
    Yes, in thinking more about it, it seems it would be hard to avoid fragmentation, especially after a few cogs have re-launched. I think it very quickly becomes a "memory management" type of problem, to which there is no (simple?) solution. Like you said, you can't reassign a cog's time-slot on the fly because it could potentially destroy its established function. About setting bandwidth during runtime: Any cog asking for more bandwidth probably needs it,·and what if it can't get it? The more I program the Propeller, the more timing-centric everything is becoming. Timing and function are more often than not inseparable concepts. Potatohead pointed out something similar to this. I'm convinced that anything that introduces indeterminancy into timing is really poisonous. Determinism has that wonderful KISS quality, which is always right.

    BTW, here's what Potatohead wrote (red text is critical):

    I've dealt with high end applictions for a lot of years. Many of these were running on SGI NUMA machines. Interesting philosophy that turned out to be very true in a lotta cases: Any compute problem, properly coded, becomes an I/O problem.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.

    Post Edited (Chip Gracey (Parallax)) : 11/26/2006 7:12:59 AM GMT
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-26 07:14
    Chip, I have a potentially interesting idea for allocating bandwidth.

    Basic assumptions:

    8 cogs competing for memory time slices.

    80Mhz HUB ram speed (12.5ns)

    Why not have a set of special registers in the hub memory that allocated timing slots?

    For yucks, lets use a 80 entry table

    by default, the table is filled as follows:

    0
    1
    2
    3
    4
    5
    6
    7
    0
    1
    2
    3
    4
    5
    6
    7
    ... until the end of the table

    HOWEVER

    the table can be re-written under cog control!

    Mind you, the table would have to be insanely fast - say 2n or less access speed

    That way, the memory access would be TOTALLY soft, totally programmable

    I would NOT want the entries packed, and it may be best to have the entries be a bit mask by bit position as it would simplify the decode logic (and make it a bit faster) - we can live with the limit of 32 cogs sharing hub memory that it would impose on a 32 bit prop.

    What do you think?

    A slightly more elaborate version would have every *second* slot or every fourth slot fixed, to guarantee a minimum certain bandwidth per cog.

    EDIT:

    A potentially easier/better idea:

    Allow for 32 max potential timing slots

    A new hub instruction, called SCHEDMEM, could be used to request a bitmask of timing slots; giving up slots normally scheduled for a cog that it did not want, and trying to allocate ones it wanted

    It could return the allocated slots

    so the default configuration with eight cogs would be something like

    cog0: 10000000100000001000000010000000
    cog1: 01000000010000000100000001000000
    cog2: 00100000001000000010000000100000
    cog3: 00010000000100000001000000010000
    cog4: 00001000000010000000100000001000
    cog5: 00000100000001000000010000000100
    cog6: 00000010000000100000001000000010
    cog7: 00000001000000010000000100000001

    the above would be the default access mask for the cogs, for the current behaviour

    however say cog4 only needed/wanted one hub access cycle in every 32 possible cycles

    it could release three cycles!

    there could be·a globally available hub register showing currently allocated memory slots

    RAMUSED: 000100010010000111000001000000

    a cog could then tell what cycles it can request

    every time a cog released its slot, it would become available for another cog

    Btw, this is also easier to implement in gates than the time slot registers i suggested above

    Post Edited (Bill Henning) : 11/26/2006 7:25:18 AM GMT
  • cgraceycgracey Posts: 14,133
    edited 2006-11-26 07:36
    Bill,

    That would certainly be flexible, and if you knew exactly what you wanted for the whole system, it would be ideal. But, if cogs spawning from objects are trying to set up their own requirements, they could be clobbering the schedules of others. That whole thing might have to be locked for inidividual cog access. I could see a lot of cog code getting spent on iffy setup procedures. Do you know what I mean? This would preclude unknown cog schemes·from deterministically starting with their bandwidth requirements under an RTOS' control. The RTOS would have to have some data on what the cog needed so that it, alone, could set it up. Nobody else had better interfere, either.


    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.

    Post Edited (Chip Gracey (Parallax)) : 11/26/2006 7:40:26 AM GMT
  • potatoheadpotatohead Posts: 10,254
    edited 2006-11-26 08:02
    Given this, I'm thinking this is not a good path. Enough complexity has been brought up to totally validate Phil's point --and it's not even fully defined yet.

    I guess this means the current symmetry in the design is another one of those don't touch items as well..
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-26 09:14
    I figured spin or a hypothetical rtos would manage bandwidth allocation roughly as follows:

    - a cog may give up any of its default allocation

    - a cog may request any slice, and for any requested slice, if it is not allocated, the request is granted

    - one way to enforce fairness is to hardwire one access in every 32 slots to a cog - if it asks for it or not; ie on startup, each cog gets one access cycle every 32 clocks (or if the default is 16 it matches the current scheme, while leaving 16 slices up for grabs for bw hungry apps!)

    - and cogs that did not need access every 1/16 slots would be allowed to give up ONE of their two slots

    - this scheme would also work for a 16 cog prop, but by default each cog would only get two time slices at hub access per 32 cycles

    ie

    *---*---*---*---*---*---*---*---
    0---1---2---3---4---5---6---7---

    slots with * are hard locked, may not be given up

    by default each cog gets one cycle in a 32 stroke "wheel"

    24 slots not allocated

    only unallocated slots may be claimed by cogs

    cogs can only free up claimed slots, not the "hard" slot they are initially allocated

    I agree its non-trivial, but say a "hyper" speed app could claim 24 slots out of 32 if it needed it!




    Chip Gracey (Parallax) said...

    Bill,

    That would certainly be flexible, and if you knew exactly what you wanted for the whole system, it would be ideal. But, if cogs spawning from objects are trying to set up their own requirements, they could be clobbering the schedules of others. That whole thing might have to be locked for inidividual cog access. I could see a lot of cog code getting spent on iffy setup procedures. Do you know what I mean? This would preclude unknown cog schemes·from deterministically starting with their bandwidth requirements under an RTOS' control. The RTOS would have to have some data on what the cog needed so that it, alone, could set it up. Nobody else had better interfere, either.


    Post Edited (Bill Henning) : 11/26/2006 9:21:59 AM GMT
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2006-11-26 11:50
    There is so much information here that I am afraid that the really useful and practical to implement stuff might get buried. From what I understand of silicon, digital design, embedded hardware and software requirements, and perhaps even the market viability part, I am offering my2cents worth. Remember, most of us aren't all-round gurus, we each have our own experience, requirements, and expectations. Together it should make one mean pie.

    COG MEMORY
    I understand the limitation that you have with 512 longs/cog is directly related to the KISS/FAST instruction decode and cannot be changed without changing the code/cpu itself or extending the instruction width longitudinally perhaps to 40 bits or more (I won't mention banking). Ok, so we are stuck with 512 longs per cog, let's work from there.

    MORE COGS? YES! HOW?
    Someone mentioned why have multiple video registers when all we really need is one (even more) and that can be accessed centrally as part of the main memory map. That may not have been practical on the original but could indeed be with the proposed new design. If this is the case then I would like to see a simple 8/16/32-bit SPI-like interface on each cog as this would not take up any more silicon than the current video generator would. The use of SPI would permit pchips to communicate with other pchips effectively and efficiently and because we would have at least 8 SPI interfaces per chip that means we could connect them in the most suitable fashion, whether that be a simple chip to chip or a transputer like connection where they are connected in a 2D XY matrix or perhaps even a 3D matrix if we start getting really fancy.

    16 cogs would seem an advantage at first but not when an 8-cog chip could access more main memory much faster, remember, we only have 512 longs, we need efficient access to that main memory. Consider this, if 16 cogs would be beneficial then why not 32 or 64? It seems that at some point we run into a barrier with adding more cogs as there is no efficient inter-chip communications method, so I suggest the SPI-like method and simply add more chips when we need more cogs.

    DETERMINISTIC
    Keep the main memory access deterministic even though my original thoughts when I first played with the pchip was why didn't they have programmabled access? The approach to mux'ing main memory access may seem a little bit plain and simple but it works. The Spin development environment of creating sharable objects is part of the success and ease of use of the Propeller. Imagine if one object required a certain type of access and another object required something different and you tried to use these objects together with one hogging what it needs but not leaving enough for what the other requires, or hand-tweaking the application, no thanks!

    KEEP US IN THE LOOP
    Make advance information available, even if it is tentative. As you know, there is a long evaluate/prototype/evaluate/development/production/whatever cycle etc in most commercial products. Having advance information plus the experience with working with the pchip now, plus the fact that we won't look elsewhere while we are in expectation and salivating means that Propeller II can expect a much faster end-user utilization, and a shorter development cost amortization then perhaps has been experienced with the original. We want you guys to stay in business.

    There are plenty of other good suggestions, some pie-in-the-sky, but there is only so much time and money available and these few things that I have outlined seem in my opinion both desirable and do'able.

    *Peter*
  • parskoparsko Posts: 501
    edited 2006-11-26 11:53
    --------------------------------------NOW (8/32)         8/256                  16/128
    Total Cog Ram (bytes).................16384               16384                  32768
    Global Ram (bytes)....................32k                 256k(8x)              128k(4x)
    Hub Access Time.......................1/16=200ns      1/8=100ns             1/16=200ns
    PASSY Command Execution Time..........4clk=50ns        1clk=12.5ns            ??????
    Clock Speed(Mhz/MIPS).................80/20              80/(160?)            80/(160?)
    
    



    Did I miss anything important? Did I get something wrong?

    After having a night to sleep on it,one important thing, to me (and I think likely Cliff and/or KaosKidd too) is the Total Cog Ram. Don't we gain some hidden benefits with having double the amount of COG ram available? Especially if COG-COG communication is faster...?

    -Parsko

    Post Edited (parsko) : 11/26/2006 2:27:29 PM GMT
  • ciw1973ciw1973 Posts: 64
    edited 2006-11-26 12:51
    Having given this some more thought overnight, I'm finding my original preference for 16 cogs and 128K RAM is once again looking more appealing, but only if there were bandwidth allocation features implemented as well.

    My main reason for going with 8 faster cogs wasn't the additional memory, but that access to this memory would be slower. Being able to allocate more slots to processes requiring faster access to this memory would largely negate the issue. OK, so it would introduce other issues which would need to be overcome, but I think there is a lot of potential there.

    If we consider the allocation of hub slots to be a similar issue to the issue of allocating blocks of memory, where the slots are determined in the spin which prepares the assembly code for loading into a cog, then it becomes fairly simple to manage.

    I'm still very keen on the idea of any new Propeller also including some on-chip FLASH though, to keep the component count down for smaller designs. I know physical silicon space is an issue, so how about a version of the 8 cog chip that has 128K SRAM and 128K FLASH?

    Post Edited (ciw1973) : 11/26/2006 12:57:44 PM GMT
  • PVJohnPVJohn Posts: 60
    edited 2006-11-26 13:06
    I prefer more memory.
    Can you make it in DIP 40 package and pin compatible with current version, so that we can continue to use HYDRA board?

    PVJohn
  • ciw1973ciw1973 Posts: 64
    edited 2006-11-26 13:51
    Agreed having a 40 pin DIP version makes it much easier for the hobbyist and for prototyping in general. Adding FLASH would make it even more so.
  • CobaltCobalt Posts: 31
    edited 2006-11-26 15:57
    I think I'll change my vote from the 16 cogs to the 8 cogs - I didn't quite realise that it would be faster and that the increased memory would let things run on fewer cogs... although I would want more IO pins [noparse]:D[/noparse]

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    while alive = 1
    wakeup
    program(propeller)
    eat(3)
    sleep(7)
  • potatoheadpotatohead Posts: 10,254
    edited 2006-11-26 16:57
    Bill, I like your idea of a micro OS to manage things. However, would that not fragment the body of prop code?

    This is gonna happen anyway as people build stuff and other people build on top of it. For a specific application, no biggie. One gets the bits they need, tweaks them to play together and moves on to the application.

    However, that change would more or less mandate Parallax to provide some sort of management scheme. That may or may not make sense to them. Also, I'm not sure for a lot of applications the additional thoughtput possible would be worth the overhead. A similar level of granularity, setting peak performance aside, could be achieved with COG threading as well. From the applications point of view, there would be very little difference.


    Thought of something else. A new chip cycle means another shot at what's on the ROM.

    Does it all have to be ROM? EEPROM maybe for those who don't care about anything beyond the SPIN interpeter and it's necessary elements?

    Would different contents really matter, knowing what we do now?

    I personally would like to see a small 8x8 character set in there, among other things. Is this something open to discussion Chip?
  • nutsonnutson Posts: 242
    edited 2006-11-26 17:48
    Chip's question suggests 1 Cog + 512x32 bit registers equals 16k8 HUB ram silicon in real estate, so the Cog takes the biggest part of the real estate. Allow me to throw in some new variables into the discussion: backwards compatibility, hypertheading, SPIN interpreter.

    Although I back up the 8/256 direction, I still have doubts, assigning a 160 MIPS processor to the task of inputting characters from the keyboard will make me feel guilty I am sure. The 16 Cog advocates have a point that the simplicity of having 16 absolutely equal resources greatly reduces the problems in programming very diverse I/O tasks and eases the reuse of objects etc. which is one of the Props strong points. Bill's method of executing streams from HUB memory is a way out, but has the disadvantage of task and context switching overhead, limited register use, different programming methods for "full" and "shared" Cog use. But, can't we have one 160 MIPS processor "hyperthreading" 8 threads at 20 MIPS, which would give us backward compatibilty with current objects?? Give each thread a separte program counter and C/Z set, and share all other resources.

    One Cog without local memory could execute one thread out of HUB memory at 20 MIPS, it could execute 8 threads out of HUB memory at 1.125MIPS. One Cog with 4K32 local memory could hyperthread 8 threads at 20 MIPS.

    What is the role of the SPIN interpreter in this. Could the SPIN interpreter be made to run multiple (low speed) threads from a single COG??

    Nico Hattink
  • iam7805iam7805 Posts: 14
    edited 2006-11-26 18:11
    I'd say go with more RAM. Would be useful for game programming. smile.gif
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-26 19:40
    Some support for high speed clocked serial communications would be very useful for multiple chip-to-chip communications as well as the Ethernet that's already been mentioned. Using a self-clocking system (like Manchester encoding) would save I/O pins, but, with a larger package needed anyway, that may not be as much of a problem. Most high speed serial chips now use SPI. If Chip decides to add a little FIFO buffering for video, it wouldn't take much logic to use the same buffer for SPI output as well. Input buffering would also be very useful, but would require more supporting logic since the original reason for putting in the FIFO is for video generation.

    If SPI support were to be added, it could also be used for cog-to-cog communications
  • Phillip Y.Phillip Y. Posts: 62
    edited 2006-11-26 19:40
    MORE Intercog communication would be useful.

    port A (same as always)
    port B w/wo real I/O pins

    OR;

    Register for access to the other cogs similar to the port A and B,
    One for all cogs (port C) but not for I/O,
    Using port C instead of port B would ELIMINATE issues of moving programs that use port B with 32 I/O to 64 I/O versions of the chips.

    OR;

    Two registers for adjacent cogs , i.e. cog to the right , cog to the left . (port R, port L)
    togeather Port R and port L connections would use silicon = to port C alone,
    many times 2 or 3 cogs work together closely and don't need special access to other cogs.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-26 20:00
    I agree with the need for better inter-cog comm support. And an unpinned port B may be all it takes. But, at 160MIPS, I'm not sure that any more is really needed in the hardware for fast serial I/O.

    What I would like to see, though, are more counters that could be combined in various ways to support hardware PWM, for example. The DUTY mode is just too fast for some D/A apps, especially those requiring MOSFETs to drive an inductive load, say.

    -Phil
  • Phillip Y.Phillip Y. Posts: 62
    edited 2006-11-26 20:13
    PhilPi ;
    I am only talking about 32 bit parallel access with in the Propeller chip, not serial or between chips.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-26 20:20
    Hi Philip,

    'Sorry, I should've kept the two topics separate. The portion of my comment regarding serial I/O was in response to Mike's posting just above yours, which also alluded to inter-cog comms.

    -Phil
Sign In or Register to comment.