Shop OBEX P1 Docs P2 Docs Learn Events
prop II or III architecture — Parallax Forums

prop II or III architecture

CncjerryCncjerry Posts: 64
edited 2012-06-19 17:28 in Propeller 1
The more I play with this chip the more I like it and keep thinking it could be better.

I would love to have completely shared memory so each cog is actually only an execution element. So you would write flat code and point an execution unit at a subroutine (or any address) and let it rip. Then keep them all in sync with semaphores and locks if necessary. That would allow the cogs to share memory more efficiently and think of the dynamics you could have by allowing each of the cogs to point its partners at blocks of code as needed without having to spool the code into the cog's independent memory with the waits involved. I also think this would be a simpler chip to design, no? The programming would also be simpler.

Comments?


Jerry

Comments

  • Cluso99Cluso99 Posts: 18,069
    edited 2012-06-18 15:53
    Jerry,
    Yes, the prop is great and Prop 2 will be even better.

    On the P2, for the cogs to share memory, each cog would be slowed by a factor of 8. Cog memory has quad access to achieve 1 instruction per clock cycle. That is 3 reads and 1 write. Each instruction actually takes 4 clocks, but they are pipelined to achieve 1 per clock.

    Programming would not be simpler because there are only 9 bits for each of the source and destination, so most instructions would be indirect. Then I suppose if you could reserve a block of 2KB per cog for direct access, that might work.

    Perhaps you might join this thread http://forums.parallax.com/showthread.php?140730-Consulting-the-crystal-ball-What-comes-after-the-Prop2
  • jmgjmg Posts: 15,183
    edited 2012-06-18 16:11
    Cncjerry wrote: »
    The more I play with this chip the more I like it and keep thinking it could be better.

    I would love to have completely shared memory so each cog is actually only an execution element. So you would write flat code and point an execution unit at a subroutine (or any address) and let it rip. Then keep them all in sync with semaphores and locks if necessary. That would allow the cogs to share memory more efficiently and think of the dynamics you could have by allowing each of the cogs to point its partners at blocks of code as needed without having to spool the code into the cog's independent memory with the waits involved. I also think this would be a simpler chip to design, no? The programming would also be simpler.

    There is another 'crystal ball thread' - but 'simpler chip' ? - not really.
    Cog memory is Opcode memory and needs to be 4 port and fast and highly deterministic.

    The opcode has a code-memory limit of 512L, and yes, you could create more than 8x 512L of 4 port memory, and let the cogs 'have at it', but 4 port memory is costly silicon.

    A low cost way to morph between cogs, would be to allow those 4 ports to cross-steer between siblings.
    ie Cog 0 could write-only to (say) a 2^N word block Cog1 memory, which would be read only in that area.
    Cog1 could write only to 2^M words in Cog0, so you have (possibly asymmetric) two way memory sharing.
    This does not add ports the the RAM, it just adds a mux to the address lines.

    A little more complex would be a scheme allowing unused COG memory, from a compact process, to be visible/used by a more loaded cog.
    The issues here, are you now have RAM larger than opcode reach, so some index/page scheme is needed.

    Full any-any cog mux would likely be too speed/size costly, so this would make most sense 'adjacent cogs' in a ring, or star.


    I think some Sparc's allowed function calls, where register bank mapping would move thru memory, and half contained call params, and the other half was local scratch. The Infineon x166 also has a similar register pointer ( IIRC into 1K?), but they both have much smaller register counts, and the memory this works in, being 4 port, is die-costly, so is relatively small.

    Other chips have register -bank switching, and in a Prop, doing this on all 512L would open up adding virtual cogs / threads.
  • kwinnkwinn Posts: 8,697
    edited 2012-06-18 17:01
    Cncjerry wrote: »
    The more I play with this chip the more I like it and keep thinking it could be better.

    I would love to have completely shared memory so each cog is actually only an execution element. So you would write flat code and point an execution unit at a subroutine (or any address) and let it rip. Then keep them all in sync with semaphores and locks if necessary. That would allow the cogs to share memory more efficiently and think of the dynamics you could have by allowing each of the cogs to point its partners at blocks of code as needed without having to spool the code into the cog's independent memory with the waits involved. I also think this would be a simpler chip to design, no? The programming would also be simpler.

    Comments?


    Jerry

    The more I play with this chip the more I like it as well, and a lot of that is due to the elegant simplicity and symmetry of the architecture. No special cases and no functions that can only be done on specific cogs.The only cog that gets special treatment is cog 0, which is used for booting, but in every other way is the same as the other 7 cogs.

    IMHO your suggestion would add a lot of complexity to the hardware and software without providing a commensurate benefit for it's intended use.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-06-18 17:10
    kwinn wrote: »
    The only cog that gets special treatment is cog 0, which is used for booting, but in every other way is the same as the other 7 cogs.
    Well, technically, when we re-boot the software with an OS style, we can actually end up with any cog doing the boot process, so you cannot actually rely on the boot cog being cog 0.
  • CncjerryCncjerry Posts: 64
    edited 2012-06-18 17:14
    Most of my experience is on massively parallel processors, some of which were the basis for blue gene and the precursors like the original PLANET and ORBIT parallel machines. I've only recently become aware of this chip and I am trying to maximize the parallel performance to give a signal generator analog smooth frequency steps of less than 1hz. So when you spin the dial, it doesn't sound like an organ as it crosses a beat frequency.

    There is another processor other than sparc, can't remember the manufacturer, for some reason SGI comes to mind, it was running SUSE Enterprise linux with real time extensions and used at Lockheed for real-time simulation of the f22 and others. I was out there last year in the simulator room, couldn't look in the cockpit. The room is about 100ft in diameter with a ball in the middle where the cockpit resides. Full audio visual to the cockpit screens, the works. The simulator is all fly by wire and they hook realized hardware up to ensure real-world lag with encoder feedback just like the plane. The ball is mounted on several large actuators, each about 40 ft long. I believe the processor was 16x, not massive, but from what I was told it has more than 4 port memory. I'm a sales executive, so that is the limit of my knowledge. The more amazing feat is how they fit all that in a plane.

    After writing the note, above, I started thinking my solution might be to carry multiple sub routines within a cog and then sit and wait for instruction as to the subroutine to execute. Same concept though I want to speed-up the wrbyte/rdbyte cog signaling or use a few I/O pins for communications. Someone mentioned it was a faster way to talk between cogs vs rd/wrbyte.

    Again being new to the prop, I hope the prop II, which I will read into, doesn't lag so much it will be outdated. I think a 4 way chip with faster shared ram access would be interesting. In my app, commenting out the memory transfers of rdbyte/wrbyte noticeably increase the responsiveness of the pushbuttons which on the quickstart board, I light as pressed and blink when bouncing. So once the user has a solid light they know the button is captured.
  • CircuitsoftCircuitsoft Posts: 1,166
    edited 2012-06-18 20:37
    Cncjerry wrote: »
    There is another processor other than sparc, can't remember the manufacturer, for some reason SGI comes to mind...
    MIPS?
    Cncjerry wrote: »
    In my app, commenting out the memory transfers of rdbyte/wrbyte noticeably increase the responsiveness of the pushbuttons which on the quickstart board
    Since rdbyte/wrbyte are round-robin scheduled, you can speed them up by moving them around in your code so that the hub is always ready when the instruction comes by. Since there are 16 cycles (4 instruction cycles) between hub accesses, if you can get your hub instructions at multiple-of-4-instruction intervals, then you'll get much better performance from your application.
  • CircuitsoftCircuitsoft Posts: 1,166
    edited 2012-06-18 20:39
    Cncjerry wrote: »
    Again being new to the prop, I hope the prop II, which I will read into, doesn't lag so much it will be outdated.
    Keep in mind that the Prop 1 is 6 years old now, and we're still finding new ways to use it. Check out the thread about running presentations off of it.
  • kwinnkwinn Posts: 8,697
    edited 2012-06-18 21:09
    Cluso99 wrote: »
    Well, technically, when we re-boot the software with an OS style, we can actually end up with any cog doing the boot process, so you cannot actually rely on the boot cog being cog 0.

    I guess I should have been more specific about what I meant by booting. Unless I am mistaken when the prop is powered up or the reset pin is pulled low the prop uses cog 0 to load the spin interpreter and execute the code that is downloaded on the serial input, or if there is no serial input then code from the eeprom.
  • kwinnkwinn Posts: 8,697
    edited 2012-06-18 21:33
    Cncjerry wrote: »

    After writing the note, above, I started thinking my solution might be to carry multiple sub routines within a cog and then sit and wait for instruction as to the subroutine to execute. Same concept though I want to speed-up the wrbyte/rdbyte cog signaling or use a few I/O pins for communications. Someone mentioned it was a faster way to talk between cogs vs rd/wrbyte.

    Again being new to the prop, I hope the prop II, which I will read into, doesn't lag so much it will be outdated. I think a 4 way chip with faster shared ram access would be interesting. In my app, commenting out the memory transfers of rdbyte/wrbyte noticeably increase the responsiveness of the pushbuttons which on the quickstart board, I light as pressed and blink when bouncing. So once the user has a solid light they know the button is captured.

    I have a feeling your app must be written in spin for the above to be true. If you were using machine language (PASM) in a cog the cog would be executing several thousand instructions for each bounce of your push button switch. Since the time required for a rd/wr byte, word, or long is at most 23 clocks (287.5 nano seconds at 80MHz) using pasm I doubt you would see any difference between writing or not writing a byte, word, or long after the debounce time. The debounce time would be several orders of magnitude greater.
  • Heater.Heater. Posts: 21,230
    edited 2012-06-18 22:06
    Cncjerry,
    I don't buy it. The prop is executing 20 million instructions per second. Yes I know hub access can be a bit slower if you missing your hub window. Removing an instruction time or two from your key scanning is not going to show up when hitting keys and watching leds.
    There must be more going on with your code than you let on.
  • CircuitsoftCircuitsoft Posts: 1,166
    edited 2012-06-19 12:56
    How fast are you actually running the prop?
  • Duane DegnDuane Degn Posts: 10,588
    edited 2012-06-19 15:13
    Heater. wrote: »
    Cncjerry,
    I don't buy it. The prop is executing 20 million instructions per second. Yes I know hub access can be a bit slower if you missing your hub window. Removing an instruction time or two from your key scanning is not going to show up when hitting keys and watching leds.
    There must be more going on with your code than you let on.

    I was thinking the same thing.

    I don't think the time required to make a few dozen hub reads and writes would be humanly perceptible.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-06-19 17:28
    MIPS?Since rdbyte/wrbyte are round-robin scheduled, you can speed them up by moving them around in your code so that the hub is always ready when the instruction comes by. Since there are 16 cycles (4 instruction cycles) between hub accesses, if you can get your hub instructions at multiple-of-4-instruction intervals, then you'll get much better performance from your application.
    Actually, the rd/wr hub instructions are longer than4 clocks, so you can have 2, 6, 10... instructions between the rd/wr hub instructions and hit the "sweet" spot.
Sign In or Register to comment.