Shop OBEX P1 Docs P2 Docs Learn Events
Propeller internals... shared memory, spin vs. asm etc... — Parallax Forums

Propeller internals... shared memory, spin vs. asm etc...

ght_dakght_dak Posts: 15
edited 2007-06-30 02:01 in Propeller 1
I've been trolling through the documents and forums but I guess I'm still not clear on some of the key elements of the inner workings of the prop chip and how it may affect high performance applications.

First, there is some discussion of how spin code works on multiple cogs... some documentation mentions that the spin interpreter is "copied" into the new cogs ram during cognew (is this true?).· Then, there is mention that the spin code is read from "main memory" during execution.· But, if the cog needs to read from main memory to get its spin code during runtime isn't there a memory contention issue?· Does each long of·spin code to·be interpreted·need to wait for its cog access time window?

This·would apply to variables as well.· If one kicks off a new cog with spin code, the variables are shared with the parent... does this mean that each read and write to this variable is also coordinated by the cog access control mechanism?

Which then begs the question of where does the speed improvement from ASM code come from?· Obviously, ASM doesn't need to be interpreted, so its gonna be a lot faster... but is there also seems to be·a substantial speed improvement because, most of the time, the cog is accessing cog local memory vs the 32K of main memory.

Is it simply true that since spin code is so much slower, that memory access for code and data is a relatively small issue?

And, finally, what is the conventional wisdom when it comes to just how much faster is ASM vs SPIN?

Comments

  • BeanBean Posts: 8,129
    edited 2007-06-30 01:18
    I think you pretty much have it correct. There is a VAST VAST speed difference between spin and assembler.

    Spin needs to read code and variables from main (hub) memory, plus it needs to interpret the instructions. Of course the advantage is you are not limited to the 512 longs of RAM that each cog contains.

    Bean.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    “The United States is a nation of laws -· poorly written and randomly enforced.” - Frank Zappa

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    www.hittconsulting.com
    ·
  • ght_dakght_dak Posts: 15
    edited 2007-06-30 01:33
    If each cog running spin needs to access main memory for code and variables, isn't there an issue WRT the, often discussed, claim of deterministic processing? At a minimum, the cog needs to wait up to 16 clock cycles to get its "window"... if no other cog is accessing shared resources. But, if each cog may or may not be accessing memory, there is an additional 7 clock cycles per access... so, there could be a pretty large variation.

    I imagine that most of the time this wouldn't be a huge problem since we're waiting for clock countdowns to do our next thing, but it still is something to be considered for some applications.
  • Mike GreenMike Green Posts: 23,101
    edited 2007-06-30 01:34
    1) You need to distinguish between the Spin interpreter which is a program written in the native Propeller instruction set and Spin byte code which is the code produced by the Spin compiler and interpreted·by the Spin interpreter.· When the chip is reset and whenever a COGNEW Spin statement is executed, a copy of the Spin interpreter is loaded into a cog from the ROM, then executed.· This Spin interpreter begins to interpret byte codes read from hub (shared) memory.· The address of the byte codes, the stack space to be used and the address of some tables needed are all supplied by the first 16 locations of hub (shared) memory and some information passed to the Spin interpreter in a register.· The native COGNEW instruction and the assembly variant of the Spin COGNEW statement are used to start an assembly program.· The program is provided in a 512 long word area (2K bytes) in hub (shared) memory and some parameter information can be passed (a 14 bit value).· The instruction copies the program to an available cog (or a specific one) and then starts the cog with the first instruction.

    2) Shared memory can be accessed "simultaneously".· The hub is designed to supply each cog with a "turn" to read or write the shared memory.· If a cog's "turn" hasn't come up yet, the cog is made to wait until its "turn" so there's no contention for the shared memory.· Code can be written so that, once a shared memory location is accessed, the cog stays in sync with the hub and this delay is minimized.

    3) The ratio between native instruction speed and Spin operation speed is probably at around 80:1.· It depends on the specific operations involved, but that ratio is a good starting point.· Multiplication and division are done by subroutine, so that improves the ratio.· Other operations in Spin (like subscripting) may take several native instructions to implement (other than the interpreter overhead) so that also improves the average ratio.

    4) I suspect that the Spin interpreter is very tightly optimized in terms of overhead (like the synchronization between cog and hub).· In any event, because it would be very hard to handle determinism using the execution times of byte codes, this is done in Spin·mostly by using the various WAITxxx instructions.· Tighter control of timing generally has to be done using assembly language.

    Post Edited (Mike Green) : 6/30/2007 1:45:28 AM GMT
  • CardboardGuruCardboardGuru Posts: 443
    edited 2007-06-30 02:01
    In addition to Mike's very comprehensive answer it's worth simply pointing out that it makes no difference whether there's one hub or 8 running, the code in each hub will run at exactly the same speed regardless (unless you deliberately wait for another cog). It isn't a minimum of 16 cyles to access main memory. Each hub gets to access main memory once every 16 cycles regardless. Of course when the program arrives at a WRLONG or RDLONG it will have to wait between 0 and 14 cycles for the next hub access window to arrive, but this delay is completely repeatable every time you run the program, or every time around a fixed length loop, and so is deterministic.
Sign In or Register to comment.