mixing spin and assembly?

mike56 · 2009-08-08 07:25

I'm writing some assembly and it seems that inline assembly can't be mixed with spin?· Nor can functions be in assembly?

For example, I can declare "Pub Myfunc" then call it from the "pub main".· This is normal subroutine behavior like in C.

But if I write an assembly code routine in DAT called "MyAsm" then I need to run a cognew(@MyAsm,0)

I can't just call "MyAsm" like a regular subroutine?

Is there anyway around this?

Also, if I write some fast code in assembly then run a "cognew(@MyAsm,0)" but want to continue in spin then that's not possible because cognew runs in parallel.

Is there anyway to call an assembly subroutine without starting a new cog?
If not, is there a way to pause operation until a cog is finished?

For example, it would be the same as:

Pub Main
·code1
·code2

·cognew(@MyAsm,0)
·don't process anymore lines until MyAsm is finished

Dat
·org 0
MyAsm code3
········· code4

Any hints?· Also, are there more examples for the assembly besides the manual?·

There are many quirks that I have found only through trial and error.·

The first is that dira register needs to be set for each cog.· In the manual it says the registers are all ORed together but if I set the dira[noparse][[/noparse]5]=1 then call outa[noparse][[/noparse]5]=1 from another cog, it doesn't work unless I specifically set the dira for that cog even though it has been set in the first cog -> weird!

·

Ale · 2009-08-08 07:35

you can "pause" i.e. wait for a variable to be a predefined value that is written from assembler. A shared variable in HUB. May objects (float, sdcard, etc) use this exact technique

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

mike56 · 2009-08-08 07:53

So you can't just call Asm like a subroutine?

mctrivia · 2009-08-08 08:02

nope that is one of the downfalls of the propeller. unlike a treditional mcu it has 2 seperate memory areas hub ram and cog ram. Normal mcu only have a larger cog ram. code can only be truely executed from cog ram and the prop does not have enough so it walks the line and loads an interpreter into cog ram for spin. asm by its nature can not be run this way.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
propmod_us and propmod_1x1 are in stock. Only $30. PCB available for $5

Want to make projects and have Gadget Gangster sell them for you? propmod-us_ps_sd and propmod-1x1 are now available for use in your Gadget Gangster Projects.

Need to upload large images or movies for use in the forum. you can do so at uploader.propmodule.com for free.

SamMishal · 2009-08-08 08:09

Mike,
Yes you can, but in a roundabout way.
What you need to do is load a cog with the assembly code and let it run there.
Now there is the problem, of course, with the cog running parallel to the cog that is
running the Spin code.
What you need is then to have the spin code WAIT for the PASM code to do its work
and return the results.
You can do this by having a sentinel variable that the Spin code will keep polling in a loop
until it is flagged and thus the Spin code knows that the PASM code has finished its job.
The PASM code has to, of course, flag the sentinel variable to indicate it has finished its work.
So you can have as many PASM routines as you need. Then every time you want to call one
you need to load it into a cog, wait for it to finish and then go on.
So this is as if you have called a subroutine but with a little extra work.
It is a minor shortcoming of the Propeller/SPIN that PASM code cannot be inline and has to be
run in a COG.
This is not really much of a shortcoming since there is the workaround as above, but also
most of the time you need PASM to do fast work and most of this work USUALLY needs to
be run in Parallel (e.g. UART, PWM, etc. etc.) to the normal program.
Sam

jazzed · 2009-08-08 12:46

If you use one of the C compilers, you can call functions written in what is known as LMM ASM or use in-line LMM ASM.
LMM is not native PASM, but it is close. You have to learn a few extra things about how to do jmp, etc... and it's about 1/8th as fast.
You can find examples for ImageCraft C in the obex, and Catalina on one of the "release" threads.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

SRLM · 2009-08-08 14:49

Take a look at this object. It's almost as if the spin were "calling" the asm.

Baggers · 2009-08-08 14:56

there has been some work done with the PASM spin source, to save space to include extra features, this could be an extra feature, maybe? I'm not sure, as I've been too busy to follow all the threads, especially lately now that the prop has really opened up with some fantastic advances [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
http://www.propgfx.co.uk/forum/·home of the PropGFX Lite

·

JonnyMac · 2009-08-08 17:34

I've attached a little shell program that I wrote for myself as a starting point for creating a PASM-based command processor. This might be easier to grasp than wading through a full operational listing.

[noparse][[/noparse]Edit] Lest I create the impression that I came up with this structure... I didn't. After studying several interesting examples from some of the better known contributors on this forum I settled on this as my starting point.

Post Edited (JonnyMac) : 8/10/2009 3:21:40 AM GMT

RinksCustoms · 2009-08-09 04:19

just F.Y.I. - a COGNEW takes ~8300 clocks... 103.5uS @ 80MHz clkfreq

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Quicker answers in the #propeller chat channel on freenode.net. Don't know squat about IRC? Download Pigin! So easy a caveman could do it...
http://folding.stanford.edu/ - Donating some CPU/GPU downtime just might lead to a cure for cancer! My team stats.

CounterRotatingProps · 2009-08-09 20:44

RinksCustoms said...
just F.Y.I. - a COGNEW takes ~8300 clocks... 103.5uS @ 80MHz clkfreq

Wow Rinks, I trust you figured that out correctly, but it's still a surprising amount of overhead! - H

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

ericball · 2009-08-10 00:48

RinksCustoms said...
just F.Y.I. - a COGNEW takes ~8300 clocks... 103.5uS @ 80MHz clkfreq

CounterRotatingProps said...
Wow Rinks, I trust you figured that out correctly, but it's still a surprising amount of overhead! - H

90% of that is the 16 x 496 clocks required to copy the data from HUB RAM to COG RAM.· Each cog is only able to access HUB RAM every 16 clock cycles, and there are 496 "registers" in COG RAM.· In a PASM app, one of the keys to performance is doing useful work between HUB access timeslots (and avoiding misses).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Composite NTSC sprite driver: Forum
NTSC & PAL driver templates: ObEx Forum
OnePinTVText driver: ObEx Forum

mike56 · 2009-08-10 01:54

Thanks for all the answers!

I am very surprised you can't just run an Asm subroutine. Seems like the compiler should be written with this in mind. Oh well can't have everything.

The suggestion of running Pasm subroutine which calls a cognew then waits until the variables are finished makes sense.

Is there an example of this that I can just cut and paste my assembly code into the DAT section?

I think JonnyMac's example is close but my asm and prop skills aren't that good yet so more examples of using the structure would be great!

Mike Green · 2009-08-10 02:54

You can't just run an Asm subroutine because Spin is an interpreted language. Spin source is compiled into byte codes, instructions for a made up machine that are interpreted by a program stored in hub (shared) ROM which is copied into and executed from cog ram. This interpreter fits into 496 32-bit words of cog ram.

The actual cog instructions are 32-bit RISC instructions that are completely different from the Spin byte codes. They have to be executed in cog memory. They can't be executed from the shared hub memory which is actually treated as a special I/O device by the cog, accessed with special instructions for reading and writing.

There are several assembly tutorials with links in one of the "sticky threads" at the top of the thread list in this forum. Look at "Propeller: Getting Started and Key Thread Index". Have a look at them.

pjv · 2009-08-10 18:09

Whoa, just a minute here.....

I have been away from the propeller for over 3 years now, but with the EOL announcement of the SX, I'm having another look at whether the Prop can do what I need it to do. So I've been starting to experiment again, and reading some of the Prop threads, in particular assembler related ones. This particular thread indicates a very long time to launch an assembly routine, and my results vary greatly...... but perhaps I'm doing something wrong (I still consider myself a "Prop Newbie") and I will defer to the experts here.

My results in launching 7 consecutive identical assembly cogs with COGNEW show exactly 354 clocks between each launch..... not the 8300 posted above!

Cheers,

Peter (pjv)

Post Edit:....OOPS hit the wrong key...... the number of clocks I see are 754, not 354············ ·Sorry about that!

Post Edited (pjv) : 8/10/2009 8:08:17 PM GMT

lonesock · 2009-08-10 18:24

pjv said...
Whoa, just a minute here.....

I have been away from the propeller for over 3 years now, but with the EOL announcement of the SX, I'm having another look at whether the Prop can do what I need it to do. So I've been starting to experiment again, and reading some of the Prop threads, in particular assembler related ones. This particular thread indicates a very long time to launch an assembly routine, and my results vary greatly...... but perhaps I'm doing something wrong (I still consider myself a "Prop Newbie") and I will defer to the experts here.

My results in launching 7 consecutive identical assembly cogs with COGNEW show exactly 354 clocks between each launch..... not the 8300 posted above!

Cheers,

Peter (pjv)

The cognew command is not blocking...it will return before the cog is fully loaded and running. This is why you will often see code similar to:

repeat while ASM_interface <> 0

or similar.

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

pjv · 2009-08-10 20:15

Jonathan;

I'm not sure we understand each other.

I follow that the COGNEW instruction may be non-blocking, but clearly each whole cog needs to load, regardless of its program size, before it is released to run. On launching these identical cogs, I see a consistent delay of 754 clocks from cog to cog to cog as each statrts running. Would this then not be the measurement of "how long it takes to launch a cog" ?

Cheers,

Peter (pjv)

Mike Green · 2009-08-10 20:28

Peter,
The COGINIT cog instruction (which is executed by the Spin interpreter when you do a COGNEW or COGINIT) only initiates the operation before continuing. There's the usual variable hub synchronization plus another few clock cycles. Once initiated, the new cog independently loads up its memory from the hub, clears the I/O registers, and begins executing from location zero. All of this takes about 100us as others have mentioned. The 754 clocks you're seeing is probably the interpreter overhead to finish the COGNEW Spin operation and continue execution of the cog initiating the COGNEW. The newly started cog would still be in the process of loading up its memory and would not actually be starting execution of the new code for a while yet.

The best way to demonstrate this would be to use COGNEW to start an assembly routine where the first thing done would be to store the system clock in some known hub memory location and set the following location to zero. The Spin program that does the COGNEW would initialize the two longs involved to -1, store the system clock in another variable, do the COGNEW, then wait for the assembly program to set the 2nd long to zero. The difference between the two system clock values would give you a good idea of the total startup time for an assembly routine.

lonesock · 2009-08-10 20:28

pjv said...
Jonathan;

I'm not sure we understand each other.

I follow that the COGNEW instruction may be non-blocking, but clearly each whole cog needs to load, regardless of its program size, before it is released to run. On launching these identical cogs, I see a consistent delay of 754 clocks from cog to cog to cog as each statrts running. Would this then not be the measurement of "how long it takes to launch a cog" ?

Cheers,

Peter (pjv)

Hi, Peter.

Just to be clear, I'm saying there's a difference between how long it takes to for the cognew command to return in Spin, and how long it takes before the new cog is active.

CON
  _clkmode = xtal1 + pll16x
  _xinfreq = 5_000_000

OBJ
  term: "Simple_Serial"

VAR
  long stack1[noparse][[/noparse]100]
  long stack2[noparse][[/noparse]100]

PUB timing | rval, t1, t2
  term.init( 31, 30, 9600 )
  rval := 0

  term.rx

  ' test the immediate return
  t1 := cnt
  cognew( @PASM_entry, @rval ) ' the assembly engine sets rval to the cnt value at startup
  t2 := cnt

  repeat while rval == 0

  term.str( string( "No wait: " ) )
  term.dec( t2-t1 )
  term.str( string( 13, "With wait: " ) )
  term.dec( rval-t1 )
  term.str( string( 13, 13 ) )  

DAT
org 0

PASM_entry
        mov tmp,cnt
        wrlong tmp,par
tmp     long 0

yields the following numbers:

No wait: 1152
With wait: 9158

Note that I added the "dec" function to Simple_Serial.spin, so this won't compile "out of the box".

Jonathan

<edit> Mike's idea of storing the system counter value directly was a better idea! [noparse][[/noparse]8^) Edited the code snippet to match.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 8/10/2009 9:03:08 PM GMT

pjv · 2009-08-10 21:18

Hi Jonathan and Mike;

I'm sorry, but I'm not yet up to the level of fully understanding the implications of SPIN code, and I can only grasp assembler, so your code examples, although appreciated, are not of great help to me. ·I have to revert to as close to the hardware as I can get, and for me that means assembler.

So my test code is as below, and the timings I observe are from the first rising edge of port pin1 to port pin2 to pin3 and so on.

If for illustrating my point we assumed for a moment that it only took one clock cycle for the launching cog0 to trigger the load of a cog (we know this is not correct), and then measured the time in clock cycles from rise of cog1 pin to the rise of cog2 pin, then the time for·the second·cog to be loaded and released to run would be equal to the time measured from cog1 to cog2 minus one clock. Now the cognew instruction takes more than 1 cycle to trigger the load, but isn't the total time to trigger the load and the loading itself equal to my observed time?

If so, is that not considered the "launch time" for a cog?

What am I not seeing here?

_clkmode = xtal1
_clkfreq = 5000000
 
PUB Launch
    cognew(@One,0)
    cognew(@Two,0)
    cognew(@Three,0)
    cognew(@Four,0)
    cognew(@Five,0)
    cognew(@Six,0)
    cognew(@Seven,0)
 
DAT
 
One        org
              mov       dira,#1
:Loop         xor       outa,#1
              jmp       #:Loop
 
Two        org
              mov       dira,#2
:Loop         xor       outa,#2
              jmp       #:Loop
 
Three        org
              mov       dira,#4
:Loop         xor       outa,#4
              jmp       #:Loop
 
Four       org
              mov       dira,#8
:Loop         xor       outa,#8
              jmp       #:Loop
Five       org

              mov       dira,#$10
:Loop         xor       outa,#$10
              jmp       #:Loop
 
Six       org
              mov       dira,#$20
:Loop         xor       outa,#$20
              jmp       #:Loop
 
Seven       org
              mov       dira,#$40
:Loop         xor       outa,#$40
              jmp       #:Loop

Thanks for your interest in helping me understand this!

Cheers,

Peter (pjv)

lonesock · 2009-08-10 22:28

pjv said...
Hi Jonathan and Mike;
...
If for illustrating my point we assumed for a moment that it only took one clock cycle for the launching cog0 to trigger the load of a cog (we know this is not correct), and then measured the time in clock cycles from rise of cog1 pin to the rise of cog2 pin, then the time for the second cog to be loaded and released to run would be equal to the time measured from cog1 to cog2 minus one clock. Now the cognew instruction takes more than 1 cycle to trigger the load, but isn't the total time to trigger the load and the loading itself equal to my observed time?

If so, is that not considered the "launch time" for a cog?

What am I not seeing here?
...
Peter (pjv)

To see what we mean, remove the "cognew(@One,0)" line, and instead use Spin to set pin 1 high. You are measuring the timing between when cogs 1 and 2 come online, cogs 2 and 3, etc. We are just mentioning that the time from the spin "cognew" command to when that same cog comes online is longer than what you are measuring.

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Mike Green · 2009-08-10 22:30

Peter,
You are seeing the time for the Spin interpreter to process a COGNEW call because the time difference between the execution of cog One and the execution of cog Two is offset by the difference in the execution times of their corresponding COGNEWs. Try what I suggested:

_clkmode = xtal1
_clkfreq = 5_000_000

VAR long start, time

PUB launch
   start := CNT
   COGNEW(@asmCode,@time)  ' start up the cog
   WAITCNT(clkfreq + CNT)   ' wait for a second
   result := time - start   ' compute elapsed time
   outa[noparse][[/noparse] 0..1 ] := %00
   dira[noparse][[/noparse] 0..1 ] := %11   ' pins 0 & 1 output mode
   repeat 32
      ' For each bit starting at the least significant,
      ' turn on pin 0 for the true value of the bit and
      ' turn on pin 1 for the false value of the bit
      ' with a 1/2 second delay between bits.
      ' This is intended to be used with LEDs on pins
      ' 0 and 1 wired to light when the pin is high.
      outa[noparse][[/noparse] 0..1 ] := ((result & 1) == 1) ^ 1
      result >> 1
      waitcnt(clkfreq / 2 + cnt)

DAT
asmCode     org
                  mov   temp,cnt
                  wrlong  temp,PAR
loopHere     jmp    #loopHere
temp           res     1

ericball · 2009-08-10 22:51

time =    0  cognew(@One,0)
time =  754  cognew(@Two,0)
time = 1508  cognew(@Three,0)
time = 2262  cognew(@Four,0)
time = 3016  cognew(@Five,0)
time = 3770  cognew(@Six,0)
time = 4524  cognew(@Seven,0)
 
time =  8300 xor       outa,#1

time =  9054 xor       outa,#2

time =  9808 xor       outa,#4

time = 10562 xor       outa,#8
time = 11316 xor       outa,#$10
time = 12070 xor       outa,#$20
time = 12824 xor       outa,#$40

So, there's a 754 clock delay between cognews, and a 8300 clock delay from the cognew to the cog starting executing.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Composite NTSC sprite driver: Forum
NTSC & PAL driver templates: ObEx Forum
OnePinTVText driver: ObEx Forum

pjv · 2009-08-10 22:54

Thanks All,

Let me wrap my head around this for a few cycles!

Cheers,

Peter (pjv)

pjv · 2009-08-10 23:25

OK, got it.

I followed Johnathan's suggestion and measured again.... the outA instruction takes 370 clocks OUCH! and one cog launch procedure measures 8950 clocks.

Wow, that is UGLY !

So, can assembly cogs reasonably launch other assembly cogs? (I suppose I should start another thread..... have done enough damage here)

Cheers,

Peter (pjv)

Timmoore · 2009-08-10 23:57

http://forums.parallax.com/forums/default.aspx?f=25&m=114128 gives an example of startin a cog from asm

but that wil not remove the ~8000 clks from starting a cog - that is loading the cog ram - 512 * 16clks

Post Edited (Timmoore) : 8/11/2009 12:02:48 AM GMT

pjv · 2009-08-11 03:48

Hi Tim;

That's a long thread, and I did not find the example. I know I messed with this over 3 years ago, but have forgotten much about the propeller, and focussed on squeezing performance out of the SX.

Has anyone found a faster way to load assembly programs from EEROM without going through the HUB?, or by using the counters or video facilities?

Cheers,

Peter (pjv)

Mike Green · 2009-08-11 04:15

You can't use the video generator for input transfers (from an external device to a cog) and the counters don't help except as clock generators.

Any of the I/O functions (including FullDuplexSerial or the SPI engine or the I2C / SPI driver) can be modified to use space in the cog as their buffer. FullDuplexSerial and the I2C / SPI driver use most of the memory in the cog, so they'd have to be significantly simplified to have much buffer memory available. It would be straightforward to make a cog loader that would load from EEPROM or from SPI flash or SRAM since all of these routines actually input the data first to a location in cog memory, then transfer it to hub memory. In most cases, you'd want to pack the data, 4 bytes per long word, but that's easy.

ericball · 2009-08-13 12:58

pjv said...
OK, got it.

I followed Johnathan's suggestion and measured again.... the outA instruction takes 370 clocks OUCH! and one cog launch procedure measures 8950 clocks.

Wow, that is UGLY !

So, can assembly cogs reasonably launch other assembly cogs? (I suppose I should start another thread..... have done enough damage here)

Cheers,

Peter (pjv)

Note: the 370 clocks you are seeing is the delay between COGNEWs.· Once you get into the PASM routine, you're operating at full speed with each PASM instruction typically requiring 4 clock cycles.

Yes, the PASM COGINIT instruction may be used by an assembly cog to launch another assembly cog.· However:

It still takes 8000+ clock cycles for the 496 LONGs to be copied from HUB RAM to COG RAM.· (And don't forget the 8000+ clock delay to start the first assembly cog before it starts the second.)
It needs absolute HUB addresses which the SPIN code typically will need to provide.
It's dang ugly to use - 4 parameters (2 addresses, new flag & cog ID) are shoved into a single register.
Starting SPIN from PASM is impossible for mere mortals.

Therefore, it's typically easier for the SPIN code to start the assembly cogs.· (Unless your app is doing things like reloading all of HUB RAM from the EEPROM and decompressing on the fly.)· The typical design loads all of the cogs on startup (with PASM or another SPIN interpretter), then shuffles data between the threads via HUB RAM.

Oh.· Before I forget - that 8000+ clock cycle latency also applies to SPIN threads started via COGNEW.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Composite NTSC sprite driver: Forum
NTSC & PAL driver templates: ObEx Forum
OnePinTVText driver: ObEx Forum

Fred Hawkins · 2009-08-13 14:51

I think you could mix small asm routines into a cog running a·Forth interpreter. I'd try it with SpinForth which had it's source code published. And has an assembler...

http://forums.parallax.com/showthread.php?p=694135·SpinForth thread...

ps, @·Graham, SpinForth ought to be added to the good thread list, yes?

Post Edited (Fred Hawkins) : 8/13/2009 2:58:13 PM GMT

mixing spin and assembly?

Comments