Understanding Assembler

janb · 2007-06-13 17:12

·Hi,
I'm playing with assembler code. The attached OBJ runs a short assem code and records
CNT at 2 positions in the assem code and reports it back so I can look later on TV terminal.

The attached code works, but if I try to call
call·· #PackOutput··

inside the loop it does not fill the output values
·a:=ioData[noparse][[/noparse]2]
· b:=ioData[noparse][[/noparse]3]

I see '0'
The LED stuff is only to monitor the cog is doing anything. And I understand there is no new info I learn if I call PackOutput over and over inside the loop. But it should work , right?
I'm probably missing sth trivial

Jan

Graham Stabler · 2007-06-13 17:29

  cognew(@myAss, @ioData) 'Launch new cog, pass params
  waitcnt(cnt+_1s/2)

This makes it work. The spin was outputting to the screen before the first waitcnt had finished.

Reducing the delay also works as does putting the display part in a loop.

Graham

Kaio · 2007-06-13 18:05

Jan,

your assembly code is working well. The problem is, that the display routine ends before the call instruction after the waitcnt in assembly is executed.

You must use a loop for your display routine like the following. Otherwise the Cog which is interpreting the Spin code will die.

  repeat
    a:=ioData[noparse][[/noparse] 2]
    b:=ioData[noparse][[/noparse] 3]
    d:=b-a
    tv.str(string("t1="))
    tv.dec(a)
    tv.str(string(cr, "t1-t0="))
    tv.dec(d)

Thomas

janb · 2007-06-13 21:05

thank you guys
Jan

janb · 2007-06-14 16:28

Hi,
could I use help of this forum again?
The attached code is my next step toward understanding of the assembler.
The goal is to
- declare a local array on a COG· (256 longs)· , clear it
- fast increment· individual bins , selecting cells at random (now using CNT)
- after N 'events' write the whole array to HUB memory
- print on the screen first ~20 cells

It works to some extend. The attached code presets first 100 cells with values 1100, 1099, 1098,.... and prints content of first 30·cells on TV.

Here are my questions:
1) the function· 'PresetArray'
works, but I do not understand why I need to use· the true address of array at the start (line ***A1). If I replace it by the variable holding the same address (commented out line ***A2) code does not work. I'm wondering why?

2) the function 'FillArray' (commented out· in the main assembler routine ) works but incorrectly.
I wanted it to accept 3 'events' and· increment· 3 cells.
For every 'event' it would receive a ~random cell ID in the range [noparse][[/noparse]0,7]·based on·CNT. The content of such cell should be incremented by 1000 - so it is easy to see on TV which one were picked.
The result of this code changes after each upload, since the sarting value of CNT value is always different.

In reality,
-I see cell #0 is always incremented
- cell #1,2,3 are never incremented
- cells #4,8,12,... are sometimes incremented
Can you help me to fix it?

3) The function PackOutput -copying COG internal array to HUB array works fine but looks very clumsy to me. Is there a way one can make it faster?

I appreciate your help
Jan

deSilva · 2007-06-14 17:16

1) Funny that you sshould expect that...
:self mov IPosAddr, x1
moves x1 into regsiter IPosAdr.. Why should it move x1 into the address it contains?
In the next line you (correctly) write
movd :self, IPosAddr
by which you move the content of register IPosAddr into the instruction.

2) Why do you have this instruction
shl x1,#2 ' this is cell address offset
??
What do you mean by "cell"? The previos code shows that you are well aware that COG-registers are addressed one by one (and not by four!)

3) What is wrong with:
:self2 wrlong IntPos ,IdxM
add IPosAddr, #1
add IdxM,#4
movd :self2, IPosAddr
djnz Idx, #:self2
??
Slightly better style however would be

:loop movd :self2, IPosAddr
add IdxM,#4
add IPosAddr, #1
:self2 wrlong 0 ,IdxM
djnz Idx, #:self2

Note that you ned something inbetween moved and : self !

Kaio · 2007-06-14 17:50

Jan,

firstly a correction of the code from deSilva while the loop would not work properly.

:loop   movd :self2, IPosAddr
        add IdxM,#4
        add IPosAddr, #1
:self2  wrlong 0 ,IdxM
        djnz Idx, #:loop

Now I will help you that the function 'FillArray' is working. As deSilva has mentioned you have to take care to have at least one instruction between a modifying instruction and the instruction which you change. So I have inserted a nop.

FillArray  {increment content of few cells at 'random' by baseVal=1000 }
        mov     Idx, #3 ' set # of events, (i.e. # cells to be incremented)
:next   mov  x1, cnt 'calc index based on CNT, tmp  
        shr  x1,#2 'drop 2 LSB to make this number ~random, tmp
        and  x1,#$7 ' this is cell #, range [noparse][[/noparse]0,7],tmp
        'shl   x1,#2 ' obsolete while not byte addressing in Cog
        mov  IPosAddr, #IntPos      'set initial address
        add  IPosAddr, x1   'set final cell address
        movd    :self, IPosAddr 'replace address              
        nop
:self   add     IntPos, baseVal 'increment content of this cell
        djnz    Idx, #:next
FillArray_ret    ret

Thomas

deSilva · 2007-06-15 16:44

Kaio said...
Jan,

firstly a correction of the code from deSilva while the loop would not work properly.
:loop   movd :self2, IPosAddr
        add IdxM,#4
        add IPosAddr, #1
:self2  wrlong 0 ,IdxM
        djnz Idx, #:loop

I am sorry I did not test the code :-( and such things tend to stay wrong

the (hopefully) correct code could be:

:loop   movd :self2, IPosAddr
        add IPosAddr, #1
:self2  wrlong 0 ,IdxM
        add IdxM,#4
        djnz Idx, #:loop

I don't "like" this very much - in fact I don't "like" the Propeller machine code at all - it should be generated by a compiler in the first place...

Mike Green · 2007-06-15 17:25

deSilva,
You don't have to like the native instruction set. There will be a C compiler available this Winter from someone else, but it simply will not produce code good enough for time critical / space critical applications and it won't be free. Only hand optimized assembly language will work for that. Fortunately, most of what people want to do is not particularly time critical. Space critical depends on what you want to do since there are only 512 words for coding programs for a COG. The C compiler will probably use a hybrid form of assembly language being called "Large Memory Model" which lets you run programs mostly from HUB memory with maybe a 20-25% (perhaps better) hit on performance ... still not bad at all.

janb · 2007-06-15 19:12

Hi,
once again thanks for all advices.
I tried it last night and did work ... after I figured out why the first bin was skipped.
It was nice to see the same fix has been suggested later on this forum - I have learned sth over last few days from you guys
Jan

Kaio · 2007-06-15 21:12

deSilva said...

I am sorry I did not test the code :-( and such things tend to stay wrong

No problem, I did it also not test. So I have not realized the other mistake.

Jan,

congratulation that you have found this mistake by yourself. It is nice to see, that you have made this rapid progress in a short time. For your further learning of assembly I would suggest you to have a look at POD, if you did not yet.
http://forums.parallax.com/showthread.php?p=639020

Then you could be testing your assembly code in an easy manner direct on the Propeller.

Thomas

janb · 2007-06-16 17:05

·Hi,
1)

Since now I have a working SPIN+*** code I'm trying to optimize it.
The COG running assembler code should accumulate some values in internal array 100 longs for short period of time.
Next it should export the array from COG -->HUB memory.
I have done the timing measurement on the transfer time. The core of the code is:
.............
····· mov t1,cnt
:loop·· movd· :self2, IPosAddr
······· add·· IdxM,#4·········· ' incr Hub address
:self2· wrlong 0 ,IdxM
······· add·· IPosAddr, #1····· 'incr Cog address
······· djnz Idx, #:loop·
········ mov t2, cnt
.....................
The difference t2-t1 is the total transfer time of· Idx=nA long variables.
I try to minimize (t2-t1)/nA

It cost now 24 COG ticks to transfer a single long from COG to HUB.
This code has flexibility to transfer longs stored in arbitrary location in COG memory.
Is there a way to do sort of 'block transfer', taking advantage· both the source· and destination arrays are continuous in memory?

I have tested a single
wrlong a,b
may take 7..22 ticks - according to the manual.
But a series of hardcoded instructions seems to lock the phase between COG & HUB

wrlong a1,b1
wrlong a2,b2
wrlong a3,b3

so it take only 7..22 +2x16 clocks.
So naively if I'd hardcoed 100 lines as above (a1,...a99)· the average transfer time would be just 16 clocks per long. This would be definitely ugly and would need 100 of 512 lines in COG memory. Later I want to use COG array of size ~300 longs, so it is bad approach.

Q: is there a way to reach· average wrlong transmission time of 16 ticks per long for 'large'
COG array?
The full code is attached for convenience.

2)

I tried this POD package and ... sort of get it to work.
I have external TV connected with 3 resistors so I have changed
in PODKeyboardTV.spin
from
·screen : "PC_Text"
· keybrd : "PC_Keyboard"

to
·screen : "TV_Text"
· keybrd : "PC_Keyboard"
I did see the reassembled code on TV, but could not control anything w/ the keyboard.

Thanks for all help so far

Jan

Paul Baker · 2007-06-16 19:03

Hi Jan,
What you are doing is similar in function to the digital storage scope aquisition routines I wrote: http://forums.parallax.com/showthread.php?p=606048
The assembly routines are highly optimized primarily for space, secondarily for speed. GetFast is the most streamlined of the assembly routines.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

janb · 2007-06-16 19:46

Wow,
this is complex.
Could you explain to how this storage loop works?

:storeloop mov fbufstart, ina 'store pins state
add :storeloop, :d_inc 'increment destination in instruction above
djnz :i, #:storeloop 'go for next transition

Q1:
What does it mean if you 'add' a value $200 to an address ':storeloop'? How does it increment the value of fbufsart by 1 long?
Is it sort of self editing code?

Q2:
the djnz instruction seems to decrement a constant :i ? Why :i is not declared as a variable if it is changeing?

I'll study it
Thanks a lot
Jan

deSilva · 2007-06-16 20:47

It might be a good thing to start at the beginnng:

Each Propeller instruction has a destination and a source field, both are register numbers of general purpose registers in the COG, starting from 0 upto 511 (thought the last 16 are special). There is also the option - through all instructions - to have a literal 0 ..511 as sorce rather than a register number.
Thats all!

Well.. nearly.

Some of those registers are occupied by the assembly code itself, which gives us the opportunity to modify that code by moving (MOVS, MIOVD, MOVI) or adding into the register occupied by the instruction. This changes the number of the destination or source resgister, or the literal. BTW, the literal is always the source, even when you would rather call it destination, as in the case of jumps or calls (=JMPRET)

By this very systematic approach a lot of interesting coding is possible - and necessary, as you have often to simulate a non-existent "indexed" addressing mode.

Don't get confused by the colons; they just mean that name has a local scope an can be re-used within another section.

ad Q1: Destination is incremented by one, as the destination field starts at bit 10 (just before the 9 -bit source field)
ad Q2: No, no, no! It decrements the register ":i" which is defined near the end.

Paul Baker · 2007-06-16 20:51

No problem,
Q1:
Yes this code is self modifying, the destination field occupies bits 9-17 and the least significant bit of this field is $200. So by adding $200 to the mov instruction the destination address is incremented by 1, and since the cog memory is organized in longs the mov instruction points to the next long after the add. Something that isn't immeadiately apparent, this add occurs after the mov instruction, this way the djnz becomes the necessary instruction between modifying an instruction and executing the instruction.

Q2:
In a cog nothing below the special purpose registers is truely a constant, :i is a "pre-initialized" value. Because this array has to be iterated over twice (first to store the data, then to write it to hub memory) the variable :i has to be reinitialized and this happens before entering the second loop in "mov :i, #CogBufSz1", but by doing the pre-initialization an instruction is saved for the first loop. This brings up another point about the mov instruction in the storeloop, there is a pre-initialization that is done by setting the destination initially to fbufstart; if this loop needed to be executed again you would have to execute a "movd :storeloop, #fbufstart" to reinitialize it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Kaio · 2007-06-16 21:42

Jan,

as you have found the fastest way you can write to the main memory uses 16 system clocks. This is also described in the Propeller manual at page 24 with a nice clock diagram.

Now you must optimize your code to provide the Cog to write within this short time to the main memory in a loop. While a hub instruction (e.g. wrlong) takes at minimum 7 system clocks then are 9 clocks left. So you have time for only 2 instructions and one instruction would be the djnz instruction. So you could have one instruction to increment a pointer and you could use the loop counter also as pointer in Cog memory. But you would have to fill the buffer in the same manner while it will be transfered in reverse order.

              mov       Idx,nA                  'the buffer in Cog must be located from 1...nA
[s]:loop         wrlong    Idx,IdxM                'the buffer will be transfered in reverse order[/s]
              add       IdxM,#4                 'incr Hub address 
              djnz      Idx,#:loop              'decr Cog address               
        

nA            res       1                       'used size of the array
Idx           res       1
IdxM          res       1

When you are using POD you have two possibilities to control it.
1. You can use the default and recommended configuration which is using the PropTerminal on your PC. It is delivered with POD, but you can also use the newer version if you would have problems by uploading your program using PropTerminal.
http://forums.parallax.com/showthread.php?p=649540

2. You can use a TV and a keyboard connected at your Prop board.

It is possible to merge this configurations but it is not useful. The PropTerminal provides a much longer screen so you can see more information of the Propeller internals.

Thomas

Post Edited (Kaio) : 6/17/2007 12:18:23 PM GMT

deSilva · 2007-06-16 22:12

This code is absolutely cool

rjo_ · 2007-06-16 23:42

Jan,

I've been in and out so much I missed this thread. I had a very similar question.

In response to my question, Ariba posted a very simple but complete example. His code is a great way to approach Paul's example, which is incrementally more useful and slightly more complex.... really tight code[noparse]:)[/noparse])

My problem was that I didn't have an application. It was an abstract issue for me ... so I understood the answers... and then promptly forgot them. Then when I wanted to actually implement some code... I had to go back and work through it all again.

http://forums.parallax.com/forums/default.aspx?f=25&m=190445&g=190993#m190993
So, for anyone similarly confused... my suggestion is go to Ariba's example... refresh yourself and then go on to Paul's code.

Rich

Kaio · 2007-06-16 23:50

deSilva and Jan,

sorry, I think this is not working as I described above. It would be writing the value of Idx to the main memory and not the value addressed by Idx.

I have currently none code found that enables a transfer from a buffer in Cog memory to the main memory within 16 system clocks.
I don't know if it would be possible.

Thomas

Ariba · 2007-06-17 01:48

Hello Kaio

It should be possible with this code:

              movd      :loop,#AddrCog          'begin of buffer in Cog
              movs      :loop,#AddrHub          'begin of buffer in HubRAM
              mov       Count,#Size             'Size of buffers
:loop         wrlong    0-0,0-0                 'the buffer will be transfered
              add       :loop,#$204             'incr Hub address (by 4) and Cog address (by 1) 
              djnz      Count,#:loop            'loop until done

Cheers
Andy

Paul Baker · 2007-06-17 03:11

Hi guys,
Before I was hired by Parallax and working the kinks out of my dscope, Phil Pilgram and I tussled with this very issue: how to write an array to hub memory in every hub time slice. The answer is, it can't be done. The heart of the issue is that immediate values can only store values of 0-511, with the wrlong instruction this is the location in hub memory. This means only the first 512 locations in hub memory are addressable via Ariba's code. Since static mapping of hub memory is not supported by the compiler, this isn't possible to do in the general sense and is a really hack to try to attempt at all. And dont forget the first 16 locations have a very special meaning. For this reason writing an array to hub memory in single hub slice isn't possible. My·:writeloop loop is the most compact, "works in every situation" code possible with the architecture.

BTW Ariba your·code violates another rule, the immediate value in the add instruction exceeds the maximum value possible ($1FF).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 6/17/2007 3:20:23 AM GMT

deSilva · 2007-06-17 07:00

Aribas Code ist fine! Pauls issue (add $204) can easily be fixed by using a preset register with that value (which takes no time)! This is one ob my own blockings from former machine language experience: I can hardly imagine that it makes no difference

BTW: Is there an "Instruction timing" diagram?

I imagine the timing would be

instr fetch (1) - get registers (2) - operation (3) - store back (4)

On the other hand there is most likely an instruction prefetch happening during (3) which is the only phase without COG memory access. But we know that an instruction takes four, not 3 ticks.

And most likely "get registers" would take 2 ticks, when not in immediate mode.

So, second try:

instr fetch (1) - idle (decode?) (2) - get dest reg (3) - get source reg (4) - operate (5) = next instr fetch - store back (6)

Paul Baker · 2007-06-17 08:25

deSilva,
yes my BTW is easily fixable, but the issue discussed in the paragraph still stands, the instruction is:

wrlong cog_address, hub_address

the cog address must be a direct value, ie the value contained in register·cog_address is written to hub memory.

the hub address can either be a direct value or an immediate value, if it's direct the value in register address hub_address is the address in the hub memory. This works just fine, however to increment the address in hub memory the value contained in register hub_address must be incremented, not the instruction itself; therefore an extra instruction is required. You cannot add $04 to the instruction, why? Because lets say you have your hub address stored in register $06, add $04, now the next time wrlong is executed the contents of register $0A is then used as the hub address, so each time you wrlong the register used to figure out where in hub memory to write to changes. This clearly won't work.

the other option for the hub address is immediate (# prefix), at first glance this would work, but there is a major catch: the only possible values to index into hub memory are $000 - $1FF and $000 - $00F are off limits, and since it's byte addressable that means you have 124 longs addressable in hub memory. To further complicate things, how are you going to reserve that specific range of locations in hub memory? The compiler has no facility to do this, you could be sure they are the first longs reserved in the top object, but this is poor programming practice, your object must always be the top object, this kind of strict restiction makes your code unusable as a distributable object. Also because this behavior falls in the undocumented catagory, this could change in a future revision of the compiler and all the sudden the code no longer works.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 6/17/2007 8:54:18 AM GMT

Paul Baker · 2007-06-17 08:37

The pipeline is SDIR, Source Destination Instruction Result. The Instruction stage is for the following instruction (so the current instruction was fetched before the results of the previous instruction's result was written), this is why there is the requirement of an intervening instruction for self modifying code. So each instruction actually takes 6 cycles but because it is interleaved with the adjacent instructions the throughput is 1 instruction every 4 cycles.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

deSilva · 2007-06-17 09:21

Pipeline: So I was not far from the truth

Loop: So sorry, I definitely missed the catch! But Paul made it absolutely clear now.
However for a lot of applications it is no restriction to use the first fixed 500 bytes for a communcation area.

The main obstacle - as I understand - is that the SPIN interpreter starts interpreting at byte 16 !?
So may be a pure assembly program, consisting of just a single cognew(@asm,0) instruction will leave space.

Could this work?

PUB Ex1
    cognew(@asm, 0)

DAT
           LONG 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ...
asm     ORG  0
.....

The memory dump looks good

I am well aware that it makes no sense to put data into the HUB when nobody is there to read it; but one can load the other COGS as well (making it 7 COGNEWs at the beginning).

BTW: How do you get rid of the SPIN interpreter in COG #0 ???

Post Edited (deSilva) : 6/17/2007 9:28:33 AM GMT

Kaio · 2007-06-17 12:52

deSilva,

if you would use only assembly code, except for starting the Cogs, then after all Spin code is performed no other Spin code would be there to process. So the Cog containing the Spin interpreter would be die.

Thomas

Ariba · 2007-06-17 23:06

Hello Paul Baker

Yes the code from my previous posting does not work, so I thought a little bit longer about it, and found this solution:

              movd      :loop,#CogBuffer        'begin of buffer in Cog
              movs      :loop,#HubAddrTab       'begin of adress table
              mov       Count,#Size             'Size of buffers
:loop         wrlong    0-0,0-0                 'the buffer will be transfered
              add       :loop,IncD_S            'incr Table Idx (by 1) and Cog address (by 1) 
              djnz      Count,#:loop            'loop until done
              ...               

IncD_S        long  $201

HubAddrTab    long  HubBuffer+0                 'Table with addresses of HubBuffer Indexes
              long  HubBuffer+4                 'must be initialized in Spin before Cog Start
              long  HubBuffer+8
              long  HubBuffer+12
              long  HubBuffer+16
              ...                               'max 240

CogBuffer     res   max. 240                    'the Buffer in CogRAM that will be transfered

It is not compact at all, but runs with 16 clocks per loop. The half CogRAM is the buffer and the other half the AddressTable, so max. ca. 240 longs are possible, but in any destination-address in the HubRAM!
And because the Addresses are in a table, every order in the destination buffer is possible (for example: Bit reversed for FFT).

Andy

Paul Baker · 2007-06-18 04:36

If you dont mind consuming 2 longs for each long of information that will work.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

janb · 2007-06-18 16:25

Hi,
the latest solution from Ariba seems to cost a bit more than 16 clocks per transfer of long.
The execution of HubAddrTab brings additional price tag of 4 ticks per long, so the total is 20 ticks per long.
But it is much better than 32 . And one could argue HubAddrTab can be executed only once and long transfer multiple times.
One could also change HubAddrTab so it writes itself, this way the largest COG array must be below (512-const)/2 longs.
It was a long series of advices, now I´m as smart as you [noparse]:)[/noparse] , almost.
For now, I´ll lower the sampling fraction of the source and go with the 4-line ASM code transferring from COG to HUB at 32 ticks per long.

I hope you do not mind a small suggestion. For student like me, it would help if any example of code with a new idea
would be accompanied by a clear statement:
- it has been tried and does work or
- it may work but was not tested.
But I have learned a lot already.

Can I ask now a completely new question about best strategy of allocating local variables (with ´:´´)
Q1:
Assume I have 3 subroutines in ASM, called in sequence by the master
clearArray, fillArray, exportArray
each needs some sort of internal running index: idx. Shall I declare:
a) ´:idx´in every subroutine, 3 times or
b) one
idx long 0
or
c) one
idx res 1

I think a) wastes 2 registers. I see no difference between b) and c).
Any suggestions?

Q2:
Assume the 4th subroutine getValue , called by fillArray needs a working variable x1.
Assume fillArray itself needs some other working variable.
Does it make sense to declare 2 local variables of the same name ´:x1´´in both subroutines
to save one register ?
Or it only adds confusion to the code structure?

thanks
Jan

Mike Green · 2007-06-18 17:15

1) You will probably do better with ": idx" in each subroutine. Since "idx" is allocated on the stack, the space is available for other uses when the subroutines are not executing. Since the subroutines are executed sequentially, they will also reuse the space. Also, the first few words of the stack (and of the VAR area) are accessed with special instructions which are faster and smaller than other access. (c) will not work with Spin. It is intended only for use in assembly language where it allocates space in cog memory, but not in main (hub) memory.

2) Most times when you need working variables in a subroutine, it is better to declare them as local variables. It is common to use the same names over and over again for local working variables. I tend to use "i", "j", "k", "m", "n" and "a", "b", "c", "d", "e" in this way.

Understanding Assembler

Comments