Problems with wrlong

Tectu · 2011-07-11 16:41

Hello folks!

I have a little issue in an assembler section....
I want to copy the value from INA to the main RAM using wrlong. This is what my code looks like:

makescan                wrlong INA, address               
                        add address, #4                      
                        djnz samples, #makescan

When I execute that, the main RAM is $0000_0000. Then, I tryed this one:

makescan                mov temp, INA
                        wrlong temp, address                               
                        add address, #4                                       
                        djnz samples, #makescan

That one is working pretty fine... but WHY???
And how can I fix it? It's very important, that that subrutine (makescan) takes exactly 16 cycles...

Can anyone help me?

Greetings Tectu

localroger · 2011-07-11 16:46

INA cannot be used in the destination field of an instruction. Even though it's technically the "source" in your WRLONG to the PASM microcode it's in the destination field and the instruction doesn't go. I've had the same issue myself.

On Edit: There's no way to fix this if your requirement is to really record all 32 bits of INA every 16 clocks, but if your real need is more limited it might be possible to do some kind of partial caching into Cog RAM. If you tell us more about the application there might be a workaround.

Tectu · 2011-07-11 16:57

Hello localroger,

Thank you for your very fast response.

I would like to build a Logic Analyzer. I know that there is the Parallax Digital Storage Logic Analyzer and also the Propalyzer, but I'd like to do one completly on my own, for learning... Another reason is that I want more than a 4.44MHz sample rate.

Now, should I talk more about my ideas, or will you say me now something like "Don't invent the wheel new" ?

Thanks for your help!

Mike Green · 2011-07-11 17:18

You can't do "wrlong ina,address" as localroger mentioned. There's actually 512 longs of memory for each cog. There's special circuitry that chooses INA instead of the corresponding "shadow" memory location when that location occurs in the source field of an instruction. There's no such circuitry when INA occurs in the destination field of an instruction as in your case, so the "shadow" memory location is used.

It's possible to use more than one cog to sample I/O pins so that one sample is made each system clock cycle. You have to synchronize the cogs so that their 4 clock instruction cycles are offset, each by one. This is done with a WAITCNT instruction with each cog waiting for a different system clock value to pick up execution. One cog waits for TIME+0. Another cog waits for TIME+1, etc. The cogs would store the samples in a buffer in their own memory, then copy the saved values to a common buffer in hub memory for processing. That's how the existing Propeller logic analyzers work.

Tectu · 2011-07-11 17:26

The Parallax Logic Analyzer uses more than one cog to sample the I/Os? I don't see that in the source code, where is that?

localroger · 2011-07-11 17:52

Mike has it, multiple cogs.

You write your cog program to do, say, 4x interleaved writes. Each instance doesn't care about the other instances, it just reads, then waitcnts until the right time to do the next 4x interleaved read. When you start the cogs, you give them each a starting CNT to wait for, chosen a judicious way into the future considering cogstart overhead, and offset by the appropriate number of clocks for each of the four cogs. When your four cogs start they each wait for their individual offset start points and go about their individual merry ways stuffing data into Hub RAM at 4 long / 16 byte intervals. What your top spin app sees is a smooth stream of data.

kuroneko · 2011-07-11 17:56

localroger wrote: »

On Edit: There's no way to fix this if your requirement is to really record all 32 bits of INA every 16 clocks, ...

Does that count as impossible? Anyway, there is a way to get ina to the hub every 16 cycles as long as you don't mind transferring 2n longs starting at high addresses and going down from there. Check this thread [thread=129719][POC] reverse overlay loader aka cog to hub transfer[/thread]. The only thing which has to change is the main loop, something like this should do:

[COLOR="blue"]mov     tmp, ina[/COLOR]                '  +8
                mov     phsa, size              '  -4   hub byte count (8n + 7)
                
:copy7          [COLOR="blue"]wrlong  tmp, phsa[/COLOR]               '  +0 = transfer long between cog and hub
                [COLOR="orange"]mov     tmp, ina[/COLOR]                '  +8
                sub     phsa, #7 wz             '  -4

:copy1          [COLOR="orange"]wrlong  tmp, phsa[/COLOR]               '  +0 = transfer long between cog and hub
        [COLOR="blue"]if_nz   mov     tmp, ina[/COLOR]                '  +8                                             
        if_nz   djnz    phsa, #:copy7           '  -4

localroger · 2011-07-11 19:11

kuroneko, that is absolutely wicked brilliant.

Tectu · 2011-07-11 20:09

localroger wrote: »

kuroneko, that is absolutely wicked brilliant.

To bad that I am new to Assembler and Propeller that I cannot understand it

K2 · 2011-07-11 20:09

It took a while before my brain would indicate anything other than $deadbeef. But now I get it! I can't believe I actually understand one of kuroneko's masterpieces. One should earn another star just for that.

Tectu · 2011-07-11 20:11

Okay... Is anyone so friendly to say how it works, step by step?

I don't want to use things, when i don't know how they work...

kuroneko · 2011-07-11 20:57

Tectu wrote: »

To bad that I am new to Assembler and Propeller that I cannot understand it

Well, you noticed that there is not much time left for all the things you have to do (reading ina, writing it (temp) to hub, incrementing the hub address, decrementing the loop counter etc).

So some people (Hi Phil!) came up with the sub #7/djnz approach. This exploits the fact that rdlong/wrlong ignores the lowest two address bits. Say you have a 4n address. You set the bottom two bits thereby making it 4n+3. Doing a rdlong will read data from 4n (lower two bits ignored). Then you subtract #7 and we end up with 4n-4 (the long address before that). Finally the djnz subtracts #1 and we end up with 4n-5 (or 4(n-2)+3). In the end we transferred two longs and adjusted the address by 8 (7+1) which is what we would have done anyway (4+4). It's a bit tricky the first time you see it but take the time to go through an example (on paper) and follow the steps.

Something is still missing. The loop size. For a read loop (hub to cog) that was initially handled in that the code being loaded overwrote the final djnz of the transfer loop and the code simply continued with what we just loaded.

This overwrite-drop-though didn't work too well with the opposite direction though (cog to hub). Because there isn't anything to overwrite. But don't despair. This is where shadow registers come in. Basically some of the special registers ($1F0-$1FF) have special behaviour depending whether you just read from or write to them (or both). One of them is the counter phase accumulator (phsx).

If you read from it you get the counter value (value := counter[phsx]).
If you write to it its shadow location and the counter are written to. (shadow[phsx] := counter[phsx] := value)

Finally, read-modify-write performs the operation based on shadow[phsx] but updates both shadow and counter.

' phsx read-modify-write issue

      mov     temp, phsx     ' temp := counter[phsx]
      shr     temp, #1       ' temp >>= 1
      mov     phsx, temp     ' update shadow and counter

      ' is equivalent to

      mov     phsx, phsx     ' shadow[phsx] := counter[phsx]
      shr     phsx, #1       ' r: operate on shadow[phsx]
                             ' m: shadow[phsx] >>= 1
                             ' w: update shadow and counter

So what I've done is to exploit this disconnectedness of shadow and counter register. I use the shadow as loop counter (sub phsa, #7 wz/djnz phsa, #:copy7). Given a loop of e.g. 10 longs doesn't give me proper addresses though (the counter register is used as wrlong target). So what we do now is to enable the counter and let it add the base address. Let's look at these two lines:

mov     phsa, size              '  -4   hub byte count (8n + 7)
:copy7          wrlong  tmp, phsa               '  +0 = transfer long between cog and hub

Using the 10 longs as an example we feed phsa with 40-1 (one less than byte count). Up to the point when the wrlong has collected all its operands frqa has been added twice to phsa (that's simply something you have to know or figure out). Which means we divide our base address by 2 (it's 4n so no issues here) and place it into frqa. So the first write goes to base/2 + base/2 + 39 = base + 36 + 3. 36 is the offset of element 9, that's what we want (followed by 8 down to 0).

I'm not a great explainer but I hope that gives you an idea why this works. Feel free to ask more questions (preferably in the Propeller sub forum).

K2 · 2011-07-11 22:43

If kuroneko had been born earlier he probably would have discovered relativity or created the first polio vaccine.

Tectu · 2011-07-12 04:43

Well, I think I understand the concept form that code now, but I could never build that on my own.

I have just tw omore questions to your explanation:

What does 4n, 8n, 5n, etc. means? Amount of digits?
What is "size" for, and what value should it have?

€dit: Someone should move this thread where it should be, I chould not figure out the right board for it - sorry

kuroneko · 2011-07-12 06:00

Tectu wrote: »

Well, I think I understand the concept form that code now, but I could never build that on my own.

Don't worry, neither could I when I started

What does 4n, 8n, 5n, etc. means? Amount of digits?

What is "size" for, and what value should it have?

4n simply stands for a number divisible by 4 without remainder and is used when the actual value isn't important, e.g. it could be 1024 or 44. A long variable is usually stored at long aligned addresses which - given its size of 4 bytes - is 4n.

As for size, this is the amount of longs you want to transfer in bytes -1. Just follow the link to the POC thread and have a look at the code listing (64 longs transferred amounts to 64*4-1 = 255). Or have a look at the [post=978929]cog storage code[/post] which uses - IIRC - 484 longs (size = 1935). HTH

Tectu · 2011-07-12 08:39

Okay... I tried to implement that now in my code (which is actually the whole code).
I just get H(+??????????? from the serial terminal, without any newlines.

The spin (in normal case) code just reads the main RAM and sends it to the serial terminal, newline after every long.

CON
  _clkmode      = xtal1 + pll16x
  _xinfreq      = 5_000_000

VAR
  long data[100]
  byte i

OBJ
  Display       : "VGA_text"
  Key           : "Keyboard"
  Debug         : "Parallax Serial Terminal"

PUB main
  Debug.start(115200)

  cognew(@entry, @data{0})

  waitcnt(clkfreq + cnt)

  Debug.char(13)
  Debug.str(string("Begin now: "))                      'begin output now
  Debug.char(13)

  repeat i from 0 to 99
    Debug.bin(data[i], 32)
    Debug.char(13)

  Debug.str(string("Finished!!"))                       'output finished


DAT
                        org

entry                   mov     samples, #100           'make 100 samples
                        mov     size, #3                'we want to sent 1 long (4 bytes) - 1 = 3 to main RAM

makescan                mov     tmp, ina                '  +8
                        mov     phsa, size              '  -4   hub byte count (8n + 7)

:copy7                  wrlong  tmp, phsa               '  +0 = transfer long between cog and hub
                        mov     tmp, ina                '  +8
                        sub     phsa, #7 wz             '  -4

:copy1                  wrlong  tmp, phsa               '  +0 = transfer long between cog and hub
              if_nz     mov     tmp, ina                '  +8
              if_nz     djnz    phsa, #:copy7           '  -4
                        djnz    samples, #makescan

:here                   jmp     #:here                  'never-ever-lands


tmp           RES       1
size          RES       1                               'need to send a long to main RAM
samples       RES       1                               'amount of samples

Sorry, for whatever I am doing wrong. It's my first project with a propeller, so please, tell me what I should do better ;-)

~ Tectu

Tectu · 2011-07-12 12:09

€dit: I get more than 100 times that term from the serial console, i get that infinitly.

kuroneko · 2011-07-12 17:59

I'll prepare a sample for you unless you figure it out in the meantime

Tectu · 2011-07-12 18:29

I did not work more on it. I tried to add the multicog stuff, half successfully.
To try the routines, I tooked my old wrlong scan method, I would replace that with your 16 cycles routine to be faster.

This is what I wrote:

CON
  _clkmode      = xtal1 + pll16x
  _xinfreq      = 5_000_000

VAR
  long waitcog                  'how long the cog has to wait
  long samples_amount           'how many samples should be made
  long coghptr                   'to switch to the right pointer
  long data[100]                'space on main RAM
  byte i

OBJ
  Display       : "VGA_text"
  Key           : "Keyboard"
  Debug         : "Parallax Serial Terminal"

PUB main
  Debug.start(115200)
  samples_amount := 5                                   'samples that will be done *4

  coghptr := 0
  waitcog := 100_000 + cnt

'----------------------------------

  coghptr += 0
  waitcog += 0
  waitcnt(10_000+cnt)
  cognew(@entry, @waitcog)

  coghptr += 4
  waitcog += 32
  waitcnt(10_000+cnt)
  cognew(@entry, @waitcog)

  coghptr += 4
  waitcog += 32
  waitcnt(10_000+cnt)
  cognew(@entry, @waitcog)

  coghptr += 4
  waitcog += 32
  waitcnt(10_000+cnt)
  cognew(@entry, @waitcog)

'----------------------------------

  waitcnt(clkfreq + cnt)

  Debug.char(13)
  Debug.str(string("Begin now: "))                      'begin output now
  Debug.char(13)

  repeat i from 0 to samples_amount*4-1
    Debug.bin(data[i], 32)
    Debug.char(13)

  Debug.str(string("Finished!!"))                       'output finished


DAT
                        org     0

entry                   mov     tmp, par
                        rdlong  wait, tmp
                        add     tmp, #4
                        rdlong  samples, tmp
                        add     tmp, #4
                        rdlong  addhptr, tmp
                        add     tmp, #4
                        mov     hptr, tmp               'copy pointer to begin of main RAM

                        add     hptr, addhptr

                        mov     dira, 0                 'make all pins input

                        waitcnt wait, #0                'wait for sncy
                        nop

makescan                mov     tmp, ina                'write INA to tmp
                        wrlong  tmp, hptr               'write the sample to main RAM
                        add     hptr, #16               'point to the fourth-next long
                        nop
                        nop
                        nop
                        djnz    samples, #makescan      'do for amount of samples

here                    jmp #here


wait          RES       1                               'how many cyles should you wait
hptr          RES       1                               'pointer to main RAM
samples       RES       1                               'amount of samples that will be done
tmp           RES       1
addhptr       RES       1

and this is what I get on serial terminal:

Begin now:
00000000000000000000000000000000
01011111000000011000000000000011
01011111000000011000000000000011
01011111000000011000000000000011
00000000000000000000000000000000
01011111000000011000000000000011
01011111000000011000000000000011
01011111000000011000000000000011
00000000000000000000000000000000
01011111000000011000000000000011
01011111000000011000000000000011
01011111000000011000000000000011
00000000000000000000000000000000
01011111000000011000000000000011
01011111000000011000000000000011
01011111000000011000000000000011
00000000000000000000000000000000
01011111000000011000000000000011
01011111000000011000000000000011
01011111000000011000000000000011
Finished!!

no idea why the first cog is not working properly...
But anyway, this is not why this thread exists.

~ Tectu

kuroneko · 2011-07-12 18:42

Can you place the waitcnt after the cognew? The first cog's parameters get corrupted by the second's set.

kuroneko · 2011-07-12 20:01

Here is the 16 cycle sample code. You basically just missed a bit of the setup. This example generates a 2.5MHz wave (@80MHz) so you can see the sampler picking up different values.

CON
  _clkmode = XTAL1|PLL16X
  _xinfreq = 5_000_000

CON
  lcnt = 32                                             ' must be an even number of longs
  
VAR
  long  data[lcnt]

OBJ
  debug: "Parallax Serial Terminal"

PUB main | i

  debug.start(115200)
  waitcnt(clkfreq*3 + cnt)
  debug.char(0)

  ' generate data
  
  dira[16]~~
  ctra := constant(%0_00100_000 << 23 | 16)
  frqa := constant(%00001_0000 << 23)                   ' change every 16 cycles

  ' start sampler

  data[0] := constant(lcnt*4 -1)                        ' transfer array length
  cognew(@entry, @data{0})

  waitcnt(clkfreq + cnt)

  ' display data
  
  debug.str(string(13, "Begin now: ", 13))              ' begin output now

  repeat i from 0 to constant(lcnt -1)
    debug.bin(data[i], 32)
    debug.char(13)

  debug.str(string("Finished!!"))                       ' output finished

DAT             org     0

entry           movi    ctra, #%0_11111_000     '  -4   LOGIC always

                rdlong  size, par               '  +0 = read byte count -1
                                  
                mov     frqa, par               '  +8   data buffer (base address)
                shr     frqa, #1                '  -4   base/2

                long    0[2] {2 x nop}          '  +0 =
                
                mov     temp, ina               '  +8
                mov     phsa, size              '  -4   hub byte count (8n - 1)

:copy7          wrlong  temp, phsa              '  +0 = transfer long between cog and hub
                mov     temp, ina               '  +8
                sub     phsa, #7 wz             '  -4

:copy1          wrlong  temp, phsa              '  +0 = transfer long between cog and hub
      if_nz     mov     temp, ina               '  +8
      if_nz     djnz    phsa, #:copy7           '  -4

                cogid   cnt                     '
                cogstop cnt                     ' sayonara ...

' initialised data and/or presets

' uninitialised data and/or temporaries

size            res     1
temp            res     1

                fit
                
DAT             org     0                       ' array validation

                res     lcnt & 1                ' lcnt must be 2n (even)

                fit     0

DAT

Note that when you go multi cog the success/failure depends on how you do it. While this is blindingly obvious, using the 16 cycle loop has certain conditions in order to get it to work interleaved. For example each of the 4 cogs would sample 4 cycles apart. This also means that the hub access is 4 cycles apart. That's where the problem lies. Cog N and cog N+1 have their respective hub window slots 2 cycles apart. So you can't use them together however much you sync them with waitcnt. This means that for a 4 cog sampler using the above sample loop you have to use either all even (2n) or odd (2n+1) numbered cogs.

Problems with wrlong

Comments