slow CRC computation

Larry Martin · 2008-06-27 14:26

Hi, all -

I have a CRC computation that ran in tens of microseconds in SX assembly language, but is taking around a millisecond per byte in SPIN.

CON
  TM_MSG_CRC_INIT              = $FFFF
  TM_MSG_CCITT_CRC_POLY        = $1021
  pLLIOMarkOut                 = 5
  pLLIOOverrunOut              = 6

PUB TM_CalculateCRC(p_src_p, p_len) : crc | mask, bitctr, bytectr, ch
  crc := TM_MSG_CRC_INIT
  repeat bytectr from 0 to (p_len - 1)
    'Align test bit with leftmost bit of the message byte.
    mask := $80
    ch := BYTE[noparse][[/noparse]p_src_p + bytectr]
    outa[noparse][[/noparse]pLLIOOverrunOut] ~~ 'deleteme
    repeat bitctr from 0 to 7            ' times at 80 MHz (5 MHz crystal X 16 PLL)
      outa[noparse][[/noparse]pLLIOMarkOut] ~~ 'deleteme
      crc <<= 1                          ' 15 uS
      outa[noparse][[/noparse]pLLIOMarkOut] ~  'deleteme
      if ch & mask
        crc |= 1                         ' 16 uS
      outa[noparse][[/noparse]pLLIOMarkOut] ~~ 'deleteme
      if crc & $10000
        crc ^= TM_MSG_CCITT_CRC_POLY     ' 25 uS
      outa[noparse][[/noparse]pLLIOMarkOut] ~  'deleteme
      crc &= $FFFF                       ' 15 uS
      outa[noparse][[/noparse]pLLIOMarkOut] ~~ 'deleteme
      mask >>= 1                         ' 15 uS =~ 1200 clock cycles
      outa[noparse][[/noparse]pLLIOMarkOut] ~  'deleteme
    outa[noparse][[/noparse]pLLIOOverrunOut] ~ 'deleteme
  RESULT := crc & $FFFF

I have counted the resulting output pulses to verify that I am not mysteriously spinning through extra iterations. It takes 15 uS to shift left a long - that's 1200 machine cycles, right? I'm pretty sure SX48 did the whole computation in that time.

Questions:

1. Is there a way to speed this up by precompiling the SPIN code?
2. If I put in the time to learn PASM and recode the function, will I get performance closer to SX48?
3. Is there an example of assembly language computation, preferably CRC16, in the Object Exchange? Googling "crc pasm site: propeller.com" comes up dry.

[noparse][[/noparse]rant]
The context is, I'm nearly done porting a realtime control application from SX48 (single 8-bit core at 50 MHz) to Propeller (Multiple 32-bit cores at 80 MHz). Imagine my surprise to find that my program is now too slow in some ways. After a couple weeks' Propeller cavitation, I get that it's interpreted SPIN, etc., but it still seems kinda funny.
[noparse][[/noparse]/rant]

Thanks,
Larry

Larry Martin · 2008-06-27 14:32

These are my clock settings:

_CLKMODE = XTAL1 + PLL16X
_XINFREQ = 5_000_000

Leon · 2008-06-27 14:40

Spin is rather slow, being interpreted. You need to write it in assembler for a fair comparison.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

Mike Green · 2008-06-27 14:52

The OUTAs are probably doubling the time it takes to do the CRC calculations. I'm sure there are some ways to speed it up even further by eliminating statements and combining statements.

1) You can't "precompile" the Spin code. It's never compiled to machine instructions, but to interpretive codes.
2) You'll certainly get comparable times to the SX48, particularly when you make use of the additional functionality of the Propeller's instructions
3) I don't think so, but maybe someone else might know of one.

Spin is certainly slower than assembly language, perhaps by a factor of 20. That's pretty typical for interpreters. On the other hand, the interpretive code is very compact and there are features not available directly in assembly (like multiplication and division, subscripting, subroutine calls with parameters).

allanlane5 · 2008-06-27 15:05

Yes, you can "precompile" the Spin code -- if by that you mean "convert the algorithm to SPIN assembly, put it in a SPIN 'DATA' section, use "initcog()" to send it to a cog, and run it out of a Cog".

And yes, at 80 mips (5 Mhz * 16 == 80 MHZ) you should get speed comparable to the SX48 running at 50 MIPS.

Larry Martin · 2008-06-27 15:21

I appreciate your point about the OUTA's, but that was the only way I could prove to myself what I was seeing. Without the OUTA's the CRC computation was taking 7 mS to do 14 bytes, which was how it got my attention in the first place. So 'mask >>= 1' takes only 7 mS (not 15), but the effective instruction clock is still 128 Kops/sec on an 80 MHz machine. I'm not just whining for fun, I have several algorithms that count on the code being >> faster than the machine. The application is web processing, currently 50 mS minimum cycle time. I was hoping to halve that with the new chip.

I'll probably have another look at ImageCraft next week.

Thanks again,
Larry

Chuck Rice · 2008-06-27 15:25

If you do not want to drop into assembly (the fastest) you might try using a lookup table to do the calculation. Google on "CRC16 LOOKUP TABLE"

Leon · 2008-06-27 15:34

I found some C code using a lookup table for a 32-bit CRC that I'm using on an ARM:

/* Initialized first time "crc32()" is called. If you prefer, you can
 * statically initialize it at compile time. [noparse][[/noparse]Another exercise.]
 */

u_long crc32(u_char *buf, int len)
{
        u_char *p;
        u_long  crc;

        if (!crc32_table)    /* if not already done, */
                init_crc32();   /* build table */
        crc = 0xffffffff;       /* preload shift register, per CRC-32 spec */
        for (p = buf; len > 0; ++p, --len)
                crc = (crc << 8) ^ crc32_table[noparse][[/noparse](crc >> 24) ^ *p];
        return ~crc;            /* transmit complement, per CRC-32 spec */
}

/*
 * Build auxiliary table for parallel byte-at-a-time CRC-32.
 */
#define CRC32_POLY 0x04c11db7     /* AUTODIN II, Ethernet, & FDDI */

init_crc32()
{
        int i, j;
        u_long c;

        for (i = 0; i < 256; ++i) {
                for (c = i << 24, j = 8; j > 0; --j)
                        c = c & 0x80000000 ? (c << 1) ^ CRC32_POLY : (c << 1);
                crc32_table[i] = c;
        }
}

[/i]

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

Larry Martin · 2008-06-27 15:38

Allen - Actually I'm thinking about downloading the ImageCraft demo and using that to crunch the original C version of the function to PASM. Thanks for the pointer to initcog.

Chuck - Good idea, but even a table driven implementation has a fair number of instructions, and I'd have to find or write a table that matches this device's CRC. Even "standard" CRC implementations differ in subtle ways. In 10 years of fairly intense integration of devices with CRC protected proprietary commo protocols, I have rarely been able to reuse a CRC function between devices.

jazzed · 2008-06-27 15:57

Propeller cavitation [noparse]:)[/noparse] Well, at least you get more memory (added: yes, and cores).

An SX48 @ 50MHz will beat one Propeller COG @80 MHz in ASM.
SX48 uses a multi-level pipeline to do 1 instruction per clock tick.
Propeller executes 1 instruction every 4 clock ticks minimum (most instructions use 4).
Someone please correct me if I misunderstand this.

Now, if you apply multiple cogs to the problem, there is potential for Propeller to outperform.
I have not seen any examples of a "generic" solution though.

All CRC tables are built using the shift/xor algorithm with a given polynominal.
The starting value and inversion factor also come into play. Tables use memory of course.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Post Edited (jazzed) : 6/27/2008 4:13:44 PM GMT

Larry Martin · 2008-06-27 16:06

jazzed - yeah, memory and cores. I think I'm going to end up ahead of the game, it's just a little more work than I hoped for.

My SX48 was loafing in every respect except for the soft UARTs. It ran two independent UARTs at 57600 baud, and controlled a machine, but I came to need a third data channel and could not make that work without water-cooling the chip.

I am adding the third serial channel now to my Propeller now. That will tell if I made the right choice. The rest is a mere matter of software.

Thanks for help, all. If questions arise about the PASM CRC, I will start a new thread.

Larry

Larry Martin · 2008-06-27 22:27

This code got the CRC down from 7 mS to 150 uS. The assembly part spins forever, looking for stuff to do based on flag CC16ParamsReady. When done, it writes the result to CC16Result and sets flag CC16ResultValid.

The SPIN level sets up data and then sets the CC16ParamsReady flag. It then waits for CC16ResultValid and returns the result data. There is also an escape if something goes wrong (like you forget to Start the object [noparse]:)[/noparse]).

Sample call: crc := CompuCog.TM_CalcCRC16(@TargetTxBuf, lTTBCommandLength - 3) ' leave out STX and CRC bytes


VAR

  long  cog                     'cog flag/id

  long CC16Base
  long CC16Length
  long CC16Flags
  long CC16Result


CON
  CC16ParamsReady     = $01 'limit constants to 8 bits for Assembly
  CC16ResultValid     = $02
                     
''These methods execute in caller's address space and cog
''Only the assembler code below executes in a separate cog
PUB start : okay
  CC16Base ~
  CC16Length ~
  CC16Flags ~
  CC16Result ~
  okay := cog := cognew(@entry, @CC16Base) + 1    '@entry is the assembly driver below

PUB stop
  if cog
    cogstop(cog~ - 1)

' bits set in top word of RESULT mean there was an error
PUB TM_CalcCRC16(p_base_p, p_len) | t0
  RESULT := -1
  'don't anticipate other threads, just wipe out the Flags
  CC16Base := p_base_p
  CC16Length := p_len
  CC16Flags := CC16ParamsReady
  t0 := cnt
  repeat while not (CC16Flags & CC16ResultValid)
    if cnt > t0 + CLKFREQ
      RESULT := $FFFFFFFF
      RETURN
  RESULT := CC16Result & $FFFF


DAT

'******************************************
'* Assembly language computation "driver" *
'******************************************

                        org
'
'
' Entry
'
entry
                        'set up Spin addresses for Assembler - e.g., CC16Flags is not accessible here
                        mov     aCC16Base, par    'get data address passed into cognew param 2
                        mov     aCC16Length, aCC16Base
                        add     aCC16Length, #4
                        mov     aCC16Flags, aCC16Base
                        add     aCC16Flags, #8
                        mov     aCC16Result, aCC16Base
                        add     aCC16Result, #12

computations

'
'
' CC16
'
cc16
:CC16setup              
                        rdlong CC16scratch, aCC16Flags
                        test CC16scratch, #CC16ResultValid    wz   'wz means set Z if equal
                        'if a result is waiting to be consumed, move on to the next computation
        if_nz           jmp #cc16_done
                        test CC16scratch, #CC16ParamsReady    wz
                        'if no parameters are waiting to be consumed, move on to the next computation
        if_z            jmp #cc16_done
                        'it's really going to happen, so get the data
                        rdlong CC16ptr, aCC16Base
                        rdlong CC16len, aCC16Length
                        'set scratch to CRC seed and xor to key value
                        mov    CC16scratch, #0        'not sure next inst gets all bits
                        mov    CC16scratch, #$FF      'work around 9 bit max
                        shl    CC16scratch, #8
                        or     CC16scratch, #$FF
                        mov    CC16xor, #0        'not sure next inst gets all bits
                        mov    CC16xor, #$10      'work around 9 bit max
                        shl    CC16xor, #8
                        or     CC16xor, #$21
                        'setup count
                        mov    CC16count, #0
:CC16byte               ' if (count >= len) break ;
                        cmp    CC16count, CC16len   wz,wc  'wc means set carry if count < len
          if_z          jmp    #cc16_ready
          if_nc         jmp    #cc16_ready
                        'get current byte - ptr is incremented below
                        rdbyte CC16byte, CC16ptr
                        mov    CC16mask, #$80          'this is also the loop control variable - when it shifts to nothing, the byte is done
:CC16bit
                        ' crc <<= 1
                        shl    CC16scratch, #1
                        ' if (mask & byte) crc |= 1 ;
                        and    CC16byte, CC16mask   wz,nr
          if_nz         or     CC16scratch, #1
                        ' if (crc & 0x00010000) {crc ^= 0x1021; crc &= 0xFFFF; }
                        and    CC16scratch, xormask   wz,nr
          if_nz         xor    CC16scratch, CC16xor
                        and    CC16scratch, intlowmask
                        'set up for next bit, mask >>= 1
                        shr    CC16mask, #1         wz
          if_nz         jmp    #:CC16bit
                        'end of byte, set up for next byte
                        add    CC16count, #1
                        add    CC16ptr, #1
                        jmp    #:CC16byte
                        '--------------------------------------- Never Falls Through
cc16_ready              wrlong CC16scratch, aCC16Result
                        mov    CC16scratch, #3                  ' CC16ResultValid | CC16ParamsReady
                        wrlong CC16scratch, aCC16Flags
cc16_done

comp_done
                        jmp     #computations

' Constants
intlowmask         long    $0000FFFF
inthighmask        long    $0000FFFF
xormask            long    $00010000
' Variables
aCC16Base               res     1
aCC16Length             res     1
aCC16Flags              res     1
aCC16Result             res     1
CC16ptr                 res     1
CC16len                 res     1
CC16count               res     1
CC16scratch             res     1
CC16byte                res     1
CC16mask                res     1
CC16xor                 res     1

{{

+------------------------------------------------------------------------------------------------------------------------------+
¦                                                   TERMS OF USE: MIT License                                                  ¦                                                            
+------------------------------------------------------------------------------------------------------------------------------¦
¦Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation    ¦ 
¦files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy,    ¦
¦modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software¦
¦is furnished to do so, subject to the following conditions:                                                                   ¦
¦                                                                                                                              ¦
¦The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.¦
¦                                                                                                                              ¦
¦THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE          ¦
¦WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR         ¦
¦COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,   ¦
¦ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                         ¦
+------------------------------------------------------------------------------------------------------------------------------+
}}

Like I said, this pattern reduced my CRC computation time by nearly two orders of magnitude, from 7 mS to 150 mS for the same 14 bytes.

Thanks to all for help, support and sample code.

It's Miller time.

slow CRC computation

Comments