slow CRC computation
Larry Martin
Posts: 104
Hi, all -
I have a CRC computation that ran in tens of microseconds in SX assembly language, but is taking around a millisecond per byte in SPIN.
I have counted the resulting output pulses to verify that I am not mysteriously spinning through extra iterations. It takes 15 uS to shift left a long - that's 1200 machine cycles, right? I'm pretty sure SX48 did the whole computation in that time.
Questions:
1. Is there a way to speed this up by precompiling the SPIN code?
2. If I put in the time to learn PASM and recode the function, will I get performance closer to SX48?
3. Is there an example of assembly language computation, preferably CRC16, in the Object Exchange? Googling "crc pasm site: propeller.com" comes up dry.
[noparse][[/noparse]rant]
The context is, I'm nearly done porting a realtime control application from SX48 (single 8-bit core at 50 MHz) to Propeller (Multiple 32-bit cores at 80 MHz). Imagine my surprise to find that my program is now too slow in some ways. After a couple weeks' Propeller cavitation, I get that it's interpreted SPIN, etc., but it still seems kinda funny.
[noparse][[/noparse]/rant]
Thanks,
Larry
I have a CRC computation that ran in tens of microseconds in SX assembly language, but is taking around a millisecond per byte in SPIN.
CON
TM_MSG_CRC_INIT = $FFFF
TM_MSG_CCITT_CRC_POLY = $1021
pLLIOMarkOut = 5
pLLIOOverrunOut = 6
PUB TM_CalculateCRC(p_src_p, p_len) : crc | mask, bitctr, bytectr, ch
crc := TM_MSG_CRC_INIT
repeat bytectr from 0 to (p_len - 1)
'Align test bit with leftmost bit of the message byte.
mask := $80
ch := BYTE[noparse][[/noparse]p_src_p + bytectr]
outa[noparse][[/noparse]pLLIOOverrunOut] ~~ 'deleteme
repeat bitctr from 0 to 7 ' times at 80 MHz (5 MHz crystal X 16 PLL)
outa[noparse][[/noparse]pLLIOMarkOut] ~~ 'deleteme
crc <<= 1 ' 15 uS
outa[noparse][[/noparse]pLLIOMarkOut] ~ 'deleteme
if ch & mask
crc |= 1 ' 16 uS
outa[noparse][[/noparse]pLLIOMarkOut] ~~ 'deleteme
if crc & $10000
crc ^= TM_MSG_CCITT_CRC_POLY ' 25 uS
outa[noparse][[/noparse]pLLIOMarkOut] ~ 'deleteme
crc &= $FFFF ' 15 uS
outa[noparse][[/noparse]pLLIOMarkOut] ~~ 'deleteme
mask >>= 1 ' 15 uS =~ 1200 clock cycles
outa[noparse][[/noparse]pLLIOMarkOut] ~ 'deleteme
outa[noparse][[/noparse]pLLIOOverrunOut] ~ 'deleteme
RESULT := crc & $FFFF
I have counted the resulting output pulses to verify that I am not mysteriously spinning through extra iterations. It takes 15 uS to shift left a long - that's 1200 machine cycles, right? I'm pretty sure SX48 did the whole computation in that time.
Questions:
1. Is there a way to speed this up by precompiling the SPIN code?
2. If I put in the time to learn PASM and recode the function, will I get performance closer to SX48?
3. Is there an example of assembly language computation, preferably CRC16, in the Object Exchange? Googling "crc pasm site: propeller.com" comes up dry.
[noparse][[/noparse]rant]
The context is, I'm nearly done porting a realtime control application from SX48 (single 8-bit core at 50 MHz) to Propeller (Multiple 32-bit cores at 80 MHz). Imagine my surprise to find that my program is now too slow in some ways. After a couple weeks' Propeller cavitation, I get that it's interpreted SPIN, etc., but it still seems kinda funny.
[noparse][[/noparse]/rant]
Thanks,
Larry

Comments
_CLKMODE = XTAL1 + PLL16X
_XINFREQ = 5_000_000
Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle
1) You can't "precompile" the Spin code. It's never compiled to machine instructions, but to interpretive codes.
2) You'll certainly get comparable times to the SX48, particularly when you make use of the additional functionality of the Propeller's instructions
3) I don't think so, but maybe someone else might know of one.
Spin is certainly slower than assembly language, perhaps by a factor of 20. That's pretty typical for interpreters. On the other hand, the interpretive code is very compact and there are features not available directly in assembly (like multiplication and division, subscripting, subroutine calls with parameters).
And yes, at 80 mips (5 Mhz * 16 == 80 MHZ) you should get speed comparable to the SX48 running at 50 MIPS.
I'll probably have another look at ImageCraft next week.
Thanks again,
Larry
/* Initialized first time "crc32()" is called. If you prefer, you can * statically initialize it at compile time. [noparse][[/noparse]Another exercise.] */ u_long crc32(u_char *buf, int len) { u_char *p; u_long crc; if (!crc32_table) /* if not already done, */ init_crc32(); /* build table */ crc = 0xffffffff; /* preload shift register, per CRC-32 spec */ for (p = buf; len > 0; ++p, --len) crc = (crc << 8) ^ crc32_table[noparse][[/noparse](crc >> 24) ^ *p]; return ~crc; /* transmit complement, per CRC-32 spec */ } /* * Build auxiliary table for parallel byte-at-a-time CRC-32. */ #define CRC32_POLY 0x04c11db7 /* AUTODIN II, Ethernet, & FDDI */ init_crc32() { int i, j; u_long c; for (i = 0; i < 256; ++i) { for (c = i << 24, j = 8; j > 0; --j) c = c & 0x80000000 ? (c << 1) ^ CRC32_POLY : (c << 1); crc32_table[i] = c; } } [/i]Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle
Chuck - Good idea, but even a table driven implementation has a fair number of instructions, and I'd have to find or write a table that matches this device's CRC. Even "standard" CRC implementations differ in subtle ways. In 10 years of fairly intense integration of devices with CRC protected proprietary commo protocols, I have rarely been able to reuse a CRC function between devices.
An SX48 @ 50MHz will beat one Propeller COG @80 MHz in ASM.
SX48 uses a multi-level pipeline to do 1 instruction per clock tick.
Propeller executes 1 instruction every 4 clock ticks minimum (most instructions use 4).
Someone please correct me if I misunderstand this.
Now, if you apply multiple cogs to the problem, there is potential for Propeller to outperform.
I have not seen any examples of a "generic" solution though.
All CRC tables are built using the shift/xor algorithm with a given polynominal.
The starting value and inversion factor also come into play. Tables use memory of course.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Post Edited (jazzed) : 6/27/2008 4:13:44 PM GMT
My SX48 was loafing in every respect except for the soft UARTs. It ran two independent UARTs at 57600 baud, and controlled a machine, but I came to need a third data channel and could not make that work without water-cooling the chip.
I am adding the third serial channel now to my Propeller now. That will tell if I made the right choice. The rest is a mere matter of software.
Thanks for help, all. If questions arise about the PASM CRC, I will start a new thread.
Larry
The SPIN level sets up data and then sets the CC16ParamsReady flag. It then waits for CC16ResultValid and returns the result data. There is also an escape if something goes wrong (like you forget to Start the object [noparse]:)[/noparse]).
Sample call: crc := CompuCog.TM_CalcCRC16(@TargetTxBuf, lTTBCommandLength - 3) ' leave out STX and CRC bytes
VAR long cog 'cog flag/id long CC16Base long CC16Length long CC16Flags long CC16Result CON CC16ParamsReady = $01 'limit constants to 8 bits for Assembly CC16ResultValid = $02 ''These methods execute in caller's address space and cog ''Only the assembler code below executes in a separate cog PUB start : okay CC16Base ~ CC16Length ~ CC16Flags ~ CC16Result ~ okay := cog := cognew(@entry, @CC16Base) + 1 '@entry is the assembly driver below PUB stop if cog cogstop(cog~ - 1) ' bits set in top word of RESULT mean there was an error PUB TM_CalcCRC16(p_base_p, p_len) | t0 RESULT := -1 'don't anticipate other threads, just wipe out the Flags CC16Base := p_base_p CC16Length := p_len CC16Flags := CC16ParamsReady t0 := cnt repeat while not (CC16Flags & CC16ResultValid) if cnt > t0 + CLKFREQ RESULT := $FFFFFFFF RETURN RESULT := CC16Result & $FFFF DAT '****************************************** '* Assembly language computation "driver" * '****************************************** org ' ' ' Entry ' entry 'set up Spin addresses for Assembler - e.g., CC16Flags is not accessible here mov aCC16Base, par 'get data address passed into cognew param 2 mov aCC16Length, aCC16Base add aCC16Length, #4 mov aCC16Flags, aCC16Base add aCC16Flags, #8 mov aCC16Result, aCC16Base add aCC16Result, #12 computations ' ' ' CC16 ' cc16 :CC16setup rdlong CC16scratch, aCC16Flags test CC16scratch, #CC16ResultValid wz 'wz means set Z if equal 'if a result is waiting to be consumed, move on to the next computation if_nz jmp #cc16_done test CC16scratch, #CC16ParamsReady wz 'if no parameters are waiting to be consumed, move on to the next computation if_z jmp #cc16_done 'it's really going to happen, so get the data rdlong CC16ptr, aCC16Base rdlong CC16len, aCC16Length 'set scratch to CRC seed and xor to key value mov CC16scratch, #0 'not sure next inst gets all bits mov CC16scratch, #$FF 'work around 9 bit max shl CC16scratch, #8 or CC16scratch, #$FF mov CC16xor, #0 'not sure next inst gets all bits mov CC16xor, #$10 'work around 9 bit max shl CC16xor, #8 or CC16xor, #$21 'setup count mov CC16count, #0 :CC16byte ' if (count >= len) break ; cmp CC16count, CC16len wz,wc 'wc means set carry if count < len if_z jmp #cc16_ready if_nc jmp #cc16_ready 'get current byte - ptr is incremented below rdbyte CC16byte, CC16ptr mov CC16mask, #$80 'this is also the loop control variable - when it shifts to nothing, the byte is done :CC16bit ' crc <<= 1 shl CC16scratch, #1 ' if (mask & byte) crc |= 1 ; and CC16byte, CC16mask wz,nr if_nz or CC16scratch, #1 ' if (crc & 0x00010000) {crc ^= 0x1021; crc &= 0xFFFF; } and CC16scratch, xormask wz,nr if_nz xor CC16scratch, CC16xor and CC16scratch, intlowmask 'set up for next bit, mask >>= 1 shr CC16mask, #1 wz if_nz jmp #:CC16bit 'end of byte, set up for next byte add CC16count, #1 add CC16ptr, #1 jmp #:CC16byte '--------------------------------------- Never Falls Through cc16_ready wrlong CC16scratch, aCC16Result mov CC16scratch, #3 ' CC16ResultValid | CC16ParamsReady wrlong CC16scratch, aCC16Flags cc16_done comp_done jmp #computations ' Constants intlowmask long $0000FFFF inthighmask long $0000FFFF xormask long $00010000 ' Variables aCC16Base res 1 aCC16Length res 1 aCC16Flags res 1 aCC16Result res 1 CC16ptr res 1 CC16len res 1 CC16count res 1 CC16scratch res 1 CC16byte res 1 CC16mask res 1 CC16xor res 1 {{ +------------------------------------------------------------------------------------------------------------------------------+ ¦ TERMS OF USE: MIT License ¦ +------------------------------------------------------------------------------------------------------------------------------¦ ¦Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation ¦ ¦files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, ¦ ¦modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software¦ ¦is furnished to do so, subject to the following conditions: ¦ ¦ ¦ ¦The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.¦ ¦ ¦ ¦THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE ¦ ¦WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR ¦ ¦COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ¦ ¦ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ¦ +------------------------------------------------------------------------------------------------------------------------------+ }}Like I said, this pattern reduced my CRC computation time by nearly two orders of magnitude, from 7 mS to 150 mS for the same 14 bytes.
Thanks to all for help, support and sample code.
It's Miller time.