WS2812 Fast Multi - Driving ~10,000 Neopixels using an $8 Micro

I've been working on an upgrade to the "laundry stargate" and looking for ways to drive more WS2812 LED strips with the Prop.
I've come up with some code that comes tantalisingly close to driving 5 WS2811/12/12B strips at full rate, so I thought I'd ask more experienced eyes to see if there's a way to finish this, or whether I should be less ambitious and settle for 4 strips per cog.
What I want to do is find a way to "advance" the 5 Ptrs to the next LED data, every 24 rather than 31 bits. I need an alternative to the 'test' instruction that makes the zero flag go high once every 24 bits, and advances the pointers to the next LED data. This allows data to be store in the simpler form $00 Gg Rr Bb (one led per long), otherwise I will have to pack 4 leds to 3 longs, which will work but perhaps isn't so user friendly.
No matter what we end up with, we should be able to use virtually all of hub ram for data; something like 7500~10,000 LEDs, split across 6 or 7 cogs, 28 output pins, at a refresh rate around 130 Hz. I think that'd be spectacular, for a $8 micro in a DIP package
Clock rate is 100 MHz so bit timing ends up at 1.28usec per bit.
I've come up with some code that comes tantalisingly close to driving 5 WS2811/12/12B strips at full rate, so I thought I'd ask more experienced eyes to see if there's a way to finish this, or whether I should be less ambitious and settle for 4 strips per cog.
What I want to do is find a way to "advance" the 5 Ptrs to the next LED data, every 24 rather than 31 bits. I need an alternative to the 'test' instruction that makes the zero flag go high once every 24 bits, and advances the pointers to the next LED data. This allows data to be store in the simpler form $00 Gg Rr Bb (one led per long), otherwise I will have to pack 4 leds to 3 longs, which will work but perhaps isn't so user friendly.
No matter what we end up with, we should be able to use virtually all of hub ram for data; something like 7500~10,000 LEDs, split across 6 or 7 cogs, 28 output pins, at a refresh rate around 130 Hz. I think that'd be spectacular, for a $8 micro in a DIP package
Clock rate is 100 MHz so bit timing ends up at 1.28usec per bit.
'Main tight bit loop starts here:-
' Bit output order looks like this 
' state: data low high
' delay: 0.36 0.56 0.36 usec
:loop rdlong dat1, ptr1 '008 get first strip data, 00RRGGBB format
test dat1, bmask wc '012 extract single data bit and copy it to C flag
if_nc andn opmask, led1mask '016 if led1 data bit was 0, zero the mask at the led1 pin position
rdlong dat2, ptr2 '024 get first strip
and outa, out0mask '028 all led outputs low <- near 0.36 usec mark after data output
test dat2, bmask wc '032 get data bit into Z
if_nc andn opmask, led2mask '036 zero led2 pin if data bit was zero.
test bitctr,#31 wz '040 check whether in last bit. store result in zero flag.
if_z add ptr1,inc1 '044 move onto next led if last bit
if_z add ptr2,inc2 '048 move onto next LED
rdlong dat3, ptr3 '056 get third channel data
test dat3, bmask wc '060 get data bit into C
if_nc andn opmask, led3mask '064 if led3 data bit was 0, zero the mask at the led3 pin position
rdlong dat4, ptr4 '072 get fourth channel data
test dat4, bmask wc '076 get data bit into C
if_nc andn opmask, led4mask '080 zero led4 pin if data bit was 0.
or outa, out1mask '084 all led outputs high <- near 0.56us after previous transition. Hence out of order slightly
if_z add ptr3,inc3 '088 move onto next LED
if_z add ptr4,inc4 '092 move onto next LED
ror bmask, #1 '096 ready to extract next bit
rdlong dat5, ptr5 '104 get fifth strip led data
test dat5, bmask wc '108 get data bit into C
if_c andn opmask, led5mask '112 zero led5 pin only if data bit was 0.
if_z add ptr5,inc5 '116 move onto next led5 data
and outa, opmask '120 Clear the pin bits that must be zero. Others (1s) will be cleared later when all go to zero. <- 0.36us after previous transition
mov opmask, #$1ff '124 reset the output mask for the next iteration
djnz bitctr, #:loop '128 jump to back and do another 24/32 bits for the next LED
Comments
I settled on the 5 strip per cog format, which means the data from 4 leds is packed into 3 longs. The upside of this is that we should be able to drive around 10,000 leds from a single Prop.
Testing on the weekend consisted of driving a long string of 576 leds successfully from one output, moving that long string from output pin to output pin. The cog driver keeps up with Hub memory so there's no limit on number of strings, or number of leds per string, other than the 32K hub limit.
Will post some photos of the setup soon
here some ideas:
- Instead of a part of the bitcntr you can use the bmask to detect the wrap around from bit0 to bit23/31.
- Copy the bit directly into the opmask with muxc , instead of set them all and clear one after one with if_nc andn, this spares one Instruction.
It will then look like that:
mov bmask, bit23 'begin with highest of 24 bits :loop rdlong dat1, ptr1 '008 get first strip data, 00RRGGBB format test dat1, bmask wc '012 extract single data bit and copy it to C flag muxc opmask, led1mask '016 copy led1 data bit to the led1 pin position rdlong dat2, ptr2 '024 get first strip and outa, out0mask '028 all led outputs low <- near 0.36 usec mark after data output test dat2, bmask wc '032 get data bit into C muxc opmask, led2mask '036 copy led2 data bit to the led2 pin position test bmask,#1 wz '040 check whether in last bit. store result in zero flag. if_nz add ptr1,inc1 '044 move onto next led if last bit if_nz add ptr2,inc2 '048 move onto next LED rdlong dat3, ptr3 '056 get third channel data test dat3, bmask wc '060 get data bit into C muxc opmask, led3mask '064 copy led3 data bit to the led3 pin position rdlong dat4, ptr4 '072 get fourth channel data test dat4, bmask wc '076 get data bit into C muxc opmask, led4mask '080 copy led4 data bit to the led4 pin position or outa, out1mask '084 all led outputs high <- near 0.56us after previous transition. Hence out of order slightly if_nz add ptr3,inc3 '088 move onto next LED if_nz add ptr4,inc4 '092 move onto next LED test dat5, bmask wc '108 get data bit into C rdlong dat5, ptr5 '104 get fifth strip led data muxc opmask, led5mask '108 copy led5 data bit to the led5 pin position if_nz add ptr5,inc5 '112 move onto next led5 data and outa, opmask '116 Clear the pin bits that must be zero. Others (1s) will be cleared later when all go to zero. <- 0.36us after previous transition shr bmask, #1 wz '120 ready to extract next bit, on wrap around set Z flag if_z mov bmask, bit23 '124 rotate to bit23 after bit0 djnz bitctr, #:loop '128 jump to back and do another 24/32 bits for the next LED ... bit23 long 1<<23
Andy
That 24/32 bit flexibility is good for the RGBW strips
Thanks & regards
tubular