Can you guys please think about this? What are the ramifications of double-integrating the Goertzel summing terms?
On each clock, the ADC bit is now used to add/subtract an 8-bit cosine value and an 8-bit sine value to/from the X and Y accumulators.
If we did a SINC2 by integrating the accumulators, and then took their periodic readings and computed diffs, might we double the ENOB of our readings?
And how would all this work with the adder terms and accumulators and final integrators being all signed?
This would, at least, get us around the lack-of-windowing problem that we already have in the Goertzel.
1. Yes.
2. It would perform sinc2 filtering. Yay! Why not go to sinc3 if it's cheap enough? For software defined radio I would love to see a third order filter. Would this always be desirable? It would interfere with using the Goertzel hardware as a FIR window filter.
3. I don't know yet. The improvement should be similar to that seen for DC input. It should greatly improve rejection of frequencies not measured.
4. The bit growth calculations assume two's compliment arithmetic. The input needs to be sign-extended. We were treating it as unsigned for the delta-sigma input. But our input is now 8 bits.
Intervals could be lengthened slightly by reducing the number of active bits in the LUT.
We'd have to read the accumulators 3 times to get one sample with sinc2. Or 4 times for sinc3. I'm not sure how this would work with clearing the accumulators upon read.
...
If we did a SINC2 by integrating the accumulators, and then took their periodic readings and computed diffs, might we double the ENOB of our readings?
...
You can filter the accumulators in software when you read the accumulators, no need to add additional SINC2 hardware.
...
This would, at least, get us around the lack-of-windowing problem that we already have in the Goertzel.
Can you synchronize the reading/clearing of the accumulators with the wrap around of the LUT address that generates the sin/cosine output?
If so, you can have multiple periodes in the LUT and can calculate a window over these sin/cosine samples in the LUT RAM. The window size is then the chosen Goertzel loop size.
...
If we did a SINC2 by integrating the accumulators, and then took their periodic readings and computed diffs, might we double the ENOB of our readings?
...
You can filter the accumulators in software when you read the accumulators, no need to add additional SINC2 hardware.
...
This would, at least, get us around the lack-of-windowing problem that we already have in the Goertzel.
Can you synchronize the reading/clearing of the accumulators with the wrap around of the LUT address that generates the sin/cosine output?
If so, you can have multiple periodes in the LUT and can calculate a window over these sin/cosine samples in the LUT RAM. The window size is then the chosen Goertzel loop size.
Andy
I'm going to do this when my P2 ES board arrives.
Note that this does not restrict frequencies to a whole number of cycles per table. For a random frequency, the phase of the table output will differ each time it runs through compared to a continuous oscillator. Not a big problem, just rotate the Goertzel output to compensate and things should be fine. The only issue is not continuously responding to input. It might cost a little bit of sensitivity.
...
If we did a SINC2 by integrating the accumulators, and then took their periodic readings and computed diffs, might we double the ENOB of our readings?
...
You can filter the accumulators in software when you read the accumulators, no need to add additional SINC2 hardware.
...
This would, at least, get us around the lack-of-windowing problem that we already have in the Goertzel.
Can you synchronize the reading/clearing of the accumulators with the wrap around of the LUT address that generates the sin/cosine output?
If so, you can have multiple periodes in the LUT and can calculate a window over these sin/cosine samples in the LUT RAM. The window size is then the chosen Goertzel loop size.
Andy
Andy, that is ingenious! Do the windowing operation via the LUT data. It never occurred to me before.
Can you synchronize the reading/clearing of the accumulators with the wrap around of the LUT address that generates the sin/cosine output?
That is exactly how it works. Furthermore, you can specify how many complete LUT cycles you want before X/Y accumulator posting and clearing. The upper two bytes of each LUT entry are the Goertzel adder values, while the bottom two bytes are what can be output, also, to the DACs. So, in the upper bytes, you can have your windowed sine/cosine pattern, while in the lower bytes you have your continuous sine/cosine pattern. This way, you can output steady sine/cosine signals of known phase and input windowed measurements that are a product of the simultaneous output.
Do you guys have any further thoughts on how to minimize the logic required to compute the Tukey window output? This could amount to a substantial logic savings
Do you guys have any ideas on the performance improvements we could anticipate by using 8-bit Tukey samples, instead of 1-bit ADC samples, as input to the Goertzel computation? It would mean using 8x8 signed multipliers, instead of just 1-bit conditional negators, before the 32-bit accumulators.
I'm thinking it could take things to a whole other level.
Any news on logic size with Tukey in the smart pins?
If it's just too big to fit, could we have a mode where groups of eight ADC bits from each of four pins can be read/streamed in one long?
Putting a Tukey into every smart pin, taking advantage of the existing flops, was smaller than putting half that many into the cogs. Only by a little bit, though.
Was that with the 45-bit adder Tukey? Have you tried with FPGA using a counter for the +32 and +16 values? Or the add times 3 idea?
I think any savings are likely to be small, though. Adding 45 tap values will always need a substantial amount of logic. How does short Tukey/Hann-like compare to long Tukey for quality? The plateau could be longer without adding much logic by using a counter.
Do you guys have any further thoughts on how to minimize the logic required to compute the Tukey window output? This could amount to a substantial logic savings.
I've been trying my utmost to make the Tukey smaller. Having a small number of Tukey pins is one option.
We need a plan B if, as seems likely, it won't fit. Would sliding windows be so terrible in software? Obviously not ideal in terms of speed, but windows could be anything, stored in LUT. How to handle triggering?
Do you guys have any ideas on the performance improvements we could anticipate by using 8-bit Tukey samples, instead of 1-bit ADC samples, as input to the Goertzel computation? It would mean using 8x8 signed multipliers, instead of just 1-bit conditional negators, before the 32-bit accumulators.
I'm thinking it could take things to a whole other level.
According the to convolution theorem; multiplication in the time domain is equivalent to convolution in the frequency domain and vice versa; which means that if we could perform a fast Fourier transform on the one bit samples; or anything easily derived therefrom; then instead of multiplying each of the one bit samples; or the eight bit samples by a raised cosine (read Tukey); you might want to find some way; by hook or by crook, to get the ADC signal into the frequency domain, where the convolution kernel for a raised cosine is just the set {-1,2,-1}. One way to do this therefore might be to store the precomputed FFT values for every possible 8 bit sequence in a table; so as to be able just simply pick off 8 bits of raw ADC output at a time, with 4 bits of overlap - look up the precomputed FFT and then just simply add; no multiplies required! Then you might try down sampling the precomputed FFT results so that the next step would be to perform an inverse 4 point FFT on the down sampled, overlapped and anti-aliased summation - which for a 4 point FFT simply involves some additions and subtractions. Of course according to Wikipedia, Winograd sometime back in the 80's (I think) figured out that it is possible to perform ANY FFT with nothing but a large number of additions and subtractions; that is to say if you are willing to perform exactly 4*N multiplications at the very end. Of course - I don't know off the top of my head how to do some, or any of the more advanced Winograd transforms, like some of the ones that involve cyclotomic polynomials derived from some transformation based on a Galois field that in turn allows flipping between different prime number factorings; but for the smaller transforms it is a slam dunk in terms of the theory; other than like any software project - the devil is in the details when it come times to debugging.
Do you guys have any ideas on the performance improvements we could anticipate by using 8-bit Tukey samples, instead of 1-bit ADC samples, as input to the Goertzel computation? It would mean using 8x8 signed multipliers, instead of just 1-bit conditional negators, before the 32-bit accumulators.
I'm thinking it could take things to a whole other level.
According the to convolution theorem; multiplication in the time domain is equivalent to convolution in the frequency domain and vice versa; which means that if we could perform a fast Fourier transform on the one bit samples; or anything easily derived therefrom; then instead of multiplying each of the one bit samples; or the eight bit samples by a raised cosine (read Tukey); you might want to find some way; by hook or by crook, to get the ADC signal into the frequency domain, where the convolution kernel for a raised cosine is just the set {-1,2,-1}. One way to do this therefore might be to store the precomputed FFT values for every possible 8 bit sequence in a table; so as to be able just simply pick off 8 bits of raw ADC output at a time, with 4 bits of overlap - look up the precomputed FFT and then just simply add; no multiplies required! Then you might try down sampling the precomputed FFT results so that the next step would be to perform an inverse 4 point FFT on the down sampled, overlapped and anti-aliased summation - which for a 4 point FFT simply involves some additions and subtractions. Of course according to Wikipedia, Winograd sometime back in the 80's (I think) figured out that it is possible to perform ANY FFT with nothing but a large number of additions and subtractions; that is to say if you are willing to perform exactly 4*N multiplications at the very end. Of course - I don't know off the top of my head how to do some, or any of the more advanced Winograd transforms, like some of the ones that involve cyclotomic polynomials derived from some transformation based on a Galois field that in turn allows flipping between different prime number factorings; but for the smaller transforms it is a slam dunk in terms of the theory; other than like any software project - the devil is in the details when it come times to debugging.
From what you are saying, it almost sounds like some kind of live FFT could be maintained in real time using one bit samples.
Saucy, about feeding 8-bit samples into the Goertzel...
I think it was you who said it would be rather pointless, because these samples are actually just one new bit per clock, anyway. Is that what you suppose? It stands to reason. If there was higher entropy in those 8-bit samples, there would be an advantage to using them. As it is, probably not.
Saucy, about feeding 8-bit samples into the Goertzel...
I think it was you who said it would be rather pointless, because these samples are actually just one new bit per clock, anyway. Is that what you suppose? It stands to reason. If there was higher entropy in those 8-bit samples, there would be an advantage to using them. As it is, probably not.
The existing Goertzel:
NCO-->LUT-->multiply-->sum
^
ADC bit _/
The proposed Goertzel:
NCO-->LUT-->multiply-->sum->integrate
^
ADC bit _/
Don't do this:
NCO-->LUT-->multiply-->sum
^
ADC bit-->Tukey_/
The entropy is not a bad way to explain it. The low pass Tukey or sinc filter does not increase the entropy of the signal. It might decrease it.
The output of the Goertzel "multiplier" may have a greater need for windowing than the ADC output. The sine/cosine inputs are periodic and what part of the cycle we collect measurements does affect the readings.
The Goertzel output should be low pass filtered the same as the ADC output. The multiplication simply shifts the frequency we want down to zero. We can work the other way too, shifting the frequency of a lowpass filter up to become a bandpass filter. The plots show what the response of the Goertzel should be like. This is not a suggestion to use the Tukey on the Goertzel output. I used the Tukey because we've been studying it closely.
Whether receiving radio signals or doing Goertzel analysis we want to reject the undesired frequencies to the greatest extent practical. At the very least, the Goertzel should resist DC from influencing the result. Is that why it add and subtracts instead of just adding?
The diagram is from the article "The USRP under 1.5X Magnifying Lens!" That's basically what the Goertzel does. It's got some serious filtering to keep out-of-band signals out. The FPGA in the USRP1 does not have multipliers so they used a cordic instead.
If you do this the Tukey or whatever window on the Goertzel input will attenuate the high frequency you are trying to measure and pass the DC though full strength. More DC will be picked up by the sidelobes of the rectangular window used by the Goertzel. It's worse than useless.
In practice the measured frequency would be in the passband of the Tukey. In that case the above doesn't apply, but there is still no benefit.
I just remembered something about our Goertzel. We can play portions of the LUT. So, we can have a window open, plateau, and window close section in the LUT for making long measurements.
It's interesting in that thing you posted that they are doing the decimation on the sine and cosine sums. That must improve acquisition time down to the square root of the number of cycles it would take, otherwise. For picking data out of a carrier wave, that must be crucial.
I did a quick test on the P1. Breadboarded on my Activity Board with 220pF caps instead of 1nF. Input was grounded.
Triangle Rectangle
Mean 1143.3 1153.9
Std Dev 0.13057 4.64507
Triangular window has a standard deviation 2.8% of the rectangular window. Have we been doing ADC wrong for 12 years? I'm probably going to build a second or third order modulator next.
That's very cool! For the second order, could you please try this code? (I don't have a board with an ADC on it atm.)
Chip,
I've just tried to use smartpin mode %01100, inc on A-rise & B-high, for use on an externally clocked bitstream (generated by an AD7400 isolated ADC). Mode %01100 is not entirely ideal for this job because the settable (X) measurement period is in sysclocks rather than bitstream clocks. Funny how such details don't pop out on first look. I know it never was intended for external synchronous clocking but I thought I should say something anyway.
I actually ended up not using that mode at all. Just went back to mode %01111 instead, and ignored the external clock. A side effect is only one pin is used this way, so that's kind of cool. It seems to be functioning just fine this way.
Chip,
I've just tried to use smartpin mode %01100, inc on A-rise & B-high, for use on an externally clocked bitstream (generated by an AD7400 isolated ADC). Mode %01100 is not entirely ideal for this job because the settable (X) measurement period is in sysclocks rather than bitstream clocks. Funny how such details don't pop out on first look. I know it never was intended for external synchronous clocking but I thought I should say something anyway.
I actually ended up not using that mode at all. Just went back to mode %01111 instead, and ignored the external clock. A side effect is only one pin is used this way, so that's kind of cool. It seems to be functioning just fine this way.
So, the period needs to be in pin rises, not sysclks, right? It's not complicated to add some config bit to the mode to select such a thing.
I will review the modes and see where such things can be added.
Chip,
I've just tried to use smartpin mode %01100, inc on A-rise & B-high, for use on an externally clocked bitstream (generated by an AD7400 isolated ADC). Mode %01100 is not entirely ideal for this job because the settable (X) measurement period is in sysclocks rather than bitstream clocks. Funny how such details don't pop out on first look. I know it never was intended for external synchronous clocking but I thought I should say something anyway.
I actually ended up not using that mode at all. Just went back to mode %01111 instead, and ignored the external clock. A side effect is only one pin is used this way, so that's kind of cool. It seems to be functioning just fine this way.
So, the period needs to be in pin rises, not sysclks, right? It's not complicated to add some config bit to the mode to select such a thing.
I will review the modes and see where such things can be added.
External clock will be useful in most modes. (+ / -)
For external ADC, the Din is a count-enable.
Chip,
Is there any detailed info on DAC configuration for the Prop123 board? So far I've managed to get one going but don't know much about the settings I've used. The channel numbering, for a starters, seems to mixed up.
wrpin ##%1010000000000_01_00000_0, #1 'set DAC mode for DAC0 (maybe)
And this seems to activate DIR as well. Not nice when trying to use the digital pins at the same time.
I've got some fast action going on with this AD7400. Even though the ADC is only operating at 10 Mbps, I'm making use of the 80 MHz sysclock to boost the existing sinc1 to a sync3 emulation in a tight 100 ns loop.
'==================================
' Sinc3 filter (cogexec in cog #1)
'==================================
ORG
start_sinc3
cogid cid
wrpin #%00_01111_0, #tpin 'set adc/counter mode
wypin #0, #tpin 'inc on high
wxpin #0, #tpin 'totaliser
dirh #tpin 'enable smart pin
'Sinc3 loop (8 sysclocks)
rep @.lend, #0 'loop forever
rdpin acc1, #tpin
add acc2, acc1
add acc3, acc2
wrlut acc3, #(mailbox & $1ff) 'for the decimator (lut sharing is active)
.lend
cogstop cid
acc1 long 0
acc2 long 0
acc3 long 0
cid long 0
ORG $3ff
mailbox long 0
One question on pasm syntax: The mailbox ORG'ing of $3ff, is it useful in terms of allocation management? Is that the right way to handle lutRAM? I mean I could have just hard coded #$1ff for the WRLUT instruction.
On the Prop123 FPGA board, cog0 DAC channels 3/2/1 go to the DACs for R/G/B. Note that these are just the cog DAC channels, not particular smart pins.
About your code above, your "org $3ff" doesn't actually load anything subsequentg into LUT. It just tracks LUT addressing, while putting code/data into hub space.
About your code above, your "org $3ff" doesn't actually load anything subsequentg into LUT. It just tracks LUT addressing, while putting code/data into hub space.
Right, yep, I could have used res 1 instead of long 0. I'm not confident of the tracking though. Is that recognised as lut space? Given wrlut can't resolve the address directly.
On the Prop123 FPGA board, cog0 DAC channels 3/2/1 go to the DACs for R/G/B. Note that these are just the cog DAC channels, not particular smart pins.
Okay, just three DACs. And should reference the RGB labels instead of the socket numbers on the board.
And to enable them I have to set the smartpin I/O config as if the related DAC was really in that I/O pad. Which allows them to be operated by the smartpins. Right?
About your code above, your "org $3ff" doesn't actually load anything subsequentg into LUT. It just tracks LUT addressing, while putting code/data into hub space.
Right, yep, I could have used res 1 instead of long 0. I'm not confident of the tracking though. Is that recognised as lut space? Given wrlut can't resolve the address directly.
The assembler will register that cog address as $3FF, which is a LUT address for branching. Meanwhile, it's address $1FF of the LUT. I know this may not be any new information to you.
On the Prop123 FPGA board, cog0 DAC channels 3/2/1 go to the DACs for R/G/B. Note that these are just the cog DAC channels, not particular smart pins.
Okay, just three DACs. And should reference the RGB labels instead of the socket numbers on the board.
And to enable them I have to set the smartpin I/O config as if the related DAC was really in that I/O pad. Which allows them to be operated by the smartpins. Right?
Oh, I see now, that's done intentionally for 24-bit RGB colour. The least 8 bits of a 32-bit longword aren't used, so OUT0 DAC is also skipped to suit.
Triangle Rectangle
Mean 1164.5 1166.5
Std 0.20081 1.51132
relative_noise = 0.13287
Quadratic Rectangle
Mean 1171.0 1173.3
Std 0.16607 1.59741
relative_noise = 0.10396
N1=N3=64
N2=128 This was to get a 12 bit result from the rectangular window.
Breadboarding the ADC circuit can be problematic. In my previous tests it was intermittently oscillating at 40MHz. Removing the caps seems to produce better results, but the scope probe loads the circuit enough to affect it. In this test the improvement is less. But it's still an order of magnitude or 3 bits.
Comments
2. It would perform sinc2 filtering. Yay! Why not go to sinc3 if it's cheap enough? For software defined radio I would love to see a third order filter. Would this always be desirable? It would interfere with using the Goertzel hardware as a FIR window filter.
3. I don't know yet. The improvement should be similar to that seen for DC input. It should greatly improve rejection of frequencies not measured.
4. The bit growth calculations assume two's compliment arithmetic. The input needs to be sign-extended. We were treating it as unsigned for the delta-sigma input. But our input is now 8 bits.
sample interval, order, accumulator size
256 clocks, 2nd order, 24 bits
4096 clocks, 2nd order, 32 bits
256 clocks, 3rd order, 32 bits
Intervals could be lengthened slightly by reducing the number of active bits in the LUT.
We'd have to read the accumulators 3 times to get one sample with sinc2. Or 4 times for sinc3. I'm not sure how this would work with clearing the accumulators upon read.
If so, you can have multiple periodes in the LUT and can calculate a window over these sin/cosine samples in the LUT RAM. The window size is then the chosen Goertzel loop size.
Andy
I'm going to do this when my P2 ES board arrives.
Note that this does not restrict frequencies to a whole number of cycles per table. For a random frequency, the phase of the table output will differ each time it runs through compared to a continuous oscillator. Not a big problem, just rotate the Goertzel output to compensate and things should be fine. The only issue is not continuously responding to input. It might cost a little bit of sensitivity.
Andy, that is ingenious! Do the windowing operation via the LUT data. It never occurred to me before.
That is exactly how it works. Furthermore, you can specify how many complete LUT cycles you want before X/Y accumulator posting and clearing. The upper two bytes of each LUT entry are the Goertzel adder values, while the bottom two bytes are what can be output, also, to the DACs. So, in the upper bytes, you can have your windowed sine/cosine pattern, while in the lower bytes you have your continuous sine/cosine pattern. This way, you can output steady sine/cosine signals of known phase and input windowed measurements that are a product of the simultaneous output.
I'm thinking it could take things to a whole other level.
Was that with the 45-bit adder Tukey? Have you tried with FPGA using a counter for the +32 and +16 values? Or the add times 3 idea?
I think any savings are likely to be small, though. Adding 45 tap values will always need a substantial amount of logic. How does short Tukey/Hann-like compare to long Tukey for quality? The plateau could be longer without adding much logic by using a counter.
I've been trying my utmost to make the Tukey smaller. Having a small number of Tukey pins is one option.
We need a plan B if, as seems likely, it won't fit. Would sliding windows be so terrible in software? Obviously not ideal in terms of speed, but windows could be anything, stored in LUT. How to handle triggering?
According the to convolution theorem; multiplication in the time domain is equivalent to convolution in the frequency domain and vice versa; which means that if we could perform a fast Fourier transform on the one bit samples; or anything easily derived therefrom; then instead of multiplying each of the one bit samples; or the eight bit samples by a raised cosine (read Tukey); you might want to find some way; by hook or by crook, to get the ADC signal into the frequency domain, where the convolution kernel for a raised cosine is just the set {-1,2,-1}. One way to do this therefore might be to store the precomputed FFT values for every possible 8 bit sequence in a table; so as to be able just simply pick off 8 bits of raw ADC output at a time, with 4 bits of overlap - look up the precomputed FFT and then just simply add; no multiplies required! Then you might try down sampling the precomputed FFT results so that the next step would be to perform an inverse 4 point FFT on the down sampled, overlapped and anti-aliased summation - which for a 4 point FFT simply involves some additions and subtractions. Of course according to Wikipedia, Winograd sometime back in the 80's (I think) figured out that it is possible to perform ANY FFT with nothing but a large number of additions and subtractions; that is to say if you are willing to perform exactly 4*N multiplications at the very end. Of course - I don't know off the top of my head how to do some, or any of the more advanced Winograd transforms, like some of the ones that involve cyclotomic polynomials derived from some transformation based on a Galois field that in turn allows flipping between different prime number factorings; but for the smaller transforms it is a slam dunk in terms of the theory; other than like any software project - the devil is in the details when it come times to debugging.
From what you are saying, it almost sounds like some kind of live FFT could be maintained in real time using one bit samples.
I think it was you who said it would be rather pointless, because these samples are actually just one new bit per clock, anyway. Is that what you suppose? It stands to reason. If there was higher entropy in those 8-bit samples, there would be an advantage to using them. As it is, probably not.
The output of the Goertzel "multiplier" may have a greater need for windowing than the ADC output. The sine/cosine inputs are periodic and what part of the cycle we collect measurements does affect the readings.
The Goertzel output should be low pass filtered the same as the ADC output. The multiplication simply shifts the frequency we want down to zero. We can work the other way too, shifting the frequency of a lowpass filter up to become a bandpass filter. The plots show what the response of the Goertzel should be like. This is not a suggestion to use the Tukey on the Goertzel output. I used the Tukey because we've been studying it closely.
Whether receiving radio signals or doing Goertzel analysis we want to reject the undesired frequencies to the greatest extent practical. At the very least, the Goertzel should resist DC from influencing the result. Is that why it add and subtracts instead of just adding?
The diagram is from the article "The USRP under 1.5X Magnifying Lens!" That's basically what the Goertzel does. It's got some serious filtering to keep out-of-band signals out. The FPGA in the USRP1 does not have multipliers so they used a cordic instead.
In practice the measured frequency would be in the passband of the Tukey. In that case the above doesn't apply, but there is still no benefit.
I just remembered something about our Goertzel. We can play portions of the LUT. So, we can have a window open, plateau, and window close section in the LUT for making long measurements.
It's interesting in that thing you posted that they are doing the decimation on the sine and cosine sums. That must improve acquisition time down to the square root of the number of cycles it would take, otherwise. For picking data out of a carrier wave, that must be crucial.
That's very cool! For the second order, could you please try this code? (I don't have a board with an ADC on it atm.) thanks,
Jonathan
I've just tried to use smartpin mode %01100, inc on A-rise & B-high, for use on an externally clocked bitstream (generated by an AD7400 isolated ADC). Mode %01100 is not entirely ideal for this job because the settable (X) measurement period is in sysclocks rather than bitstream clocks. Funny how such details don't pop out on first look. I know it never was intended for external synchronous clocking but I thought I should say something anyway.
I actually ended up not using that mode at all. Just went back to mode %01111 instead, and ignored the external clock. A side effect is only one pin is used this way, so that's kind of cool. It seems to be functioning just fine this way.
So, the period needs to be in pin rises, not sysclks, right? It's not complicated to add some config bit to the mode to select such a thing.
I will review the modes and see where such things can be added.
For external ADC, the Din is a count-enable.
Right.
Is there any detailed info on DAC configuration for the Prop123 board? So far I've managed to get one going but don't know much about the settings I've used. The channel numbering, for a starters, seems to mixed up.
And this seems to activate DIR as well. Not nice when trying to use the digital pins at the same time.
One question on pasm syntax: The mailbox ORG'ing of $3ff, is it useful in terms of allocation management? Is that the right way to handle lutRAM? I mean I could have just hard coded #$1ff for the WRLUT instruction.
On the Prop123 FPGA board, cog0 DAC channels 3/2/1 go to the DACs for R/G/B. Note that these are just the cog DAC channels, not particular smart pins.
About your code above, your "org $3ff" doesn't actually load anything subsequentg into LUT. It just tracks LUT addressing, while putting code/data into hub space.
Right, yep, I could have used res 1 instead of long 0. I'm not confident of the tracking though. Is that recognised as lut space? Given wrlut can't resolve the address directly.
And to enable them I have to set the smartpin I/O config as if the related DAC was really in that I/O pad. Which allows them to be operated by the smartpins. Right?
The assembler will register that cog address as $3FF, which is a LUT address for branching. Meanwhile, it's address $1FF of the LUT. I know this may not be any new information to you.
I had to mask it to get that in the source above. So I'm thinking there is a better way.
Oh, I see now, that's done intentionally for 24-bit RGB colour. The least 8 bits of a 32-bit longword aren't used, so OUT0 DAC is also skipped to suit.