Is anything about Prop2 proving to be a headache?
cgracey
Posts: 14,209
Having used and contemplated the Prop2 design for several weeks, are you finding some things to be more complicated or difficult than you feel they ought to be?
I'll get the SPI booter working next, as I know that is holding some people up.
Is there anything that seems to be in the way of carefree development?
I'll get the SPI booter working next, as I know that is holding some people up.
Is there anything that seems to be in the way of carefree development?
Comments
The only "pain" is exploring all the new features! Haven't gotten to them all just yet. Nice "pain" to have, if you ask me.
I'm very curious to read what others have to say. I'm really liking this one.
The only pain suffered here was withdrawal from P2-Hot, I feel better now.
P2-Not_as_hot is coming together very nicely.
Looking forward to more speed again (80MHz+) and smart pins.
The MUL and MULS are nice, but you need a lot of instructions to do for example the inner loop of a FIR filter, something like:
With an additional instruction that does a SCL + ADD to an accu register and does not modify D it would only be: The accu can be a fixed cog register (most flexible) or a hidden additional register with a set and get instruction. And the MUL (SCL) is done with the existing 16x16 signed multiplier.
Andy
I was lamenting the loss of DSP, too. Adding an accumulator is not a big deal. It would also be simple to make power-of-2 wrapping with ALTDS, where it is achieved by masking.
Maybe it'd be better to have the default be for those instructions to stall until the result is ready?
Then, all the business about overlapping and waiting and counting clocks could be an optional extra...
But, I haven't actually tried it yet, so don't take that comment too seriously...
The CORDIC is all goodness now. The GETQx instructions were always blocking instructions (Even if they didn't quite work that way initially.) and Chip has just cleaned up the last niggle where they could potentially hang.
Enjoy.
The one caveat is that there is only one result buffer for each Cog, and it has to be cleared before it will receive a new result. Meaning, if there is more than one command on the go, that Cog must fetch the current result before the next one pops off the pipeline. Otherwise that next result goes poof when it hits the result buffer.
I'm reading that section again now to see if it looks easier...
Ok, I guess it looks good. Like you said, just do GETQx and it will wait for result.
Ariba,
I've been thinking about what you said here.
These wrapping buffers are critical for FIRs. I'm looking into changing ALTI to allow simple wrapping buffers for power-of-2 sizes, where D/S field [n:0] is incremented/decremented and [8:n+1] stays the same. That's almost nothing in logic, as it avoids random comparisons. That will give us the buffers that we need.
I'll add an accumulator and make a MAC instruction, too. 'REP+ALTI+MAC' will give us 4-clock-per-MAC throughput. We could even get that down to 1 clock if we did something like 'SETQ+RDLONG' does, where the cog RAM is cycled on each clock. That would be optimal for FIRs, but not do anything for random MACs.
I always prefere IIR filters, but they need the same type of MAC and if you do more than one filter, fast loops are also welcome for that.
How do you plan to implement MAC?
I think the simplest is a fractional multiply (like SCL of the old P2hot) that gets accumulated in a 32bit accu:
accu += (16bit x 16bit >> 15).
This will result in a 16bit DSP, with 32bit accumulators (15 bit headroom).
If you accumulate the full 32bit result of a 16x16 Mult, then the accu should be made wider than 32bits, 40 bit for example, and we need a fast way to scale it down (SARACC) and the overflow case should be handled. This gets all more complicated with the only advantage, that you have a higher resolution for intermediate results.
Also single MACs are a pain, you need to scale before and after the instruction, using the existing MUL and MULS may be faster then.
Andy
Okay. I like your idea of accumulating the top 16 bits of the product. Actually, a signed 16x16 multiple yields two sign bits, so we could maybe shift it right by 15, instead of 16, to get the intended effect.
(See the bold part of my previous post )
Andy
Sorry I didn't register that, at first.
I think we should have two accumulators for FFTs. Do you think so?
If you can arrange for a hardware multiply of complex numbers using that cordic engine that would be great!
The CORDIC supplies the sine and cosine which get multiplied by the data sample and then separately accumulated.
Duh... I just remembered that the CORDIC always generates scaled values, according to what you feed it. There's the multiply! 'QROTATE sample,angle' does the sine and cosine, already scaled by the sample. All that's left is the accumulation of those separate values. That's just five instructions:
QROTATE sample,angle
GETQX x
ADD xacc,x
GETQY y
ADD yacc,y
There's the heart of the FFT.
If we just have a SCL instruction, which writes the result into a separate register (not D), we can do a MAC with 2 instructions: So we can have as many accus as we want, and we can also do ADD and SUBTRACT for example.
An FIR loop would then take 3 instructions: I think this is much simpler to implement than a MAC with one ACCU.
If the separate result destination for SCL is too difficult, it can be made with ALTI, which allows redirection of the result anyway. SCL would then just place the result in the D register if no ALTI changes that.
Andy
Andy, you're right. That would be way better. I was thinking, too, that MAC was not ideal, given the overhead.
Do you think a fixed >> 15 would be appropriate?
We could have an instruction to set the SCL destination register.
We wouldn't want it as a prefixing modifier as that is just more instructions in the loop. I'm guessing you mean as a hidden 9-bit presettable config register. I'm not sure there's enough gain, like the PUSHD instruction, having a configurable link register location would be somewhat self defeating.
PS: Embedding a small selection, like PUSHA/PUSHB, in the instruction would work well.
Yes a fixed arithemetic shift right by 15 makes most sense with a 16x16 multiplier. We always can use MULS with following SAR if we need some special scaling.
I think a fixed result register for SCL is all we need. Mostly we just need the result as a source value for the next instruction, so SCL can even work like a ALT-type instructions which changes just the src input of the following ALU instruction.
Andy
Whoa! That's even better!
No interrupts allowed when SCL executing, then, so that it will never be separated from the next instruction.
Can these instruction pairs perhaps have an implicit STALLI/ALLOWI around them?
Are there other sources of interrupt jitter ?
A source of frustration on MCUs is the jitter in interrupts, and often I have wished for a switch that had a fixed, jitter free delay choice.
This would be set to the longest delay, and other paths would simply pad-out to that time, removing jitter.
Is that possible on P2 ?