Is anything about Prop2 proving to be a headache?

cgracey · 2015-11-25 04:06

Having used and contemplated the Prop2 design for several weeks, are you finding some things to be more complicated or difficult than you feel they ought to be?

I'll get the SPI booter working next, as I know that is holding some people up.

Is there anything that seems to be in the way of carefree development?

potatohead · 2015-11-25 04:34

Well, I've not had as much time as I would have liked. Still, it's not hard. I don't find myself struggling too much. So far, I've sat down to do something and in a reasonable time, did it.

The only "pain" is exploring all the new features! Haven't gotten to them all just yet. Nice "pain" to have, if you ask me.

I'm very curious to read what others have to say. I'm really liking this one.

ozpropdev · 2015-11-25 04:50

Chip
The only pain suffered here was withdrawal from P2-Hot, I feel better now.
P2-Not_as_hot is coming together very nicely.
Looking forward to more speed again (80MHz+) and smart pins.

Ariba · 2015-11-25 05:39

One thing that is not well supported in the new P2 is the typical DSP functions, like MAC. This was very good on the P2-hot, you could do a Multiply Accumulate in one cycle and modify two index registers at the same time (even with wrapping in a buffer).

The MUL and MULS are nice, but you need a lot of instructions to do for example the inner loop of a FIR filter, something like:

	mov   accu,#0
	rep   @loop,#128
	alti  bufp,#%001
	mov   t1,0-0
	alti  coef,#%001
	muls  t1,0-0
	sar   t1,#15
	add   accu,t1
loop

With an additional instruction that does a SCL + ADD to an accu register and does not modify D it would only be:

	clracc
	rep   @loop,#128
	alti  pointers,#%001_001
	mac   0-0,0-0              'does acc + (d * s ~> 15)
loop
	getacc result

The accu can be a fixed cog register (most flexible) or a hidden additional register with a set and get instruction. And the MUL (SCL) is done with the existing 16x16 signed multiplier.

Andy

cgracey · 2015-11-25 05:47

Ariba wrote: »
One thing that is not well supported in the new P2 is the typical DSP functions, like MAC. This was very good on the P2-hot, you could do a Multiply Accumulate in one cycle and modify two index registers at the same time (even with wrapping in a buffer).

The MUL and MULS are nice, but you need a lot of instructions to do for example the inner loop of a FIR filter, something like:
	mov   accu,#0
	rep   @loop,#128
	alti  bufp,#%001
	mov   t1,0-0
	alti  coef,#%001
	muls  t1,0-0
	sar   t1,#15
	add   accu,t1
loop
With an additional instruction that does a SCL + ADD to an accu register and does not modify D it would only be:
	clracc
	rep   @loop,#128
	alti  pointers,#%001_001
	mac   0-0,0-0              'does acc + (d * s ~> 15)
loop
	getacc result
The accu can be a fixed cog register (most flexible) or a hidden additional register with a set and get instruction. And the MUL (SCL) is done with the existing 16x16 signed multiplier.

Andy

I was lamenting the loss of DSP, too. Adding an accumulator is not a big deal. It would also be simple to make power-of-2 wrapping with ALTDS, where it is achieved by masking.

Rayman · 2015-11-25 12:35

Personally, I think the business about waiting for Cordic to be done is overly complex...

Maybe it'd be better to have the default be for those instructions to stall until the result is ready?

Then, all the business about overlapping and waiting and counting clocks could be an optional extra...

But, I haven't actually tried it yet, so don't take that comment too seriously...

evanh · 2015-11-25 12:45

Rayman,
The CORDIC is all goodness now. The GETQx instructions were always blocking instructions (Even if they didn't quite work that way initially.) and Chip has just cleaned up the last niggle where they could potentially hang.

Enjoy.

evanh · 2015-11-25 12:54

The ability for overlapping commands is just a natural feature of the long pipeline. It can be processing up to 36 commands at once, starting a new command every clock. Each Cog gets to issue a new command every 16 clocks. So, up to 3 commands from each Cog can be in flux.

The one caveat is that there is only one result buffer for each Cog, and it has to be cleared before it will receive a new result. Meaning, if there is more than one command on the go, that Cog must fetch the current result before the next one pops off the pipeline. Otherwise that next result goes poof when it hits the result buffer.

evanh · 2015-11-25 12:56

or it might be 38 rather than 36. Small detail.

Rayman · 2015-11-25 14:21

The docs say 36 in one place and 38 in another...

I'm reading that section again now to see if it looks easier...

Ok, I guess it looks good. Like you said, just do GETQx and it will wait for result.

Seairth · 2015-11-25 15:00

The pipeline is 36 cycles. But the instruction overhead adds 2 cycles. I think.

Rayman · 2015-11-25 15:01

38 clocks sounds like a lot, but when you consider instructions take 2 clocks, it sounds better...

cgracey · 2015-11-25 18:12

Ariba wrote: »

One thing that is not well supported in the new P2 is the typical DSP functions, like MAC. This was very good on the P2-hot, you could do a Multiply Accumulate in one cycle and modify two index registers at the same time (even with wrapping in a buffer).

Ariba,

I've been thinking about what you said here.

These wrapping buffers are critical for FIRs. I'm looking into changing ALTI to allow simple wrapping buffers for power-of-2 sizes, where D/S field [n:0] is incremented/decremented and [8:n+1] stays the same. That's almost nothing in logic, as it avoids random comparisons. That will give us the buffers that we need.

I'll add an accumulator and make a MAC instruction, too. 'REP+ALTI+MAC' will give us 4-clock-per-MAC throughput. We could even get that down to 1 clock if we did something like 'SETQ+RDLONG' does, where the cog RAM is cycled on each clock. That would be optimal for FIRs, but not do anything for random MACs.

Ariba · 2015-11-26 01:48

I think FIRs are not that important, that optimized features for it make much sense. It's just one of the the simplest DSP algorythm, that shows what kind of instructions are needed for DSP. It's a bit like a benchmark in the DSP world.

I always prefere IIR filters, but they need the same type of MAC and if you do more than one filter, fast loops are also welcome for that.

How do you plan to implement MAC?
I think the simplest is a fractional multiply (like SCL of the old P2hot) that gets accumulated in a 32bit accu:
accu += (16bit x 16bit >> 15).
This will result in a 16bit DSP, with 32bit accumulators (15 bit headroom).

If you accumulate the full 32bit result of a 16x16 Mult, then the accu should be made wider than 32bits, 40 bit for example, and we need a fast way to scale it down (SARACC) and the overflow case should be handled. This gets all more complicated with the only advantage, that you have a higher resolution for intermediate results.
Also single MACs are a pain, you need to scale before and after the instruction, using the existing MUL and MULS may be faster then.

Andy

cgracey · 2015-11-26 06:58

Ariba wrote: »

I think FIRs are not that important, that optimized features for it make much sense. It's just one of the the simplest DSP algorythm, that shows what kind of instructions are needed for DSP. It's a bit like a benchmark in the DSP world.

I always prefere IIR filters, but they need the same type of MAC and if you do more than one filter, fast loops are also welcome for that.

How do you plan to implement MAC?
I think the simplest is a fractional multiply (like SCL of the old P2hot) that gets accumulated in a 32bit accu:
accu += (16bit x 16bit >> 15).
This will result in a 16bit DSP, with 32bit accumulators (15 bit headroom).

If you accumulate the full 32bit result of a 16x16 Mult, then the accu should be made wider than 32bits, 40 bit for example, and we need a fast way to scale it down (SARACC) and the overflow case should be handled. This gets all more complicated with the only advantage, that you have a higher resolution for intermediate results.
Also single MACs are a pain, you need to scale before and after the instruction, using the existing MUL and MULS may be faster then.

Andy

Okay. I like your idea of accumulating the top 16 bits of the product. Actually, a signed 16x16 multiple yields two sign bits, so we could maybe shift it right by 15, instead of 16, to get the intended effect.

Ariba · 2015-11-26 07:35

Yes definitly a 15 bit shift, so that the result has the same format as the inputs.
(See the bold part of my previous post

)

Andy

cgracey · 2015-11-26 18:13

Ariba wrote: »

Yes definitly a 15 bit shift, so that the result has the same format as the inputs.
(See the bold part of my previous post )

Andy

Sorry I didn't register that, at first.

I think we should have two accumulators for FFTs. Do you think so?

Heater. · 2015-11-26 19:48

Did someone say FFT?

If you can arrange for a hardware multiply of complex numbers using that cordic engine that would be great!

cgracey · 2015-11-26 21:28

Heater. wrote: »

Did someone say FFT?

If you can arrange for a hardware multiply of complex numbers using that cordic engine that would be great!

The CORDIC supplies the sine and cosine which get multiplied by the data sample and then separately accumulated.

Duh... I just remembered that the CORDIC always generates scaled values, according to what you feed it. There's the multiply! 'QROTATE sample,angle' does the sine and cosine, already scaled by the sample. All that's left is the accumulation of those separate values. That's just five instructions:

QROTATE sample,angle
GETQX x
ADD xacc,x
GETQY y
ADD yacc,y

There's the heart of the FFT.

Dave Hein · 2015-11-26 21:30

I believe two QROTATE instructions can be used to implement a butterfly operation. This should be applicable to the FFT, DCT and other orthogonal transforms. However, I think just using the MULS instruction on 16-bit values will be sufficient for many applications, such as doing DCT for JPEG and MPEG-2. A MAC instruction would be nice for implementing FIR filters.

Heater. · 2015-11-26 21:37

There is something awesome about that, if only I could figure it out

evanh · 2015-11-26 23:05

Could be amusing using multiCog HubExec and the CORDIC together to get full speed.

Ariba · 2015-11-28 06:36

After thinking a bit more about it, a MAC with a dedicated ACCU may not be the best approach. Especially with single random MACs we lose all the benefit if we have to fill or clear the accu first and read the result from the accu after MAC.
If we just have a SCL instruction, which writes the result into a separate register (not D), we can do a MAC with 2 instructions:

	scl	d,s
	add	accu1,sclresult

	scl	d,s
	add	accu2,sclresult

	scl	d,s
	sub	accu1,sclresult

So we can have as many accus as we want, and we can also do ADD and SUBTRACT for example.

An FIR loop would then take 3 instructions:

	mov	pointers,##coeffs<<9 + inbuff
	mov	acc1,#0

	rep	@end,#128
	alti	pointers,mode
	scl	0-0,0-0
	add	acc1,sclresult
end
	...

mode	long	%011_011_000_111_111   'incr D and S with wrapping inside 128

I think this is much simpler to implement than a MAC with one ACCU.

If the separate result destination for SCL is too difficult, it can be made with ALTI, which allows redirection of the result anyway. SCL would then just place the result in the D register if no ALTI changes that.

Andy

cgracey · 2015-11-28 16:53

Ariba wrote: »
After thinking a bit more about it, a MAC with a dedicated ACCU may not be the best approach. Especially with single random MACs we lose all the benefit if we have to fill or clear the accu first and read the result from the accu after MAC.
If we just have a SCL instruction, which writes the result into a separate register (not D), we can do a MAC with 2 instructions:
	scl	d,s
	add	accu1,sclresult

	scl	d,s
	add	accu2,sclresult

	scl	d,s
	sub	accu1,sclresult
So we can have as many accus as we want, and we can also do ADD and SUBTRACT for example.

An FIR loop would then take 3 instructions:
	mov	pointers,##coeffs<<9 + inbuff
	mov	acc1,#0

	rep	@end,#128
	alti	pointers,mode
	scl	0-0,0-0
	add	acc1,sclresult
end
	...

mode	long	%011_011_000_111_111   'incr D and S with wrapping inside 128 
I think this is much simpler to implement than a MAC with one ACCU.

If the separate result destination for SCL is too difficult, it can be made with ALTI, which allows redirection of the result anyway. SCL would then just place the result in the D register if no ALTI changes that.

Andy

Andy, you're right. That would be way better. I was thinking, too, that MAC was not ideal, given the overhead.

Do you think a fixed >> 15 would be appropriate?

We could have an instruction to set the SCL destination register.

evanh · 2015-11-28 23:26

cgracey wrote: »

We could have an instruction to set the SCL destination register.

We wouldn't want it as a prefixing modifier as that is just more instructions in the loop. I'm guessing you mean as a hidden 9-bit presettable config register. I'm not sure there's enough gain, like the PUSHD instruction, having a configurable link register location would be somewhat self defeating.

PS: Embedding a small selection, like PUSHA/PUSHB, in the instruction would work well.

Ariba · 2015-11-29 04:41

cgracey wrote: »

....
Andy, you're right. That would be way better. I was thinking, too, that MAC was not ideal, given the overhead.

Do you think a fixed >> 15 would be appropriate?

We could have an instruction to set the SCL destination register.

Yes a fixed arithemetic shift right by 15 makes most sense with a 16x16 multiplier. We always can use MULS with following SAR if we need some special scaling.

I think a fixed result register for SCL is all we need. Mostly we just need the result as a source value for the next instruction, so SCL can even work like a ALT-type instructions which changes just the src input of the following ALU instruction.

Andy

cgracey · 2015-11-29 06:31

Ariba wrote: »

cgracey wrote: »

....
Andy, you're right. That would be way better. I was thinking, too, that MAC was not ideal, given the overhead.

Do you think a fixed >> 15 would be appropriate?

We could have an instruction to set the SCL destination register.

Yes a fixed arithemetic shift right by 15 makes most sense with a 16x16 multiplier. We always can use MULS with following SAR if we need some special scaling.

I think a fixed result register for SCL is all we need. Mostly we just need the result as a source value for the next instruction, so SCL can even work like a ALT-type instructions which changes just the src input of the following ALU instruction.

Andy

Whoa! That's even better!

No interrupts allowed when SCL executing, then, so that it will never be separated from the next instruction.

evanh · 2015-11-29 11:23

Ahhhh, not happy ... that's a burden ... Do all ALTx instructions have this caveat?

Can these instruction pairs perhaps have an implicit STALLI/ALLOWI around them?

cgracey · 2015-11-29 16:50

It's already such that interrupts cannot occur on cycles where ALTI/ALTR/ALTD/ALTS are executing. In these cases, the interrupt is delayed by one instruction.

jmg · 2015-11-29 19:06

cgracey wrote: »

It's already such that interrupts cannot occur on cycles where ALTI/ALTR/ALTD/ALTS are executing. In these cases, the interrupt is delayed by one instruction.

Are there other sources of interrupt jitter ?

A source of frustration on MCUs is the jitter in interrupts, and often I have wished for a switch that had a fixed, jitter free delay choice.
This would be set to the longest delay, and other paths would simply pad-out to that time, removing jitter.
Is that possible on P2 ?

potatohead · 2015-11-29 19:46

I can't remember what REP + Interrupt ended up doing.

Is anything about Prop2 proving to be a headache?

Comments