Propeller II MAC instruction

BradC · 2010-01-11 01:51

G'day Chip,

I've had a pretty good root around all the information currently available on the MAC instruction proposed for the new chip, and from what I've been able to ascertain it's a 16x16 bit multiply?

The reason for this post is to query the accumulator size. I've been making heavy use of the MAC on another small processor recently for digital audio filtering and this chip has a 40 bit accumulator. I've found it absolutely essential to have the extra bits available to prevent overflow when using a Direct Form - 1 Bi-Quad filter. Is the new chip going to have any facility for an accumulator > 32 bits?

The other "feature" that is really nice is selectable automatic writeback saturation, where you write the contents of the accumulator back to a user register and it can automatically saturate to that register size rather than truncating the high bits and causing a wraparound. This seems pretty essential when the accumulator is larger than the native word size.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

Mike Green · 2010-01-11 01:57

The chip area occupied by a multiplier like this goes up pretty fast as the word length gets bigger. I'm sure it's at least related to the square of the number of bits and 16x16 was a good compromise. You can very easily do larger multiplications using multiple precision. 32x32 would take 4 multiplies and some shifts and adds.

BradC · 2010-01-11 02:06

Mike Green said...
The chip area occupied by a multiplier like this goes up pretty fast as the word length gets bigger. I'm sure it's at least related to the square of the number of bits and 16x16 was a good compromise. You can very easily do larger multiplications using multiple precision. 32x32 would take 4 multiplies and some shifts and adds.

Oh, I'm not suggesting 16x16 is inadequate (although it's about the bare minimum you want for reasonable quality digital audio work), I'm just hoping the accumulator (the register that all MAC multiplies get added into) has an extension past 32 bits.

Let's say for example you have three MAC and 2 MSC instructions (a pretty basic implementation of a biquad filter).
Your first three MAC instructions (16x16) can generate a 32 bit result each, resulting in overflow of a 32 bit accumulator. Your next 2 MSC instructions (multiply and subtract) also generate 32 bit results. The end result is a number that will comfortably fit in a 32 bit register, but if your accumulator is 32 bits you have lost information in the actual filter process as your first couple of intermediate results have been truncated.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

Leon · 2010-01-11 04:48

I'd like a 32-bit MAC with a 64-bit accumulator, like the XMOS chips.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

hinv · 2010-01-11 05:09

Does XMOS have a forum like this?
Is there an inexpensive way to get started playing with the XMOS chips?
Do you get paid to constantly plug XMOS when unsoliced?

Doug

Leon · 2010-01-11 05:17

I would still like a 32-bit MAC, I think that most people would. I can't really see the point of a 16-bit MAC on a 32-bit device.

SFE has a nice little XS1-L1-64 XMOS board for $49.

I use both Propeller and XMOS chips, depending on the application. They don't really compete with each other, they are intended for completely different markets.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Post Edited (Leon) : 1/11/2010 5:25:25 AM GMT

RossH · 2010-01-11 05:31

@hinv,

There is an XMOS forum (http://www.xmoslinkers.org/) but it doesn't seem to get much traffic. I guess the XMOSer's spend most of their time on forums like this one.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

Leon · 2010-01-11 05:36

That's the old forum, this is the new one:

www.xcore.com

A 32-bit MAC with a 32-bit result, as is available on some ARM chips, would be a nice compromise.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

Post Edited (Leon) : 1/11/2010 5:41:50 AM GMT

heater · 2010-01-11 05:51

Except they have now messed things up by introducing a second forum www.xcore.com/forum/ which gets most attention now. They are a long way short of the levels of action we see here.

It's telling that over Christmas and New Year this forum was humming with activity but over there was, well, quite.

Now the thing is that when we had those long threads about new Prop II features I would sometime suggest features that were included in the Inmos Transputer of the 1980's which was a very "out of the box" design for it's time. The Transputer is dead and gone but now, decades later, parallelism in embedded systems is back in the form of the Propeller and the XMOS. That's why that name comes up here from time to time.

I still thinks there are some ideas Prop II could adapt from the Transputer and now XMOS that continues in it's foot steps.

I also believe Chip can teach XMOS a thing or two about making a usable device [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

ErNa · 2010-01-11 06:42

Yes, there should be ! and ? That would really be great!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1181572927203_421963583_5511&partName=htmltext
Hello Rest Of The World
Hello Debris
Install a propeller and blow them away

heater · 2010-01-11 07:13

ErNa: "there should be ! and ? That would really be great!"

Exactly.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2010-01-11 07:39

For those who wonder what ErNa is talking about, it's to do with communicating between parallel processes. From the Transputer Occam language book:

channel ! expression

sends the value of expression over a channel to another process.

channel ? variable

receives a value from the channel and stores it in the variable.

It would be damn nice to have such high speed communication between COGs sometimes. No going through HUB RAM.

I'm sure the Props idea of locks could be extended into channels. Two COGs wanting to communicate would "check out" a channel, rather like they do locks now, and then swap longs through it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

BradC · 2010-01-11 08:27

Well, if you were going to do a 32 bit MAC (32x32 with a 64 bit result) then you really need between 70-80 bits in the accumulator to accommodate the intermediate overflows. On the plus side, you only need to be able to add or subtract to/from the accumulator, so no really funky behaviour required there.

Look, don't get me wrong, I'd wet my pants over a 32x32 MAC. Even a 24x24 (which is what you get on the coldfire with a 56 bit accumulator) would make me really, really excited and completely remove the need for me to be even looking at outboard DSP chips. Having said that, I'm managing with 16x16 with the dsPIC at the moment (although the low resolution does make you jump through hoops with regard to noise shaping).

The real beauty of the PIC MAC is it can do a MAC, 2 pre-fetches and a saturated writeback in every instruction cycle. That makes filtering pretty damn efficient, even if you do have to jump through hoops to save intermediate results for noise shaping.

If the XMOS does a 32x32 multiply and only has a 64 bit accumulator I'd reckon that was a pretty poor design decision..

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

heater · 2010-01-11 09:15

BradC: You are just greedy [noparse]:)[/noparse]

Speaking as an old 8 biter, 32 bits seems huge and 64 bits unimaginably so.

Now maybe I'm missing a point but when dealing with any real world quantities, like audio samples, 24 bits is the limit of sensible resolution nowadays. So isn't multiplying 24 by 24 into 48 and accumulating into 64 bits enough?

"They" may have made a poor design choice, on the other hand "They" seem to be doing quite well at handling multiple streams of 24bit audio, filtering etc etc.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

BradC · 2010-01-11 09:27

When doing digital filtering. More resolution == less noise.

24x24 into 64 would be great. If you are using 16x16 into 32 you may as well not actually have a MAC instruction and just do multiply and multi-word addition separately.

16x16 into 40 would do the job.

It comes down to the resolution of the filtering you are doing as to how close the poles are and therefore the size of the coefficients. 16 bit is really Q15 + sign, so your coefficients are only 15 bits of magnitude. As an example, I'm doing some work at ~78Khz. Using a standard butterworth LPF biquad I run out of resolution below about 600hz when I set my coefficients up. Again, because of the low resolution the coefficients need to be < 1. Where I have a coefficient > 1 I need to use multiple MAC instructions to accommodate the extra gain.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

heater · 2010-01-11 09:57

I'm totally ignorant of DSP but "More resolution == less noise" is clear enough.

Given that 16 bits is a signal to noise ratio of 98db and that we are working with 24 bits accumulating into 64, I would have guessed that any noise introduced by the filtering calcs was nigh on invisible. At least in audio work.

One day I might want to pick your brains re: a 3-way digital cross over algorithm for active loudspeakers I have had running on a PC for some time. I had hoped it would be doable on the Prop so as to make a small stand alone unit. As I say I know nothing of DSP, I just wrote my best simulation of a famous opamp based design and it works very well. It's a surprisingly simple and short piece of code. Just need to get away from floating point to fixed point[noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Leon · 2010-01-11 10:48

Unlike the dsPIC, which has a DSP engine with lots of good stuff like dual 40-bit accumulators and X and Y memories, as well as a conventional CPU, the Propeller, ARM and XMOS chips are general purpose devices with a few DSP instructions. They aren't primarily intended for DSP.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM

evanh · 2010-01-12 06:52

Brad: It seems to me that 16x16 to 32 is fine. I've not done any fixed point filtering myself but wouldn't it be just a case of shifting right by one more on each additional add to not need more bits?

BradC · 2010-01-12 07:44

evanh said...
Brad: It seems to me that 16x16 to 32 is fine. I've not done any fixed point filtering myself but wouldn't it be just a case of shifting right by one more on each additional add to not need more bits?

And then you lose those extra bits off to the right causing quantization errors and more noise.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

evanh · 2010-01-12 08:06

Hmmm, it's not like the full 32 bits is needed for the final result. I'm thinking the problem is more about the extra instructions that were needed which slowed the loop down. And on that subject I'd speculate that there is special variations in some DSPs that have optional/conditional free shifting in many instructions.

BradC · 2010-01-12 08:32

16 bit math for filtering is noisy, there is no way around it. The coefficients are barely big enough as it is. To get the best results from this, you need to preserve every bit, and that means taking the lower 16 bits and feeding them back into the next round of the filter. It means having an accumulator big enough that you can accumulate several full range multiplies without saturating or wrapping around. This is why most DSP's have an accumulator that is much bigger than twice their word size. If you are careful with your bits, you can achieve good results with 16x16.

I wrote quite an extensive post from here down, but deleted it as the original question stands. No point in any more speculation.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

evanh · 2010-01-12 09:11

Ok, makes sense, use the 16 bit quantities to their fullest.

The problem with the opening question is that every one of the 512 registers in a Cog is considered an accumulator in the general microprocessor sense. And 32 bits is their size. You are asking a big ask to change that.

That said, I guess to have the best performance, Chip will possibly be adding a special non-addressable register just for the MAC's accumulation. In which case there is nothing stopping it even being 64 bits. [noparse]:)[/noparse]

BradC · 2010-01-12 09:29

evanh said...

The problem with the opening question is that every one of the 512 registers in a Cog is considered an accumulator in the general microprocessor sense. And 32 bits is their size. You are asking a big ask to change that.

Not really. If you do what the dsPIC does with its standard multiply results, you use any destination of a multiply as 2 paired registers. You must specify an even register for the destination, and the destination+1 automatically becomes its partner. Now, you only need logic to add or subtract to the destination. Destination +1 is simply the carried result of the destination arithmetic. Sure 64 bits is overkill, where 40 would do, but doing it this way just uses pairs of registers. I'd have thought the logic to be a bit simpler.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

evanh · 2010-01-12 09:32

Definitely would lose speed that way as that needs to fetch 4 input registers and store 2 result registers for each MAC.

BradC · 2010-01-12 09:33

Good point

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

evanh · 2010-01-12 09:49

For saturation, the special move instruction needed to unload the MAC's accumulator could include saturation in it. I'd think a couple of priority comparators would suffice - That way both min and max can be encoded in the one instruction along with the destination register. Also eliminates the need for two summing comparators.

evanh · 2010-01-12 10:21

Err, I'll just eat my words there a little bit. No such thing as a priority comparator ... Thinking about it, what I wanted is really a combo of a binary decoder and a summing comparator per bound that selects one of three results - one of the related saturation powers or the truncated accumulator.

The max value in the move instruction could also be used for the final shifter/truncator.

evanh · 2010-01-12 11:25

Oh, Smile, a MAC ain't that simple to do. There is need for two tables to be traversed, with one or both being in Hub ram. That's a lot of register fiddling between each MAC instruction. May not be such an issue if the MAC instruction itself is a few clocks long ... I think I'll stop speculating now. O_o

Propeller II MAC instruction

Comments