Byte Space Operators

rjo__ · 2015-10-22 21:26

I hate to do it, but I think I must.

Here is the real world issue... I'm going to be subtracting one unsigned byte from another and storing the signed 9 bit result out to a large memory space... Potentially, for some use cases, I might want to do this millions and millions and millions of times. In assembly, on a 68040, this could take up to twenty four hours to complete... I am older and wiser now and can do it better and I can scale back the processing by picking problems that can be more quickly solved, but time is always an issue.

It would be nice to have a single instruction, that would take a long... subtract one byte from another and then store the signed result in the unused word of the same long. Since we are talking about a very short loop, the gain in throughput could be substantial.

Would this be useful for anyone else?

Rich

rjo__ · 2015-10-22 21:42

I should add that I think Chip has said that his intention is to release the Verilog sources... which will allow me to tinker with this. So, if you guys see no other compelling arguments, it can wait. Does seem like a hole in the instruction set though:)

cgracey · 2015-10-23 00:47

If this could be contained to 8-bit fields, it would be possible to have an instruction that performs 4 subtractions across 4 byte sets in two longs.

SUBBYTS D,S

D[31:24] - S[31:24] --> D[31:24]
D[23:16] - S[23:16] --> D[23:16]
D[15:08] - S[15:08] --> D[15:08]
D[07:00] - S[07:00] --> D[07:00]

evanh · 2015-10-23 00:59

So you've got something like the following?

struct  {
    int8_t  input1
    int8_t  input2
    int16_t  result
}  BigArray[99999999];

and you're wanting a single instruction, right? The first thing that stands out is the need to load the data into Cog registers before the addition can take place. I've generally considered the Prop as a load-and-store architecture and I see this as a good example why.

But maybe you are talking about data that is already packed in CogRAM? ... doing chunks at a time ...

rjo__ · 2015-10-23 01:06

Chip,

I love talking to experts... yes, that is way more better... as Hammer would say:)

Rich

rjo__ · 2015-10-23 01:08

And the fact that hub longs don't have to be aligned makes every ting so nice...

rjo__ · 2015-10-23 01:31

My fear is that we are stretching those clocks out and one of these times, things are going to break in such a way that it will be hard to conjure. So... if it were me, I would make sure everything else is perfect before making a change like this.

Thanks,

Rich

jmg · 2015-10-23 01:36

rjo__ wrote: »

.. and storing the signed 9 bit result out to a large memory space... Potentially, for some use cases, I might want to do this millions and millions and millions of times.

How large, you mention "millions and millions and millions of times" which has gone outside P2 memory and onto external storage.

rjo__ wrote: »

It would be nice to have a single instruction, that would take a long... subtract one byte from another and then store the signed result in the unused word of the same long.

That sounds a very niche design for an opcode.

rjo__ wrote: »

Since we are talking about a very short loop, the gain in throughput could be substantial.

Given you mentioned large an millions and millions, you likely have other. more serious bottlenecks.

rjo__ wrote: »

Would this be useful for anyone else?

Hard to imagine why it would ?
What real world problem are you solving with this ?

ozpropdev · 2015-10-23 01:38

You mean something like this?

	getbyte	temp1,myreg,#0
	getbyte	temp2,myreg,#1
	sub	temp,temp2
	setword	myreg,temp,#1

rjo__ · 2015-10-23 01:51

Brian

Yes... but I think Chip has already come up with something better.

Evanh

Machine vision in general.

Camera calibration.

Camera tracking ... Apple does it very well. Intel does it very well.
Microsoft doesn't do it very well and would probably be happy to get it from the public domain.

Stereo-analysis with or without structured light, etc.

Does your bot know exactly where he or she is at all times?... we could help that:)

There is a list of medical apps as well. I might go there again... I might not.

Lotsostuff

jmg · 2015-10-23 01:58

rjo__ wrote: »

Yes... but I think Chip has already come up with something better.

? Chip's example does not do this
'store the signed 9-bit result in the unused word of the same long'

His example is a 32b subtract, with carry chain sliced so it behaves as 4 x 8 bit subtracts to 4 x 8 bit results.

Seairth · 2015-10-23 01:59

ozpropdev wrote: »
You mean something like this?
	getbyte	temp1,myreg,#0
	getbyte	temp2,myreg,#1
	sub	temp,temp2
	setword	myreg,temp,#1

That's what I thought he was asking for. That takes 16 clock cycles. Assuming another 104 clock cycles of overhead, then you would be able to do one million of these per second (@120MHz), or 8.64E10 per day.

Seairth · 2015-10-23 02:02

I just realized that this is starting to sound like SIMD instructions. While that might be nice to have, I think that's a can of worms we really don't want to be opening this late in the game. If we do, I foresee another non-trivial delay in the P2 release...

rjo__ · 2015-10-23 02:57

JMG...

I agree, but he is doing a whole lot more than I was asking for.. to actually use it the way he is suggesting would add back a little overhead.

Seairth,

I can live with 8.6E10 subtractions per day:)

Of course I wasn't spending the whole day just subtracting numbers. At that time, I wasn't parallelizing anything. I tried Occam, but couldn't get it where I needed it to go. Now, I am trying to parallelize everything. The P2 is just so nice for this... it is just insane.

I agree with you on the can of worms issue. I would put this last on Chip's order of priorities and wouldn't think about it again until everything else is ready to go...
then if there is time... throw it in. It can't hurt, but I can certainly live happily without it.

Rich

evanh · 2015-10-23 03:14

Yes, Chip has totally given a simple SIMD solution.

rjo__ · 2015-10-23 03:15

He has done it before... he'll do it again. It sells chips.

rjo__ · 2015-10-23 03:17

AND it is fun!!!

rjo__ · 2015-10-23 03:19

I want to see someone honestly benchmark this beast. They didn't do it for the P1... will they have the courage to do it for the P2?

evanh · 2015-10-23 03:24

Just to echo Seairth's concern, SIMD does sound a bit "hot", so to speak, and a bit late in the game.

rjo__ · 2015-10-23 03:59

No doubt about it. Concerns all the way around. We all agree.
Lots of ifs.

rjo__ · 2015-10-23 04:02

My grandson always wants a chocolate milkshake... just because he wants the cherry on the top.
I don't care because I get the shake after he gets the cherry. In my mind it is a milkshake... good with or without the cherry.

evanh · 2015-10-23 04:03

Ramon · 2015-10-23 12:34

Chip's SIMD instruction result is unsigned, but you asked for a signed result.
A signed result could be easily implemented if memory registers had 36 bits.

I think this instruction could be a perfect homework for P1 verilog.

Also, I am recently using NIOS2/QSYS and find it quite flexible. I am starting to like it!. It is very easy to create a complete system from just a single FPGA board with embedded JTAG (using the JTAG UART for serial input/output).

Why I am talking about this?

Because NIOS allows you to create custom instructions with 3 operands (two sources and destination). And I also have seen that Qsys will allow to create custom 'ip'. This is just a way to create a reusable block of verilog, and use it together with their AVALON BUS. That can be connected to a NIOS cpu or standalone (I really like this feature!!). Both P1V and NIOS II can be modified to implement that custom instruction.

A P1V cog 'ip' for QSYS used together with NIOS could be an incredible powerful system.

Think twice about this. NIOS already has a lot of peripherals (SDRAM, DDR, Ethernet ...), a gcc compiler, and Linux support. Imagine that the NIOS avalon bus is used to fake a HUB ram for each P1v cog implemented. Some of you already have some FPGA boards, so using NIOS is almost for free.

Ramon · 2015-10-23 14:39

I have just noticed that it is a SUB (not ADD). So no overflow bit required.

Byte Space Operators

Comments