Byte Space Operators
rjo__
Posts: 2,114
in Propeller 2
I hate to do it, but I think I must.
Here is the real world issue... I'm going to be subtracting one unsigned byte from another and storing the signed 9 bit result out to a large memory space... Potentially, for some use cases, I might want to do this millions and millions and millions of times. In assembly, on a 68040, this could take up to twenty four hours to complete... I am older and wiser now and can do it better and I can scale back the processing by picking problems that can be more quickly solved, but time is always an issue.
It would be nice to have a single instruction, that would take a long... subtract one byte from another and then store the signed result in the unused word of the same long. Since we are talking about a very short loop, the gain in throughput could be substantial.
Would this be useful for anyone else?
Rich
Here is the real world issue... I'm going to be subtracting one unsigned byte from another and storing the signed 9 bit result out to a large memory space... Potentially, for some use cases, I might want to do this millions and millions and millions of times. In assembly, on a 68040, this could take up to twenty four hours to complete... I am older and wiser now and can do it better and I can scale back the processing by picking problems that can be more quickly solved, but time is always an issue.
It would be nice to have a single instruction, that would take a long... subtract one byte from another and then store the signed result in the unused word of the same long. Since we are talking about a very short loop, the gain in throughput could be substantial.
Would this be useful for anyone else?
Rich
Comments
SUBBYTS D,S
D[31:24] - S[31:24] --> D[31:24]
D[23:16] - S[23:16] --> D[23:16]
D[15:08] - S[15:08] --> D[15:08]
D[07:00] - S[07:00] --> D[07:00]
But maybe you are talking about data that is already packed in CogRAM? ... doing chunks at a time ...
I love talking to experts... yes, that is way more better... as Hammer would say:)
Rich
Thanks,
Rich
That sounds a very niche design for an opcode.
Given you mentioned large an millions and millions, you likely have other. more serious bottlenecks.
Hard to imagine why it would ?
What real world problem are you solving with this ?
Yes... but I think Chip has already come up with something better.
Evanh
Machine vision in general.
Camera calibration.
Camera tracking ... Apple does it very well. Intel does it very well.
Microsoft doesn't do it very well and would probably be happy to get it from the public domain.
Stereo-analysis with or without structured light, etc.
Does your bot know exactly where he or she is at all times?... we could help that:)
There is a list of medical apps as well. I might go there again... I might not.
Lotsostuff
? Chip's example does not do this
'store the signed 9-bit result in the unused word of the same long'
His example is a 32b subtract, with carry chain sliced so it behaves as 4 x 8 bit subtracts to 4 x 8 bit results.
That's what I thought he was asking for. That takes 16 clock cycles. Assuming another 104 clock cycles of overhead, then you would be able to do one million of these per second (@120MHz), or 8.64E10 per day.
I agree, but he is doing a whole lot more than I was asking for.. to actually use it the way he is suggesting would add back a little overhead.
Seairth,
I can live with 8.6E10 subtractions per day:)
Of course I wasn't spending the whole day just subtracting numbers. At that time, I wasn't parallelizing anything. I tried Occam, but couldn't get it where I needed it to go. Now, I am trying to parallelize everything. The P2 is just so nice for this... it is just insane.
I agree with you on the can of worms issue. I would put this last on Chip's order of priorities and wouldn't think about it again until everything else is ready to go...
then if there is time... throw it in. It can't hurt, but I can certainly live happily without it.
Rich
Lots of ifs.
I don't care because I get the shake after he gets the cherry. In my mind it is a milkshake... good with or without the cherry.
A signed result could be easily implemented if memory registers had 36 bits.
I think this instruction could be a perfect homework for P1 verilog.
Also, I am recently using NIOS2/QSYS and find it quite flexible. I am starting to like it!. It is very easy to create a complete system from just a single FPGA board with embedded JTAG (using the JTAG UART for serial input/output).
Why I am talking about this?
Because NIOS allows you to create custom instructions with 3 operands (two sources and destination). And I also have seen that Qsys will allow to create custom 'ip'. This is just a way to create a reusable block of verilog, and use it together with their AVALON BUS. That can be connected to a NIOS cpu or standalone (I really like this feature!!). Both P1V and NIOS II can be modified to implement that custom instruction.
A P1V cog 'ip' for QSYS used together with NIOS could be an incredible powerful system.
Think twice about this. NIOS already has a lot of peripherals (SDRAM, DDR, Ethernet ...), a gcc compiler, and Linux support. Imagine that the NIOS avalon bus is used to fake a HUB ram for each P1v cog implemented. Some of you already have some FPGA boards, so using NIOS is almost for free.