Propeller II update - BLOG

Baggers · 2013-12-22 06:11

Yes, it would then look up something else, or be a table of instructions, i.e., for a recording of a replay, where it will have bits for movement directions or anything, but I guess the GETNIB would work fine too

isn't there one for bytes too iirc. as that would be helpful also.
for getting a tile from a map, to then draw to screen, or other such use, not just tied to gaming.

edit: MOVF or something like that, can't remember off hand.

Cluso99 · 2013-12-22 06:31

Yes, there are GETBYTE, GETWORD, and some others too.
The GETNIB would save having to ROR the source, but you would need to AND the dest to mask the upperbits off. Though, if it were a table, thatcouldbe unnecessary.

ozpropdev · 2013-12-22 15:56

Baggers wrote: »

Yes, it would then look up something else, or be a table of instructions, i.e., for a recording of a replay, where it will have bits for movement directions or anything, but I guess the GETNIB would work fine too isn't there one for bytes too iirc. as that would be helpful also.
for getting a tile from a map, to then draw to screen, or other such use, not just tied to gaming.

edit: MOVF or something like that, can't remember off hand.

The MOVF/SETF (Byte mover) is now obsolete. It has been replaced with the following instructions
GETNIB SETNIB GETBYTE SETBYTE GETWORD SETWORD ROLNIB SWBYTES STBYTES ESWAP4 ESWAP8
ROLBYTE ROLWORD . Some very useful instructions in that group.

Cluso99 wrote: »

Yes, there are GETBYTE, GETWORD, and some others too.
The GETNIB would save having to ROR the source, but you would need to AND the dest to mask the upperbits off. Though, if it were a table, thatcouldbe unnecessary.

GETNIB GETBYTE GETWORD all zero upper bits of destinarion. No ANDing required

Cluso99 · 2013-12-22 17:21

ozpropdev wrote: »

The MOVF/SETF (Byte mover) is now obsolete. It has been replaced with the following instructions
GETNIB SETNIB GETBYTE SETBYTE GETWORD SETWORD ROLNIB SWBYTES STBYTES ESWAP4 ESWAP8
ROLBYTE ROLWORD . Some very useful instructions in that group.

GETNIB GETBYTE GETWORD all zero upper bits of destinarion. No ANDing required

Of course. I was confusing it with SETNIB where you put it into an existing field. I am yet to understand how some of these work.

BTW Do you know what FRAC D/#,S/# does?

Baggers · 2013-12-22 17:54

Awesome set of instructions

shame I was too busy with work at the time they came out, and missed seeing that part of the thread, then again it does fill up quickly in here

Clusu99, what does FRAC D/#,S/# do?

rjo__ · 2013-12-22 19:23

Well, since there is no such thing as a stupid question around here, I think I'll take my shot at a stupid answer:)

I think the description in Prop2_Docs from 11/27 is a little thin, since it stops just when things are getting interesting.
Superficially, it looks like Frac first applies a fixed scale to the fraction but the resulting getdivq and getdivr results are not really described. Are they the components of a binary fraction?

To start a 32-bit fraction calculation, use FRAC:

FRAC D/#,S/# - Begin calculating the unsigned fraction of D/# over S/#, where
D/# and S/# are unsigned 32-bit values and D/# is less than S/#.
Use GETDIVQ to get the result.

Examples:

FRAC #1,#2 yields $80000000 (1/2 of $1_00000000)
FRAC #1,#3 yields $55555555 (1/3 of $1_00000000)
FRAC #1,#4 yields $40000000 (1/4 of $1_00000000)
FRAC #15,#16 yields $F0000000
FRAC $80000000,$90000000 yields $E38E38E3
FRAC 31_250,80_000_000 yields $00199999

After starting the divider, you'll have 17 clocks cycles to execute other code, if you
wish, before GETDIVQ/GETDIVR will return the quotient/remainder long(s) of the result:

GETDIVQ D - Get quotient result
GETDIVR D - Get remainder result

In single-task mode, GETDIVQ/GETDIVR will stall the pipeline until the result is ready.
In multi-task mode, GETDIVQ/GETDIVR will jump to themselves until the result is ready,
freeing clocks for other tasks.

Rich

rjo__ · 2013-12-22 19:39

I'm trying to get my camera up and running on the P2… right now, everything I don't know how to do with the Prop2, I am contemplating doing with a Prop1 and then just frankenstein the hardware together.
Of course, the point is to get everything on the P2 as quickly as possible. How do I set up ctrb to output NCO on pin x(portA)? I understand how to assign frqb, it is the mode bits that I can't fathom.

Thanks,

Rich

cgracey · 2013-12-23 09:20

Cluso99 wrote: »
Ugh! I missed that part - just assumed the destination was rotated.

If he is wanting the source to be rotated, that is not possible as that would mean 2 writebacks to the cog in stage 4.

So this is what he was after...
Which means combining these instructions...
AND D,#mask
[I] where "2" uses mask=11, "4" uses mask=1111, "8" uses mask=11111111
[/I]MOV tmp,S
AND tmp,#mask
OR D,tmp
followed by
ROR S,#n
 [I]where n=2/4/8[/I]

That's right. We only have one cog RAM register-write possibility per instruction, so the operation on S would have to become an operation on D in a separate instruction. As clock cycles go, this wouldn't be any worse - it would just take another instruction.

cgracey · 2013-12-23 09:26

rjo__ wrote: »

I'm trying to get my camera up and running on the P2… right now, everything I don't know how to do with the Prop2, I am contemplating doing with a Prop1 and then just frankenstein the hardware together.
Of course, the point is to get everything on the P2 as quickly as possible. How do I set up ctrb to output NCO on pin x(portA)? I understand how to assign frqb, it is the mode bits that I can't fathom.

Thanks,

Rich

Rich, sorry those docs are done yet. You can find the LSB-justified bit pattern that does NCO in the latest docs file. Just put the pin number you want the output on into the D and/or I fields (via SETD/SETI). Pin numbers are seven bits, but it will be necessary to set relative bits +7 and +8 to either %01, %10, or %11 to get output. Remember to set the related DIR bits(s). You'll need to experiment (I'm not at my work computer now).

Baggers · 2013-12-23 11:25

Cheers Rich

rjo__ · 2013-12-23 17:55

Thanks Chip,

I thought it was probably described somewhere in a thread I had missed or that maybe we only had one counter on the FPGA.

Baggers…Happy Holidays!!!

Rich

koehler · 2014-01-07 01:43

Hi Chip,

Just a question regarding P2 power useage.
While there is some 'down time' waiting for the next shuttle run, I know you've been tweaking and adding features at a rip roaring rate.
However, I've not seen any mention of any sort of power-gating being implemented.
This may already be in place, however if not, I think it would be very, very worthwhile to minimize power requirements as much as is reasonable possible.

cgracey · 2014-01-07 12:58

koehler wrote: »

Hi Chip,

Just a question regarding P2 power useage.
While there is some 'down time' waiting for the next shuttle run, I know you've been tweaking and adding features at a rip roaring rate.
However, I've not seen any mention of any sort of power-gating being implemented.
This may already be in place, however if not, I think it would be very, very worthwhile to minimize power requirements as much as is reasonable possible.

The synthesis tools add clock gating automatically in cases where adequate setup times exist. In our last synthesis run, 89% of the flipflops were clock-gated, as opposed to always-clocked with enable inputs. This chip might dissipate a couple of watts at 1.8V.

KeithE · 2014-01-07 21:22

cgracey wrote: »

The synthesis tools add clock gating automatically in cases where adequate setup times exist. In our last synthesis run, 89% of the flipflops were clock-gated, as opposed to always-clocked with enable inputs. This chip might dissipate a couple of watts at 1.8V.

Here's a page with a picture - see Figure 1:

http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages/PowerCompiler.aspx

MJB · 2014-01-09 14:10

@CHIP
In Automated Testing Systems we often have the need to measure isolated voltages.
This can easily be done with external Sigma/Delta encoders like http://www.analog.com/AD7401A
This part delivers a 10MHz bitstream which gives a 16bit ADC when filtered by a SINC3 filter
(verilog code shown in datasheet). But then need an FPGA to decode it, which is high cost.
So I was wondering, if the P2 ADC HW might be used to take the EXTERNAL bitstream, instead of the internal one
to implement an isolated ADC.
IIRC the internal ADC works with 1st order S/D-encoder whereas this SINC3 gives MUCH better Signal to noise ratio.
Looking at the verilog code it might even be possible to use one COG @200MHz to do it in SW.
but some HW assist, if available, would be much better.

cgracey · 2014-01-09 14:37

MJB wrote: »

@CHIP
In Automated Testing Systems we often have the need to measure isolated voltages.
This can easily be done with external Sigma/Delta encoders like http://www.analog.com/AD7401A
This part delivers a 10MHz bitstream which gives a 16bit ADC when filtered by a SINC3 filter
(verilog code shown in datasheet). But then need an FPGA to decode it, which is high cost.
So I was wondering, if the P2 ADC HW might be used to take the EXTERNAL bitstream, instead of the internal one
to implement an isolated ADC.
IIRC the internal ADC works with 1st order S/D-encoder whereas this SINC3 gives MUCH better Signal to noise ratio.
Looking at the verilog code it might even be possible to use one COG @200MHz to do it in SW.
but some HW assist, if available, would be much better.

The CTRs have modes to sum up 1's in order to realize a 1st-order delta-sigma conversion. I don't know if there will be adequate bandwidth to do any 2nd-order conversions, unless you can use every Nth bit, or MSBs of short accumulations.

cgracey · 2014-01-09 14:49

BIG NEWS!!!

I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.

When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.

DaveJenson · 2014-01-09 15:03

Congratulations! You are the man!

DL7PNP · 2014-01-09 15:09

That sounds very impressiv and powerful!

Would it be possible to execute inline assembler in spin or even a kind of lmm assembler?

cgracey wrote: »

BIG NEWS!!!

I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.

When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.

David Betz · 2014-01-09 15:10

cgracey wrote: »

BIG NEWS!!!

I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.

When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.

Wow! That is wonderful news! Congratulations!!

Are you ready to release the new instruction encodings yet so we can think about updating propgcc?

Kye · 2014-01-09 15:11

Hub execution will give you 200 MHz LMM execution, that coupled with the ability to bring in DRAM data means big things.

I assume there's no d-cache... so this is only for making static read-only code faster right?

cgracey · 2014-01-09 15:19

Kye wrote: »

Hub execution will give you 200 MHz LMM execution, that coupled with the ability to bring in DRAM data means big things.

I assume there's no d-cache... so this is only for making static read-only code faster right?

That's right. If you need self-modifying code, you could load it into the cog RAM and execute it there. After I get this done, I want to make automatic transfers between hub, cog, aux, and pins.

cgracey · 2014-01-09 15:20

David Betz wrote: »

Wow! That is wonderful news! Congratulations!!

Are you ready to release the new instruction encodings yet so we can think about updating propgcc?

I think so. There will be a few instruction additions coming, but I don't see any big changes.

cgracey · 2014-01-09 15:21

DL7PNP wrote: »

That sounds very impressiv and powerful!

Would it be possible to execute inline assembler in spin or even a kind of lmm assembler?

Yes. That's going to be important to Spin - to be able to execute PASM and even background PASM in other tasks of the same cog.

David Betz · 2014-01-09 15:22

cgracey wrote: »

I think so. There will be a few instruction additions coming, but I don't see any big changes.

I'm not at all concerned about instruction additions as long as there are no more big encoding changes across the whole instruction set.

David Betz · 2014-01-09 15:23

cgracey wrote: »

That's right. If you need self-modifying code, you could load it into the cog RAM and execute it there. After I get this done, I want to make automatic transfers between hub, cog, aux, and pins.

Are "automatic transfers" like DMA?

cgracey · 2014-01-09 15:25

David Betz wrote: »

Are "automatic transfers" like DMA?

You could say so. You just can't execute instructions that would try to alter those resources while the transfer is going on. For example, if you start a hub-to-AUX transfer, you better not do hub or AUX instructions while it's busy, or the transfer would get corrupted.

Bill Henning · 2014-01-09 15:27

Excellent news!

cgracey wrote: »

big news!!!

I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

Since hub execution occurs whenever a task's 16-bit program counter is beyond $01ff, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: If an instruction is being fetched above $01ff, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which i was concerned about.

When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the spin interpreter's efficiency. Being able to bust beyond the cog's ram is a fantastic feeling.

David Betz · 2014-01-09 15:30

cgracey wrote: »

You could say so. You just can't execute instructions that would try to alter those resources while the transfer is going on. For example, if you start a hub-to-AUX transfer, you better not do hub or AUX instructions while it's busy, or the transfer would get corrupted.

That seems perfectly reasonable.

Baggers · 2014-01-09 15:46

Awesome news Chip

Also a quick question... Would the Spin interpreter be able to run from ROM? thus freeing even more COG ram?

Propeller II update - BLOG

Comments