Shop Learn
Propeller II update - BLOG - Page 148 — Parallax Forums

Propeller II update - BLOG

1145146148150151223

Comments

  • BaggersBaggers Posts: 2,991
    edited 2013-12-22 06:11
    Yes, it would then look up something else, or be a table of instructions, i.e., for a recording of a replay, where it will have bits for movement directions or anything, but I guess the GETNIB would work fine too :) isn't there one for bytes too iirc. as that would be helpful also.
    for getting a tile from a map, to then draw to screen, or other such use, not just tied to gaming.

    edit: MOVF or something like that, can't remember off hand.
  • Cluso99Cluso99 Posts: 17,478
    edited 2013-12-22 06:31
    Yes, there are GETBYTE, GETWORD, and some others too.
    The GETNIB would save having to ROR the source, but you would need to AND the dest to mask the upperbits off. Though, if it were a table, thatcouldbe unnecessary.
  • ozpropdevozpropdev Posts: 2,745
    edited 2013-12-22 15:56
    Baggers wrote: »
    Yes, it would then look up something else, or be a table of instructions, i.e., for a recording of a replay, where it will have bits for movement directions or anything, but I guess the GETNIB would work fine too :) isn't there one for bytes too iirc. as that would be helpful also.
    for getting a tile from a map, to then draw to screen, or other such use, not just tied to gaming.

    edit: MOVF or something like that, can't remember off hand.

    The MOVF/SETF (Byte mover) is now obsolete. It has been replaced with the following instructions
    GETNIB SETNIB GETBYTE SETBYTE GETWORD SETWORD ROLNIB SWBYTES STBYTES ESWAP4 ESWAP8
    ROLBYTE ROLWORD . Some very useful instructions in that group. :)
    Cluso99 wrote: »
    Yes, there are GETBYTE, GETWORD, and some others too.
    The GETNIB would save having to ROR the source, but you would need to AND the dest to mask the upperbits off. Though, if it were a table, thatcouldbe unnecessary.

    GETNIB GETBYTE GETWORD all zero upper bits of destinarion. No ANDing required :)
  • Cluso99Cluso99 Posts: 17,478
    edited 2013-12-22 17:21
    ozpropdev wrote: »
    The MOVF/SETF (Byte mover) is now obsolete. It has been replaced with the following instructions
    GETNIB SETNIB GETBYTE SETBYTE GETWORD SETWORD ROLNIB SWBYTES STBYTES ESWAP4 ESWAP8
    ROLBYTE ROLWORD . Some very useful instructions in that group. :)

    GETNIB GETBYTE GETWORD all zero upper bits of destinarion. No ANDing required :)
    Of course. I was confusing it with SETNIB where you put it into an existing field. I am yet to understand how some of these work.

    BTW Do you know what FRAC D/#,S/# does?
  • BaggersBaggers Posts: 2,991
    edited 2013-12-22 17:54
    Awesome set of instructions :D shame I was too busy with work at the time they came out, and missed seeing that part of the thread, then again it does fill up quickly in here ;)

    Clusu99, what does FRAC D/#,S/# do?
  • rjo__rjo__ Posts: 2,115
    edited 2013-12-22 19:23
    Well, since there is no such thing as a stupid question around here, I think I'll take my shot at a stupid answer:)

    I think the description in Prop2_Docs from 11/27 is a little thin, since it stops just when things are getting interesting.
    Superficially, it looks like Frac first applies a fixed scale to the fraction… but the resulting getdivq and getdivr results are not really described. Are they the components of a binary fraction?
    To start a 32-bit fraction calculation, use FRAC:

    FRAC D/#,S/# - Begin calculating the unsigned fraction of D/# over S/#, where
    D/# and S/# are unsigned 32-bit values and D/# is less than S/#.
    Use GETDIVQ to get the result.

    Examples:

    FRAC #1,#2 yields $80000000 (1/2 of $1_00000000)
    FRAC #1,#3 yields $55555555 (1/3 of $1_00000000)
    FRAC #1,#4 yields $40000000 (1/4 of $1_00000000)
    FRAC #15,#16 yields $F0000000
    FRAC $80000000,$90000000 yields $E38E38E3
    FRAC 31_250,80_000_000 yields $00199999


    After starting the divider, you'll have 17 clocks cycles to execute other code, if you
    wish, before GETDIVQ/GETDIVR will return the quotient/remainder long(s) of the result:

    GETDIVQ D - Get quotient result
    GETDIVR D - Get remainder result

    In single-task mode, GETDIVQ/GETDIVR will stall the pipeline until the result is ready.
    In multi-task mode, GETDIVQ/GETDIVR will jump to themselves until the result is ready,
    freeing clocks for other tasks.

    Rich
  • rjo__rjo__ Posts: 2,115
    edited 2013-12-22 19:39
    I'm trying to get my camera up and running on the P2… right now, everything I don't know how to do with the Prop2, I am contemplating doing with a Prop1 and then just frankenstein the hardware together.
    Of course, the point is to get everything on the P2 as quickly as possible. How do I set up ctrb to output NCO on pin x(portA)? I understand how to assign frqb, it is the mode bits that I can't fathom.

    Thanks,

    Rich
  • cgraceycgracey Posts: 13,412
    edited 2013-12-23 09:20
    Cluso99 wrote: »
    Ugh! I missed that part - just assumed the destination was rotated.

    If he is wanting the source to be rotated, that is not possible as that would mean 2 writebacks to the cog in stage 4.

    So this is what he was after...
    Which means combining these instructions...
    AND D,#mask
    [I] where "2" uses mask=11, "4" uses mask=1111, "8" uses mask=11111111
    [/I]MOV tmp,S
    AND tmp,#mask
    OR D,tmp
    
    followed by
    ROR S,#n
     [I]where n=2/4/8[/I]
    


    That's right. We only have one cog RAM register-write possibility per instruction, so the operation on S would have to become an operation on D in a separate instruction. As clock cycles go, this wouldn't be any worse - it would just take another instruction.
  • cgraceycgracey Posts: 13,412
    edited 2013-12-23 09:26
    rjo__ wrote: »
    I'm trying to get my camera up and running on the P2… right now, everything I don't know how to do with the Prop2, I am contemplating doing with a Prop1 and then just frankenstein the hardware together.
    Of course, the point is to get everything on the P2 as quickly as possible. How do I set up ctrb to output NCO on pin x(portA)? I understand how to assign frqb, it is the mode bits that I can't fathom.

    Thanks,

    Rich


    Rich, sorry those docs are done yet. You can find the LSB-justified bit pattern that does NCO in the latest docs file. Just put the pin number you want the output on into the D and/or I fields (via SETD/SETI). Pin numbers are seven bits, but it will be necessary to set relative bits +7 and +8 to either %01, %10, or %11 to get output. Remember to set the related DIR bits(s). You'll need to experiment (I'm not at my work computer now).
  • BaggersBaggers Posts: 2,991
    edited 2013-12-23 11:25
    Cheers Rich :)
  • rjo__rjo__ Posts: 2,115
    edited 2013-12-23 17:55
    Thanks Chip,

    I thought it was probably described somewhere in a thread I had missed or that maybe we only had one counter on the FPGA.

    Baggers…Happy Holidays!!!

    Rich
  • koehlerkoehler Posts: 599
    edited 2014-01-07 01:43
    Hi Chip,

    Just a question regarding P2 power useage.
    While there is some 'down time' waiting for the next shuttle run, I know you've been tweaking and adding features at a rip roaring rate.
    However, I've not seen any mention of any sort of power-gating being implemented.
    This may already be in place, however if not, I think it would be very, very worthwhile to minimize power requirements as much as is reasonable possible.
  • cgraceycgracey Posts: 13,412
    edited 2014-01-07 12:58
    koehler wrote: »
    Hi Chip,

    Just a question regarding P2 power useage.
    While there is some 'down time' waiting for the next shuttle run, I know you've been tweaking and adding features at a rip roaring rate.
    However, I've not seen any mention of any sort of power-gating being implemented.
    This may already be in place, however if not, I think it would be very, very worthwhile to minimize power requirements as much as is reasonable possible.


    The synthesis tools add clock gating automatically in cases where adequate setup times exist. In our last synthesis run, 89% of the flipflops were clock-gated, as opposed to always-clocked with enable inputs. This chip might dissipate a couple of watts at 1.8V.
  • KeithEKeithE Posts: 957
    edited 2014-01-07 21:22
    cgracey wrote: »
    The synthesis tools add clock gating automatically in cases where adequate setup times exist. In our last synthesis run, 89% of the flipflops were clock-gated, as opposed to always-clocked with enable inputs. This chip might dissipate a couple of watts at 1.8V.

    Here's a page with a picture - see Figure 1:

    http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages/PowerCompiler.aspx
  • MJBMJB Posts: 1,200
    edited 2014-01-09 14:10
    @CHIP
    In Automated Testing Systems we often have the need to measure isolated voltages.
    This can easily be done with external Sigma/Delta encoders like http://www.analog.com/AD7401A
    T
    his part delivers a 10MHz bitstream which gives a 16bit ADC when filtered by a SINC3 filter
    (verilog code shown in datasheet). But then need an FPGA to decode it, which is high cost.
    So I was wondering, if the P2 ADC HW might be used to take the EXTERNAL bitstream, instead of the internal one
    to implement an isolated ADC.
    IIRC the internal ADC works with 1st order S/D-encoder whereas this SINC3 gives MUCH better Signal to noise ratio.
    Looking at the verilog code it might even be possible to use one COG @200MHz to do it in SW.
    but some HW assist, if available, would be much better.
  • cgraceycgracey Posts: 13,412
    edited 2014-01-09 14:37
    MJB wrote: »
    @CHIP
    In Automated Testing Systems we often have the need to measure isolated voltages.
    This can easily be done with external Sigma/Delta encoders like http://www.analog.com/AD7401A
    T
    his part delivers a 10MHz bitstream which gives a 16bit ADC when filtered by a SINC3 filter
    (verilog code shown in datasheet). But then need an FPGA to decode it, which is high cost.
    So I was wondering, if the P2 ADC HW might be used to take the EXTERNAL bitstream, instead of the internal one
    to implement an isolated ADC.
    IIRC the internal ADC works with 1st order S/D-encoder whereas this SINC3 gives MUCH better Signal to noise ratio.
    Looking at the verilog code it might even be possible to use one COG @200MHz to do it in SW.
    but some HW assist, if available, would be much better.


    The CTRs have modes to sum up 1's in order to realize a 1st-order delta-sigma conversion. I don't know if there will be adequate bandwidth to do any 2nd-order conversions, unless you can use every Nth bit, or MSBs of short accumulations.
  • cgraceycgracey Posts: 13,412
    edited 2014-01-09 14:49
    BIG NEWS!!!

    I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

    Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.

    When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.
  • DaveJensonDaveJenson Posts: 338
    edited 2014-01-09 15:03
    Congratulations! You are the man!
  • DL7PNPDL7PNP Posts: 18
    edited 2014-01-09 15:09
    That sounds very impressiv and powerful!

    Would it be possible to execute inline assembler in spin or even a kind of lmm assembler?
    cgracey wrote: »
    BIG NEWS!!!

    I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

    Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.

    When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.
  • David BetzDavid Betz Posts: 14,365
    edited 2014-01-09 15:10
    cgracey wrote: »
    BIG NEWS!!!

    I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

    Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.

    When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.

    Wow! That is wonderful news! Congratulations!!

    Are you ready to release the new instruction encodings yet so we can think about updating propgcc?
  • KyeKye Posts: 2,200
    edited 2014-01-09 15:11
    Hub execution will give you 200 MHz LMM execution, that coupled with the ability to bring in DRAM data means big things.

    I assume there's no d-cache... so this is only for making static read-only code faster right?
  • cgraceycgracey Posts: 13,412
    edited 2014-01-09 15:19
    Kye wrote: »
    Hub execution will give you 200 MHz LMM execution, that coupled with the ability to bring in DRAM data means big things.

    I assume there's no d-cache... so this is only for making static read-only code faster right?


    That's right. If you need self-modifying code, you could load it into the cog RAM and execute it there. After I get this done, I want to make automatic transfers between hub, cog, aux, and pins.
  • cgraceycgracey Posts: 13,412
    edited 2014-01-09 15:20
    David Betz wrote: »
    Wow! That is wonderful news! Congratulations!!

    Are you ready to release the new instruction encodings yet so we can think about updating propgcc?

    I think so. There will be a few instruction additions coming, but I don't see any big changes.
  • cgraceycgracey Posts: 13,412
    edited 2014-01-09 15:21
    DL7PNP wrote: »
    That sounds very impressiv and powerful!

    Would it be possible to execute inline assembler in spin or even a kind of lmm assembler?

    Yes. That's going to be important to Spin - to be able to execute PASM and even background PASM in other tasks of the same cog.
  • David BetzDavid Betz Posts: 14,365
    edited 2014-01-09 15:22
    cgracey wrote: »
    I think so. There will be a few instruction additions coming, but I don't see any big changes.
    I'm not at all concerned about instruction additions as long as there are no more big encoding changes across the whole instruction set.
  • David BetzDavid Betz Posts: 14,365
    edited 2014-01-09 15:23
    cgracey wrote: »
    That's right. If you need self-modifying code, you could load it into the cog RAM and execute it there. After I get this done, I want to make automatic transfers between hub, cog, aux, and pins.
    Are "automatic transfers" like DMA?
  • cgraceycgracey Posts: 13,412
    edited 2014-01-09 15:25
    David Betz wrote: »
    Are "automatic transfers" like DMA?

    You could say so. You just can't execute instructions that would try to alter those resources while the transfer is going on. For example, if you start a hub-to-AUX transfer, you better not do hub or AUX instructions while it's busy, or the transfer would get corrupted.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-01-09 15:27
    Excellent news!
    cgracey wrote: »
    big news!!!

    I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.

    Since hub execution occurs whenever a task's 16-bit program counter is beyond $01ff, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: If an instruction is being fetched above $01ff, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which i was concerned about.

    When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the spin interpreter's efficiency. Being able to bust beyond the cog's ram is a fantastic feeling.
  • David BetzDavid Betz Posts: 14,365
    edited 2014-01-09 15:30
    cgracey wrote: »
    You could say so. You just can't execute instructions that would try to alter those resources while the transfer is going on. For example, if you start a hub-to-AUX transfer, you better not do hub or AUX instructions while it's busy, or the transfer would get corrupted.
    That seems perfectly reasonable.
  • BaggersBaggers Posts: 2,991
    edited 2014-01-09 15:46
    Awesome news Chip :D

    Also a quick question... Would the Spin interpreter be able to run from ROM? thus freeing even more COG ram?
Sign In or Register to comment.