Shop Learn
Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i - Page 110 — Parallax Forums

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

1107108110112113160

Comments

  • Cluso99Cluso99 Posts: 17,969
    edited 2017-12-15 04:22
    Here is a diagram of what is happening. CRC' is just the new value of CRC after the 1-bit addition.

    P2%20CRCBIT.jpg

    By having the CRCBIT instruction as I have described, it is a 2-clock instruction that can be executed perhaps as each bit is being received/sent. Thus it is only an additional 2-cock instruction executed in the character assembly/sending loop.

    However, for this to work in this instance, any internal registers used plus any interrupts etc must still be as per normal.

    So perhaps, could there be one or two internal registers (CRC and POLY) together with two instructions to load each, plus an instruction to accumulate the CRC from the C flag, plus an instruction to read the CRC register at the end.
    ' to use this...
            SETPOLY ##$8005             ' initial value of internal POLY register
            SETCRC  #0                  ' initial value of internal CRC register (could be #FFFF)
            .....
            REPS    #1,     #8          ' 2 instruction x 8 loops
            SHR     DATA,   #1    WC    ' C:=DATA[0]      
            CRCBIT                      ' accumulate 1-bit into CRC
            GETCRC  CRC16               ' read final CRC value
    
    To utilise the instruction as each bit in a block of characters is received/sent...
    ' or as receiving/sending each bit...
            SETPOLY ##$8005             ' initial value of internal POLY register
            SETCRC  #0                  ' initial value of internal CRC register (could be #FFFF)
            .....
            CRCBIT                      ' accumulate 1-bit (in Carry) into CRC (maybe 8*block)
            .....
            GETCRC  CRC16               ' read final CRC value
    

    In theory, we could use SETQ and SETQ2 instructions to load the initial values of CRC and POLY. Then we would only require the GETCRC D instruction and a simple CRCBIT instruction using the Carry flag.

    These could be used in conjunction with the NRZI helper instruction that I proposed for bit-banging USB, etc.
    1103 x 282 - 94K
  • rjo__rjo__ Posts: 2,115
    Chip,

    I don't have any suggestions... I have no idea what you guys are talking about most of the time, but:

    As God is my witness, I am going to have a massively parallel system to manage. I have no abstractions to offer, but I know you probably do.

    This seems like the last possible moment to think about it.

    Thanks,

    Rich
  • cgraceycgracey Posts: 13,627
    Cluso99,

    I think it will be much simpler to just use regular register conventions, like you first showed:

    CRCBIT crc,poly

    ...Where C is used by CRCBIT. What could be simpler than that?

    I made some instruction room for the CRCBIT instruction. I moved ZEROX and SIGNX to near TESTN/TEST, so they now have C and Z output. This opens up room where they used to be for the CRCBIT instruction. After making room, everything seems fine, so I'm going to sleep some and then implement the CRCBIT instruction.
  • Cluso99Cluso99 Posts: 17,969
    cgracey wrote: »
    Cluso99,

    I think it will be much simpler to just use regular register conventions, like you first showed:

    CRCBIT crc,poly

    ...Where C is used by CRCBIT. What could be simpler than that?

    I made some instruction room for the CRCBIT instruction. I moved ZEROX and SIGNX to near TESTN/TEST, so they now have C and Z output. This opens up room where they used to be for the CRCBIT instruction. After making room, everything seems fine, so I'm going to sleep some and then implement the CRCBIT instruction.
    Excellent thanks Chip!
  • cgraceycgracey Posts: 13,627
    Yanomani wrote: »
    Hi Chip

    If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.

    This also means that the stack ram size could increase at will, limited only by phisical space availability.

    Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.


    Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.


    Hope I'd figured it right!

    P.S. Sorry, I didn't figured it right.

    Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.

    If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.

    Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.

    Henrique
    cgracey wrote: »
    Seairth wrote: »
    Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.

    The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.

    Right, it's a LIFO. Just one pointer would be needed. And the top could be buffered in flops to make it immediately accessible. It's too late to implement something like this, though. We've got OnSemi going with a fixed set of memories. Any deviation would probably cost more money.
  • cgracey wrote: »
    Yanomani wrote: »
    Hi Chip

    If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.

    This also means that the stack ram size could increase at will, limited only by phisical space availability.

    Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.


    Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.


    Hope I'd figured it right!

    P.S. Sorry, I didn't figured it right.

    Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.

    If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.

    Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.

    Henrique
    cgracey wrote: »
    Seairth wrote: »
    Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.

    The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.

    Right, it's a LIFO. Just one pointer would be needed. And the top could be buffered in flops to make it immediately accessible. It's too late to implement something like this, though. We've got OnSemi going with a fixed set of memories. Any deviation would probably cost more money.
    I don't think a bigger stack would be that useful unless you could also directly address elements other than the top. That would allow you to setup stack frames needed by most high level languages. Just being able to PUSH/POP wouldn't be that useful.

  • cgraceycgracey Posts: 13,627
    edited 2017-12-15 14:28
    Cluso99,

    I made the CRCBIT instruction as you described.

    I'm not getting any results that match online CRC calculators.

    Do you shift data in LSB-first or MSB-first?

    I suppose for a byte, you do 8 bits. What if it's CRC16?

    The online calculators seem to be shifting data left, not right. Any idea why things look this way?

    I made this:

    (1) X := C XOR D[0]
    (2) D := D >> 1
    (3) if X == 1 then D := D XOR POLY

    The Verilog looks like this:

    wire [31:0] crcbit = {1'b0, d[31:01]} ^ ({32{c ^ d[0]}} & s);

    What do you think?
  • RaymanRayman Posts: 12,210
    FYI: For the USB CRC16, I used this reference and got a match:

    web tool with hex data and look at the Modbus result:
    http://www.lammertbies.nl/comm/info/crc-calculation.html

    Then, have to invert all the bits. Then, you send LSByte first.
  • From my reading on the hardware side, it's typical for a USB Serial Interface Engine to calculate a CRC MSB-first.

    Here's a Microchip doc that has an overview on both hardware and software implementations:
    CRC Generating and Checking
  • Cluso99Cluso99 Posts: 17,969
    edited 2017-12-15 17:46
    Chip,
    I will have to check it out. I did post working spin P1 code.

    There are MSB and LSB versions of CRC, plus inversion of bits, starting values, and inversions of the POLY. Ultimately they give the same results once you know how to correct the end result.

    I am fairly sure what I gave you is correct.

    Unfortunately I will not get time to verify today as we have a family Christmas Party :)
  • Cluso99Cluso99 Posts: 17,969
    edited 2017-12-15 18:08
    Chip,

    One thing I miss when converting P1 PASM programs is the JMPRET (JMP/CALL/RET) instruction.

    It would be nice if it's possible to add an equivalent.

    Thinking about handling COG and LUT, for starters, just handle COG. We are never going to be able to store the return address in LUT anyway.
  • Chip
    What was decided on moving the CZ flags to bits 31:30 on the hardware stack?
  • evanhevanh Posts: 11,806
    ozpropdev wrote: »
    Chip
    What was decided on moving the CZ flags to bits 31:30 on the hardware stack?
    Putting the stack back to a hardware width of 22 bits is the right answer.
  • evanhevanh Posts: 11,806
    Cluso99 wrote: »
    One thing I miss when converting P1 PASM programs is the JMPRET (JMP/CALL/RET) instruction.
    CALLD D,S is the Prop2 equivalent I think.
  • evanh wrote: »
    ozpropdev wrote: »
    Chip
    What was decided on moving the CZ flags to bits 31:30 on the hardware stack?
    Putting the stack back to a hardware width of 22 bits is the right answer.
    A 32 bit machine without a 32 bit wide stack seems odd to me.
    Making it 32 bits keeps things consistent with the PUSHx/POPx hub stack operations.


  • RaymanRayman Posts: 12,210
    Shouldn't stack be 34 bits?
  • evanhevanh Posts: 11,806
    It's not a normal stack though is it, it's special tiny hardware thing that can overflow in a blink. If PUSH/POP instructions for that hardware stack were double operand then I'd be saying those should be removed too.

    We have PTRA/B for the real thing. And I'm sure ALTx instructions can do wonders for fast Cog side operations.
  • cgraceycgracey Posts: 13,627
    I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).
  • evanhevanh Posts: 11,806
    I'd vote more PTRx registers if they can fit.
  • evanhevanh Posts: 11,806
    cgracey wrote: »
    I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

    Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.
  • cgraceycgracey Posts: 13,627
    evanh wrote: »
    cgracey wrote: »
    I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

    Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

    They now do this {C, Z, 10'b0, PC[19:0]}.

    Am I not seeing something?
  • cgraceycgracey Posts: 13,627
    evanh wrote: »
    I'd vote more PTRx registers if they can fit.

    Those take a lot of mux's to read. They are expensive, in a sense.
  • evanhevanh Posts: 11,806
    cgracey wrote: »
    evanh wrote: »
    I'd vote more PTRx registers if they can fit.

    Those take a lot of mux's to read. They are expensive, in a sense.

    Not to mention the instruction encoding space restrictions.
  • cgraceycgracey Posts: 13,627
    evanh wrote: »
    cgracey wrote: »
    evanh wrote: »
    I'd vote more PTRx registers if they can fit.

    Those take a lot of mux's to read. They are expensive, in a sense.

    Not to mention the instruction encoding space restrictions.

    That, too, and there's no unallocated registers from $1F0..$1FF.

  • evanhevanh Posts: 11,806
    cgracey wrote: »
    evanh wrote: »
    cgracey wrote: »
    I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

    Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

    They now do this {C, Z, 10'b0, PC[19:0]}.

    Am I not seeing something?

    Assuming we want consistency, I count 14 instructions as listed in the Prop2 Instructions v30:
    Register addressing versions:
    CALLD
    CALL/CALLPA/CALLPB/RET
    CALLA/RETA
    CALLB/RETB
    Oddly, your JMP also mentions loading of C & Z from the D register.

    Immediate addressing instructions:
    CALL
    CALLA
    CALLB
    CALLD
    The immediate addressed JMP doesn't mention C & Z.

    The hardware stack PUSH/POP were on that list but will be removed now that they do a full 32-bit operation.


  • AribaAriba Posts: 2,515
    cgracey wrote: »
    Cluso99 and Garryj,

    Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?

    Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?

    Yes, a CRC instructions for 1 bit makes not much sense.
    We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles:
    forums.parallax.com/discussion/comment/1369775/#Comment_1369775
    A loop with 8 one bit calculations will always be slower.

    A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.

    Andy
  • cgraceycgracey Posts: 13,627
    Ariba wrote: »
    cgracey wrote: »
    Cluso99 and Garryj,

    Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?

    Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?

    Yes, a CRC instructions for 1 bit makes not much sense.
    We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles:
    forums.parallax.com/discussion/comment/1369775/#Comment_1369775
    A loop with 8 one bit calculations will always be slower.

    A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.

    Andy

    Thanks, Andy. I'll investigate what it would take to do this in hardware, anyway, since it wouldn't require a big chunk of the LUT.
  • cgraceycgracey Posts: 13,627
    evanh wrote: »
    cgracey wrote: »
    evanh wrote: »
    cgracey wrote: »
    I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

    Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

    They now do this {C, Z, 10'b0, PC[19:0]}.

    Am I not seeing something?

    Assuming we want consistency, I count 14 instructions as listed in the Prop2 Instructions v30:
    Register addressing versions:
    CALLD
    CALL/CALLPA/CALLPB/RET
    CALLA/RETA
    CALLB/RETB
    Oddly, your JMP also mentions loading of C & Z from the D register.

    Immediate addressing instructions:
    CALL
    CALLA
    CALLB
    CALLD
    The immediate addressed JMP doesn't mention C & Z.

    The hardware stack PUSH/POP were on that list but will be removed now that they do a full 32-bit operation.


    All those instructions now handle C/Z at bits 31/30, with a 10-bit empty space before the 20-bit address.

    Thanks for making the list. I'll use it to review the code changes. I think I've got them all, already, but I'll verify.
  • Cluso99Cluso99 Posts: 17,969
    Ariba wrote: »
    cgracey wrote: »
    Cluso99 and Garryj,

    Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?

    Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?

    Yes, a CRC instructions for 1 bit makes not much sense.
    We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles:
    forums.parallax.com/discussion/comment/1369775/#Comment_1369775
    A loop with 8 one bit calculations will always be slower.

    A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.

    Andy

    The reasoning behind requesting a single bit CRC instruction is the possibility of accumulating the CRC as each bit of the byte/block is transmitted/received.

    Often there is insufficient time at the end of a block to perform the CRC calculation for the last/all bytes and get a reply out (ack or nak) in the required time. This is the case with USB and P1.

    When I asked for this instruction 4+ years ago, I had spent a lot of time understanding the USB protocol. I previously spent many years designing hardware and writing synchronous communications in the 80's and 90's.

    Please, just leave it be a single bit CRC instruction. A lookup table can be used if you want Byte CRC calculation.
  • Cluso99Cluso99 Posts: 17,969
    cgracey wrote: »
    evanh wrote: »
    cgracey wrote: »
    I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

    Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

    They now do this {C, Z, 10'b0, PC[19:0]}.

    Am I not seeing something?

    As long as it is consistent, either way works for me.
Sign In or Register to comment.