Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Cluso99 · 2017-12-15 04:18

Here is a diagram of what is happening. CRC' is just the new value of CRC after the 1-bit addition.

P2%20CRCBIT.jpg

By having the CRCBIT instruction as I have described, it is a 2-clock instruction that can be executed perhaps as each bit is being received/sent. Thus it is only an additional 2-cock instruction executed in the character assembly/sending loop.

However, for this to work in this instance, any internal registers used plus any interrupts etc must still be as per normal.

So perhaps, could there be one or two internal registers (CRC and POLY) together with two instructions to load each, plus an instruction to accumulate the CRC from the C flag, plus an instruction to read the CRC register at the end.

' to use this...
        SETPOLY ##$8005             ' initial value of internal POLY register
        SETCRC  #0                  ' initial value of internal CRC register (could be #FFFF)
        .....
        REPS    #1,     #8          ' 2 instruction x 8 loops
        SHR     DATA,   #1    WC    ' C:=DATA[0]      
        CRCBIT                      ' accumulate 1-bit into CRC
        GETCRC  CRC16               ' read final CRC value

To utilise the instruction as each bit in a block of characters is received/sent...

' or as receiving/sending each bit...
        SETPOLY ##$8005             ' initial value of internal POLY register
        SETCRC  #0                  ' initial value of internal CRC register (could be #FFFF)
        .....
        CRCBIT                      ' accumulate 1-bit (in Carry) into CRC (maybe 8*block)
        .....
        GETCRC  CRC16               ' read final CRC value

In theory, we could use SETQ and SETQ2 instructions to load the initial values of CRC and POLY. Then we would only require the GETCRC D instruction and a simple CRCBIT instruction using the Carry flag.

These could be used in conjunction with the NRZI helper instruction that I proposed for bit-banging USB, etc.

rjo__ · 2017-12-15 04:25

Chip,

I don't have any suggestions... I have no idea what you guys are talking about most of the time, but:

As God is my witness, I am going to have a massively parallel system to manage. I have no abstractions to offer, but I know you probably do.

This seems like the last possible moment to think about it.

Thanks,

Rich

cgracey · 2017-12-15 12:52

Cluso99,

I think it will be much simpler to just use regular register conventions, like you first showed:

CRCBIT crc,poly

...Where C is used by CRCBIT. What could be simpler than that?

I made some instruction room for the CRCBIT instruction. I moved ZEROX and SIGNX to near TESTN/TEST, so they now have C and Z output. This opens up room where they used to be for the CRCBIT instruction. After making room, everything seems fine, so I'm going to sleep some and then implement the CRCBIT instruction.

Cluso99 · 2017-12-15 13:02

cgracey wrote: »

Cluso99,

I think it will be much simpler to just use regular register conventions, like you first showed:

CRCBIT crc,poly

...Where C is used by CRCBIT. What could be simpler than that?

I made some instruction room for the CRCBIT instruction. I moved ZEROX and SIGNX to near TESTN/TEST, so they now have C and Z output. This opens up room where they used to be for the CRCBIT instruction. After making room, everything seems fine, so I'm going to sleep some and then implement the CRCBIT instruction.

Excellent thanks Chip!

cgracey · 2017-12-15 13:09

Yanomani wrote: »

Hi Chip

If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.

This also means that the stack ram size could increase at will, limited only by phisical space availability.

Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.

Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.

Hope I'd figured it right!

P.S. Sorry, I didn't figured it right.

Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.

If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.

Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.

Henrique

cgracey wrote: »

Seairth wrote: »

Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.

The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.

Right, it's a LIFO. Just one pointer would be needed. And the top could be buffered in flops to make it immediately accessible. It's too late to implement something like this, though. We've got OnSemi going with a fixed set of memories. Any deviation would probably cost more money.

David Betz · 2017-12-15 13:28

cgracey wrote: »

Yanomani wrote: »

Hi Chip

If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.

This also means that the stack ram size could increase at will, limited only by phisical space availability.

Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.

Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.

Hope I'd figured it right!

P.S. Sorry, I didn't figured it right.

Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.

If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.

Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.

Henrique

cgracey wrote: »

Seairth wrote: »

Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.

The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.

Right, it's a LIFO. Just one pointer would be needed. And the top could be buffered in flops to make it immediately accessible. It's too late to implement something like this, though. We've got OnSemi going with a fixed set of memories. Any deviation would probably cost more money.

I don't think a bigger stack would be that useful unless you could also directly address elements other than the top. That would allow you to setup stack frames needed by most high level languages. Just being able to PUSH/POP wouldn't be that useful.

cgracey · 2017-12-15 14:24

Cluso99,

I made the CRCBIT instruction as you described.

I'm not getting any results that match online CRC calculators.

Do you shift data in LSB-first or MSB-first?

I suppose for a byte, you do 8 bits. What if it's CRC16?

The online calculators seem to be shifting data left, not right. Any idea why things look this way?

I made this:

(1) X := C XOR D[0]
(2) D := D >> 1
(3) if X == 1 then D := D XOR POLY

The Verilog looks like this:

wire [31:0] crcbit = {1'b0, d[31:01]} ^ ({32{c ^ d[0]}} & s);

What do you think?

Rayman · 2017-12-15 16:17

FYI: For the USB CRC16, I used this reference and got a match:

web tool with hex data and look at the Modbus result:
http://www.lammertbies.nl/comm/info/crc-calculation.html

Then, have to invert all the bits. Then, you send LSByte first.

garryj · 2017-12-15 17:32

From my reading on the hardware side, it's typical for a USB Serial Interface Engine to calculate a CRC MSB-first.

Here's a Microchip doc that has an overview on both hardware and software implementations:
CRC Generating and Checking

Cluso99 · 2017-12-15 17:45

Chip,
I will have to check it out. I did post working spin P1 code.

There are MSB and LSB versions of CRC, plus inversion of bits, starting values, and inversions of the POLY. Ultimately they give the same results once you know how to correct the end result.

I am fairly sure what I gave you is correct.

Unfortunately I will not get time to verify today as we have a family Christmas Party

Cluso99 · 2017-12-15 17:58

Chip,

One thing I miss when converting P1 PASM programs is the JMPRET (JMP/CALL/RET) instruction.

It would be nice if it's possible to add an equivalent.

Thinking about handling COG and LUT, for starters, just handle COG. We are never going to be able to store the return address in LUT anyway.

ozpropdev · 2017-12-15 23:54

Chip
What was decided on moving the CZ flags to bits 31:30 on the hardware stack?

evanh · 2017-12-16 00:10

ozpropdev wrote: »

Chip
What was decided on moving the CZ flags to bits 31:30 on the hardware stack?

Putting the stack back to a hardware width of 22 bits is the right answer.

evanh · 2017-12-16 00:17

Cluso99 wrote: »

One thing I miss when converting P1 PASM programs is the JMPRET (JMP/CALL/RET) instruction.

CALLD D,S is the Prop2 equivalent I think.

ozpropdev · 2017-12-16 00:30

evanh wrote: »

ozpropdev wrote: »

Chip
What was decided on moving the CZ flags to bits 31:30 on the hardware stack?

Putting the stack back to a hardware width of 22 bits is the right answer.

A 32 bit machine without a 32 bit wide stack seems odd to me.
Making it 32 bits keeps things consistent with the PUSHx/POPx hub stack operations.

Rayman · 2017-12-16 00:44

Shouldn't stack be 34 bits?

evanh · 2017-12-16 00:45

It's not a normal stack though is it, it's special tiny hardware thing that can overflow in a blink. If PUSH/POP instructions for that hardware stack were double operand then I'd be saying those should be removed too.

We have PTRA/B for the real thing. And I'm sure ALTx instructions can do wonders for fast Cog side operations.

cgracey · 2017-12-16 01:02

I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

evanh · 2017-12-16 01:06

I'd vote more PTRx registers if they can fit.

evanh · 2017-12-16 01:11

cgracey wrote: »

I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

cgracey · 2017-12-16 01:13

evanh wrote: »

cgracey wrote: »

I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

They now do this {C, Z, 10'b0, PC[19:0]}.

Am I not seeing something?

cgracey · 2017-12-16 01:17

evanh wrote: »

I'd vote more PTRx registers if they can fit.

Those take a lot of mux's to read. They are expensive, in a sense.

evanh · 2017-12-16 01:21

cgracey wrote: »

evanh wrote: »

I'd vote more PTRx registers if they can fit.

Those take a lot of mux's to read. They are expensive, in a sense.

Not to mention the instruction encoding space restrictions.

cgracey · 2017-12-16 01:22

evanh wrote: »

cgracey wrote: »

evanh wrote: »

I'd vote more PTRx registers if they can fit.

Those take a lot of mux's to read. They are expensive, in a sense.

Not to mention the instruction encoding space restrictions.

That, too, and there's no unallocated registers from $1F0..$1FF.

evanh · 2017-12-16 01:46

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

They now do this {C, Z, 10'b0, PC[19:0]}.

Am I not seeing something?

Assuming we want consistency, I count 14 instructions as listed in the Prop2 Instructions v30:
Register addressing versions:
CALLD
CALL/CALLPA/CALLPB/RET
CALLA/RETA
CALLB/RETB
Oddly, your JMP also mentions loading of C & Z from the D register.

Immediate addressing instructions:
CALL
CALLA
CALLB
CALLD
The immediate addressed JMP doesn't mention C & Z.

The hardware stack PUSH/POP were on that list but will be removed now that they do a full 32-bit operation.

Ariba · 2017-12-16 04:08

cgracey wrote: »

Cluso99 and Garryj,

Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?

Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?

Yes, a CRC instructions for 1 bit makes not much sense.
We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles:
forums.parallax.com/discussion/comment/1369775/#Comment_1369775
A loop with 8 one bit calculations will always be slower.

A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.

Andy

cgracey · 2017-12-16 06:40

Ariba wrote: »

cgracey wrote: »

Cluso99 and Garryj,

Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?

Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?

Yes, a CRC instructions for 1 bit makes not much sense.
We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles:
forums.parallax.com/discussion/comment/1369775/#Comment_1369775
A loop with 8 one bit calculations will always be slower.

A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.

Andy

Thanks, Andy. I'll investigate what it would take to do this in hardware, anyway, since it wouldn't require a big chunk of the LUT.

cgracey · 2017-12-16 06:44

evanh wrote: »

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

They now do this {C, Z, 10'b0, PC[19:0]}.

Am I not seeing something?

Assuming we want consistency, I count 14 instructions as listed in the Prop2 Instructions v30:
Register addressing versions:
CALLD
CALL/CALLPA/CALLPB/RET
CALLA/RETA
CALLB/RETB
Oddly, your JMP also mentions loading of C & Z from the D register.

Immediate addressing instructions:
CALL
CALLA
CALLB
CALLD
The immediate addressed JMP doesn't mention C & Z.

The hardware stack PUSH/POP were on that list but will be removed now that they do a full 32-bit operation.

All those instructions now handle C/Z at bits 31/30, with a 10-bit empty space before the 20-bit address.

Thanks for making the list. I'll use it to review the code changes. I think I've got them all, already, but I'll verify.

Cluso99 · 2017-12-16 07:50

Ariba wrote: »

cgracey wrote: »

Cluso99 and Garryj,

Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?

Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?

Yes, a CRC instructions for 1 bit makes not much sense.
We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles:
forums.parallax.com/discussion/comment/1369775/#Comment_1369775
A loop with 8 one bit calculations will always be slower.

A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.

Andy

The reasoning behind requesting a single bit CRC instruction is the possibility of accumulating the CRC as each bit of the byte/block is transmitted/received.

Often there is insufficient time at the end of a block to perform the CRC calculation for the last/all bytes and get a reply out (ack or nak) in the required time. This is the case with USB and P1.

When I asked for this instruction 4+ years ago, I had spent a lot of time understanding the USB protocol. I previously spent many years designing hardware and writing synchronous communications in the 80's and 90's.

Please, just leave it be a single bit CRC instruction. A lookup table can be used if you want Byte CRC calculation.

Cluso99 · 2017-12-16 07:51

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).

Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.

They now do this {C, Z, 10'b0, PC[19:0]}.

Am I not seeing something?

As long as it is consistent, either way works for me.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments