P2 Serial / Shift Register discussion

whicker · 2013-10-13 14:11

Cluso99 wrote: »

This generic shift register block diag adds an optional (configurable) feedback mux/xor required for NRZI as used in USB.

I have assumed that we can optionally invert the data input, data output, and the input and output clocks - these are not shown.

Looks great, very simple yet very powerful.

And yes, in addition to the inversion, there theoretically should also be a long row of FF's that are the latch / load portion below it.

How many systems shift lsb first as opposed to msb first?
by default we should settle on the orientation that hits the most use cases.

Does the P2 already have an instruction to exchange bit order? (0 <-> 31, 1 <-> 30, 2 <-> 29, etc) ?

evanh · 2013-10-13 14:53

whicker wrote: »

How many systems shift lsb first as opposed to msb first?
by default we should settle on the orientation that hits the most use cases.

UARTs (RS232 and co) are little endian. SPI is big endian for the most part. Ethernet is big endian. IP is conceptually big endian (I'm not sure if there is hardware). USB is little endian.

Correction: From Wikipeadia: "Ethernet transmission is strange, in that the octet order is big-endian (leftmost octet is sent first), but bit order little-endian (rightmost, or LSB (Least Significant Bit) of the octet is sent first)."

Mike Green · 2013-10-13 15:05

The REV instruction in both P1 and P2 reverses the bit order of a 32-bit (or smaller) value.

jazzed · 2013-10-28 11:52

Hi,

Looking at the content of this thread it seems there are lots of good details being discussed. I especially like this post even though I completely disagree with the first point because it makes buffering less usable with alternative MSB/LSB end first.

I'd like to know why buffering is not given more attention.

There is a lot of software concept/solution being discussed as an alternative to DMA-like buffering. It would of course be impossible for a SERDES buffer engine to directly access HUB memory because of the shared nature of that memory. However, there is the CLUT memory which can act like one or more buffers. That memory can (or should at least!) be able to transfer quickly to/from HUB memory ... makes me wonder why a blocking HUB-CLUT transfer mechanism is not implemented .

One of the great points of having a SERDES integrated into the chip would be performance. If reasonable performance is not achievable, then maybe the entire idea is bogus.

Bill Henning · 2013-10-28 13:57

The CLUT is only dual ported, so if the SERDES used it, the video generator or SDRAM XFER would be blocked; making the clut triple ported would use too many transistors.

With a single byte/wordl/long buffer, a single task could easily be used to make a FIFO / LIFO / circular buffer in either CLUT or cog ram for serial data transfers up to clkfreq/2, leaving three tasks available for other purposes.

Basically I believe that a larger buffer for SERDES may make sense for Prop3, but I am itching too much to get Prop2 in my hands as soon as possible to support it

Cluso99 · 2013-10-28 16:00

Here are my thoughts on a few things raised...

1. I don't mind if the CLUT is used as the buffer. I don't believe that SDRAM or Video would be used in this cog anyway, so if that were a restriction I would not see that as a problem.

2. I have tended to ignore the buffering idea. I wanted a SIMPLE SERDES. Everything we add complicates the design and puts more pressure on it not being done and/or risks that the whole concept fails. Currently we don't have anything - well Chip added async 8/32/36 but its not usable for SERDES. I would rather a simple set of 32(36) bit shift registers that could be chained/paralleled and able to do both directions because I see a lot of possibilities that others will think of later.

In my quest to do FS USB, I have found what I believe to be the biggest restrictions in software, and it is those hw assist features that I would like the most. Currently we have to do bit banging for both receive and transmit. With very simple hw assist we can do transmit easily (the video in P1 could do this, but not in P2 as I understand). If we can receive a chain of bits, then sw is at least free to do some other housekeeping while the hw receives "some" bits. As long as we can calculate the number of bits, and look at the compiled bits, we at least have removed the timing constraints of bit-banging the receive. A few simple hw instructions would help here immensely. And if done right, become useful perhaps for other purposes. For instance, there unfortunately are numerous CRC algorithms used. If the instruction could be general purpose acting on a cog register (cog ram) or else the ACCx internal register, we have improved the P2s capability beyond USB.

Summary

I guess that is why I would rather a simple serdes, based loosely on 74xx595's as Bill also suggested.

Unfortunately I must go out for most of the day.

What I am trying to draw up is "THE" base SERDES shift register with some "optional/configurable" additions. We only need to draw a single one of these. Once we get this base done (and can mostly agree on this) then we can progress to the next stage and see how we can modify a second one to become a "latch" or a "counter" or whatever else you think. Remember, we really need to start with the basics. If we get time to progress further (before Chip comes up for air) then great.

So how about we work together instead of fragmenting the discussions

My last simple block diag was post #60.
Is this too restrictive? If so, how can you make it more generalised?
What do we need to add to this to make it better?
And just remember, we do have a limited transistor budget, and we don't want it to be too complex that it imposes risks - we don't have a lot of time to debug.

Bill Henning · 2013-10-28 16:46

I think we need an input latch, and an output latch, to decouple the timing some... we don't want a cog busy waiting to catch the last bit coming in, much better to wait for a "byte/Nbits/long" ready... also the data rate is not nearly as high as video or sdram, so I don't think we need to use the CLUT as a buffer.

Due to the transistor budget, a single latch per SERDES input and SERDES output is (IMHO) enough - it would allow one task of four to keep up with clkfreq/2 input or output. I also want to keep Chip's async modes

Once Chip surfaces from Verilog-ing the new instructions I am sure he will pipe up with what he has been considering.

Sapieha · 2013-10-28 17:14

Hi Cluso.

I dont have looked on USB 1.1 --- But USB 2.0 use only

usbf_crc16.v usbf_crc5.v

Cluso99 wrote: »

Here are my thoughts on a few things raised...

1. I don't mind if the CLUT is used as the buffer. I don't believe that SDRAM or Video would be used in this cog anyway, so if that were a restriction I would not see that as a problem.

2. I have tended to ignore the buffering idea. I wanted a SIMPLE SERDES. Everything we add complicates the design and puts more pressure on it not being done and/or risks that the whole concept fails. Currently we don't have anything - well Chip added async 8/32/36 but its not usable for SERDES. I would rather a simple set of 32(36) bit shift registers that could be chained/paralleled and able to do both directions because I see a lot of possibilities that others will think of later.

In my quest to do FS USB, I have found what I believe to be the biggest restrictions in software, and it is those hw assist features that I would like the most. Currently we have to do bit banging for both receive and transmit. With very simple hw assist we can do transmit easily (the video in P1 could do this, but not in P2 as I understand). If we can receive a chain of bits, then sw is at least free to do some other housekeeping while the hw receives "some" bits. As long as we can calculate the number of bits, and look at the compiled bits, we at least have removed the timing constraints of bit-banging the receive. A few simple hw instructions would help here immensely. And if done right, become useful perhaps for other purposes. For instance, there unfortunately are numerous CRC algorithms used. If the instruction could be general purpose acting on a cog register (cog ram) or else the ACCx internal register, we have improved the P2s capability beyond USB.

Summary

I guess that is why I would rather a simple serdes, based loosely on 74xx595's as Bill also suggested.

Unfortunately I must go out for most of the day.

What I am trying to draw up is "THE" base SERDES shift register with some "optional/configurable" additions. We only need to draw a single one of these. Once we get this base done (and can mostly agree on this) then we can progress to the next stage and see how we can modify a second one to become a "latch" or a "counter" or whatever else you think. Remember, we really need to start with the basics. If we get time to progress further (before Chip comes up for air) then great.

So how about we work together instead of fragmenting the discussions

My last simple block diag was post #60.
Is this too restrictive? If so, how can you make it more generalised?
What do we need to add to this to make it better?
And just remember, we do have a limited transistor budget, and we don't want it to be too complex that it imposes risks - we don't have a lot of time to debug.

jazzed · 2013-10-28 18:47

Cluso99 wrote: »

So how about we work together instead of fragmenting the discussions

You asked me to contribute here (twice!). I did (twice now.)

I'll take a simple SERDES without buffer queues. The question is, will it be useful enough to implement?

Bill's buffering notes are spot-on. Using CLUT would only be required for buffer queues since HUB access is relatively slow.

Cluso99 · 2013-10-28 20:30

jazzed/Steve,
I thought I did answer both your posts, maybe in general but I did answer. My last post was in haste as I had to go out (these things happen). I am back now.

Looking at the content of this thread it seems there are lots of good details being discussed. I especially like this post even though I completely disagree with the first point because it makes buffering less usable with alternative MSB/LSB end first.

I am interested to see why you think the REV instruction cannot be used to do the lesser case of MSB first?

I'd like to know why buffering is not given more attention.

There is a lot of software concept/solution being discussed as an alternative to DMA-like buffering. It would of course be impossible for a SERDES buffer engine to directly access HUB memory because of the shared nature of that memory. However, there is the CLUT memory which can act like one or more buffers. That memory can (or should at least!) be able to transfer quickly to/from HUB memory ... makes me wonder why a blocking HUB-CLUT transfer mechanism is not implemented .

Chip probably has this mastered already from the video in P1, and the P2 registers.
As I said, I prefer to get a simple agree serdes first, then look at the buffering. If we run out of time or whatever, at least with simple serdes we can assist with sw to do some of these things, even without buffering. We can use the waitcnt to set us up at the right point in time to read the serial input, and to set the serial output. Hence my priority in working out a generic serdes initially.

Bill's suggestion of a wait byte/nbits/long seems very interesting. I think the merits of this needs more discussion.
So Steve, I agree the CLUT is not necessary to be used in special hw.

One of the great points of having a SERDES integrated into the chip would be performance. If reasonable performance is not achievable, then maybe the entire idea is bogus.

Absolutely. But don't leave it out because you don't think it is achievable. USB FS is 12Mbps. HS is not achievable unless virtualy all of it is in hw, so it's not an option, and too much risk to do in hw.

Cluso99 · 2013-10-28 20:54

jazzed/Steve,
Here is your reply in the "No Simple P2 SERDES". I did in fact reply, but here it is addressed in detail (though it was for Chip)

Chip,

Didn't realize you were considering USB as well, but full a USB device will be more difficult. Not sure if you considering a USB host mode like OTG or not.

I am not sure this is what Chip really means. To do the whole USB FS in hw is not feasible for the P2 - too much risk and too little time.

Propeller would be much more cost competitive if it's possible to eliminate the FTDI chip and have P2 be able to emulate a USB serial port at least.

I agree here 100%. In fact, I believe it's just as imperative as many other things. But I do know it is doable, even in sw. It is a matter of just how many cogs vs hw assist. BTW I am happy for a non-compliant but working USB FS.

I would be happy with simple synchronous and asynchronous SERDES (2 per COG?).

Yes, me too, provided synchronous means also simple serial shifter.

It's probably too much work to add Manchester, 8b/10b, 64b/66b, or other encodings. NRZ(I) is a lot of work too, but USB uses NRZI, so if you're willing to go there well, OK.

Haven't looked at the others for years. However, NRZI is quite simple to do a low form of hw assist. The bit stuffing/unstuffing is more complex to get right, but again doable. But it needs to be optional because there are various methods besides USB.

Honestly, I think there is too much technical risk in going beyond a simple SERDES. Schedule risk is a big deal too; it has to be finished soon.

This is where I started. Chip seems happy enough to add some support for us to do other things. My BIG POINT is they need to be optional/configurable.
And a generic SERDES will have way more uses besides what we are all thinking about now.

What is a simple SERDES? In the simplest form, a synchronous serial to parallel, parallel to serial converter ....

I'm sure there are more details to consider, but to me a summary of features would be:

1. Synchronous SERDES could be a simple extension of the current SERA/B UART.

In fact it is less, with the async being configurable options.

2. Assuming 2 SERDES per COG/core, operation only needs to be half-duplex.
3. A master clock transmit/receive. P2 generates clock that devices follow for data IO.
4. A slave clock transmit/receive. P2 uses a clock provided by a separate master for data IO.
5. A clock edge select. Which edge is used to clock bits in or out (I.E. SPI device modes).

2,3,4,5 Agreed but could be enhanced quite simply.[/quote]
6. Some form of chip select does not need to be controlled by the SERDES.
[/quote]Yes, control by sw, no hw support.

7. SERDES should be able to transfer data up to clock/2 in master mode.

Perhaps possible only in its simplest form. Best to aim for this.

8. Only one data element at a time needs to be received/transmitted.

Not sure I really understand this... Do you mean half duplex? Do you mean SPI x1bit? Other?

9. A set of TX/RX queue pointers would be a nice bonus for buffering.

Do we now agree this is no longer a requirement since we are not using multiple buffering (CLUT)?

10. Input/output would be single ended or differential.

Both optional - IIRC this is already configurable in the I/O pins- otherwise we definitely need it for TX USB. RX USB could get around this in sw.

11. Per SERDES config and status registers.

Config register(s) include:
First bit mode which end is sent/received first.
Number of bits required per transfer (7,8,32,etc...).
Clock mode/phase and baudrate.
Optional buffer queue pointer, head, and tail.

Status register(s) include:
Data element receive ready and transmit complete.
Optional buffer queue status such as over-flow, byte count(?), etc...
Line error bits, etc....

This definition allows for SPI master/slave. The chip-select logic for SPI would be controlled only through code (not the SERDES). As you can see, I've had a change of heart over the need for SPI slave device mode. If someone wants P2 to be a peripheral for a CPU like ARM or x86, then SPI slave mode is necessary.

I'm sure there is more to consider.

Good luck Chip.
--Steve

No need to comment now as this will fall out of the ultimate specs provided all features can be optional.

Steve, I hope this addresses your issues.

Cluso99 · 2013-10-28 21:00

Sapieha wrote: »

Hi Cluso.

I dont have looked on USB 1.1 --- But USB 2.0 use only
usbf_crc16.v usbf_crc5.v

Yes, both use the same CRC's. Both CRC5 and CRC16 (known as IBM) are used. CRC5 can be done quite simply at a time when timing is not that important. There are a few CRC16 variants. It would be nice to do both the IBM variant x^16+x^15+x^2+1 ($A001 and initial value $FFFF) and the CCITT variant x^16+x^12+x^5+1 ($8404 and $FFFF). Xmodem incorrectly calculates the CCITT value (transmits reverse order of the CRC16 IIRC), and so is another variant.

Perhaps there is an opportunity to do other polynomials, perhaps not - anyone want to discuss??

jazzed · 2013-10-28 21:26

Cluso99 wrote: »

jazzed/Steve,
I am interested to see why you think the REV instruction cannot be used to do the lesser case of MSB first?

It can be used if there are no buffer queues (the lesser case?), but like everything else software driven, it will be slower.

Cluso99 wrote: »

Chip probably has this mastered already from the video in P1, and the P2 registers.

I'm fairly sure he won't need or (even seek) help in the actual logic design. He's doesn't strike me as a copy/paste kind of guy. Should have seen his reaction when I suggested putting an ARM core into P2 ... LOL.

Cluso99 wrote: »

Bill's suggestion of a wait byte/nbits/long seems very interesting. I think the merits of this needs more discussion.

Yes, I agree. To me it is obvious from looking at other design specifications that it is important. I.e. 5, 6, 7, 8, 9 bits among others come up. I expect other bit counts would be useful.

One problem with length is in packet and/or TDM communications .... even 128 bits is not big enough. Don't get me wrong, I do love the idea of chaining 4x32 registers together

Packet length and NZR/NZRI encoding are important considerations for a useful SERDES. Even USB uses NZRI and isochronous clocking right?

Cluso99 wrote: »

So Steve, I agree the CLUT is not necessary to be used in special hw.

Well I suppose that even with buffer queues the SERDES DMA engine would be able to do rdquad/wrquad through the HUB access window. Unfortunately, that isn't really practical either because other "tasks" in the COG would need to arbitrate for window access. Just dumping packets to/from CLUT for processing would be easier all around for any in-COG processing. Of course if you want to use another COG for SERDES I/O data processing, the CLUT wouldn't be worth much ... unless there is some kind of atomic CLUT-HUB transfer mechanism.

I'm fond of a CLUT-HUB transfer idea if that was possible. The rdquad and rdcache commands have turned out to be too slow in my opinion.

Cluso99 wrote: »

... too much risk to do in hw.

Ya, that is certainly on my mind. Schedule risk is also important. I'm afraid that if this puppy doesn't fly soon, we will all suffer.

The simplest thing to do would be to waste a COG generating start/stop bits and clocking in the Chips current design with SERA/B.

Cluso99 · 2013-10-28 22:42

Here is an updated block for a possible SERDES solution. Only one block is shown. Identical blocks could be used for tx and rx. Chaining should also be possible.
You will see I have muxes (each configurable by a bit) for things like
1. Start bit bypass
2. Stop bit bypass
3. Address bypass
4. NRZI bypass
5. Enable unstuffing (6 consecutive zeros)
6. Fill shifter with "1" as bits shifted out

Please note none of this logic is completely verified, so take it as a representative block diagram only.

Cluso99 · 2013-10-28 23:04

jazzed wrote: »

It can be used if there are no buffer queues (the lesser case?), but like everything else software driven, it will be slower.

Yes, but a 1 clock instruction per byte/nbits/long seems reasonable for the lesser used case of MSB first. I think there is a reasonable amount of logic to reverse it in hw, but we can ask. Just need to set the priorities as we see it so Chip can look at the options and decide what's best.

I'm fairly sure he won't need or (even seek) help in the actual logic design. He's doesn't strike me as a copy/paste kind of guy. Should have seen his reaction when I suggested putting an ARM core into P2 ... LOL.

Sure. But its nothing like an ARM!

Yes, I agree. To me it is obvious from looking at other design specifications that it is important. I.e. 5, 6, 7, 8, 9 bits among others come up. I expect other bit counts would be useful.

It is easy to do a waitcnt for a number of bits, even pollcnt. But what caught my eye was Bill's "WAIT" instruction for a number of bits received, rather than just a blanket timing of bits with waitcnt - meaning we don't need to sync the sw on the start/first bit.

One problem with length is in packet and/or TDM communications .... even 128 bits is not big enough. Don't get me wrong, I do love the idea of chaining 4x32 registers together Packet length and NZR/NZRI encoding are important considerations for a useful SERDES.

Yes. Packet length seems to vary. But what I see so far on USB is they are fixed in blocks that the device can specify. So, USB is reading multiple bytes even though the protocol really specifies bits.

Even USB uses NZRI and isochronous clocking right?

Yes[/quote]

Well I suppose that even with buffer queues the SERDES DMA engine would be able to do rdquad/wrquad through the HUB access window. Unfortunately, that isn't really practical either because other "tasks" in the COG would need to arbitrate for window access. Just dumping packets to/from CLUT for processing would be easier all around for any in-COG processing. Of course if you want to use another COG for SERDES I/O data processing, the CLUT wouldn't be worth much ... unless there is some kind of atomic CLUT-HUB transfer mechanism.
[/quote]At this stage, I am not fond of any of this because of the risk and timing. Don't forget we do have the PortD bits and IIRC the 4 remaining PortC bits that are directly linked between cogs.

I'm fond of a CLUT-HUB transfer idea if that was possible. The rdquad and rdcache commands have turned out to be too slow in my opinion.

I disagree, but the point is moot anyway.

Ya, that is certainly on my mind. Schedule risk is also important. I'm afraid that if this puppy doesn't fly soon, we will all suffer.

The simplest thing to do would be to waste a COG generating start/stop bits and clocking in the Chips current design with SERA/B.

These were my thoughts too. But Chip went ahead and added the async UART features, so perhaps we can most likely have both. Otherwise I am very much in favour of removing the start/stop section.

One last point...
I would like 2 hw instructions to aid in USB (with likely other uses). I mentioned them above...
1. An XOR instruction to work on an input pin with the C flag, result in C. Optional Z set according to the input pin.
2. A 1bit CRC16 instruction using the C flag (from above). This could be more complex for other CRC's or simply just for USB. It is a simple XOR of a few bits if the C flag is set, and then the result is shifted 1 position (right IIRC). I will draw a block of this later (there are a number of bocks on the internet but there seems to be an error in the one I was thinking of posting). BTW I have implemented CRC16 (IBM) in sw in micro designs in the early 80's.

Cluso99 · 2013-10-28 23:56

Here is the instruction set required to read a bit at 12MHz for FS USB on the P2. It needs to execute every 6.667 clocks at 80MHz.

              waitcnt   delay, bittime
              test      K, pina  wz              ' 1
              muxz      lastbit, mask wc        ' 2  - top 2 bits contain previous and new values..
              shl       lastbit, #1             ' 3  - parity check 00 or 11 is odd and 01 or 10 is even
              test      JK, pina wz              ' 4  - Check for EOP
        IF_Z  jmp       #wfe                    ' 5 (wait for end)
              rcr       data, #1                ' 6  - rotate carry from xor into top bit of acc
              rcl       stuffcnt, #6 wz               ' 7  - If the entire register is zero, we need to unstuff (6 bits in)
        if_nz jmp       #rxnostuff              ' 8  - no need to unstuff next bit

This sequence is actually doing the NRZI for each bit...

              test      K, pina  wz              ' 1
              muxz      lastbit, mask wc        ' 2  - top 2 bits contain previous and new values..
              shl       lastbit, #1             ' 3  - parity check 00 or 11 is odd and 01 or 10 is even

By keeping the last bit received in the C flag, and a special P2 instruction, we could hopefully do this...

              xorc      pina  wc                 ' C XOR Pina -> C

"wc" need not be specified. It would also be nice to optionaly use WZ to set the Z flag to the value of pina.
This would result in 1 clock instead of 3 instructions being 3 clocks. A saving of 2 clocks.

Cluso99 · 2013-10-29 00:00

Another help would be calculating the CRC16. A lookup table is often used to get a byte at a time.

A special instruction for calculating each bit into CRC16 could be (uses 1 clock per bit in sw)...

              crc16     crcbits, poly wc      ' C --> (crcbits + (crcbits[16] XOR $A001) ) >> 1

Note: crcbits is preset to $FFFF and the poly is $A001 (x^16+x^15+x^2+1). This can be represented by...

Sapieha · 2013-10-29 03:52

Hi Cluso.

If You will look on entire Verilog code to USB 2.0 --- Look in attainment

ctwardell · 2013-10-29 04:07

Cluso99 wrote: »

Bill's suggestion of a wait byte/nbits/long seems very interesting. I think the merits of this needs more discussion.

I'm pretty sure Bill's suggestion went along with having input/output latches, otherwise the thread must still read/load before the next shift.

I'm glad Bill used the term 'latch' instead of 'buffer', I get the feeling people are taking 'buffer' to mean more than a single level latch.

C.W.

average joe · 2013-10-29 05:37

Captain obvious chiming in here... CRC16 and 4 SERDES would make 4 bit SD much faster.

Just saying...

Bill Henning · 2013-10-29 07:15

Good start...

IMHO, it also needs:

- insertion of optional start bit & stop bit (to keep async uart functionality without software baby sitting transfers)
- bit counter, with match register so that input data can be latched for the cog when bit count matches, and set "DREADY" bit
- match register for output data, so cog's buffer latch can be latched into shift register when it is ready (if available) with a DREQ bit to let cog know it needs to write next data

After thinking about everything, I think the un-implemented I/O bits should be mapped as follows (unique to every cog):

- IO92 = SERDES0 DREADY bit
- IO93 = SERDES0 DREQ bit
- IO94 = SERDES1 DREADY bit
- IO95 = SERDES1 DREQ bit

The beauty of the above is that no new instructions will be needed to wait for SERDES status, existing WAITPEQ/PNE would work well :-)

Now if Chip can add USB style bit stuffing and NRZI that would be fantastic, if not, I suspect a lookup in CLUT would help a lot with that with the new single cycle random access CLUT instructions.

Cluso99 wrote: »

Here is an updated block for a possible SERDES solution. Only one block is shown. Identical blocks could be used for tx and rx. Chaining should also be possible.
You will see I have muxes (each configurable by a bit) for things like
1. Start bit bypass
2. Stop bit bypass
3. Address bypass
4. NRZI bypass
5. Enable unstuffing (6 consecutive zeros)
6. Fill shifter with "1" as bits shifted out

Please note none of this logic is completely verified, so take it as a representative block diagram only.

Bill Henning · 2013-10-29 07:24

yep, you are right! I wanted "input bits" counter / match register, "output bits" counter / match register, and a wait for them. Mind you, in the previous message I thought of a cute way of mapping it to existing WAITPEQ...

I think SERDES is only marginally useful without the latches, as in high speed cases, a task would no longer be enough... a single simple latch for input, and another for output, allows a thread to process SERDES as fast a the data could be stored into the CLUT or cog ram (making a deeper buffer (fifo, lifo, circular)) a simple software issue, instead of more transistors.

We must keep the start/stop bit as an option as well... high speed async is not to be sneered at - I already have a lot of interesting ways to use it.

Regarding msb/lst first... we could simply control the shift direction for input and output - problem solved.

ctwardell wrote: »

I'm pretty sure Bill's suggestion went along with having input/output latches, otherwise the thread must still read/load before the next shift.

I'm glad Bill used the term 'latch' instead of 'buffer', I get the feeling people are taking 'buffer' to mean more than a single level latch.

C.W.

Bill Henning · 2013-10-29 07:25

Yes, but it would need 4 SERDES as the %!%$#@$! 4 bit standard requires a separate CRC16 for each of the four data lines.

I also seem to remember that it is patent encumbered (I could be wrong)

average joe wrote: »

Captain obvious chiming in here... CRC16 and 4 SERDES would make 4 bit SD much faster.

Just saying...

kuba · 2013-10-29 10:29

I think that any serious I/O functionality needs to take into account the reality for higher speed interfaces: I don't think a professional product would attach either USB or Ethernet physical lines to P2 pins, unless such pins were explicitly designed and qualified for compliance in such a use.

You get an external PHY that has some sort of a clocked parallel interface. That's how you get compliant Hi-Speed USB and 100Mbit Ethernet without going too protocol-specific. Usually it's clocked nibble-at-a-time.

Thus, I think that practical USB and Ethernet is wholly orthogonal from SPI, UART/ASYNC and I2S-like interfaces. If one wishes to support both 1-bit and more-than-one-bit clocked interfaces in one logic block, that's a valid choice, but not a necessary one (IMHO). Realistic USB and Ethernet support is greatly helped with a built-in bit pattern comparator to detect sync. Another helpful thing is a configurable-feedback CRC instruction, where you can set up the polynomial and feed it a settable number of bits at a time (ideally: any number, from 1 bit up to width of polynomial).

This is of course all rambling without a solid proposal to back it up, so I'll see if I can come up with something coherent.

Bill Henning · 2013-10-29 10:36

For a lot of applications, being able to do low speed USB (12Mbps) is plenty - so supporting it without an external PHY (that would chew up a lot of pins) is a big win.

High speed is 480Mbps - for this we need an aexternal PHY

10Mbps Ethernet should be bit-bangable on a P2 with just the addition of a mag jack, so it will be well worth doing as it provides a 10Mbps interface to the world...

Every penny saved on BOM costs saves customers many more pennies!

kuba · 2013-10-29 10:47

Bill Henning wrote: »

For a lot of applications, being able to do low speed USB (12Mbps) is plenty

This is full-speed. Low-speed is 1.5mbit/s and doesn't support bulk transfers, so it is way more limited and really not usable with stock Windows drivers for anything but HID-style devices. 480Mbps is hi-speed from USB 2 and is what one would consider a bare minimum for any sort of a data-storage device. It also has other benefits: since the entire upstream tree has to switch data rates, a full-speed device robs bandwidth from other devices on the tree. A flat-out full-speed device reduces the host's USB2 port to USB1.x performance. With no other devices sharing that port it's of no concern, of course, but many systems out there operate on internal hubs.

I personally would not consider designing-in full speed USB into any new product, it simply makes very little sense, especially given overally high data transfer rates possible with P2. A USB2 phy is about 1 USD in reasonable quantities. Are P2 I/O pins otherwise compliant even with USB 1.x?

Bill Henning · 2013-10-29 11:09

(Must get more coffee...)

You are of course correct about FS being 12Mbps, LS being 1.5 ... I mis-typed.

FYI, Linux works fine with CDC over LS, and look at all the AVR projects using bit-banged LS USB.

If the P2 can have FS USB without a PHY, that would matter to a lot of products. +$1 in BOM costs is significant at the retail end, not to mention using a lot fewer pins - I am looking forward to designing in FS USB into my future P2 products, for the pennies the resistors and USB socket will cost.

I believe the P2 pins will be compliant with FS, however external resistors may be needed.

For HS (480Mbps) a PHY is a reasonable choice, as it would be too much work to add support for it into the P2 at this point.

kuba wrote: »

This is full-speed. Low-speed is 1.5mbit/s and doesn't support bulk transfers, so it is way more limited and really not usable with stock Windows drivers for anything but HID-style devices. 480Mbps is hi-speed from USB 2 and is what one would consider a bare minimum for any sort of a data-storage device. It also has other benefits: since the entire upstream tree has to switch data rates, a full-speed device robs bandwidth from other devices on the tree. A flat-out full-speed device reduces the host's USB2 port to USB1.x performance. With no other devices sharing that port it's of no concern, of course, but many systems out there operate on internal hubs.

I personally would not consider designing-in full speed USB into any new product, it simply makes very little sense, especially given overally high data transfer rates possible with P2. A USB2 phy is about 1 USD in reasonable quantities. Are P2 I/O pins otherwise compliant even with USB 1.x?

Cluso99 · 2013-10-29 14:57

Bill Henning wrote: »

Good start...

IMHO, it also needs:

- insertion of optional start bit & stop bit (to keep async uart functionality without software baby sitting transfers)

Perhaps my simplification didn't make this point clear - yes I would wish to retain the insertion/removal of start/stop bits (and addressing) that Chip has added, provided we have the option to disable them.

- bit counter, with match register so that input data can be latched for the cog when bit count matches, and set "DREADY" bit
- match register for output data, so cog's buffer latch can be latched into shift register when it is ready (if available) with a DREQ bit to let cog know it needs to write next data

After thinking about everything, I think the un-implemented I/O bits should be mapped as follows (unique to every cog):

- IO92 = SERDES0 DREADY bit
- IO93 = SERDES0 DREQ bit
- IO94 = SERDES1 DREADY bit
- IO95 = SERDES1 DREQ bit

The beauty of the above is that no new instructions will be needed to wait for SERDES status, existing WAITPEQ/PNE would work well :-)

Nice concept Bill.
Perhaps these bits (IO92-95) would be enabled by configuration bits, so we still retain the intercog usage. So perhaps the configuration switches a mux to use an internal set of IO92-95 bits, while the other 7 cogs retain the connection and usage of IO92-95 between them.

Now if Chip can add USB style bit stuffing and NRZI that would be fantastic, if not, I suspect a lookup in CLUT would help a lot with that with the new single cycle random access CLUT instructions.

Agreed, but probably HUB rather than the CLUT for lookup would be better - but this is the programmers choice anyway.

Using 2 cogs I am sure we can do FS USB, hopefully with those 2 helper instructions, even without the serdes.

Bill, I will address the latch in the next post.

Cluso99 · 2013-10-29 15:20

Latch & Buffering...

You would have noticed I have not said much about this. Many of you have said buffering is mandatory.
I disagree. I think latching (single latch) is nice to have but not mandatory and here are my reasons. Please comment...

I was thinking that the serdes (at least sync mode) would operate more like the old video/waitvid in the P2 (how I think it works anyway). That is, you load up a value and set a timer (waitcnt/passcnt) and wait for tx to be done. Similarly, in rx, sw would setup the serdes to wait for a change in input level. Sw would also have to independently be looking for this (waitPXX or polling). Once detected, the sw would read the CNT value and add n-bit-times, and use this later to wait on the completed frame. So, the sw is ultimately waiting for the frame to be completed and therefore could read the serdes register directly. This avoids the latching/buffering issue.

The reason that I am not pushing for latching is silicon and Chip's time. Latching is preferred, but not mandatory IMHO, because I would rather a simple serdes than the whole thing become too complex for the time/silicon we have available and it not go in at all.

Also, Chip has done latching before so he will know how to do it if he has the time/silicon. He has changed the instruction set and that is a huge improvement but we don't yet know what impact that has on the available silicon.

To sum up...
I would rather define the serdes better, and leave latching to Chip. If it's in then great, if not I am not going to cry. I can see so many possibilities with a generalised serdes, way beyond just UART/USB/Ethernet.

Cluso99 · 2013-10-29 15:21

Bill,
Sorry I forgot. It is not simple to just reverse the serialiser to do MSB first. Lots of silicon is required, or else read/write needs to reverse this. That is why I see the REV instruction for this.

P2 Serial / Shift Register discussion

Comments