Prop123_A9_Prop2_v7y Intermediate Release - 14 March 2016 - Faster PINSET, all shift modes LSB-first

cgracey · 2016-03-15 14:44

Does anyone have any complaints about the always-LSB first issue?

It afforded a logic savings which made it possible to spend more logic on speeding up the PINSETx nstructions, for a net performance gain, even after an extra instruction, or two, to reverse and/or shift data.

Seairth · 2016-03-15 15:24

cgracey wrote: »

Does anyone have any complaints about the always-LSB first issue?

It afforded a logic savings which made it possible to spend more logic on speeding up the PINSETx nstructions, for a net performance gain, even after an extra instruction, or two, to reverse and/or shift data.

I'll try to find some time this evening to play with the changes. One thought that immediately comes to mind, though, is that the MSB receive manipulation feels odd. Not only are you using a mask (instead of a shift), but symbols greater than 9 bits in length will require an additional register to store the mask (or AUGS). One way around this would be for receive to shift left instead of shift right. With this, you end up with the following:

LSB Transmit: no manipulation
LSB Receive: REV, SHR

MSB Transmit: REV, SHR
MSB Recieve: no manipulation

That way, the P2 isn't playing favorites between LSB and MSB, and it's a bit more symmetric.

cgracey · 2016-03-15 15:48

Seairth wrote: »

cgracey wrote: »

Does anyone have any complaints about the always-LSB first issue?

It afforded a logic savings which made it possible to spend more logic on speeding up the PINSETx nstructions, for a net performance gain, even after an extra instruction, or two, to reverse and/or shift data.

I'll try to find some time this evening to play with the changes. One thought that immediately comes to mind, though, is that the MSB receive manipulation feels odd. Not only are you using a mask (instead of a shift), but symbols greater than 9 bits in length will require an additional register to store the mask (or AUGS). One way around this would be for receive to shift left instead of shift right. With this, you end up with the following:

LSB Transmit: no manipulation
LSB Receive: REV, SHR

MSB Transmit: REV, SHR
MSB Recieve: no manipulation

That way, the P2 isn't playing favorites between LSB and MSB, and it's a bit more symmetric.

That would be much better from a software standpoint, but it would bring back those 32 mux's we just got rid of.

Seairth · 2016-03-15 15:59

cgracey wrote: »

That would be much better from a software standpoint, but it would bring back those 32 mux's we just got rid of.

Ugh. Good point. Instead, could you flip them when they are sent over the 4-bit I/O channel?

cgracey · 2016-03-15 16:02

Seairth wrote: »

cgracey wrote: »

That would be much better from a software standpoint, but it would bring back those 32 mux's we just got rid of.

Ugh. Good point. Instead, could you flip them when they are sent over the 4-bit I/O channel?

What 4-bit channel?

Oh, I know what you mean. You are talking about the messaging from the cog to the pin.

That would definitely be simpler. We might need an extra instruction, though, to make it happen, since we sometimes don't want to have the data reversed.

Seairth · 2016-03-15 17:43

cgracey wrote: »

Seairth wrote: »

cgracey wrote: »

That would be much better from a software standpoint, but it would bring back those 32 mux's we just got rid of.

Ugh. Good point. Instead, could you flip them when they are sent over the 4-bit I/O channel?

What 4-bit channel?

Oh, I know what you mean. You are talking about the messaging from the cog to the pin.

That would definitely be simpler. We might need an extra instruction, though, to make it happen, since we sometimes don't want to have the data reversed.

Oh yeah, I suppose that buffer would contain data for other pin modes too. In that case, leave it as things are currently.

Tubular · 2016-03-15 19:05

I'm ok with the REV instruction reversal.

However there might be an alternate way, by using an alternative instruction to PINSETY (PINSETR? PINMSBY?) that would shift left rather than shift R on the originating data. The bulk of the logic would be at the cog end (x16 rather than x64), and most of it probably pre-exists from other instruction muxing

jmg · 2016-03-15 19:24

cgracey wrote: »

Does anyone have any complaints about the always-LSB first issue?

It afforded a logic savings which made it possible to spend more logic on speeding up the PINSETx nstructions, for a net performance gain, even after an extra instruction, or two, to reverse and/or shift data.

The litmus test here, is can the COG keep-up with the fastest Serial rates, for all bit-lengths ?

With the extra opcode-time, that is a yes ?

IIRC that is SysCLK/2, for Sync Master, and Async Transmit, and Aligned Async Receive ? (Ozpropdev seemed to hint at some issues with 40MHz ?)

ozpropdev · 2016-03-17 11:44

cgracey wrote: »

PINSETx instructions are now 10 clocks, down from 18.
All serial modes are LSB-first.

Looking good Chip!
Your smartpin tweaks have improved sync. comms to 20Mbaud through my Ft2232H (fast serial mode)
Now blasting 384k in 216mS.
Silicon should hit 40Mbaud Ok.
Nice work!!

jmg · 2016-03-17 20:35

ozpropdev wrote: »

Looking good Chip!
Your smartpin tweaks have improved sync. comms to 20Mbaud through my Ft2232H (fast serial mode)
Now blasting 384k in 216mS.
Silicon should hit 40Mbaud Ok.
Nice work!!

Yes, sounds great.

I get 20M/(384k/216m) = 11.25 bits, which seems to be the FTDI limit, not P2.
(Their fast serial has some added latency, made up for by 50MHz max FSCLKs)

Can you re-test this at 40MHz, I was unclear on what the exact issues were re 40MHz above ?

ozpropdev · 2016-03-18 03:01

@jmg
A 40Mhz clock is not achievable because of the limitation of the smartpin sync. transmitter.
From the P2 docs:

Words of 1 to 32 bits are shifted out on the pin, LSB first, with each new bit being output two internal clock cycles
after registering a positive edge on the B input.

jmg · 2016-03-18 03:08

ozpropdev wrote: »

@jmg
A 40Mhz clock is not achievable because of the limitation of the smartpin sync. transmitter.
From the P2 docs:

Words of 1 to 32 bits are shifted out on the pin, LSB first, with each new bit being output two internal clock cycles
after registering a positive edge on the B input.

I'm not seeing the problem, in this use case ? That sounds like just a phase delay ?

The FTDI device expects Async formatted data, so it self-syncs on the Start bit. ie Phase delays are tolerated.

I see two possible ways to generate Tests
a) use SYNC mode, and manually add Start and Stop, and define X bits long - I think you have done this ?

b) Use Async mode, 8b,(or more) Tx, and start a 40MHz Clock on FSCLK pin.
In this case some means to phase-lock TxD and 40MHz correct edge will be needed.
I don't think there is a Smart pin feature to output an AsyncMode Tx Clock on a nearby pin,

If there was, you would not need a separate 40MHz clock.

ozpropdev · 2016-03-18 04:30

The issue appears to be that the smartpin gets bound up at 40MHz.
My code hangs waiting for the shifter ready (inx=1) which never comes.

jmg · 2016-03-18 07:14

ozpropdev wrote: »

The issue appears to be that the smartpin gets bound up at 40MHz.
My code hangs waiting for the shifter ready (inx=1) which never comes.

Hmmm.. is the effect the same on both Sync(SPI) & Async(UART) modes ?

I guess one issue could be Smart pins seem to be slave in nature: you generate the clock, then it reacts to that.
In Async.TX, maybe that limit is less imposing ?

The MCU data I'm looking at here, says SysCLK/2 on Async.Txmit, and SysCLK/2 on SPI, (with some pin-delay caveats at very high MHz values) so I think this should be possible in Logic ?
Async RX (no phase align) is of course lower, and SPI slave is also lower.

cgracey · 2016-03-18 07:26

ozpropdev wrote: »

The issue appears to be that the smartpin gets bound up at 40MHz.
My code hangs waiting for the shifter ready (inx=1) which never comes.

Can you see the 40MHz clock? I'm just wondering if it's a clock-generation problem.

Seairth · 2016-03-18 11:04

@cgracey: in the 7y documentation, PINSETM has a "pin output enable control" field, but I don't understand what it's for. It appears that DIR still enables/disables the smart cell, so what is the purpose of the field?

cgracey · 2016-03-18 11:43

Seairth wrote: »

@cgracey: in the 7y documentation, PINSETM has a "pin output enable control" field, but I don't understand what it's for. It appears that DIR still enables/disables the smart cell, so what is the purpose of the field?

Bit 6, for smart pin modes, sets the output enable for the pin, since DIR from the cogs is used for a local reset.

Seairth · 2016-03-18 12:32

cgracey wrote: »

Seairth wrote: »

@cgracey: in the 7y documentation, PINSETM has a "pin output enable control" field, but I don't understand what it's for. It appears that DIR still enables/disables the smart cell, so what is the purpose of the field?

Bit 6, for smart pin modes, sets the output enable for the pin, since DIR from the cogs is used for a local reset.

Gotcha. It was the "regardless of DIR" in the bit 6 description that was throwing me off. So, for inputs, do you set it to 0? Or do input modes simply ignore that bit?

Also, for the async serial mode where you are providing the clock source, it looks like the current approach is to set a cell to "transition" mode for the clock and another cell to the serial mode. In this case, can you leave the clock output off (m[6]=0) and still have the serial mode see it on pin B?

Seairth · 2016-03-18 12:45

On a slightly related question, I notice that many of the outputs use X[15:0] to establish the high and low periods. When using the "transition" mode as a clock, you have to double the value for SETPINY. While this works, it feels slightly cumbersome. Instead, it would be nice if there were a "double transition" ("cycle", "clock", whatever) mode where X[31:16] specifies the high period, X[15:0] specifies the low period, and Y specifies the number of high-then-low (complete cycle) transitions. Other than the obvious use for clocking, it can be used for PWM, impulses, etc.

ozpropdev · 2016-03-18 12:47

The async serial modes generate their own clock.
If you meant sync. mode then yes, you do have to enable the clock output.

ozpropdev · 2016-03-18 12:54

cgracey wrote: »

ozpropdev wrote: »

The issue appears to be that the smartpin gets bound up at 40MHz.
My code hangs waiting for the shifter ready (inx=1) which never comes.

Can you see the 40MHz clock? I'm just wondering if it's a clock-generation problem.

Yep, 40MHz clock is there.
Here's my test code.
Runs fine at 20MHz, locks up at 40MHz.

Seairth · 2016-03-18 13:38

ozpropdev wrote: »

Yep, 40MHz clock is there.
Here's my test code.
Runs fine at 20MHz, locks up at 40MHz.

I wonder if it has to do with the fact that the output is two clock cycles after a rising edge. In the case of 1-clock transitions (40MHz), the output of the current bit is on exactly the same edge as the detection for outputting the next bit.

Seairth · 2016-03-18 14:08

Ugh... trying to get the synchronous serial mode working without a logic analyzer is turning out to be quite difficult!

cgracey · 2016-03-18 15:04

Seairth wrote: »

Ugh... trying to get the synchronous serial mode working without a logic analyzer is turning out to be quite difficult!

Seairth, I think at 40MHz there may be too much smearing. The receiver should see the clocks, though. Have you tried setting X[5] (I think it is) to cause the receiver to sample the bit on the same cycle as it sees the clock edge?

Seairth · 2016-03-18 15:06

Seairth wrote: »

ozpropdev wrote: »

Yep, 40MHz clock is there.
Here's my test code.
Runs fine at 20MHz, locks up at 40MHz.

I wonder if it has to do with the fact that the output is two clock cycles after a rising edge. In the case of 1-clock transitions (40MHz), the output of the current bit is on exactly the same edge as the detection for outputting the next bit.

Alternatively, looking at your (@ozpropdev) rep loop, you may just not be able to push the data fast enough. The PINSETY and PINACK alone will take 20 clock cycles. Add the additional instructions and you are talking a minimum of 30 clock cycles per loop.

Would it be reasonable for PINSETY to also implicitly perform a PINACK?

cgracey · 2016-03-18 15:09

Seairth wrote: »

Seairth wrote: »

ozpropdev wrote: »

Yep, 40MHz clock is there.
Here's my test code.
Runs fine at 20MHz, locks up at 40MHz.

I wonder if it has to do with the fact that the output is two clock cycles after a rising edge. In the case of 1-clock transitions (40MHz), the output of the current bit is on exactly the same edge as the detection for outputting the next bit.

Alternatively, looking at your (@ozpropdev) rep loop, you may just not be able to push the data fast enough. The PINSETY and PINACK alone will take 20 clock cycles. Add the additional instructions and you are talking a minimum of 30 clock cycles per loop.

Would it be reasonable for PINSETY to also implicitly perform a PINACK?

PINSETM/X/Y now takes only 10 clock cycles.

Seairth · 2016-03-18 15:13

cgracey wrote: »

Seairth wrote: »

Seairth wrote: »

ozpropdev wrote: »

Yep, 40MHz clock is there.
Here's my test code.
Runs fine at 20MHz, locks up at 40MHz.

I wonder if it has to do with the fact that the output is two clock cycles after a rising edge. In the case of 1-clock transitions (40MHz), the output of the current bit is on exactly the same edge as the detection for outputting the next bit.

Alternatively, looking at your (@ozpropdev) rep loop, you may just not be able to push the data fast enough. The PINSETY and PINACK alone will take 20 clock cycles. Add the additional instructions and you are talking a minimum of 30 clock cycles per loop.

Would it be reasonable for PINSETY to also implicitly perform a PINACK?

PINSETM/X/Y now takes only 10 clock cycles.

Right. But, PINACK is really just a PINSETM, correct? So PINSETY followed by a PINACK, would take 20 clock cycles.

cgracey · 2016-03-18 15:17

PINACK has the D LSB cleared, so it takes only two clocks. The cog only outputs %0001 and then %0000, without eight data nibbles in-between.

Seairth · 2016-03-18 15:57

cgracey wrote: »

PINACK has the D LSB cleared, so it takes only two clocks. The cog only outputs %0001 and then %0000, without eight data nibbles in-between.

Gotcha. Ok. And it turns out not to be a timing issue in the code. I took @ozpropdev's code and set "bits" to 32, which would ensure 64 clock cycles per PINSETY. Plenty of time for the WAITPAT to set up and wait. Also confirmed that the same problem exists with SETEDG/WAITEDG.

Seairth · 2016-03-18 16:05

cgracey wrote: »

Seairth wrote: »

Ugh... trying to get the synchronous serial mode working without a logic analyzer is turning out to be quite difficult!

Seairth, I think at 40MHz there may be too much smearing. The receiver should see the clocks, though. Have you tried setting X[5] (I think it is) to cause the receiver to sample the bit on the same cycle as it sees the clock edge?

Actually, that comment was for something else. I'm trying to get communication with a DS1620 working (at only 500 KHz). This uses half-duplex (3-wire) synchronous serial with an inverted clock and LSB first. Without the benefit of a logic analyzer, I'm only guessing that the RST, CLK, and DQ lines are sequencing correctly.

Edit: Actually, I'm not worried about RST, as this is simply an OUTx toggle. It's really just the CLK and DQ (which is TX for 8 bits and RX for 9 bits).

Prop123_A9_Prop2_v7y Intermediate Release - 14 March 2016 - Faster PINSET, all shift modes LSB-first

Comments