Big update for DE2-115 and DE0-Nano users w/add-on boards

jazzed · 2013-10-04 15:06

jmg wrote: »

Rather than break Async, the focus is on adding SPI modes(master/slave), by allowing optional use of that Start/Stop state engine, and adding clock out choices, in both modes.

That's the only changes that need to be made IMHO. Anything more substantial is probably asking for trouble.

I'm guessing there is about only one synthesis run left in the patience queue.

jmg · 2013-10-04 15:22

USB LS & FS on the P2 ?

Cluso99 wrote: »

WIP

There may be a hybrid solution, at least for research.
If the P2 cannot quite do proper full speed USB @ ~80MHz, then a CPLD could assist. (it might even make FS on P1 practical ?)

This would need a Slave SPI mode, which is already discussed above.

The CPLD also needs a faster clock, some N x 12MHz ( here, maybe 72 or 84MHz ?)

The PLD does Phase-snap bit sync -not really a pure PLL, but does re-align on every edge, which SW solutions usually cannot do. ( SW solutions thus have issues with larger packets )

PLD can do bit stuff in both directions, and it pauses the SPI slave clock when doing do.

Some mode/flag pins control Non Diff cases.

The P2 still works intra-packet, but now handles byte not bit, and toggle encode and bit-stuff are all removed into CPLD, as is bit-sync, so the P2 SW loading is slashed, in the time domain.
P2 still needs to do CRC & manage bus direction.

Each stuff bit slot on the USB, has no real data associated with it, so simply insert(tx) or remove(rx) and pause spiclk

A XC2C32A ( $1.15 QFN32) would probably do this, and a LCMXO2-256HC-4SG32C ( $2.80 QFN) would have a lot of spare space in the same size package. (XO2 could include CRC)

A quick skeleton test uses ~60% of a 32MC CPLD.

Full ESD spec would need a USB driver, but so would any native P2 USB solution.

Adding: Other reference points for USB, as reality checks, are
a) CP2105 USB to Dual UART bridge 1,000: $1.30
- works out of the box, but has lowish sustained speeds

b) C8051T623-GM 1,000: $0.89
Needs code, but should do higher sustained speeds

evanh · 2013-10-04 15:33

Sapieha wrote: »

BUT can that be run as free runing counter without need of some code to execute it --- As in my proposal else need some code to that.

Some code is needed just to read the value in all situations. Are you maybe wanting direct hardware output from this frequency measure, like a comparator?

Cluso99 · 2013-10-04 15:38

Seairth wrote: »

Actually, I was trying to go beyond that. In the grand tradition of generality on the propeller, I was aiming for some basic hardware that would allow for support of just about any serial interface (within reason). It may not be an ideal solution for any single interface, but it would hopefully let any single interface have a better implementation that the software-only approach would provide.

+1

Cluso99 · 2013-10-04 15:49

jmg wrote: »

The point is Chip's first pass is a great first pass, but do not break that, in the effort to make it more general.
You can do both Async and SPI properly, with very little effort.

Async is already coded, tested and in FPGA. Code is already using this feature.

Sure, expand it to make it more general, but do not remove it.

FTDI is just one example, from the real word.

But we are not just trying to do SPI. We are trying to add some general use shift registers. This is the Props philosophy.

Sure, you can use the counters, but do not force all users to use them. That becomes a wasteful constraint.

Small uC often give users the choice of timers OR local baud rate generation.
If there is room for a bit to choose Timer or local BRG, I'd be fine with that.
No constraints and no surprise.

But why add more hw when the existing hw that would otherwise be left unused?
BTW the counters are not used as timers like other micros. We have CNT and WAITCNT?PASSCNT for that.

"In an ultra fast serial, the cog won't have time to do anything else so the counters will be unused."

?? Did you really say that

If the fast serial has the bits managed in HW, (as now) there are plenty of cycles for other threads

One nice feature of the FTDI 50MHz Clocked Async mode, is that you can feed it as slow as you want, or you can go up to the full speed in bursts as needed.
The user has the choice.

At 50MHz there will not be much time left for anything else.

But if we have some generic shifters, just imagine what they can be used for! A UART will always be a UART.

jmg · 2013-10-04 15:49

Seairth wrote: »

Actually, I was trying to go beyond that....

? bur removing a feature already there, is not "going beyond that" at all - it is offering a crippled subset.

Adding SPI as well as present Async, is a proper superset. ( 35c micros can manage SPI and Async. )

evanh · 2013-10-04 15:50

Seairth wrote: »

For serial (e.g. 8N1), you'd transmit by loading ten bits (start bit, 8-bit value, stop bit) into SRA, then clock it out. For receive, you use WAITPEQs to detect the start condition, then clock in 8 bits to SRB.

That sounds a bit prone to generating bit timing errors. It would at the very least require those wait instructions to be specialised start/stop of the shifter. I'd be intrigued if it could be done reliably.

The downside is there is no way this could handle gapless streaming.

If you wanted to add a parity bit, that would be handled (both generating and verifying) in software. In order to asynchronously send and receive, use two tasks and both shift registers.

Chip's existing design I suspect can't handle 8+parity as this would effectively be 9 bits of data. The recommended extension of fully definable frame length would solve this. But, yep, software dealing to parity will be the way.

jmg · 2013-10-04 15:55

Cluso99 wrote: »

But we are not just trying to do SPI. We are trying to add some general use shift registers.

I'm perfectly fine with that, but most uC users idea of General use includes both SPI and Async modes.

I can see no point, in removing a trivial level of silicon, (already proven) driven by claims of 'philosophy'..

I prefer silicon to just work properly and simply : HW manages the bit level stuff, and SW manages the bye level stuff

jmg · 2013-10-04 15:57

evanh wrote: »

Chip's existing design I suspect can't handle 8+parity as this would effectively be 9 bits of data. The recommended extension of fully definable frame length would solve this. But, yep, software dealing to parity will be the way.

Agreed. His first pass, was a good first pass. Just needs a clean-up pass, expanding some choices, removing none.

Sapieha · 2013-10-04 16:27

Hi evanh.

Buffered value every measure period to read in same time counter start new measure

evanh wrote: »

Some code is needed just to read the value in all situations. Are you maybe wanting direct hardware output from this frequency measure, like a comparator?

evanh · 2013-10-04 16:30

Cluso99 wrote: »

But we are not just trying to do SPI. We are trying to add some general use shift registers.

More accurately, we're trying to come up with modifications to the UARTs so as to use their existing shift registers for synchronous use.

And maybe some strange sampling tricks also. Can anyone think of uses for the logic/feedback functions of the counters being attached to the shift registers?

evanh · 2013-10-04 16:32

Sapieha wrote: »

Buffered value every measure period to read in same time counter start new measure

Metronomic sampling of the counter?

KC_Rob · 2013-10-04 16:40

jmg wrote: »

I can see no point, in removing a trivial level of silicon, (already proven) driven by claims of 'philosophy'..

I just want to, hopefully, clarify something here without getting into nitty-gritty technical stuff for a moment. There is more to this "philosophy" than just mere philosophy; it is in fact an important market distinction. One of the main things that makes P1 special *and* worthwhile, despite its higher price point - and the P2 will be how much? $20+ in low volume? - is its generality and flexibility. It may not be - and need not be - the fastest or the best at any one thing, but it can do lots of things well enough, without much engineering time or trouble. It is not a typical microcontroller and would likely fail miserably in the marketplace if it were made into (or very nearly like) one. So, I think the argument for more general, flexible solutions - even where they cause a (tolerable) performance hit - is generally the right one; always allowing for the possible exception, of course.

That said, so much has been done with/added to the Prop2 already. The main focus now should be on getting a usable, functioning part in production.

Bill Henning · 2013-10-04 16:43

- I agree that we are trying to add some very powerful shift capabilities.

- I'd prefer to keep the counters separate, simply so other tasks in the same cog could use them

- assume P2 @ 160Mhz, say 4 tasks running

50Mbps serial, let's say simple sync 8 bits per byte... 6.25M reads/sec of shift register, so one thread can easily handle the 50Mbps data stream, leaving three to do other work.

If sending/receiving longs, make that 1.625M longs/sec, tons of time for a task

--> So one cog could handle an SD card, and three other fast SPI peripherals, with relatively simple code.

Without such a sync shifter, a cog would not be able to handle 50MHz SPI slave (never mind one task in a cog)

- Chip will hopefully keep the async mode he implemented (including optional 4 bit address), I can see a lot of uses for it, but hopefully he will add a flexible sync mode that can be used SPI master/slave as well as many other uses

Cluso99 wrote: »

But we are not just trying to do SPI. We are trying to add some general use shift registers. This is the Props philosophy.

But why add more hw when the existing hw that would otherwise be left unused?
BTW the counters are not used as timers like other micros. We have CNT and WAITCNT?PASSCNT for that.

At 50MHz there will not be much time left for anything else.

But if we have some generic shifters, just imagine what they can be used for! A UART will always be a UART.

kwinn · 2013-10-04 17:46

KC_Rob wrote: »

I just want to, hopefully, clarify something here without getting into nitty-gritty technical stuff for a moment. There is more to this "philosophy" than just mere philosophy; it is in fact an important market distinction. One of the main things that makes P1 special *and* worthwhile, despite its higher price point - and the P2 will be how much? $20+ in low volume? - is its generality and flexibility. It may not be - and need not be - the fastest or the best at any one thing, but it can do lots of things well enough, without much engineering time or trouble. It is not a typical microcontroller and would likely fail miserably in the marketplace if it were made into (or very nearly like) one. So, I think the argument for more general, flexible solutions - even where they cause a (tolerable) performance hit - is generally the right one; always allowing for the possible exception, of course.

That said, so much has been done with/added to the Prop2 already. The main focus now should be on getting a usable, functioning part in production.

Well put. Better to make the shifters more general purpose and avoid optimizing them for one function/protocol while crippling them for others. One of the things I would like to see is a 32 bit shifter that can shift data in from one pin while simultaneously shifting data out on another.

evanh · 2013-10-04 17:54

kwinn wrote: »

One of the things I would like to see is a 32 bit shifter that can shift data in from one pin while simultaneously shifting data out on another.

All implementations do that. You don't have a full-duplex port without it.

Cluso99 · 2013-10-04 18:03

Bill Henning wrote: »

- I agree that we are trying to add some very powerful shift capabilities.

Absolutely!

- I'd prefer to keep the counters separate, simply so other tasks in the same cog could use them

I cannot see this. Is it just me?
Almost all of my current P1 projects don't even use the counters. I know some reasons are they are not general enough (no general shifting in/out capabilities).
I don't want to waste silicon for no good reason and the counters should make good clock generators. Remember the VGA uses a counter to drive it.
Could you give some examples please?

- assume P2 @ 160Mhz, say 4 tasks running

50Mbps serial, let's say simple sync 8 bits per byte... 6.25M reads/sec of shift register, so one thread can easily handle the 50Mbps data stream, leaving three to do other work.

If sending/receiving longs, make that 1.625M longs/sec, tons of time for a task

Presume a simple shift register, both for tx and rx...
* If there is bit stuffing then the unstuffing will take a number of instructions.
* If CRC then this will likewise take a number of instructions.
* Then there is the protocol overhead, and the response times allowed - USB is quite tight.
All this added together will most likely consume the whole cog. No time left for other tasks.
A typical example... USB FS

--> So one cog could handle an SD card, and three other fast SPI peripherals, with relatively simple code.

As in 4 distinct sets, not daisy-chained, then only presuming we have 4 sets (as in tx & rx pair) of shifters. Currently Chip has 2 pairs, A & B.
If 4 daisy chained, then depending on speed, easy even in sw, but great with hw shifters (only 1 tx/rx pair required).

Without such a sync shifter, a cog would not be able to handle 50MHz SPI slave (never mind one task in a cog)

IMHO its possible, but no real argument. Hw assist should be available.

- Chip will hopefully keep the async mode he implemented (including optional 4 bit address), I can see a lot of uses for it, but hopefully he will add a flexible sync mode that can be used SPI master/slave as well as many other uses

Agreed as long as the start/stop bits can be disabled - IMHO that is probably agreed now.
I see the 32+4 address bits as a nifty feature. It could also work as a single wire between props too, and super fast.

In fact, in RX its a matter of whether the start bit is automatically removed.
In TX, its a matter of whether it is added automatically.
For stop bits, it is really the same, although we could remove them by sw simplifying the current implementation.

We must also consider the reading and writing of the shifter - such as wait/poll. But lets get the shifter right first.

Cluso99 · 2013-10-04 18:08

kwinn wrote: »

Well put. Better to make the shifters more general purpose and avoid optimizing them for one function/protocol while crippling them for others. One of the things I would like to see is a 32 bit shifter that can shift data in from one pin while simultaneously shifting data out on another.

That is what I suggested in post #162. This way, each shifter could be a tx or an rx, or a daisy-chained shifter (shifters could be daisy-chained within the P2, and daisy-chained between P2's). So it becomes a serial/parallel/serial shifter.

4x shifters would work for quad SPI.

jmg · 2013-10-04 18:09

Sapieha wrote: »

Hi evanh.

Buffered value every measure period to read in same time counter start new measure

Yes, this is the counter equivalent of gapless capture.

An ideal system can read the value thus far, and either clear, or continue counting, without missing an edge.
This allows longer term, background higher precision results.

Better still, is a scheme that can allow to load the buffer of both counters, on the same clock edge.
[either a SW read, or an external capture edge ] Pretty much exactly the same way a 64 bit read of CNT is handled now.

jmg · 2013-10-04 18:13

Cluso99 wrote: »

Agreed as long as the start/stop bits can be disabled - IMHO that is probably agreed now.

Great, That is what I have been saying all along

evanh · 2013-10-04 18:17

jmg wrote: »

Yes, this is the counter equivalent of gapless capture.

An ideal system can read the value thus far, and either clear, or continue counting, without missing an edge.
This allows longer term, background higher precision results.

Better still, is a scheme that can allow to load the buffer of both counters, on the same clock edge.
[either a SW read, or an external capture edge ] Pretty much exactly the same way a 64 bit read of CNT is handled now.

That's basic metronomic sampling. No need to ever load the counters even at the start. Take known regular intervals for the sampling and just keep a copy of the previous sample and subtract it, giving you the diff. Piece of cake.

jmg · 2013-10-04 18:19

Cluso99 wrote: »

I cannot see this. Is it just me?

I think is it just your usage areas

Cluso99 wrote: »

Almost all of my current P1 projects don't even use the counters. I know some reasons are they are not general enough (no general shifting in/out capabilities).
I don't want to waste silicon for no good reason and the counters should make good clock generators. Remember the VGA uses a counter to drive it.
Could you give some examples please?

There are plenty of Sensor handling areas where you want to measure Period/Frequency/Phase and also send that information.
A wide dynamic range frequency counter will need both counters, unless Chip has expanded them a lot
(in which case, it makes even less sense to consume that resource, for only BRG )

If there is room for a bit to allow Timer or local BRG shifter clocking, that is fine.

Again, let the user choose, do not make the decision for them.

evanh · 2013-10-04 18:23

Cluso99 wrote: »

Almost all of my current P1 projects don't even use the counters. I know some reasons are they are not general enough (no general shifting in/out capabilities).
I don't want to waste silicon for no good reason and the counters should make good clock generators. Remember the VGA uses a counter to drive it.

I've thought about this a little and decided the counter, or shift, register itself is only a small part of their respective module. The functional logic and config take the lion's share of transistors. These parts will still exist even if the storage is shared. May as well leave the counters totally independent.

Cluso99 · 2013-10-04 18:33

jmg wrote: »

I think is it just your usage areas [re using counters as clock/baud generators]

There are plenty of Sensor handling areas where you want to measure Period/Frequency/Phase and also send that information.
A wide dynamic range frequency counter will need both counters, unless Chip has expanded them a lot
(in which case, it makes even less sense to consume that resource, for only BRG )

If there is room for a bit to allow Timer or local BRG shifter clocking, that is fine.

Again, let the user choose, do not make the decision for them.

But we only have so much silicon.
And in your example, that is precisely what cogs are for..
* One cog does the Period/Frequency/Phase measurement, plus optionally conversion/smoothing/feedback
* One cog does the transmission.

I would like to see an option of daisy-chaining the counters too - was it you who suggested the mux? I do think we can achieve that anyway by using the internally connected pins P92-P127 (I am presuming P92-P95 are also internally connected). Perhaps we can ask Chip for a freehand block diagram of the counters - I know he is too busy to fully document them ATM.

evanh · 2013-10-04 18:34

Cluso99 wrote: »

... or a daisy-chained shifter (shifters could be daisy-chained within the P2, and daisy-chained between P2's). So it becomes a serial/parallel/serial shifter.

Hmm, difficult to do if it's not a single shifter from in to out. And SPI explicitly allows for this on slaves. Do uC's handle being a chained slave? I've never looked through any docs for this.

4x shifters would work for quad SPI.

Well ... there is the eight Cogs ...

evanh · 2013-10-04 18:42

Cluso99 wrote: »

I would like to see an option of daisy-chaining the counters too - was it you who suggested the mux?

That was me - http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page134

I do think we can achieve that anyway by using the internally connected pins P92-P127.

Ouch, bussing a whole counter via I/O port, crazy idea.

You know what, there is some potential there but it does require having the adders in the counter modules able to do the same. But still pretty nutty, lol.

Cluso99 · 2013-10-04 19:16

evanh wrote: »

That was me - http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page134

Sorry, couldn't remember and didn't bother to find it.

"I do think we can achieve that anyway by using the internally connected pins P92-P127."
Ouch, bussing a whole counter via I/O port, crazy idea. You know what, there is some potential there but it does require having the adders in the counter modules able to do the same. But still pretty nutty, lol.

Actually, the I/O ports for P92-P127 are effectively buried - their pins do not come to the outside and they don't have all the I/O driver/analog options either. I was one of the proponents for them and when Chip added them in he also added some instructions to use them as comms between cogs. I am just not sure if P2-P95 are there.
It would be therefore possible if each shift register had both input and output pins, then they could be daisy-chained by this method.

BTW Trying to combine 4 shifters using 4 cogs for Quad SPI would be a nightmare

evanh · 2013-10-04 19:33

Cluso99 wrote: »

It would be therefore possible if each shift register had both input and output pins, then they could be daisy-chained by this method.

There's quite a bit of extra config for this. You'd have to define which bit of the shift register went out for chaining. What's interesting is this makes the UARTs transmit register/buffer and shifter available for a second SPI port. I guess that makes the extra config worth it.

So, for every UART there is actually two SPIs.

Cluso99 · 2013-10-04 19:38

Here is a simple example of using a P2 and a number of the shift registers I propose...

This P2 is a part of a number of P2,s controlled by a master P2 (or other micro).
This P2 and other identical P2s are going to be intelligent controllers for a lot of I/O.
For cheapness, we load all props from a master P2 (or other micro) using P90-P91 & Reset. This saves the SPI & 3-4 pins.

Lets presume we have 4 shift registers per cog and we daisy-chain 3 of them via 3 internal port pins (P92-P127).
We utilise P91 as DIN, P90 as DOUT and P89 as CLK (common to all P2s).
That leaves 88 I/O left for intelligent I/O.

So, the master shifts out "n" x 96 bits of data, of which only 88 bits are used (perhaps the other 8 bits are some form of control commands), and receives the last DOUT back into itself (as a readback).

So, the master shifts out "n" x 96 data bits, each P2 takes its' 96 bit info from its' 3 shift registers (read), then writes back into those 3 shift registers its' results/input, and then the master does another "n" x 96 data bit shift. This gives the P2s a new set of data, while the updated (written) slave data is shifted back into the master.
The process continues.
In each slave P2, only 1 cog is required to read the data in, and write the data out, and it is quite a simple/mundane task. 7 cogs are left to control the intelligent 88 I/O.

For sure there are other ways to do this. It is just an example of what those shifter could do.

evanh · 2013-10-04 19:55

Cluso99 wrote: »

Actually, the I/O ports for P92-P127 are effectively buried - their pins do not come to the outside and they don't have all the I/O driver/analog options either.

I was thinking more along the lines of feeding a fast 8-bit parrallel flash A/D directly into the first stage counter for digital filter. So, instead of the sigma delta signal source being just a 1 or 0, it has range of 0-255. Making it way faster settling for a given target resolution ... Propscope for profesionals.

Big update for DE2-115 and DE0-Nano users w/add-on boards

Comments