These conversations should be to help Chip make the Prop2 work with said products and protocols, not be a catch line that never works. What I'm trying to convey here is listing a bunch of names needs to also be backed up with detail on what those names entail and possibly even how it could be achieved.
Right now, we've got some desired features and performance goals. They aren't firm, and it's a balance between what can go on the die, process limits, etc... and what we can get.
A great example is the "Boot from SD" discussion. The high level is, "read the file, load it, boot" Chip asked, "What is the process we want? Bits, etc..." and that turned out to require some investigation. Some code was attempted, more probably will be here soon when we move into the tools stage, and it turns out that we remain unsure on what makes the best sense, but we also have an idea of what the basic hardware operations are.
The logic cost on smart pins is 64x. Just making a dedicated block that can do a protocol is expensive. That's why we got the state machine. Chip basically asked, "try and make stuff" with that state machine to see what might be possible. It's novel.
Just asking for what the other devices do identifies the features. Not a bad thing. It does not necessarily identify what can be done with the features on chip so far. The idea that using a COG and the POG in tandem "kills performance" ignores the production clock being well over 100Mhz, maybe 200Mhz.
It may be that we end up with more dedicated logic than planned. It may be that we end up with fewer POGS too, maybe extending their reach to the pins, or something. Or, it may be a different state option makes better sense.
Some iteration and some attempts to see what "kills performance" really does mean is warranted, that's all.
What I'm trying to convey here is listing a bunch of names needs to also be backed up with detail on what those names entail and possibly even how it could be achieved.
Try to make Chip's job a little easier.
I already do that.
When I talk in the details of MUX and D-FF, I get asked why do you need that, and when I talk in functional use cases, I get asked for some vague "Enabling features"
(I still have no idea what that means)
- proves you cannot please everybody.
Enabling features means what is going to be in the Prop. Flip-flops is an example of something that is in the Prop. JTAG is an example of something that isn't in the Prop, SPI is another such example.
To achieve capability of, say, SPI there is certain amount of bit-bashing software required. Maybe not the serial stream itself but much of the support around it. Just having to set up a second SmartPin to achieve duplex is one example of this.
An example of discussing an enabling feature might be something like whether there is any way to daisy chain two SmartPins together to get an SPI like duplex without software.
The direct feedback of pin-in to pin-out would be another one. This can be done on the Prop1 already.
Enabling features means what is going to be in the Prop. Flip-flops is an example of something that is in the Prop. JTAG is an example of something that isn't in the Prop, SPI is another such example.
To achieve capability of, say, SPI there is certain amount of bit-bashing software required. Maybe not the serial stream itself but much of the support around it. Just having to set up a second SmartPin to achieve duplex is one example of this.
Now you have further confused me (everyone?), as I would certainly say SPI is in the Prop.
Chip lists 8 Serial modes, Async and Sync.
There is always some software layer needed, so I prefer to focus on the simple and unambiguous : bit-level at the pin, and word level in SW.
An example of discussing an enabling feature might be something like whether there is any way to daisy chain two SmartPins together to get an SPI like duplex without software.
I'm unclear on what "without software" could mean, but Async already needs two pins configured for Duplex, so Sync is similar.
The grey area I see in SPI docs, is around who manages the CLOCK ?
Chip lists
11000 = sync serial transmit byte (A-data, B-clock, MSB first)
11010 = sync serial receive byte (C-data, B-clock, MSB first)
I've tagged that for 3 pins. A.B.C
Data Out, and Data In are straight forward, but the clock has master/slave choices I do not see clarified ?
ie if you configure 2 channels onto the same CLK pin, is that ok provided they start in Sync ?
Receive could default to CLK-IN, but there are cases where you want to generate a CLK-OUT on receive.
Not clear if this is just missing from the Docs, or needs a 2nd pass (like the variable Tx/Rx length needs another pass)
Yeah, good point, so there is three separate SmartPin configs needed for a duplex SPI configuration. And then there is the SPI chip selects (TMS on JTAG) also.
That's easy enough as a master. Maybe not even possible as a slave without purely bit-bashing the whole thing.
Regarding where software intervention has to seriously get involved: SmartPin sources are only direct pin inputs and Cog OUTs as far as I know. That means, as a SPI slave device, a duplex mechanism requires software in the middle to pipe the data between the two SmartPin shifters.
I think it'd be cool if a whole SmartPin was documented as a single general block like what Andy posted rather than just the "custom state machine" section. I suspect SmartPins aren't too far from being exactly that. It's coarser than your average FPGA cells, ie: Based around the 32-bit shifter/adder + Z register, but would still gives the FPGA like flexibility in how that is utilised with regard to connections and modes.
Instead of asking a lot of questions - why not just try it out yourself - play with an FPGA board and a scope. That's where the fun is!
I just tried the TRANSITION OUTPUT mode. It lets you define the rate (in sysclock cycles) of the transitions and the number of transitions. A PINSETY starts one shot of the defined transistions. This is ideal for an SPI clock.
A Duplex SPI can look like this:
For a Slave configuration just let the external Master generate the clock.
SPI with 1..32 bits should be doable with the custom state machine modes.
Instead of asking a lot of questions - why not just try it out yourself - play with an FPGA board and a scope. That's where the fun is!
I don't think JMG has a suitable FPGA board. I didn't until recently, now, since Pnut doesn't work under Wine, I'm trying to get sources for Pnut and compile it for Linux.
For a Slave configuration just let the external Master generate the clock.
SPI with 1..32 bits should be doable with the custom state machine modes.
Cool, sounds like there is a shift-in input to the shift-register then.
Andy, are you hand drawing those diagrams in a paint package? I had initially assumed you had a structured package for the job.
...
Andy, are you hand drawing those diagrams in a paint package? I had initially assumed you had a structured package for the job.
Yes I do. The standard Paint program of Windows has a lot of predefined shapes, and with a bit of practice you can use them for all kind of symboles like MUX, Inverter and so on.
Schematic programs have a lot of overhead until you can start with drawing and you have to define new symbols first. All very distracting if you want to scetch an idea...
Ohhhh, of course, Duh! The shift-in data from data input pin is probably is encoded in the rule/command so that the shifter pulls a 1/0 bit from the presets.
Yes, the ALU part in the diagram is heavy simplified.
You can MOVe 0,1,-1 to Z, or you can SHIFT by 1 or -1 (=left/right) or you can ADD 0,1,-1 (=INC/DEC).
Its all encoded in bit 2..0 of the command.
This TRANSITION OUT mode allows to output a clock in parallel to excuted istructions. With a 1 cycle timing you get exactly 1 clock per instruction (for 2 cycle Instructions). And the first edge seems to happen exactly at the right time in the middle of an outputted bit.
This allows to make a very fast QuadSPI output / input with an unrolled loop like that:
dat
org
pinsetm mode,#4 'set smartpin mode TRANSITIONS OUT
pinsetx #1,#4 '1 cycle High, 1 cycle Low = 40 MHz
mov dira,#$01F
mov txd,##$3210 'testvalue
loop pinsety #8,#4 'start 8 transitions = 4 full clocks
getnib outa,txd,#3 'output nibbles at OUTA[3..0]
getnib outa,txd,#2 ' (OUTA[4] clocks in parallel)
getnib outa,txd,#1
getnib outa,txd,#0
'Clock transistions stop here automatically
waitx ##40_000
jmp #loop
mode long %1_00_00101_0000_0000_00_0_0000000000000 'transition out mode
txd res 1
It's restricted to QSPI on bits 3..0 of PortA or PortB but works with half the sysclk speed (40 MHz now, 80 MHz on real chip)
The screenshot here shows the sclk in blue and Bit0 of the Data in yellow. It's a 60MHz scope so all is a bit round at these frequencies.
Andy
EDIT: I think GETNIB overvrites the whole OUTA port and not only the lowest nibble, so for a real application we will need two instructions per nibble. This is still 1/4 of the sysclk frequency for QSPI.
This TRANSITION OUT mode allows to output a clock in parallel to excuted istructions. With a 1 cycle timing you get exactly 1 clock per instruction (for 2 cycle Instructions). And the first edge seems to happen exactly at the right time in the middle of an outputted bit.
That's nifty, but looks interrupt intolerant, and such inferred timing is a bear to debug.
How does that scale with changes in Data rate ? (ie vary TRANSITION OUT rate)
A real system may have slower QuadSPI RAM connected with faster QuadSPI FLASH.
EDIT: I think GETNIB overvrites the whole OUTA port and not only the lowest nibble, so for a real application we will need two instructions per nibble. This is still 1/4 of the sysclk frequency for QSPI.
Here is an example that uses a Custom State Machine mode.
It just detects pos-edges on PA0 and generates a diveded frequency on PA1. Divide by 2..33 is possible.
Once initiated all goes automatic in the pin cell, no cog code needed anymore.
A possible application is generating the L/R framing clock for I2S.
The code also installs a NCO smartcell on PA0 to generate the frequency that gets divided.
' Custom State Machine Example
' Divides a clock on pin PA0 and outputs clock / N on pin PA1 (N=1..32)
' NEXTstat * N for ST=0
' ST=0 Bp Bc Ap Ac Posedge detector on B inp
' 0 1 x x = NEXTstate, Rules 4..7
' other = NOP
' ST=1 Bp Bc Ap Ac Posedge detector on B inp
' 0 1 x x = NEXTstate + INC, Rules 4..7
' other = NOP
dat
org
pinsetm NCOmd,#0 'set smartpin 0 to NCO
pinsetx #1,#0 '1 cycle resolution
pinsety ##$2000_0000,#0 '1/8 sysclk
pinsetm mode,#1 'set smartpin 1 to Custom 1-bit 2-pattern mode
pinsetx xset,#1 'rules + commands
pinsety yset,#1 'rules + commands
mov dira,#$03
loop jmp #loop
NCOmd long %1_00_00110_0000_0000_00_0_0000000000000 'nco out mode
' Bin-1 Ain
mode long %1_00_10111_0111_0000_00_0_0000000000000 'custom 1bit 2 pattern + OUT
' :N NXT NOP 7..4
xset long %0_0_0_00_1_01000_00000_0000000011110000
' :N NXT+INC NOP 7..4 N-2
yset long %00000_0_01100_00000_0000000011110000 + 2 << 27
The scope shows that it works, I don't know yet where the phase shift come from.
Now that I have done an example with the Custom State Machine mode, I got a clearer picture on how it all sticks together. So I rearranged the diagram a bit to make the function more obvious. I also added the Z-buffer, it turned out that Z reads only get updated with SIGNAL instructions.
To understand all these, I think it helps if you have some knowledge of FPGA design.
The LUT for example lets you program any logical combination of the 4 inputs in that you program the Truth Table.
On FPGAs the output of the Truth-table is just a '0' or '1' but here this gets translated to a '0'-Instruction and a '1'-Instruction. These instructions can Increment, decrement or shift the Z value or just do nothing. An instruction also can change the state. This is a single FlipFlop so there are only 2 states '0' and '1'. A counter lets you delay the State change by 1..32 NextState instructions.
In the 2 pattern mode the state selects between two sets of instructions, so you can for example count Z up in state 0, and count down in state 1.
There is also the possibility to feed back the state or some Z-bits into a LUT input.
Here is the new diagram of the 1-bit 2-pattern mode:
Ah, some extras around that st bit.
I had to look up what :N meant - it's a counter. Needs labelled.
I think having the ALU shown and it's accumulator(working Z) represented separately again is superior.
Now that I have done an example with the Custom State Machine mode, I got a clearer picture on how it all sticks together. So I rearranged the diagram a bit to make the function more obvious. I also added the Z-buffer, it turned out that Z reads only get updated with SIGNAL instructions.
To understand all these, I think it helps if you have some knowledge of FPGA design.
The LUT for example lets you program any logical combination of the 4 inputs in that you program the Truth Table.
On FPGAs the output of the Truth-table is just a '0' or '1' but here this gets translated to a '0'-Instruction and a '1'-Instruction. These instructions can Increment, decrement or shift the Z value or just do nothing. An instruction also can change the state. This is a single FlipFlop so there are only 2 states '0' and '1'. A counter lets you delay the State change by 1..32 NextState instructions.
In the 2 pattern mode the state selects between two sets of instructions, so you can for example count Z up in state 0, and count down in state 1.
There is also the possibility to feed back the state or some Z-bits into a LUT input.
Here is the new diagram of the 1-bit 2-pattern mode:
Andy
Andy, do you feel these custom modes are arranged in a useful manner? Could we get better functions out of this much logic? I just figured counting and shifting were likely useful, but maybe a little twist could make this a lot better.
Chip,
I don't know if this is already possible but internally daisy chaining (While still leaving the pin drivers freely available to the Cogs) the smartpins maybe effective.
I haven't got any particular use in mind though sorry. Just the idea is all.
PS: I do think the general approach is a good one btw.
Jmg was asking about logic usage for different modes.
Here is a table from Quartus of usage for two smart pins. The first column is 'Logic Cells' and the second column is "Dedicated Logic Registers". This is from the DE0-Nano compile, so this is a Cyclone IV device, not a Cyclone V, like on the A9 board. These are LE's as opposed to ALM's:
You can see that four of the blocks have no registers. That's because they are mux'd to the flops when selected. Here are their descriptions:
Of the 226 flops per Smartpin, 128 flops go to the Cog accessible registers: M,X,Y and Z. Plus 32 flops for the accumulator. That leaves 66 flops left for buffering, state holding, small counters and the likes.
Comments
Try to make Chip's job a little easier.
Right now, we've got some desired features and performance goals. They aren't firm, and it's a balance between what can go on the die, process limits, etc... and what we can get.
A great example is the "Boot from SD" discussion. The high level is, "read the file, load it, boot" Chip asked, "What is the process we want? Bits, etc..." and that turned out to require some investigation. Some code was attempted, more probably will be here soon when we move into the tools stage, and it turns out that we remain unsure on what makes the best sense, but we also have an idea of what the basic hardware operations are.
The logic cost on smart pins is 64x. Just making a dedicated block that can do a protocol is expensive. That's why we got the state machine. Chip basically asked, "try and make stuff" with that state machine to see what might be possible. It's novel.
Just asking for what the other devices do identifies the features. Not a bad thing. It does not necessarily identify what can be done with the features on chip so far. The idea that using a COG and the POG in tandem "kills performance" ignores the production clock being well over 100Mhz, maybe 200Mhz.
It may be that we end up with more dedicated logic than planned. It may be that we end up with fewer POGS too, maybe extending their reach to the pins, or something. Or, it may be a different state option makes better sense.
Some iteration and some attempts to see what "kills performance" really does mean is warranted, that's all.
I already do that.
When I talk in the details of MUX and D-FF, I get asked why do you need that, and when I talk in functional use cases, I get asked for some vague "Enabling features"
(I still have no idea what that means)
- proves you cannot please everybody.
To achieve capability of, say, SPI there is certain amount of bit-bashing software required. Maybe not the serial stream itself but much of the support around it. Just having to set up a second SmartPin to achieve duplex is one example of this.
The direct feedback of pin-in to pin-out would be another one. This can be done on the Prop1 already.
Now you have further confused me (everyone?), as I would certainly say SPI is in the Prop.
Chip lists 8 Serial modes, Async and Sync.
There is always some software layer needed, so I prefer to focus on the simple and unambiguous : bit-level at the pin, and word level in SW.
I'm unclear on what "without software" could mean, but Async already needs two pins configured for Duplex, so Sync is similar.
The grey area I see in SPI docs, is around who manages the CLOCK ?
Chip lists I've tagged that for 3 pins. A.B.C
Data Out, and Data In are straight forward, but the clock has master/slave choices I do not see clarified ?
ie if you configure 2 channels onto the same CLK pin, is that ok provided they start in Sync ?
Receive could default to CLK-IN, but there are cases where you want to generate a CLK-OUT on receive.
Not clear if this is just missing from the Docs, or needs a 2nd pass (like the variable Tx/Rx length needs another pass)
That's easy enough as a master. Maybe not even possible as a slave without purely bit-bashing the whole thing.
I think Chip has said sync in slave is ok.
Hopefully he has test examples of Duplex, with the 3 pins configured as Shared_CLK, MISO and MOSI.
If the Sync block reads the CLK pin, the difference between master and Slave CLK. can be CLK pin OE ?
Scratch all that. I was reading the diagram - http://forums.parallax.com/discussion/comment/1365368/#Comment_1365368 -, it doesn't appear to be complete enough to show any shifter inputs at all ....
I think it'd be cool if a whole SmartPin was documented as a single general block like what Andy posted rather than just the "custom state machine" section. I suspect SmartPins aren't too far from being exactly that. It's coarser than your average FPGA cells, ie: Based around the 32-bit shifter/adder + Z register, but would still gives the FPGA like flexibility in how that is utilised with regard to connections and modes.
I just tried the TRANSITION OUTPUT mode. It lets you define the rate (in sysclock cycles) of the transitions and the number of transitions. A PINSETY starts one shot of the defined transistions. This is ideal for an SPI clock.
A Duplex SPI can look like this:
For a Slave configuration just let the external Master generate the clock.
SPI with 1..32 bits should be doable with the custom state machine modes.
Andy
Cool, sounds like there is a shift-in input to the shift-register then.
Andy, are you hand drawing those diagrams in a paint package? I had initially assumed you had a structured package for the job.
Yes I do. The standard Paint program of Windows has a lot of predefined shapes, and with a bit of practice you can use them for all kind of symboles like MUX, Inverter and so on.
Schematic programs have a lot of overhead until you can start with drawing and you have to define new symbols first. All very distracting if you want to scetch an idea...
Andy
I never doubted your diagram at all Andy!
You can MOVe 0,1,-1 to Z, or you can SHIFT by 1 or -1 (=left/right) or you can ADD 0,1,-1 (=INC/DEC).
Its all encoded in bit 2..0 of the command.
Andy
This allows to make a very fast QuadSPI output / input with an unrolled loop like that: It's restricted to QSPI on bits 3..0 of PortA or PortB but works with half the sysclk speed (40 MHz now, 80 MHz on real chip)
The screenshot here shows the sclk in blue and Bit0 of the Data in yellow. It's a 60MHz scope so all is a bit round at these frequencies.
Andy
EDIT: I think GETNIB overvrites the whole OUTA port and not only the lowest nibble, so for a real application we will need two instructions per nibble. This is still 1/4 of the sysclk frequency for QSPI.
This may seem overly-manual, but the data will be received earlier than GETPINZ would have been able to relay it
How does that scale with changes in Data rate ? (ie vary TRANSITION OUT rate)
A real system may have slower QuadSPI RAM connected with faster QuadSPI FLASH.
Hmm....
Thanks
Overly-manual? Manual bit banging with exact timing is one of the big strength of the Propeller chips.
Andy
It just detects pos-edges on PA0 and generates a diveded frequency on PA1. Divide by 2..33 is possible.
Once initiated all goes automatic in the pin cell, no cog code needed anymore.
A possible application is generating the L/R framing clock for I2S.
The code also installs a NCO smartcell on PA0 to generate the frequency that gets divided.
The scope shows that it works, I don't know yet where the phase shift come from.
Andy
Thanks for the diagrams too
To understand all these, I think it helps if you have some knowledge of FPGA design.
The LUT for example lets you program any logical combination of the 4 inputs in that you program the Truth Table.
On FPGAs the output of the Truth-table is just a '0' or '1' but here this gets translated to a '0'-Instruction and a '1'-Instruction. These instructions can Increment, decrement or shift the Z value or just do nothing. An instruction also can change the state. This is a single FlipFlop so there are only 2 states '0' and '1'. A counter lets you delay the State change by 1..32 NextState instructions.
In the 2 pattern mode the state selects between two sets of instructions, so you can for example count Z up in state 0, and count down in state 1.
There is also the possibility to feed back the state or some Z-bits into a LUT input.
Here is the new diagram of the 1-bit 2-pattern mode:
Andy
I had to look up what :N meant - it's a counter. Needs labelled.
I think having the ALU shown and it's accumulator(working Z) represented separately again is superior.
Andy, do you feel these custom modes are arranged in a useful manner? Could we get better functions out of this much logic? I just figured counting and shifting were likely useful, but maybe a little twist could make this a lot better.
I don't know if this is already possible but internally daisy chaining (While still leaving the pin drivers freely available to the Cogs) the smartpins maybe effective.
I haven't got any particular use in mind though sorry. Just the idea is all.
PS: I do think the general approach is a good one btw.
Here is a table from Quartus of usage for two smart pins. The first column is 'Logic Cells' and the second column is "Dedicated Logic Registers". This is from the DE0-Nano compile, so this is a Cyclone IV device, not a Cyclone V, like on the A9 board. These are LE's as opposed to ALM's:
You can see that four of the blocks have no registers. That's because they are mux'd to the flops when selected. Here are their descriptions:
pin_mod = modulator: DAC, pulse, transitions, PWM
pin_mtr = metering: measuring/counting/timing
pin_pgm = programmable modes
pin_ser = serial modes
EDIT: Very good proof there!
Thanks for the table.
Key question around state logic, is how much does that shrink, if you comment out the state mode?
With all the muxes, is the pin-cell still comfortably faster than the critical path ? ie clocks easily >>sysclk in nco modes ?