Thanks for that info Chip, I ended up removing the nop after I figured that it was safe to perform a testp after the clock falling edge. So this works at 80MHZ clocking data at 10MHz.
SPIRD rep @.end,#8 ' 8 bits
outnot sck ' clock (low high)
outnot sck
testp miso wc ' read data from card
rcl tos,#1 ' shift in msb first
.end ret
Peter, that wound up being the same as my ROM code:
'
'
' SPI byte in
'
spi_in rep @.in,#8 'ready to input a byte
outh #spi_ck 'clock pin high
outl #spi_ck 'clock pin low
testp #spi_dq wc 'sample data pin ('testp' is from before 'outh')
rcl x,#1 'save data bit
.in
ret
Modifying that a little, and I'm not sure how much this would be valued, but here's a slight unrolling to even out the clock duty:
'
'
' SPI byte in
'
spi_in outh #spi_ck 'clock pin high
rep @.in,#7 'ready to input a byte
outl #spi_ck 'clock pin low
testp #spi_dq wc 'sample data pin ('testin' is from before 'outl')
outh #spi_ck 'clock pin high
rcl x,#1 'save data bit
.in
outl #spi_ck 'clock pin low
testp #spi_dq wc 'sample data pin ('testin' is from before 'outl')
rcl x,#1 'save data bit
ret
Looks good, Evanh. The SPI memory outputs new data on the falling edge of the clock, so you'd want to sample just before the clock drops, which all our code does in these snippets.
My SPI code just toggles the clock so that I can enter with it high or low to cater for various chips. The data that I'm reading though is already available before I do any clocking and so the clock is simply outputting the next data bit.
Chip, do you think it's a good idea to use smartpins for SPI for the boot code? I will probably try this out next just so I have the options.
Looks good, Evanh. The SPI memory outputs new data on the falling edge of the clock, so you'd want to sample just before the clock drops, which all our code does in these snippets.
The SPI slave has to be clocking out on the rising clock edge! Otherwise the first bit is not present for the first shift.
The extra clock cycle needed at 80Mhz is due to the unconstrained timing paths not allowing same-clock feedback from a register, through logic, out to a pin, back in from a pin, through more logic, then back into a register, all in 12.5ns.
Peter also reported issues at 40MHz, so those delays look to be significant.
Your notes for IN say "registered from physical pin", so I'm unclear where this slack is sneaking in ...
There's no need to suppose that some variable number of NOPs might be needed, based on frequency. It is safest to write code for the 80MHz timing situation, as it will always work.
Problem is, the real world does not quite work that way.
As long as the silicon can add an arbitrary extra clock, based on SysCLK and PVT, it will bite users.
In reality, the user codes until it works, and then later someone else may adjust the clock speed.
Even if they strive to carefully count cycles, there is no proof they have that right.
As your numbers show, even counting opcodes is too coarse, as the sampling can be mid-opcode.
So long as the silicon has the risk, proving it is field safe is very hard indeed.
ie You may just have a fluked 'good' part.
I went through the Verilog code to determine the IN/OUT paths to/from the cogs:
OUT registered in cog on 'go' (on last clock of instruction)
registered in hub after OR'ing all cogs' OUT signals *
goes through smart pin logic to physical pin
(total delay = 1 clock after last clock of instruction)
DRVC #30 '2 (+1!) updates after 1 clock after 2-clock instruction
IN for D/S registered from physical pin
goes through smart pin mux and logic/filtering
registered in hub for fan-out to all cogs *
registered in cog on 'go' (last clock of prior instruction)
(total delay = 3 clocks)
TESTB INB,#31 '(?3+) 2 samples 3 clocks before 2-clock instruction
IN for TESTP{N} registered from physical pin
goes through smart pin mux and logic/filtering
registered in hub for fan-out to all cogs *
registered in cog on 'get' (first clock of instruction)
(total delay = 2 clocks, since it arrives in first clock of instruction)
TESTP #31 '(?2+) 2 samples 2 clocks before 2-clock instruction
* Can *maybe* be eliminated in ASIC
As you can see, TESTP/TESTPN get IN data that is one clock fresher than instructions which read INA/INB via D or S.
I'm not easily seeing a long path anywhere there, especially with the "registered from physical pin" on Pin in ?
In terms is *possible* ASIC speed ups, the IN path seems safer than OUT.
The OUT includes a wide OR, (slow), and you want the OUT pin to be close to the clock edge, to allow other external parts setup times. Smart pin bypass sounds a fast path.
For IN, what slack is 'registered in hub for fan-out to all cogs' swallowing, there is no wide-OR on IN ?
Could that be easier to remove than the OUT case ?
What exactly is "goes through smart pin mux and logic/filtering" ?
The mux here is a simple bypass for no-smart-pin, and the filtering is default-off right ?
JMG,
I read it as the issue is on the output. Input is good, albeit with a fixed lag.
It looks like the big OR is hidden behind a register in the Hub. So that leaves delays from long route from the Hub and also any mux'ing with associated Smartpin. The Smartpin propagation should be very short.
I'm suspicious, without any changes, that the final silicon will be a lot improved simply because the long route won't have all the FPGA's selectable interconnect.
I can almost bet the problem is just the FPGA ... but I won't.
JMG,
I read it as the issue is on the output. Input is good, albeit with a fixed lag.
That's a good point, a scope should show which of the OUT, or IN, paths gets the bonus added clock.
Triggering the scope could be tricky, maybe connected to Smart Pin toggling pin, and SW-Pin, would presume the Smart-pin never had this effect, and so could not move ?
An alternative could be to add a SysClk/N scope pin to the FPGA build, that is eg /128 using a fast sync counter and then the SW can pace-itself to the same total clocks. Such a fixed ref avoids all Smart Pin complexities, and hopefully 'fails last'.
That fast-test-pin could also check IN effects, by wiring to a Smart Pin.
From my understanding, one of the gotchas with FPGA is that the "OR"ing of the various "OUT"s come from a number of places...
1. each cog (ultimately the OUTx registers by way of various instructions)
2. smart pins
3. streamer
This results in a big OR gate. In the FPGA, only 4 or 6 lines can be "or"ed at a time. So cascading OR gates needs to be done. On top of that, we have the routing lines adding delays.
In the real silicon, the "OR" gate can have as many inputs as required, resulting in a single OR gate with less routing delays.
Looks good, Evanh. The SPI memory outputs new data on the falling edge of the clock, so you'd want to sample just before the clock drops, which all our code does in these snippets.
The SPI slave has to be clocking out on the rising clock edge! Otherwise the first bit is not present for the first shift.
In SPI Mode0 (the preferred SD mode) & Mode3 (Flash supports modes 0 & 3):
Data is output on CLK falling edge (must be valid within 6-8ns after the falling edge)
Data is sampled on CLK rising edge (must be valid tds>2ns before, and thd>2ns after the rising edge)
In SPI Mode0 (the preferred SD mode) & Mode3 (Flash supports modes 0 & 3):
Data is output on CLK falling edge (must be valid within 6-8ns after the falling edge)
Data is sampled on CLK rising edge (must be valid tds>2ns before, and thd>2ns after the rising edge)
Ah, thanks for the nudge, I've just had a nosy at a MicroChip SPI flash datasheet. The detail I hadn't understood is the last clock (rising) of the read command and address is also the first clock (falling) of the data reply.
The first data bit is always present for reading back before the SPI-In-routine clock even gets raised ... so only 7 data clocks will be needed in all those examples above.
Chip,
I just bumped into a small group of TESTP/TESTPN instructions that are explicitly encoded for the pin I/O bits. I note there is also a general case of these instructions that looks to be 100% compatible. The similarity is such that the {#}S bit selection field from the general cases is also listed in the pin I/O cases but these ones don't actually use S field at all.
I'm thinking the pin I/O versions should just be aliases of the general versions.
Chip,
I just bumped into a small groyup of TESTP/TESTPN instructions that are explicitly encoded for the pin I/O bits. I note there is also a general case of these instructions that looks to be 100% compatible. The similarity is such that the {#}S bit selection field from the general cases is also listed in the pin I/O cases but these ones don't actually use S field at all.
I'm thinking the pin I/O versions should just be aliases of the general versions.
For immediate pin numbers, you could use 'TESTB INx,#pin', but for variable pin numbers, you'd need extra code to resolve INx. That's where TESTP/TESTPN are needed.
Ha, I've probably asked the same thing in the past and forgotten the answer... it's a touch of deja vu just looking down the instruction list and seeing this virtually identical group of instructions.
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
Here's the overlapping opcodes expanded for clarity
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
Here's the overlapping opcodes expanded for clarity
When HubExec was devised, coherent integration with the time sliced threads was troublesome. HubExec was clearly the favourite of two and, as you've noted, we suddenly had 16 true cores to play with so no-one felt losing the threads was a significant loss.
Comments
Chip, do you think it's a good idea to use smartpins for SPI for the boot code? I will probably try this out next just so I have the options.
Your notes for IN say "registered from physical pin", so I'm unclear where this slack is sneaking in ...
Problem is, the real world does not quite work that way.
As long as the silicon can add an arbitrary extra clock, based on SysCLK and PVT, it will bite users.
In reality, the user codes until it works, and then later someone else may adjust the clock speed.
Even if they strive to carefully count cycles, there is no proof they have that right.
As your numbers show, even counting opcodes is too coarse, as the sampling can be mid-opcode.
So long as the silicon has the risk, proving it is field safe is very hard indeed.
ie You may just have a fluked 'good' part.
I'm not easily seeing a long path anywhere there, especially with the "registered from physical pin" on Pin in ?
In terms is *possible* ASIC speed ups, the IN path seems safer than OUT.
The OUT includes a wide OR, (slow), and you want the OUT pin to be close to the clock edge, to allow other external parts setup times. Smart pin bypass sounds a fast path.
For IN, what slack is 'registered in hub for fan-out to all cogs' swallowing, there is no wide-OR on IN ?
Could that be easier to remove than the OUT case ?
What exactly is "goes through smart pin mux and logic/filtering" ?
The mux here is a simple bypass for no-smart-pin, and the filtering is default-off right ?
I read it as the issue is on the output. Input is good, albeit with a fixed lag.
It looks like the big OR is hidden behind a register in the Hub. So that leaves delays from long route from the Hub and also any mux'ing with associated Smartpin. The Smartpin propagation should be very short.
I'm suspicious, without any changes, that the final silicon will be a lot improved simply because the long route won't have all the FPGA's selectable interconnect.
I can almost bet the problem is just the FPGA ... but I won't.
Triggering the scope could be tricky, maybe connected to Smart Pin toggling pin, and SW-Pin, would presume the Smart-pin never had this effect, and so could not move ?
An alternative could be to add a SysClk/N scope pin to the FPGA build, that is eg /128 using a fast sync counter and then the SW can pace-itself to the same total clocks. Such a fixed ref avoids all Smart Pin complexities, and hopefully 'fails last'.
That fast-test-pin could also check IN effects, by wiring to a Smart Pin.
1. each cog (ultimately the OUTx registers by way of various instructions)
2. smart pins
3. streamer
This results in a big OR gate. In the FPGA, only 4 or 6 lines can be "or"ed at a time. So cascading OR gates needs to be done. On top of that, we have the routing lines adding delays.
In the real silicon, the "OR" gate can have as many inputs as required, resulting in a single OR gate with less routing delays.
In SPI Mode0 (the preferred SD mode) & Mode3 (Flash supports modes 0 & 3):
Data is output on CLK falling edge (must be valid within 6-8ns after the falling edge)
Data is sampled on CLK rising edge (must be valid tds>2ns before, and thd>2ns after the rising edge)
Ah, thanks for the nudge, I've just had a nosy at a MicroChip SPI flash datasheet. The detail I hadn't understood is the last clock (rising) of the read command and address is also the first clock (falling) of the data reply.
The first data bit is always present for reading back before the SPI-In-routine clock even gets raised ... so only 7 data clocks will be needed in all those examples above.
I just bumped into a small group of TESTP/TESTPN instructions that are explicitly encoded for the pin I/O bits. I note there is also a general case of these instructions that looks to be 100% compatible. The similarity is such that the {#}S bit selection field from the general cases is also listed in the pin I/O cases but these ones don't actually use S field at all.
I'm thinking the pin I/O versions should just be aliases of the general versions.
For immediate pin numbers, you could use 'TESTB INx,#pin', but for variable pin numbers, you'd need extra code to resolve INx. That's where TESTP/TESTPN are needed.
I do now, however, see one difference. That is the bit range is 0-63 for the TESTPx instructions verses 0-31 for the TESTBx instrctions.
EDIT: Oh, lol, I've kind of noted the same thing in a different way. You can't just runtime alias the two INx registers. Understood now.
1. TESTPx are shown with ,{#}S but there is no S. Is this a copy-and-paste error from TESTB?
2. TESTPx and the following DIRx instructions have identical opcodes.
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
I have V26 working and my program works.
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
https://drive.google.com/file/d/1omGhklqFgAEEoR0jrSNupT_UUxO7Cye8/view?usp=sharing
Here's the overlapping opcodes expanded for clarity
Thanks Brian. v27z jic is working. Are the SD pins mapped on this version, and if so, are they the P3x or P6x sets and is SW1 used?
I also found v27zz which has fixed SD mapping but I missed where the SD is mapped.
Thanks Chip & ozpropdev.
I am not finding any lockups on v27z. Run some outputting to serial for >1hr so far.
Haven't yet figured out if SD is on ~P38 or ~P60.
Update: now been running for more than 6 hours, no lockups.
Last mention of "settask" instruction seems to be circa 2014...
I guess I didn't pay a whole lot of attention back then because there were 16 cores...
When HubExec was devised, coherent integration with the time sliced threads was troublesome. HubExec was clearly the favourite of two and, as you've noted, we suddenly had 16 true cores to play with so no-one felt losing the threads was a significant loss.