P2 - DDR Data Capture
With P2 , how fast can I capture 8bit data in-sync with a clock signal with data available at both rising and falling edges?
The data starts with a trigger pulse, then I need to read approx 10K samples (8 bit 3V3 digital data) , after the data capture I do some simple math with the samples captured, until the beginning of a next cycle. I assume at each edge I need to read the port and write it to the hub memory.
So far I was using a Xilinx Zynq FPGA and I was thinking about switching to something less complex and cost effective since I want to scale it after deploying multiple units. May be I can replace my FPGA SOM board with a custom P2 board.
Does 20MHz clock (40M sampling) sound doable?
PS: I am not afraid of assembly. I did some cool stuff with P1 years ago.
Other things I need , one I2C, one SPI master and a fast (921600bps) UART, 7 GPIOs. I have a JTAG connector on the carrier board to program FPGA. I want to keep the carrier board as-is.
Cheers
Comments
Smartpins can functionally be externally clocked by another clock source but since they are independent serial inputs they would then be quite the burden to massage into 8-wide data. They can go as high as sysclock/3 in the hardware but the software wouldn't be able to keep up. As a guess, sysclock/8 might be achieved fetching and massaging the data of eight smartpins at a time.
Alternatively, each cog also contains one "streamer" that can DMA to hubRAM at 8-bit wide but it does not use any external clock. Therefore it requires the software to identify the start of data signalling and also configure the streamer's sampling rate to stay aligned. This in turn requires the Prop2 itself to be clocked by a common clock source.
It has been demonstrated, via an Ethernet RMII interface, that by using a fixed length preamble the receiver code can manage the appropriate streamer alignment timing so as to not miss any incoming data. This should be good for up to sysclock/2. Even sysclock/1 is possible but any phase error becomes extremely tight then.
Realistically you'll need a P2 operating at something greater than about 5x the sampled DDR rate to do this. So an overclocked 350MHz P2 could capture close to 70M samples/s from the IO pin assuming IO signal integrity is okay. The sample time in the bit does vary however by up to one P2 clock and you may need NOP/WAITX delays which reduces the maximum sample rate, but helps to reliably capture the data in the middle portion of the data transitions.
This is the probably the tightest way to do it in the snippet below. Setup the 8 bit input data to be on Pins 0-7 (ina) or Pins 32-39 (inb) and use the FIFO to write to HUB as fast as possible. Depending on IO delays through the chip, you may find that you need to wait for the opposite edge (or you can introduce NOPs or WAITX to delay after you see the special edge event). The waitse1 and waitse2 instructions will take from 2..n cycles to occur, the other instructions in the REP loop should take 2 once the FIFO is primed. Note that writing bytes may not be the fastest way if the FIFO can't keep up and wflong may be better than wfbyte (but will need post-processing to extract the byte from the long in memory afterwards). Certain capture rates are not as efficient as others with bytes being written into the FIFO, depends on the spacing of the executed wfbyte instructions.
Depends on if you're taking the clock as an input or generating it on the P2. In the latter case, the hardware can just do it in the background. But it can't use an external clock input, so in that case you'd need to fall back to a hot loop (like what one'd have done on the P1) (though maybe single-cycle one-shot streamer commands could be used to shorten that loop...) @rogloh did parallel RGB capture from an LCD interface and that worked decently fine.
UART/SPI/I2C can be plentiful with varying level of hardware assist. No JTAG programming though, only regular UART.
EDIT: wow the other guys sniped me with more detailed info
JTAG at its heart is SPI.
Hmm, I'm thinking that instead of doing a WFBYTE directly from INA, you could do a XZERO or XINIT instead (with a length-1 command), which would allow using any pin group and delaying the signal by a few cycles. Haven't tried that but should work.
Also IDK what you're on about, the FIFO will/should take any amount of WFBYTE instructions just fine. The problems only happen when you write from the streamer into the FIFO and then try to do a non-FIFO memory access and that hangs the CPU for an extended period.
Think the hyperram driver does this with the P2 providing the clock at 300 MHz? or is it 150 MHz?
40 MHz from external clock sounds easy?
Wait for clock then do wfbyte in rep?
That's the Prop as the master providing the clock. Different story when trying to follow an external clock source.
Roger's bit-bashed solution above is very good really. It actually waits for each clock edge.
Though if your clock is always active and never turns off, you can use it as the P2's clock source and simplify your life that way.
Well, I know how to get 350 MHz now, just up Vdd to 2.0 V.
But, if clock can be made to be synchronous with P2, think can be easier...
Or, if capture lengths are relatively small, maybe clocks don't have to be synchronous..
Yep, but still have to find the start of data. Which is where a preamble helps hugely. It's basically a big long chip-select. Plenty of time for software to react.
WFBYTE / WFWORD data are written to hub RAM as bytes / words and the FIFO can be filled faster than hub RAM can be written.
EDIT:
What I'm trying to say is there is a potential timing issue with WFBYTE in particular as bytes are not assembled into longs before being written to hub RAM. Same applies to WFWORD but issue is less acute.
Yeah I just went and tested this out too. The WFBYTE timing seems okay at any gap spacing from 0..40 clocks (except I could not test a 1 clock delay). I must have confused it with FIFO reads and something either evanh or TonyB_ had mentioned a while back.
EDIT: okay TonyB_ just mentioned this too above. But it must be in conjunction with other HUB accesses. Back to back wfbyte's with arbitrary fixed gaps between them seems possible in the absence of those extra accesses. Unless it is somehow hub window related - I didn't vary the start address in this test.
BTW - I found via initial problems if you move the last DAT block with buf defined at the end of this snippet to occur just before the PASM DAT code block it will hang in flexspin even if you add an extra ORGH before the "pasm" label. Not sure why but it's probably a bug in flexspin. Running 7.0.0-beta-v6.9.7-65-g3914edfd of that compiler. Alignment perhaps?
Ada is right, this is only a problem if trying to use RD/WRxxxx in combination with WFxxxx. The FIFO gets hoggy with WFxxxx instructions, for sure, but it never gets behind. The only problem is that hogginess then blocks any ordinary RD/WRxxxx instructions. As long as you don't do both together it's fine. I had to make a point of doing one at a time in the new streamer based SD Card driver.
If the data is slow enough then it won't be an issue.
It would be a problem if there is no gaps to the streaming data going to hubRAM, and that data rate is high. At some point the cog is going to want to write something else to hubRAM. But then at some point you are going to run out of hubRAM too.
Yes, WFxxxx can stall random reads and writes very easily. What I said is not true all the time. At very high data rates multiple WFBYTEs must be written to hub RAM at the same time as words or longs because it takes a minimum of 8 cycles to write a byte to the same slice as a previous write.
Thanks guys
The clock is input for me but it is derived from a free running clock I am generating,.so may be I can treat it as an internal clock too.
With the FPGA, I am using around 12.5MHz DDR clock but probably twice that would work fine too.
So 12.5MHz is what I am aiming for now.
I capture approx 10K samples, time to time I send it over an UART with 921600bps too but I am dividing the sending to smaller chunks so I don't wait to start new sampling cycle.
One cycle is around 2ms. So in ms I do the data capture, do the calculations (a simple thresholding and center of mass calculaton) then start the new cycle.
Ah, it's measuring the physical world rather than a comms block transfer that I was thinking about.
So the exact start of sampling is not critical, which eliminates the need for a preamble. Roger's example bit-bashed code in post #3 above is about all you need then.