There's something about this VGA demo. I can't quite put my finger on it....
The slideshow blanks while it loads and flips the bmp files and then holds each for 4 frames. The vga driver reloads the palette and intensity every frame and advances a frame counter so that we can synch.
I've optimised the VGA driver so that I can load a bmp file directly into the video memory with just a slight offset adjustment. So yes, the main bottleneck is loading a 300Kb file and even though it is using a single sector read command that normally wastes a lot of time waiting for the SD controller to access the sector as if it were non-sequential, it is still quite fast. However even at 180MHz the SPI clock is around 22.5MHz and the signal is looking quite rounded already. I will have to see what I need to do to improve the signal itself but of course I could speed it a lot simply by getting the sequential block read mode to work properly.
This is outputting a clock from the P2 but I think it is a limitation of my 100MHz scope. Here is a 50MHz waveform using x10 probes and a 33R series termination resistor just to get the reflections out a bit. The probe ground is direct to a pin on the P2D2 ground, and this is from P42 near the center.
..However even at 180MHz the SPI clock is around 22.5MHz and the signal is looking quite rounded already.
... This is outputting a clock from the P2 but I think it is a limitation of my 100MHz scope.
Here is a 50MHz waveform using x10 probes and a 33R series termination resistor just to get the reflections out a bit. The probe ground is direct to a pin on the P2D2 ground, and this is from P42 near the center.
Is that 50MHz from P2 ? Did you generate for 50% duty cycle ?
P2 physical pin speeds were always going to impose some limit, the only question was where.
There is a lot of stuff hanging off a P2 Pin...
And then there's always that members-only 4-bit mode to investigate. I wonder if the pins slew faster in that mode.
Yeah the P2 hopefully opens up the possibility to do SDIO in 4 bit mode with proper CRC calculations at a reasonable speed giving some benefit over SPI mode. In fact I'm actually laying out a board right now assuming that it can be done at some point by including the extra two SD DAT1,DAT2 wires to P2 pins. I've had SD mode going before with my own custom P1V implementation so I learned what is needed in terms of the command and data transfer state machines. You have to hunt around for various snippets and samples of information and code online to work that out but it can be done with lots of patience.
I think streaming nibbles to/from the 4 pins can be done automagically by the P2 streamer but the COG is also going to need to either precompute any sent CRC16/CRC7's in advance or more ideally compute it as it goes. Maybe the COG can track the information being sent from the streamer to the pins on each new clock and inject the CRC at the end from a table lookup being done use the LUT memory on each byte. This would be a very interesting timing project...and a good way to learn P2 capabilities.
Update: I've mapped my SD pins as follows to the P2. I hope the streamer supports this allocation for the optimized nibble transfers...though if it doesn't work at least SPI is still going to be possible.
CMD P26
CLK P27
DAT0 P28
DAT1 P29
DAT2 P30
DAT3 P31
And then there's always that members-only 4-bit mode to investigate. I wonder if the pins slew faster in that mode.
Yeah the P2 hopefully opens up the possibility to do SDIO in 4 bit mode with proper CRC calculations at a reasonable speed giving some benefit over SPI mode. In fact I'm actually laying out a board right now assuming that it can be done at some point by including the extra two SD DAT1,DAT2 wires to P2 pins. I've had SD mode going before with my own custom P1V implementation so I learned what is needed in terms of the command and data transfer state machines. You have to hunt around for various snippets and samples of information and code online to work that out but it can be done with lots of patience.
I think streaming nibbles to/from the 4 pins can be done automagically by the P2 streamer but the COG is also going to need to either precompute any sent CRC16/CRC7's in advance or more ideally compute it as it goes. Maybe the COG can track the information being sent from the streamer to the pins on each new clock and inject the CRC at the end from a table lookup being done use the LUT memory on each byte. This would be a very interesting timing project...and a good way to learn P2 capabilities.
Hey Roger, you reckon you could post that 4-bit code or whatever you have?
I can also see by the reflections that I will need series termination, so I will allow for a resnet or two.
Peter, here are some pictures from an 8Gs/s 1.5GHz Infiniium scope with active FET probes. I pulled it out and set it up after I saw your post above. It runs on an old version of Windows and needs a PS2 mouse and the BIG round keyboard connector. It's a pain to set up, but works great. Have you ever seen 1ns/div? This was new to me. We bought this for $10k back in 2005 for the Prop1. These were $40k new. Today you can buy them for about $3k.
Anyway, here is a shot of a smart pin in NCO mode with F=$8000_0000, so that it toggles on every clock. The waveform is half the clock frequency of 300MHz:
That was cheating, though, because I had clocking enabled on the pin (C=1 in %CIOHHHLLL). It squared up the timing. Look at the duty cycle measurement.
Here it is without clocking enabled (C=0), so we are seeing what the core logic is outputting. See the difference in Tph vs. Tpl:
I found out that clocking (C=1) fails around 300MHz, unless you raise VIO a bit.
That one really will be the long run OUT route. The final buffer/driver is squaring it up but the internal rise and fall slewing must be uneven.
EDIT: It won't affect the DIR to OUT relationship because they will both experience the same long paths. Or at least, the rules will ensure they are affected equally.
That would also explain the notch in reliable clock frequency for ADC operation too. Beyond about 325ish MHz it appeared reliable again. The IN bitstream is possibly skewing so badly that the transitions are being sampled by the smartpin on the following clock. It's also possible this is producing a distortion between rise and fall at the seemingly reliable 340 Mhz sysclock.
That would also explain the notch in reliable clock frequency for ADC operation too. Beyond about 325ish MHz it appeared reliable again. The IN bitstream is possibly skewing so badly that the transitions are being sampled by the smartpin on the following clock. It's also possible this is producing a distortion between rise and fall at the seemingly reliable 340 Mhz sysclock.
Yep. I never imagined the I/O pads wouldn't be fast *enough*. Now, it's a good thing that clocking must be turned ON, and not the default, so that we have an asynchronous path to the I/O pins that allows the core to still function over I/O, at all, at such high speeds. Providence!!
So I've fixed that bug I had in the multiple block read mode so that is working now and loaded a 300kB bmp file in......
LAP TIGER LAP .LAP 31,316,360 cycles = 97,863,625ns @320MHz ok
Less than 100ms! That means it is reading at a rate of over 3MB/sec !!!
The SPI clock routines have an extra nop in the clock to stretch them out now too.
Who knows how fast I can push it if I used the smartpins and include some series termination on the next board.
BTW - this is all there is to the VIEW routine.
: VIEW ( sector -- )
--- read file header and align file palette to memory palette then read all
DUP FOPEN PALETTE 10 SDW@ $400 - - BMPSZ HIDE SDRDS SHOW
;
That doesn't offer any improvements since it has 5 instruction in the main loop, the same as the old routine plus it also adds 2 extra instructions. Nonetheless I tried it out and while it worked, it didn't run any faster. The SD clock is running at 30MHz for both. Maybe I will have to look at smartpin modes next.
EDIT: Sorry, yours has 4 instructions in the main loop but I would have to do the same to the SPIRX version as well which handles blocks. One of the problems we have is that we only have 16kB of ROM total and we really had to do some squeezing.
Comments
By the way hope to put a proto pcb panel in on Monday. There'll be room, if you want to run a proto (1 or 2 copies)
The slideshow blanks while it loads and flips the bmp files and then holds each for 4 frames. The vga driver reloads the palette and intensity every frame and advances a frame counter so that we can synch.
It's the Sonic Dreams image right?
:-)
J
Can anyone confirm what they get?
Is that 50MHz from P2 ? Did you generate for 50% duty cycle ?
P2 physical pin speeds were always going to impose some limit, the only question was where.
There is a lot of stuff hanging off a P2 Pin...
Yeah the P2 hopefully opens up the possibility to do SDIO in 4 bit mode with proper CRC calculations at a reasonable speed giving some benefit over SPI mode. In fact I'm actually laying out a board right now assuming that it can be done at some point by including the extra two SD DAT1,DAT2 wires to P2 pins. I've had SD mode going before with my own custom P1V implementation so I learned what is needed in terms of the command and data transfer state machines. You have to hunt around for various snippets and samples of information and code online to work that out but it can be done with lots of patience.
I think streaming nibbles to/from the 4 pins can be done automagically by the P2 streamer but the COG is also going to need to either precompute any sent CRC16/CRC7's in advance or more ideally compute it as it goes. Maybe the COG can track the information being sent from the streamer to the pins on each new clock and inject the CRC at the end from a table lookup being done use the LUT memory on each byte. This would be a very interesting timing project...and a good way to learn P2 capabilities.
Update: I've mapped my SD pins as follows to the P2. I hope the streamer supports this allocation for the optimized nibble transfers...though if it doesn't work at least SPI is still going to be possible.
CMD P26
CLK P27
DAT0 P28
DAT1 P29
DAT2 P30
DAT3 P31
Hey Roger, you reckon you could post that 4-bit code or whatever you have?
I can also see by the reflections that I will need series termination, so I will allow for a resnet or two.
Good catch on the series termination. I might add some too, just in case. Cheers.
Anyway, here is a shot of a smart pin in NCO mode with F=$8000_0000, so that it toggles on every clock. The waveform is half the clock frequency of 300MHz:
That was cheating, though, because I had clocking enabled on the pin (C=1 in %CIOHHHLLL). It squared up the timing. Look at the duty cycle measurement.
Here it is without clocking enabled (C=0), so we are seeing what the core logic is outputting. See the difference in Tph vs. Tpl:
I found out that clocking (C=1) fails around 300MHz, unless you raise VIO a bit.
That'll be setup time of OUT rise getting too late.
Clearly OUT is falling faster than rising to produce such a skewed duty.
500ps rise and fall time
Overshoot doesn't look too bad either
Do you know why the high is shorter than the low?
That one really will be the long run OUT route. The final buffer/driver is squaring it up but the internal rise and fall slewing must be uneven.
EDIT: It won't affect the DIR to OUT relationship because they will both experience the same long paths. Or at least, the rules will ensure they are affected equally.
That partially explains the low glitch we get when DIR and OUT fall on the same clock.
Yep. I never imagined the I/O pads wouldn't be fast *enough*. Now, it's a good thing that clocking must be turned ON, and not the default, so that we have an asynchronous path to the I/O pins that allows the core to still function over I/O, at all, at such high speeds. Providence!!
Yeah, I guess, we're way beyond rated speed. Anything here is just bonus.
The SPI clock routines have an extra nop in the clock to stretch them out now too.
Who knows how fast I can push it if I used the smartpins and include some series termination on the next board.
BTW - this is all there is to the VIEW routine. Just one line of code.
Use this for improved clocking template: PS: In case you're wondering, I've derived that from your posting here - https://forums.parallax.com/discussion/comment/1426178/#Comment_1426178
That doesn't offer any improvements since it has 5 instruction in the main loop, the same as the old routine plus it also adds 2 extra instructions. Nonetheless I tried it out and while it worked, it didn't run any faster. The SD clock is running at 30MHz for both. Maybe I will have to look at smartpin modes next.
EDIT: Sorry, yours has 4 instructions in the main loop but I would have to do the same to the SPIRX version as well which handles blocks. One of the problems we have is that we only have 16kB of ROM total and we really had to do some squeezing.
I hope you've carefully replaced every RET with _RET_ :P
I'm running my loop now with P2 at 340MHz and I'm reading a 33.8MHz clock, so it is very close to 1/10 of the CPU clock.