AHHH, this was one detail I missed. Is there any disadvantage to the way I'm doing things?
At that speed, it's fine. 100 MHz is the output of the PLL. You can strike a limit of how fast the PLL can go. I think it's good for about 400 MHz on the P2ES. And Chip has raised that to maybe 500 MHz for the finished Prop2.
Oops, I got that wrong. _XDIVP is to be used normally there. It's to keep the PLL in the 100-400 MHz when you only want a slower system clock, like 25 MHz for example. So you've done exactly the right settings.
I remember now, Chip did some simulation of the PLL and decided it had far too much leeway at the bottom end, right down to 1 MHz. So he tweaked the respin to sacrificed some that to make it go faster at the top end.
It took me a few but I was able to figure out... if you change CLKSET as follows, you can use cluso's code very easily.
_XTALFREQ = 20_000_000 ' crystal frequency
_XDIV = 1 ' crystal divider to give 20MHz
_XMUL = 5 ' crystal / div * mul (100mhz)
_XDIVP = 4 ' crystal / div * mul /divp to give _CLKFREQ (1,2,4..30)
_XOSC = %10 'OSC ' %00=OFF, %01=OSC, %10=15pF, %11=30pF
_XPPPP = ((_XDIVP>>1) + 15) & $F ' 1->15, 2->0, 4->1, 6->2...30->14
_CLOCKFREQ = _XTALFREQ / _XDIV * _XMUL / _XDIVP ' internal clock frequency
_SETFREQ = 1<<24 + (_XDIV-1)<<18 + (_XMUL-1)<<8 + _XPPPP<<4 + _XOSC<<2 ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_00 ' setup oscillator '' this is used by clkset
...
clkset(_SETFREQ, _CLOCKFREQ)
From what I understand it's best to keep xdiv low.. while keeping 100 MHz =< (xtal / xdiv * xmul) >= 200MHz. In my example I'm setting for 20 mhz. I've been using c/ 2 * 16 /2 to match my P1 80mhz clock. Hope this helps a little!
In cluso's listing there is a constant
_XSEL = %11
that isn't in your listing (above). Does that make a difference?
It took me a few but I was able to figure out... if you change CLKSET as follows, you can use cluso's code very easily.
_XTALFREQ = 20_000_000 ' crystal frequency
_XDIV = 1 ' crystal divider to give 20MHz
_XMUL = 5 ' crystal / div * mul (100mhz)
_XDIVP = 4 ' crystal / div * mul /divp to give _CLKFREQ (1,2,4..30)
_XOSC = %10 'OSC ' %00=OFF, %01=OSC, %10=15pF, %11=30pF
_XPPPP = ((_XDIVP>>1) + 15) & $F ' 1->15, 2->0, 4->1, 6->2...30->14
_CLOCKFREQ = _XTALFREQ / _XDIV * _XMUL / _XDIVP ' internal clock frequency
_SETFREQ = 1<<24 + (_XDIV-1)<<18 + (_XMUL-1)<<8 + _XPPPP<<4 + _XOSC<<2 ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_00 ' setup oscillator '' this is used by clkset
...
clkset(_SETFREQ, _CLOCKFREQ)
From what I understand it's best to keep xdiv low.. while keeping 100 MHz =< (xtal / xdiv * xmul) >= 200MHz. In my example I'm setting for 20 mhz. I've been using c/ 2 * 16 /2 to match my P1 80mhz clock. Hope this helps a little!
In cluso's listing there is a constant
_XSEL = %11
that isn't in your listing (above). Does that make a difference?
I removed that to make it (more) obvious that the clock is set using clkset. _XSEL is the OSC enable which is actually set by par3, which is initialized to 3 before the call (thus enabling the osc). It really doesn't make a difference because it's automatically supplied by the CLKSET instruction. That was what took me so long to understand the difference between the two methods.
I did update my setup code. Here is the latest...
Just select the section for your board and comment out the other.
Note the divide works better with _XDIV being smaller. There is jitter if this gives 0.5MHz or 1MHz.
Choose your values _XDIV, _XMUL and _XDIVP and the calculations are done for you.
_1us is also defined so you can use
WAITX ##(_1us * n)-1
to give a precise delay where n is the number of 1 microseconds to wait.
CON
'+-------[ Select for P2-EVAL ]------------------------------------------------+
_XTALFREQ = 20_000_000 ' crystal frequency
_XDIV = 2 '\ '\ crystal divider to give 10.0MHz
_XMUL = 15 '| 150 MHz '| crystal / div * mul to give 150MHz
_XDIVP = 1 '/ '/ crystal / div * mul /divp to give 150MHz
_XOSC = %10 '15pF ' %00=OFF, %01=OSC, %10=15pF, %11=30pF
'+-------[ Select for P2D2 ]---------------------------------------------------+
_XTALFREQ = 12_000_000 ' crystal frequency
_XDIV = 4 '\ '\ crystal divider to give 3.0MHz
_XMUL = 99 '| 148.5MHz '| crystal / div * mul to give 297.0MHz
_XDIVP = 2 '/ '/ crystal / div * mul /divp to give 148.5MHz
_XOSC = %01 'OSC ' %00=OFF, %01=OSC, %10=15pF, %11=30pF
'+-----------------------------------------------------------------------------+
_XSEL = %11 'XI+PLL ' %00=rcfast(20+MHz), %01=rcslow(~20KHz), %10=XI(5ms), %11=XI+PLL(10ms)
_XPPPP = ((_XDIVP>>1) + 15) & $F ' 1->15, 2->0, 4->1, 6->2...30->14
_CLOCKFREQ = _XTALFREQ / _XDIV * _XMUL / _XDIVP ' internal clock frequency
_SETFREQ = 1<<24 + (_XDIV-1)<<18 + (_XMUL-1)<<8 + _XPPPP<<4 + _XOSC<<2 ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_00 ' setup oscillator
_ENAFREQ = _SETFREQ + _XSEL ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_ss ' enable oscillator
_1us = _clockfreq/1_000_000 ' 1us
DAT org 0
'+-------[ Set Xtal ]----------------------------------------------------------+
entry hubset #0 ' set 20MHz+ mode
hubset ##_SETFREQ ' setup oscillator
waitx ##20_000_000/100 ' ~10ms
hubset ##_ENAFREQ ' enable oscillator
'+-----------------------------------------------------------------------------+
It took me a few but I was able to figure out... if you change CLKSET as follows, you can use cluso's code very easily.
...
clkset(_SETFREQ, _CLOCKFREQ)
In cluso's listing there is a constant
_XSEL = %11
that isn't in your listing (above). Does that make a difference?
_XSEL is an optional 3rd parameter to clkset, so the full version of clkset is:
clkset(_SETFREQ, _CLOCKFREQ, _XSEL)
if you don't pass in an explicit _XSEL then it defaults to %11, which is what you will usually want. But for some combinations (for setting rcfast or rcslow) you'll have to give an _XSEL yourself.
BTW the source code for clkset (and other P2 builtin functions) is in the spin2cpp repository in the file sys/p2_code.spin:
Thanks guys,
I should have probably figured out the 3rd optional parameter (xsel) on my own. I'm back to my 3rd post in this thread, trying to understand the inline asm. It seems that if I set the optimization level -O2 my test program hangs after the serial output but before toggling sd pins. I'm not sure if this is a bug in fastspin or if there's something in my code that's not friendly to being optimized.
So here comes the stupid question for the day.. What would be the best way to figure out what's going wrong? I tried using an older version of fastspin and that seemed to break on -O1. I've been going through assembly for working and not working versions and i'm not seeing anything jump out at me. I'm at a loss!
It's very possible that the fastspin optimizer will not work properly on "rep" without labels (in inline assembly). If so, that's a bug. Could you try changing it to use an end label instead of a count of instructions?
Also, I think you used "8-1" for your repeat count; why wasn't it "8" (one for each bit)?
I'm slowly working to confirm that I'm not doing something wrong but I really can't rule that out yet.
@ersmith, I'm trying to take the "pure spin" driver and compile as suggested with -O2 but somethings not quite right.
Regarding inline asm, in the test file where I am using inline asm I've taken your suggestion and using rep with labels. I also corrected the rep count to 8, I had it 8-1 because I thought I saw it somewhere. Either I was remembering old info or just completely wrong.
I'm still trying to figure out what I'm doing wrong...
*edit-
It seems that when compiled with -O2, the SD pins never toggle. I'm thinking about trying a simple rep loop to toggle pins and move it through the file until the pin stops toggling.
*aedit-
Even more interesting, it seem after running the file compiled with -O2, the SD card does not respond to the previously working -O1 or -O0. I've tried back to back running -O1 and they seem okay. I really don't understand what's going on since -O2 never seems to toggle the pins so it shouldn't affect the working code. I'm at a loss right now to explain what I'm seeing...
It probably is a lag vs speed issue. The Prop2 has many more stages in the I/O circuits than the Prop1 had. This introduces more measurable lag with respect to software reading the inputs.
To make it work, at first brush, you might end up waiting for signals to appear where not having to on the Prop1. Later that can be reordered to accommodate the extra lag built in the Prop2 and thereby eliminating any slowdown.
@ersmith, I'm trying to take the "pure spin" driver and compile as suggested with -O2 but somethings not quite right.
Well, I certainly can't rule out a bug in fastspin (the -O2 setting isn't the default because it enables somewhat "riskier" optimization); but it could also be a timing issue. Does the hardware really not require any delays around toggling the pins? Have you tried running the P2 at a lower clock rate (like 80 MHz, or even 40 MHz) to slow it down to P1 speeds, and then trying to rev it back up?
Regarding getting faster code out of the compiler in general, the key thing is to keep as much as possible in local variables. That's probably more important than -O2. In your "send" routine don't keep reading global variables like "clk" and "di" inside the loop, read them once into local variables and use those. It'll be much faster (if the hardware can handle the additional speed) since it saves doing repeated rdlongs. This is true on P1 as well as P2.
It probably is a lag vs speed issue. The Prop2 has many more stages in the I/O circuits than the Prop1 had. This introduces more measurable lag with respect to software reading the inputs.
To make it work, at first brush, you might end up waiting for signals to appear where not having to on the Prop1. Later that can be reordered to accommodate the extra lag built in the Prop2 and thereby eliminating any slowdown.
This probably has something to do with my issues. It seems strange that the working code fails to run after the code that doesn't work, even though it seems like the broken code doesn't toggle any SD pins. I'm doing some experiments to narrow down what's going on here.
@ersmith, I'm trying to take the "pure spin" driver and compile as suggested with -O2 but somethings not quite right.
Well, I certainly can't rule out a bug in fastspin (the -O2 setting isn't the default because it enables somewhat "riskier" optimization); but it could also be a timing issue. Does the hardware really not require any delays around toggling the pins? Have you tried running the P2 at a lower clock rate (like 80 MHz, or even 40 MHz) to slow it down to P1 speeds, and then trying to rev it back up?
Regarding getting faster code out of the compiler in general, the key thing is to keep as much as possible in local variables. That's probably more important than -O2. In your "send" routine don't keep reading global variables like "clk" and "di" inside the loop, read them once into local variables and use those. It'll be much faster (if the hardware can handle the additional speed) since it saves doing repeated rdlongs. This is true on P1 as well as P2.
@ersmith, Thanks for all your help and suggestions. This could possibly be a timing issue, perhaps the setup or hold time from di -clk is causing this issue but it seems somewhat unlikely. I've tried P2 clock from 20mhz to 320mhz and working code remains working and broken code still does nothing.
I'm going to try to refactor things to use locals as much as possible. Hopefully this will get the speed up. I really need to get this test under 3 seconds before I can start using this object. At 320mhz it's about 13 seconds so I've got some work to do!
This routine is a good example for problematic timing differences between Prop1 and Prop2
pri read | r
r := 0
repeat 8
outa[clk] := 0
outa[clk] := 1
r += r + ina[do]
return r
The assumption in that code is that the clock output is transitioning down and back up during the preceding two commands before INA. For the Prop1 this always holds true in the hardware.
Without optimising, it likely also holds true for the Prop2 because it could take a while to actually execute through to the INA. But with optimising, it can shrink down to as few as three instructions. Which has big implications for lag effects.
In the Prop2 manual, just before the section on Events there is a one page write up of I/O Pin Timing. It states 3 clocks of lag for OUT and another 3 clocks for IN. 6 clocks is 3 instructions for the Prop2. And that's without giving an external chip time to respond to a clock change.
The quick fix is add a wait between OUTA and INA. Which kind of defeats the purpose of your optimising effort.
I know the way Peter has handled this in Tachyon is by having an early SPI clock edge before the READ routine is entered. It results in an extraneous SPI clock upon exit from READ, which presumably can carry over to the next byte read.
So, his code still pulses the SPI clock before IN but that first IN is reading data from clock pulse prior to READ routine entry. The REP loop was 4 instructions long so could pull a sysclock/8 bit rate, eg: At 80 MHz he could do 10 Mbps for 7 bits.
Another way to handle this is convert the low level routines to using smartpins. They can manage the timings for you, but also have a whole new set of details of their own. One of which is there is initial setup required by a separate piece of code.
Smartpins are a good approach. In true Propeller form, all pins are the same. Once to you know one smartpin, you know them all.
That is correct code, I think -- it's what your program has asked for (the andn sets the clock pin low, the muxc sets the data pin to 0 or 1 based on the previous low bit of outv, and the or sets clock high again). But it's all happening very fast, and doesn't allow for any delays at all.
I guess a crucial question is whether when we read "outb" it's returning a buffered version, or whether it's actually returning the actual current state of the pins (which lags the setting by 3 cycles). If it's returning the actual state of the pins then we definitely need a delay between muxc and or, and probably between the or and the andn.
That's not even considering what's happening on the other side, where your device only has a 4 cycle window between clock going low and then high again in which to pick up the data change; and for half of that window (during the shr instruction) it's still showing the old value.
I think you really need a waitx_(N) after the outb[di] := outv. That'll introduce a 2+N cycle delay. Probably waitx_(2) would be enough, but you might experiment with this.
When you had di as a global variable a bunch of optimizations were inhibited, so the inner loop wasn't as compact (and the rdlong to get the value of di introduced a delay as we waited for HUB memory to respond).
You're going to have a similar problem with the READ routine, where you'll have to have a delay before you try to read data from inb.
Thanks @ersmith, I do realize there will be timing issues with the way things are written and these timings will change as I try to refactor things. The issue I'm having trouble understanding is why changing:
outb[di] := outv
to:
d := di
...
outb[d] := outv
Causes the di pin to not respond. I'm watching on the LA and the clock is toggling properly but di remains high. I've sprinkled in waits to make sure I'm able to capture on the LA. I'm almost sure I'm doing something very wrong here since bringing the clock pin into a local variable works but does not with di.
I think I'm going to take evanh's recommendation and just use inline asm since I've already had it working using the local variables. Well I at least had the send working... I tried to simplify things by hard-coding the pins and some other "bad practices" just to try to get things working. I'm stuck at 13 seconds running @ 320mhz, not quite good enough.
Thanks @ersmith, I do realize there will be timing issues with the way things are written and these timings will change as I try to refactor things. The issue I'm having trouble understanding is why changing:
outb[di] := outv
to:
d := di
...
outb[d] := outv
Causes the di pin to not respond.
OK, sorry, I was so focused on the timing issue (the code change above moves the rdlong from inside the loop to outside) that I missed that the pin change wasn't showing up on your logic analyzer at all. Sure enough, you've hit a bug in -O2; it's moving the initialization of the mask for outb[] to the wrong place, before d has been set from di, so the outb[d] is writing to the wrong or no pin. That bug won't happen with -O1.
For P2 you could use something like:
VAR
long clk, di
pub send(outv) | c, d
outv ><= 8
c := clk
d := di
repeat 8
asm
test outv, #1 wc
drvl c
drvc d
drvh c
endasm
outv >>= 1
drvh_(d)
I'd still be a little concerned about whether the receiver can keep up with this, but I don't know what the timing specs are for your devices, so maybe it isn't an issue.
Thanks @ersmith, I do realize there will be timing issues with the way things are written and these timings will change as I try to refactor things. The issue I'm having trouble understanding is why changing:
outb[di] := outv
to:
d := di
...
outb[d] := outv
Causes the di pin to not respond.
OK, sorry, I was so focused on the timing issue (the code change above moves the rdlong from inside the loop to outside) that I missed that the pin change wasn't showing up on your logic analyzer at all. Sure enough, you've hit a bug in -O2; it's moving the initialization of the mask for outb[] to the wrong place, before d has been set from di, so the outb[d] is writing to the wrong or no pin. That bug won't happen with -O1.
For P2 you could use something like:
VAR
long clk, di
pub send(outv) | c, d
outv ><= 8
c := clk
d := di
repeat 8
asm
test outv, #1 wc
drvl c
drvc d
drvh c
endasm
outv >>= 1
drvh_(d)
I'd still be a little concerned about whether the receiver can keep up with this, but I don't know what the timing specs are for your devices, so maybe it isn't an issue.
I've been trying to fight timing, and some kinda funkiness... (no, not compiler issues... PEBMAK)… SD state persistent over reboots but I think I'm almost there. Thanks for pointing out the drv instructions. There's so many new instructions to try out. I really like this because it can reach all 64 pins easily.
Sorry about not being able to articulate the bug better, that may be part of the reason you were stuck on the timing issue.
I've done some testing, moving timing around and it seems pretty solid on at least 2 of my sd cards. I have this idea of an "all in one SPI cog" that can handle SD card access as well as generic SPI support. I just have to get over the learning curve of the P2, with the new instructions and default clock that beats my poor logic analyzer into the dust.
*edit-
Thinking about it, I'm not sure if this is actually working or not because I don't think it should!! If my basic maths are right, the inner loop is 4 instructions * 2 clks per for 8 clocks per rep which would be 40mhz!! at 320 P2 clock? I thought SD spec is 20-25mhz. Cluso's sd seems to use 12 clocks per rep? I'm not sure about TaqOZ..
A couple of you told me that ptra is the equivalent of par in p1. I have a simple thing that I want to do which is just to take a value and transfer it to another variable and print. Not working here is the simple code from spin2 code. I am not an engineer. Please tell me what I am missing. This is my jump start.
I figured it out in P1 pasm but here I am lacking. I need to get this so I can move on. Other programming in pasm is o.k. Just this. Sounds dumb.Oh well.
Comments
Tom
In cluso's listing there is a constant that isn't in your listing (above). Does that make a difference?
I removed that to make it (more) obvious that the clock is set using clkset. _XSEL is the OSC enable which is actually set by par3, which is initialized to 3 before the call (thus enabling the osc). It really doesn't make a difference because it's automatically supplied by the CLKSET instruction. That was what took me so long to understand the difference between the two methods.
Here's the asm output of CLKSET:
Which get passed to __system_clkset
Just select the section for your board and comment out the other.
Note the divide works better with _XDIV being smaller. There is jitter if this gives 0.5MHz or 1MHz.
Choose your values _XDIV, _XMUL and _XDIVP and the calculations are done for you.
_1us is also defined so you can use
WAITX ##(_1us * n)-1
to give a precise delay where n is the number of 1 microseconds to wait.
_XSEL is an optional 3rd parameter to clkset, so the full version of clkset is: if you don't pass in an explicit _XSEL then it defaults to %11, which is what you will usually want. But for some combinations (for setting rcfast or rcslow) you'll have to give an _XSEL yourself.
BTW the source code for clkset (and other P2 builtin functions) is in the spin2cpp repository in the file sys/p2_code.spin:
I should have probably figured out the 3rd optional parameter (xsel) on my own. I'm back to my 3rd post in this thread, trying to understand the inline asm. It seems that if I set the optimization level -O2 my test program hangs after the serial output but before toggling sd pins. I'm not sure if this is a bug in fastspin or if there's something in my code that's not friendly to being optimized.
So here comes the stupid question for the day.. What would be the best way to figure out what's going wrong? I tried using an older version of fastspin and that seemed to break on -O1. I've been going through assembly for working and not working versions and i'm not seeing anything jump out at me. I'm at a loss!
Also, I think you used "8-1" for your repeat count; why wasn't it "8" (one for each bit)?
https://forums.parallax.com/discussion/comment/1465100/#Comment_1465100
I'm slowly working to confirm that I'm not doing something wrong but I really can't rule that out yet.
@ersmith, I'm trying to take the "pure spin" driver and compile as suggested with -O2 but somethings not quite right.
Regarding inline asm, in the test file where I am using inline asm I've taken your suggestion and using rep with labels. I also corrected the rep count to 8, I had it 8-1 because I thought I saw it somewhere. Either I was remembering old info or just completely wrong.
I'm still trying to figure out what I'm doing wrong...
*edit-
It seems that when compiled with -O2, the SD pins never toggle. I'm thinking about trying a simple rep loop to toggle pins and move it through the file until the pin stops toggling.
*aedit-
Even more interesting, it seem after running the file compiled with -O2, the SD card does not respond to the previously working -O1 or -O0. I've tried back to back running -O1 and they seem okay. I really don't understand what's going on since -O2 never seems to toggle the pins so it shouldn't affect the working code. I'm at a loss right now to explain what I'm seeing...
To make it work, at first brush, you might end up waiting for signals to appear where not having to on the Prop1. Later that can be reordered to accommodate the extra lag built in the Prop2 and thereby eliminating any slowdown.
Regarding getting faster code out of the compiler in general, the key thing is to keep as much as possible in local variables. That's probably more important than -O2. In your "send" routine don't keep reading global variables like "clk" and "di" inside the loop, read them once into local variables and use those. It'll be much faster (if the hardware can handle the additional speed) since it saves doing repeated rdlongs. This is true on P1 as well as P2.
This probably has something to do with my issues. It seems strange that the working code fails to run after the code that doesn't work, even though it seems like the broken code doesn't toggle any SD pins. I'm doing some experiments to narrow down what's going on here.
@ersmith, Thanks for all your help and suggestions. This could possibly be a timing issue, perhaps the setup or hold time from di -clk is causing this issue but it seems somewhat unlikely. I've tried P2 clock from 20mhz to 320mhz and working code remains working and broken code still does nothing.
I'm going to try to refactor things to use locals as much as possible. Hopefully this will get the speed up. I really need to get this test under 3 seconds before I can start using this object. At 320mhz it's about 13 seconds so I've got some work to do!
Without optimising, it likely also holds true for the Prop2 because it could take a while to actually execute through to the INA. But with optimising, it can shrink down to as few as three instructions. Which has big implications for lag effects.
In the Prop2 manual, just before the section on Events there is a one page write up of I/O Pin Timing. It states 3 clocks of lag for OUT and another 3 clocks for IN. 6 clocks is 3 instructions for the Prop2. And that's without giving an external chip time to respond to a clock change.
The quick fix is add a wait between OUTA and INA. Which kind of defeats the purpose of your optimising effort.
So, his code still pulses the SPI clock before IN but that first IN is reading data from clock pulse prior to READ routine entry. The REP loop was 4 instructions long so could pull a sysclock/8 bit rate, eg: At 80 MHz he could do 10 Mbps for 7 bits.
Smartpins are a good approach. In true Propeller form, all pins are the same. Once to you know one smartpin, you know them all.
I've played around with this, trying different combos of optimizer levels as well as different names for d.
I really need to study this closer, and try to run through it in my head.
That way you can concentrate on hand crafting the most critical parts and leave the optimiser to do fancy tricks on non timing critical code.
I guess a crucial question is whether when we read "outb" it's returning a buffered version, or whether it's actually returning the actual current state of the pins (which lags the setting by 3 cycles). If it's returning the actual state of the pins then we definitely need a delay between muxc and or, and probably between the or and the andn.
That's not even considering what's happening on the other side, where your device only has a 4 cycle window between clock going low and then high again in which to pick up the data change; and for half of that window (during the shr instruction) it's still showing the old value.
I think you really need a waitx_(N) after the outb[di] := outv. That'll introduce a 2+N cycle delay. Probably waitx_(2) would be enough, but you might experiment with this.
When you had di as a global variable a bunch of optimizations were inhibited, so the inner loop wasn't as compact (and the rdlong to get the value of di introduced a delay as we waited for HUB memory to respond).
You're going to have a similar problem with the READ routine, where you'll have to have a delay before you try to read data from inb.
to: Causes the di pin to not respond. I'm watching on the LA and the clock is toggling properly but di remains high. I've sprinkled in waits to make sure I'm able to capture on the LA. I'm almost sure I'm doing something very wrong here since bringing the clock pin into a local variable works but does not with di.
I think I'm going to take evanh's recommendation and just use inline asm since I've already had it working using the local variables. Well I at least had the send working... I tried to simplify things by hard-coding the pins and some other "bad practices" just to try to get things working. I'm stuck at 13 seconds running @ 320mhz, not quite good enough.
Thanks for everyone's input!
OK, sorry, I was so focused on the timing issue (the code change above moves the rdlong from inside the loop to outside) that I missed that the pin change wasn't showing up on your logic analyzer at all. Sure enough, you've hit a bug in -O2; it's moving the initialization of the mask for outb[] to the wrong place, before d has been set from di, so the outb[d] is writing to the wrong or no pin. That bug won't happen with -O1.
For P2 you could use something like:
I'd still be a little concerned about whether the receiver can keep up with this, but I don't know what the timing specs are for your devices, so maybe it isn't an issue.
nice!
Mike
I've been trying to fight timing, and some kinda funkiness... (no, not compiler issues... PEBMAK)… SD state persistent over reboots but I think I'm almost there. Thanks for pointing out the drv instructions. There's so many new instructions to try out. I really like this because it can reach all 64 pins easily.
Sorry about not being able to articulate the bug better, that may be part of the reason you were stuck on the timing issue.
I've done some testing, moving timing around and it seems pretty solid on at least 2 of my sd cards. I have this idea of an "all in one SPI cog" that can handle SD card access as well as generic SPI support. I just have to get over the learning curve of the P2, with the new instructions and default clock that beats my poor logic analyzer into the dust.
*edit-
Thinking about it, I'm not sure if this is actually working or not because I don't think it should!! If my basic maths are right, the inner loop is 4 instructions * 2 clks per for 8 clocks per rep which would be 40mhz!! at 320 P2 clock? I thought SD spec is 20-25mhz. Cluso's sd seems to use 12 clocks per rep? I'm not sure about TaqOZ..
thanks