My stupid questions thread

twm47099 · 2019-02-21 03:29

Thanks to both. That helps a lot.
Tom

evanh · 2019-02-21 03:32

cheezus wrote: »

AHHH, this was one detail I missed. Is there any disadvantage to the way I'm doing things?

At that speed, it's fine. 100 MHz is the output of the PLL. You can strike a limit of how fast the PLL can go. I think it's good for about 400 MHz on the P2ES. And Chip has raised that to maybe 500 MHz for the finished Prop2.

evanh · 2019-02-21 03:44

Oops, I got that wrong. _XDIVP is to be used normally there. It's to keep the PLL in the 100-400 MHz when you only want a slower system clock, like 25 MHz for example. So you've done exactly the right settings.

evanh · 2019-02-21 03:55

I remember now, Chip did some simulation of the PLL and decided it had far too much leeway at the bottom end, right down to 1 MHz. So he tweaked the respin to sacrificed some that to make it go faster at the top end.

twm47099 · 2019-02-21 04:43

cheezus wrote: »

It took me a few but I was able to figure out... if you change CLKSET as follows, you can use cluso's code very easily.

  _XTALFREQ     = 20_000_000                                    ' crystal frequency
  _XDIV         = 1                                             ' crystal divider to give 20MHz
  _XMUL         = 5                                          ' crystal / div * mul (100mhz)
  _XDIVP        = 4                                             ' crystal / div * mul /divp to give _CLKFREQ (1,2,4..30)
  _XOSC         = %10                                  'OSC    ' %00=OFF, %01=OSC, %10=15pF, %11=30pF 
  _XPPPP        = ((_XDIVP>>1) + 15) & $F                       ' 1->15, 2->0, 4->1, 6->2...30->14
  _CLOCKFREQ    = _XTALFREQ / _XDIV * _XMUL / _XDIVP            ' internal clock frequency                
  _SETFREQ      = 1<<24 + (_XDIV-1)<<18 + (_XMUL-1)<<8 + _XPPPP<<4 + _XOSC<<2  ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_00  ' setup  oscillator '' this is used by clkset

...
   clkset(_SETFREQ, _CLOCKFREQ)

From what I understand it's best to keep xdiv low.. while keeping 100 MHz =< (xtal / xdiv * xmul) >= 200MHz. In my example I'm setting for 20 mhz. I've been using c/ 2 * 16 /2 to match my P1 80mhz clock. Hope this helps a little!

In cluso's listing there is a constant

_XSEL     = %11

that isn't in your listing (above). Does that make a difference?

cheezus · 2019-02-21 04:59

twm47099 wrote: »
cheezus wrote: »
It took me a few but I was able to figure out... if you change CLKSET as follows, you can use cluso's code very easily.
  _XTALFREQ     = 20_000_000                                    ' crystal frequency
  _XDIV         = 1                                             ' crystal divider to give 20MHz
  _XMUL         = 5                                          ' crystal / div * mul (100mhz)
  _XDIVP        = 4                                             ' crystal / div * mul /divp to give _CLKFREQ (1,2,4..30)
  _XOSC         = %10                                  'OSC    ' %00=OFF, %01=OSC, %10=15pF, %11=30pF 
  _XPPPP        = ((_XDIVP>>1) + 15) & $F                       ' 1->15, 2->0, 4->1, 6->2...30->14
  _CLOCKFREQ    = _XTALFREQ / _XDIV * _XMUL / _XDIVP            ' internal clock frequency                
  _SETFREQ      = 1<<24 + (_XDIV-1)<<18 + (_XMUL-1)<<8 + _XPPPP<<4 + _XOSC<<2  ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_00  ' setup  oscillator '' this is used by clkset

...
   clkset(_SETFREQ, _CLOCKFREQ)
From what I understand it's best to keep xdiv low.. while keeping 100 MHz =< (xtal / xdiv * xmul) >= 200MHz. In my example I'm setting for 20 mhz. I've been using c/ 2 * 16 /2 to match my P1 80mhz clock. Hope this helps a little!
In cluso's listing there is a constant
_XSEL     = %11
that isn't in your listing (above). Does that make a difference?

I removed that to make it (more) obvious that the clock is set using clkset. _XSEL is the OSC enable which is actually set by par3, which is initialized to 3 before the call (thus enabling the osc). It really doesn't make a difference because it's automatically supplied by the CLKSET instruction. That was what took me so long to understand the difference between the two methods.

Here's the asm output of CLKSET:

'    clkset(_SETFREQ, _CLOCKFREQ)
	mov	arg01, ##17043448
	mov	arg02, ##160000000
	mov	arg03, #3
	calla	#@__system__clkset

Which get passed to __system_clkset

__system__clkset
'       $f == free
	wrlong	arg02, #20
'     $10 = reserved   (never free automatically)
	wrlong	arg01, #24
'     $20 = inuse      (block was observed to be in use during GC)
	hubset	#0
	hubset	arg01
	waitx	##200000
	add	arg01, arg03
	hubset	arg01
__system__clkset_ret
	reta

Cluso99 · 2019-02-21 07:02

I did update my setup code. Here is the latest...
Just select the section for your board and comment out the other.

Note the divide works better with _XDIV being smaller. There is jitter if this gives 0.5MHz or 1MHz.
Choose your values _XDIV, _XMUL and _XDIVP and the calculations are done for you.

_1us is also defined so you can use
WAITX ##(_1us * n)-1
to give a precise delay where n is the number of 1 microseconds to wait.

CON
'+-------[ Select for P2-EVAL ]------------------------------------------------+ 
  _XTALFREQ     = 20_000_000                                    ' crystal frequency
  _XDIV         = 2             '\                              '\ crystal divider                      to give 10.0MHz
  _XMUL         = 15            '| 150 MHz                      '| crystal / div * mul                  to give 150MHz
  _XDIVP        = 1             '/                              '/ crystal / div * mul /divp            to give 150MHz
  _XOSC         = %10                                   '15pF   ' %00=OFF, %01=OSC, %10=15pF, %11=30pF
'+-------[ Select for P2D2 ]---------------------------------------------------+ 
  _XTALFREQ     = 12_000_000                                    ' crystal frequency
  _XDIV         = 4             '\                              '\ crystal divider                      to give   3.0MHz
  _XMUL         = 99            '| 148.5MHz                     '| crystal / div * mul                  to give 297.0MHz
  _XDIVP        = 2             '/                              '/ crystal / div * mul /divp            to give 148.5MHz
  _XOSC         = %01                                   'OSC    ' %00=OFF, %01=OSC, %10=15pF, %11=30pF
'+-----------------------------------------------------------------------------+
  _XSEL         = %11                                   'XI+PLL ' %00=rcfast(20+MHz), %01=rcslow(~20KHz), %10=XI(5ms), %11=XI+PLL(10ms)
  _XPPPP        = ((_XDIVP>>1) + 15) & $F                       ' 1->15, 2->0, 4->1, 6->2...30->14
  _CLOCKFREQ    = _XTALFREQ / _XDIV * _XMUL / _XDIVP            ' internal clock frequency                
  _SETFREQ      = 1<<24 + (_XDIV-1)<<18 + (_XMUL-1)<<8 + _XPPPP<<4 + _XOSC<<2  ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_00  ' setup  oscillator
  _ENAFREQ      = _SETFREQ + _XSEL                                             ' %0000_000e_dddddd_mmmmmmmmmm_pppp_cc_ss  ' enable oscillator

  _1us          = _clockfreq/1_000_000                          ' 1us

DAT             org     0
'+-------[ Set Xtal ]----------------------------------------------------------+ 
entry           hubset  #0                              ' set 20MHz+ mode
                hubset  ##_SETFREQ                      ' setup oscillator
                waitx   ##20_000_000/100                ' ~10ms
                hubset  ##_ENAFREQ                      ' enable oscillator
'+-----------------------------------------------------------------------------+

ersmith · 2019-02-21 12:32

twm47099 wrote: »
cheezus wrote: »
It took me a few but I was able to figure out... if you change CLKSET as follows, you can use cluso's code very easily.

...
   clkset(_SETFREQ, _CLOCKFREQ)
In cluso's listing there is a constant
_XSEL     = %11
that isn't in your listing (above). Does that make a difference?

_XSEL is an optional 3rd parameter to clkset, so the full version of clkset is:

clkset(_SETFREQ, _CLOCKFREQ, _XSEL)

if you don't pass in an explicit _XSEL then it defaults to %11, which is what you will usually want. But for some combinations (for setting rcfast or rcslow) you'll have to give an _XSEL yourself.

BTW the source code for clkset (and other P2 builtin functions) is in the spin2cpp repository in the file sys/p2_code.spin:

pri clkset(mode, freq, xsel = 3)
  CLKFREQ := freq
  CLKMODE := mode
  asm
    hubset #0
    hubset mode
    waitx ##20_000_000/100
    add  mode, xsel
    hubset mode
  endasm

cheezus · 2019-02-21 21:54

Thanks guys,
I should have probably figured out the 3rd optional parameter (xsel) on my own. I'm back to my 3rd post in this thread, trying to understand the inline asm. It seems that if I set the optimization level -O2 my test program hangs after the serial output but before toggling sd pins. I'm not sure if this is a bug in fastspin or if there's something in my code that's not friendly to being optimized.

So here comes the stupid question for the day.. What would be the best way to figure out what's going wrong? I tried using an older version of fastspin and that seemed to break on -O1. I've been going through assembly for working and not working versions and i'm not seeing anything jump out at me. I'm at a loss!

evanh · 2019-02-21 23:30

I probably can't help much with non-hardware issues, but we probably need your whole source for this one.

ersmith · 2019-02-22 00:11

It's very possible that the fastspin optimizer will not work properly on "rep" without labels (in inline assembly). If so, that's a bug. Could you try changing it to use an end label instead of a count of instructions?

Also, I think you used "8-1" for your repeat count; why wasn't it "8" (one for each bit)?

cheezus · 2019-02-22 00:51

Full source is posted as a zip here:
https://forums.parallax.com/discussion/comment/1465100/#Comment_1465100

I'm slowly working to confirm that I'm not doing something wrong but I really can't rule that out yet.

@ersmith, I'm trying to take the "pure spin" driver and compile as suggested with -O2 but somethings not quite right.

Regarding inline asm, in the test file where I am using inline asm I've taken your suggestion and using rep with labels. I also corrected the rep count to 8, I had it 8-1 because I thought I saw it somewhere. Either I was remembering old info or just completely wrong.

I'm still trying to figure out what I'm doing wrong...

*edit-
It seems that when compiled with -O2, the SD pins never toggle. I'm thinking about trying a simple rep loop to toggle pins and move it through the file until the pin stops toggling.

*aedit-
Even more interesting, it seem after running the file compiled with -O2, the SD card does not respond to the previously working -O1 or -O0. I've tried back to back running -O1 and they seem okay. I really don't understand what's going on since -O2 never seems to toggle the pins so it shouldn't affect the working code. I'm at a loss right now to explain what I'm seeing...

evanh · 2019-02-22 02:05

It probably is a lag vs speed issue. The Prop2 has many more stages in the I/O circuits than the Prop1 had. This introduces more measurable lag with respect to software reading the inputs.

To make it work, at first brush, you might end up waiting for signals to appear where not having to on the Prop1. Later that can be reordered to accommodate the extra lag built in the Prop2 and thereby eliminating any slowdown.

ersmith · 2019-02-22 02:45

cheezus wrote: »

@ersmith, I'm trying to take the "pure spin" driver and compile as suggested with -O2 but somethings not quite right.

Well, I certainly can't rule out a bug in fastspin (the -O2 setting isn't the default because it enables somewhat "riskier" optimization); but it could also be a timing issue. Does the hardware really not require any delays around toggling the pins? Have you tried running the P2 at a lower clock rate (like 80 MHz, or even 40 MHz) to slow it down to P1 speeds, and then trying to rev it back up?

Regarding getting faster code out of the compiler in general, the key thing is to keep as much as possible in local variables. That's probably more important than -O2. In your "send" routine don't keep reading global variables like "clk" and "di" inside the loop, read them once into local variables and use those. It'll be much faster (if the hardware can handle the additional speed) since it saves doing repeated rdlongs. This is true on P1 as well as P2.

cheezus · 2019-02-22 02:52

evanh wrote: »

It probably is a lag vs speed issue. The Prop2 has many more stages in the I/O circuits than the Prop1 had. This introduces more measurable lag with respect to software reading the inputs.

To make it work, at first brush, you might end up waiting for signals to appear where not having to on the Prop1. Later that can be reordered to accommodate the extra lag built in the Prop2 and thereby eliminating any slowdown.

This probably has something to do with my issues. It seems strange that the working code fails to run after the code that doesn't work, even though it seems like the broken code doesn't toggle any SD pins. I'm doing some experiments to narrow down what's going on here.

cheezus · 2019-02-22 03:02

ersmith wrote: »

cheezus wrote: »

@ersmith, I'm trying to take the "pure spin" driver and compile as suggested with -O2 but somethings not quite right.

Well, I certainly can't rule out a bug in fastspin (the -O2 setting isn't the default because it enables somewhat "riskier" optimization); but it could also be a timing issue. Does the hardware really not require any delays around toggling the pins? Have you tried running the P2 at a lower clock rate (like 80 MHz, or even 40 MHz) to slow it down to P1 speeds, and then trying to rev it back up?

Regarding getting faster code out of the compiler in general, the key thing is to keep as much as possible in local variables. That's probably more important than -O2. In your "send" routine don't keep reading global variables like "clk" and "di" inside the loop, read them once into local variables and use those. It'll be much faster (if the hardware can handle the additional speed) since it saves doing repeated rdlongs. This is true on P1 as well as P2.

@ersmith, Thanks for all your help and suggestions. This could possibly be a timing issue, perhaps the setup or hold time from di -clk is causing this issue but it seems somewhat unlikely. I've tried P2 clock from 20mhz to 320mhz and working code remains working and broken code still does nothing.

I'm going to try to refactor things to use locals as much as possible. Hopefully this will get the speed up. I really need to get this test under 3 seconds before I can start using this object. At 320mhz it's about 13 seconds so I've got some work to do!

evanh · 2019-02-22 03:12

This routine is a good example for problematic timing differences between Prop1 and Prop2

pri read | r
   r := 0
   repeat 8
      outa[clk] := 0
      outa[clk] := 1
      r += r + ina[do]
   return r

The assumption in that code is that the clock output is transitioning down and back up during the preceding two commands before INA. For the Prop1 this always holds true in the hardware.

Without optimising, it likely also holds true for the Prop2 because it could take a while to actually execute through to the INA. But with optimising, it can shrink down to as few as three instructions. Which has big implications for lag effects.

In the Prop2 manual, just before the section on Events there is a one page write up of I/O Pin Timing. It states 3 clocks of lag for OUT and another 3 clocks for IN. 6 clocks is 3 instructions for the Prop2. And that's without giving an external chip time to respond to a clock change.

The quick fix is add a wait between OUTA and INA. Which kind of defeats the purpose of your optimising effort.

evanh · 2019-02-22 03:38

I know the way Peter has handled this in Tachyon is by having an early SPI clock edge before the READ routine is entered. It results in an extraneous SPI clock upon exit from READ, which presumably can carry over to the next byte read.

So, his code still pulses the SPI clock before IN but that first IN is reading data from clock pulse prior to READ routine entry. The REP loop was 4 instructions long so could pull a sysclock/8 bit rate, eg: At 80 MHz he could do 10 Mbps for 7 bits.

evanh · 2019-02-22 03:42

Another way to handle this is convert the low level routines to using smartpins. They can manage the timings for you, but also have a whole new set of details of their own. One of which is there is initial setup required by a separate piece of code.

Smartpins are a good approach. In true Propeller form, all pins are the same. Once to you know one smartpin, you know them all.

cheezus · 2019-02-22 05:12

I've finally got an interesting result. I'm not sure if this has anything to do with my problem but it certainty isn't helping!

pri send(outv) | c, d
'
'   Send eight bits, then raise di.
'
    
   outv ><= 8
   
   c := clk
   d := di

    repeat 8
        outb[c] := 0
        outb[di] := outv  ' if change to d, nothing outputs
        outv >>= 1
        outb[c] := 1
'        waitcnt(cnt+wait_time)  
    outb[di] := 1    ' if change to d, nothing outputs

I've played around with this, trying different combos of optimizer levels as well as different names for d.

' pri send(outv) | c, d
_sdspi_def_send
	mov	_var01, #1
	shl	_var01, _var02
	rev	arg01
	shr	arg01, #24
' '
' '   Send eight bits, then raise di.
' '
'     
'    outv ><= 8
'    
'    c := clk
	add	objptr, #8
	rdlong	_var03, objptr
	sub	objptr, #8
'    d := di
	rdlong	_var02, objptr
	mov	_var04, #1
	shl	_var04, _var03
	mov	_var05, #8
	rep	@LR__0132, #8
LR__0131
'         outb[c] := 0
	andn	outb, _var04
'         outb[d] := outv
	shr	arg01, #1 wc
	muxc	outb, _var01
'         outv >>= 1
' '        waitcnt(cnt+wait_time)     '' c code needs wait here 
'         outb[c] := 1
	or	outb, _var04
LR__0132
	mov	_var06, #1
	shl	_var06, _var02
'     outb[d] := 1
	or	outb, _var06
_sdspi_def_send_ret
	reta

I really need to study this closer, and try to run through it in my head.

evanh · 2019-02-22 09:50

You could write what you want in inline pasm. You seem to be familiar with assembly and Fastspin does make inlining an easy job.

That way you can concentrate on hand crafting the most critical parts and leave the optimiser to do fancy tricks on non timing critical code.

ersmith · 2019-02-22 12:10

If you look at the inner loop it's translated to:

   andn outb, _var04
   shr arg01, #1 wc
   muxc outb, _var01
   or   outb, _var04

That is correct code, I think -- it's what your program has asked for (the andn sets the clock pin low, the muxc sets the data pin to 0 or 1 based on the previous low bit of outv, and the or sets clock high again). But it's all happening very fast, and doesn't allow for any delays at all.

I guess a crucial question is whether when we read "outb" it's returning a buffered version, or whether it's actually returning the actual current state of the pins (which lags the setting by 3 cycles). If it's returning the actual state of the pins then we definitely need a delay between muxc and or, and probably between the or and the andn.

That's not even considering what's happening on the other side, where your device only has a 4 cycle window between clock going low and then high again in which to pick up the data change; and for half of that window (during the shr instruction) it's still showing the old value.

I think you really need a waitx_(N) after the outb[di] := outv. That'll introduce a 2+N cycle delay. Probably waitx_(2) would be enough, but you might experiment with this.

When you had di as a global variable a bunch of optimizations were inhibited, so the inner loop wasn't as compact (and the rdlong to get the value of di introduced a delay as we waited for HUB memory to respond).

You're going to have a similar problem with the READ routine, where you'll have to have a delay before you try to read data from inb.

cheezus · 2019-02-22 13:38

Thanks @ersmith, I do realize there will be timing issues with the way things are written and these timings will change as I try to refactor things. The issue I'm having trouble understanding is why changing:

outb[di] := outv

to:

   d := di
...
outb[d] := outv

Causes the di pin to not respond. I'm watching on the LA and the clock is toggling properly but di remains high. I've sprinkled in waits to make sure I'm able to capture on the LA. I'm almost sure I'm doing something very wrong here since bringing the clock pin into a local variable works but does not with di.

I think I'm going to take evanh's recommendation and just use inline asm since I've already had it working using the local variables. Well I at least had the send working... I tried to simplify things by hard-coding the pins and some other "bad practices" just to try to get things working. I'm stuck at 13 seconds running @ 320mhz, not quite good enough.

Thanks for everyone's input!

ersmith · 2019-02-22 17:38

cheezus wrote: »
Thanks @ersmith, I do realize there will be timing issues with the way things are written and these timings will change as I try to refactor things. The issue I'm having trouble understanding is why changing:
outb[di] := outv 
to:
   d := di
...
outb[d] := outv 
Causes the di pin to not respond.

OK, sorry, I was so focused on the timing issue (the code change above moves the rdlong from inside the loop to outside) that I missed that the pin change wasn't showing up on your logic analyzer at all. Sure enough, you've hit a bug in -O2; it's moving the initialization of the mask for outb[] to the wrong place, before d has been set from di, so the outb[d] is writing to the wrong or no pin. That bug won't happen with -O1.

For P2 you could use something like:

VAR
  long clk, di
  
pub send(outv) | c, d
  outv ><= 8

  c := clk
  d := di
  repeat 8
    asm
      test outv, #1 wc
      drvl c
      drvc d
      drvh c
    endasm
    outv >>= 1
  drvh_(d)

I'd still be a little concerned about whether the receiver can keep up with this, but I don't know what the timing specs are for your devices, so maybe it isn't an issue.

ersmith · 2019-02-22 18:37

(The -O2 bug you found will be fixed in the next release of fastspin, I found the root problem. Thanks for pointing it out.)

msrobots · 2019-02-22 19:47

    asm
      test outv, #1 wc
      drvl c
      drvc d
      drvh c
    endasm

nice!

Mike

cheezus · 2019-02-22 23:27

ersmith wrote: »
cheezus wrote: »
Thanks @ersmith, I do realize there will be timing issues with the way things are written and these timings will change as I try to refactor things. The issue I'm having trouble understanding is why changing:
outb[di] := outv 
to:
   d := di
...
outb[d] := outv 
Causes the di pin to not respond.
OK, sorry, I was so focused on the timing issue (the code change above moves the rdlong from inside the loop to outside) that I missed that the pin change wasn't showing up on your logic analyzer at all. Sure enough, you've hit a bug in -O2; it's moving the initialization of the mask for outb[] to the wrong place, before d has been set from di, so the outb[d] is writing to the wrong or no pin. That bug won't happen with -O1.

For P2 you could use something like:
VAR
  long clk, di
  
pub send(outv) | c, d
  outv ><= 8

  c := clk
  d := di
  repeat 8
    asm
      test outv, #1 wc
      drvl c
      drvc d
      drvh c
    endasm
    outv >>= 1
  drvh_(d)
I'd still be a little concerned about whether the receiver can keep up with this, but I don't know what the timing specs are for your devices, so maybe it isn't an issue.

I've been trying to fight timing, and some kinda funkiness... (no, not compiler issues... PEBMAK)… SD state persistent over reboots but I think I'm almost there. Thanks for pointing out the drv instructions. There's so many new instructions to try out. I really like this because it can reach all 64 pins easily.

Sorry about not being able to articulate the bug better, that may be part of the reason you were stuck on the timing issue.

I've done some testing, moving timing around and it seems pretty solid on at least 2 of my sd cards. I have this idea of an "all in one SPI cog" that can handle SD card access as well as generic SPI support. I just have to get over the learning curve of the P2, with the new instructions and default clock that beats my poor logic analyzer into the dust.

*edit-
Thinking about it, I'm not sure if this is actually working or not because I don't think it should!! If my basic maths are right, the inner loop is 4 instructions * 2 clks per for 8 clocks per rep which would be 40mhz!! at 320 P2 clock? I thought SD spec is 20-25mhz. Cluso's sd seems to use 12 clocks per rep? I'm not sure about TaqOZ..

pilot0315 · 2019-02-23 04:43

A couple of you told me that ptra is the equivalent of par in p1. I have a simple thing that I want to do which is just to take a value and transfer it to another variable and print. Not working here is the simple code from spin2 code. I am not an engineer. Please tell me what I am missing. This is my jump start.

thanks

pilot0315 · 2019-02-23 04:45

I figured it out in P1 pasm but here I am lacking. I need to get this so I can move on. Other programming in pasm is o.k. Just this. Sounds dumb.Oh well.

pilot0315 · 2019-02-23 04:46

I have looked at a lot of code that looks like this but it does not work for me.

My stupid questions thread

Comments