P2 Hosted USB Keyboard/Mouse

ersmith · 2019-12-30 17:45

Ah, thank you for your last post Roger -- I think that's the clue I needed to find the problem. I think this is a fastspin bug, and it comes up only in a very specific circumstance: when a Spin function which uses no member variables (VAR) calls coginit/cognew to run a Spin function which itself does access member variables (as mouseinit does in this case). By adding the "@usbA" you made the "start" method access a VAR, which fixed the problem. A similar result would have come from moving the usbstack array from DAT to VAR (which is a more common place for it, and the reason this bug hasn't shown up before -- most users of coginit either use a stack defined in VAR or set a cog variable in VAR).

Here's a version of fastspin which has that particular issue fixed. I hope it'll fix your problem.

Thanks,
Eric

garryj · 2019-12-30 20:05

That looks to have resolved the issue, Eric. @rogloh

rogloh · 2019-12-30 23:11

Thanks @ersmith and @garryj. I will take your word for it for now. That zip file only contained a windows binary of fastspin that I couldn't try out on my Mac. I can wait for the next release and use a workaround for now.

Rayman · 2020-02-10 15:22

Barely worth mentioning, but I was surprised that this cheap old mouse didn't work.
Same part number as this one: https://www.amazon.com/HP-MOFXUO-2-Button-Optical-537749-001/dp/B00I2YNG8E

This is one that came with an HP computer bought new back in 2015.
I think this might be the first one I tried that didn't work...

This even older: Logitech MouseMan Dual Optical USB Mouse M-BL63B
Seems to work just fine...

evanh · 2020-02-10 21:42

Roger,
When Eric drops a binary like that, you can be pretty sure the master branch on Github is the same code. Just do a "git pull" then "make".

rogloh · 2020-02-10 22:51

LOL. Is that a subtle hint for me to retest...evanh ?

When I get a chance I can try to resurrect that USB gfx+mouse demo issue I had. I have a much newer fastspin anyway now since I decided to update the tools and resolve the loadp2 issue etc.

evanh · 2020-02-10 23:08

Ah, just misunderstood.

rogloh · 2020-04-28 04:43

@garryj
I'm already finding it may be a problem in some video applications of the P2 where the frequency is determined at runtime based on screen resolution etc to be able to make use your USB driver due to having dependencies to the P2 frequency embedded in various timeout constants throughout the USB driver code. I think this might have been mentioned before.

I took a deeper look and found all the P2 frequency dependencies in your USB driver. Attached is a text file showing them all, with each indented level in the hierarchy representing usage of a symbolic constant that is originally based on the P2 frequency. Each deepest "leaf" node terminates in final P2 code if the symbol is actually used in the driver file. Note: I had also patched it for rev-B so there are two extra lines that may not be present in your original codebase. Some constants are never used and don't have code associated with them.

From what I can tell it looks like there are 38 final instances in your driver where these timeout constants get used in actual P2 instructions. Is it feasible for you to have some initial code that starts this COG to first patch the runtime to apply the operating/nominated P2 frequency so it can be used dynamically? I already do this type of thing in my video driver and HyperRAM driver to dynamically customize/patch it for the chosen output type/frequency etc when it launches so I think it should be possible. You could possibly have an initial stub in LUT RAM that does this then reads in the actual COG that uses these patched values, its space can be overwritten later and reused fully. Apart from the timeout computations you just need to put labels in the code on these 38 lines to know where to patch in memory. That's how I manage it in my video driver and it works well.

Many of these instructions just use simple multiples of common timeout values which are easy but some of these dependencies currently use floats and rounding etc which may first need some integer math conversion to run in PASM (or SPIN even if it is easier). The CORDIC can be used to do some 32-> 64 bit math perhaps if it helps you get better precision.

What are your thoughts on possibly doing this? I think it would help improve it a lot.

Roger.

ps. If you add this patching capability you might also be able to dynamically patch the code for Rev-A vs Rev-B operation as well. I think there are already known ways to detect the chip revision differences dynamically. It's only a handful of lines that differ and they don't change the size of the code. Eg. To run on rev-B you just do this type of thing:

{
            ' was originally this code for rev A (rogloh)
                wrpin   ##USB_V1HMODE_LS, #DP           ' Low-speed signalling is always used
                wrpin   ##USB_V1HMODE_LS, #DM
        if_z    wxpin   ##_1_5Mbps, #DM                 ' Set 1.5Mbs baudrate if no downstream hub
}
                wrpin   ##USB_V2MODE, #DP               ' Low-speed signalling is always used
                wrpin   ##USB_V2MODE, #DM
        if_z    wxpin   ##USB_H_LS_NCO, #DM             ' Host mode and 1.5Mbs baudrate if no downstream hub

evanh · 2020-04-28 04:56

Yes, use the cordic. QMUL/QDIV, as a pair, is beautiful.

rogloh · 2020-04-28 05:05

Also here were all the cases where floating point stuff actually gets used in final code generation. Any symbols that weren't used in the code are excluded (like _100ns). If these items can be converted to integer equivalents at run time without loss of precision the rest of the patching work should be easy in comparison...

_FCLKFREQ = 297000000.0 ' 160_000_000.0  
LSBTns = 667.0                          ' Low-Speed bit period, in nanoseconds
FSBTns = 83.54                          ' Full-Speed bit period, in nanoseconds
    _CLKFREQ  = round(_FCLKFREQ)
    _12Mbps        = round((12_000_000.0 / _FCLKFREQ) * 65536.0) ' = 4369 NCO @180MHz, 9830 @80MHz
    _1_5Mbps       = round((1_500_000.0 / _FCLKFREQ) * 65536.0)  ' = 546 NCO @180MHz, 1229 @80Mhz
    _1ms   = round(_FCLKFREQ / 1_000.0)
        _1us   = round(float(_1ms) / 1_000.0)
            IP_DELAY_LS  = round(4.0 * LSBTns * float(_1us) / 1000.0) ' Range @180MHz: 240 to 901 clocks (1.334us to 5.003us)
            IP_DELAY_FS  = round(4.0 * FSBTns * float(_1us) / 1000.0) ' Range @180MHz: 30 to 112  clocks (0.166us to 0.623us)
            TAT_WAIT_LS  = round(22.0 * LSBTns * float(_1us) / 1000.0)
            TAT_WAIT_FS  = round(28.0 * FSBTns * float(_1us) / 1000.0)

As well as a 32x32 bit to 64 bit multiplier there is a 64/32 bit divider in the CORDIC. So I think this can be used to get the precision needed, the remainder is also available and can be used for the rounding.

rogloh · 2020-04-28 05:52

I'm slightly confused by the choice of 667.0 and 83.54 above for the LSBTns and FSBTns bit periods for USB.

Were these accounting for some worst case clock tolerance or for rounding or something? I'd have thought they would typically be 666.7 and 83.33, for 1.5Mbps and 12Mbps operation respectively but I'm sure there must have been a valid reason for this difference...

I'm hoping these constants will also remain static and weren't something needed to fudge things for the default P2 operating frequency.

Update: the differences from their nominal values turn out to be 500ppm for 1.5Mbps and ~2500ppm for 12Mbps. If this difference is crystal tolerance related, it seems backwards, as I though the low speed USB operation needed less frequency precision compared to high speed operation. Is this a bug? It's now bugging me

Update2: Ok, so I found this in the USB spec which probably explains the tolerance for full speed and lines up with the above number at least.
"The full-speed data rate is nominally 12.000 Mb/s. For full-speed only functions, the required data-rate when transmitting (TFDRATE) is 12.000 Mb/s ±0.25% (2,500 ppm)."

The other low speed one appears to be less precise as I remembered. It's 15000ppm not 500ppm however, so that one is a still a bit of an unknown.
"The low-speed data rate is nominally 1.50 Mb/s. For low-speed functions, the required data-rate when transmitting (TLDRATE) is 1.50 Mb/s ±1.5% (15,000 ppm). This allows the use of resonators in low cost, low- speed devices."

rogloh · 2020-04-28 07:16

Another reason to patch the code is to support builds with different USB COGs on different pins because right now it seems you need two static copies of the driver code in the build to support both a USB keyboard port and a USB mouse port. If the code supports patching this could also patch in the pin numbers and a common single implementation could support both devices assuming shared hub code does not use the constant pin numbers and also requires patching. I didn't check that fully yet so if it does it needs a little more work, but if it runs in hub exec presumably is it less time critical and could either read from a register or hub location to get the pin to use - there's more space available in hubRAM anyway if the code grows slightly. Any other per instance state should be put into its own VAR section instead of shared in a single DAT region.

Update: Just checked pin usage. It appears that two cases of the pin constants are used in HUB RAM which would need either a register or another lookup to resolve. Here's the full use of these Pin constants (where * represents the pin use in HUB exec RAM or COG/LUT RAM). If you share the active LED pin with the REPO (which is a great way to save a pin) it would only require one COGRAM location to store this pin, and the rest of the pins can easily be patched in the executable.

HUB COG
 *   *  HOST_ACTIVE_LED = USB_BASEPIN + 4
     *  USB_PROTECT_ON = USB_BASEPIN + 5
     *  DM = USB_BASEPIN + 6   ' DM is "The Brain"
     *  DP = USB_BASEPIN + 7   ' DP is passive
 *   *  USB_EVENT_REPO = HOST_ACTIVE_LED ' was USB_BASEPIN + 8 (rogloh change)
     *  HOST_ERROR_LED = LED56

To share the LED pin with the USB REPO you need this change (Thanks to ozpropdev for this one!)

        SP_REPO1_MODE   = %01_00001_0    ' %MMMMM_0, P[12:10] != %101
     'was  SP_REPO1_MODE   = %00001_0 | 1 << 16    ' %MMMMM_0, P[12:10] != %101 (rogloh)

garryj · 2020-04-28 22:41

@rogloh
Thanks for compiling that list of constants/registers that are dependent on the P2 frequency. Dynamic re-calculations of USB timings is still on my radar but, as of late, time for P2 coding has been pretty hard to come by. I have already implemented getting the USB pin assignments from the object that starts the USB host, and the USB REPO pin is shared with the activity LED. P2 rev A/B+ detection is an easy add, but due to the before-mentioned rationed coding time, it has been sitting on a back burner.

I'm hoping to get the above out ASAP and after that will turn my attention to dynamic frequency changes.

AJL · 2020-04-29 03:04

garryj wrote: »

@rogloh
Thanks for compiling that list of constants/registers that are dependent on the P2 frequency. Dynamic re-calculations of USB timings is still on my radar but, as of late, time for P2 coding has been pretty hard to come by. I have already implemented getting the USB pin assignments from the object that starts the USB host, and the USB REPO pin is shared with the activity LED. P2 rev A/B+ detection is an easy add, but due to the before-mentioned rationed coding time, it has been sitting on a back burner.

I'm hoping to get the above out ASAP and after that will turn my attention to dynamic frequency changes.

I'm working on some code that might be of help there.

rogloh · 2020-04-29 05:00

garryj wrote: »

@rogloh
Thanks for compiling that list of constants/registers that are dependent on the P2 frequency. Dynamic re-calculations of USB timings is still on my radar but, as of late, time for P2 coding has been pretty hard to come by.

No problem garryj. I'm sort of glad this is still something that you may hope to add in time.

AJL wrote: »

I'm working on some code that might be of help there.

Great - any more details? This morning I looked at the math sequences that would be needed to handle the floating point stuff and found it won't really affect precision significantly or overflow when moving to integer math with the 64 bit Cordic. The two computations using FSBTns will just require scaling by 100 and two divisions because they are limited to a 32 bit divisor. Other than that it looks fairly straightforward.

evanh · 2020-04-29 05:07

The Cordic will be more precise since Chip uses single precision floats in his compilers and Eric has followed suit with Spin and Pasm at least.

AJL · 2020-04-29 06:31

rogloh wrote: »

garryj wrote: »

@rogloh
Thanks for compiling that list of constants/registers that are dependent on the P2 frequency. Dynamic re-calculations of USB timings is still on my radar but, as of late, time for P2 coding has been pretty hard to come by.

No problem garryj. I'm sort of glad this is still something that you may hope to add in time.

AJL wrote: »

I'm working on some code that might be of help there.

Great - any more details? This morning I looked at the math sequences that would be needed to handle the floating point stuff and found it won't really affect precision significantly or overflow when moving to integer math with the 64 bit Cordic. The two computations using FSBTns will just require scaling by 100 and two divisions because they are limited to a 32 bit divisor. Other than that it looks fairly straightforward.

I see two options:
Use the CORDIC for everything which will save space but cost time, or a small number of CORDIC ops to produce _1LSBT, _1FSBT, _12Mbps, _1ms, and _100ns, with some good old-fashioned shift and add to produce the rest of the values.
You have many locations where you are loading constants into registers, so if those registers remain unmodified during driver operation the amount of space taken for the integer maths could be offset by base code savings.

So, I'm investigating the later approach given that the former approach is fairly straightforward.

In the end it will be interesting to see the comparison of the approaches.

rogloh · 2020-04-29 07:29

Yes AJL I was also thinking the computation of just a handful of key values with the CORDIC first makes sense. Then just do the shifting, adding and integer multiplies needed from these 32 bit starting integer values (using regular MUL etc where possible) which will all derive the needed 32 bit result that is then written into the correct addresses.

All this patching code could probably itself run in hub exec mode to avoid a lot of extra COG/LUTRAM instruction overhead in the driver code, or the additional code chaining needed when the driver COG initializes. Patching data into LUTRAM addresses is a little more tedious unfortunately as I already found from my own work but is still doable with RDLUT/WRLUT.

Alternatively if this timing configuration adjustment is only done at USB driver COG startup time, you can also patch into the source HUB address before you even start the COG which may become easier to do but removes the option of dynamic re-config during USB operation down the track if that is also attempted one day.

AJL · 2020-05-01 09:06

Ok, This is the untested, unreviewed code, so probably has bugs or other issues, and the cycle counting I've done might be slightly off.

I've focused on the calculation rather than the patching, but I think using a block of registers to hold the results rather than the ##immediates of the original code should help all around.

Interrupts are stalled to protect the overlapping CORDIC operations. At around 150 cycles long this shouldn't be a problem.

To enable runtime configuration, the values of the form '_1ms * 2' have been given their own labels, e.g. '_2ms'

As @rogloh has pointed out, there are a number of values calculated that are never used; Most of these are the time-out values (TO_NODATA, TO_SETADDR, CONNECT_WAIT, TO_STANDARD).
I've left them in, in case they are needed in the future, and some are for intermediate steps to calculate newly labelled values.
Some values will need multiple labels rather than keeping multiple copies, ( _2ms and TO_CHGADDR),(_8ms, KBD_POLL_INTERVAL and MOUSE_POLL_INTERVAL), (TO_NODATA and TO_SETADDR).

If I've worked it out correctly, total runtime comes in at less than 250 cycles, and it fits in 87 code longs plus 7 longs to hold input values, 26 longs for the calculated values, and 1 temp long for rounding.
6 of the input longs could probably be folded into the code as ##immediates, but I didn't at this point so that I wouldn't need to recalculate the cycle counts and potentially re-sequence some of the code for CORDIC timing window purposes. If I've goofed the cycle counts that could be considered.

The calculated values are used in 33 locations throughout @garryj 's code, so with this code separate and the results table in the cog, the cog driver might shrink slightly and no patching routine would be needed.

I've used ALTR heavily to reduce the need for dedicated copy instructions, but that can make it harder to trace where results end up.

I've started with the assumption of an integer clock frequency as the values being generated are not particularly sensitive to differences of 100kHz, let alone fractions of Hertz.

So, here it is for your consideration and general improvement.

Anthony.

rogloh · 2020-05-01 11:06

This looks to be a very good contribution to assist things moving forward AJL. Though I can't deal with those upper case PASM2 mnemonics shouting at me. LOL.

I've focused on the calculation rather than the patching, but I think using a block of registers to hold the results rather than the ##immediates of the original code should help all around.

Ideally that would be the case. Retaining this block can possibly burn COGRAM register space which garryj may wish to be reserved for other purposes (not sure there). Depending on the outcome of that, patching some opcodes may still be required, hopefully not though. Using a block of timing registers will save lots of labels and address computation for all those patch addresses.

One thing that you also found is that each ## burns an additional long anyway which can be avoided with a register value instead of a ## constant (at least in COGRAM). This is helpful way to pay for the cost of the extra timing register if the ## was in COGRAM, but not if in LUTRAM. Further savings can be had if the same register constant gets used multiple times. It's one of those typical tradeoffs we have to make.

With any luck this timing customisation code could be run in hub-exec mode (eg. called at setup time as part of the usb_host_start routine before it fires the bus up) and would not need to burden the COG with any real additional execution space in COGRAM. It could be first loaded into LUTRAM too if needed, perhaps with some chain loading if it doesn't need to patch the LUTRAM code.

garryj · 2020-05-01 18:02

@AJL, thanks for the code! Yeah, that handful of unused constants were there as calculations that I thought may be of use somewhere down the road...

@rogloh, my current plan is to do these sysclock based calculations, if necessary, when the USB.start method is called. A USB.stop method will be added to allow the object overlord to shut down the USB prior to changing sysclock, as not doing so would not be a good thing 😢.

rogloh · 2020-05-01 23:12

garryj wrote: »

A USB.stop method will be added to allow the object overlord to shut down the USB prior to changing sysclock, as not doing so would not be a good thing 😢.

That makes sense initially. Even though it is only a secondary future desire of mine I was also wondering about how a dynamic clock change while your driver is already running could potentially work and how the USB devices on the bus would react if the PLL change takes more than 1ms to stabilise. First completing the current transfer then going into a USB suspend state might be a good way to go before the PLL changes and it requires some COG sequencing before changing the PLL but ultimately perhaps there may be some controlled way found to change the timing on the fly. If so that would be very cool indeed.

I'll have to also start thinking about if/how it could even be done on my HyperRAM driver though... I already know my video driver COG will need a restart so there's going to be a brief video glitch there of at least a lost frame.

In general thinking about this capability of changing the P2 PLL is useful for cases where you might want to put the P2 into a sleep state for consuming less power with slower clock speeds but not shutting down fully. That was sort of the purpose of that USB suspend state too I guess.

evanh · 2020-05-01 23:17

Recalibrate the time base without reinitialising ... That's a tough call, especially for cog code space.

rogloh · 2020-06-27 05:02

Discussed continued here from a different video thread...

garryj wrote: »

@rogloh, I've been moving at a snail's pace on the USB kbd/mouse improvements, but I can finally see a light at the end of the tunnel. Dynamic sysclock changes and USB on P2 can get along!
All sysclock related calculations are in registers and USB suspend, resume and reset are now functional. There is a method that takes the new frequency and clockmode values and if there is a USB device connected, the bus will be suspended while the sysclock is getting changed and when that completes, the USB will be resumed and it will detect the new frequency and recalculate all NCO and timing values. If no device is connected, there is no need for suspend/resume as in the "connect wait" state it looks for a sysclock change and put the usb in the idle state while it recalculates.

The code is still a mess, but I'm hoping to get it cleaned up and the demo out in the next few days. It runs on P2 revs A/B, though revA doesn't work as smoothly as revB when it comes to changing the NCO and sometimes the bus will stall. If that happens you need to follow up with a "hard" bus reset to get things going again.

That sounds really good garryj. It will be very useful for situations where you want to slow down the P2 for sleep modes, or change the frequency if video requirements change at runtime. At some point it would also be nice to be able to spawn a common USB driver COG from HUB RAM using pin numbers provided at runtime rather than needing multiple instances statically hardcoded to each port too (but that is somewhat independent of the timing change).

garryj · 2020-06-27 18:08

rogloh wrote: »

That sounds really good garryj. It will be very useful for situations where you want to slow down the P2 for sleep modes, or change the frequency if video requirements change at runtime. At some point it would also be nice to be able to spawn a common USB driver COG from HUB RAM using pin numbers provided at runtime rather than needing multiple instances statically hardcoded to each port too (but that is somewhat independent of the timing change).

Already done. The object that starts the USB object dictates the basepin. The only restriction is that basepin must be an even pin# and the (basepin addpins 3) pinfield order is assumed to be the same as that used by the Parallax Serial Host add-on board.

evanh · 2020-06-27 22:13

Garry,
How low can sysclock go with your USB code working?

rogloh · 2020-06-27 23:49

garryj wrote: »

Already done. The object that starts the USB object dictates the basepin. The only restriction is that basepin must be an even pin# and the (basepin addpins 3) pinfield order is assumed to be the same as that used by the Parallax Serial Host add-on board.

Great. Many thanks for continuing to improve your driver by the way. It's extremely useful to have some USB devices working on P2 already, and your further changes will only extend that usability.

garryj · 2020-06-28 00:10

evanh wrote: »

Garry,
How low can sysclock go with your USB code working?

80Mhz is a minimum for full-speed and it looks like low-speed starts flaking out when below 50Mhz.

evanh · 2020-06-28 02:49

Thanks.

garryj · 2020-09-17 22:16

The KbMObjDemos_v0_1_4.zip package has been added to the first post of this discussion. It includes the latest USB host + "boot protocol" keyboard/mouse driver P2 object. The zip package includes three clients that offer differing binary footprints. The README.md file in the zip describes the features and changes.
@rogloh, my apologies for taking so long to get this out.

P2 Hosted USB Keyboard/Mouse

Comments