cogserial - fullduplex smart serial using interrupt
msrobots
Posts: 3,709
This is the first running Version, not really optimized.
It is a replacement for smartserial in spin2gui. I runs on its own cog and uses lut as send and receive buffer.
Currently I am saving bytes in longs giving 255 bytes send buffer and 255 bytes receive buffer. This will be optimized soon and either gives 4 times the buffer or 2 pairs of 2 times but for two pairs of send/receive. Not sure yet.
I am using a interrupt for receiving, but was not able yet to use a interrupt for sending thus sends are done by the cog itself.
It has a extended start method StartEx where one can also provide the smartpin selector mode and baud rate for each send and receive channel, thus it is possible to send at say 115_200 and receive at 230_400 for example.
Besides the Spin stub the cogSerial driver needs a 4 long long mailbox to communicate, like JDserial it is easy usable from assembler. Or it will be.
Because for some #@$@#%$@ reason I can not get fastspin to start my cog with a mailbox address as parameter and have to manually patch the needed parameter in my start method into the DAT block before loading the COG.
There is still some work to do to make it nicer, but as is it works.
I have no clue as how fast it will work, I am just testing it from fastspins terminal right now
Enjoy!
Mike
It is a replacement for smartserial in spin2gui. I runs on its own cog and uses lut as send and receive buffer.
Currently I am saving bytes in longs giving 255 bytes send buffer and 255 bytes receive buffer. This will be optimized soon and either gives 4 times the buffer or 2 pairs of 2 times but for two pairs of send/receive. Not sure yet.
I am using a interrupt for receiving, but was not able yet to use a interrupt for sending thus sends are done by the cog itself.
It has a extended start method StartEx where one can also provide the smartpin selector mode and baud rate for each send and receive channel, thus it is possible to send at say 115_200 and receive at 230_400 for example.
Besides the Spin stub the cogSerial driver needs a 4 long long mailbox to communicate, like JDserial it is easy usable from assembler. Or it will be.
Because for some #@$@#%$@ reason I can not get fastspin to start my cog with a mailbox address as parameter and have to manually patch the needed parameter in my start method into the DAT block before loading the COG.
There is still some work to do to make it nicer, but as is it works.
I have no clue as how fast it will work, I am just testing it from fastspins terminal right now
Enjoy!
Mike
Comments
I am not sure while parameter passing thru ptra does not work, is there maybe a problem with using @varname in sub objects?
I sort of got stuck there and just patched the values to get it running, will examine further today.
At least it is working
Enjoy!
Mike
If you're still having trouble with mailboxes post the mailbox version of your code and we can take a look at it (many eyes make bugs easier to find ).
Is there any documentation for fastspin2?
In the fastspin docs/ folder, which gets copied to the spin2gui doc/ folder. For Spin specifically there's a spin.md. Mainly it covers differences between "standard" Spin and fastspin.
The writing 0 back to mbox[0] could happen at any time after the assembly has read everything it needs to out of the mailbox. It's just a device to make sure that the Spin code doesn't get ahead of the assembly and overwrite the mailbox while the PASM is still using it. In P1 Spin nobody much worried about that because Spin was so much slower than PASM. But with fastspin it can matter, because the Spin COG is itself running PASM "under the hood", so it's just about as fast as the PASM COG.
I think you'll have to look a bit closer. The pasmtest.spin2 code I posted above is in fact starting assembler (the PASM code it starts is in the DAT section and starts with the label "asmfunc"). I have started Spin code in other demos. But actually in fastspin there's really not much difference -- it's all assembly code once the compiler is done with converting the Spin into PASM.
YES, this was exactly part of my problem.
But mostly I stumbled over a difference between P1 and P2.
I wasn't interested in the value behind ptra, but of its address. so I did not do a rdlong cogptr, ptra but tried a mov cogptr, ptra.
And that did not work at all. And I still don't know why.
But your version got me to a very clean approach. I do create a local array in my start function and populate it with the needed parameter for my cog. Then I set a sync value, start the cog and wait until it has read the parameters and clears my sync value, so that the start function can return and destroy the local array.
The nice thing is I currently need just 4 longs in the HUB for my mailbox, but have 6 parameters to feed at start. Using a local array for the start parameters solves this problem also.
So there is no @%#@@#@% problem with fastspin, the problem was sitting 2 foot away from the monitor.
I need to clean up the code a bit and will post a nicer version.
Thanks,
Mike
I'll have a fix for it in the next release.
everybody here is complaining about the missing tools development, I on the other hand have basically to reload all my tools every week or so because you and @"Dave Hein" and @Rayman are pushing out changes faster as I update my tools.
I think my problem was that I assumed I can read ptra at start of my program just like I used to on the P1
on the P1 I can do
But as stated before your approach with a local array looks quite more clean and works.
attached now a slightly better version.
Enjoy!
Mike
okay := cog := coginit(1,@loop,@command) + 1
' okay := cog := cognew(@loop, @command) + 1
I don't know why that would make any difference.
Thoughts?
Do you know if there is a similar issue with cognew?
Fighting with the interrupts I decided to switch strategy. The driver now supports two pairs of full duplex serial channels, but you can just use it as a single driver using just one pair. If using 2 pairs of full duplex serial channels the buffer size for each channel is halved, currently 128 byte per channel if 2 ports are used or 256 bytes per channel if just one port is used.
The driver uses int1 for serial receive of rx1, int2 for serial receive on rx2 and int3 for checking transmit status and transmitting both output buffers, running on a timed base.
Right now I just run int3 at every 500 sysclocks. This is just a test, I need to calculate something out of the bitrate of the faster transmit channel (which I basically have) to say run the interrupt as fast that it will trigger twice the time needed to catch the fastest tx. But right now it is 500 sysclocks. I have the numbers but haven't done the math, yet.
This is work in progress, but a lot of fun for me. I need to think about a test harness running on other cogs to really stress this thing.
But currently I have a 2 port full duplex serial buffered driver running in a cog, just needing 8 longs in HUB for communicating. This is fun...
Mike
Is "loop" a Spin function or PASM code? If it's a Spin function then you could be running into the fastspin 3.9.15 bug I mentioned earlier; it's a memory corruption kind of thing that affects both cognew and coginit, so small code changes that seem unrelated can cause it to trigger. If "loop" is PASM code then that's not the problem.
Otherwise coginit and cognew are pretty much the same (cognew is translated to coginit with a special first parameter that says "allocate a COG" instead of requiring a specific COG).
Thanks for doing that code and with very good documentation. I can read it better than the original FDSR.
I am trying to get waitcnt(clkfreq+cnt) to work.
I assume that there is a slightly different way in this version of spin.
I am using spin2gui. The c version uses this: waitcnt(getcnt() + CLKFREQ/2);
I will try the waitx. Will it work in a spin2 file.
waitcnt(clkfreq+cnt) will work in fastspin and spin2gui for both P1 and P2 processors. The compiler will automatically translate it to whatever you need (a waitcnt instruction on P1, and waitct1 on P2).
(This is for Spin code, of course. If you are writing PASM or PASM2 then you have to do the translation to waitct1 or waitx yourself).
Thanks
Yes, loop is the mailbox monitor in PASM. Very strange. I have no idea why coginit works and cognew does not.
I am using now fastspins feature of providing standard constants to parameters. That did reduce the needed spin code a lot. Wonderful, thanks @ersmith.
I also did a lot of commenting to keep track of what it is supposed to do, and as far as I can see it does.
The concept here is the full duplex driver, running in its own cog, is buffering 1 or 2 serial full duplex connections using interrupts and smart pins.
Rx for both channels is bound to int 1 and 2, Tx for both channels uses int 3 and the cog itself just takes care of the mailbox to serve the calling program.
The driver supports async access to both pairs of rx/tx and actually reads and writes the result itself to hub so the calling cog just needs to send off commands.
You do not need to use two ports, if not enabled the second pair will not be used.
a couple of more days and I can slap a MIT license on it and put my name on the top. Right now it needs more documentation...
Enjoy!
Mike
I have one main cog using one serial driver to talk to the terminal. (2 COGS)
via mailbox I start a testrunner COG running tests on a second serial driver COG (2 COGS)
I also have a echo COG running a third serial driver COG. this one reads is RX and writes its TX.
The testrunner clears a ram buffer, transfers 16K rom over serial and back into a ram buffer and then compares the buffer with the rom to see if its done its job correct.
If I run the buffered driver talking to itself with 2 SPs (RX1 listening to TX1) it runs up to and fails at 90085400 baud. seems OK.
If I run the buffered driver talking to itself with 4 SPs (RX1 listening to TX1 and RX2 listening to TX2)) it runs up to and fails at 90085400 baud. seems OK.
Now I activate the echo server in between. So the testrunner sends on TX1, the echoserver receives on his RX1, sends out on his TX1 and the testrunner receives on RX1.
And now it fails already at 921600 baud with one channel and when using both channels it fails already at 460800.
All tests are running at 180Mhz and using SPs 0-7.
What I do not understand is the drastic reduction of transfer speed when using the echo server inbetween.
the main file to run is testserial.spin2 that will use/include the other files.
Maybe someone can look at it if I made some stupid mistake. Or test the driver with some other tool, cogserial.spin2 uses cogserialpasm.spin2.
I simply do not understand why echo is so slow.
Help needed,
Mike
A RX up to SysCLK /2 is not going to be practical between two P2's that are not phase locked.
How does it fail ? - are early bytes ok, and later ones fail ?
Can you add a char counter to each stage, and check those after a run ?
I've found char counters a great cross check for serial stress testing.
I also send blocks of "U" and check the MHz with a frequency/edge counter - then compare that with the expected baud rates.
This finds (usually undocumented) creepage issues in the links. eg Sometimes, extra stop bits are added at high baud rates.
yes, I increment by 115200 and it fails at 90085400, so 90 could work.
and it does that full duplex with two pairs of RX and TX.
- A RX up to SysCLK /2 is not going to be practical between two P2's that are not phase locked.
phase locked, that might be part of the problem, the delay shows up when two different COGs reading the same smartpins, or to be clear there the RX smart pins are always reading the pin next to them driven as TX smart pin. TX and RX has each a own smart pin, but rx reads the pin next to it for not having to put resistors between the pins.
- how does it fail.
good question. I clear a 16K buffer then send the ROM content async , receive it async, wait for completion of write and read, then compare buffer with rom, if not equal, fail.
I also have time-outs on RX if nothing there but I am not sure yet if they even hit,
-Can you add a char counter to each stage, and check those after a run ?
I think I can try that to see when it fails.
-I also send blocks of "U" and check the MHz with a frequency/edge counter - then compare that with the expected baud rates.
-This finds (usually undocumented) creepage issues in the links. eg Sometimes, extra stop bits are added at high baud rates.
could you maybe try that, I do not have the equipment to do so? Because that might be a case.
But my primary guess is that I have a stupid typo somewhere checking rxWhateve instead of txWhatever.
EDIT: ahh - I forgot
when transmitting is successful I 'lose' about 15-90 sysclocks per byte when comparing set baudrate with sysclocks used. But that is the calloverhead and seems to be quit constant
when failing this goes up to 300
Enjoy!
Mike
ie If you go from 90MBd to under 1MBd that's a massive drop.
IIRC Chip has reported samples-per-bit in the order of 3-4-5 are needed for true ASYNC, (ie between two separate clocked P2's ) and many MCUs have x8 sample UART modes.
Yes I was considering to skip the SSP reading next to it and jumper the pins, but am afraid to just jumper them. I need some resistors, don't want to fry pins. And all my electronics stuff is still in Boxes, since I moved recently.
But since reading next pin works with a single COG, it should work with two COGS except that it doesn't.
And you are exactly right from 90 down to 1 makes not really sense. It never hangs on tx but hangs on RX as far as I could see.
IIRC Chip has reported samples-per-bit in the order of 3-4-5 are needed for true ASYNC, (ie between two separate clocked P2's ) and many MCUs have x8 sample UART modes.
my RXcheck times out after 100_000 cycles so about 800_000 sysclocks, TX1 and TX2 using int3, RX1 int1, RX2 int2.
maybe putting TXes on int1?
I am drawing at straws right now.
Enjoy!
Mike
One thing that does/can change, with 2 COGs vs one, is the relative opcode phase, since opcodes are 2 sysclks.
ie Talking-to-self would always be opcode-phase-locked, but talking to another might be off by half an opcode time. Maybe that matters ?
Can you add an extra stop bit to Tx ? That can give more tolerance to creep, and it may change the failure frequency.
because it is different. The one pair (RX1/TX1) version fails with a timeout on RX1, but the two pair version (RX1/TX1 and RX2/TX2) fails with buffer check wrong.
Not sure why, but at least some hint.
as for 2 stop bits, might be a try, I just don't now how to do that with smart pins, must read a bit about that.
as for being off 1 or 2 clocks, I do not think that this would explain 90Mbit vs 10Mbit
the current version goes does this for using just one rx/tx pair and using the echo server and this when using two pairs rx1/tx1 r2/tx pair and using the echo server also with two pairs
the first number is sysclock taken for test, thus negative on errors
the number after PASS is the effective baudrate inclding code overhead and the third number the derivation in sysclocks per byte, because of that overhead.
leaving the echo COG out and just running smartpins in one COG:
on the top end I seem not to outrun the SPs, but the processing code
first number sysclocks taken, second effective baudrate third derivation, so gabs in between chars in sysclocks
same goes for the two port code.
I am still digging here...
Enjoy!
Mike