cogserial - fullduplex smart serial using interrupt

jmg · 2019-02-09 22:10

msrobots wrote: »

loosing bytes at the front of the buffer is more fatal, because you would have a inconsistent stream missing data somewhere in between.

I like the idea of varying values of 'fatal'
Losing data is usually fatal, no matter where in the packet it occurs. If that byte did not matter, you did not need to send it

I like to flag (usually to a pin during tests ) such cases, so I can confirm the error trap did work, and so any stress test can find when such overflows occur.

With P2's high possible baud rates, there will come some point where interrupts cannot keep up, especially 2 Channels of TX and RX as I think you have here.

As well as software over-runs, there are also HW over-runs to worry about, and I think P2 has no HW means to flag 'RX byte failed to read, before next byte replaced it' ?
To avoid that, it does mean RX int code needs to be very compact indeed, and highest bauds would only manage single-RX in one COG.

msrobots wrote: »

... I am not just thinking about quadrupling the size but faster access between HUB and LUT thru reading and writing longs instead of bytes.

Clumping bytes into long are faster, but that also has fish-hooks, as what happens if your system sends an odd number of bytes, then pauses ?
Some means to flush partial loaded longs are needed and then that needs a means to indicate how many are valid...
FIFO uarts use timers to manage such sub-threshold remainders cases.
In some cases, you could impose a rule that TX side always sends 4N bytes, but noise can false start a byte, which could give a permanent offset of 1, unless managed.

Highest raw serial speeds may need 2 COGS using the shared LUT feature ?

The highest serial speed I can find out in the real world, is ~50MBd on the Fast Serial (OPTO) mode of FTDI (that's a 13-14-15 bit frame, so it not quite a 10b UART byte), but of course there are P2-P2 links possible,

I get these test results on FT232H , PC to remote, using FT232H clkout ability which goes to 30MHz.

  Summary of FT232H Fast Serial PC bulk sending tests : (use CLK out of FT232H to feed FSCLK )
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  FT232H fast Serial Speed equations model as                Calc             CTR             Deviations                                                               BYTES/Sec
  30MHz_30MBaud   => (30.00054M*5/14)*(256*14)/(255*14+15)  = 10711489     1-ans/10.71147M  = -1.855ppm 1-ans/10.71149M = 11.72ppb, ie mostly /14, appx 1:256 is /15   2.14229797 MB/s
  15MHz_15MBaud   => (15.000271M*5/13)*(512*13)/(511*13+14) = 5768468.343  1-ans/5.768466M  = 0.406ppm Mostly /13, appx 1:512 /14                                      1.153693 MB/s
  7.5MHz_7.5MBaud => (7.500135M*5/13)                       =  2884667.307    2.884667M     = looks like 7.5MHz can stream at /13, gapless.                             576.933 kB/s

  Hard to predict the 50MHz equation, ie more /15 ? or do some /16 appear ?

evanh · 2019-02-09 23:05

msrobots wrote: »

I post the current code before I destroy it again, still need to shorten it, I really want some sort of string input supported by PASM in the driver.

There is a block copy feature between hubRAM and cog/lutRAM. SETQ + WRLONG will write contiguous longwords to hubRAM from cogRAM in a burst like transfer. One clock per long! This is documented just after fifo section in main doc.

SETQ2 + WRLONG copies lutRAM to hubRAM. And equivalents for RDLONG to copy back from hubRAM. The tricky part is interrupts will be blocked for the burst duration. But they are already jittery from any hubRAM read/write.

You could make use of the fifo instead. Using this could reduce interrupt jitter down to only a few clocks. The fifo won't give you max copy speed but it will make a big difference to a tight copy loop. Particularly for reading hubRAM. RDFAST/WRFAST are first documented instructions in the main doc.

ersmith · 2019-02-09 23:07

@msrobots: do you really need to specify the rx1/tx1 parameters and rx2/tx2 parameters at startup time? Wouldn't it be nicer to be able to start one port now and another port later? But maybe that's too complicated.

For doing just startup code it doesn't matter too much how you set it up, performance won't be too different either way. With the function call method you're hiding the assignments by turning them into parameter passing, but I think the code generated will be pretty similar (instead of wrlongs into the message box they will end up being wrlongs to be pushes). On P2 maybe the push method will be a little smaller and faster because we have autoincrement for the stack, so perhaps your original way is better. I wouldn't use @parameter in anything that's called very often, but if it's just once at initialization it shouldn't hurt.

AJL · 2019-02-10 02:29

msrobots wrote: »
AJL wrote: »
msrobots wrote: »
Thanks for all your input, but I am still stuck

I tried this
rx1_isr		rdpin	rx1_char,	rx1_pin			'get received chr
		shr	rx1_char,	#32-8			'shift to lsb justify
		mov	rx1_byte_index, rx1_head
		and	rx1_byte_index, #%11			'now 0 to 3
		mov	rx1_address,	rx1_head		'adjust to buffer start
		shr	rx1_address,	#2
		add	rx1_address,	rx1_lut_buff 		'by adding rx1_lut_buff
		rdlut	rx1_lut_value,	rx1_address
		
		neg	rx1_byte_index				' now 0 to -3
		add	rx1_byte_index,	#3			' now 3 to 0
		altsb 	rx1_byte_index,	#rx1_lut_value
  		setbyte rx1_char

'		cmp	rx1_byte_index,	#0		wz
'	if_z	setbyte rx1_lut_value,	rx1_char, #3
'		cmp	rx1_byte_index,	#1		wz
'	if_z	setbyte rx1_lut_value,	rx1_char, #2
'		cmp	rx1_byte_index,	#2		wz
'	if_z	setbyte rx1_lut_value,	rx1_char, #1
'		cmp	rx1_byte_index,	#3		wz
'	if_z	setbyte rx1_lut_value,	rx1_char, #0

		wrlut	rx1_lut_value,	rx1_address		'write byte to circular buffer in lut
		incmod	rx1_head, 	rx1_lut_btop		'increment buffer head
		cmp	rx1_head, 	rx1_tail 	wz	'hitting tail is bad
	if_z	incmod	rx1_tail, 	rx1_lut_btop		'increment tail  - I am losing received chars at the end of the buffer because the buffer is full
		reti1						'exit
but it does not work. And I really need this 4 longs each per pair...

I am loosing faith in being a worthy programmer,

Mike
It's been suggested before, but I'll mention it again: Have you tried moving your code to LUT RAM and placing your buffers in COG RAM?

It seems when this has been mentioned previously you have stated that you can't because LUT RAM is full: of buffers.

But if the buffers are moved to COG RAM, you have that space for code, and you'll be able to pack your bytes into longs with ALTSB in COG RAM buffers.

Please correct me if I'm off base here.
Well yes @AJL I think @evanh mentioned that, but I am not sure why this would help. Maybe you can elaborate. My point of view here is that I have 512 longs of LUT ram that would nicely fit 4 512 byte buffers for RX1/TX1/RX2/TX2.

Cog ram is not 512 longs in my understanding, because of special registers at the end of COG ram or is that different on the P2 vs the P1? I did ask that question before and found no answer yet.

Since I am considering to rewrite this completely if I can't find the stupid mistake I made I am really interested about why two people now recommend to use LUT ram for code and COG ram as LUT/buffer.

I am sometimes quite slow to understand things, so please bear with me and explain further. I seem to miss some point of the argument why I should try this.

Sure I can copy my code from COG to LUT and run it there, and reuse the COG space as buffer, but why should I?

I currently reuse all initialization code space for register variables. To speed up things I pre calculate pointers and have them ready to use. That are about 150 registers ready to use because in COG ram.

If I have the code in LUT and my buffers in COG how to handle those variables I need to do rdbyte/rdlong buffer positions/sizes whatever.

Keeping them in COG ram would reduce the available buffer size, having them in LUT ram and accessing with rdlut wrlut seems impossible to me.

confused,

Mike

@msrobots,
In COG you have 496 longs of RAM, and in LUT you have 512. With only one byte stored per long in LUT you can have 128 meaningful bytes stored per buffer in LUT.
By swapping your buffers to COG and packing 4 bytes per word, the same capacity (512 bytes of buffer) will take only 128 longs total (32 longs per buffer).
This allows you 880 longs total available for variables and code, versus your current 496.
If you double the buffer capacity, you still have 756 longs available for code and variables.
Quadrupling the buffer size isn't possible in this scenario as you run out of COG RAM.

The code for packing the buffers will also be smaller and faster. Do you now see the benefit of swapping?

Another, less advantageous approach, would be to shrink the buffer footprint in LUT, move some subroutines to the now available space in LUT, and use 2 longs in COG for packing the current incoming bytes into longs. As you fill each long you then copy it into the LUT buffer at the appropriate location. I predict that the code for this would be slower overall, and more convoluted, maintaining pointers for both byte position in the COG buffer and long position in the LUT buffer.

Of course, none of this addresses the situation where the data sent is not a multiple of 4 bytes. How do you determine when to transfer a long without all bytes populated, and how you flag this? I guess this comes down to your mailbox structure, and whether you include a byte count for reception.

evanh · 2019-02-10 03:51

Of course, none of this addresses the situation where the data sent is not a multiple of 4 bytes. How do you determine when to transfer a long without all bytes populated, and how you flag this?

Mike does have that one solved by doing a read-modify-write of lutRAM. That's what his struggle with SETBYTE was dealing with.

AJL · 2019-02-10 10:16

evanh wrote: »

Of course, none of this addresses the situation where the data sent is not a multiple of 4 bytes. How do you determine when to transfer a long without all bytes populated, and how you flag this?

Mike does have that one solved by doing a read-modify-write of lutRAM. That's what his struggle with SETBYTE was dealing with.

Does that solve the situation for transfer to HUB, and the eventual consumer of the received data?

evanh · 2019-02-10 10:30

I've made a couple of suggestions for speeding up block copies to/from hubRAM.

msrobots · 2019-02-10 11:25

AJL wrote: »

evanh wrote: »

Of course, none of this addresses the situation where the data sent is not a multiple of 4 bytes. How do you determine when to transfer a long without all bytes populated, and how you flag this?

Mike does have that one solved by doing a read-modify-write of lutRAM. That's what his struggle with SETBYTE was dealing with.

Does that solve the situation for transfer to HUB, and the eventual consumer of the received data?

well, halve ways, currently.

The latest published test suite has the driver already using bytes in the LUT instead of longs, but the faster HUB/LUT transfer is not presentable. It is working but - hmm - ugly.

As @evanh guessed the smart pins run up to sysclock baud. It is kind of scary, I can't get the dara into HUB as fast as it comes in. But as far as I have tested it the smart pins do serial very nice.

I use int's for RX, could not really figure out to use a TX I and dropped down to a simple cnt event every x sysclocks and poll the pins if I can send.

And this thing is rocking quite nice,

In my current working version I had to remove HEX and DEC from the COG, it is crowded inside there.

But, yeah we have now a working 2 port serial full duplex driver just running in a COG, 8 longs in HUB, usable with fastspin in any language you - well ask @ersmith .

Enjoy!

Mike

pilot0315 · 2019-02-11 04:46

will this receive asynchronous serial. I am trying to receive gps data as a test.

PUB rx_read(hubaddress, size)
rx_read_async(hubaddress, size)
repeat until rx1_cmd == -1

' receive a block from serial to memory of given size in bytes (does not wait for completion - you may need to check rx1_cmd if done or not later in your code)
PUB rx_read_async(hubaddress, size)
repeat until rx1_cmd == -1
rx1_param := hubaddress
rx1_cmd := size

msrobots · 2019-02-11 19:08

yes, that does it.

Mike

msrobots · 2019-02-17 02:43

Slowly I am getting traction with my refactoring.

I have now byte access to the lut for all channels running and byte + long access to the hub for RX to fill the buffers faster. Now I need to get long access running for buffer transfers on the TX side and I am at 479 longs.

Might be possible right now.

When this is working I might be able to either use rep or even switch to seq +rdfast = at that point I am at perfect long lut position/

working on it, stay tuned

Mike

msrobots · 2019-02-17 18:18

So I am chasing a rabbit around the block, literally.

jmg wrote: »

msrobots wrote: »

loosing bytes at the front of the buffer is more fatal, because you would have a inconsistent stream missing data somewhere in between.

I like the idea of varying values of 'fatal'
Losing data is usually fatal, no matter where in the packet it occurs. If that byte did not matter, you did not need to send it
I like to flag (usually to a pin during tests ) such cases, so I can confirm the error trap did work, and so any stress test can find when such overflows occur.
...

And had to find out that @jmg is right, again. Because my Idea that loosing bytes at the end is less fatal then loosing bytes at the front is completely wrong.

Because in a buffer full situation, my rx interrupt now pushes the tail pointer around giving the read routine a hard time to read the buffer.

At least that is what I think is going wrong right now.

So yes there are varying values of 'fatal'

Enjoy!

Mike

msrobots · 2019-02-19 01:26

Here the next stable version.

I was able to squeeze in long transfer between HUB and LUT at the receive buffer side. For doing the same on the send side I still need to recover more longs code space.

But it did made a major difference the speed quadrupled already …

This version passes all tests, and does contain a lot of fixes with the buffer handling.

Enjoy!

Mike

evanh · 2019-02-19 04:18

Just a hunch, but I'm not sure interrupts provide a significant benefit in the context of using a whole cog. And their pre-emptive nature brings with it an associated reduction in determinism. This likely tips them into a net negative contribution.

msrobots · 2019-02-19 04:46

evanh wrote: »

Just a hunch, but I'm not sure interrupts provide a significant benefit in the context of using a whole cog. And their pre-emptive nature brings with it an associated reduction in determinism. This likely tips them into a net negative contribution.

Well @evanh, I have not much knowledge about using interrupts. I just gave it a shot to find out how to use the smart pins. And indeed they work quite perfect. I have RX1 fire event1 for async receive on int1 and RX2 uses event2 to fire on rx receive on int2.

I had no luck to get TX smartpin interrupt running, and I have just one interrupt left, anyways for two transmit channel.

If my experiments are not complete wrong those events fire the interrupts perfectly up to sysclock (current 180) baud without problem. I was pretty astonished about that,

So both RX channels feed their respective buffer when they receive a byte. I do not see any problem with determinism there, more the opposite, they catch every byte.

That leaves TX and buffer handling from LUT to HUB (the mailbox to the calling application) to the COG interrupted by RX.

Just for giggles I decided to use int3 based on the counter every 100 clk (adjustable) to serve the TX needs and check if something is there to send and if the pins are ready to take a new byte (some problems there, still). Here we have some issue with determinism, I guess.

So now the rest of my COG takes care of the user-interface or say the mailbox. Pulling Data from HUB to LUT buffer or pushing Data from LUT buffer to HUB while being interrupted by the TX interrupt sending DATA being interrupted by both RX interrupts receiving DATA.

Seems to work fine right now. Just download all the files an run testserial in Spin2gui.

And - yeah - I actually do use a whole COG, 3 interrupts, the complete LUT as buffer and 487 longs used for code.

But this 2 port full duplex buffered driver needs just 8 longs in HUB

Enjoy!

Mike

jmg · 2019-02-19 05:48

msrobots wrote: »

Well @evanh, I have not much knowledge about using interrupts. I just gave it a shot to find out how to use the smart pins. And indeed they work quite perfect. I have RX1 fire event1 for async receive on int1 and RX2 uses event2 to fire on rx receive on int2.

I had no luck to get TX smartpin interrupt running, and I have just one interrupt left, anyways for two transmit channel.

If my experiments are not complete wrong those events fire the interrupts perfectly up to sysclock (current 180) baud without problem. I was pretty astonished about that,

Interrupts on RX make sense, as RX has to respond in hard real time. The RX code should be very compact indeed, and build a buffer only as large as it needs to be for the next step to keep up.
TX interrupts make less sense, as the P2 determines when it is ready for another byte.
If you had a HW handshake controlling TX flows, then maybe an TX/Handshake interrupt would make more sense.

The real test of RX handlers, is when you ask both channels to receive continual data at the same time, and there I doubt you will sustain SysCLK baud rates, as that is a new byte every 5 opcodes, per channel, for an average of 2.5 opcodes per byte incoming (!).

If your remote unit can be set to do longer-formats, like up to 32b, you buy more time, but that is quite special.

The practical limit will be some division of SysCLK.

A useful target to aim for there, would be Fast Serial mode of FTDI, which specs 50MHz CLK and frames each byte in ~ 14~16 bit times.
It we take a 200MHz sysclk, that gives 28~32 opcodes per byte, which might be doable with care, on one channel at least. ?
You need some spare cycles for the COG to handle the next step, unless you want to use 2 COGS sharing a LUT.

The FTDI part does have a simple handshake scheme, so you would probably need to use that on TX side.

msrobots · 2019-02-19 06:42

Yes, as usual you are right.

That sysclock baud is the max transfer speed between two smartpins on the same COG talking to each other. My code to run from HUB to HUB petered out at about a quarter/eights of that speed, the average handling time per byte takes about 56 sysclocks.

So yes, the pins can go up to sysclock baud, the processing does not.

but 4 independent lines RX1/TX1/RX2/TX2 running full blast reach now 1036800 and fail at 1152000 when running at 180Mhz against the echo server.
and 2 independent lines RX1/TX1 running full blast reach now 1958400 and fail at 2073600 when running at 180Mhz against the echo server.

That's not too bad.

Enjoy!

Mike

pilot0315 · 2019-02-22 04:17

@msrobots

I get this error in trying to run the code in pnut. and nothing in spin2gui.

pilot0315 · 2019-02-22 04:18

I tried each of the programs and got the same thing. "Expecting and "or"

evanh · 2019-02-22 04:27

Pnut doesn't know spin, only pasm.

pilot0315 · 2019-02-22 17:00

spin2gui did not budge is there something I am missing?

msrobots · 2019-02-22 19:17

well testecho is a sub program of the test suite for the driver.

To start the test suite you need to compile the testserial.spin2 with fastspin as main program, it contains all the other.

To use the driver in your program you need to include the cogserial.spin2 as object and it will include the cogserialpasm.spin2.

I separated the spin routines from the pasm COG in two files.

The current testserial.spin2 uses pins 63/62 for connection with the terminal and pins 1-6 as internal connected smartpins to test the 2 driver against each other.

All needed connections of the pins for the serial driver needed for the tests are done via smartpins ability to read/write pins next to it, so no wiring needed just the pins must be free.

Enjoy!

Mike

pilot0315 · 2019-03-05 18:23

ok thanks

pilot0315 · 2019-03-06 18:49

@msrobots

I am looking to possibly use the asm version on the pnut. Do you have an easy example??? I read the instructions but it would help me get going faster.
Thanks

msrobots · 2019-03-06 19:28

Yes it sounds complicated, but basically it is not.

You need a mailbox in HUB, 8 longs long. two longs per channel (RX1/TX1/RX2/TX2)

You wait for the command log to be -1, that means the last command finished and a new one can be issued.

To do so you write the parameter long first (if needed) then write the command long to issue a new command.

Then wait for or check periodically if command long is -1 again. Thus the command has finished.

If the command returns a result you find the result in the parameter long.

This works for each of the 4 channel independent from each other. Each using two longs.

Starting the whole shebang is another procedure. Because you need a array of 22 longs to pass all needed parameter to the starting COG.

Those 22 longs are just needed at start time, for that reason I miss/re/-use the spin stack for that but you can just use any 22 long in HUB you want. You need to populate those 22 longs with the needed startparameter values listed, load PTRA with the address of the first long of the startparameter block and then start the COG.

You might consider using fastspin to just compile cogserialpasm.spin2 and then look thru the generated listing file cogserialpasm.lst or at the generated PASM sourcecogserialpasm.pasm2.

I broke my current driver now at least 5 times trying to implement long reads and writes between LUT and HUB and am frustrated.

Sadly real work is hammering me right now, so I am not even sure if I have time this weekend for another try.

Mike

pilot0315 · 2019-03-13 04:27

I am studying it. I am going to attempt to skip what I think is the interface from spin to the pasm and call the routine with assigning the rx,tx etc values internally.
But would like an example in P2pasm to access the tx to the serial terminal and display a number.

Thanks

Martin

RossH · 2019-04-09 11:05

I get the following error when compiling the latest version (2019-02-19) ...

/cogserialpasm.spin2(320) error: Changing hub value for symbol cmdparam

However, the previous version (2019-02-10) compiles and runs ok.

msrobots · 2019-04-09 21:44

yeah, I stumbled across that also, it is a work in progress.

I am currently swamped with other work, but will update as soon as I can

Enjoy!

Mike

pilot0315 · 2019-06-09 17:56

I am using cogserial to attempt to get the data from a gps. The gps is at 9600 baud. I get gibberish from the feed. Just for fun I tried hex and dec and the hex looks right and I get results that are once a second that are much longer. Does cogserial not work at 9600 baud. I am using spin2gui.
Printing a string directly "test" works great. The feed does not work in between "test" when printed. The gibberish only works without printing "test".

I have this working on the P1 without problems in SPIN and in Prop C.

Thanks

pilot0315 · 2019-06-09 18:51

Here is the code

cogserial - fullduplex smart serial using interrupt

Comments