cogserial - fullduplex smart serial using interrupt

jmg · 2019-02-03 05:10

msrobots wrote: »

as for 2 stop bits, might be a try, I just don't now how to do that with smart pins, must read a bit about that.

IIRC, I think you just define TX as 9 bits, and align so the final sent bit is 1 - with the smart pins, you can thus define any number of stop bits this way, up to the 32b field width,

jmg · 2019-02-03 08:19

msrobots wrote: »
...
the current version goes does this for using just one rx/tx pair and using the echo server
running at baud 691200
  45061683 - PASS - 639204 - 146
  45061619 - PASS - 639204 - 146
  45061723 - PASS - 639204 - 146
the first number is sysclock taken for test, thus negative on errors
the number after PASS is the effective baudrate inclding code overhead and the third number the derivation in sysclocks per byte, because of that overhead.

Interesting effect, - that seems quite a few Sysclks overhead, for a modest baud rate for P2 ?

Your times :  
 45061683/180M  = 0.25034268333333333333
 45061619/180M  = 0.25034232777777777778
 45061723/180M  = 0.25034290555555555556

Possible TX times (following usual UARTS granularity )
 16k*10/691200  = 0.23148148148148148148
 16k*11/691200  = 0.25462962962962962963 - hmm, you get somewhere in the middle

Equivalent Stop Bit time

 16k*10.814813/691200  = 0.25034289351851851852

Expressing that as SysCLKs 
  (180M/691200)*0.814813 = 212.190  (not quite your 146?)

Notice that elapsed time, is not a whole bit time. Most uarts derive a BAUD clock, and all TX's align to that.
That means sending "UUUUUUU" gives an exact baud/2 (5 pulses in 10 bit times) on most UARTS I've tested.

At 691200, you should have ~ whole bit time (~130 opcodes) from char-done, to load-next-char, for the interrupt, & more if the P2 interrupts on Tx buffer emptied.
What is the exact timing of the TX interrupt ?

Does P2 reset the TX timing on every byte ? (or is it jittering between 10 & 11 bits/char)

I would have expected P2 to be able to pack bytes continually in Tx and Rx. (certainly at 691200)
It certainly needs to be able to receive bytes continually (no gaps)

Any same-COG test is going to somewhat naturally self-pace. but a 2 COG test might have skews in paths in echo ?

cgracey · 2019-02-03 08:24

The smart pin serial modes can absolutely send and receive gapless data.

jmg · 2019-02-03 08:38

cgracey wrote: »

The smart pin serial modes can absolutely send and receive gapless data.

I thought they could/should.
Is the UART TX baud-aligned between bytes ? (ie are fractional bit times between bytes impossible ?)

How much time margin is there on the TX side, and RX side, for interrupts ?
That would be useful in the DOCs, to see exactly when the TX and RX interrupts fire, and the best way to manage normal data, and RS485 data (which needs to wait for end of stop bit, before change of direction)

msrobots · 2019-02-03 08:40

yeah I think my measured times are not correct, it is time needed for send 16k and receive 16K async plus time to read and write to and from the HUB with wrbyte/rdbyte.

I think the pins are transmitting gapless. The 1 COG talking to itself driver runs astonishingly 90Mbaud with 180Mhz. Just when using my echo server on another COG in between it breaks down.

That might be a problem of my echo server, I just threw it together in Spin, maybe doing inline PAM can do better as fastspin, but I think the problem is still a foot away from the screen I am looking at.

EDIT: I should be more precise here, the pins are transmitting at 90Mbaut with 180Mhz clock, but my driver is not fast enough to feed them constantly so the driver maxes out at around 70Mbaud or so,

I am still working on it,

Enjoy!

Mike

evanh · 2019-02-03 08:53

JMG,
You don't seem to be using your board. How about I take it off your hands.

msrobots · 2019-02-03 09:04

I use INT 1 for RX1 and INT2 for RX2.

I was not able to even envision how to use a Interrupt for sending, because when it fires at the time it can send, and I have nothing to send everything stops.

So I used INT 3 in mode #1 just firing every x clocks (currently 100) and checking if it has something to output in its buffer and can ctually output on the smartpin for both TX1 and TX2

The rest of the COG just takes care of the mailbox and transferring data from/to buffers and HUB.

Sadly I am running out of space and have to rethink, because I currently use LUT as buffer for bytes, but save them as longs in the LUT. thus wasting a lot of buffer space. Currently I have 4 128 byte buffer for RX1,TX1,RX2,TX2 but if I could address the LUT byte wise I could have 4 1K buffers.

I just need to figure out some small way to replace wrlut x,y/rdlut x,y with some call to something addressing bytes in the lut. And I am at 480 longs right now …

I do have a index from 0 to buffer-size for each buffer (currently in longs) and would like to access the LUT byte-wise. I do have very less code space left, but I have still reused init code space for variables.

' I want to replace all wrlut's and rdlut's used right now
'current code something like this

.rx1block	cmp	rx1cmd,		#0 		wz	'need more bytes?
	if_z 	jmp	#.done					'no - done
'				
		cmp	rx1_head, 	rx1_tail	wz	'byte received?
	if_z	ret						'no - try again don't block the rest
'
		mov	rx_address, 	rx1_tail		'adjust to buffer start
		add	rx_address, 	rx1_lut_buff		'by adding rx1_lut_buff
		rdlut	rx_char, 	rx_address		'get byte from circular buffer in lut
		incmod	rx1_tail, 	rx1_lut_btop		'increment buffer tail
		wrbyte  rx_char, 	rx1param		'write byte to Block
		add	rx1param, 	#1			'adjust Block address
	_ret_	sub	rx1cmd,		#1			'adjust count - try again don't block the rest
'
' now I want to use rx1_head rx1_tail,  rx1_lut_btop as bytes not longs as they are now
'
.rx1block	cmp	rx1cmd,		#0 		wz	'need more bytes?
	if_z 	jmp	#.done					'no - done
'				
		cmp	rx1_head, 	rx1_tail	wz	'byte received?
	if_z	ret						'no - try again don't block the rest
'
'new
*		mov byte_index,       rx1_tail
*		and byte_index,        #%11
*		shl  byte_index,       #4
		mov	rx_address, 	rx1_tail		'adjust to buffer start
*		shr	rx_address, 	#2
		add	rx_address, 	rx1_lut_buff	'and adding rx1_lut_buff
		rdlut	rx_char, 	        rx_address	'get long from circular buffer in lut
*		shr  	rx_char, 	        byte_index
*		and  	rx_char, 	        #$FF
'new
		incmod	rx1_tail, 	rx1_lut_btop	'increment buffer tail
		wrbyte  rx_char, 	rx1param		'write byte to Block
		add	rx1param, 	#1			'adjust Block address
	_ret_	sub	rx1cmd,		#1			'adjust count - try again don't block the rest

this adds 6 instructions can I do this shorter?

Enjoy

Mike

evanh · 2019-02-03 10:05

You can free up a bunch of cogRAM by putting the code in lutRAM. Here's an example wrapper around your above code:

'-------- Copy lut code into position --------
		setq2	#(LUT_CODE_END - LUT_CODE_START - 1)	'copy length, in longwords
		rdlong	0, ##@LUT_CODE_START			'the "0" is lutRAM zero, or $200 in memory map
		jmp	#\LUT_CODE_START			'jump into the lutRAM copy


ORG   $200                                    'longword addressing
LUT_CODE_START

' I want to replace all wrlut's and rdlut's used right now
'current code something like this

.rx1block	cmp	rx1cmd,		#0 		wz	'need more bytes?
	if_z 	jmp	#.done					'no - done
'				
		cmp	rx1_head, 	rx1_tail	wz	'byte received?
	if_z	ret						'no - try again don't block the rest
'
		mov	rx_address, 	rx1_tail		'adjust to buffer start
		add	rx_address, 	rx1_lut_buff		'by adding rx1_lut_buff
		rdlut	rx_char, 	rx_address		'get byte from circular buffer in lut
		incmod	rx1_tail, 	rx1_lut_btop		'increment buffer tail
		wrbyte  rx_char, 	rx1param		'write byte to Block
		add	rx1param, 	#1			'adjust Block address
	_ret_	sub	rx1cmd,		#1			'adjust count - try again don't block the rest
'
' now I want to use rx1_head rx1_tail,  rx1_lut_btop as bytes not longs as they are now
'
.rx1block	cmp	rx1cmd,		#0 		wz	'need more bytes?
	if_z 	jmp	#.done					'no - done
'				
		cmp	rx1_head, 	rx1_tail	wz	'byte received?
	if_z	ret						'no - try again don't block the rest
'
'new
*		mov byte_index,       rx1_tail
*		and byte_index,        #%11
*		shl  byte_index,       #4
		mov	rx_address, 	rx1_tail		'adjust to buffer start
*		shr	rx_address, 	#2
		add	rx_address, 	rx1_lut_buff	'and adding rx1_lut_buff
		rdlut	rx_char, 	        rx_address	'get long from circular buffer in lut
*		shr  	rx_char, 	        byte_index
*		and  	rx_char, 	        #$FF
'new
		incmod	rx1_tail, 	rx1_lut_btop	'increment buffer tail
		wrbyte  rx_char, 	rx1param		'write byte to Block
		add	rx1param, 	#1			'adjust Block address
	_ret_	sub	rx1cmd,		#1			'adjust count - try again don't block the rest

LUT_CODE_END
FIT   $400

EDIT: Added the absolute addressing to the jump. Avoids a bug in Pnut.

evanh · 2019-02-03 10:06

After that you can then move the buffers into cogRAM and use the more powerful ALTxx + GETBYTE/SETBYTE combos.

msrobots · 2019-02-03 14:47

hmm I already use the complete LUT as buffer for my four serial ports I can not put my code there,

My question was more if there is a faster way to access bytes out of a long out of the LUT. something like wrlut_byte(lutadrss, byte0-3)

But GETBYTE SETBYTE just run in COGRAM, Still maybe faster as my current attempt, will test.

Thanks,

Mike

evanh · 2019-02-03 21:05

What I'm saying is you'll get better performance if you swap those over. Put the buffers in cogRAM and code in lutRAM.

msrobots · 2019-02-04 02:35

hmm - sounds wrong.

Code execution from LUT is slower then Code execution from RAM and if I use alts/d + getbyte I can also use rdlut+getbyte, so no code space savings but slower execution?

confused

Mike

evanh · 2019-02-04 03:37

Code execution in lutRAM is full speed with no penalties. Same as cogRAM. Only limitation is self-modifying doesn't have the flexibility of cogRAM.

RDLUT is the one that's slower. Although the biggest factor is GETBYTE can only be used upon cogRAM so any such use on data from lutRAM needs load and store operations around it.

The ALTxx prefixing instructions provide cogRAM table/buffer indexing in a very convenient package. The extra two clocks are easily made up for by their abilities.

msrobots · 2019-02-04 03:51

hmm - I need to think about this.

I do know that rdlut need 3 clock instead of two, but I currently use all 512 LUT longs as - guess - Look Up Table, and am on the way to rework my code, I am down to 438 longs with long buffer addressing.

I am reworking the code to find any differences between the 1 pair of RX/TX to 2 pair of RX/TX. Fund some typos, but the main issue of 1 and 2 port failing with different errors has not lifted its head to greet me.

I slowly think that the serial driver is OK but the echo-server is to slow. But with 4 time the buffer size in the driver it should go further and that would proof that the issue is in the echo server.

But when I save 4 bytes as longs in my buffer not 1 byte per long I will be able to move data faster between HUB and LUT, that will make a huge difference.

At least this is my current working plan.

Enjoy!

Mike

jmg · 2019-02-04 04:03

msrobots wrote: »

I slowly think that the serial driver is OK but the echo-server is to slow. But with 4 time the buffer size in the driver it should go further and that would proof that the issue is in the echo server.

Chip has said the Smart Pins can manage gapless send and receive, (at least up to some high baud speeds).
It may be that echo needs asm coding, to copy incoming Rx byte to echo-Tx ?

evanh · 2019-02-04 04:15

Gapless UART transimission requires monitoring the smartpin IN status - Intended for event/IRQ generation. Using RDPIN can only tell when transmission has ceased.

jmg · 2019-02-04 04:37

evanh wrote: »

Gapless UART transimission requires monitoring the smartpin IN status - Intended for event/IRQ generation. Using RDPIN can only tell when transmission has ceased.

Most the the P2 Smart pin DOC's are rather cryptic, but they do say this :

"X[5] selects the update mode:

X[5] = 0 sets continuous mode, where a first word is written via WYPIN during reset (DIR=0) to prime the shifter. Then, after reset (DIR=1), the second word is buffered via WYPIN and continuous clocking is started. Upon shifting each word, the buffered data written via WYPIN is advanced into the shifter and IN is raised, indicating that a new output word can be buffered via WYPIN. This mode allows steady data transmission with a continuous clock, as long as the WYPIN’s after each IN-rise occur before the current word transmission is complete.

X[5] = 1 sets start-stop mode, where the current output word can always be updated via WYPIN before the first clock, flowing right through the buffer into the shifter. Any WYPIN issued after the first clock will be buffered and loaded into the shifter after the last clock of the current output word, at which time it could be changed again via WYPIN. This mode is useful for setting up the output word before a stream of clocks are issued to shift it out.

X[4:0] sets the number of bits, minus 1. For example, a value of 7 will set the word size to 8 bits.

WYPIN is used to load the output words. The words first go into a single-stage buffer before being advanced to the shifter for output. Each time the buffer is advanced into the shifter, IN is raised, indicating that a new output word can be written via WYPIN. During reset, the buffer flows straight into the shifter.
"

That does mention a separate buffer and shifter, so they should have a queue of about 1 char time, so update jitter within that window, should still give gapless transmit.
Hence this statement "This mode allows steady data transmission with a continuous clock, as long as the WYPIN’s after each IN-rise occur before the current word transmission is complete."

msrobots · 2019-02-04 05:22

jmg wrote: »

msrobots wrote: »

I slowly think that the serial driver is OK but the echo-server is to slow. But with 4 time the buffer size in the driver it should go further and that would proof that the issue is in the echo server.

Chip has said the Smart Pins can manage gapless send and receive, (at least up to some high baud speeds).
It may be that echo needs asm coding, to copy incoming Rx byte to echo-Tx ?

since fastspin produces pasm I was not thinking so, but I can use the serial driver directly from pasm, so that is one of the next options

evanh wrote: »

Gapless UART transimission requires monitoring the smartpin IN status - Intended for event/IRQ generation. Using RDPIN can only tell when transmission has ceased.

yes, I do use events/interrupts for reading the serial RX pins, int1 for RX1 and int2 for RX2 and that seems to work flawless and gapless (as long I can keep up reading my buffer) .

but a big setback is this

error: Third operand to setbyte must be an immediate

same with getbyte. That is bad.

because now I need 4 cmp and 4 getbytes/setbytes

so

rx1_isr		rdpin	rx1_char,	rx1_pin			'get received chr
		shr	rx1_char,	#32-8			'shift to lsb justify
		mov	rx1_byte_index, rx1_head
		and	rx1_byte_index, #%11
		mov	rx1_address,	rx1_head		'adjust to buffer start
		shr	rx1_address,	#2
		add	rx1_address,	rx1_lut_buff 		'by adding rx1_lut_buff
		rdlut	rx1_lut_value,	rx1_address
		setbyte rx1_lut_value,	rx1_char, rx1_byte_index
		wrlut	rx1_lut_value,	rx1_address		'write byte to circular buffer in lut
		incmod	rx1_head, 	rx1_lut_btop		'increment buffer head
		cmp	rx1_head, 	rx1_tail 	wz	'hitting tail is bad
	if_z	incmod	rx1_tail, 	rx1_lut_btop		'increment tail  - I am losing received chars at the end of the buffer because the buffer is full
		reti1						'exit

does not compile. Will need to do

rx1_isr		rdpin	rx1_char,	rx1_pin			'get received chr
		shr	rx1_char,	#32-8			'shift to lsb justify
		mov	rx1_byte_index, rx1_head
		and	rx1_byte_index, #%11
		mov	rx1_address,	rx1_head		'adjust to buffer start
		shr	rx1_address,	#2
		add	rx1_address,	rx1_lut_buff 		'by adding rx1_lut_buff
		rdlut	rx1_lut_value,	rx1_address
		cmp	rx1_byte_index,	#0		wz
	if_z	setbyte rx1_lut_value,	rx1_char, #0
		cmp	rx1_byte_index,	#1		wz
	if_z	setbyte rx1_lut_value,	rx1_char, #1
		cmp	rx1_byte_index,	#2		wz
	if_z	setbyte rx1_lut_value,	rx1_char, #2
		cmp	rx1_byte_index,	#3		wz
	if_z	setbyte rx1_lut_value,	rx1_char, #3
		wrlut	rx1_lut_value,	rx1_address		'write byte to circular buffer in lut
		incmod	rx1_head, 	rx1_lut_btop		'increment buffer head
		cmp	rx1_head, 	rx1_tail 	wz	'hitting tail is bad
	if_z	incmod	rx1_tail, 	rx1_lut_btop		'increment tail  - I am losing received chars at the end of the buffer because the buffer is full
		reti1						'exit

instead?

well I am getting there, just running out of longs...

maybe I can use altd/s/I to shorten that up

Mike

evanh · 2019-02-04 05:49

ALTGB/ALTSB solves all.

msrobots · 2019-02-04 19:59

evanh wrote: »

ALTGB/ALTSB solves all.

I am not following you, wtf is ALTGB/ALTBS?

Mike

Electrodude · 2019-02-04 20:14

See the Parallax Propeller 2 Instructions v32 spreadsheet starting at row 103.

The ALTGB and ALTSB instructions allow you to override the fixed third argument of GETBYTE and SETBYTE instructions. There are other similar instructions to override fixed fields of other instructions too.

EDIT: Those instructions override both the D and N fields, allowing you to access all of cogram as a word, byte, or nibble array with only two instructions per access.

msrobots · 2019-02-05 02:50

ohh, good, I missed that link and the google doc I know of does not describe the instructions, so I am flying blind, mostly.

Thank you @Electrodude,

Mike

.

msrobots · 2019-02-06 13:45

OK,
I read the docu but I do something wrong

rx2_isr		rdpin	rx2_char,	rx2_pin			'get received chr
		shr	rx2_char,	#32-8			'shift to lsb justify
		mov	rx2_byte_index, rx2_head
		and	rx2_byte_index, #%11
		mov	rx2_address,	rx2_head		'adjust to buffer start
		shr	rx2_address,	#2
		add	rx2_address,	rx2_lut_buff 		'by adding rx1_lut_buff
		rdlut	rx2_lut_value,	rx2_address

'		neg	rx2_byte_index
'		add	rx2_byte_index,	#4
'		add	rx2_byte_index,	#rx2_lut_value<<2
'		altsb	rx2_byte_index
'		setbyte 0-0,		rx2_char, #0-0

		cmp	rx2_byte_index,	#0		wz
	if_z	setbyte rx2_lut_value,	rx2_char, #3
		cmp	rx2_byte_index,	#1		wz
	if_z	setbyte rx2_lut_value,	rx2_char, #2
		cmp	rx2_byte_index,	#2		wz
	if_z	setbyte rx2_lut_value,	rx2_char, #1
		cmp	rx2_byte_index,	#3		wz
	if_z	setbyte rx2_lut_value,	rx2_char, #0
'
		wrlut	rx2_lut_value,	rx2_address		'write byte to circular buffer in lut
		incmod	rx2_head, 	rx2_lut_btop		'increment buffer head
		cmp	rx2_head, 	rx2_tail 	wz	'hitting tail is bad
	if_z	incmod	rx2_tail, 	rx2_lut_btop		'increment tail  - I am losing received chars at the end of the buffer because the buffer is full
		reti2						'exit

I do want to replace the 8 lines following the out commented altsb block to save 3 longs, but it does not work, what I am doing wrong with altsb and setbyte?

unsure,

Mike

Cluso99 · 2019-02-06 14:33

Didn’t look properly, but you require an ALTSB before each SETBYTE instruction. The ALT and AUG only apply to the following instruction.

jmg · 2019-02-06 18:59

msrobots wrote: »

I do want to replace the 8 lines following the out commented altsb block to save 3 longs, but it does not work, what I am doing wrong with altsb and setbyte?

There is code in the ROM_Booter source that shuffles bytes into long, for the checksum, so you could check that ?

You could also look at
RCZR D {WC/WZ/WCZ} Rotate C,Z right through D. D = {C, Z, D[31:2]}. C = D[1], Z = D[0].
Not sure if there is any non-destructive version of that ?
Which gets 2 bits into CZ, you can test for 4 packed statements.

Or, maybe this pair can be even faster ?
DECOD D,{#}S Decode S[4:0] into D. D = 1 << S[4:0].
and
SKIPF {#}D Skip cog/LUT instructions fast per D. Like SKIP, but instead of cancelling instructions, the PC leaps over them.

Cluso99 · 2019-02-06 19:54

Looking closer, i am unsure what is not working.

Are those 8 lines working and you need to find a 5 instruction replacement for them?

evanh · 2019-02-06 22:33

Mike,
I'm not sure why the RDLUT code is there but here's all I think you need in there:

rx2_isr
		rdpin	rx2_char,	rx2_pin			'get received chr
		shr	rx2_char,	#32-8			'shift to lsb justify
		altsb	rx2_head, #rx2_buffer
		setbyte	rx2_char
		incmod	rx2_head, 	rx2_lut_btop		'increment buffer head
		cmp	rx2_head, 	rx2_tail 	wz	'hitting tail is bad
	if_z	incmod	rx2_tail, 	rx2_lut_btop		'increment tail  - I am losing received chars at the end of the buffer because the buffer is full
		reti2						'exit

ozpropdev · 2019-02-06 22:40

Be aware that ALTxx is broken on P2-ES.
IIRC sign extension (negative deltas)?

rogloh · 2019-02-06 23:15

Hi ozpropdev. Do we know which particular ALTxx instructions are broken? I think we might have been using some of them for HDMI bitbang, though there are a few variants.

evanh · 2019-02-06 23:23

Only going to affect negative indexing from the base. That's not a very common action.

cogserial - fullduplex smart serial using interrupt

Comments