Prop2 Feature List

cgracey · 2016-08-20 03:16

About interrupts...

Here is the interrupt-related code from the new ROM booter. I got everything to work in one cog using interrupts. At first, I was using one cog for the booter and another cog just for auto baud detection, with LUT sharing as conduit.

I figured this could be done in one cog, but I needed to make MORE edge events, and add STATE events, so that you don't get chicken-and-egg problems with smart pin event detection and AKPIN response.

Here is the related part of the booter code. Note that interrupt 1 responds to state changes on the RX pin (via smart pin 0), looking for a space ($20), while interrupt 2 handles RX data reception (via smart pin 63). Interrupt 1 actually forces interrupt 2, in case interrupt 2 didn't fire in time, when a space ($20) is detected via interrupt 1. This was a problem at higher baud rates. Now it's rock solid:

CON
	rx_pin		=	63		'pins
	tx_pin		=	62
	spi_cs		=	61
	spi_ck		=	60
	spi_di		=	59
	spi_do		=	58
	rx_msr		=	0

	lut_buff	=	$000		'serial receive buffer
	lut_btop	=	$07F		'serial receive buffer top

	chr_to		=	0		'mode bits
	did_spi		=	1
	key_on		=	2


DAT		org
'
'
' Enable autobaud and serial receive interrupts
'
		wrpin	msta,#rx_msr		'measure states on rx_pin via rx_msr
		setse1	#%110<<6+rx_msr		'event on rx_msr high
		dirh	#rx_msr			'enable measurement
		mov	ijmp1,#autobaud		'set interrupt vector
		setint1	#4			'enable interrupt

		wrpin	#%00_11111_0,#rx_pin	'set rx pin for asynchronous receive
		setse2	#%110<<6+rx_pin		'set se2 to trigger on rx_pin high
		mov	ijmp2,#receive		'set int2 jump vector
		setint2	#5			'set receiver ISR to trigger on se2 (rx_pin high)


		(main program here)
'
'
' Get rx byte
'
get_rx		pollct1			wz	'if timeout, error
	if_nz	jmp	#command_err

		cmp	head,tail	wz	'loop until byte is received
	if_z	jmp	#get_rx


		testb	mode,#chr_to	wz	'clear timeout?
	if_nz	call	#clear_timeout

		rdlut	x,tail			'get byte from lut
		incmod	tail,#lut_btop		'update head

		ret
'
'
' Clear timeout
'
clear_timeout	getct	x
		addct1	x,timeout_per

		ret
'
'
' Send string
'
tx_string	waitx	##30_000_000/100	'wait 10ms

		wrpin	#%01_11110_0,#tx_pin	'configure tx pin
		wxpin	baud,#tx_pin		'set baud
		dirh	#tx_pin			'enable tx pin

		mov	x,#3			'initialize byte counter

tx_loop		incmod	x,#3		wc	'if initial or 4th byte,
tx_ptr	if_c	mov	y,0			'..get 4 bytes (start address set by caller)
	if_c	add	tx_ptr,#1		'..point to next 4 bytes
		test	y,#$FF		wz	'if not end of string,
	if_nz	wypin	y,#tx_pin		'..send byte
.wait	if_nz	testin	#tx_pin		wc	'..wait for buffer empty
 if_nc_and_nz	jmp	#.wait
	if_nz	akpin	#tx_pin			'..acknowledge pin
	if_nz	shr	y,#8			'..ready next byte
	if_nz	jmp	#tx_loop		'..loop for next byte

.busy		rdpin	x,#tx_pin	wc	'end of string,
	if_c	jmp	#.busy			'..wait for tx to finish
		dirl	#tx_pin			'..disable tx pin
		wrpin	#0,#tx_pin		'..unconfigure tx pin
		ret
'
'
' Autobaud ISR
'
autobaud	akpin	#rx_msr			'acknowledge rx state change

		rdpin	buf2,#rx_msr	wc	'get sample, measure ($20 -> 10000001001 -> ..1, 6x 0, 1x 1, 2x 0, 1..)
		clrb	buf2,#31		'clear msb in case 1 sample
	if_c	jmp	#.scroll		'if 1 sample, just scroll

		mov	limh,buf0		'0 sample,
		shr	limh,#4			'..make window from 1st 0 (6x if $20)
		neg	liml,limh
		add	limh,buf0
		add	liml,buf0

		mov	comp,buf1		'0 sample,
		mul	comp,#6			'..normalize last 1 (1x if $20) to 6x
		cmpr	comp,limh	wc	'..check if last 1 within window
	if_nc	cmp	comp,liml	wc

	if_nc	mov	comp,buf2		'0 sample and last 1 within window,
	if_nc	mul	comp,#3			'..normalize last 0 (2x if $20) to 6x
	if_nc	cmpr	comp,limh	wc	'..check if last 0 within window
	if_nc	cmp	comp,liml	wc

	if_c	jmp	#.scroll		'if not $20, just scroll

		add	buf0,buf2		'$20 (space),
		shl	buf0,#16-3		'..compute bit period from 6x 0 and 2x 0
		or	buf0,#7			'..set 8 bits
		wxpin	buf0,#rx_pin		'..set rx pin baud
		dirl	#rx_pin			'..reset rx pin
		dirh	#rx_pin			'..(re)enable rx pin to (re)register frame

		mov	baud,buf0		'..save baud for transmit

		mov	rxbyte,#$120		'..signal receiver ISR to ignore pin, enter space
		trgint2				'..trigger serial receiver ISR in case it wasn't, already (<50k baud)

.scroll		mov	buf0,buf1		'scroll sample buffer
		mov	buf1,buf2

		reti1				'if $20 (space), serial receiver ISR executes next
'
'
' Serial receiver ISR
'
receive		clrb	rxbyte,#8	wc	'triggered by autobaud? if so, rxbyte = $20 (space)

	if_nc	akpin	#rx_pin			'triggered by receive, acknowledge rx byte
	if_nc	rdpin	rxbyte,#rx_pin		'triggered by receive, get rx byte

		wrlut	rxbyte,head		'write byte to circular buffer in lut
		incmod	head,#lut_btop		'increment buffer head

		reti2
'
'
' Constants / initialized variables
'
timeout_per	long	30_000_000/10		'initial 100ms timeout
msta		long	%0111<<28+%00_10000_0	'read states on lower pin (pin 63 in case of pin 0)
mode		long	0			'serial mode
head		long	0			'serial buffer head
tail		long	0			'serial buffer tail
'
'
' Unitialized variables
'
i		res	1	'universal
x		res	1
y		res	1
z		res	1

rxbyte		res	1	'ISR serial receive

buf0		res	1	'ISR autobaud
buf1		res	1
buf2		res	1
limh		res	1
liml		res	1
comp		res	1
baud		res	1

potatohead · 2016-08-20 17:14

The early "maybe interrupts make sense" discussion boiled down to what we experienced using the "hot" edition. One cog, just polling was inefficient, just like all but one cog using the hub was.

It was either a Tasker, or interrupts.

The hot chip tasaker solidified the idea of the COG as atomic, not the task, or in our case, ISR.

As long as hub access and events do not impact other cogs, we've still got what we like in p1. People can grab object and run with them pretty easy.

As I recall, that discussion ended quietly, Chip saying, "let us not talk about this." Was the right call.

Heater. · 2016-08-20 20:23

I may be slow and out of date and I may have missed a point. But...

On the P2 can my code running on its cog(s) modulate the execution rate of your code running on its cog(s) as we both hammer on HUB access through the "egg beater" ?

Dave Hein · 2016-08-20 20:33

On the P2, cogs have dedicated hub access slots, but the slot time also depends on the 4 LSB of the long address. Hub accesses from a cog will not interfere with the timing of other cogs.

Heater. · 2016-08-20 21:00

So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.

Right?

Electrodude · 2016-08-20 21:09

Heater. wrote: »

So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.

Right?

Not really. The bottom 4 bits (EDIT: actually bits 5..2) that a cog has access to goes up once per clock, so that a cog can read the next long of hubram each clock - this is the whole beauty of the egg beater, that every cog at once can read sequential longs every per clock. There's a FIFO to smooth out the accesses, since (except for the streamer, which can use a long every clock) a cog can't use a long every clock. The FIFO is big enough so that, once it synchronizes once, it can never underflow when reading or overflow when writing. The FIFO can be used for manual access, hubexec, or the streamer. The FIFO can only do one of these things at a time.

However, if you want random and not sequential access, or if you're already using the FIFO for something else (e.g. hubexec), you might be better off aligning your data based on cog id. But you don't need to actually check your cog id to do this - the first access's timing may be wrong, but the first one will stall the cog so that the rest are timed perfectly (supposing you wrote your code properly). Put a dummy access before any time-critical accesses if you can't afford for the first one to be off.

JRetSapDoog · 2016-08-20 21:39

Heater. wrote: »

So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.

That sure seems like a high price to pay for maximum throughput. It divides up the hub memory in a weird (though regular) way, potentially wasting 15/16ths of it (unless you also used more code at other times or in other cogs to use those skipped over portions). I know programmers go to extremes at times for maximum throughput, but the boost from such an access scheme would seem to come at the expense of programming sanity, for lack of a better term. Of course, the P2 shines best when it's doing sequential access. But when it comes to random access, I'd guess that it's generally best to just live with the lower throughput rate and not divvy up memory in such a tricky way. But yeah, for the "maximum random r/w access" that you mentioned, I believe that such random access usage would be the fastest, but others can comment more confidently. Update: Okay, another just did comment with regards to the synching up part and not needing to worry about the exact cog number.

Heater. · 2016-08-20 21:39

Thanks Electrodude,

It all kind of, sort of makes sense. Sometimes. I have to study the egg beater "magic roundabout' diagram some more.

I still can't work out if your code can modulate the speed of my code though....

JRetSapDoog · 2016-08-20 22:05

Heater. wrote: »

On the P2 can my code running on its cog(s) modulate the execution rate of your code running on its cog(s) as we both hammer on HUB access through the "egg beater" ?

Such influence among cogs is *electrically impossible* due to the chip's design, wherein each cog can only access one particular slice of memory at any one time. So, as Dave Hein said, no interference can occur. Now if two or more cogs were exchanging messages or otherwise using results calculated in another cog, then, of course, they could affect each other through the expected ways, but that's not what you were considering. For what you mentioned, it's refreshing to know that a cog will be totally done with reading or writing a particular long/word/byte in a slice of memory before the "trailing" cog gets access to that slice (and the same long/word/byte).

jmg · 2016-08-20 23:03

Heater. wrote: »

So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.
Right?

I think this is a yes and no / it depends type case.

There are burst HUB operations, but if you really want 'random' that infers no control at all of address.

If you can accept some LSB control, that is no longer quite random, then yes, careful sync of LSB to the Slot index, can avoid a wait-for-next-go-around.

I don't think that is COG ID relevant after the first access, so if you carefully interleave opcodes(N Cycles) and Hub access(Adr+N), you could craft higher bandwidths.
Given the high bandwidth already there, and the burst ops, actual need of this case would be quite rare, but it can be constructed.

Heater. wrote: »

I still can't work out if your code can modulate the speed of my code though....

Not via Hub-Slots.
The HUB already is allocating 1/16 time to every COG, so it is hard-coded jitter free. (from a COG-COG interaction viewpoint, if they want, every other COG might use the slot available to it)
Only if it somehow could allocate N/16, could there be a jitter effect.

Addit:

The HUB-Slot rotate effect does mean there is a preferred INC or DEC direction.
(I forget which way Chip has the interaction working.)
Even in a HLL, you might get slightly faster data flows in small buffers, with a sparse-array design.

Dave Hein · 2016-08-20 23:07

The hub RAM is divided into 16 banks of memory. Bits 2 through 4 of the hub RAM address are used to select the bank. The hub slots for bank 0 are allocated as 0, 1, 2, ... ,14, 15. The hub slots for bank 1 are allocated as 1, 2, 3, ... , 15, 0. The hub slots for the rest of the banks are shifted in the same manner. This allows for a max transfer rate of 1 long/cycle. So if all of the cogs were using their FIFOs at the same time you could get 16*4*160MHz = 10.24 Giga-bytes/second transferred to/from the hub RAM.

Reading sequential longs in a tight loop is a bit different. Since instructions take 2 cycles you would not be able to read longs a full speed, but instead the speed would be 1 long/17 cycles. Also, it will be difficult to design a loop that reads hub RAM with deterministic timing like the P1 unless the data address are deterministic as well. Hopefully, the higher speed of the P2 will help to compensate for the lack of determinism.

EDIT: I meant bits 2 through 5 instead of bits 2 though 4. Four bits from the hub address are used to select the RAM bank.

cgracey · 2016-08-20 23:15

As someone pointed out, if two cogs were to communicate through addresses whose bits 4..2 were static, timing would become deterministic, since the hub slice (physical RAM instance) would remain constant, coming around on every 16th clock to each cog.

potatohead · 2016-08-21 02:05

Heater. wrote: »

Thanks Electrodude,

It all kind of, sort of makes sense. Sometimes. I have to study the egg beater "magic roundabout' diagram some more.

I still can't work out if your code can modulate the speed of my code though....

That does not happen.

There are 16 cogs. There are 16 banks of hub RAM, addressed by the lower nibble.

Each cog gets exclude access to one bank. Every clock, that bank increments, modulo style, which insures a given COG will get access to a given bank within a given time.

All COGS get HUB access all the time, and it's uniform.

Cluso99 · 2016-08-21 10:28

No heater. One cog cannot interfere with any other cogs hub access!

There are actually 16 possible cog accesses to hub in every clock. Each cog's access is skewed by one long for the same clock.

This permits a cog to transfer a long on every clock pulse.

But when using normal instructions to access sequential longs, each successive long will be 17 clocks apart!!! If you are reading successive bytes, beginning on a long boundary, you will get byte 0, byte 1 will be 16 clocks later, byte 2 another 16 clocks, byte 3 another 16 clocks, and then byte 4 will actually be 16+1=17 clocks, followed by the next bytes 5, 6 & 7 each 16 clocks, then byte 8 at 17 clocks with the next 3 bytes at 16 clocks, and so on.

MJB · 2016-08-21 13:25

Cluso99 wrote: »

No heater. One cog cannot interfere with any other cogs hub access!

There are actually 16 possible cog accesses to hub in every clock. Each cog's access is skewed by one long for the same clock.

This permits a cog to transfer a long on every clock pulse.

But when using normal instructions to access sequential longs, each successive long will be 17 clocks apart!!! If you are reading successive bytes, beginning on a long boundary, you will get byte 0, byte 1 will be 16 clocks later, byte 2 another 16 clocks, byte 3 another 16 clocks, and then byte 4 will actually be 16+1=17 clocks, followed by the next bytes 5, 6 & 7 each 16 clocks, then byte 8 at 17 clocks with the next 3 bytes at 16 clocks, and so on.

so when you want it (the reading of BYTEs) fast you use the FIFO
or read at least longs and do the shift mask manually and still being faster ...

code snippets like for doing this (FIFO / Streamer .. ) could go into a document to help beginners.

jmg · 2016-08-21 22:28

Cluso99 wrote: »

But when using normal instructions to access sequential longs, each successive long will be 17 clocks apart!!!

I think that is 17 or 15, depending on the INC/DEC relative to Slot-Spin.

Given INC is the more common code style, should the Slot-Spin be tuned to give the better access number for INC ?
( I think that means Slot decrements) -
Has that been done on P2 ?

Electrodude · 2016-08-21 22:50

It would be every 17 clocks. The slot visible to a cog increments every cycle, so that the FIFO can do forward sequential access; if it decremented instead, the FIFO wouldn't be able to provide the one long per clock sequential access that it does provide.

cgracey · 2016-08-21 23:37

Electrodude wrote: »

It would be every 17 clocks. The slot visible to a cog increments every cycle, so that the FIFO can do forward sequential access; if it decremented instead, the FIFO wouldn't be able to provide the one long per clock sequential access that it does provide.

Right, FIFO access would have to run backward through hub RAM, instead of forward.

jmg · 2016-08-22 00:44

Electrodude wrote: »

It would be every 17 clocks. The slot visible to a cog increments every cycle, so that the FIFO can do forward sequential access; if it decremented instead, the FIFO wouldn't be able to provide the one long per clock sequential access that it does provide.

Yes, I forgot about the need to also support the FIFO.

Heater. · 2016-08-22 08:20

Thanks all. I'm convinced.

Now where is my DE0-Nano ...

Prop2 Feature List

Comments