Is LUT sharing between adjacent cogs very important?

jmg · 2016-05-12 22:28

Tubular wrote: »

...
Once you arrive at needing multiple clock cycles to kick off new cogs, things start to free up, and indeed it might be possible to have a bit mask of arbitrary cog start requests, which rotates against a bit mask of running cogs, rotating one per cycle, until a match is made.

There is certainly that choice too.
Now that there is inter-COG signaling, you can start with more freedom, and then sync-up later, when they are all ready.
A few clocks delay on load-up, is not going to bother 99.9% of uses.

If any multiple-start request fails, then the master 'fetches a bigger hammer' ....

jmg · 2016-05-12 22:42

cgracey wrote: »

If objects aren't written to accommodate worst-case hub timing, they will not work.

This brings up the question again, of is it possible to choose fixed hub timing (eg always 16c) ?

ie, without fixed timing, how does someone know/prove they have written and tested for worst-case hub timing ?

cgracey wrote: »

I would not write cog code that relied on cogs being any particular number apart. It's just asking for trouble.

Certainly best avoided, but you can also be sure someone will do it...

potatohead · 2016-05-12 22:45

What is worst case HUB timing?

cgracey · 2016-05-12 23:05

potatohead wrote: »

What is worst case HUB timing?

Let's see...

If you want to read long $xxx0 and you just missed it:

123456789ABCDEF0ddddd

Those 'd' delays are for getting the return value back out of the memory system. The first two d's are to get to the specific memory and the last three are for the result to come all the way out. Writes don't have to wait for the d's.

So, reads are 6..21 clocks, while writes are 2..16 clocks.

Tubular · 2016-05-12 23:12

Well I guess a few of us will be 'asking for trouble', then : )

Crafted cog-cog spacing is going to be an artifact of the egg-beater design, exactly the same way crafted instruction spacing will be. I don't see any of this as a problem, if anything its a fun challenge.

Yes many low/mid performance objects can decouple their cog dependency for compatibility/simplicity, but for higher performance you're going to want things exact. There have been similar things found with P1 where cog placement for audio matters, or cog to cog distance matters (can't remember the details but Kuroneko highlighted it)

potatohead · 2016-05-12 23:12

So we have basically 10Mhz inter cog comms, assuming 200mhz production clock speed.

That's kind of slow.*

I've stayed outta this discussion for being busy, but I caught a break today. Time for a little P2 thinking...

Is is plausible, practical to stream the same data to more than one COG, so that each can do its part? Use them in parallel fashion?

*bursts are much faster, but this seems to be the basic, random data comms rate. Am I wrong about that somehow?

ozpropdev · 2016-05-12 23:15

Surely if an object is going to require 2 adjacent cogs then a warning can be included with its docs pointing out it needs to start first before other objects.
Curiosity got the better of me so I threw this together, and it works OK

{
Find two free adjacent cogs

Free cogs 1,3,4,5 & 7
Result = Starts target code in cogs 3 & 4
Target = Parallax P123-A9 FPGA board
}
dat		org

'make assorted cogs already busy

		mov	bx,##%1111_1111_0100_0010	'cog mask
		rep	@.fill,#16
		rcr	bx,#1 wc
	if_c	coginit	ax,##@freeze
		add	ax,#1
.fill

'collect and start all free cogs

loop		mov	ax,#16
		coginit	ax,##@freeze wc		'try to start a free cog
	if_c	jmp	#part2			'success? 
		rolnib	cogs_h,cogs_l,#7	'save cogid
		rolnib	cogs_l,ax,#0
		add	free,#1			'free cog count
		jmp	#loop

'find two adjacent cogs in free list

part2		cmp	free,#2 wz,wc		'enough free cogs?
	if_b	jmp	#no_free_cogs
		mov	cx,free			'left justify cog list
		subr	cx,#16
		rep	@lj_loop,cx
		rolnib	cogs_h,cogs_l,#7
		shl	cogs_l,#4
lj_loop
		sub	free,#1
check		getnib	ax,cogs_h,#6
		getnib	bx,cogs_h,#7
		sub	ax,bx
		cmp	ax,#1 wz,wc		'adjacent cogs?
	if_e	jmp	#found_pair
		cogstop	bx			'release unused cog
		rolnib	cogs_h,cogs_l,#7
		shl	cogs_l,#4
		djnz	free,#check
		jmp	#no_free_cogs

'start cog pair with target code and release remaining cogs in list

found_pair	getnib	ax,cogs_h,#6		'start cog pair
		coginit	ax,##@target_code1
		getnib	ax,cogs_h,#7
		coginit	ax,##@target_code2
		sub	free,#1 wz		'release remining cogs in list
	if_z	jmp	#done
		rolnib	cogs_h,cogs_l,#7
		shl	cogs_l,#4
release		rolnib	cogs_h,cogs_l,#7
		shl	cogs_l,#4
		getnib	ax,cogs_h,#7
		cogstop	ax
		djnz	free,#release

done		jmp	#done	
no_free_cogs	jmp	#no_free_cogs

	
cogs_h		long	0
cogs_l		long	0
free		long	0
ax		long	0
bx		long	0
cx		long	0

'******************************************************
		org
freeze		jmp	#freeze

'******************************************************
		org
target_code1
target_code2	cogid	adra
		decod	adra
		or	dirb,adra
d_loop		xor	outb,adra
		waitx	##20_000_000
		jmp	#d_loop

Not a lot of P2 code to get two adjacent cogs.

When we write code we already have to be aware of our IO/smartpin allocation/usage, we SHOULD also be aware of our cog allocation/usage.
Just my 2cents..

jmg · 2016-05-12 23:16

So, reads are 6..21 clocks, while writes are 2..16 clocks.

Tubular wrote: »

Cog-cog spacing is an artifact of the egg-beater exactly the same way crafted instruction spacing will be. I don't see any of this as a problem, if anything its a fun challenge. Yes many low/mid performance objects can decouple their cog dependency, but for higher performance you're going to want things exact

Yes, and I see two classes/meanings of 'exact'.

Some may just want things jitter free and predictable, for their 'exact'.

Other designs may need precise, and shortest possible delays, for their 'exact'.

Ideally, P2 can support both ?

potatohead · 2016-05-12 23:17

Maybe code like this would be a great addition to our boot ROM, should we proceed with this feature.

Having one, tested, robust routine to use would help a lot to marginalize potential trouble.

Edit: to expand on my 10Mhz comment above, that rate seems to be insufficient for a lot of modern bit bang type tasks. This is a concern to me as one big goal we all should have is the general applicability of the P2.

We have specialized hardware and events. Those are good. And the hardware covers a lot of "standard" comms and a ton of test and measure too.

Now we are missing out on respond capability, in the general sense. This will place a strong emphasis on single cogs, and optimizing that code may not yield a good result. If we can't fall back to good parallelism, we lose out on a lot of the benefit of the COGS. We may fill in with strong use of the "events" but that code and it's potential glitches seems much worse to manage and write than obtaining two cogs does.

Am I wrong about that? I could be

Where we know what is going to need to happen, we have those features, and they are fast too. We have throughput, basically.

Honestly, this gap is worth resolving. It's a big hole in our feature / price / performance map. If the thing is going to cost a premium and feature these ease of development, easy PASM features, we should be able to use them on those odd, one off, thorny problems.

I see that as a big market for us.*

*I write us, we, etc... as this very clearly is a shared effort made better by our common interests. We are all in it together in many real ways. If we enjoy success, everyone here will benefit, some more directly than others, but enough to warrant we and us, in this context.

Cluso99 · 2016-05-12 23:21

cgracey wrote: »

If objects aren't written to accommodate worst-case hub timing, they will not work. I would not write cog code that relied on cogs being any particular number apart. It's just asking for trouble.

Under a fully controlled implementation, I see nothing wrong with this. That is what determinism is about...you can guarantee the outcome every time.

We do somewhat similar coding with sweet spot timing with the hub, especially with video updating, overlay loading. Granted, most of this is within a single cog. But the same should apply to running multiple cogs.

Perhaps a way around the issue it to have an enabler instruction for a cog to share its LUT?
i.e. Have a write enable instruction to enable that cogs LUT to be written by the next cog. Or read/write enable.

jmg · 2016-05-12 23:21

potatohead wrote: »

So we have basically 10Mhz inter cog comms, assuming 200mhz production clock speed.

That's kind of slow.*
*bursts are much faster, but this seems to be the basic, random data comms rate. Am I wrong about that somehow?

I think that's correct - burst can tack-onto the next SysCLK, once alignment is made, so you can get higher average speeds, but for single transaction / random access, those reads are 6..21 clocks, while writes are 2..16 clocks apply.

If your code is looping at 16*N clocks, you do gain some time, after that first sync.
Conversely, if your code was capable of looping faster, it will slow down to N*16

Rayman · 2016-05-12 23:24

If the cog signaling feature gets added, that should reduce latency in cog-cog coms I think

cgracey · 2016-05-12 23:28

Remember that you can read or write MORE than 1 long, at a cost of only 1 clock per additional long. So, there is latency, but if you want to move a block quickly, the latency amortizes to near zero.

	SETQ	#15			'read 16 longs
	RDLONG	startreg,hubaddr	'6..21 +15 clocks

	SETQ	#99			'write 100 longs
	WRLONG	startreg,hubaddr	'2..16 +99 clocks

	SETQ	##$3FFFF		'clear all of hub
	WRLONG	#0,#0			'2..16 +256K clocks

	SETQ2	#9			'read 10 longs into LUT
	RDLONG	startlut,hubaddr	'6..21 +9 clocks

	SETQ2	#511			'write entire LUT to hub
	WRLONG	0,address		'2..16 +511 clocks

These constructs do not use the cog's hub FIFOs, so they will work with hub exec, RDFAST/RFBYTE/RFWORD/RFLONG, or WRFAST/WFBYTE/WFWORD/WFLONG.

jmg · 2016-05-12 23:30

Rayman wrote: »

If the cog signaling feature gets added, that should reduce latency in cog-cog coms I think

Cog Signaling is in there now, and it can help some latencies, but not all.
If signaling used 32 bits, there could be some scope for using upper bits as data ( in a very limited way, but it is almost free).

Rayman · 2016-05-12 23:31

I think somebody mentioned signaling cog-cog through smartpin.

Maybe that's relatively slow. But, maybe you could send data from one cog to another cog's LUT via smartpin and that second port?

jmg · 2016-05-12 23:31

cgracey wrote: »

Let's see...

If you want to read long $xxx0 and you just missed it:

123456789ABCDEF0ddddd

Those 'd' delays are for getting the return value back out of the memory system. The first two d's are to get to the specific memory and the last three are for the result to come all the way out. Writes don't have to wait for the d's.

So, reads are 6..21 clocks, while writes are 2..16 clocks.

What are the current equivalent times for Writes / Reads of 8b/16b/32b to a Pin Cell ?

potatohead · 2016-05-12 23:35

Yes Chip, throughput is high.

But, basic, read, test, respond is slow. 10mhz does not seem sufficient to cover test / respond cases of any complexity fast enough. As cluso points out below, the full handshake could drop to as little as 2mhz!

Simple ones happen in COG, and are fast. It's that "just a bit more than one can do in a COG" case that could be a worry.

I'm framing it in general 10Mhz terms in some attempt to balance general applicability against the cog race potential and how that gets managed.

It may be a worthy tradeoff.

Let me put it another way:

Balance dynamic COG allocation type tasks against test and respond coordinated COG tasks. In the former, each cog is atomic, and the task somehow is easy to perform in parallel. Each cog does the same thing, most of the time too.

We have this done cold. Works, is fast, etc...

In the latter, where we need COGS to work on a common task and play different roles, cogs doing coordinated, different things, we have a gap. Making things parallel in this way will often be at 10mhz.

This is no real improvement over P1, which runs at 100mhz.

For this bit bang, test / respond case, P1 users have no upgrade path, where more than one COG is or needs to be involved.

Rayman · 2016-05-12 23:43

Ok, what about the new DAC scheme. I guess that's so any cog can write to set of four pins. Can that bus be made bidirectional so that a cog can either read or write from it?

potatohead · 2016-05-12 23:48

And that is all I have to contribute right now. I'm open either way. Just wanted to place this bottleneck into some context I thought might help.

Cluso99 · 2016-05-12 23:49

cgracey wrote: »

potatohead wrote: »

What is worst case HUB timing?

Let's see...

If you want to read long $xxx0 and you just missed it:

123456789ABCDEF0ddddd

Those 'd' delays are for getting the return value back out of the memory system. The first two d's are to get to the specific memory and the last three are for the result to come all the way out. Writes don't have to wait for the d's.

So, reads are 6..21 clocks, while writes are 2..16 clocks.

So for a full write to another cooperating cog via hub will be...
Wrlong (2-16)
Just missed slot for read (so read zero & jz): (6-1+2)
Rdlong Wz (6-21)
Jnz (2)
Some processing
Wrlong (2-16) for write back reply
Just missed slot for read (so read zero & jz): (6-1+2)
Rdlong wz (6-21)
Jnz (2)

So the minimum transaction allowing for worst case is 92 clocks without processing the data!

With shared LUT, worst case is...
Wrlut (2)
Just missed read (so rdlut2 zero & jz): (2-1+2)
Rdlut2 wz (2)
Jz(2)
Process
Wrlut2 (2)
Just missed read (so rdlut zero & jz): (2-1+2)
Rdlut wz (2)
Jz (2)

So total 18 clocks vs 92 clocks, worst case without processing.

jmg · 2016-05-12 23:56

Rayman wrote: »

I think somebody mentioned signaling cog-cog through smartpin.

Maybe that's relatively slow. But, maybe you could send data from one cog to another cog's LUT via smartpin and that second port?

Chip improved the Pin-Cell link speed, and made it run-length coded, so bytes are faster than words.
I think it will be faster than the worst-case HUB numbers, but slower than best-case.
See my question for current figures on pin-cell links above.

Rayman wrote: »

Ok, what about the new DAC scheme. I guess that's so any cog can write to set of four pins. Can that bus be made bidirectional so that a cog can either read or write from it?

Yes, any existing data pathways should be looked at. Tacking-on some virtual nodes may be simple ?

potatohead · 2016-05-13 00:08

I have one more thing to say:

That shared LUT could do a lot of video, sprite, etc... tricks. Could also be used to boost code, do overlays and a whole pile of fun things. I would use it, and I would either make a cog allocator, or use one that is standard, or just start the shared ones at boot time and limit dynamic cog pools by number accordingly.

All of that seems easier to deal with and less frequent than the inability to tightly couple cogs will be. It may mean just not being able to do stuff as opposed to having to be a bit careful about doing stuff.

We added events on this same reasoning, and that was the right thing to do.

Honestly, running multiple COGS in the usual way can deliver same as or comparable performance in a lot of cases I can think of. Or, the performance possible seems good and generally applicable.

I don't think we lose out on too many potential cases if we do not LUT share.

However, if we do anything, before we call it done, improving on that read, test, respond cog to cog case is most important. It's one area where a P1 user looking to improve won't be able to do that on P2, despite a higher clock and spiffy features, IMHO.

We will lose out on too many potential use cases without a better solution, also IMHO.

David Betz · 2016-05-13 01:20

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

I've been mulling over how to provide LUT sharing without setting the stage for a future software crisis, and I just can't see how to do it, shy a ton of logic.

Well, this is probably really ugly but what if you add a 16 bit mask register that COGNEW uses to decide which COGs are candidates for allocation. If a bit is set then COGNEW can allocate that COG if it isn't already running. If it's not set, COGNEW skips over that COG and goes on to the next one. If you set that mask so that every other bit is set and then do COGNEW

cgracey wrote: »

Cluso99 wrote: »

Heater,
It is the same way I map which cogs are used and which are free in my PropOS.

I have a dummy "jmp $" and start free cogs using this routine. Each cog new returns the cog# started until no free cogs are found. Now I know which cogs are used and which are free, so I build a table. Then I stop all the cogs I started.

The difference with the P2 and shared LUT, or for that matter using specific cogs to minimise latency with the egg beater, is that each time I perform a cognew I would check if I now have cogs meeting my requirements. If so, I return the unwanted cogs by stopping them, and either restart the required cogs or repoint the software dummy routine (easy with hub exec).

If you start up cogs on an experimental basis to find two in a row, you might choke off a COGNEW from some unrelated code written by someone else. That could be problematic, as now your testing is causing other, unrelated programs to fail when they wouldn't have, otherwise. All those errors have to be handled, somehow. There's no way out of this, except to allocate cogs at the start of an app, and that undermines the concept of random reallocation of cog assets. It's kind of like reserving seats in a bus station.

What if you add a new form of COGNEW that does this looking for two in a row? If it finds them, it starts both at the same time with the same code. It's up to one of them to use COGINIT to start its neighbor with the code it actually wants it to run if necessary. If two in a row aren't found, COGNEW fails.

We could do that multi-cog COGNEW thing, but it will lead to the fragmented malloc problem, where an eventual clean-up would be needed to get the active cogs contiguous again, so that chunks of cogs could be allocated once more, and that is not possible to do with cogs.

I'm quite sure when Windows freezes up now and then for a split second, it's making RAM allocations contiguous again so that it can meet a new malloc request that fragmentation was preventing.

Yes, that is true. However, how likely is that scenario? It assumes that COGs are being started and stopped continuously. My experience (granted, very limited) is that most programs start all of their COGs during initialization and then run with a relatively static COG-map for the duration of the program. Even if this isn't true all the time, how often will drivers that require consecutive COGs be randomly started and stopped? I guess what I'm asking is whether this is a real concern.

T Chap · 2016-05-13 01:32

No concern for my projects. I start all cogs at the top in a specific order. If there is an advantage for LCD graphics speed then I want it in.

cgracey · 2016-05-13 01:51

David Betz wrote: »

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

I've been mulling over how to provide LUT sharing without setting the stage for a future software crisis, and I just can't see how to do it, shy a ton of logic.

Well, this is probably really ugly but what if you add a 16 bit mask register that COGNEW uses to decide which COGs are candidates for allocation. If a bit is set then COGNEW can allocate that COG if it isn't already running. If it's not set, COGNEW skips over that COG and goes on to the next one. If you set that mask so that every other bit is set and then do COGNEW

cgracey wrote: »

Cluso99 wrote: »

Heater,
It is the same way I map which cogs are used and which are free in my PropOS.

I have a dummy "jmp $" and start free cogs using this routine. Each cog new returns the cog# started until no free cogs are found. Now I know which cogs are used and which are free, so I build a table. Then I stop all the cogs I started.

The difference with the P2 and shared LUT, or for that matter using specific cogs to minimise latency with the egg beater, is that each time I perform a cognew I would check if I now have cogs meeting my requirements. If so, I return the unwanted cogs by stopping them, and either restart the required cogs or repoint the software dummy routine (easy with hub exec).

If you start up cogs on an experimental basis to find two in a row, you might choke off a COGNEW from some unrelated code written by someone else. That could be problematic, as now your testing is causing other, unrelated programs to fail when they wouldn't have, otherwise. All those errors have to be handled, somehow. There's no way out of this, except to allocate cogs at the start of an app, and that undermines the concept of random reallocation of cog assets. It's kind of like reserving seats in a bus station.

What if you add a new form of COGNEW that does this looking for two in a row? If it finds them, it starts both at the same time with the same code. It's up to one of them to use COGINIT to start its neighbor with the code it actually wants it to run if necessary. If two in a row aren't found, COGNEW fails.

We could do that multi-cog COGNEW thing, but it will lead to the fragmented malloc problem, where an eventual clean-up would be needed to get the active cogs contiguous again, so that chunks of cogs could be allocated once more, and that is not possible to do with cogs.

I'm quite sure when Windows freezes up now and then for a split second, it's making RAM allocations contiguous again so that it can meet a new malloc request that fragmentation was preventing.

Yes, that is true. However, how likely is that scenario? It assumes that COGs are being started and stopped continuously. My experience (granted, very limited) is that most programs start all of their COGs during initialization and then run with a relatively static COG-map for the duration of the program. Even if this isn't true all the time, how often will drivers that require consecutive COGs be randomly started and stopped? I guess what I'm asking is whether this is a real concern.

It's not a concern until it becomes a problem and we don't have a good solution. From doing tech support in the past, I think it's better to not have some feature if it's going to cause more trouble than it's worth. I could see LUT sharing providing some good performance in some cases, but it will almost certainly get over-used, just because it's there. The more it's used, the more problems are going to come up with cog allocation. I don't relish that thought, at all. I think it's better to encourage cog-independent programming. It's amazing how you can find ways to do things when you figure out how to work with what you've got, whatever it is, or isn't.

Nothing will compare to LUT sharing for quick feedback, but what about these smart-pin comm times:


For the PINSETx instructions, different numbers of clock cycles are required for bytes, words, and longs, according to the D/# value:


    $000000xx = 4 clocks

    $0000xxxx = 6 clocks

    $xxxxxxxx = 10 clocks


For the PINGETZ instruction, either a byte, word, or long is returned to D, according to the mode. If WC is used with PINGETZ, the MSB of the byte, word, or long goes into the C flag. If WZ is used, the Z flag will be set if the data was zero, or cleared if the data was not zero. Different minimal numbers of cycles are required for each data size, with additional cycles usually needed to wait for the start of the looping message coming from the smart pin:


    $000000xx = 4..6 clocks

    $0000xxxx = 6..10 clocks

    $xxxxxxxx = 10..18 clocks

What if we had a smart-pin mode that had byte/word/long sub modes, so that a smart pin could just act as a data transponder. Anyone could write it (one at a time), but all could read it concurrently. For bytes, especially, this would be fast.

Tubular · 2016-05-13 02:07

I think this is via-pin idea is definitely worth trying, as it would break the bottleneck wide open at little logic cost. The ability to send bytes/words/longs efficiently is good, too.

Special bonus points if you can clock in on one cog as you are clocking out of the other (ie 10..18 clocks for the whole cog-cog transfer, rather than latching midway 10..18 clocks+10..18 clocks)

Being optionally able to expose the data, via the physical pin as it passes through, could be good for debugging, though I think you have a few data lines in action at the same time but only one physical pin

T Chap · 2016-05-13 02:25

Chip, I am curious why you can't put an OR bus connected to all cogs? One cog could load a byte/word/long onto the 32 bit bus, and any cog can read it immediately? Wouldn't that be faster than using a smartpin mode?

jmg · 2016-05-13 02:35

cgracey wrote: »

Nothing will compare to LUT sharing for quick feedback, but what about these smart-pin comm times:
...

What if we had a smart-pin mode that had byte/word/long sub modes, so that a smart pin could just act as a data transponder. Anyone could write it (one at a time), but all could read it concurrently. For bytes, especially, this would be fast.

This is worth exploring, because it allows any cog freedoms.
I would not remove LUT sharing just yet, depending on the final speed outcomes of altermatives.

Can this use the Smart-Pin links, but not consume an actual pin (ie be a sort of virtual pin path, with minimal logic) ?

Typical use would be something like

Host:
* Generates Signal to slave(s)
* PINSETx of 4,6,10 cy

Slave(s)
* WAIT or branch on Signal
* PINGETZ of 4,6,10,??

Could multiple slaves attach to the same Node ? Could be tricky in out of phase reading cases ?

Can the PINGET start in parallel, so if it does have to sync, there is minimal added time.
in the above, the total delay Cog-Cog would be less than 10+10

jmg · 2016-05-13 02:40

David Betz wrote: »

Yes, that is true. However, how likely is that scenario? It assumes that COGs are being started and stopped continuously. My experience (granted, very limited) is that most programs start all of their COGs during initialization and then run with a relatively static COG-map for the duration of the program. Even if this isn't true all the time, how often will drivers that require consecutive COGs be randomly started and stopped? I guess what I'm asking is whether this is a real concern.

I agree, most Microcontroller applications are like that, and if there are dynamic reloads, it is very few COGs.
(Things like Board Test and Calibrate might borrow a run-time COG, if one was not spare, but they are one-off usages.)

I think Chip is looking ahead somewhat, to a more general almost Microprocessor case, where users (re) launch some COGs, more like apps.

Not really a Microcontroller use case, and I'd expect the Master/OS COG to manage any COG distribution along the lines of ozpropdev's example above.

potatohead · 2016-05-13 03:21

all could read it concurrently

Nice! That plugs a big hole. We got a cog broadcast option again!

I agree with the idea of figuring out how to use what we have, and we have a lot.

Is LUT sharing between adjacent cogs very important?

Comments