No reason more than 2 cogs could not be made to communicate "upstream" as long as every cog can write to the next higher cog's LUT. Cog(n) can process some and send some data to cog(n+1), which can process some and pass some on to cog(n+2), etc. etc.. Daisy chain as many cogs as needed.
Correct, provided such a path actually exists,
Some of the proposals do not have that path.
But its existence adds a layer of possibilities that makes the chip much more desirable to a subset of potential users that have certain odd requirements and are happy to work within the constraints that LUT-sharing imposes.
Perceived requirements I suspect. Cluso has proven this. He keeps saying the direct link is important when really all he wanted was to avoid instruction stalls. Instruction counting is often important.
This is not what I have been saying recently, and not what I meant previously (if that was indeed the impression I gave). I gave you the benefit of the doubt on the other thread and apologised for saying you got it wrong. But here you are clearly misquoting me!
There is no perceived requirement. I actually quoted a real example, not a contrived one.
Latency is the problem, and I even went so far as defining a specific example. Any instruction stall results in latency. A direct link is the only way to clearly remove the latency of going via the hub. Other methods can reduce the latency.
BTW I am also referring to byte/word/long transfers, not blocks of data.
It's a specific tool for specific situations. It's not the defacto means of inter-cog communications. But its existence adds a layer of possibilities that makes the chip much more desirable to a subset of potential users that have certain odd requirements and are happy to work within the constraints that LUT-sharing imposes.
Yes, the key here is low latency.
Sometimes that simply matters, and there is no workaround, or band-aid.
I am also keen to see the more general Pin-Cell pathway tuned, to lower the latency there too.
Chip has already shaved one more cycle off it Great to see.
On second thought, I may have missed the point of the bottom-up/top-down scheme. My impression was that cognew was used for both single and dual allocation. But now it appears that coginit might have been implied for the top-down part, just so it didn't collide with the single allocations coming up from the bottom. I don't like that. For one, if the bottom up allocations become full, coginit could easily clobber a running cog without knowing it.
Better would be an atomic cognew2 that allocates even/odd pairs simultaneously and loads the same code into both. The odd-numbered cog would then, via cogid, identify and coginit itself with the correct code. The even-numbered cog, after verifying such with cogid, would just continue running the code that was loaded.
-Phil
I was thinking that COGNEW would search top-down for dual cogs and bottom-up for single cogs. Now, we know that top-up, only, is fine.
You're concept of two cogs being allocated by COGNEW is what we need. And searching bottom-up for single or even/odd pairs will be fine. Also, how they would resolve between themselves, having started running the same code, is right on.
I guess I see that the 2nd LUT port has a read and write interface with the read being used by the streamer. So, I guess this write only mode makes it easier to implement.
Does it add much to write to both cog+1 and cog-1?
You're concept of two cogs being allocated by COGNEW is what we need. And searching bottom-up for single or even/odd pairs will be fine. Also, how they would resolve between themselves, having started running the same code, is right on.
About this matter of rolling-overlap LUT sharing...
Is it really that practical to program a string of cogs to propagate results upward to other LUTs? It seems to me kind of a pipedream. I can definitely see two cogs working together, but a string of them seems unlikely to be useful.
On second thought, I may have missed the point of the bottom-up/top-down scheme. My impression was that cognew was used for both single and dual allocation. But now it appears that coginit might have been implied for the top-down part, just so it didn't collide with the single allocations coming up from the bottom. I don't like that. For one, if the bottom up allocations become full, coginit could easily clobber a running cog without knowing it.
Better would be an atomic cognew2 that allocates even/odd pairs simultaneously and loads the same code into both. The odd-numbered cog would then, via cogid, identify and coginit itself with the correct code. The even-numbered cog, after verifying such with cogid, would just continue running the code that was loaded.
-Phil
I was thinking that COGNEW would search top-down for dual cogs and bottom-up for single cogs. Now, we know that top-up, only, is fine.
You're concept of two cogs being allocated by COGNEW is what we need. And searching bottom-up for single or even/odd pairs will be fine. Also, how they would resolve between themselves, having started running the same code, is right on.
In the case of odd/even pairs, the cogs would determine from cogid whether odd or even, and jump accordingly.
In the case of any pairs, the cogs would determine whether odd or even. They would store their cogid in hub with even being say the lower hub byte. Then both would compare their cogid with the other hub to determine the lowest cog, and jump accordingly.
About this matter of rolling-overlap LUT sharing...
Is it really that practical to program a string of cogs to propagate results upward to other LUTs? It seems to me kind of a pipedream. I can definitely see two cogs working together, but a string of them seems unlikely to be useful.
I tend to agree.
I can see a case for 2 or 3 or 4 as practical, but more on a minion basis, than a chain or string.
Once you start constructing a chain, the housekeeping costs would make the pin-cell pathway better.
Unless each COG does some significant processing before passing the data along, and that moves into packets of data, which the HUB manages fine.
It's the smaller, fine grained stuff, where latency tends to be most critical.
About this matter of rolling-overlap LUT sharing...
Is it really that practical to program a string of cogs to propagate results upward to other LUTs? It seems to me kind of a pipedream. I can definitely see two cogs working together, but a string of them seems unlikely to be useful.
I can definitely see 2 cogs working together, perhaps 3. In 3, the middle cog would share tasks between the lower and the upper cog.
But it depends on implementation. If you implemented it as you originally conceived it (half shared up, half shared down, and extra LUTX instructions ie separate and not dual write) I could see some applications might prefer to go via LUT to an adjoining cog rather than via hub.
I had thought you would just map the whole LUT from each cog to the next, meaning that cog n would map its LUT to cog n+1, and cog n+1 would map its LUT to cog n+2. Thus cog n+1 would effectively be able to see cog n LUT and share its own LUT with cog n+2. Therefore every cog would be equal.
If LUT addressing was 10 bits instead of 9, then perhaps the whole 4KB LUT could be used by one cog for larger stacks and/or LUT exec. This may be too much work though?
About this matter of rolling-overlap LUT sharing...
Is it really that practical to program a string of cogs to propagate results upward to other LUTs? It seems to me kind of a pipedream. I can definitely see two cogs working together, but a string of them seems unlikely to be useful.
I have stayed out of this whole discussion and have no special wants for anything in particular, but I thought of one possible application for something like this.
A fast pipeline notification pathway could be useful for doing something like real time packet processing where you can break up the work into fixed portions that have to run fast. Of course there are other devices that would do it better but this example is to just show a reasonable use case.
For example if you had a sequence of worker COGs doing specific operations like
1) reception
2) packet classification, access control
3) header manipulations
4) switching
5) scheduling and queuing
6) transmission
you could then have the bulk data path stay in the hub once received until transmitted, and just pass an small information block or opcode down the line via the LUT pathway rapidly so that you could keep up with the wire rate of a 100Mbps stream for example and are not be wasting time polling the hub on every packet at every step when your cycle budget per packet is tightly restricted.
You could of course try to parallelize by replication of a bunch of COGs doing the same thing for each packet but only if the totality of all possible operations fits in all these COGs instruction space. This may not be possible with the 2kB instruction limit in some applications, though hub exec could alleviate that in some cases (with loss of some determinism perhaps). A pipeline like this also inherently maintains sequence and doesn't need other means to synchronize the final output order.
It's the other way round. The instruction stall was a side effect of waiting on the Hub write latency. A buffered write doesn't stall even though there is still a background write latency.
With the FIFO, read latencies can be prefetched away so as not to directly affect instructions.
Cluso is thinking along those lines too. However, those are not timing sensitive to each other so can merrily use HubRAM.
It'd be awfully tricky to use HubRAM when running HubExec, though. Some of those layers would benefit from more code space
I think HubExec will be quite common on the P2. This is a good reason to have some other mechanism for data exchange (LUT, pins, whatever)
The block moves (SETQ/SETQ2+RDLONG/WRLONG) can work concurrently with hub exec, so if you are into processing blocks of data, it's still very efficient, even under hub exec.
I think something like Rogloh's fast layered comms is one of the more likely use cases. Clearly USB is pretty well looked after, already. Its kinda hard imagining exactly what timing latencies are required, until we actually do it.
About this matter of rolling-overlap LUT sharing...
Is it really that practical to program a string of cogs to propagate results upward to other LUTs? It seems to me kind of a pipedream. I can definitely see two cogs working together, but a string of them seems unlikely to be useful.
I have stayed out of this whole discussion and have no special wants for anything in particular, but I thought of one possible application for something like this.
A fast pipeline notification pathway could be useful for doing something like real time packet processing where you can break up the work into fixed portions that have to run fast. Of course there are other devices that would do it better but this example is to just show a reasonable use case.
For example if you had a sequence of worker COGs doing specific operations like
1) reception
2) packet classification, access control
3) header manipulations
4) switching
5) scheduling and queuing
6) transmission
you could then have the bulk data path stay in the hub once received until transmitted, and just pass an small information block or opcode down the line via the LUT pathway rapidly so that you could keep up with the wire rate of a 100Mbps stream for example and are not be wasting time polling the hub on every packet at every step when your cycle budget per packet is tightly restricted.
You could of course try to parallelize by replication of a bunch of COGs doing the same thing for each packet but only if the totality of all possible operations fits in all these COGs instruction space. This may not be possible with the 2kB instruction limit in some applications, though hub exec could alleviate that in some cases (with loss of some determinism perhaps). A pipeline like this also inherently maintains sequence and doesn't need other means to synchronize the final output order.
This was one of the applications I had in mind when I suggested the idea. The other was high speed small burst data acquisition where the hub is too slow but a single LUT is not large enough. It may even be used for high speed transfer of larger amounts of data to hub.
I think something like Rogloh's fast layered comms is one of the more likely use cases. Clearly USB is pretty well looked after, already. Its kinda hard imagining exactly what timing latencies are required, until we actually do it.
Agreed.
Another use case is going to be SDRAM, where one COG acts as the SDRAM interface/driver/State engine and refresh.
Rogloh has some great work on P1V SDRAM, where he has packed address and refresh into one P1V hub cycle, so the SDRAM 'looks' like it is simple SRAM.
That will need quite close co-operation between two COGS to use that to full potential on P2.
It might be possible to use 2 streamers to get close to HW speeds.
Similar to SDRAM will be HyperFLASH, HyperRAM etc, where one COG needs to focus on the finer timings, but have the ability to RW quickly too.
Addit: I expect HyperFLASH and HyperRAM will be in full production long before P2 hits the FAB.
Checking I see Digikey already have stocks of HyperRAM, and Hyperflash is coming.
These come in compact, low pin count 24-TBGA (6x8).
You're concept of two cogs being allocated by COGNEW is what we need. And searching bottom-up for single or even/odd pairs will be fine. Also, how they would resolve between themselves, having started running the same code, is right on.
Thanks. Obviously, from my other posts, I'm really torn between seeing this to fruition and just scrapping it altogether in the name of forward progress -- still leaning toward the latter.
I can't help but to remember Bill Clinton's mea culpa from 2004: "I did something for the worst possible reason -- just because I could." Spare gates/real estate may be the "worst possible reason" to add new features to an already great design. Further delay, as a consequence, will negate any marketing advantages an already-late product might confer. Chip, it's time to doff your engineering hat, don that of a businessman, and jest git 'er done!
Of course, the longer you prolong this process, the longer you can put off that horrid task of writing documentation! (Did I hit a nerve? )
Sounds like Chip has figured COGINIT will have to do for dual allocation ...
I didn't get that impression when he mentioned the bottom-up/top-down allocation scheme.
As it turns out, that scheme offers no advantages over allocating both single and double cogs from the bottom. I suspected that to be the case and wrote a Spin program to simulate 1000 allocations, randomly mixing single and double cognew requests until a request can't be filled. A tally of the average number of succesful requests was kept and displayed at the end of the simulation. It was the same for both allocation schemes.
Anyway, requiring coginit simply will not do. There has to be a way to do it with some sort of hybrid cognew.
-Phil
My cog allocator, which is for next-LUT sharing, is 50% better than a simple allocator, and probably 2-3 times as much combinatorial logic (which I'm guessing isn't really significant for the cog allocator). My single cog allocator prefers idle cogs sandwiched between running cogs, and after that prefers cogs N for which cog N-1 is running but cog N+1 is idle. My double cog allocator prefers pairs of idle cogs N and N+1 for which cog N-1 is running.
CON
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
N_TRIALS = 1000
UP_DOWN = 0
BIAS = 8
OBJ
sio : "Parallax Serial Terminal"
rnd : "RealRandom"
VAR
long sum, seed
long occupied
PUB start | cog, i, n
sio.start(9600)
cog := rnd.start - 1
seed := rnd.random
cogstop(cog)
repeat i from 1 to N_TRIALS
n~
occupied~
repeat
if (?seed & 15 => BIAS)
if (get_double_up_prefer)
n++
else
quit
else
if (get_single_prefer)
n++
else
quit
sum += n
sio.str(string("Average sucesses * 100: "))
sio.dec(sum * 100 / N_TRIALS)
sio.str(string(13, 10))
PUB get_double_up | mask, val
mask := 3
repeat 8
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask <-= 2
return false
PUB get_double_up_nopair | mask, val
mask := 3
repeat 16
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask <-= 1
return false
PUB get_double_down | mask
mask := $c000
repeat 8
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask ->= 2
return false
PUB get_double_down_nopair | mask
mask := $18000
repeat 16
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask ->= 1
return false
PUB get_single | mask
mask := 1
repeat 16
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask <-= 1
return false
PUB get_double_up_prefer | curr, mask, val
curr := %110 ' what to start
mask := %111 ' what to consider
val := %001 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
return get_double_up_nopair
PUB get_single_prefer | curr, mask, val
curr := %010 ' what to start
mask := %111 ' what to consider
val := %101 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
curr := %010
mask := %111
val := %001
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
return get_single
PUB get_single_prefer_pair | curr, mask, val
curr := %01 ' what to start
mask := %11 ' what to consider
val := %10 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
curr := %10 ' what to start
mask := %11 ' what to consider
val := %01 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
return get_single
Note that my occupied mask is now 32 bits. The top and bottom 16 bits should always be the same. I also replaced all shifts with rotates. This lets stuff wrap around.
Also included are get_double_up_nopair and get_double_down_nopair. They are just like get_double_{up,down}, but they will take any two cogs, not just only even/odd pairs. They weren't any better than Phil's original tests.
EDIT: My first tests were with next-LUT sharing. I just did tests for even/odd sharing, and got similar results (50% better than Phil's). To see this, use get_single_prefer_pair, which prefers cogs from half-started pairs over cogs from completely stopped pairs.
EDIT: It doesn't seem to matter which double cog allocator I use. It seems that using any single cog allocator that prefers idle cogs that are neighboring running cogs (e.g. get_single_prefer or get_single_prefer_pair) usually makes it around 50% better, regardless of which double cog allocator I use.
Are the buses (DATA, ADDRESS), connected in a wired-or configuration during LUT share?
If positive, are there any chances for a READ operation directed to one of the paired LUTs, being signaled as a WRITE to the other one?
Kind of a clocked transfer between luts, perhaps linking some of streamer's operation to the transfer mechanism?
I know that there are concerns about stable DATA being settle onto the bus and ADDRESS hold time, before writing to destination LUT, but, is it someway a viable data transfer mechanism?
This is a bit complex at first read...
So, if cog A wants to send message to cog B and get reply:
Cog A writes to both LUTs at same time
Cog B is notified somehow and reads his LUT to get message
Then
Cog B writes response to both LUTs at same time
Cog A is notified and reads response in his LUT.
Ok, I think I have it...
It's very simple. The two cogs' LUTs become more the same, with each new write from either cog.
You're concept of two cogs being allocated by COGNEW is what we need. And searching bottom-up for single or even/odd pairs will be fine. Also, how they would resolve between themselves, having started running the same code, is right on.
Thanks. Obviously, from my other posts, I'm really torn between seeing this to fruition and just scrapping it altogether in the name of forward progress -- still leaning toward the latter.
I can't help but to remember Bill Clinton's mea culpa from 2004: "I did something for the worst possible reason -- just because I could." Spare gates/real estate may be the "worst possible reason" to add new features to an already great design. Further delay, as a consequence, will negate any marketing advantages an already-late product might confer. Chip, it's time to doff your engineering hat, don that of a businessman, and jest git 'er done!
Of course, the longer you prolong this process, the longer you can put off that horrid task of writing documentation! (Did I hit a nerve? )
-Phil
Talking and thinking about what to do takes all the time. Implementing this LUT idea will take maybe 30 minutes. Maybe an hour for dual-cog COGNEW. Documentation was really hard, at first, but I don't dread it, anymore.
Sounds like Chip has figured COGINIT will have to do for dual allocation ...
I didn't get that impression when he mentioned the bottom-up/top-down allocation scheme.
As it turns out, that scheme offers no advantages over allocating both single and double cogs from the bottom. I suspected that to be the case and wrote a Spin program to simulate 1000 allocations, randomly mixing single and double cognew requests until a request can't be filled. A tally of the average number of succesful requests was kept and displayed at the end of the simulation. It was the same for both allocation schemes.
Anyway, requiring coginit simply will not do. There has to be a way to do it with some sort of hybrid cognew.
-Phil
My cog allocator, which is for next-LUT sharing, is 50% better than a simple allocator, and probably 2-3 times as much combinatorial logic (which I'm guessing isn't really significant for the cog allocator). My single cog allocator prefers idle cogs sandwiched between running cogs, and after that prefers cogs N for which cog N-1 is running but cog N+1 is idle. My double cog allocator prefers pairs of idle cogs N and N+1 for which cog N-1 is running.
CON
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
N_TRIALS = 1000
UP_DOWN = 0
BIAS = 8
OBJ
sio : "Parallax Serial Terminal"
rnd : "RealRandom"
VAR
long sum, seed
long occupied
PUB start | cog, i, n
sio.start(9600)
cog := rnd.start - 1
seed := rnd.random
cogstop(cog)
repeat i from 1 to N_TRIALS
n~
occupied~
repeat
if (?seed & 15 => BIAS)
if (get_double_up_prefer)
n++
else
quit
else
if (get_single_prefer)
n++
else
quit
sum += n
sio.str(string("Average sucesses * 100: "))
sio.dec(sum * 100 / N_TRIALS)
sio.str(string(13, 10))
PUB get_double_up | mask, val
mask := 3
repeat 8
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask <-= 2
return false
PUB get_double_up_nopair | mask, val
mask := 3
repeat 16
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask <-= 1
return false
PUB get_double_down | mask
mask := $c000
repeat 8
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask ->= 2
return false
PUB get_double_down_nopair | mask
mask := $18000
repeat 16
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask ->= 1
return false
PUB get_single | mask
mask := 1
repeat 16
ifnot (occupied & mask)
occupied |= mask | (mask <- 16)
return true
mask <-= 1
return false
PUB get_double_up_prefer | curr, mask, val
curr := %110 ' what to start
mask := %111 ' what to consider
val := %001 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
return get_double_up_nopair
PUB get_single_prefer | curr, mask, val
curr := %010 ' what to start
mask := %111 ' what to consider
val := %101 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
curr := %010
mask := %111
val := %001
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
return get_single
PUB get_single_prefer_pair | curr, mask, val
curr := %01 ' what to start
mask := %11 ' what to consider
val := %10 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
curr := %10 ' what to start
mask := %11 ' what to consider
val := %01 ' what to look for
repeat 16
if (occupied & mask) == val
occupied |= curr | (curr <- 16)
return true
mask <-= 1
val <-= 1
return get_single
Note that my occupied mask is now 32 bits. The top and bottom 16 bits should always be the same. I also replaced all shifts with rotates. This lets stuff wrap around.
Also included are get_double_up_nopair and get_double_down_nopair. They are just like get_double_{up,down}, but they will take any two cogs, not just only even/odd pairs. They weren't any better than Phil's original tests.
EDIT: My first tests were with next-LUT sharing. I just did tests for even/odd sharing, and got similar results (50% better than Phil's). To see this, use get_single_prefer_pair, which prefers cogs from half-started pairs over cogs from completely stopped pairs.
EDIT: It doesn't seem to matter which double cog allocator I use. It seems that using any single cog allocator that prefers idle cogs that are neighboring running cogs (e.g. get_single_prefer or get_single_prefer_pair) usually makes it around 50% better, regardless of which double cog allocator I use.
Good thinking. What I got out of that is that when a single cog is requested, an unused cog within an even/odd pair in which the other cog is in use is preferred over an unused cog whose even/odd neighbor is also unused. This will tend to leave unused even/odd pairs available for double-cog requests. Neat! Phil, would it really work?
Comments
+1G
Correct, provided such a path actually exists,
Some of the proposals do not have that path.
There is no perceived requirement. I actually quoted a real example, not a contrived one.
Latency is the problem, and I even went so far as defining a specific example. Any instruction stall results in latency. A direct link is the only way to clearly remove the latency of going via the hub. Other methods can reduce the latency.
BTW I am also referring to byte/word/long transfers, not blocks of data.
Sometimes that simply matters, and there is no workaround, or band-aid.
I am also keen to see the more general Pin-Cell pathway tuned, to lower the latency there too.
Chip has already shaved one more cycle off it Great to see.
I was thinking that COGNEW would search top-down for dual cogs and bottom-up for single cogs. Now, we know that top-up, only, is fine.
You're concept of two cogs being allocated by COGNEW is what we need. And searching bottom-up for single or even/odd pairs will be fine. Also, how they would resolve between themselves, having started running the same code, is right on.
Does it add much to write to both cog+1 and cog-1?
Is it really that practical to program a string of cogs to propagate results upward to other LUTs? It seems to me kind of a pipedream. I can definitely see two cogs working together, but a string of them seems unlikely to be useful.
In the case of any pairs, the cogs would determine whether odd or even. They would store their cogid in hub with even being say the lower hub byte. Then both would compare their cogid with the other hub to determine the lowest cog, and jump accordingly.
Oops. Just reread and Phil already worked it out.
I tend to agree.
I can see a case for 2 or 3 or 4 as practical, but more on a minion basis, than a chain or string.
Once you start constructing a chain, the housekeeping costs would make the pin-cell pathway better.
Unless each COG does some significant processing before passing the data along, and that moves into packets of data, which the HUB manages fine.
It's the smaller, fine grained stuff, where latency tends to be most critical.
But it depends on implementation. If you implemented it as you originally conceived it (half shared up, half shared down, and extra LUTX instructions ie separate and not dual write) I could see some applications might prefer to go via LUT to an adjoining cog rather than via hub.
I had thought you would just map the whole LUT from each cog to the next, meaning that cog n would map its LUT to cog n+1, and cog n+1 would map its LUT to cog n+2. Thus cog n+1 would effectively be able to see cog n LUT and share its own LUT with cog n+2. Therefore every cog would be equal.
If LUT addressing was 10 bits instead of 9, then perhaps the whole 4KB LUT could be used by one cog for larger stacks and/or LUT exec. This may be too much work though?
I have stayed out of this whole discussion and have no special wants for anything in particular, but I thought of one possible application for something like this.
A fast pipeline notification pathway could be useful for doing something like real time packet processing where you can break up the work into fixed portions that have to run fast. Of course there are other devices that would do it better but this example is to just show a reasonable use case.
For example if you had a sequence of worker COGs doing specific operations like
1) reception
2) packet classification, access control
3) header manipulations
4) switching
5) scheduling and queuing
6) transmission
you could then have the bulk data path stay in the hub once received until transmitted, and just pass an small information block or opcode down the line via the LUT pathway rapidly so that you could keep up with the wire rate of a 100Mbps stream for example and are not be wasting time polling the hub on every packet at every step when your cycle budget per packet is tightly restricted.
You could of course try to parallelize by replication of a bunch of COGs doing the same thing for each packet but only if the totality of all possible operations fits in all these COGs instruction space. This may not be possible with the 2kB instruction limit in some applications, though hub exec could alleviate that in some cases (with loss of some determinism perhaps). A pipeline like this also inherently maintains sequence and doesn't need other means to synchronize the final output order.
With the FIFO, read latencies can be prefetched away so as not to directly affect instructions.
It'd be awfully tricky to use HubRAM when running HubExec, though. Some of those layers would benefit from more code space
I think HubExec will be quite common on the P2. This is a good reason to have some other mechanism for data exchange (LUT, pins, whatever)
The block moves (SETQ/SETQ2+RDLONG/WRLONG) can work concurrently with hub exec, so if you are into processing blocks of data, it's still very efficient, even under hub exec.
I think something like Rogloh's fast layered comms is one of the more likely use cases. Clearly USB is pretty well looked after, already. Its kinda hard imagining exactly what timing latencies are required, until we actually do it.
This was one of the applications I had in mind when I suggested the idea. The other was high speed small burst data acquisition where the hub is too slow but a single LUT is not large enough. It may even be used for high speed transfer of larger amounts of data to hub.
That was my thought as well.
Agreed.
Another use case is going to be SDRAM, where one COG acts as the SDRAM interface/driver/State engine and refresh.
Rogloh has some great work on P1V SDRAM, where he has packed address and refresh into one P1V hub cycle, so the SDRAM 'looks' like it is simple SRAM.
That will need quite close co-operation between two COGS to use that to full potential on P2.
It might be possible to use 2 streamers to get close to HW speeds.
Similar to SDRAM will be HyperFLASH, HyperRAM etc, where one COG needs to focus on the finer timings, but have the ability to RW quickly too.
Addit: I expect HyperFLASH and HyperRAM will be in full production long before P2 hits the FAB.
Checking I see Digikey already have stocks of HyperRAM, and Hyperflash is coming.
These come in compact, low pin count 24-TBGA (6x8).
I can't help but to remember Bill Clinton's mea culpa from 2004: "I did something for the worst possible reason -- just because I could." Spare gates/real estate may be the "worst possible reason" to add new features to an already great design. Further delay, as a consequence, will negate any marketing advantages an already-late product might confer. Chip, it's time to doff your engineering hat, don that of a businessman, and jest git 'er done!
Of course, the longer you prolong this process, the longer you can put off that horrid task of writing documentation! (Did I hit a nerve? )
-Phil
My cog allocator, which is for next-LUT sharing, is 50% better than a simple allocator, and probably 2-3 times as much combinatorial logic (which I'm guessing isn't really significant for the cog allocator). My single cog allocator prefers idle cogs sandwiched between running cogs, and after that prefers cogs N for which cog N-1 is running but cog N+1 is idle. My double cog allocator prefers pairs of idle cogs N and N+1 for which cog N-1 is running.
Note that my occupied mask is now 32 bits. The top and bottom 16 bits should always be the same. I also replaced all shifts with rotates. This lets stuff wrap around.
Also included are get_double_up_nopair and get_double_down_nopair. They are just like get_double_{up,down}, but they will take any two cogs, not just only even/odd pairs. They weren't any better than Phil's original tests.
EDIT: My first tests were with next-LUT sharing. I just did tests for even/odd sharing, and got similar results (50% better than Phil's). To see this, use get_single_prefer_pair, which prefers cogs from half-started pairs over cogs from completely stopped pairs.
EDIT: It doesn't seem to matter which double cog allocator I use. It seems that using any single cog allocator that prefers idle cogs that are neighboring running cogs (e.g. get_single_prefer or get_single_prefer_pair) usually makes it around 50% better, regardless of which double cog allocator I use.
I wasn't even trying.
Are the buses (DATA, ADDRESS), connected in a wired-or configuration during LUT share?
If positive, are there any chances for a READ operation directed to one of the paired LUTs, being signaled as a WRITE to the other one?
Kind of a clocked transfer between luts, perhaps linking some of streamer's operation to the transfer mechanism?
I know that there are concerns about stable DATA being settle onto the bus and ADDRESS hold time, before writing to destination LUT, but, is it someway a viable data transfer mechanism?
Talking and thinking about what to do takes all the time. Implementing this LUT idea will take maybe 30 minutes. Maybe an hour for dual-cog COGNEW. Documentation was really hard, at first, but I don't dread it, anymore.
-Phil
Good thinking. What I got out of that is that when a single cog is requested, an unused cog within an even/odd pair in which the other cog is in use is preferred over an unused cog whose even/odd neighbor is also unused. This will tend to leave unused even/odd pairs available for double-cog requests. Neat! Phil, would it really work?
It just hasn't felt complete, until very recently.