Jmg,
Thanks. I wonder if it would've simpler if implemented as I thought. Perhaps there is a little less decoding my way. The LUT is still built as 2KB but doesn't need to be split in halves requiring decoding and more r & w strobes.
Physically, yes, sure you can do either.
However, if you dual port 100%, you then place the 2 LUT's between 2 COGS, and there is then no 3 or 4 COG cluster choice.
ie Layout is thus : COGa.LUTa.LUTb.COGb.COGc.LUTc.LUTd.COGd
IIRC, when I asked Chip about splits and resize choices of overlay portions, the logic costs were not hugely different.
Makes sense, as the Data bus is 32b, and address only 9b. Halving the overlay size, saves one adr bit
(Exact numbers of the two approaches have not been given.)
The half-overlaid system Chip chose, isolates COGs less, and to me is more flexible.
My main concern would be if the split in any way impacted speed, but Chip seemed comfortable there.
I beg to differ. If you just share 100% of you LUT with the lower cog, you also get to access 100% of the higher cogs sharing with you.
So you can indeed chain cogs.
I beg to differ. If you just share 100% of you LUT with the lower cog, you also get to access 100% of the higher cogs sharing with you.
So you can indeed chain cogs.
Show me the physical placement, using dual port memory, that delivers that.
I beg to differ. If you just share 100% of you LUT with the lower cog, you also get to access 100% of the higher cogs sharing with you.
So you can indeed chain cogs.
Show me the physical placement, using dual port memory, that delivers that.
Except they would actually be in a circle on the actual silicon.
Each cog can access its own LUT to its right, as well as the one on its left belonging to the previous cog. However, it can do more things with its own LUT to the right, like streaming and LUTEXEC and such.
I don't see the problem with wiring the whole LUT if it's dual ported. Streamer is not going to be used very often so the second porting is sitting idle in most cases.
That's it. Each cog will be able to r/w the next cog's LUT via its second port that the streamer uses. The other cog will have priority over the local cog's streamer.
I think your missing what I am saying...
The other choice is the adjacent lower cog in all cases.
Effectively, each cog can access its own LUT (with the second port connected to the lower cog). Now the next higher cog can access its own LUT (with the second port connected to the lower cog).
So each cog only has two ports.
In order to lut share, you have to have something in the lut to share... you have to write and then index your lut address, 4 clocks. Then you need some method to sync to the adjacent cog... this seems like it would take at least 1 clock... then you need to read it. from the next cog ... 3 clocks... total of 8 clocks, which (if I understand it) is where we currently sit for byte data using Chips new schema.
I think your missing what I am saying...
The other choice is the adjacent lower cog in all cases.
Effectively, each cog can access its own LUT (with the second port connected to the lower cog). Now the next higher cog can access its own LUT (with the second port connected to the lower cog).
So each cog only has two ports.
I think I still prefer what Chip called 'overlaid halves', as that permits left and right communication without consuming your own LUT memory area. (which may be used for large arrays. LUTEXEC etc )
The other approach runs a 32b port across the COG (so saves no routing) but is asymmetric in that 'talking right' is outside your own LUT, but 'talking left' must be done inside it.
In order to lut share, you have to have something in the lut to share... you have to write and then index your lut address, 4 clocks. Then you need some method to sync to the adjacent cog... this seems like it would take at least 1 clock... then you need to read it. from the next cog ... 3 clocks... total of 8 clocks, which (if I understand it) is where we currently sit for byte data using Chips new schema.
No?
Yes, this is why we eagerly await the actual timings of end-to-end DAC-Data paths.
LUT sharing would be abused. That I'm sure of because I'd abuse it too. I know, the argument that it's out right to do so has it's validity too.
That means that at the and also you are admitting that it is needed and that it will simplify you many tasks. Other ways you can retain from using it even if present
Second, we don't have enough logic in A9 chip to test it with everything else...
you can reduce smartpns, you can reduce cogs. Yo do not need 64 equal smart pins and 16 equal cogs. You can have an fpga build with eg 32 (or 16) smartpins and or 12/8 cogs.
This is a limit for the fpga resources not silicon die area.
... it seems you may not understand at all - You are requiring everyone to change with you.
I'm requiring no one to change with me! How can you possibly say that the existence of a feature compels everyone to use it?
But what you are admitting is that it would be useful. And you are also saying that NO ONE should have this useful thing. This is extremely interesting even if it borders on crazy.
Yes. That is what is being said. Sometimes hard choices get made. They were made in P1 too. We just did get to see them.
I'm fine with the decision. Decisions like this get made all the time too. Sometimes, the potential hassles, support, etc... out weigh feature utility.
... it seems you may not understand at all - You are requiring everyone to change with you.
I'm requiring no one to change with me! How can you possibly say that the existence of a feature compels everyone to use it?
You need to read the many earlier arguments. Heater has a habit of being right.
We have a choice of maintaining dynamic Cog mapping with COGNEW and forgoing shared LUTs. Or gaining the shared LUTs but forcing COGNEW to oblivion.
But what you are admitting is that it would be useful. And you are also saying that NO ONE should have this useful thing. This is extremely interesting even if it borders on crazy.
Chip started the whole thing off by telling us he'd dual-ported the LUTRAM because of the increased space. JMG immediately thought of using it for sharing between Cogs - http://forums.parallax.com/discussion/comment/1371876/#Comment_1371876 - which may have been a side-effect of Cluso's crusade for a Cog-to-Cog comms.
We have a choice of maintaining dynamic Cog mapping with COGNEW and forgoing shared LUTs. Or gaining the shared LUTs but forcing COGNEW to oblivion.
Err, nope, there are a great many more choices than those 2.
Raising the IQ of COGNEW, is one obvious one.
Note that 'dynamic Cog mapping' is mostly an illusion, as you do still need to ensure you do not try to ask for COG #9 on P1 or COG #17 on P2.
Such is the real world, it always intrudes, with real numbers, and physical limits.
Note that 'dynamic Cog mapping' is mostly an illusion, as you do still need to ensure you do not try to ask for COG #9 on P1 or COG #17 on P2.
Such is the real world, it always intrudes, with real numbers, and physical limits.
Making sure you don't try to use more cogs than you have is a different and easier problem than actually allocating cogs. Making sure you don't run out of cogs can only be done in software, while making atomic cog allocation not a pain can and is done in hardware. With COGNEW, you don't need to worry about which cogs stuff is running in, just how many.
That raises it to OS/Library. The instruction will still be gone and therefore so will its usefulness.
Note that 'dynamic Cog mapping' is mostly an illusion, as you do still need to ensure you do not try to ask for COG #9 on P1 or COG #17 on P2.
Such is the real world, it always intrudes, with real numbers, and physical limits.
For sure, most deployments are static by nature. Most of OBEX uses COGNEW just simply to make the mappings vanish from the source code. This could be changed but it would relegate any fully dynamic code to using far more complex methods.
Is it because the Cogs don't have the processing power of a MIPS32 or ARM M4 Core and they figure if they tag team a pair they can do more or have it try to punch above it's weight in video applications.
I think you get it.
A P2 COG will never have the processing power of a common MIPS or ARM core.
A P2 COG will never handle USB and ethernet, etc as the dedicated hardware you find on such MIPS and ARM SoCs.
But a P2 has 16 cores, so the dream is to be able to stitch them together to achieve similar performance in some cases.
Perhaps.
Personally I think it's like trying to enhance a 555 timer chip into an ATMEL tinyAVR. Sure you could do that but then you don't have the simplicity of the 555 for those situations where you need it.
Heater
Reading the comments that seems to be the gist of what the supporters of LUT Sharing are implying. Not enough horsepower in the Cogs for the complex apps they envision.
They do however have enough power if you're aiming for I/O work as opposed to data crunching, video.
Yes it can do video but so does a Rpi, BeagleBone, EVE FT 800 or a Gameduino and they're cheaper and more capable.
LUT sharing would be abused. That I'm sure of because I'd abuse it too. I know, the argument that it's out right to do so has it's validity too.
That means that at the and also you are admitting that it is needed and that it will simplify you many tasks. Other ways you can retain from using it even if present
My guess is that it is about "risk reduction".
If you're going to spend, I don't know, say $100k on a production run, you want to have high confidence in it.
Sounds like that if it works in FPGA, it will work on silicon too, using Treehouse.
Except for the analog stuff, that Chip is testing separately.
But, if you can't test the full design in FPGA, you're exposing yourself to some risk...
How expensive is the logic for switching the connections to LUTs for cog order agnosticism? If you do a simplistic grid of muxes to switch the buses, then you would need them switched at 256 points for 16 LUTs (you could probably put the muxes in trees to half that). However, if the LUTs are more limited (e.g. you have 4 of them that can be allocated), then you will have significantly fewer points where the buses have to be switched. So while you couldn't make a chain of 16 cogs working together, you could set up a few of them to have higher speed connections in an order agnostic way.
How expensive? It would be on the order of the hub egg-beater. That takes about 5,500 ALMs.
There is another option for half the price: Make LUT sharing write-only, but with 16-bit LUT mask selection. That way, you could write any or all of the other cogs' LUT's at once, from any cog. Each cog's LUT's 2nd port would use an AND-OR mux to bring in writes/address/data from other cogs. For you to receive feedback from any other cog, the other cog would write your LUT in a complimentary configuration. This would allow cog-number agnosticism AND writes to multiple cogs' LUT's simultaneously. This would take only 2 clocks, because there is no read feedback. Also, it would obviate the cog attention mechanism, so we could get rid of it. Actually, this would be the same as the attention mechanism, but with 9 bits of address and 32 bits of data added.
Comments
Physically, yes, sure you can do either.
However, if you dual port 100%, you then place the 2 LUT's between 2 COGS, and there is then no 3 or 4 COG cluster choice.
ie Layout is thus : COGa.LUTa.LUTb.COGb.COGc.LUTc.LUTd.COGd
IIRC, when I asked Chip about splits and resize choices of overlay portions, the logic costs were not hugely different.
Makes sense, as the Data bus is 32b, and address only 9b. Halving the overlay size, saves one adr bit
(Exact numbers of the two approaches have not been given.)
The half-overlaid system Chip chose, isolates COGs less, and to me is more flexible.
My main concern would be if the split in any way impacted speed, but Chip seemed comfortable there.
So you can indeed chain cogs.
Show me the physical placement, using dual port memory, that delivers that.
COG0 LUT0 COG1 LUT1 COG2 LUT2 ... COG7 LUT7 (COG0)
Except they would actually be in a circle on the actual silicon.
Each cog can access its own LUT to its right, as well as the one on its left belonging to the previous cog. However, it can do more things with its own LUT to the right, like streaming and LUTEXEC and such.
I think you just hit 3 ports there ..
One port belongs to the owner COG, the 2nd port can MUX to either Streamer, or one other choice.
Chip said that the port that otherwise belonged to the local streamer is now shared with the previous cog's RD/WRAUX:
The other choice is the adjacent lower cog in all cases.
Effectively, each cog can access its own LUT (with the second port connected to the lower cog). Now the next higher cog can access its own LUT (with the second port connected to the lower cog).
So each cog only has two ports.
No?
I think I still prefer what Chip called 'overlaid halves', as that permits left and right communication without consuming your own LUT memory area. (which may be used for large arrays. LUTEXEC etc )
The other approach runs a 32b port across the COG (so saves no routing) but is asymmetric in that 'talking right' is outside your own LUT, but 'talking left' must be done inside it.
Yes, this is why we eagerly await the actual timings of end-to-end DAC-Data paths.
This is a limit for the fpga resources not silicon die area.
I'm requiring no one to change with me! How can you possibly say that the existence of a feature compels everyone to use it?
But what you are admitting is that it would be useful. And you are also saying that NO ONE should have this useful thing. This is extremely interesting even if it borders on crazy.
I'm fine with the decision. Decisions like this get made all the time too. Sometimes, the potential hassles, support, etc... out weigh feature utility.
We have a choice of maintaining dynamic Cog mapping with COGNEW and forgoing shared LUTs. Or gaining the shared LUTs but forcing COGNEW to oblivion.
... civilised, restrained, I'm always polite, considerate, ... or just happy either way. My first comment on it - http://forums.parallax.com/discussion/comment/1372001/#Comment_1372001
then http://forums.parallax.com/discussion/comment/1374329/#Comment_1374329 where I was guessing and already looking forward to it.
Raising the IQ of COGNEW, is one obvious one.
Note that 'dynamic Cog mapping' is mostly an illusion, as you do still need to ensure you do not try to ask for COG #9 on P1 or COG #17 on P2.
Such is the real world, it always intrudes, with real numbers, and physical limits.
Making sure you don't try to use more cogs than you have is a different and easier problem than actually allocating cogs. Making sure you don't run out of cogs can only be done in software, while making atomic cog allocation not a pain can and is done in hardware. With COGNEW, you don't need to worry about which cogs stuff is running in, just how many.
For sure, most deployments are static by nature. Most of OBEX uses COGNEW just simply to make the mappings vanish from the source code. This could be changed but it would relegate any fully dynamic code to using far more complex methods.
I note Cluso is posing this question constructively - http://forums.parallax.com/discussion/164274/do-you-dynamically-start-and-stop-cogs/p1
Heater
Reading the comments that seems to be the gist of what the supporters of LUT Sharing are implying. Not enough horsepower in the Cogs for the complex apps they envision.
They do however have enough power if you're aiming for I/O work as opposed to data crunching, video.
Yes it can do video but so does a Rpi, BeagleBone, EVE FT 800 or a Gameduino and they're cheaper and more capable.
My guess is that it is about "risk reduction".
If you're going to spend, I don't know, say $100k on a production run, you want to have high confidence in it.
Sounds like that if it works in FPGA, it will work on silicon too, using Treehouse.
Except for the analog stuff, that Chip is testing separately.
But, if you can't test the full design in FPGA, you're exposing yourself to some risk...
How expensive? It would be on the order of the hub egg-beater. That takes about 5,500 ALMs.
There is another option for half the price: Make LUT sharing write-only, but with 16-bit LUT mask selection. That way, you could write any or all of the other cogs' LUT's at once, from any cog. Each cog's LUT's 2nd port would use an AND-OR mux to bring in writes/address/data from other cogs. For you to receive feedback from any other cog, the other cog would write your LUT in a complimentary configuration. This would allow cog-number agnosticism AND writes to multiple cogs' LUT's simultaneously. This would take only 2 clocks, because there is no read feedback. Also, it would obviate the cog attention mechanism, so we could get rid of it. Actually, this would be the same as the attention mechanism, but with 9 bits of address and 32 bits of data added.
It'd be good for getting data split out to multiple cogs to act on 'in a jiffy'
Direct 32-bit write to ALL cogs' LUT's from any cog.
EDIT: Ah, I see, any mask of the 16, including your own LUT.
Yes. You'd do something like this: