To finish out RDLUTN/WRLUTN, I should add support for egg-beater loading (1 clock per long) using a new SETQ3+RDLONG/WRLONG/WMLONG arrangement. This doesn't take much, at all. What I DON'T want to do is get into extra LUT-exec issues, as that blows up the execution map again and I don't see any real point in it. Do you guys?
LUTExec was done quite a while back. This is expected to be used quite a lot as it has no speed penalties.
SETQ3 will add Streamer cross transferring to next Cog. This could be fun to play with, for one it provides a convoluted way, via HubRAM, to have streamed data from pins to LUT. (EDIT: Although, it would probably be easier to use the other Streamer to only fetch to HubRAM and then setup your own FIFO with a RDFAST and get the data into CogRAM instead.)
Cog data access to LUT is simple load and store. 4-clock access time I think. Dunno what sort of indexing can be done there.
Ah, how about this: Cog0 Streamer from pins to HubRAM, then Cog1 Streamer from HubRAM to Cog2 LUTRAM, then processing in Cog2 and fast return back to HubRAM via FIFO with a WRFAST config. Eek, that's all three FIFO's consumed!
Ah, how about this: Cog0 Streamer from pins to HubRAM, then Cog1 Streamer from HubRAM to Cog2 LUTRAM, then processing in Cog2 and fast return back to HubRAM via FIFO with a WRFAST config. Eek, that's all three FIFO's consumed!
I can see some use examples will be needed to show just what this can do....
So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?
You are saying code exec from that is not supported ? What about exec from your own LUT ?
Streamer ?
Memory arrays ?
To allow LUT-exec from the next cog's LUT, we'd have to redefine the memory map for code execution. I don't see any utility in that. Fast loading and storing the next cog's LUT from/to hub does seem like something useful, though.
The streamer can move hub data to pins and DACs, or pin data to hub. It doesn't read/write the LUT, at all. There is a mechanism, though, to move a long per per clock between hub and LUT. This mechanism can be extended to include the next cog's LUT.
It is very unlikely, that any feature not implemented now hinders the success of P2. Like we have a P1 where lot of resources are unused, we should not squeeze out the last quant. A solution is perfect if there is a way for improvement ahead!!
Chip, just try to make demos to find hidden misbehavior, we all together (sorry, I can't help) should now prepare to have in silicon what we have in a FPGA. Like Peter Jakacki does. He's already bringing P2 to application!
To finish out RDLUTN/WRLUTN, I should add support for egg-beater loading (1 clock per long) using a new SETQ3+RDLONG/WRLONG/WMLONG arrangement. This doesn't take much, at all. What I DON'T want to do is get into extra LUT-exec issues, as that blows up the execution map again and I don't see any real point in it. Do you guys?
I realise this is for fpga image generation, I hope you haven't forgotten the equivalent mask for real silicon. Hate to think you might actually generate the wrong verilog for the silicon.
So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?
You are saying code exec from that is not supported ? What about exec from your own LUT ?
Streamer ?
Memory arrays ?
To allow LUT-exec from the next cog's LUT, we'd have to redefine the memory map for code execution. I don't see any utility in that. Fast loading and storing the next cog's LUT from/to hub does seem like something useful, though.
The utility for next cog LUT-exec is if you require more high speed deterministic code without branch address penalties, that cog+lut by themselves are unable to satisfy. It would give access to a lot of fast cog registers for data storage and 2 x lut size for code.
Would it mess up the code execution memory map? How many programs are likely to run cog code to $1ff, then through lut to$3ff, and then carry on at $400 (long==1000 byte aligned hub ram)? I would think nearly all, except those who wish to test the execute through linear memory feature, will execute a jump.
Chip,
Just briefly catching up before I go off grid again
If a cog could lut exec from the adjacent cogs lut, that would permit one cog to have 1536 instead of 1024 LUT/ram space. It also means a cooperating (adjacent) cog could perform overlay loading for the larger cog.
The only real cost is an extra address bit for cog/LUT, and a larger hole in hub ram. However, if hub wraps at 512KB as previously indicated, then in reality the 512KB block is just shifted up by the cog/LUT size.
There is some real benefit being able to run LUT exec from the additional lut/aux memory if it is not too much trouble. It would also permit larger LUT stack(s).
If a cog could lut exec from the adjacent cogs lut, that would permit one cog to have 1536 instead of 1024 LUT/ram space. It also means a cooperating (adjacent) cog could perform overlay loading for the larger cog.
The only real cost is an extra address bit for cog/LUT, and a larger hole in hub ram. However, if hub wraps at 512KB as previously indicated, then in reality the 512KB block is just shifted up by the cog/LUT size.
There is some real benefit being able to run LUT exec from the additional lut/aux memory if it is not too much trouble. It would also permit larger LUT stack(s).
I'm still hoping to see some info on what the features/limits/costs of the LUT extend actually are.
* What the Memory Map looks like ?
* What is excluded by the hardware ?
* How much logic this actually costs ?
* What opcodes can make use of the extension ?
Comments
To finish out RDLUTN/WRLUTN, I should add support for egg-beater loading (1 clock per long) using a new SETQ3+RDLONG/WRLONG/WMLONG arrangement. This doesn't take much, at all. What I DON'T want to do is get into extra LUT-exec issues, as that blows up the execution map again and I don't see any real point in it. Do you guys?
You are saying code exec from that is not supported ? What about exec from your own LUT ?
Streamer ?
Memory arrays ?
SETQ3 will add Streamer cross transferring to next Cog. This could be fun to play with, for one it provides a convoluted way, via HubRAM, to have streamed data from pins to LUT. (EDIT: Although, it would probably be easier to use the other Streamer to only fetch to HubRAM and then setup your own FIFO with a RDFAST and get the data into CogRAM instead.)
Cog data access to LUT is simple load and store. 4-clock access time I think. Dunno what sort of indexing can be done there.
I can see some use examples will be needed to show just what this can do....
To allow LUT-exec from the next cog's LUT, we'd have to redefine the memory map for code execution. I don't see any utility in that. Fast loading and storing the next cog's LUT from/to hub does seem like something useful, though.
Chip, just try to make demos to find hidden misbehavior, we all together (sorry, I can't help) should now prepare to have in silicon what we have in a FPGA. Like Peter Jakacki does. He's already bringing P2 to application!
I realise this is for fpga image generation, I hope you haven't forgotten the equivalent mask for real silicon. Hate to think you might actually generate the wrong verilog for the silicon.
Or perhaps you use a different top file for the real silicon?
The utility for next cog LUT-exec is if you require more high speed deterministic code without branch address penalties, that cog+lut by themselves are unable to satisfy. It would give access to a lot of fast cog registers for data storage and 2 x lut size for code.
Would it mess up the code execution memory map? How many programs are likely to run cog code to $1ff, then through lut to$3ff, and then carry on at $400 (long==1000 byte aligned hub ram)? I would think nearly all, except those who wish to test the execute through linear memory feature, will execute a jump.
Just briefly catching up before I go off grid again
If a cog could lut exec from the adjacent cogs lut, that would permit one cog to have 1536 instead of 1024 LUT/ram space. It also means a cooperating (adjacent) cog could perform overlay loading for the larger cog.
The only real cost is an extra address bit for cog/LUT, and a larger hole in hub ram. However, if hub wraps at 512KB as previously indicated, then in reality the 512KB block is just shifted up by the cog/LUT size.
There is some real benefit being able to run LUT exec from the additional lut/aux memory if it is not too much trouble. It would also permit larger LUT stack(s).
I'm still hoping to see some info on what the features/limits/costs of the LUT extend actually are.
* What the Memory Map looks like ?
* What is excluded by the hardware ?
* How much logic this actually costs ?
* What opcodes can make use of the extension ?