P2 vs modern process limits

cgracey · 2016-05-06 01:17

Instead of a using a max pin number to implement smart pins, I made a mask arrangement:

	parameter altera_hub_mask = 14'h3FFF, cogs = 16, smartpins = 64'hFC00_0000_FFFF_FFFF;	// Prop123-A9	1024K	16 cogs  38 smart pins
//	parameter altera_hub_mask = 14'h1FFF, cogs =  8, smartpins = 64'hFC00_0000_0000_FFFF;	// Prop123-A7	512K	 8 cogs  22 smart pins
//	parameter altera_hub_mask = 14'h0FFF, cogs =  8, smartpins = 64'h0000_0000_0000_00FF;	// DE2-115	256K	 8 cogs   8 smart pins
//	parameter altera_hub_mask = 14'h01FF, cogs =  1, smartpins = 64'h0000_0000_0000_00FF;	// DE0-Nano	32K	 1 cog    8 smart pins, no cordic

To finish out RDLUTN/WRLUTN, I should add support for egg-beater loading (1 clock per long) using a new SETQ3+RDLONG/WRLONG/WMLONG arrangement. This doesn't take much, at all. What I DON'T want to do is get into extra LUT-exec issues, as that blows up the execution map again and I don't see any real point in it. Do you guys?

jmg · 2016-05-06 01:26

So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?

You are saying code exec from that is not supported ? What about exec from your own LUT ?
Streamer ?
Memory arrays ?

evanh · 2016-05-06 01:40

LUTExec was done quite a while back. This is expected to be used quite a lot as it has no speed penalties.

SETQ3 will add Streamer cross transferring to next Cog. This could be fun to play with, for one it provides a convoluted way, via HubRAM, to have streamed data from pins to LUT. (EDIT: Although, it would probably be easier to use the other Streamer to only fetch to HubRAM and then setup your own FIFO with a RDFAST and get the data into CogRAM instead.)

Cog data access to LUT is simple load and store. 4-clock access time I think. Dunno what sort of indexing can be done there.

evanh · 2016-05-06 01:58

Ah, how about this: Cog0 Streamer from pins to HubRAM, then Cog1 Streamer from HubRAM to Cog2 LUTRAM, then processing in Cog2 and fast return back to HubRAM via FIFO with a WRFAST config.

Eek, that's all three FIFO's consumed!

jmg · 2016-05-06 02:01

evanh wrote: »

Ah, how about this: Cog0 Streamer from pins to HubRAM, then Cog1 Streamer from HubRAM to Cog2 LUTRAM, then processing in Cog2 and fast return back to HubRAM via FIFO with a WRFAST config. Eek, that's all three FIFO's consumed!

I can see some use examples will be needed to show just what this can do....

cgracey · 2016-05-06 02:01

jmg wrote: »

So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?

You are saying code exec from that is not supported ? What about exec from your own LUT ?
Streamer ?
Memory arrays ?

To allow LUT-exec from the next cog's LUT, we'd have to redefine the memory map for code execution. I don't see any utility in that. Fast loading and storing the next cog's LUT from/to hub does seem like something useful, though.

cgracey · 2016-05-06 02:06

The streamer can move hub data to pins and DACs, or pin data to hub. It doesn't read/write the LUT, at all. There is a mechanism, though, to move a long per per clock between hub and LUT. This mechanism can be extended to include the next cog's LUT.

evanh · 2016-05-06 03:40

Oh, cool, that makes sense. I had seen those SETQx instructions in the past but forgotten how they worked - blindly assumed they setup the Streamer.

ErNa · 2016-05-06 07:57

It is very unlikely, that any feature not implemented now hinders the success of P2. Like we have a P1 where lot of resources are unused, we should not squeeze out the last quant. A solution is perfect if there is a way for improvement ahead!!
Chip, just try to make demos to find hidden misbehavior, we all together (sorry, I can't help) should now prepare to have in silicon what we have in a FPGA. Like Peter Jakacki does. He's already bringing P2 to application!

78rpm · 2016-05-07 22:47

cgracey wrote: »
Instead of a using a max pin number to implement smart pins, I made a mask arrangement:
	parameter altera_hub_mask = 14'h3FFF, cogs = 16, smartpins = 64'hFC00_0000_FFFF_FFFF;	// Prop123-A9	1024K	16 cogs  38 smart pins
//	parameter altera_hub_mask = 14'h1FFF, cogs =  8, smartpins = 64'hFC00_0000_0000_FFFF;	// Prop123-A7	512K	 8 cogs  22 smart pins
//	parameter altera_hub_mask = 14'h0FFF, cogs =  8, smartpins = 64'h0000_0000_0000_00FF;	// DE2-115	256K	 8 cogs   8 smart pins
//	parameter altera_hub_mask = 14'h01FF, cogs =  1, smartpins = 64'h0000_0000_0000_00FF;	// DE0-Nano	32K	 1 cog    8 smart pins, no cordic
To finish out RDLUTN/WRLUTN, I should add support for egg-beater loading (1 clock per long) using a new SETQ3+RDLONG/WRLONG/WMLONG arrangement. This doesn't take much, at all. What I DON'T want to do is get into extra LUT-exec issues, as that blows up the execution map again and I don't see any real point in it. Do you guys?

I realise this is for fpga image generation, I hope you haven't forgotten the equivalent mask for real silicon. Hate to think you might actually generate the wrong verilog for the silicon.

    	parameter altera_hub_mask = 14'h2FFF, cogs = 16, smartpins = 64'hFFFF_FFFF_FFFF_FFFF;	// Prop silicon	512K	16 cogs  64 smart pins

Or perhaps you use a different top file for the real silicon?

78rpm · 2016-05-07 22:55

cgracey wrote: »

jmg wrote: »

So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?

You are saying code exec from that is not supported ? What about exec from your own LUT ?
Streamer ?
Memory arrays ?

To allow LUT-exec from the next cog's LUT, we'd have to redefine the memory map for code execution. I don't see any utility in that. Fast loading and storing the next cog's LUT from/to hub does seem like something useful, though.

The utility for next cog LUT-exec is if you require more high speed deterministic code without branch address penalties, that cog+lut by themselves are unable to satisfy. It would give access to a lot of fast cog registers for data storage and 2 x lut size for code.

Would it mess up the code execution memory map? How many programs are likely to run cog code to $1ff, then through lut to$3ff, and then carry on at $400 (long==1000 byte aligned hub ram)? I would think nearly all, except those who wish to test the execute through linear memory feature, will execute a jump.

Cluso99 · 2016-05-07 22:55

Chip,
Just briefly catching up before I go off grid again

If a cog could lut exec from the adjacent cogs lut, that would permit one cog to have 1536 instead of 1024 LUT/ram space. It also means a cooperating (adjacent) cog could perform overlay loading for the larger cog.
The only real cost is an extra address bit for cog/LUT, and a larger hole in hub ram. However, if hub wraps at 512KB as previously indicated, then in reality the 512KB block is just shifted up by the cog/LUT size.
There is some real benefit being able to run LUT exec from the additional lut/aux memory if it is not too much trouble. It would also permit larger LUT stack(s).

jmg · 2016-05-07 23:31

Cluso99 wrote: »

If a cog could lut exec from the adjacent cogs lut, that would permit one cog to have 1536 instead of 1024 LUT/ram space. It also means a cooperating (adjacent) cog could perform overlay loading for the larger cog.
The only real cost is an extra address bit for cog/LUT, and a larger hole in hub ram. However, if hub wraps at 512KB as previously indicated, then in reality the 512KB block is just shifted up by the cog/LUT size.
There is some real benefit being able to run LUT exec from the additional lut/aux memory if it is not too much trouble. It would also permit larger LUT stack(s).

I'm still hoping to see some info on what the features/limits/costs of the LUT extend actually are.
* What the Memory Map looks like ?
* What is excluded by the hardware ?
* How much logic this actually costs ?
* What opcodes can make use of the extension ?

P2 vs modern process limits

Comments