Shop OBEX P1 Docs P2 Docs Learn Events
P2 vs modern process limits - Page 5 — Parallax Forums

P2 vs modern process limits

1235»

Comments

  • cgraceycgracey Posts: 14,209
    edited 2016-05-06 01:21
    Instead of a using a max pin number to implement smart pins, I made a mask arrangement:
    	parameter altera_hub_mask = 14'h3FFF, cogs = 16, smartpins = 64'hFC00_0000_FFFF_FFFF;	// Prop123-A9	1024K	16 cogs  38 smart pins
    //	parameter altera_hub_mask = 14'h1FFF, cogs =  8, smartpins = 64'hFC00_0000_0000_FFFF;	// Prop123-A7	512K	 8 cogs  22 smart pins
    //	parameter altera_hub_mask = 14'h0FFF, cogs =  8, smartpins = 64'h0000_0000_0000_00FF;	// DE2-115	256K	 8 cogs   8 smart pins
    //	parameter altera_hub_mask = 14'h01FF, cogs =  1, smartpins = 64'h0000_0000_0000_00FF;	// DE0-Nano	32K	 1 cog    8 smart pins, no cordic
    

    To finish out RDLUTN/WRLUTN, I should add support for egg-beater loading (1 clock per long) using a new SETQ3+RDLONG/WRLONG/WMLONG arrangement. This doesn't take much, at all. What I DON'T want to do is get into extra LUT-exec issues, as that blows up the execution map again and I don't see any real point in it. Do you guys?
  • jmgjmg Posts: 15,175
    So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?

    You are saying code exec from that is not supported ? What about exec from your own LUT ?
    Streamer ?
    Memory arrays ?

  • evanhevanh Posts: 16,042
    edited 2016-05-06 01:54
    LUTExec was done quite a while back. This is expected to be used quite a lot as it has no speed penalties.

    SETQ3 will add Streamer cross transferring to next Cog. This could be fun to play with, for one it provides a convoluted way, via HubRAM, to have streamed data from pins to LUT. (EDIT: Although, it would probably be easier to use the other Streamer to only fetch to HubRAM and then setup your own FIFO with a RDFAST and get the data into CogRAM instead.)

    Cog data access to LUT is simple load and store. 4-clock access time I think. Dunno what sort of indexing can be done there.
  • evanhevanh Posts: 16,042
    edited 2016-05-06 02:00
    Ah, how about this: Cog0 Streamer from pins to HubRAM, then Cog1 Streamer from HubRAM to Cog2 LUTRAM, then processing in Cog2 and fast return back to HubRAM via FIFO with a WRFAST config. :D Eek, that's all three FIFO's consumed!
  • jmgjmg Posts: 15,175
    evanh wrote: »
    Ah, how about this: Cog0 Streamer from pins to HubRAM, then Cog1 Streamer from HubRAM to Cog2 LUTRAM, then processing in Cog2 and fast return back to HubRAM via FIFO with a WRFAST config. :D Eek, that's all three FIFO's consumed!

    :)
    I can see some use examples will be needed to show just what this can do....


  • cgraceycgracey Posts: 14,209
    jmg wrote: »
    So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?

    You are saying code exec from that is not supported ? What about exec from your own LUT ?
    Streamer ?
    Memory arrays ?

    To allow LUT-exec from the next cog's LUT, we'd have to redefine the memory map for code execution. I don't see any utility in that. Fast loading and storing the next cog's LUT from/to hub does seem like something useful, though.
  • cgraceycgracey Posts: 14,209
    The streamer can move hub data to pins and DACs, or pin data to hub. It doesn't read/write the LUT, at all. There is a mechanism, though, to move a long per per clock between hub and LUT. This mechanism can be extended to include the next cog's LUT.
  • evanhevanh Posts: 16,042
    edited 2016-05-06 03:43
    Oh, cool, that makes sense. I had seen those SETQx instructions in the past but forgotten how they worked - blindly assumed they setup the Streamer.
  • ErNaErNa Posts: 1,752
    It is very unlikely, that any feature not implemented now hinders the success of P2. Like we have a P1 where lot of resources are unused, we should not squeeze out the last quant. A solution is perfect if there is a way for improvement ahead!!
    Chip, just try to make demos to find hidden misbehavior, we all together (sorry, I can't help) should now prepare to have in silicon what we have in a FPGA. Like Peter Jakacki does. He's already bringing P2 to application!
  • cgracey wrote: »
    Instead of a using a max pin number to implement smart pins, I made a mask arrangement:
    	parameter altera_hub_mask = 14'h3FFF, cogs = 16, smartpins = 64'hFC00_0000_FFFF_FFFF;	// Prop123-A9	1024K	16 cogs  38 smart pins
    //	parameter altera_hub_mask = 14'h1FFF, cogs =  8, smartpins = 64'hFC00_0000_0000_FFFF;	// Prop123-A7	512K	 8 cogs  22 smart pins
    //	parameter altera_hub_mask = 14'h0FFF, cogs =  8, smartpins = 64'h0000_0000_0000_00FF;	// DE2-115	256K	 8 cogs   8 smart pins
    //	parameter altera_hub_mask = 14'h01FF, cogs =  1, smartpins = 64'h0000_0000_0000_00FF;	// DE0-Nano	32K	 1 cog    8 smart pins, no cordic
    

    To finish out RDLUTN/WRLUTN, I should add support for egg-beater loading (1 clock per long) using a new SETQ3+RDLONG/WRLONG/WMLONG arrangement. This doesn't take much, at all. What I DON'T want to do is get into extra LUT-exec issues, as that blows up the execution map again and I don't see any real point in it. Do you guys?

    I realise this is for fpga image generation, I hope you haven't forgotten the equivalent mask for real silicon. Hate to think you might actually generate the wrong verilog for the silicon.
        	parameter altera_hub_mask = 14'h2FFF, cogs = 16, smartpins = 64'hFFFF_FFFF_FFFF_FFFF;	// Prop silicon	512K	16 cogs  64 smart pins
    

    Or perhaps you use a different top file for the real silicon?
  • cgracey wrote: »
    jmg wrote: »
    So this extra 'next COG's LUT' memory that a COG can now see, has what limitations ?

    You are saying code exec from that is not supported ? What about exec from your own LUT ?
    Streamer ?
    Memory arrays ?

    To allow LUT-exec from the next cog's LUT, we'd have to redefine the memory map for code execution. I don't see any utility in that. Fast loading and storing the next cog's LUT from/to hub does seem like something useful, though.

    The utility for next cog LUT-exec is if you require more high speed deterministic code without branch address penalties, that cog+lut by themselves are unable to satisfy. It would give access to a lot of fast cog registers for data storage and 2 x lut size for code.

    Would it mess up the code execution memory map? How many programs are likely to run cog code to $1ff, then through lut to$3ff, and then carry on at $400 (long==1000 byte aligned hub ram)? I would think nearly all, except those who wish to test the execute through linear memory feature, will execute a jump.
  • Cluso99Cluso99 Posts: 18,069
    Chip,
    Just briefly catching up before I go off grid again :(

    If a cog could lut exec from the adjacent cogs lut, that would permit one cog to have 1536 instead of 1024 LUT/ram space. It also means a cooperating (adjacent) cog could perform overlay loading for the larger cog.
    The only real cost is an extra address bit for cog/LUT, and a larger hole in hub ram. However, if hub wraps at 512KB as previously indicated, then in reality the 512KB block is just shifted up by the cog/LUT size.
    There is some real benefit being able to run LUT exec from the additional lut/aux memory if it is not too much trouble. It would also permit larger LUT stack(s).
  • jmgjmg Posts: 15,175
    Cluso99 wrote: »
    If a cog could lut exec from the adjacent cogs lut, that would permit one cog to have 1536 instead of 1024 LUT/ram space. It also means a cooperating (adjacent) cog could perform overlay loading for the larger cog.
    The only real cost is an extra address bit for cog/LUT, and a larger hole in hub ram. However, if hub wraps at 512KB as previously indicated, then in reality the 512KB block is just shifted up by the cog/LUT size.
    There is some real benefit being able to run LUT exec from the additional lut/aux memory if it is not too much trouble. It would also permit larger LUT stack(s).

    I'm still hoping to see some info on what the features/limits/costs of the LUT extend actually are.
    * What the Memory Map looks like ?
    * What is excluded by the hardware ?
    * How much logic this actually costs ?
    * What opcodes can make use of the extension ?

Sign In or Register to comment.