Shop OBEX P1 Docs P2 Docs Learn Events
Assigning unused cog's hub slots to a "supercog" - Page 2 — Parallax Forums

Assigning unused cog's hub slots to a "supercog"

2

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2014-08-17 17:08
    Sounds like a great start!

    LMM and Spin performance will be getting a huge boost from that.
  • jmgjmg Posts: 15,183
    edited 2014-08-17 17:32
    .. I chose instead the simpler method of gating the output of the bus select and the cog enables back to cog 0 so it's a completely automatic method and if we increased the number of cogs beyond eight for instance then it still works the same. So if the bus select sequence goes to select cog 4 and it's not enabled then it just selects cog 0 instead, simple! ...

    Sounds good. Anyone wanting to check/control BW ceilings, can simply enable COGS they know they will be later using, and if you enable only ODD COGs, then C0 gets 50% of the HUB bandwidth.

    Maybe COGSTOP could be extended a little to include a Pause/Unpause, which could even be RUN and HUB directing. - ie a master COG can remove another COG briefly from the HUB slots, then restore it.
    This is invisible to the slaved COG code, except it gets more HUB jitter. Many apps could tolerate that.
    Would mostly be used for tightly co-operating COGs.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-18 05:18
    1000 slots? Hmmm in many ways that would be really nice, but it would be a fair number of gates. 1024 would be nicer :)

    Actually if the number of gates was there, the more slots the merrier. My back of the envelope calculations suggested there should be a minimum of four slots per cog, ideally 8 or more. This was based on expected bandwidth usage of low-to-medium bandwidth peripherals, vs. bandwidth hungry video and byte code / vm / lmm / hubexec cogs.

    The best argument I remember that was made against many slots was the lower fMax due to the additional lookup.

    What I really like about the Verilog release is that we can all try many different ideas!

    Heck, I think it is only a matter of time before someone grafts on a Wishbone bus, and an FPU.

    Shouldn't affect fMax at all should it? 1) We're talking about the hub cycle time. The important fMax is about the COG cycle time. 2) The vector indirection could be pipelined so indirection would essentially take zero time. The COG number (address) would be available when needed.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-08-18 07:50
    I was concerned that it might bump the hub cycle time enough so 16 clock cycles would not be enough for a hubop (+ two regular ops) before next hub op. If the additional lookup fits in the current scheme nicely with room to spare, great.

    My concern was based on an estimate of an additional ~5n to do the slot lookup. With an fMax of 160Mhz, we may or may not have the 5ns to spare ... and I'd hate to see the fMax drop to 100ns.

    If we can hide the delay in the pipeline, without effecting fMax, that would be nice indeed. Hope we can!

    Hmm.. quick calculation:

    1024 entry 6 bit mapping table = 6144 bits = 36864 transistors (plus some more decoding logic)... does not seem like a huge number relative to total transistors in a P1V, so it may be feasible.
    Shouldn't affect fMax at all should it? 1) We're talking about the hub cycle time. The important fMax is about the COG cycle time. 2) The vector indirection could be pipelined so indirection would essentially take zero time. The COG number (address) would be available when needed.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-18 10:43
    Bill Henning: My concern was based on an estimate of an additional ~5n to do the slot lookup. With an fMax of 160Mhz, we may or may not have the 5ns to spare ... and I'd hate to see the fMax drop to 100ns.

    Remember, we're talking electronics here, not software. All this cog indirect addressing can go on in parallel with everything else and when we need the cog address, it will be there ... at no delay cost at all.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-08-18 12:17
    The way I saw the timing diagrams, the single ported cog memory will have to be read three times to execute an instruction - once for the opcode fetch, once for the src fetch, and once for the dst fetch, then a fourth cycle to write the result.

    Adding indirection for reads, using any arbitrary cog memory location, will add two fetches to (the single ported) cog memory. indirection for writes would also add a cycle.

    Now if we had special-purpose registers - like INDA & INDB in the previous P2 design - it should be possible to use them to index the cog memory instead of the src and dst fields, however that will add an extra level of multiplexers.

    I simply don't know how many ns that multiplexing will add. I do note that Chip added the INDS instruction precisely due to INDA/INDB with pre/post increment impacting fMax.
    Bill Henning: My concern was based on an estimate of an additional ~5n to do the slot lookup. With an fMax of 160Mhz, we may or may not have the 5ns to spare ... and I'd hate to see the fMax drop to 100ns.

    Remember, we're talking electronics here, not software. All this cog indirect addressing can go on in parallel with everything else and when we need the cog address, it will be there ... at no delay cost at all.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-08-18 14:06
    1024 entry 6 bit mapping table = 6144 bits = 36864 transistors (plus some more decoding logic)... does not seem like a huge number relative to total transistors in a P1V, so it may be feasible.
    OT: IIRC that's only about 5 * Z80 ;)
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-18 14:14
    The way I saw the timing diagrams, the single ported cog memory will have to be read three times to execute an instruction - once for the opcode fetch, once for the src fetch, and once for the dst fetch, then a fourth cycle to write the result.

    Adding indirection for reads, using any arbitrary cog memory location, will add two fetches to (the single ported) cog memory. indirection for writes would also add a cycle.

    Now if we had special-purpose registers - like INDA & INDB in the previous P2 design - it should be possible to use them to index the cog memory instead of the src and dst fields, however that will add an extra level of multiplexers.

    I simply don't know how many ns that multiplexing will add. I do note that Chip added the INDS instruction precisely due to INDA/INDB with pre/post increment impacting fMax.

    Can you translate this code from hub.v for me? It refers to cog[n-1] in the comment. If the indirect slot vector is "sv", couldn't we replace cog[n-1] with cog[sv[n-1]] (pseudo code) and expect all the busses to hook up just as they do now with the hard-wired sequence of slots? What am I missing?

    // connect hub memory to signals from cog[n-1]

    wire mem_w = ec && ~&sc && wc;

    wire [3:0] mem_wb = sc[1] ? 4'b1111 // wrlong
    : sc[0] ? ac[1] ? 4'b1100 : 4'b0011 // wrword
    : 4'b0001 << ac[1:0]; // wrbyte

    wire [31:0] mem_d = sc[1] ? dc // wrlong
    : sc[0] ? {2{dc[15:0]}} // wrword
    : {4{dc[7:0]}}; // wrbyte

    wire [31:0] mem_q;

    hub_mem hub_mem_ ( .clk_cog (clk_cog),
    .ena_bus (ena_bus),
    .w (mem_w),
    .wb (mem_wb),
    .a (ac[15:2]),
    .d (mem_d),
    .q (mem_q) );

  • jmgjmg Posts: 15,183
    edited 2014-08-18 14:30
    My concern was based on an estimate of an additional ~5n to do the slot lookup. With an fMax of 160Mhz, we may or may not have the 5ns to spare ... and I'd hate to see the fMax drop to 100ns.

    If we can hide the delay in the pipeline, without effecting fMax, that would be nice indeed. Hope we can!
    A 1024 table may be excessive, but if you pipeline this and have the table as the only COG selector (no mode mux) then it should avoid speed impact.
    The table can INIT from a default file for 'normal scan', so from power up, it would look like a standard P1.
    Custom builds could INIT their own scan, to save code.

    A modulus wrap bit (also pipelined) would allow any jitter-free scan base.

    A single Cyclone V M10K block can do
    1K as x10 or x8
    2K as x5 or x4

    so there is silicon support there for quite comprehensive tables, if you can streamline filling them in !

    Addit: On the 'comprehensive' side of things, you could fit a Dual Table, and Split the HUB.
    Each table now allocates a COG to (half?) the HUB map.
    Example; if one table is a simple 0..7 repeat, and the other is all 2, then all COGS get equal paced access to low HUB, and COG 2 has 100% access to upper HUB. It also has the 1/8 access to low HUB, for command packets.
    For full access, the same number goes in both columns, so the default loading would be Full access, 1/8 ratio.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-08-18 14:39
    "couldn't we replace cog[n-1] with cog[sv[n-1]] (pseudo code)"

    Yes, we can replace that (as long as table initialization is also added) - however after compilation, that will add additional logic, and a lookup to implement the indirection, which translates to a delay of <unknown>ns due to the lookup in the mapping table.

    How much this would impact fMax is the (pardon the pun) $64,000 question.

    My guess would be that the fastest way to do this would be to (wastefully) use a bram block, in which case, the access time for looking up the mapping would be whatever the bram's access time is.

    If I remember other posts currently people are running the P1V @ 140MHz, and the compiles report an fMax of 160MHz

    I am actually quite curious what the fMax would be with slot mapping, and how that would impact the practical upper limit for running P1V.

    If the slot mapping table was small, building it out of LE's may make it faster, however that would chew up a LOT of LE's for a large mapping table.
    Can you translate this code from hub.v for me? It refers to cog[n-1] in the comment. If the indirect slot vector is "sv", couldn't we replace cog[n-1] with cog[sv[n-1]] (pseudo code) and expect all the busses to hook up just as they do now with the hard-wired sequence of slots? What am I missing?

    // connect hub memory to signals from cog[n-1]

    wire mem_w = ec && ~&sc && wc;

    wire [3:0] mem_wb = sc[1] ? 4'b1111 // wrlong
    : sc[0] ? ac[1] ? 4'b1100 : 4'b0011 // wrword
    : 4'b0001 << ac[1:0]; // wrbyte

    wire [31:0] mem_d = sc[1] ? dc // wrlong
    : sc[0] ? {2{dc[15:0]}} // wrword
    : {4{dc[7:0]}}; // wrbyte

    wire [31:0] mem_q;

    hub_mem hub_mem_ ( .clk_cog (clk_cog),
    .ena_bus (ena_bus),
    .w (mem_w),
    .wb (mem_wb),
    .a (ac[15:2]),
    .d (mem_d),
    .q (mem_q) );

  • jmgjmg Posts: 15,183
    edited 2014-08-18 15:28
    I am actually quite curious what the fMax would be with slot mapping, and how that would impact the practical upper limit for running P1V.

    If the slot mapping table was small, building it out of LE's may make it faster, however that would chew up a LOT of LE's for a large mapping table.
    The present scanner will be a set of D-FF, so if you follow the RAM table read, with D-FF, you should get identical downstream timing. That makes the RAM read one clock earlier, but as it is a scanned table, that does not matter.
    The A2 has 176 M10K BRAMs
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-18 15:55
    "couldn't we replace cog[n-1] with cog[sv[n-1]] (pseudo code)"

    Yes, we can replace that (as long as table initialization is also added) - however after compilation, that will add additional logic, and a lookup to implement the indirection, which translates to a delay of <unknown>ns due to the lookup in the mapping table.

    How much this would impact fMax is the (pardon the pun) $64,000 question.

    My guess would be that the fastest way to do this would be to (wastefully) use a bram block, in which case, the access time for looking up the mapping would be whatever the bram's access time is.

    If I remember other posts currently people are running the P1V @ 140MHz, and the compiles report an fMax of 160MHz

    I am actually quite curious what the fMax would be with slot mapping, and how that would impact the practical upper limit for running P1V.

    If the slot mapping table was small, building it out of LE's may make it faster, however that would chew up a LOT of LE's for a large mapping table.

    Why not just use memory? If the HUB can do a RDWORD or WRWORD at a COG stop, a separate process could certainly do a read from the slot vector to determine the next COG in the slot sequence. If it doesn't have time to do this, the process could be pipelined. Assuming is never good in these cases but experimentation can easily verify that using FPGA memory makes more sense than using ALM's ... unless of course using memory just means using ALM's. My compile for my Cyclone V shows I've used 27% of my ALM's (out of 29,080) and only 14% of my memory (out of 4.5+ Mbits). If I'm not mistaken, you can specify the number of bits in each memory slot so 4 bits would make sense for a 16 COG implementation and the length of the vector could be arbitrarily large (limited by the number of bits in the indexer and modulus). All this goes on in parallel with everything else that is going on so there is no performance hit whatever. Obviously I'm leaving out how this memory gets populated, but that shouldn't be rocket science. In the end it comes down to controlling 16 input multiplexer for a 16 COG implementation. Again, what am I missing?
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2014-08-18 22:25
    As you know I have been running the P1V @120Mhz and with Tachyon loaded and running its network servers I have run some tests. Downloading a 1.44MB file takes 29 seconds on an 80MHz P1 with a W5200 vs 17 seconds on the P1V which is really only running everything from cog 0 with cog 1 handling serial receive and cog 2 is running background timing functions with cog 3 in an idle loop. So that leaves 4 extra hub slots for cog 0.

    The 120MHz speed should only run at 150% of the 80MHz speed but the supercog boosts this to 170%. The actual SPI transfers are only done in bit-bashing but obviously could be done using the counters although I think this would not really have that much of an impact in itself and would require a dedicated cog for that too. Most <100K files transfer in less than a second or so.


    EDIT: Actually I just realized that this P1V config is only giving cog0 one extra slot, I will retest this with the correct file later.

    P1 @80MHz (W5200)
    Command: RETR P8X32A.PDF
    Response: 150 Accepted data connection for P8X32A.PDF
    Response: 226 File successfully transferred
    Status: File transfer successful, transferred 1,442,886 bytes in 29 seconds

    vs P1V @120MHz with 4 cogs disabled (W5500)
    Command: RETR P8X32A.PDF
    Response: 150 Accepted data connection for P8X32A.PDF
    Response: 226 File successfully transferred
    Status: File transfer successful, transferred 1,442,886 bytes in 17 seconds

    I intend to leave this up and running when I'm not modifying it so it's accessible via TELNET/FTP/HTTP.

    Here's a shot of my rough n ready setup:
    2014 - 1.jpg
    1024 x 728 - 91K
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-19 07:57
    msrobots wrote: »
    I think the performance boost is not as big as anybody thinks.

    By now you have 2 instructions between Hub instructions. freeing up more slots for cog0 will not really gain much. Maybe the first one. But after more then one free slot?

    What do you want to do with those slots? One rdlong after the other, without processing of the data?

    Maybe I do have some gross misunderstanding here on my side, but I do not really see any speed improvement gained thru more slots. Maybe with one more slot. OK. but more?

    Please enlighten me!

    Mike
    A COG doing a RD/UPDATE/WRITE has to wait for 3 cycles around the HUB, minimum. If the UPDATE takes more than 16 clocks it's more cycles. With a long indirected vector of slots, the COG could be seeded in the vector so it would never have to wait. When it needed HUB access it could be the next COG in the sequence. With the P2 there were 16 COGS, not 8. That meant even more latency when communicating with the HUB ... and thus with other COGS. Indirecting the COGS in a vector of slots with a settable modulus for the indexing costs almost nothing in hardware and absolutely nothing in performance. Out of the box with a modulus of 8 or 16 it performs identical to the P1 and proposed P2. Seeding all slots with COG0 and then reseeding for COGs needing service yields the Supercog model. And with a long slot vector (say 1024 slots), virtually any kind of precise timing could be obtained for RD/Update/WR processing.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-08-19 12:41
    The P1 slots take 2 clocks vs 1 clock in the P2.
    So an immediate improvement would be to get 1 clock hub cycles working.
    Just a suggestion.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-19 13:22
    Cluso99 wrote: »
    The P1 slots take 2 clocks vs 1 clock in the P2.
    So an immediate improvement would be to get 1 clock hub cycles working.
    Just a suggestion.
    And the RD and WR doesn't take 8 cycles either. Just a suggestion.
  • jmgjmg Posts: 15,183
    edited 2014-08-19 15:18
    ... And with a long slot vector (say 1024 slots), virtually any kind of precise timing could be obtained for RD/Update/WR processing.
    Yes, but if precise includes jitter free, then some modulus control is also needed in the Slot table.
    Very easy to add, and comes almost for free - a ScanCtr Clear bit per slot is one simple way.
    With P1V, a simple use case of this, is to have the FPGA-config pre-load the table, and leave run-time loading for later.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-19 15:28
    jmg wrote: »
    Yes, but if precise includes jitter free, then some modulus control is also needed in the Slot table.
    Very easy to add, and comes almost for free - a ScanCtr Clear bit per slot is one simple way.
    With P1V, a simple use case of this, is to have the FPGA-config pre-load the table, and leave run-time loading for later.
    Agreed.
  • msrobotsmsrobots Posts: 3,709
    edited 2014-08-20 19:26
    @Peter,

    so you gained 20% with one more hub access? That is quite impressive. I do like the simplicity of your way to attack Verilog. If not used back to cog0. Simple and easy to use for any software. They even do not need to know it.

    Where I am still confused is what the cog0 can really do with a variable number of hub cycles so I hope you can get this running!
    The speed up may be language dependent but if running it's just testing and checking against real P1.

    I can't play with your guys yet because I wait for the Parallax FPGA. So just lurking here...

    Enjoy!

    Mike
  • msrobotsmsrobots Posts: 3,709
    edited 2014-08-20 19:59
    @Todd,

    I think I do understand your proposed hub access system. I hope so.

    Your plan is a finer graduation between 8 (or 16 cogs). So instead of 1 of 8(16) you propose something like 1 of 1024 or 512 of 1024 evenly spread of course or spread in bunches of slots? What a nightmare.

    If you talk about read/modify/write in one predictable circle you need to capture ALL hub events between read of your cog and write of the final result. thus barring any other cog from execution of any hub op? Why?

    You could use LOCKS on the propeller. The read/modify/write problem is exactly why locks exist on the current P1. So the rest of the multi processor chip is not blocked by your read/modify/write problem.

    All my applications on the P1 are using mailboxes for communication between parallel running processes. That is basically how P1 programming works in any language available for it.

    That is the CORE of the Propeller. Either you like it or not. Prioritizing Hub access by some external table might get you some gains in reaction time (up to 16 clocks?) for a single hub access, but what you gained?

    Your code in the cog still has to execute. By now you have 2 PASM ins between hub access. Lets say you reduce it to one. Now you can transfer a byte/word/long 1/3 faster then before. That might work for some applications. Now save the next PASM ins by giving one more hub slot. You end up with RDLONGs (bytes whatever) without processing the data at all. GREAT!

    So even in PASM a cog can not do anything useful with more than two slots.

    So why you try to granulate 1024 levels where 3 already break the ability of the chip?

    confused!

    Mike
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-08-21 00:53
    Mike,
    You are missing the opportunity of hubexec mode. Here, without internal cache, the cog can and will run faster with more hub cycles (ie closer together).
    With normal code, more hub cycles will likely result in lower latency, which will improve the programs performance.
    However, I am not advocating for any complex hub slot allocation, just a simple allocation.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-08-25 15:21
    @msrobots:

    We'll see. With the ability to implement the technique on the FPGA for a P1 I can prove or disprove it for myself. My initial concern was a 16 COG P2 (not knowing 1 cycle per COG rather than 2 as in the P1). That told me I was stuck with 32 cycles between a RD and a WR. And worse ... 64 cycles if I was doing an update. RDs and WRs don't take 32 cycles. But the built-in latency makes it look like they take 32 cycles (or more properly 16 on the P2 with 16 COGS). It seemed to me I could write code that did a RD; did some analysis; and did a WR, all in one round (maybe 2 or 3 in 1 RR) if I could just put myself in the right "time slots". I could dial in my own latency or achieve zero latency. And what did I give up for the privilege? Nothing! I have a plain vanilla P1 if I set up 8 slots and seed them 0.1.2.3.4.5.6.7. I have the so-called super COG0 if I seed them 0.0.0.0.0.0.0.0 and then place COGIDs as I start up cogs (and set slots back to 0 when I shut them down). And since I visualized an indirect addressed memory implementation (rather than the "shifting select" Chip uses), an arbitrarily long (I said nominally 1000 ... binary wired people like powers of 2 even if not doing Boolean things so someone bumped it to 1024) and a modulus which sets the virtual length of the memory. That would allow some pretty difficult lowest common denominator spacing to be implemented.

    As I said, we'll see.
  • msrobotsmsrobots Posts: 3,709
    edited 2014-08-25 22:43
    @Cluso99,
    Cluso99 wrote: »
    Mike,
    You are missing the opportunity of hubexec mode. Here, without internal cache, the cog can and will run faster with more hub cycles (ie closer together).
    With normal code, more hub cycles will likely result in lower latency, which will improve the programs performance.
    However, I am not advocating for any complex hub slot allocation, just a simple allocation.

    Everybody is throwing ideas around on increasing hub access, but I do no really see any reasonable way to use more hub cycles.

    I have not seen any hubexec Verilog/P1. This is by now about more/faster(?) hub access for a cog. So we are talking about PASM rd/wr/long/word/byte.

    In the current P1 the cog can execute 2 ins between hub access, once synced. More Hub access points can just help for faster sync. As said before one more might made sense but even two more will reduce your PASM code to useless rdwr whatever without the ability to process the data. You have maxed out the ins. of the cog.

    Why people think that a cog could use more hub slots at all? For what exactly? How?

    @todd,

    sorry to say that, but I think you do have no clue at all how the current P1 works and you should maybe write some PASM and look at the result on your scope. I am reading your posts carefully - having English as a second language - so I do not want to offend you at all. But to me it looks like that you are proposing/hoping for things not possible at all.

    In a multi-core environment like the prop, or a multi thread environment in a PC or a multi server environment like googles server net - you do not have atomic read/modify/write scenarios. It can not happen without either locking the resource or STOPPING all other participants. Just not possible.

    If you want to read., inspect, modify and rewrite a resource you need either to LOCK the resource or stop all other processes of doing anything. Or your result may be hit by some other process. No way around this.

    This is not just true for a micro controller like the P1, it is true even for databases or file systems. If you need to share a resource, you need to SHARE it. What can I say. It is obvious.

    You are still thinking in a single program flow. But we have not. It is a multi core. Them processes do not run sequential but parallel.

    Enjoy!

    Mike
  • jmgjmg Posts: 15,183
    edited 2014-08-25 23:43
    msrobots wrote: »
    Why people think that a cog could use more hub slots at all? For what exactly? How?

    @todd,

    sorry to say that, but I think you do have no clue at all how the current P1 works and you should maybe write some PASM and look at the result on your scope. I am reading your posts carefully - having English as a second language - so I do not want to offend you at all. But to me it looks like that you are proposing/hoping for things not possible at all.
    These discussions are not about a P1 chip, but are about a Slot allocate idea that was talked about for P2.

    The present P1 silicon simply scans one COG after another, for HUB access, but that does not have to be so.

    It is easy enough to use another table method to allocate HUB slots - of course, there is no 'free lunch', but there are many combinations that can be more technically useful than a fixed 1:8 for everyone,
    A 1024 table is a slight overkill, but it fits within a single BRAM in a FPGA, and BRAM can be preloaded, so this can be easily trialed with no Binary changes to P1 code (the revised COG allocate is fixed at power up, but a FPGA re-build can remap to suit the final application).
    If this all gives useful outcomes, then a more run-time change of table can be looked at.
  • msrobotsmsrobots Posts: 3,709
    edited 2014-08-26 13:04
    @Jmg,

    Yes I do understand that this is about the Verilog P1.

    But still - how can a cog USE more slots? For what? one RDlong after the other would use a maximum of 3 slots, but can not do anything else with the Hub Data.

    So please give me one 'technically useful combination' of hub slots and a small piece of PASM benefitting from it.

    More slots do not clock the cog faster. And todds example of read, inspect, modify and write will need some instructions between RDxxx and WRxxx

    I still can not see how PASM can gain anything from more Hub Slots.

    Enjoy!

    Mike
  • jmgjmg Posts: 15,183
    edited 2014-08-26 20:20
    msrobots wrote: »
    More slots do not clock the cog faster. And todds example of read, inspect, modify and write will need some instructions between RDxxx and WRxxx

    The Prop manual says this
    ["To get the most efficiency out of Propeller Assembly routines that have to frequently access mutually-exclusive resources, it can be beneficial to interleave non-hub instructions with hub instructions to lessen the number of cycles waiting for the next Hub Access Window. Since most Propeller Assembly instructions take 4 clock cycles, two such instructions can be executed in between otherwise contiguous hub instructions."]
    A slot table gives you another control variable in this interleave process.
  • AribaAriba Posts: 2,690
    edited 2014-08-26 20:42
    The speed gain comes not so much from more hub accesses in the same time but from a better granularity than 16 clock cycles. A good example is the classical LMM loop where you have 3 instructions between hub reads, so you miss the hub slot always by one instruction, this makes the loop half as fast as it would be with 2 instructions.
    A hub scheme that can access the hub every 4th cycle in best case will run much faster (second column in the following code example)::
    cycles
    LMM   rdlong ins,pc     20    8
          add    pc,#4      4     4
    ins   nop               4     4
          jmp    #LMM       4     4
                           ----------
                            32    20
    

    A good optimized PASM code that mostly uses the 2 free instructions between hub acceses (like the Spin Interpreter) will see only a minimal gain in speed.

    Andy
  • roglohrogloh Posts: 5,852
    edited 2014-08-26 22:40
    Now we have access to the Verilog code of P1 maybe we can add an incrementer feature to bring the LMM loop down to one hub cycle (see example below). Given RDLONG takes at least 7 cycles, is there any time in there to increment the PC and write it back? I know regular 4 cycle instructions access COG RAM on every cycle, but maybe RDLONG has a few extra cycles where another D result register can be written back maybe while waiting for the RDLONG to return its result from the hub access...?

    eg.
    instr   NOP
            RDLONG instr, pc  WC  ` wc also indicates to increment PC by 4
            JMP #instr
    
  • AribaAriba Posts: 2,690
    edited 2014-08-26 22:56
    An Increment D by 4 and jump to (#)S instruction may be easier to implement. It's very similar to DJNZ only that it must add 4 to D and the condition test is not necessary.
    The PC increment then just happens at the JMP instead of the RDLONG in your LMM loop example.

    Andy
  • roglohrogloh Posts: 5,852
    edited 2014-08-26 23:45
    Yeah I was wondering if it would be easier to do the increment in the jump as well, given there is no writeback involved for that instruction. It would be nice to get LMM truly running at full hub frequency, using a minor Verilog mod and without jumping through hoops like bundling instructions, backwards PCs etc.
Sign In or Register to comment.