Shop Learn
Very Simple HUBEXEC for New 16-Cog, 512KB, 64 analog I/O — Parallax Forums

Very Simple HUBEXEC for New 16-Cog, 512KB, 64 analog I/O

Cluso99Cluso99 Posts: 17,973
edited 2014-04-07 21:48 in Propeller 2
I think this thread is "Dead in the Water" - thanks Chip
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257855&viewfull=1#post1257855


Today Chip posted a new definition of the P16X32B. This post has therefore been renamed.
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip

HUBEXEC discussion for the new chip now continues from Post #10 below.


Here is what I proposed elsewhere. It is a simple, no frills method to implement HUBEXEC on a P1 2clock Cog.
It could be used on the proposed P16X32B / P32X32B chip, with 16-32 x P1 2clock C cogs.

Ignoring hub data accesses which will impact any scenario similarly...

An LMM loop uses 4 P1 instructions to execute the loop
:LMM    RDLONG  :INSTR, PC   ' fetch the instruction from hub
        ADD     PC, #4
:INSTR  nop                  'execute the fetched instruction
        JMP     #:LMM
With 2clock instructions, and hub access every 8 clocks, this loop could execute in 8 clocks.
Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use 100% power.

In my method, the instruction would be fetched from hub during the next hub cycle and executed directly. Any waits for the hub cycle would effectively be "idle" clocks resulting in virtually no power.

With 2clock instructions, and hub access every 8 clocks, this loop could execute in 2 clocks, so would "idle" for 6 clocks (3 instruction periods)
Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use ~25% of the LMM power, and same speed.

Now, if we increase the hub access to 4 clocks,
LMM would still use 8 clocks, so 1 access every 2 hub cycles, and would use 100% power (the reference).
Hubexec would now execute 2 instructions per 2 hub cycles, and would use 50% power of LMM and run 200% faster.

By changing the jump/call/ret instructions to use relative mode if executed from hub, hubexec becomes simpler than LMM, and saves instructions (like pseudo FCALL etc) as well as increases speed. Bill would have the best info on these savings.

Adding in a jump/call/ret instruction to use absolute addressing would also substantially benefit Hubexec over LMM. 17 bit addresses are required for 512KB hub (128KB longs). The call could place the return address in a fixed location together with Z & C flags. A single P1 opcode could be used for this.

The P2 utilised addresses <$200 ($800 in byte addressing) as being for cog mode, and equal/above for hub mode. The same would apply here.

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-06 21:02
    Ray,

    I suspect you missed my post of almost three hours ago, my friend.

    http://forums.parallax.com/showthread.php/155083-Consensus-on-the-P16X32B?p=1257101&viewfull=1#post1257101

    I posted a more general, tunable performance version of hubexec for P1+ there.

    Mooch would also help :)
    Cluso99 wrote: »
    Here is what I proposed elsewhere. It is a simple, no frills method to implement HUBEXEC on a P1 2clock Cog.
    It could be used on the proposed P16X32B / P32X32B chip, with 16-32 x P1 2clock C cogs.

    Ignoring hub data accesses which will impact any scenario similarly...

    An LMM loop uses 4 P1 instructions to execute the loop
    :LMM    RDLONG  :INSTR, PC   ' fetch the instruction from hub
            ADD     PC, #4
    :INSTR  nop                  'execute the fetched instruction
            JMP     #:LMM
    
    With 2clock instructions, and hub access every 8 clocks, this loop could execute in 8 clocks.
    Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use 100% power.

    In my method, the instruction would be fetched from hub during the next hub cycle and executed directly. Any waits for the hub cycle would effectively be "idle" clocks resulting in virtually no power.

    With 2clock instructions, and hub access every 8 clocks, this loop could execute in 2 clocks, so would "idle" for 6 clocks (3 instruction periods)
    Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use ~25% of the LMM power, and same speed.

    Now, if we increase the hub access to 4 clocks,
    LMM would still use 8 clocks, so 1 access every 2 hub cycles, and would use 100% power (the reference).
    Hubexec would now execute 2 instructions per 2 hub cycles, and would use 50% power of LMM and run 200% faster.

    By changing the jump/call/ret instructions to use relative mode if executed from hub, hubexec becomes simpler than LMM, and saves instructions (like pseudo FCALL etc) as well as increases speed. Bill would have the best info on these savings.

    Adding in a jump/call/ret instruction to use absolute addressing would also substantially benefit Hubexec over LMM. 17 bit addresses are required for 512KB hub (128KB longs). The call could place the return address in a fixed location together with Z & C flags. A single P1 opcode could be used for this.

    The P2 utilised addresses <$200 ($800 in byte addressing) as being for cog mode, and equal/above for hub mode. The same would apply here.
  • ElectrodudeElectrodude Posts: 1,440
    edited 2014-04-06 21:40
    Instead of hubexec for the P1, which will be an endless and impossible headache and, as Chip said, will completely overcomplicate everything, how about just a simple hubram->cogram block transfer mechanism? It could be implemented in a few extra hubop instruction S slots. CP_HYRDRAMAN (bomberman) from the hydra demo library is an excellent example of this block transfer but in software. It gets 60fps, which should be more than plenty for anything that needs LMM (no hardware driver should need LMM).

    sample P1E code:
    hubop  #$080,   #$010  'prepare to copy to cog starting at $080
    hubop  hubptr,  #$011  'prepare to copy from hubram starting at address in hubptr
    hubop  _64,     #$012  'actually copy 64 longs
    
    _64    long     64
    
    One long would be transferred ever hub cycle using the same mechanism used for coginit. This would have no overhead unlike one-instruction-at-a-time LMM and would cache way more code than hubexec ever would if the compiler/human programmer is smart about pages.

    electrodude

    EDIT: thought about it some more
    You would only have to set up the cog ptr and length once, after that you would only need one hubop to specify the hub pointer and trigger the transfer. A stack could be done pretty easily in hardware but it would probably be best to do that in software.
  • Cluso99Cluso99 Posts: 17,973
    edited 2014-04-06 21:40
    Bill,
    Perhaps you have the wrong link? There is no performance and power references there.
    However, I now think we are on the same page. Perhaps a few more might join us?
    I think we are starting to get closer to what the recommended PccX32B should look like.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 06:01
    Hi Ray,

    On re-reading my post and yours, you are right, we are on the same page! We were attacking the same issue, from different ends in a way.

    Your post concentrated on cycle counts, mine concentrated hub slot mapping and the minimum hubexec instructions needed to eliminate LMM. I did not do cycle calculations as with the slot mapping, the cycle counts would be configurable (within limits) by the developer, and Chip has not yet posted about the exact cycle behavior of hub instructions on a P1E.

    With a "P1E" running at 200MHz, and 32 cogs, the map would initially give each cog a hub cycle every 32 clock cycles. Due to the 2 cycles per instruction, this would appear to the cog like one hub slot in every 16 instruction cycles (32 clock cycles) - just like a P1. That is unless the slot map was changed, which it would be, so that cogs that needed could get hub slots.

    Chip has not yet stated how many clock cycles hub instructions would take on a P1E, so I did not dwell into further calculations yet.

    Assuming that the hubexec engine could use the hub long as soon as the window came around by executing the instruction directly out of the instruction just read out of the hub directly into the internal "instruction register", if a cog was configured for 64/128 hub cycles, it would run instructions at the same rate as cog only code! Obviously a hubexec at 32/128 hub slots would run at 1/2 cog native speed, 16/128 at 1/4 cog native speed, and 32/128 at a little over LMM speed.

    I did not include performance figures as in my scheme the performance is configurable, and Chip has not released any data on power utilization on such a cog that would allow me to make power calculations.
    Cluso99 wrote: »
    Bill,
    Perhaps you have the wrong link? There is no performance and power references there.
    However, I now think we are on the same page. Perhaps a few more might join us?
    I think we are starting to get closer to what the recommended PccX32B should look like.
  • Brian FairchildBrian Fairchild Posts: 537
    edited 2014-04-07 06:46
    With a "P1E" running at 200MHz...the 2 cycles per instruction...16/128 at 1/4 cog native speed...

    So, just to get this straight in my head...

    At 200MHz main clock, a P1E COG would run at 100MIPS from COG RAM and eight of them could run at 25MIPS each from HUBRAM using your slot allocator mechanism.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 06:56
    Yes!

    Chip's proposed P1E's use a 200MHz clock, with 2 clock cycles per instruction, and as posted, does not have an instruction cache and data cache like the P2.

    (because he is reducing cog memory from four ports on the P2 to two ports on the P1E to save transistors)

    So, by using the slot allocation mechanism:

    - setting a cogs access pattern to 64/128, one cog could run at 100MIPS from the hub ram, leaving 64 slots for the other cogs
    - setting a cogs access pattern to 32/128, three cogs could run at 50MIPS from the hub ram, leaving 32 slots for the other cogs

    - setting a cogs access pattern to 16/128, six cogs could run at 25MIPS from the hub ram, leaving 32 slots for the other cogs
    - setting a cogs access pattern to 16/128, eight cogs could run at 25MIPS from the hub ram, leaving no slots for the other cogs (the case you were interested in)

    Adding the "Mooch" mode would allow cogs that don't need to be strictly deterministic to get more slots (the ones assigned to cogs that won't use them)
    So, just to get this straight in my head...

    At 200MHz main clock, a P1E COG would run at 100MIPS from COG RAM and eight of them could run at 25MIPS each from HUBRAM using your slot allocator mechanism.
  • Brian FairchildBrian Fairchild Posts: 537
    edited 2014-04-07 07:12
    Yes!

    Good, glad I've got it.

    So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.

    Or if a 32 COG P1+ appears I'll think I've died and gone to heaven.


    Just a thought but with your 128-slot allocator is there any merit in being able to set a top value of <128 to allow COGs to run in sync if you aren't running a nice 'binary' pattern of access? Or indeed, set the length as a prime number to make them deliberately run out of sync?

    Also, and I haven't really thought this through but is there any case where a COG wouldn't want access to HUB memory once it's running so you need a way of flagging this in the allocator?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 07:25
    Good, glad I've got it.

    So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.

    Or if a 32 COG P1+ appears I'll think I've died and gone to heaven.

    Exactly! It also gives a reason to go for 32 cogs (that Chip has said he has space for) as it really does simplify I/O to run a serial / kb / mouse in separate cogs at 1/128... and we'd have room for a TON of such peripherals. In another thread I showed a breakdown with two high bandwidth cogs (one hubexec, one video), 12 serial ports, 3 SPI, 2 I2C, etc and had some cogs and hub time left over...
    Just a thought but with your 128-slot allocator is there any merit in being able to set a top value of <128 to allow COGs to run in sync if you aren't running a nice 'binary' pattern of access? Or indeed, set the length as a prime number to make them deliberately run out of sync?

    The index to the table is NOT the cog number, but cnt&$1F... to allow many different deterministic patters. Setting it to <128 would not change the pattern, due to cnt&$1F. Using a separate counter with a limit could allow other patterns, but then it would disturb binary access. The way I proposed it minimizes transistors, and I think we all like binary patterns :)

    You can already run them out of sync, no one says you have to use a nice binary allocation in the 128 slot table :)
    Also, and I haven't really thought this through but is there any case where a COG wouldn't want access to HUB memory once it's running so you need a way of flagging this in the allocator?

    The allocator allocates hub cycles to cogs. Simply don't allocate any slots to a cog, it won't get them.

    Unless of course "Mooch" is added, then cogs who don't have specific slots could use slots that were not needed/used by the cogs they were assigned to.

    I gave an example in another thread of using a cog that has no slots assigned to it as a garbage collector for Java/JS/Python like languages that need gc, as it does not need to have a deterministic access pattern, allowing more deterministic slots for cogs that DO need determinism.
  • Cluso99Cluso99 Posts: 17,973
    edited 2014-04-07 19:37
    Just managed to read the new 16-Cog thread.
    What a difference a day makes!

    I am really hoping Bill will repost his new hubexec proposals here
    http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257520&viewfull=1#post1257520

    Hubexec is nice for many reasons, but the biggest ones are...
    1. Gets over the 2K cog ram limitation much simpler than LMM
    2. Hubexec only uses ~25% power over LMM (LMM executes 4 instructions per 1)
      • With helper instructions reduces hub footprint and increases speed too.
    Now that the New 16-Cog P1+ chip is being designed, we know it will have QUAD access (128 bit = 4 long) hub access. This means that hubexec could have a simple and a slightly more complex method of implementing HUBEXEC.

    Basic HUBEXEC
    1. Requires a new JMPRETX (JMPX/CALLX/RETX) instruction with 17bit for direct addressing.
      • CALLX will store return address (17bit) and flags in the fixed register $1EF
      • JMPX will not store the return address nor flags
      • RETX will JMPX to the return address stored in register $1EF, restoring the flags via WC & WZ
      • If the goto address is <$200 (ie hub byte address <$800) then the resulting jump will be to cog, else it will be hub.
      • The 17bit address (hub long address >>2) can be held in D+S for immediate mode.z
    2. The cog's program counter will be increased from 9bits to 17bits (17bits hub long address)
    3. When an instruction needs to be fetched from hub, it will wait for the hub cycle and the hub instruction will be read from the hub long.
      • ie the instruction can come from either cog ram or a fetched hub long.
    4. No hub caching for instructions will be performed.
      • This keeps the design simplest.
      • Hubexec will execute 1 instruction for each available hub slot, except when delayed due to a hub data access.
      • Due to 1 per hub slot, the remaining clocks will cause the cog to idle in low power mode, reducing power consumption considerably.
    Improved HUBEXECBy increasing the complexity slightly, the following improvements are possible...
    1. By adding a single QUAD register to hold the fetched long for an instruction would permit up to 4 instructions to execute in HUBEXEC mode per hub slot (presuming 1:16 clocks = 1:8 instructions).
      • This would give HUBEXEC up to 4x improvement in speed over the Basic Hubexec mode.
      • I will leave the details for Chip to work this out simply if he decides it has merit.
      • There seems no point in cache tags although saving hub fetching if it is the same address might be simple and save power.
    Other improvements could be made by
    • LOADX #<17 or 18 bits> to a fixed location ($1EE or $1EF ???)
    • AUGS and/or AUGD
    • TARG
  • rjo__rjo__ Posts: 2,115
    edited 2014-04-07 20:04
    Ray and Bill,

    You guys are true experts in every sense of the word. Most of the people trying to follow your conversation are not. My understanding about Hubexec is that one of its virtues is to facilitate
    efficient use of other languages. In fact, one of the current limits of the P1 is the speed of SPIN code. There are lots of things that simply can't be done in Spin because it isn't fast enough.
    The new chip will be a lot faster, so SPIN should be faster too, but the faster it can be made, the better.
    Do you guys see HUBEXEC boosting SPIN performance beyond what Chip already has planned?

    My other understanding (from long ago) is that HUBEXEC is "elected" by the user so to speak… so if a user doesn't want to use it, it will only marginally impact other modes. True?
  • jmgjmg Posts: 14,821
    edited 2014-04-07 20:22
    rjo__ wrote: »
    My other understanding (from long ago) is that HUBEXEC is "elected" by the user so to speak… so if a user doesn't want to use it, it will only marginally impact other modes. True?

    The impacts are die area, and perhaps a (very?) small power adder when not used.

    I would guess Chip will do a OnSemi sim run before adding any HubExec, to confirm Power and speed points.
    I think this step also gives more accurate die-area figures

    (unless he can 'see' a clear winner simple low impact solution.)

    Given Hubexec is not 'new' in the sense it works on the more complex P2, it may not be that great a task.
  • Cluso99Cluso99 Posts: 17,973
    edited 2014-04-07 20:26
    rjo,
    I am not sure how much improvement, if any, it would make to spin.

    HUBEXEC has the following advantages:
    * Simplicity to expand over the 2K hub limit. This is the BIGGEST problem this solves because you don't have to resort to the complex LMM.
    * It reduces power by 75% over LMM for the same speed.
    * It usually improves speed over LMM, but this depends on the simplicity of the implementation. QLMM may beat the basic hubexec, but is a lot more complex.
    * Enables better use of HLL (high level languages) such as GCC and Catalina C.

    If the user doesn't use hubexec, there is no impact to the user (other than perhaps a minute (negligible) power waste and silicon waste in the basic hubexec implementation). There was quite a bit more power and silicon waste in P2 if it was not used due to 4 deep 8 long instruction cache with LRU (least recently used algorithm) and auto loading of the cache by a state m/c.

    I gained a 25% average improvement with my P1 Spin Interpreter. But because spin was in ROM, it was not really used. Once its soft (will be now), those sort of improvements can be made, as well as unwrapping some, and putting the lesser used into hubexec mode. So actually hubexec could improve spin, but not a lot IMHO. There are far better ways to improve it for more speed.
  • rjo__rjo__ Posts: 2,115
    edited 2014-04-07 20:44
    Cluso99

    Thanks.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 21:07
    Cluso99 wrote: »

    Hi Ray,

    Happy to help!

    Sorry for the timeline / quote format, I will clean it up once Chip posts what fits :)

    (from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257520&viewfull=1#post1257520 roughly 10 hours ago)

    LMM comparisons made to base rate ie without fcache, as hubexec is also base rate (no reason hubexec cannot use an fcache)

    Instead of a physical separate LRU capable cache, could you execute from the four longs from the last hub cycle?

    That can't pre-fetch a following cache line, but it would still be MUCH faster than LMM (without fcache, which the compiler guys are not using anywhere near its potential, probably due to GCC code generator limitations)

    If you can execute from an internal latched "this is what I got last hub cycle", it would still be minimum 2x-4x faster than any possible LMM.

    Still avoids cache lines, etc.

    HUNGRY would help the simplified hubexec very significantly

    All we really need is 3 new instructions, and executing from the QUAD latch as above

    JMP d/#hub17longaddr
    CALL d/#hub17longaddr ' writes PC to LR, I suggest $1EF, or whatever the last register is before special registers
    LOAD #hub17longaddr ' could the #hub17longaddr + two low zero bitswrite to LR, or one reg below it - for fast loading of pointers (cheap LOCPTRA replacement)

    This would be roughly 4x faster than possible with LMM, and save parallax $$$$ over having to do a QLMM gcc.

    The slot assignment table idea, would make it MUCH faster, almost twice as fast

    Mooch would also help greatly

    Even a single 4 long line (simple latch) and the RDxxxxC instructions would speed up VM's and data access for LMM & hubexec by roughly 4x.

    - What is the area/transistor budget effect of a single 4 long cache for data?

    - It sounds like you determined that 32 cogs are out (I assume too much die area).

    Pity, with slot mapping it would add a lot, but what does not fit, does not fit.

    Slot mapping (16 entries with 8 cogs minimum), would allow 2x the simple hubexec performance, twice the video bandwidth

    at the cost of some slots from low speed drivers that don't need them

    Later, I was asked for more clarification... I may have tried to stuff too much info into too short a post!

    (from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257529&viewfull=1#post1257529 about 9 hours ago)
    100% correct

    Hubexec was not really worth it without the wide bus, due to minimal incremental performance (ok, some memory savings too, which is less critical with 512KB)

    When I digested his new design, I saw how to cheaply improve it.

    It is very likely he already has a 128 latch for each hub to latch the value before writing it to 128 bits of hub registers.

    If not, maybe he can execute out of the four registers it was written to.

    In either case, we are talking about a 128:4 multiplexer, PC changed to 17 bits, and the 3 instructions I proposed. I think the gate count for that would be very small compared to the size of even the tiny P1 cog.

    Quote Originally Posted by David Betz View Post
    I thought that too but I think he's nearly there with increasing the hub with to 128 bits. He may have already paid a large part of the price in complexity. If that's not true then maybe it would be better to leave it out. That still leaves Ken's comment about improved C efficiency being a customer request though.

    (from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257537&viewfull=1#post1257537 about 9 hours ago)
    Dave,

    It is actually several potential performance boosts. Maybe I make my technical proposals too dense!

    hubexec engine:

    Step 1: no prefetch (absolute minimum gates required)

    may not need a cache line if the hub bus is latched on read (can execute from there) or if Chip can execute from the destination four registers (don't know). If the above are not possible, needs a minimum of one 4 long icache line (1x4L to save future typing)

    Performance is roughly 1/2 what it could be, as there is no possibility of prefetch.

    Step 2: prefetch requires 2x4L icache, if there is gate/power budget for two cache lines (uses more gates). Obviously 4x4L would be better.

    Step 3: mooch would help a lot (annoys some people on "principle")

    minimum instructions needed for good improvement over LMM

    The three instructions I pointed out are the bare minimum to minimize required verilog and gate.

    JMP d/#hub17longaddr
    CALL d/#hub17longaddr ' writes PC to LR, I suggest $1EF, or whatever the last register is before special registers
    LOAD #hub17longaddr ' could the #hub17longaddr + two low zero bitswrite to LR, or one reg below it - for fast loading of pointers (cheap LOCPTRA replacement)

    More would allow for more improvement, but I got it that people want minimal changes (weather I agree or not). I even used a LR as the gcc group prefers

    possible further performance improvements, up to Chip which are small enough to fit - or even put in

    Option 1: A single 4 long data cache so we can get RDxxxxC back. That hugely improves VM's, would even help LMM

    Option 2: Simplified slot mapping, 2x number of cogs slots

    Allows deterministic bandwidth assignment - less for serial/ps2/etc cogs, more for hubexec and video cogs. Should be very cheap in silicon (32x5 bits), but only Chip knows what % of gates increase it adds

    Serial port? only one slot out of 32 (or 64 etc)

    Video engine? Give it two slots! Once cog can now do 16bpp 1080p60 !!

    Obvious extensions are say 64 slots if cheap enough for finer grain deterministic timing control

    Quote Originally Posted by Dave Hein View Post
    Bill, I find your proposal a bit confusing. Aren't you just suggesting hubex with a single cache line without prefetch, or did I miss something?

    (from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257591&viewfull=1#post1257591 about 8 hours ago)
    Dave,

    Quote Originally Posted by Dave Hein View Post
    The thing that makes it confusing is that you are suggesting several things.
    I was building up from minimum, to better... better... trying to save myself time.

    My intent was for each stage to be read, analyzed, internalized before moving to the next.

    This way I was hoping on saving everyone time, and trying to present a "roadmap" from minimum gates some performance improvement, to maximum performance, with as few gates as I could see using.

    Quote Originally Posted by Dave Hein View Post
    It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line. The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.
    A shared single line 4 long cache cannot improve for LMM or hubexec, as it would be reloaded on every hub reference, and the first instruction after a hub reference. Performance would be terrible, almost zero benefit.

    Shared I/D caches can work very well when there are a LOT of cache lines, and use an LRU algorithm.

    Two lines of I with prefetch and one line of D cache is the minimum for decent performance. (Diminishing returns hits after 8 lines of I and 4 of D)

    Quote Originally Posted by Dave Hein View Post
    I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?
    No. Load is equivalent to LOCPTRA on the P2, but going to a fixed location to avoid needing bits for D.

    (gate) poor mans replacement for the P2 LOC* instructions, without needing PTRA. Not as good, but a good boost for compiled code. To wit:

    Code:

    ' LMM

    CALL #MVI_R4
    long hub_addr_of_array
    RDLONG R3, R4 ' get first element, can incr R4 to walk array

    ' HUBEXEC

    LOADK #hubaddr
    RDLONG R3, $1EE (or whatever fixed address)

    HUGE performance win, reduces memory use too.

    As per my discussion with David, I'd be delighted if Chip instead could add AUGS:

    RDLONG R3,##hubaddr

    and that would also cover reading 32 bit constants.

    I did not have it in my minimized proposal... as it was the minimum

    If I find the time, I'll edit it down tomorrow, now that it is all in one place!
  • Cluso99Cluso99 Posts: 17,973
    edited 2014-04-07 21:29
    Thanks Bill.
    That's fine as it will change as we get to understand more. I thought it should be here in case anyone wants to input to hubexec mode.

    What are your thoughts about the Basic Hubexec. It's way better than nothing, and makes life simpler.
    Then we can see what adding the 4 long instruction register (simple cache) would do - it then needs to compare addresses required and be on a quad long boundary.

    The Basic implementation would be deterministic whereas the 4 long cache would not. Probably cannot have both.
Sign In or Register to comment.