Extremely Simple HUBEXEC in HUB or Extended GOG RAM
Cluso99
Posts: 18,069
All that is necessary to be able to execute in HUB RAM or EXTENDED COG RAM is...
1. Implement a flat-memory model where...
Cog RAM $00000-001FF
Extended Cog RAM $00200-003FF (if 2KB extended implemented)
Hub RAM $00400-xFFFF
2. Increase the PC (program counter) to say 18bits (ie 18b+"00" = 20bits = up to 1MB addressable)
3. Permit instruction fetches to come from the flat model. If the address is from Hub RAM, then wait for your hub-slot.
4. JMPRET (JMP/CALL/RET) and TJ/DJ instructions changed to relative (ie +/- 256 instructions/longs)
5. Optional LJMP instruction using both D/S giving 18bits+"00" (1MB max)
Now there is no need for an LMM execution unit and the LJMP/LCALL mechanisms currently used. Execute straight out of hub at full hub speed.
1. Implement a flat-memory model where...
Cog RAM $00000-001FF
Extended Cog RAM $00200-003FF (if 2KB extended implemented)
Hub RAM $00400-xFFFF
2. Increase the PC (program counter) to say 18bits (ie 18b+"00" = 20bits = up to 1MB addressable)
3. Permit instruction fetches to come from the flat model. If the address is from Hub RAM, then wait for your hub-slot.
4. JMPRET (JMP/CALL/RET) and TJ/DJ instructions changed to relative (ie +/- 256 instructions/longs)
5. Optional LJMP instruction using both D/S giving 18bits+"00" (1MB max)
Now there is no need for an LMM execution unit and the LJMP/LCALL mechanisms currently used. Execute straight out of hub at full hub speed.
Comments
1) "If the address is from Hub RAM, then wait for your hub-slot." extends to become :
If the address is from Hub RAM, then wait for your hub-slot, OR if it is from off-chip memory(XIP serial), fetch the opcode(s) waiting for the address issue.
The small HW state engine that supports this, needs to be skip-aware, so that if an address close-ahead appears, it does not generate a new address-frame, instead it uses the faster skip.
The HW also should allow block streamed copies, with no new address issues, which supports Fonts, and LMM style code streaming.
That encourages code to be Skip-in-nature for best performance, which compilers and programmers can often do with small extra care.
Of course once this is one, the 20b address looks a tad small, requiring a larger call.
I think Chip was looking at smallest jumps being relative to allow code-position independence, 9 & 20b sizes, and the full 32b address being absolute.
2) Some parts have a register frame pointer, and that re-maps the working registers to other parts of memory,
In a P1V context this would map COG memory to different HUB locations, This works best with higher clock opcodes, (like P1V is now) and less well with single-cycle cores.
At any rate, part of the plan was to continue with the idea of mapping boot ROM to low Hub space and have the Cogs' addressing overlap the lower 2k of ROM. So, 100% linear across Cog and Hub.
The cool part with the P1V is this can be every 8 clocks just by making the Hub cycle every 8 clocks as well.
1. Just extend the PC and allow hub fetches for instructions when PC > $200 (or >$x00 with extended cog)
2. Add a new LJMP
3. For now, treat JMPRET & TJ/DJ instructions as only changing the lower 9bits of PC.
Now, just compile a block of code that will "live" within a 2KB block. The registers will still all be in normal cog space.
Start a block of cog code with a LJMP to a 2KB block of hub code.
evanh,
Agreed, we need 8 clock hub. Then hubexec will run only at half cog speed
Also keep in mind that there is 36 bit memory in most FPGAs (40b in MAX 10, but that may be too unique), so that allows bigger opcodes for some items (longer JMP)
like LJMPS and I have suggested it can also be used for Modes of AutoINC on Indirect opcodes.
The Prop has modes to load 9 bit sub-fields, so a 4th one of those can nicely reach all 36 bits, without needing a fully explicit 36b wide load, if that is not desired.
HW would usually default-safe those upper 4 bits.