Very Simple HUBEXEC for New 16-Cog, 512KB, 64 analog I/O
Cluso99
Posts: 18,069
I think this thread is "Dead in the Water" - thanks Chip
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257855&viewfull=1#post1257855
Today Chip posted a new definition of the P16X32B. This post has therefore been renamed.
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip
HUBEXEC discussion for the new chip now continues from Post #10 below.
Here is what I proposed elsewhere. It is a simple, no frills method to implement HUBEXEC on a P1 2clock Cog.
It could be used on the proposed P16X32B / P32X32B chip, with 16-32 x P1 2clock C cogs.
Ignoring hub data accesses which will impact any scenario similarly...
An LMM loop uses 4 P1 instructions to execute the loop
Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use 100% power.
In my method, the instruction would be fetched from hub during the next hub cycle and executed directly. Any waits for the hub cycle would effectively be "idle" clocks resulting in virtually no power.
With 2clock instructions, and hub access every 8 clocks, this loop could execute in 2 clocks, so would "idle" for 6 clocks (3 instruction periods)
Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use ~25% of the LMM power, and same speed.
Now, if we increase the hub access to 4 clocks,
LMM would still use 8 clocks, so 1 access every 2 hub cycles, and would use 100% power (the reference).
Hubexec would now execute 2 instructions per 2 hub cycles, and would use 50% power of LMM and run 200% faster.
By changing the jump/call/ret instructions to use relative mode if executed from hub, hubexec becomes simpler than LMM, and saves instructions (like pseudo FCALL etc) as well as increases speed. Bill would have the best info on these savings.
Adding in a jump/call/ret instruction to use absolute addressing would also substantially benefit Hubexec over LMM. 17 bit addresses are required for 512KB hub (128KB longs). The call could place the return address in a fixed location together with Z & C flags. A single P1 opcode could be used for this.
The P2 utilised addresses <$200 ($800 in byte addressing) as being for cog mode, and equal/above for hub mode. The same would apply here.
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257855&viewfull=1#post1257855
Today Chip posted a new definition of the P16X32B. This post has therefore been renamed.
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip
HUBEXEC discussion for the new chip now continues from Post #10 below.
Here is what I proposed elsewhere. It is a simple, no frills method to implement HUBEXEC on a P1 2clock Cog.
It could be used on the proposed P16X32B / P32X32B chip, with 16-32 x P1 2clock C cogs.
Ignoring hub data accesses which will impact any scenario similarly...
An LMM loop uses 4 P1 instructions to execute the loop
:LMM RDLONG :INSTR, PC ' fetch the instruction from hub ADD PC, #4 :INSTR nop 'execute the fetched instruction JMP #:LMMWith 2clock instructions, and hub access every 8 clocks, this loop could execute in 8 clocks.
Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use 100% power.
In my method, the instruction would be fetched from hub during the next hub cycle and executed directly. Any waits for the hub cycle would effectively be "idle" clocks resulting in virtually no power.
With 2clock instructions, and hub access every 8 clocks, this loop could execute in 2 clocks, so would "idle" for 6 clocks (3 instruction periods)
Therefore, it would execute 1 hub instruction per hub access (8 clocks), and would use ~25% of the LMM power, and same speed.
Now, if we increase the hub access to 4 clocks,
LMM would still use 8 clocks, so 1 access every 2 hub cycles, and would use 100% power (the reference).
Hubexec would now execute 2 instructions per 2 hub cycles, and would use 50% power of LMM and run 200% faster.
By changing the jump/call/ret instructions to use relative mode if executed from hub, hubexec becomes simpler than LMM, and saves instructions (like pseudo FCALL etc) as well as increases speed. Bill would have the best info on these savings.
Adding in a jump/call/ret instruction to use absolute addressing would also substantially benefit Hubexec over LMM. 17 bit addresses are required for 512KB hub (128KB longs). The call could place the return address in a fixed location together with Z & C flags. A single P1 opcode could be used for this.
The P2 utilised addresses <$200 ($800 in byte addressing) as being for cog mode, and equal/above for hub mode. The same would apply here.
Comments
I suspect you missed my post of almost three hours ago, my friend.
http://forums.parallax.com/showthread.php/155083-Consensus-on-the-P16X32B?p=1257101&viewfull=1#post1257101
I posted a more general, tunable performance version of hubexec for P1+ there.
Mooch would also help
sample P1E code: One long would be transferred ever hub cycle using the same mechanism used for coginit. This would have no overhead unlike one-instruction-at-a-time LMM and would cache way more code than hubexec ever would if the compiler/human programmer is smart about pages.
electrodude
EDIT: thought about it some more
You would only have to set up the cog ptr and length once, after that you would only need one hubop to specify the hub pointer and trigger the transfer. A stack could be done pretty easily in hardware but it would probably be best to do that in software.
Perhaps you have the wrong link? There is no performance and power references there.
However, I now think we are on the same page. Perhaps a few more might join us?
I think we are starting to get closer to what the recommended PccX32B should look like.
On re-reading my post and yours, you are right, we are on the same page! We were attacking the same issue, from different ends in a way.
Your post concentrated on cycle counts, mine concentrated hub slot mapping and the minimum hubexec instructions needed to eliminate LMM. I did not do cycle calculations as with the slot mapping, the cycle counts would be configurable (within limits) by the developer, and Chip has not yet posted about the exact cycle behavior of hub instructions on a P1E.
With a "P1E" running at 200MHz, and 32 cogs, the map would initially give each cog a hub cycle every 32 clock cycles. Due to the 2 cycles per instruction, this would appear to the cog like one hub slot in every 16 instruction cycles (32 clock cycles) - just like a P1. That is unless the slot map was changed, which it would be, so that cogs that needed could get hub slots.
Chip has not yet stated how many clock cycles hub instructions would take on a P1E, so I did not dwell into further calculations yet.
Assuming that the hubexec engine could use the hub long as soon as the window came around by executing the instruction directly out of the instruction just read out of the hub directly into the internal "instruction register", if a cog was configured for 64/128 hub cycles, it would run instructions at the same rate as cog only code! Obviously a hubexec at 32/128 hub slots would run at 1/2 cog native speed, 16/128 at 1/4 cog native speed, and 32/128 at a little over LMM speed.
I did not include performance figures as in my scheme the performance is configurable, and Chip has not released any data on power utilization on such a cog that would allow me to make power calculations.
So, just to get this straight in my head...
At 200MHz main clock, a P1E COG would run at 100MIPS from COG RAM and eight of them could run at 25MIPS each from HUBRAM using your slot allocator mechanism.
Chip's proposed P1E's use a 200MHz clock, with 2 clock cycles per instruction, and as posted, does not have an instruction cache and data cache like the P2.
(because he is reducing cog memory from four ports on the P2 to two ports on the P1E to save transistors)
So, by using the slot allocation mechanism:
- setting a cogs access pattern to 64/128, one cog could run at 100MIPS from the hub ram, leaving 64 slots for the other cogs
- setting a cogs access pattern to 32/128, three cogs could run at 50MIPS from the hub ram, leaving 32 slots for the other cogs
- setting a cogs access pattern to 16/128, six cogs could run at 25MIPS from the hub ram, leaving 32 slots for the other cogs
- setting a cogs access pattern to 16/128, eight cogs could run at 25MIPS from the hub ram, leaving no slots for the other cogs (the case you were interested in)
Adding the "Mooch" mode would allow cogs that don't need to be strictly deterministic to get more slots (the ones assigned to cogs that won't use them)
Good, glad I've got it.
So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.
Or if a 32 COG P1+ appears I'll think I've died and gone to heaven.
Just a thought but with your 128-slot allocator is there any merit in being able to set a top value of <128 to allow COGs to run in sync if you aren't running a nice 'binary' pattern of access? Or indeed, set the length as a prime number to make them deliberately run out of sync?
Also, and I haven't really thought this through but is there any case where a COG wouldn't want access to HUB memory once it's running so you need a way of flagging this in the allocator?
Exactly! It also gives a reason to go for 32 cogs (that Chip has said he has space for) as it really does simplify I/O to run a serial / kb / mouse in separate cogs at 1/128... and we'd have room for a TON of such peripherals. In another thread I showed a breakdown with two high bandwidth cogs (one hubexec, one video), 12 serial ports, 3 SPI, 2 I2C, etc and had some cogs and hub time left over...
The index to the table is NOT the cog number, but cnt&$1F... to allow many different deterministic patters. Setting it to <128 would not change the pattern, due to cnt&$1F. Using a separate counter with a limit could allow other patterns, but then it would disturb binary access. The way I proposed it minimizes transistors, and I think we all like binary patterns
You can already run them out of sync, no one says you have to use a nice binary allocation in the 128 slot table
The allocator allocates hub cycles to cogs. Simply don't allocate any slots to a cog, it won't get them.
Unless of course "Mooch" is added, then cogs who don't have specific slots could use slots that were not needed/used by the cogs they were assigned to.
I gave an example in another thread of using a cog that has no slots assigned to it as a garbage collector for Java/JS/Python like languages that need gc, as it does not need to have a deterministic access pattern, allowing more deterministic slots for cogs that DO need determinism.
What a difference a day makes!
I am really hoping Bill will repost his new hubexec proposals here
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257520&viewfull=1#post1257520
Hubexec is nice for many reasons, but the biggest ones are...
- Gets over the 2K cog ram limitation much simpler than LMM
- Hubexec only uses ~25% power over LMM (LMM executes 4 instructions per 1)
- With helper instructions reduces hub footprint and increases speed too.
Now that the New 16-Cog P1+ chip is being designed, we know it will have QUAD access (128 bit = 4 long) hub access. This means that hubexec could have a simple and a slightly more complex method of implementing HUBEXEC.Basic HUBEXEC
- Requires a new JMPRETX (JMPX/CALLX/RETX) instruction with 17bit for direct addressing.
- CALLX will store return address (17bit) and flags in the fixed register $1EF
- JMPX will not store the return address nor flags
- RETX will JMPX to the return address stored in register $1EF, restoring the flags via WC & WZ
- If the goto address is <$200 (ie hub byte address <$800) then the resulting jump will be to cog, else it will be hub.
- The 17bit address (hub long address >>2) can be held in D+S for immediate mode.z
- The cog's program counter will be increased from 9bits to 17bits (17bits hub long address)
- When an instruction needs to be fetched from hub, it will wait for the hub cycle and the hub instruction will be read from the hub long.
- ie the instruction can come from either cog ram or a fetched hub long.
- No hub caching for instructions will be performed.
- This keeps the design simplest.
- Hubexec will execute 1 instruction for each available hub slot, except when delayed due to a hub data access.
- Due to 1 per hub slot, the remaining clocks will cause the cog to idle in low power mode, reducing power consumption considerably.
Improved HUBEXECBy increasing the complexity slightly, the following improvements are possible...- By adding a single QUAD register to hold the fetched long for an instruction would permit up to 4 instructions to execute in HUBEXEC mode per hub slot (presuming 1:16 clocks = 1:8 instructions).
- This would give HUBEXEC up to 4x improvement in speed over the Basic Hubexec mode.
- I will leave the details for Chip to work this out simply if he decides it has merit.
- There seems no point in cache tags although saving hub fetching if it is the same address might be simple and save power.
Other improvements could be made byYou guys are true experts in every sense of the word. Most of the people trying to follow your conversation are not. My understanding about Hubexec is that one of its virtues is to facilitate
efficient use of other languages. In fact, one of the current limits of the P1 is the speed of SPIN code. There are lots of things that simply can't be done in Spin because it isn't fast enough.
The new chip will be a lot faster, so SPIN should be faster too, but the faster it can be made, the better.
Do you guys see HUBEXEC boosting SPIN performance beyond what Chip already has planned?
My other understanding (from long ago) is that HUBEXEC is "elected" by the user so to speak… so if a user doesn't want to use it, it will only marginally impact other modes. True?
The impacts are die area, and perhaps a (very?) small power adder when not used.
I would guess Chip will do a OnSemi sim run before adding any HubExec, to confirm Power and speed points.
I think this step also gives more accurate die-area figures
(unless he can 'see' a clear winner simple low impact solution.)
Given Hubexec is not 'new' in the sense it works on the more complex P2, it may not be that great a task.
I am not sure how much improvement, if any, it would make to spin.
HUBEXEC has the following advantages:
* Simplicity to expand over the 2K hub limit. This is the BIGGEST problem this solves because you don't have to resort to the complex LMM.
* It reduces power by 75% over LMM for the same speed.
* It usually improves speed over LMM, but this depends on the simplicity of the implementation. QLMM may beat the basic hubexec, but is a lot more complex.
* Enables better use of HLL (high level languages) such as GCC and Catalina C.
If the user doesn't use hubexec, there is no impact to the user (other than perhaps a minute (negligible) power waste and silicon waste in the basic hubexec implementation). There was quite a bit more power and silicon waste in P2 if it was not used due to 4 deep 8 long instruction cache with LRU (least recently used algorithm) and auto loading of the cache by a state m/c.
I gained a 25% average improvement with my P1 Spin Interpreter. But because spin was in ROM, it was not really used. Once its soft (will be now), those sort of improvements can be made, as well as unwrapping some, and putting the lesser used into hubexec mode. So actually hubexec could improve spin, but not a lot IMHO. There are far better ways to improve it for more speed.
Thanks.
Hi Ray,
Happy to help!
Sorry for the timeline / quote format, I will clean it up once Chip posts what fits
(from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257520&viewfull=1#post1257520 roughly 10 hours ago)
LMM comparisons made to base rate ie without fcache, as hubexec is also base rate (no reason hubexec cannot use an fcache)
Later, I was asked for more clarification... I may have tried to stuff too much info into too short a post!
(from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257529&viewfull=1#post1257529 about 9 hours ago)
(from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257537&viewfull=1#post1257537 about 9 hours ago)
(from http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257591&viewfull=1#post1257591 about 8 hours ago)
If I find the time, I'll edit it down tomorrow, now that it is all in one place!
That's fine as it will change as we get to understand more. I thought it should be here in case anyone wants to input to hubexec mode.
What are your thoughts about the Basic Hubexec. It's way better than nothing, and makes life simpler.
Then we can see what adding the 4 long instruction register (simple cache) would do - it then needs to compare addresses required and be on a quad long boundary.
The Basic implementation would be deterministic whereas the 4 long cache would not. Probably cannot have both.
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257855&viewfull=1#post1257855