P1V Hub 4x faster 1:4 (4 cogs), 2x 1:8 (8 cogs) + new compile options
Cluso99
Posts: 18,069
I have taken the latest base GitHub code (as at 5Mar2015) and added options to compile the code as follows...
NOT_SCRAMBLED
By default, the ROM is scrambled. By defining NOT_SCRAMBLED, the unscrambled ROM file will be used.
The scrambled ROM identifies as Propeller Version 1, the unscrambled ROM identifies as Propeller Version 2.
CLUSO_ROMHI
When defining CLUSO_ROMHI in conjunction with defining NOT_SCRAMBED, the unscrambled cluso ROM will be used.
The Cluso ROM has my Faster Spin Interpreter with the $F000..FFFF section of the ROM reorganised so that the vector table used by my interpreter can be positioned at $FBF8..FFF7. The booter/runner/interpreter all still start at their original ROM addresses although the runner is now split into 3 sections $FFF8..FFFF, $F7A4..F7FF and $FB94..FB97 and starts at $FFF9 as before. The log/antilog/sine tables all remain in their original form/location. The Parallax Copyright message contained in the ROM at $FF00..FF5F is now changed and located at $FBC8..FBF7.
The ROM identifies as Propeller Version 3.
DISABLE_ROM_FONT
The ROM Font may be disabled by defining DISABLE_ROM_FONT. The ROM font is located at hub $8000..BFFF. There is insufficient RAM space in the DE0-Nano so the FPGA must be built with this option defined.
ROM_FONT_WRITABLE
The ROM Font is built in the FPGA as RAM at $8000..BFFF. This permits this space to be used by programs as HUB RAM. Enable this by defining ROM_FONT_WRITABLE. The ROM will still be pre-loaded with the Font but the user programs may overwrite this.
ROM_HIGH_WRITABLE
The ROM containing the LOG/ALOG/SIN and INTERPRETER/BOOTER/RUNNER code is built in the FPGA as RAM. This permits this space to be used by programs as HUB RAM. Enable the section of HUB at $C000..EFFF as RAM by defining ROM_HIGH_WRITABLE. Note the HUB at $F000..$FFFF (interpreter/booter/runner) remains as protected ROM.
The ROM will still be pre-loaded with the log/alog/sin and interpreter/booter/runner code.
HUB_SINGLE_CLOCK
The P1V may be built with the HUB ACCESS as 1:8 clocks instead of the usual 1:16 clocks. To enable this, define HUB_SINGLE_CLOCK.
If 4 Cogs are selected, with HUB_SINGLE_CLOCK the hub access will be 1:4 clocks, otherwise it will be 1:8.
COGS_4
The P1V may be built with only 4 Cogs instead of 8 Cogs. When the P1V is built with 4 Cogs, the HUB ACCESS will be built with 1:4 clocks with HUB_SINGLE_CLOCK set, else it will be built with 1:8 clocks. Note that RDxxxx/WRxxxx still take 8 clocks, but will synchronise with the hub in 4 clocks when 1:4, resulting in faster hub accesses.
COGS_4 defines NCOGS=4. Otherwise NCOGS=8.
NO_VIDEO
The P1V may be built with the Video disabled. This results in a faster build and of course saves logic/power.
Currently, this disables video in all Cogs.
INVERT_COG_LEDS
The P1V may be built inverting the 8x LED outputs. This is required for the BeMicroCV build - this is done automatically when selecting the BeMicroCV project.
Hub ROM as RAM
In order to be able to use the HUB ROM as HUB RAM, I needed to rebuild the Intel Hex Rom files as Byte files. I have modified hub_mem.v accordingly. Files contain the filename appendices _b0, _b1, _b2 and _b3 for the respective bytes 0..3 hex files.
As of 15May2015, the initialisation files are now xxxx.mif format (built by Quartus) which prevents 4 warnings.[/I]
config.v
The file "config.v" contains the above definitions, and are common to all DE0-Nano, DE2-115 and BeMicroCV builds.
Here are all files. Just unzip into a new folder and compile with Quartus II. Note I used Quartus 15.0.0.145 and the DE0-Nano project. Quartus gives 14 or 15 warnings when building with COGS_4.
P1V_4cog_20150515g.zip
Special thanks to Brian (ozpropdev) for his P1V Toolbox. I used this to create the Intel Hex format files, including the new command SAVEBYTE to save the files in byte form for the new hub rom byte format so that the rom can be used as ram.
Then I used Quartus to open these *.hex files and save as *.mif files.
Previous posts
PostEdit 12May2015 ... see post #24
I now have hub running 4x Faster with 4 cogs
rdxxxx/wrlong still take 8 clocks but there is no further delay to wait for the next hub slot as it comes along every 4 clocks. So there is no requirement to count instructions between rdxxxx/wrxxxx. Of course, there are only 4 cogs for this to work.
Postedit... see post #8
http://forums.parallax.com/showthread.php/160690-Looking-to-improve-hub-to-1-8?p=1326979&viewfull=1#post1326979
Success!!! Hub now runs as 1:8 (twice as fast)
Modifying dig.v to use the same clock as cog_clk results in hub 1:8 instead of the previous 1:16.
Yesterday I was looking at the hub round Robin mechanism. It uses a 3 stage pipe. But I couldn't see where 2 clocks were used for each cog. Am I missing something or has anyone confirmed it really is 2 clocks is 1:16 ???
NOT_SCRAMBLED
By default, the ROM is scrambled. By defining NOT_SCRAMBLED, the unscrambled ROM file will be used.
The scrambled ROM identifies as Propeller Version 1, the unscrambled ROM identifies as Propeller Version 2.
CLUSO_ROMHI
When defining CLUSO_ROMHI in conjunction with defining NOT_SCRAMBED, the unscrambled cluso ROM will be used.
The Cluso ROM has my Faster Spin Interpreter with the $F000..FFFF section of the ROM reorganised so that the vector table used by my interpreter can be positioned at $FBF8..FFF7. The booter/runner/interpreter all still start at their original ROM addresses although the runner is now split into 3 sections $FFF8..FFFF, $F7A4..F7FF and $FB94..FB97 and starts at $FFF9 as before. The log/antilog/sine tables all remain in their original form/location. The Parallax Copyright message contained in the ROM at $FF00..FF5F is now changed and located at $FBC8..FBF7.
The ROM identifies as Propeller Version 3.
DISABLE_ROM_FONT
The ROM Font may be disabled by defining DISABLE_ROM_FONT. The ROM font is located at hub $8000..BFFF. There is insufficient RAM space in the DE0-Nano so the FPGA must be built with this option defined.
ROM_FONT_WRITABLE
The ROM Font is built in the FPGA as RAM at $8000..BFFF. This permits this space to be used by programs as HUB RAM. Enable this by defining ROM_FONT_WRITABLE. The ROM will still be pre-loaded with the Font but the user programs may overwrite this.
ROM_HIGH_WRITABLE
The ROM containing the LOG/ALOG/SIN and INTERPRETER/BOOTER/RUNNER code is built in the FPGA as RAM. This permits this space to be used by programs as HUB RAM. Enable the section of HUB at $C000..EFFF as RAM by defining ROM_HIGH_WRITABLE. Note the HUB at $F000..$FFFF (interpreter/booter/runner) remains as protected ROM.
The ROM will still be pre-loaded with the log/alog/sin and interpreter/booter/runner code.
HUB_SINGLE_CLOCK
The P1V may be built with the HUB ACCESS as 1:8 clocks instead of the usual 1:16 clocks. To enable this, define HUB_SINGLE_CLOCK.
If 4 Cogs are selected, with HUB_SINGLE_CLOCK the hub access will be 1:4 clocks, otherwise it will be 1:8.
COGS_4
The P1V may be built with only 4 Cogs instead of 8 Cogs. When the P1V is built with 4 Cogs, the HUB ACCESS will be built with 1:4 clocks with HUB_SINGLE_CLOCK set, else it will be built with 1:8 clocks. Note that RDxxxx/WRxxxx still take 8 clocks, but will synchronise with the hub in 4 clocks when 1:4, resulting in faster hub accesses.
COGS_4 defines NCOGS=4. Otherwise NCOGS=8.
NO_VIDEO
The P1V may be built with the Video disabled. This results in a faster build and of course saves logic/power.
Currently, this disables video in all Cogs.
INVERT_COG_LEDS
The P1V may be built inverting the 8x LED outputs. This is required for the BeMicroCV build - this is done automatically when selecting the BeMicroCV project.
Hub ROM as RAM
In order to be able to use the HUB ROM as HUB RAM, I needed to rebuild the Intel Hex Rom files as Byte files. I have modified hub_mem.v accordingly. Files contain the filename appendices _b0, _b1, _b2 and _b3 for the respective bytes 0..3 hex files.
As of 15May2015, the initialisation files are now xxxx.mif format (built by Quartus) which prevents 4 warnings.[/I]
config.v
The file "config.v" contains the above definitions, and are common to all DE0-Nano, DE2-115 and BeMicroCV builds.
Here are all files. Just unzip into a new folder and compile with Quartus II. Note I used Quartus 15.0.0.145 and the DE0-Nano project. Quartus gives 14 or 15 warnings when building with COGS_4.
P1V_4cog_20150515g.zip
Special thanks to Brian (ozpropdev) for his P1V Toolbox. I used this to create the Intel Hex format files, including the new command SAVEBYTE to save the files in byte form for the new hub rom byte format so that the rom can be used as ram.
Then I used Quartus to open these *.hex files and save as *.mif files.
Previous posts
PostEdit 12May2015 ... see post #24
I now have hub running 4x Faster with 4 cogs
rdxxxx/wrlong still take 8 clocks but there is no further delay to wait for the next hub slot as it comes along every 4 clocks. So there is no requirement to count instructions between rdxxxx/wrxxxx. Of course, there are only 4 cogs for this to work.
Postedit... see post #8
http://forums.parallax.com/showthread.php/160690-Looking-to-improve-hub-to-1-8?p=1326979&viewfull=1#post1326979
Success!!! Hub now runs as 1:8 (twice as fast)
Modifying dig.v to use the same clock as cog_clk results in hub 1:8 instead of the previous 1:16.
Yesterday I was looking at the hub round Robin mechanism. It uses a 3 stage pipe. But I couldn't see where 2 clocks were used for each cog. Am I missing something or has anyone confirmed it really is 2 clocks is 1:16 ???
Comments
Notice that ena_bus (in dig.v) is toggled every other clock cycle. This means that bus_sel is updated every 2 clocks. Since all of the cogs share the hub buses, I'm guessing the extra clock cycle is/was needed to let the buses settle (when bus_sel changed) before the hub attempted to read any of the bus signals.
I'm curious how Chip got around this with the P2. I'll take a guess, though! I wonder if you could buffer those signals by setting them on the prior clock cycle. In other words, each cog would receive bus_sel[cog#] and bus_sel[cog# - 1]. The cog would set its bus signals on bus_sel[cog# - 1], then wait on bus_sel[cog#]. Or something like that...
On the other hand, since the P1V is intended to target only FPGAs, maybe the settle time isn't required anymore. Maybe just remove the ena_bus toggle and see what happens?
I haven't looked any further yet. In the hub code section Chip is doing what you described in what I referred to as a 3 stage pipe.
The main clock is divided by 2 before it becomes the Cog clock.
The hub appears to cycle every clock too, so I am missing something.
I am just getting back into the P1V code again, so this will be on my todo list.
No, the hub (bus_sel) cycles every other clk_cog because ena_bus is a divide-by-2 of the clk_cog, and it is ena_bus that gates the update of bus_sel.. See dig.v lines 58-73.
That's my guess too. Quartus doesn't appear to detect the divide-by-2 nature of ena_bus (i.e. report it as a clock), so I'm guessing all of it's timing assumes that bus_sel could be updated every clk_cog. At which point, Quartus already things it's fast enough. Of course, you might not be able to run it at the higher clock rates, but it's still worth trying...
Only a single line in dig.v was required for the change.
dig_hub1in8_20150419f.zip
Here are the benchmarks. Note that the rightmost column is the % faster for my Spin Interpreter with normal 1:16 hub access versus 1:8 hub access.
Is the first Column of raw numbers, 'Standard' P1 Spin, and then two improvement steps occur, one for Cluso Spin, and another for 1:8 ? so the % numbers need to 'sum' ?
Any impact on the peak MHz values ?
If not all cogs are being used, tying that in with reduced hub slots would "boost" things further again. (i.e. 1:7 1:6 etc)
Next you will be suggesting a Slot-Yield scheme, or a Slot Priority Encoder
Effectively a rd/wr long/word/byte uses 8 clocks. So consecutive rdxxxx/wrxxxx can operate successfully. Or you can have two 4 clock instructions to catch every second hub cycle (just as before). But the advantage is you only have to wait a maximum of 8 clocks for your turn instead of 16.
jmg: it's a forum code posting issue. Can't guess where the columns are
Here are some results...
This line is related to bus_ack and its relationship with bus_sel.
BTW. 1:4 configurations will give the same results as 1:8.
Related to 4 clock instruction cycle always missing next hub cycle I believe.
Post edit: Depends on hub instruction spacing to hub slot alignment.
This makes for much more efficient hub access.
So I have been thinking about what could be done to improve it more...
What would be nice is a Block RD/WR instruction. In the P2 Chip covered this with a repeat and auto increment instructions.
I thought maybe writing to a repeat register and have this do the incrementing. The par CNT Register lends itself to this.
Any thoughts?
Postdit: Changed PAR to CNT register (because it would be a count value)
How about merging that with indirect COG memory location(s), where 1, maybe 2*, registers have extra hardware.
With an address not needing to be 32b, you could tag using the upper bits things like
INC/DEC and Size 00: Off, 01 Byte (+/-!) 0x2 u16 (+/-2) 0x3 u32 (+/-4)
COG Memory maps somewhere spare in HUB address space.
2* registers allows Source-Dest pairing. - if they were both used in one opcode, would an equivalent
MOV @Rd++,@Rs++ be possible in one line ?
On an earlier post I suggested using the 4 unused bits on every register as AutoINC Size/Dirn tags, but that is maybe an over-kill. They would be cleared by default and accessed by a new 4th 9-bit field opcode ( 4 x 9 = 36 nicely)
(This is more FPGA focused than ASIC.)
To read 256 longs you would do this...
To write 256 longs you would do this...
I thought that PAR could be used as a parameter holder for some special effects like an AUGDS instruction.
This way, we have not really introduced any extra instructions, just some effects to be used by previously unused shadow registers (well some of us have used them to advantage but that is quite limited).
While I like the idea of mapping the cog space into hub space to permit addressing extended cog ram, I am not of the opinion that it should be multiport to permit other cogs access.
I want to keep the P1V to single port ram, at least for now anyway.
Personally, I am not interested in this idea. There is just too many changes required to software and tools. And the P2 will still be 32bits.
That was not quite what I meant, - the memory mapping is virtual & only COG local.
In hardware terms, it is a simply MUX on the MSBs, that selects HUB path, or COG path
- so one opcode can access either memory.
A repeated block move opcode is nice, but it should also have a more general opcode that can do
MOV @Rd++,Source
MOV @Rd++,@Rs++
where the @ target can be same-COG or HUB memory, u8,u16,u32
I also prefer a repeat opcode that is not limited to memory copy, but has a count and reach fields, so says
repeat the next X lines of Code Y times.
That form of opcode would naturally pair with the @Auto-Inc one I have, to give a 2 line block move - the same code size as yours, but much more general in use.
As you say, with using a pair of registers for the addresses with incrementing capability, its possible all instructions could work on both cog and extended cog and hub addresses. This was one of the things I thought about when Chip came up with the AUG instruction on the P2.
I think this is a bit too complex for now at least. A single instruction repeat should be fairly simple to do.
Flat address space also allows simple execution from hub. But the gotcha is that the hub access becomes more complex because it can be accessed by any of the I-SD-R clocks. Currently only the rd/wr instruction can access hub so it's quite a bit simpler. However, the benefits far outway this bit of complexity.
It is just a re-loadable counter.
I have just successfully tested (well, not quite as cogs 2 & 3 don't work correctly due to an error on my part with the hub slot pipeline mechanism).
I have tested the rdxxxx/wrxxxx and found that they always take 8 clocks. But any number of instructions from 0 to n between successive rdxxxx/wrxxxx will hit the hub slot immediately. Of course, I only have 4 cogs in this solution.
P1V_4cog_20150514c_4cogswork.zip
Here is a table of speed comparisons...
Nico Hattink
Glad its working under Q15.
As for the identifier, yes I am using "1" as the original scrambled ROM, "2" as the unscrambled ROM, and "3" as my Faster Cluso Interpreter (spin) ROM.
Added NO_VIDEO and uses *.mif files for the ROM files.