P1V Hub 4x faster 1:4 (4 cogs), 2x 1:8 (8 cogs) + new compile options

Cluso99 · 2015-04-05 16:50

I have taken the latest base GitHub code (as at 5Mar2015) and added options to compile the code as follows...

NOT_SCRAMBLED
By default, the ROM is scrambled. By defining NOT_SCRAMBLED, the unscrambled ROM file will be used.
The scrambled ROM identifies as Propeller Version 1, the unscrambled ROM identifies as Propeller Version 2.

CLUSO_ROMHI
When defining CLUSO_ROMHI in conjunction with defining NOT_SCRAMBED, the unscrambled cluso ROM will be used.
The Cluso ROM has my Faster Spin Interpreter with the $F000..FFFF section of the ROM reorganised so that the vector table used by my interpreter can be positioned at $FBF8..FFF7. The booter/runner/interpreter all still start at their original ROM addresses although the runner is now split into 3 sections $FFF8..FFFF, $F7A4..F7FF and $FB94..FB97 and starts at $FFF9 as before. The log/antilog/sine tables all remain in their original form/location. The Parallax Copyright message contained in the ROM at $FF00..FF5F is now changed and located at $FBC8..FBF7.
The ROM identifies as Propeller Version 3.

DISABLE_ROM_FONT
The ROM Font may be disabled by defining DISABLE_ROM_FONT. The ROM font is located at hub $8000..BFFF. There is insufficient RAM space in the DE0-Nano so the FPGA must be built with this option defined.

ROM_FONT_WRITABLE
The ROM Font is built in the FPGA as RAM at $8000..BFFF. This permits this space to be used by programs as HUB RAM. Enable this by defining ROM_FONT_WRITABLE. The ROM will still be pre-loaded with the Font but the user programs may overwrite this.

ROM_HIGH_WRITABLE
The ROM containing the LOG/ALOG/SIN and INTERPRETER/BOOTER/RUNNER code is built in the FPGA as RAM. This permits this space to be used by programs as HUB RAM. Enable the section of HUB at $C000..EFFF as RAM by defining ROM_HIGH_WRITABLE. Note the HUB at $F000..$FFFF (interpreter/booter/runner) remains as protected ROM.
The ROM will still be pre-loaded with the log/alog/sin and interpreter/booter/runner code.

HUB_SINGLE_CLOCK
The P1V may be built with the HUB ACCESS as 1:8 clocks instead of the usual 1:16 clocks. To enable this, define HUB_SINGLE_CLOCK.
If 4 Cogs are selected, with HUB_SINGLE_CLOCK the hub access will be 1:4 clocks, otherwise it will be 1:8.

COGS_4
The P1V may be built with only 4 Cogs instead of 8 Cogs. When the P1V is built with 4 Cogs, the HUB ACCESS will be built with 1:4 clocks with HUB_SINGLE_CLOCK set, else it will be built with 1:8 clocks. Note that RDxxxx/WRxxxx still take 8 clocks, but will synchronise with the hub in 4 clocks when 1:4, resulting in faster hub accesses.
COGS_4 defines NCOGS=4. Otherwise NCOGS=8.

NO_VIDEO
The P1V may be built with the Video disabled. This results in a faster build and of course saves logic/power.
Currently, this disables video in all Cogs.

INVERT_COG_LEDS
The P1V may be built inverting the 8x LED outputs. This is required for the BeMicroCV build - this is done automatically when selecting the BeMicroCV project.

Hub ROM as RAM
In order to be able to use the HUB ROM as HUB RAM, I needed to rebuild the Intel Hex Rom files as Byte files. I have modified hub_mem.v accordingly. Files contain the filename appendices _b0, _b1, _b2 and _b3 for the respective bytes 0..3 hex files.
As of 15May2015, the initialisation files are now xxxx.mif format (built by Quartus) which prevents 4 warnings.[/I]

config.v
The file "config.v" contains the above definitions, and are common to all DE0-Nano, DE2-115 and BeMicroCV builds.

Here are all files. Just unzip into a new folder and compile with Quartus II. Note I used Quartus 15.0.0.145 and the DE0-Nano project. Quartus gives 14 or 15 warnings when building with COGS_4.

P1V_4cog_20150515g.zip

Special thanks to Brian (ozpropdev) for his P1V Toolbox. I used this to create the Intel Hex format files, including the new command SAVEBYTE to save the files in byte form for the new hub rom byte format so that the rom can be used as ram.
Then I used Quartus to open these *.hex files and save as *.mif files.

Previous posts

PostEdit 12May2015 ... see post #24

I now have hub running 4x Faster with 4 cogs
rdxxxx/wrlong still take 8 clocks but there is no further delay to wait for the next hub slot as it comes along every 4 clocks. So there is no requirement to count instructions between rdxxxx/wrxxxx. Of course, there are only 4 cogs for this to work.

Postedit... see post #8
http://forums.parallax.com/showthread.php/160690-Looking-to-improve-hub-to-1-8?p=1326979&viewfull=1#post1326979
Success!!! Hub now runs as 1:8 (twice as fast)

Modifying dig.v to use the same clock as cog_clk results in hub 1:8 instead of the previous 1:16.

Yesterday I was looking at the hub round Robin mechanism. It uses a 3 stage pipe. But I couldn't see where 2 clocks were used for each cog. Am I missing something or has anyone confirmed it really is 2 clocks is 1:16 ???

evanh · 2015-04-05 20:35

me was babbling ... please ignore

Seairth · 2015-04-07 06:34

Cluso99 wrote: »

Yesterday I was looking at the hub round Robbin mechanism. It uses a 3 stage pipe. But I couldn't see where 2 clocks were used for each cog. Am I missing something or has anyone confirmed it really is 2 clocks is 1:16 ???

Notice that ena_bus (in dig.v) is toggled every other clock cycle. This means that bus_sel is updated every 2 clocks. Since all of the cogs share the hub buses, I'm guessing the extra clock cycle is/was needed to let the buses settle (when bus_sel changed) before the hub attempted to read any of the bus signals.

I'm curious how Chip got around this with the P2. I'll take a guess, though! I wonder if you could buffer those signals by setting them on the prior clock cycle. In other words, each cog would receive bus_sel[cog#] and bus_sel[cog# - 1]. The cog would set its bus signals on bus_sel[cog# - 1], then wait on bus_sel[cog#]. Or something like that...

On the other hand, since the P1V is intended to target only FPGAs, maybe the settle time isn't required anymore. Maybe just remove the ena_bus toggle and see what happens?

Cluso99 · 2015-04-08 13:29

Thanks Seairth,
I haven't looked any further yet. In the hub code section Chip is doing what you described in what I referred to as a 3 stage pipe.
The main clock is divided by 2 before it becomes the Cog clock.
The hub appears to cycle every clock too, so I am missing something.

I am just getting back into the P1V code again, so this will be on my todo list.

Seairth · 2015-04-08 15:56

Cluso99 wrote: »

Thanks Seairth,
I haven't looked any further yet. In the hub code section Chip is doing what you described in what I referred to as a 3 stage pipe.
The main clock is divided by 2 before it becomes the Cog clock.
The hub appears to cycle every clock too, so I am missing something.

I am just getting back into the P1V code again, so this will be on my todo list.

No, the hub (bus_sel) cycles every other clk_cog because ena_bus is a divide-by-2 of the clk_cog, and it is ena_bus that gates the update of bus_sel.. See dig.v lines 58-73.

Cluso99 · 2015-04-08 16:22

Seairth wrote: »

No, the hub (bus_sel) cycles every other clk_cog because ena_bus is a divide-by-2 of the clk_cog, and it is ena_bus that gates the update of bus_sel.. See dig.v lines 58-73.

Thanks Seairth, I will take a closer look. The hub should be able to work at full cog speed. Perhaps this was a critical area in the real P1 at it's 360nm? process.

Seairth · 2015-04-08 21:26

Cluso99 wrote: »

Thanks Seairth, I will take a closer look. The hub should be able to work at full cog speed. Perhaps this was a critical area in the real P1 at it's 360nm? process.

That's my guess too. Quartus doesn't appear to detect the divide-by-2 nature of ena_bus (i.e. report it as a clock), so I'm guessing all of it's timing assumes that bus_sel could be updated every clk_cog. At which point, Quartus already things it's fast enough. Of course, you might not be able to run it at the higher clock rates, but it's still worth trying...

Cluso99 · 2015-04-18 21:31

I have hub 1:8 working successfully

Only a single line in dig.v was required for the change.

dig_hub1in8_20150419f.zip

Here are the benchmarks. Note that the rightmost column is the % faster for my Spin Interpreter with normal 1:16 hub access versus 1:8 hub access.

19-Apr-15	P1V 80MHz	P1V 80MHz		hub 1:8	cluso/cluso
Benchmark	P1 Spin	Cluso Spin	Faster%	Cluso Spin	1:16/1:8
toggle	977664	833616	14.7%	753416	9.6%
fibo 1	2192	2144	2.2%	1888	11.9%
fibo 2	6288	5888	6.4%	5136	12.8%
fibo 3	10384	9632	7.2%	8384	13.0%
fibo 4	18576	17120	7.8%	14880	13.1%
fibo 5	30864	28352	8.1%	24624	13.1%
fibo 6	51344	47072	8.3%	40864	13.2%
fibo 7	84112	77024	8.4%	66848	13.2%
fibo 8	137360	125696	8.5%	109072	13.2%
fibo 9	223376	204320	8.5%	177280	13.2%
fibo 10	362640	331616	8.6%	287712	13.2%
fibo 11	587920	537536	8.6%	466352	13.2%
fibo 12	952464	870752	8.6%	755424	13.2%
fibo 13	1542288	1409888	8.6%	1223136	13.2%
fibo 14	2496656	2282240	8.6%	1979920	13.2%
fibo 15	4040848	3693728	8.6%	3204416	13.2%
fft v1.0	117225712	116332496	0.8%	104507648	10.2%
fft v2.0	145591760	139424544	4.2%	126872152	9.0%

Electrodude · 2015-04-18 22:13

I'm assuming you didn't optimize your spin interpreter for a 1:8 hub. How much faster do you think it would go if you did?

Cluso99 · 2015-04-18 22:35

It's not really possible to optimise the spin interpreter for this. It just doesn't work that way.

jmg · 2015-04-18 22:36

Cluso99 wrote: »

Here are the benchmarks. Note that the rightmost column is the % faster for my Spin Interpreter with normal 1:16 hub access versus 1:8 hub access.

The columns seem to not line up.
Is the first Column of raw numbers, 'Standard' P1 Spin, and then two improvement steps occur, one for Cluso Spin, and another for 1:8 ? so the % numbers need to 'sum' ?

Any impact on the peak MHz values ?

ozpropdev · 2015-04-18 22:40

Nice!
If not all cogs are being used, tying that in with reduced hub slots would "boost" things further again. (i.e. 1:7 1:6 etc)

jmg · 2015-04-18 22:48

ozpropdev wrote: »

Nice!
If not all cogs are being used, tying that in with reduced hub slots would "boost" things further again. (i.e. 1:7 1:6 etc)

Next you will be suggesting a Slot-Yield scheme, or a Slot Priority Encoder

Cluso99 · 2015-04-18 23:24

It works as expected.
Effectively a rd/wr long/word/byte uses 8 clocks. So consecutive rdxxxx/wrxxxx can operate successfully. Or you can have two 4 clock instructions to catch every second hub cycle (just as before). But the advantage is you only have to wait a maximum of 8 clocks for your turn instead of 16.

jmg: it's a forum code posting issue. Can't guess where the columns are

Cluso99 · 2015-04-19 02:40

Just tested the DE0-Nano at 120MHz. It reports two timing errors so probably this is not really doable, but it works for the testing I have done.
Here are some results...

TogglePin
  P1               977,664 (12.22ms @  80MHz) 
+ ClusoInterpreter 833,616 (10.42ms @  80MHz)
+ Hub 1:8          753,416 ( 9.42ms @  80MHz)
+ 120MHz           653,416 ( 6.28ms @ 120MHz)

fibo
  P1               4,040,848 (50.51ms @  80MHz) 
+ ClusoInterpreter 3,693,728 (46.17ms @  80MHz)
+ Hub 1:8          3,204,416 (40.06ms @  80MHz)
+ 120MHz           3,204,416 (26.70ms @ 120MHz)

fft 1.0
  P1               117,225,712 (1465ms @  80MHz) 
+ ClusoInterpreter 116,332,496 (1454ms @  80MHz)
+ Hub 1:8          104,507,648 (1306ms @  80MHz)
+ 120MHz           104,507,648 ( 871ms @ 120MHz)

fft 2.0
  P1               145,591,760 (1399ms @  80MHz) 
+ ClusoInterpreter 139,424,544 (1743ms @  80MHz)
+ Hub 1:8          126,872,152 (1586ms @  80MHz)
+ 120MHz           126,872,152 (1057ms @ 120MHz)

ozpropdev · 2015-04-21 01:55

Another line needs to be changed in hub.v to avoid issues with lock-up (1:4 configurations in particular)
This line is related to bus_ack and its relationship with bus_sel.

// generate bus acknowledge for cog[n-2]

assign bus_ack      = ed ? {bus_sel[1:0], bus_sel[7:2]} : 8'b0;
for 1:4 change to
assign bus_ack      = ed ? {bus_sel[1:0], bus_sel[3:2]} : 8'b0;

BTW. 1:4 configurations will give the same results as 1:8.
Related to 4 clock instruction cycle always missing next hub cycle I believe.

Post edit: Depends on hub instruction spacing to hub slot alignment.

Cluso99 · 2015-04-23 21:50

Now with 1:8 hub slots, and each instruction takes 4 clocks, the max delay is just 1 instruction.
This makes for much more efficient hub access.

So I have been thinking about what could be done to improve it more...

What would be nice is a Block RD/WR instruction. In the P2 Chip covered this with a repeat and auto increment instructions.
I thought maybe writing to a repeat register and have this do the incrementing. The par CNT Register lends itself to this.

Any thoughts?

Postdit: Changed PAR to CNT register (because it would be a count value)

jmg · 2015-04-23 22:22

Cluso99 wrote: »

So I have been thinking about what could be done to improve it more...

What would be nice is a Block Rd/wr instruction. In the P2 chip covered this with a repeat and an auto increment instructions.
I thought maybe writing to a repeat register and have this do the incrementing. The par Register lends itself to this.

Any thoughts?

How about merging that with indirect COG memory location(s), where 1, maybe 2*, registers have extra hardware.
With an address not needing to be 32b, you could tag using the upper bits things like
INC/DEC and Size 00: Off, 01 Byte (+/-!) 0x2 u16 (+/-2) 0x3 u32 (+/-4)
COG Memory maps somewhere spare in HUB address space.
2* registers allows Source-Dest pairing. - if they were both used in one opcode, would an equivalent
MOV @Rd++,@Rs++ be possible in one line ?

On an earlier post I suggested using the 4 unused bits on every register as AutoINC Size/Dirn tags, but that is maybe an over-kill. They would be cleared by default and accessed by a new 4th 9-bit field opcode ( 4 x 9 = 36 nicely)
(This is more FPGA focused than ASIC.)

Cluso99 · 2015-04-24 00:59

Following on from my previous post...

To read 256 longs you would do this...

   MOV      CNT, #256     'set repeat count to 256
   RDLONG   block, ptr    'reads 256 longs into cog 'block+n' from hub ptr+n

To write 256 longs you would do this...

   MOV      CNT, #256     'set repeat count to 256
   WRLONG   block, ptr    'reads 256 longs into cog 'block+n' from hub ptr+n

I thought that PAR could be used as a parameter holder for some special effects like an AUGDS instruction.

This way, we have not really introduced any extra instructions, just some effects to be used by previously unused shadow registers (well some of us have used them to advantage but that is quite limited).

Cluso99 · 2015-04-24 01:10

jmg wrote: »

How about merging that with indirect COG memory location(s), where 1, maybe 2*, registers have extra hardware.
With an address not needing to be 32b, you could tag using the upper bits things like
INC/DEC and Size 00: Off, 01 Byte (+/-!) 0x2 u16 (+/-2) 0x3 u32 (+/-4)
COG Memory maps somewhere spare in HUB address space.
2* registers allows Source-Dest pairing. - if they were both used in one opcode, would an equivalent
MOV @Rd++,@Rs++ be possible in one line ?

This could be an interesting effect. We would need this to be extra registers $1EE and $1EF. There are some implications for some programs, particularly the spin interpreter(s).

COG Memory maps somewhere spare in HUB space.

While I like the idea of mapping the cog space into hub space to permit addressing extended cog ram, I am not of the opinion that it should be multiport to permit other cogs access.
I want to keep the P1V to single port ram, at least for now anyway.

On an earlier post I suggested using the 4 unused bits on every register as AutoINC Size/Dirn tags, but that is maybe an over-kill. They would be cleared by default and accessed by a new 4th 9-bit field opcode ( 4 x 9 = 36 nicely)
(This is more FPGA focused than ASIC.)

Personally, I am not interested in this idea. There is just too many changes required to software and tools. And the P2 will still be 32bits.

jmg · 2015-04-24 13:41

Cluso99 wrote: »

While I like the idea of mapping the cog space into hub space to permit addressing extended cog ram, I am not of the opinion that it should be multiport to permit other cogs access.
I want to keep the P1V to single port ram, at least for now anyway.

That was not quite what I meant, - the memory mapping is virtual & only COG local.
In hardware terms, it is a simply MUX on the MSBs, that selects HUB path, or COG path
- so one opcode can access either memory.

A repeated block move opcode is nice, but it should also have a more general opcode that can do
MOV @Rd++,Source
MOV @Rd++,@Rs++
where the @ target can be same-COG or HUB memory, u8,u16,u32

I also prefer a repeat opcode that is not limited to memory copy, but has a count and reach fields, so says
repeat the next X lines of Code Y times.
That form of opcode would naturally pair with the @Auto-Inc one I have, to give a 2 line block move - the same code size as yours, but much more general in use.

Cluso99 · 2015-04-24 15:17

jmg wrote: »

That was not quite what I meant, - the memory mapping is virtual & only COG local.
In hardware terms, it is a simply MUX on the MSBs, that selects HUB path, or COG path
- so one opcode can access either memory.

A repeated block move opcode is nice, but it should also have a more general opcode that can do
MOV @Rd++,Source
MOV @Rd++,@Rs++
where the @ target can be same-COG or HUB memory, u8,u16,u32.

Yes, makes perfect sense. I did this for extended cog addresses when I was adding the AUGDS instruction.

As you say, with using a pair of registers for the addresses with incrementing capability, its possible all instructions could work on both cog and extended cog and hub addresses. This was one of the things I thought about when Chip came up with the AUG instruction on the P2.

I also prefer a repeat opcode that is not limited to memory copy, but has a count and reach fields, so says
repeat the next X lines of Code Y times.
That form of opcode would naturally pair with the @Auto-Inc one I have, to give a 2 line block move - the same code size as yours, but much more general in use.

I think this is a bit too complex for now at least. A single instruction repeat should be fairly simple to do.

Flat address space also allows simple execution from hub. But the gotcha is that the hub access becomes more complex because it can be accessed by any of the I-SD-R clocks. Currently only the rd/wr instruction can access hub so it's quite a bit simpler. However, the benefits far outway this bit of complexity.

jmg · 2015-04-24 21:14

Cluso99 wrote: »

I think this is a bit too complex for now at least. A single instruction repeat should be fairly simple to do.

Single instruction repeat is not that much simpler, in both cases you need to create the repeat logic, all that changes is instead of a fixed reach of 1 you make that reach an opcode field.
It is just a re-loadable counter.

Cluso99 · 2015-05-11 22:59

4 Cogs and 1:4 hub clocks

I have just successfully tested (well, not quite as cogs 2 & 3 don't work correctly due to an error on my part with the hub slot pipeline mechanism).

I have tested the rdxxxx/wrxxxx and found that they always take 8 clocks. But any number of instructions from 0 to n between successive rdxxxx/wrxxxx will hit the hub slot immediately. Of course, I only have 4 cogs in this solution.

Cluso99 · 2015-05-14 04:55

Now all 4 cogs work with hub 1:4

P1V_4cog_20150514c_4cogswork.zip

Here is a table of speed comparisons...
attachment.php?attachmentid=114178&d=1431610229

nutson · 2015-05-14 06:49

Thanks for the effort, Cluso99. Very usefull, in many cases a 4 Cog prop is sufficient. However, the project fails to open with Quartus 14.0 "Can't open project - Quartus II Settings File contains one or more errors". Was this done with the new Quartus 15?? Do you still have 14.0 around, if yes can give it a try to check if you get the same error. If so, I will have to upgrade to the newer version

Nico Hattink

nutson · 2015-05-14 08:07

To answer my own question, I have upgraded to Quartus 15 and the project compiles fine now, although I was asked the first time if I wanted to overwrite the database created with Quartus 14? The De-Nano identifies itself as a Propeller 2. Now need to devise some tests to see the 1:4 hub access working.

Cluso99 · 2015-05-14 18:27

Nico,
Glad its working under Q15.
As for the identifier, yes I am using "1" as the original scrambled ROM, "2" as the unscrambled ROM, and "3" as my Faster Cluso Interpreter (spin) ROM.

Cluso99 · 2015-05-15 01:49

I have updated the first post to include the latest code and build options.
Added NO_VIDEO and uses *.mif files for the ROM files.

P1V Hub 4x faster 1:4 (4 cogs), 2x 1:8 (8 cogs) + new compile options

Comments