NEW Hubexec at 20MIPs+ on P1V with a superCOG
rogloh
Posts: 5,791
So here's the next P1V enhancement I wanted to add. With it we now should have the ability to do hub execution with a single "superCOG" at the full COG rate of 20MIPs+. The superCOG currently defined as COGID 7 sees all the 32kB hub RAM as its COG RAM and can execute instructions by reading from it directly at full COG speed, while the remaining seven COGs (COGIDs 0-6) continue to work just as before, all sharing the hub and without any hub exec.
This enhancement was achieved by turning hub RAM into dual ported RAM and connecting its secondary memory interface over to the superCOG's COGRAM interface. See the attached illustration.
The idea here is that you get a single COG that could be running a larger application such as some high level C code and can now have a decent amount of stack/data space for that, and at the same time you still get to use the remaining 7 COGs for other SPIN code/drivers/real time I/O stuff. I'm also thinking after you initialize the system and spawn your driver COGs you could clear/reclaim the space they originally took up in the hub and reuse it for zeroed out BSS data or stack space for the superCOG. You can always spawn PASM driver COGs from the superCOG too if required. That still works too.
When enabled and running, this superCOG can access 32bit instructions or data directly with no hub cycle penalty, and at the same time it can still perform standard hub ops and access hub RAM using regular RDxxxx/WRxxxx instructions if required, using standard byte addresses. When it executes from hub RAM using long addresses this should ultimately allow PASM program sizes up to 32kB (8192 longs) for the superCOG, which is 16 times larger than we have today on the P1. Actually when testing this I found you really need to keep the different addressing in mind at all times, otherwise it quickly gets very confusing distinguishing between hub byte addresses and instruction long addresses.
In this first version indirect branches to high memory already works but directly accessing any data in the high address space > 512 longs or branching to addresses exceeding $1ff using immediate jumps is still not supported yet, however this is all designed to eventually fit properly with AUGDS stuff Cluso is doing and with my other earlier COG instruction enhancements that allows a COG stack and indirect memory accesses etc. That should all be coming soon and once it does it will really make this stuff start to fly. Some will want that to all come together before experimenting a lot with this mod.
To spawn the superCOG you just use coginit(7, @startaddr, @param) from SPIN and the superCOG will start running at hub address defined by your startaddr in a DAT section. This start address should obviously be long aligned. When starting up, no COG memory needs to be loaded/rewritten so it skips the COGRAM writes apart from the shadow RAM at long locations [$1f0-$1ff] corresponding to hub addresses from [$7C0-$7FF]. Those shadow RAM locations will therefore get cleared in the hub memory. This clearing needs to be taken into consideration in the first DAT block of the top spin file so as not to clobber something important. I am still considering not having it update the shadow RAM (just write the internal registers only) for some more safety there but I think we may want to keep it for increased compatibility if some PASM instructions ever need to read from these registers after bootup and expect them to be zeroed initially. Alternatively we could make this an option somewhere using a flag (maybe in bit15 of the start address?). Note that a side effect of the superCOG architecture is that it can't run the SPIN interpreter in this COG anymore, only PASM that has been customized for its new addressing. But we still have the other 7 COGs that can run all our regular SPIN/PASM code so I don't think too much has been lost there. :-)
Also as an optimization for the DE0 nano I have now made the spin interpreter and booter fit within a smaller 4kB block of memory at locations $F000-$FFFF. In this particular version, no sine/cosine/log tables are present in the upper ROM area. I sacrificed those for the DE0 nano implementation so I can now have some more video font/buffer memory to play with above 32kB for my text mode VGA controller I want to add. Personally I've not used the math table data in any of my applications to date, but I imagine some people may do and they would prefer to keep these tables and lose video ROM instead. Either arrangement is certainly possible.
The memory map I'm using is below...
00k-32k - hub RAM, and superCOG's RAM (COG 7) sharing this common region of dual port memory
32k-48k - font table ROM, future shared use as video RAM read by external video controller (TBD/coming soon).
48k-64k - 4kB hub boot ROM & Spin Interpreter (4kB folding over 4x in the upper 16kB).
The hub space allocated uses up 52kB of the 66kB available memory blocks on the DE-0 nano, the other 14kB is used for the COGRAM (7 x 2kB) needed by the regular COGs. So all memory is put to full use now.
I've only just gotten it working today and just tested the very basics so there could well be some side effects with this until any bugs or quirks get ironed out. It also may or may not compile correctly on Xilinx FPGAs depending on how the Xilinx tool infers the dual port memory I've defined in the Verilog, only Altera Quartus II was used for testing here. I know there are a couple of extra warnings about the size of the P bus related to the ALU, that will all get tidied up when I code up a dedicated ALU for the superCOG with all my extra stack instructions etc.
As far as software tools go I have found that BST under Linux will let you compile PASM code > 512 locations and won't complain if its output exceeds this address, but it won't let you put in ORGs greater than $1F0 which is a pity. So you need to be careful in how you craft the PASM to make it suit the memory layout. We will probably want to sort out some updated tools to make best use of these P1V improvements we are adding here.
Just add these files into a clean P1V design folder and build. I think they are all there. It's a work in progress so the code is not all cleaned up yet. By the way I also have a "dummycog" module defined in the dig.v file if you want to reduce the number of COGs built in the generator loop to speed compilation up, just change the generator's for loop indexes accordingly.
Enjoy!
Roger.
Oct 1 UPDATE : I've updated the supercog.v file in the attached hubexecmodsv2.zip which now contains Cluso's AUGS and AUGDS functionality and whose encodings are shown in post #4 below. This will let you access high memory > 512 longs, and create 32 bit constants etc. I've done limited testing but it appears to work. Still to come later is all the COG stack stuff I did recently, but this will help get any early adopters going.
To jump to high memory address you would do this in your code...
To move between two high registers you could do this..
Here I am assuming AUGS/AUGDS are defined using something like this
This enhancement was achieved by turning hub RAM into dual ported RAM and connecting its secondary memory interface over to the superCOG's COGRAM interface. See the attached illustration.
The idea here is that you get a single COG that could be running a larger application such as some high level C code and can now have a decent amount of stack/data space for that, and at the same time you still get to use the remaining 7 COGs for other SPIN code/drivers/real time I/O stuff. I'm also thinking after you initialize the system and spawn your driver COGs you could clear/reclaim the space they originally took up in the hub and reuse it for zeroed out BSS data or stack space for the superCOG. You can always spawn PASM driver COGs from the superCOG too if required. That still works too.
When enabled and running, this superCOG can access 32bit instructions or data directly with no hub cycle penalty, and at the same time it can still perform standard hub ops and access hub RAM using regular RDxxxx/WRxxxx instructions if required, using standard byte addresses. When it executes from hub RAM using long addresses this should ultimately allow PASM program sizes up to 32kB (8192 longs) for the superCOG, which is 16 times larger than we have today on the P1. Actually when testing this I found you really need to keep the different addressing in mind at all times, otherwise it quickly gets very confusing distinguishing between hub byte addresses and instruction long addresses.
In this first version indirect branches to high memory already works but directly accessing any data in the high address space > 512 longs or branching to addresses exceeding $1ff using immediate jumps is still not supported yet, however this is all designed to eventually fit properly with AUGDS stuff Cluso is doing and with my other earlier COG instruction enhancements that allows a COG stack and indirect memory accesses etc. That should all be coming soon and once it does it will really make this stuff start to fly. Some will want that to all come together before experimenting a lot with this mod.
To spawn the superCOG you just use coginit(7, @startaddr, @param) from SPIN and the superCOG will start running at hub address defined by your startaddr in a DAT section. This start address should obviously be long aligned. When starting up, no COG memory needs to be loaded/rewritten so it skips the COGRAM writes apart from the shadow RAM at long locations [$1f0-$1ff] corresponding to hub addresses from [$7C0-$7FF]. Those shadow RAM locations will therefore get cleared in the hub memory. This clearing needs to be taken into consideration in the first DAT block of the top spin file so as not to clobber something important. I am still considering not having it update the shadow RAM (just write the internal registers only) for some more safety there but I think we may want to keep it for increased compatibility if some PASM instructions ever need to read from these registers after bootup and expect them to be zeroed initially. Alternatively we could make this an option somewhere using a flag (maybe in bit15 of the start address?). Note that a side effect of the superCOG architecture is that it can't run the SPIN interpreter in this COG anymore, only PASM that has been customized for its new addressing. But we still have the other 7 COGs that can run all our regular SPIN/PASM code so I don't think too much has been lost there. :-)
Also as an optimization for the DE0 nano I have now made the spin interpreter and booter fit within a smaller 4kB block of memory at locations $F000-$FFFF. In this particular version, no sine/cosine/log tables are present in the upper ROM area. I sacrificed those for the DE0 nano implementation so I can now have some more video font/buffer memory to play with above 32kB for my text mode VGA controller I want to add. Personally I've not used the math table data in any of my applications to date, but I imagine some people may do and they would prefer to keep these tables and lose video ROM instead. Either arrangement is certainly possible.
The memory map I'm using is below...
00k-32k - hub RAM, and superCOG's RAM (COG 7) sharing this common region of dual port memory
32k-48k - font table ROM, future shared use as video RAM read by external video controller (TBD/coming soon).
48k-64k - 4kB hub boot ROM & Spin Interpreter (4kB folding over 4x in the upper 16kB).
The hub space allocated uses up 52kB of the 66kB available memory blocks on the DE-0 nano, the other 14kB is used for the COGRAM (7 x 2kB) needed by the regular COGs. So all memory is put to full use now.
I've only just gotten it working today and just tested the very basics so there could well be some side effects with this until any bugs or quirks get ironed out. It also may or may not compile correctly on Xilinx FPGAs depending on how the Xilinx tool infers the dual port memory I've defined in the Verilog, only Altera Quartus II was used for testing here. I know there are a couple of extra warnings about the size of the P bus related to the ALU, that will all get tidied up when I code up a dedicated ALU for the superCOG with all my extra stack instructions etc.
As far as software tools go I have found that BST under Linux will let you compile PASM code > 512 locations and won't complain if its output exceeds this address, but it won't let you put in ORGs greater than $1F0 which is a pity. So you need to be careful in how you craft the PASM to make it suit the memory layout. We will probably want to sort out some updated tools to make best use of these P1V improvements we are adding here.
Just add these files into a clean P1V design folder and build. I think they are all there. It's a work in progress so the code is not all cleaned up yet. By the way I also have a "dummycog" module defined in the dig.v file if you want to reduce the number of COGs built in the generator loop to speed compilation up, just change the generator's for loop indexes accordingly.
Enjoy!
Roger.
Oct 1 UPDATE : I've updated the supercog.v file in the attached hubexecmodsv2.zip which now contains Cluso's AUGS and AUGDS functionality and whose encodings are shown in post #4 below. This will let you access high memory > 512 longs, and create 32 bit constants etc. I've done limited testing but it appears to work. Still to come later is all the COG stack stuff I did recently, but this will help get any early adopters going.
To jump to high memory address you would do this in your code...
LONG AUGS + (target >> 9) JMP #(target & $1ff)
To move between two high registers you could do this..
LONG AUGDS + (dest & $2fe00) + ((source >> 9) & $1ff) MOV (dest & $1ff), (source & $1ff)
Here I am assuming AUGS/AUGDS are defined using something like this
CON AUGS = (6<<26) + (1<<24) AUGDS = (6<<26) + (15<<18) + (1<<22) ' for unconditional execution independent of Z/C.
Comments
Just from a silicon perspective, dual and quad port ram takes a lot of space. What could be nice though for an ultimate silicon version might be to block this 32KB into 2 x 16KB blocks, and permit each to be assigned as hub or cog 7 extendd cog ram. But in the P1V FPGA, this is really nice.
Great that bst permits cog > 512 longs. This will permit some nice testing.
Cluso, tonight I have browsed through your AUGDS changes and am incorporating some of it in my files to be tested (your files had a few other extras which I didn't want so I have just taken the pieces relating to AUGDS #D, #S and AUGS #S and modified them to suit). I am keeping the instruction encoding presented in the earlier emails which fits in with all my earlier changes.
I also have a recommended change for AUGDS #D, #S which I am going to be trying out in my version. I think it is worth making the #S field also perform sign extension of the upper bits automatically. This will then let us create all the positive and negative constants in the range from -2^17 to (2^17-1), and at the same time also enabling read/write access of the high COG memory range using #D all without needing to add an extra AUGS in the sequence. We will still be able to use this augmented potentially negative S value as an address too as we don't even use the upper bits in the address (plus we can treat them as zero if the COGRAM ever exceeds 1MB).
We would use AUDS to create larger 32 bit constants, but that won't always be required as lots of numeric constants for example will fit the smaller range, especially if it can now be negative.
Consider this..
This will move a value of -20000 into (high) COG memory address (DH<<9) + DL in just two instructions. Otherwise you'd need to do this below which is burning another instruction and code space unecessarily.
Obviously for constants outside the range of -131072 to 131071 we need to the AUGS version instead.
The one bug I had with my later AUGxx tests was precisely where the m[n] was being set relative to the cog_clk. I needed to know when the following instruction was being executed so that the D & S values of the AUGxx instruction could be appended/added, and when to reset these values. I either had my flag being set too early or reset too late. Then I got sidetracked with trying to do a debug setup and finally visitors.
If you like, I can email you my test cases. Otherwise, it will be mid next week till I get back to it to fix.
Other COGs could potentially render lines into hub RAM and the superCOG could output directly with back to back WAITVIDs taking its pixels/color input data directly from hub RAM. This may allow very hires modes we couldn't achieve before due to the limited COG RAM space for both data and WAITVID instructions, and just the extra time it normally took to read it all in from hub.
Any good ideas here...?
Roger.