A simpler COG
Ale
Posts: 2,363
Having loads of time to think, and less time to sit down at the keyboard, brought me the possibility of doing some fpga work. While the ultimate goal may be the 1048576-bit parallel processor implemented in the Tornado5000 series to be developed on the next century, I have meanwhile decided that the constrains imposed by a Lattice MachXO2-1200 allows for interesting possibilities.
One of them is a COG. Yes, our beloved Propeller series I cog. (I know that some have done this already but I haven't seen any code yet).
With time to think, and pencil and paper I have been exploring with multiple archtectures. And finaly settled, for the time being, for a COG. But mine is a bit different. So, no code compability. But important are the concepts that lead to its simplicity: The sequencer has its work cut in half because all these bits for flags and result handling, dual ported memory for reading s and d cocurrently avoid extra muxes.
I also increased the. memory to 1024 locations. Some conditions had to go too to make room for the extra space, those conditions are not used that often, then I thought extra space is better . The FPGA is some 85 % full.
One of them is a COG. Yes, our beloved Propeller series I cog. (I know that some have done this already but I haven't seen any code yet).
With time to think, and pencil and paper I have been exploring with multiple archtectures. And finaly settled, for the time being, for a COG. But mine is a bit different. So, no code compability. But important are the concepts that lead to its simplicity: The sequencer has its work cut in half because all these bits for flags and result handling, dual ported memory for reading s and d cocurrently avoid extra muxes.
I also increased the. memory to 1024 locations. Some conditions had to go too to make room for the extra space, those conditions are not used that often, then I thought extra space is better . The FPGA is some 85 % full.
Comments
That makes it a whole lot more flexible and matters even more on a single COG design.
What speed does your COG run at ?
Here some simulation waveforms.
Regarding the frequency, the compiler said something about 10 Mhz, that is in the slowest of the MachXO2 series, with an 8x8 multiplier
Chip gave some details in the Prop 2 threads. IIRC he managed a late-change to add threading, without a clock penalty.
To time-slice, you create multiple Opcode Fetch Pointers (ProgramCounters) and Flags, and then flip those to the CPU engine, on a time-sliced manner.
You need to decide how many PC.FLAG copies (or threads) are to be supported, and give some way to weight them.
Simple Equal time, is less flexible than something like a 4 way copy, which needs 2 bits to map, and one can fit 16 x 1:4 allocate slots into a 32 bit register.
You might choose to allocate every second one to the most critical Thread, giving that 50% CPU thruput (still jitter free, but at half the clock speed), and the remaining 8 slots could map as 3/16:3/16:2/16, or 4/16, 3/16 and 1/16, or 7/16, 1/16, 0/16 etc.
Or, you may decide to allow some modulus bits, to cover cases 1/3 1/3 1/3 mapping.
But I am thinkering a bit more with an external memory interface. RDBYTE/WRBYTE should access external memory directly... let's see with what I can come up.
Thanks!
I wonder how difficult it would be to port your design to a Cyclone IV? I guess that partly depends on whether you used an HDL or building blocks from a Lattice library. Anyway, if your COG is sufficiently lean, perhaps several would fit on on a DE0-Nano. More cogs plus higher speed would make a most interesting design to play with!
Or how about an engine that executes Spin bytecodes directly?
Yes, you have an interesting project indeed.
Flexible time slices and mapping multiple thread execution register sets (PC, C, Z, etc.) as well as bank switching would be very fine enhancements to the current architecture. But more important, at least in my opinion, would be the provision for indirect addressing. The current cog can already quite effectively do task switching, but is crippled somewhat by the lack if indirect operations.
Solving this latter issue would really make a cog shine !
Cheers,
Peter (pjv)
It went through my mind something like a spin bytecode interpreter, that would be cool, maybe microcoded . The proGCC CMM bytecodes are for me a new territory, didn't know they existed... I'll research it!
Indirect addressing would need and extra field in the opcode map... And there is no space for it, Maybe autoincrement/decrement... Maybe another encoding for the flags/effects should ge impemented... 7 bits are quite a... bit!
Microcoded instruction sets for various languages(Pascal, Modula-2, Prolog, LISP, Java) have been done but there isn't much of a performance gain. Once the Intel 32bit microprocessors started taking off, it killed the market for these microcoded beasts.
But if one wants to study how to implement it, take a look at the Symbolics workstations. It's fascinating stuff.
Since this is custom, and you are doing external memory, I would suggest to include SPI RAM and SPI FLASH HW options. (QSPI where possible, with options for Dual chips, to do 8 wide SPI )
Really fast stuff is managed on-chip, but once you go to LMM byte-code style, the speed impact of going serial can be greatly reduced with proper HW support.
The only reason this works is because the P2 has a 4 stage pipeline. If he copies the P1, there isn't any chance of it having this feature, because the 4 stage pipeline imposes so many restrictions on code execution that the P1 doesn't.
They're as different as apples and oranges.
I'm not following the logic here. This has been done in P2, in spite of the 4 stage pipeline imposes so many restrictions on code execution
- which has to make it easier to do, in a CPLD version, than in P2, rather than harder ?
The Op has already said in #5 that he thinks a 2:1 Slicer is (relatively) easy to do..
The P2 and P1 *are* similar in that the P1 does the 4 stages of execution in 4 consecutive clocks, while the P2 interleaves 4 instructions across 4 stages and 4 clocks. When you combine this with the multi-threading, it actually takes 16 clocks for 4 instructions to enter and exit the pipeline, because although the P2 executes 4 instructions in 4 clocks, the pipeline is 4 stages deep and the currently finishing instruction was loaded 4 clocks ago.
I think if you used a DE0-Nano it might gain some traction as those of us that already have one might find the time to assist.
If you (and others) come up with something interesting, once the P2 is out in production, Chip/Parallax might be interested in adding it into a P1 or P2 variant.
i have also a board with a spartan2-100, and one with a spartan3-200, the DE0-nano will be the birthday's gift that I'll ger as long as I order it . There, I'm sure one could pack loads of such cogs, maybe 15 of them.
I am writing down some documentation and diagrams, I want to see if I can speed it up a bit, it is rather slow. With the combinatorial multiplier it should do some 6.5 MHz.
Multiply is a rare opcode, and often systems pipeline it, which saves a lot of logic, and you may be able to make it larger./ Size variable.
Adding an opcode like WAITMATH would work like WAITxx, allowing you to pack in other instructions that did not use the result, and wait for the result if you need to, with minimal looping cost.
Note another opcode in Prop 2 is the REPEAT - that does a 'zero overhead loop', where you specify a loop count, and block size,
and a parallel hardware circuit does DJNZ
Prop 2 also allowed a timeout on some Waits...
On the spartan2 it can do 18 MHz.
I have been thinking about the threading vs pipeline and a 2 way threading can be done sort of easily, interlocking the fetch, registers load and writeback. At least on paper is possible, I'll have to check to see how slow it becomes due to the extra address muxes and related logic.
I have to write some docs and publish them, with diagrams!, those I have to do too , maybe we get some more ideas going.
The picture shows a bit the simulation of the sequencer and the memory. The two threads are labeled t1&t2. The marker shows where the first thread starts execution. One clock before memory is read to get the opcode. The slots or states are Fetch S&D T1, Fetch Opcode & WB T2, Fetch S&D T2 and Fetch Opcode & WB T1. I have to add the ALU and see if it fits the MachXO2. A Spartan2-100 would probably be 50% full.
The file tb_v8.v can be compiled with icarus verilog.
$ iverilog tb_v8.v
and run with vvp
$ vvp a.out
that will create a waveform file, dump.vcd (also provided) that can be displayed with GTKWave.
What it does...
Thread one starts at address 0 and executes a call to $004 with ret at $022. Thread 2 starts after one instruction of t1 at address $20, at address $22 it finds the written jmp of thread 1. For the time being only the memory and the sequencer are included. Feel free to try it out !.
That can pipeline, so should have no speed-cost. Default could be 1:1 for 50% CPU.
It would nudge the HW a little, ( a tiny %) but I could see that Two wide-ish load values could be quite useful ?
That would allow Video threading for example. 0 is off, and 1..N is that many cycles on that thread.
examples would be
A:B 1:1 run high speed 50%/50%
A:B 0:1 B has 100%, A is off
A:B 1:0 A has 100%, B is off
A:B 480:110 A has 100% for 480 clocks, then B has 110 clocks
A:B 9:1 A has 90%, B has 10%
Either thread could change A:B, and it would apply on the next load.
here is the report for the attached version.
It has two threads, but I havent tested it, only the one thread version. It should run at 24 MHz, witht booth signed multiplier and all No idea what of all works as of yet, but the multiplier and the pointers should work, the pointers are separated for both threads, the flags c and z and not saved yet, a little rework is still needed
I'll post more code soon