Cog CPU Component Description
Bob Drury
Posts: 236
in Propeller 2
Is there some details or a block diagram on Cog CPU. The following components I Believe exist are there anymore:
Each Cog CPU has a 20 bit PC program Counter,
Each Cog CPU Has ALU Arithmetic Logic Unit
ALU has a result register C Carry Flag or Z Zero Flag
Each Cog CPU Has a “Q” register with 2 associated flags that can be set with SetQ and SetQ2
Q register has RDLONG\WRLONG to identify burts read/write to CogRam or LutRam
Each Cog CPU has AUGSx 32 bit register. For augmenting source field S (9 bit field +23 bits)
Each Cog CPU has AUGDx 32 bit register. For augmenting destination D 9 bit field + 23 bits)
Regards
Bob (WRD)
Comments
Sounds like you want a zoomed in block diagram of just the ALU with special registers and the instruction pipeline feeding it.
Beyond that it'll be a big drawing: Dual porting of both lutRAM and cogRAM. HubRAM and the FIFO for hubexec. The streamer, coupled to lutRAM, and it's myriad of hardware functions. The egg-beater of course. Cordic and hub-ops paths. Buses to/from smartpins as well. It'll be an amazing block diagram when complete.
There probably does need to be a P2 cpu block diagram, like this one for P1:
Oh, I forgot the DAC paths. They're quite prolific, which is why Chip limited them to groups of four consecutive pins with a fixed order.
What I was thinking was more of a block diagram of the components the programmer should be made aware of. Maybe the path ways wouldn't be required.
Regards
Bob (WRD)
Ok, so a programming model rather than a generic block diagram
Example?
Sometimes P2 reminds me to a house, that was considered too small by its generations of owners. So they added something every year....
The overall concept seems to be, to use the slower Spin2 interpreter in combination with fastest pasm where needed. So very many customers are forced to use assembler. In contrary to Arduino users, who can use a fast compiler for everything.
So P2 will need a very good description of every assembler instruction and the special hardware it uses in one manual. Pictures and examples will be helpful. It is no good to let each customers guess, search and research on his own.
(I don't understand priorities at Parallax. I think, that such documentation is way more important to make P2 usable than the graphical debugger or spin floating point.)
Jeff has said he wanted to get onto the documentation earlier but other things were keeping him tied up.
I am not sure that is always true. Spin can be compiled to machine code with FlexProp, and is also compiled to byte code in Proptool.
Interpreted Spin runs roughly as fast as assembly does on a P1. I do not have the clock rate where that is true, but it is in the neighborhood of 200Mhz.
PASM is needed to use special features on chip, at least until libraries make more of that available.
And with FlexSpin and Catalina you have native compilers producing pure assembler like on other computers. There is even a Version of GCC supporting P1 and P2 code generation, alas a older one as current.
As of fast compiler on A... - hmm - the turnaround time between changing a line of code and have the project running again for testing is quite long compared with all offered compilers for P2.
The reason a lot of people use Assembler on P2 and P2 is that PASM is a very nice Language and sometimes you need/want to go down to the level of maschine code, even in TAQOZ/Tachion as the thread about inline assembly here shows.
Like FORTH Chips byte Code interpreter is a stack based RPM notated interpreter, so actually SPIN is not as slow as you think.
It will be quite interesting to compare TAQOZ byte code, Spin2 byte code, Wuerfel byte code and the upcoming NU byte code from ersmith with some benchmarks, but benchmarks on a Micro Controller are not as easy as running a FFT.
Enjoy!
Mike
Well, I meant the concept, that Parallax follows. They do not really support the compilers.
P2 200MHz would mean about 5 times faster, as P2 needs 2 vs 4 cycles per instruction for P1.
I find it interesting, that for ESP-32 there is no assembler, "only C", and complaints about this are extremely rare. Perhaps this is really mainly a question of the completeness of the libraries. P2's focus is of course more on the controlling side.
Your last part of the last sentence is some riddle for me. :-)
Benchmarks are quite easy, the problem is how far the results will be relevant for your case. For example if there is a multiply in the code, it will be very significant, if you can use mul16. Peters simple fibonacci benchmark gives some good idea for a speed comparison. https://forums.parallax.com/discussion/173684/simple-assembler-for-p1-tachyon-forth-5-7-or-can-you-beat-tachyon-at-fibonacci#latest
In my findings using Taqoz vs PASM, the handing of the stack takes a lot of time, it is often more direct, if you can access several registers. So if you do a benchmark here, you will have that influence included.
(By the way, Taqoz V2.8 has evolved into 16bit word code. Taqoz's strength is that it is interactive and still rather fast.)
Well, the idea of this thread was to have some better documentation for PASM with a diagram. So sorry for this side-track.
Actually, a FFT is one of "standard" Spin benchmarks.
If you're wondering, the results for that on P1 are like this:
I think the cogs look a bit like this on the inside. Not sure which bus instruction fetching happens on, but I think it's the D bus? (all the FIFO ops use D and I assume it isn't needlessly hooked up to the other bus, too.) @cgracey would have to confirm. Also missing whatever mysterious force makes WRPIN and friends work.
Sounds reasonable. I suspect it's a lot more cluttered than that in reality. It's very easy to drop in extra muxes in HDL. A hint is the ALU write-back, normally coincides with instruction fetch, must be able write to lutRAM while executing instructions from there as well.
I'm guessing there is a staging register just for WRLUT.
It seems to me Bob is more looking for something like a programming model.
An example snipped from page 28 of 'Programming the 65816' by Eyes and Litchy
Granted, it doesn't show how the boxes connect, but it shows which elements are available to the programmer, and can even be colour coded to show what is write only, what is read only, and what is read/write.
AJL
Yes. The block diagram could then be referenced to what registers have to be set and the instruction description could reference the order.
The mode registers could then be individually blocked with the settings tables. This in turn could then also be used to reference the Boolean pin Diagrams.
Regards
Bob (WRD)
With hub-ops the cogs are time-sliced for accessing common resources like Cordic, HUBSET and hubRAM. The mode registers there are not in the cogs. The instructions for setting take extra clocks to complete because of this arbitration and distance.
The pins/smartpins mode registers are way off in each smartpin. The mechanism is different to even the hub. With pins, every cog can talk over each other to any arbitrary pin. Chip has described how it works in the past but I don't remember the details myself. The routing resources required to make that work as fast as it does may have been the reason why we didn't end up getting the full 16 cogs.
As for an equivalent programming model. Much of the special purpose internals of a cog are not accessible. The Q register wasn't even meant to be readable. Chip was surprised when we discovered we could read it. Program Counter is one that is not readable but obviously does exist ... well, except by a CALL+POP combo.
There is a bunch of hidden status flags. Only C and Z visible for general use. The event system will be chocker with flags.
504 accumulators (general registers).
PTRA and PTRB can be used for stack pointers but there is also the hidden hardware stack.
Much of the Prop2 instruction set is specific instructions to manage all these hidden bits. That is why I've stated in the past that the architecture is strongly load/store oriented.
At one point I suggested maybe a clearer focus on load/store might help. I didn't make a clear case myself though, so it wasn't received well.
Nice picture! I think, something like this is good to get an overview. Perhaps the paths for addresses could be included? You could then see, that LUT is always addressed indirectly. As far as I understand, the streamer can access HUB Ram too? ( I have not yet used it. ) Perhaps it would make sense to include the cordic into the picture as well.
For more details I think, that text describing the hardware functions together with the PASM instructions including examples are necessary. Parallax has done excellent documents in the past: https://nagasm.org/ASL/Propeller/printedPDF/AN001Counters10.pdf
That FIFO at the bottom is a critical part of streamer access to hubRAM. Sometimes I've noticed Chip even call it the streamer, maybe accidentally. The FIFO is shared between the cog and its streamer. Streamers are slightly oddball in that they are built into each cog, using both the FIFO and lutRAM, but otherwise kind of live their own lives detached from the associated cog.
Of course the program counter is readable! LOC with zero offset will do the trick. Not sure if you can actually assemble that though...
loc pa,#$
maybe?The event flags can also be tested non-destructively, I think, using the conditional branch instructions.
The CORDIC lives in the hub, so I guess it just uses the same connections as memory access. It gets 3x32 bit inputs though...
And yes, the connection between streamer and FIFO is missing.
I thought about drawing separate address busses, but it's harder to guess where the addresses are coming from...
They're not intended as readable I guess is the point. The fact that many are derivable is incidental.
An easy way to resolve that might be to just mark them as internal, or special. People benefit by knowing they're there, and then how to actually get information to and from them can be a special section as well.