How to use a second Cog as Coprocessor?
Hi, this is some vague idea, and perhaps it might be an interesting "intellectual pastime" for others too?
So the general idea is to use two cogs in parallel to process some code together. (I am thinking along the lines of a virtual Forth processor, but there might be other ideas for application.)
1. * perhaps one of the cogs could be used for the threading, jumps, loops and the other cog would do the actual data processing work?
2. * perhaps one cog could handle the data stack, many words in TAQOZ end with jmp #DROP
(probably no good idea, because TAQOZ hold the first 4 stack items in fastest COG registers.)
3. * perhaps the second cog could be used to prefill some code cache in shared LUT ?
4. * LUT cache for data?
5. * ???
Some practical questions:
Is there a simple example for sharing of LUT and also one for using the smart pin as fast shared LONG?
Thoughts?
Have fun, Christof
Comments
My first thought is a floating point coprocessor. There is a double arithmetic library somewhere in this forum.
Agruments can be in shared LUT, HUB or smartpins in a repository mode but for floating point coprocessor sharing LUT would be an overkill..
ATN mechanism can be used to trigger the coprocessor to do its job and to notify the main cog that it finished the work.
One implementation for using a symmetrical multiprocessor that I've always imagined is in PLC setups where the I/O processing can be done concurrently to the logic processing. The logic works from an I/O snapshot in RAM. Only at the end of logic processing is the physical I/O updated. Therefore, just by flipping between two copies of the I/O snapshot, the I/O processor can always be updating the alternate copy concurrently to the logic's copy. The same double-buffering technique that's done in 3D games.
And they should grow together quite well too. As the logic gets busier then so does the amount of I/O that requires updating.
Another one is PLC function blocks could also be palmed off to further processors. There is no requirement for function blocks to be executed sequentially with the main logic loop. Again, they work with the I/O snapshot unless specifically required to independently sample.
@pik33 and @evanh ,
thank you for your thoughts here! While I had thought of very tightly coupled coprocessing, your ideas seem to go into directions, where the coupling should be less demanding. The programmer has to do the brainwork, what can be done in parallel. But perhaps it is just a matter of habit, whether you call a subroutine or whether you put a task onto a list of open jobs. Some means to synchronise back is needed.
For my original idea to couple very tightly, I have doubts if any substantial speed improvement is possible at all.
For the moment I will revisit https://forums.parallax.com/discussion/175422/p2-taqoz-v2-8-simple-automatic-multiprocessing#latest This was implemented for repeating tasks, but it should be rather simple to modify it for non-repeating tasks.
Christof
think in buffers. have one COG prepare one array of data (whatever it is) and put it in the HUB ram, so that another COG can process this chunk further. if you visualize it in a flow diagram, it is like a daisy chain where each block is running in parallel and continuous (like a pipeline)
On the topic of buffers (It's been noted in recent times but I don't think Chip was aware of this at design time) SETQ + RDLONG/WRLONG is effectively an atomic block copy operation at the hardware level. It can provide important certainty when spreading the functions around like this. That's really pretty cool.
There is an exception to this though. When using the local cog's FIFO for Streamer ops then it has priority access to hubRAM, so therefore can pause the above block copy operation until the FIFO is satisfied. Not sure if hubexec would also be a concern here. Given hubexec is synchronous to instruction execution then it shouldn't be making any FIFO demands during a block copy instruction.
PS: Note: It's not truly atomic, but as long as all cogs are reading and writing the shared area of hubRAM with block copies then it behaves that way.
PPS: And there's no size limit.
Yes working and communicating with buffers is a way to couple cogs. A secondary cog could perform a FFT on a buffer of samples. Or write it to SD.
Another thought.
Several cogs could have different bytecode interpreters, so that a bytecode has different meanings for them, but still read and execute the same bytecode in a synchronised manner. Therefore do different things in parallel, complement one another somehow.
the fastest method of passing data between COGs should be a shared pin in repository mode... you might want to look into that. it can be a point to point connection. my wild guess is that the latency is about 5 clock cycles.... so you might work out a bytecode that runs on the 8 in parallel and is synchronized via repositorys
another idea: since access to hub memory is interleaved ( each COG has 1/8 of memory bandwidth ) you might be able to work out a scheme where the COGs share one HUB ram memory location, just in different time slots... but that is just an unconfirmed idea from the top of my head
I have done some experiments with Catalina. My first one was to use a cog pair for the XMM Kernel and the cache it typically needs to access XMM RAM. It works, and the code will be included in the next release of Catalina - but the results are mixed.
There were two parts to this experiment:
Using pins in repository mode to communicate between the cogs gives a speed improvement of about 10% both in benchmarks and in real-world cases. This was to be expected since the cache only needs to be consulted on read operations that leave the current page, or on write operations. Using the LUT as the cache gives you a larger improvement in benchmarks - of the order of 25%. But sadly not nearly so much in real-world cases. The reason is that the LUT is just too small to make a decent cache and so the improvement of having it in the shared LUT is not large enough to offset the improvements that can be gained in real world programs by having a much larger cache in Hub RAM. Catalina supports Hub RAM cache sizes up to 64k on the P2, whereas the LUT can only be up to 2k.
My next experiment will be with Catalina's floating point co-processor - again, I expect decent improvements using pin sharing, but LUT sharing probably won't offer much. My experience is that the LUT offers much more significant improvements as additional code and data space than by sharing it. But I am sure someone will come up with an application where LUT sharing really shines.
The next release of Catalina will probably use pin communications by default, but not LUT sharing.
Ross.
LUT sharing was requested for special Cog pairing where there might be a desire to have two Cogs dedicated to one I/O driver. Something like a USB manager.
Something to note about LUT sharing: You can turn it on and off at will and only writes from the paired cog while it's enabled will come through. Also, take note of the events/interrupts associated with it. You could use those "magic" locations as a mailbox while keeping the rest of both LUTs stuffed with code.
Yes that could be handy in the case of an emulator COG that wanted its own exclusive fast and/or cached access to PSRAM without burdening it with all the code required to exist in that same emulator COG - thereby keeping most of its own LUT RAM free for its own use. It could perform its memory accesses via this mailbox and the dynamically LUT paired driver COG could manage a cache on its behalf as well as accessing the memory whenever required. This is useful if this driver COG also used most of the LUT RAM for cache tags or whatever other code or state it maintains. Also in theory the LUT mailbox should be a faster way to request service and return results than going via HUB RAM, at least for single accesses. Block transfers would still be best done via the HUB as you can read/write one long per clock that way (once you are already going).
I recall there was something that didn't function when LUT sharing was enabled to do with the streamer or FIFO, but if it sharing is only turned on temporarily when the system is idle I would expect it should still potentially coexist.
Found it...
"Lookup-RAM writes from the adjacent cog are implemented on the 2nd port of the lookup RAM. The 2nd port is also shared by the streamer in DDS/LUT modes. If an external write occurs on the same clock as a streamer read, the external write gets priority. It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode."
I suspect streaming in LUT mode is possibly avoidable or could at least be avoided during a time the writes from the requesting COG is being made, especially if you block while the request is in progress and wait for the result. A data cache would potentially need to be write thru in that case so that the memory driver COG isn't going to be streaming anything out from the LUT in the background idle time whenever a new request could come in via the LUT mailbox.
Has anyone actually used it as such? I am convinced there is significant benefit to be had here, and I'd love to see a working example that showed what was possible.
Another small co-processor test with Catalina has resulted in mixed results. This was using repository pins to speed up communications between the kernel and the floating point co-processor (instead of using a Hub RAM mailbox). What I really love about the Propeller is that you can make such deep and fundamental design changes in just a few instructions - it shows how well integrated the Propeller design really is. And again the initial results seemed promising - my simple floating point benchmark program showed a significant improvement (> 20% speed increase). But sadly again the results disappear almost entirely in real-world programs. This is because even programs that apparently make heavy use of floating point actually do not - the non-floating point operations typically outnumber the floating point ones even in tight loops, so that the overall benefit ends up being relatively insignificant.
However, I believe the co-processor idea is basically sound and I will continue to try and find the best place to use it!
Ross.
I use LUT sharing in my 1080p 2-bit tile driver. One cog dynamical changes the LUT contents in order to have 24 bit colors for each of four colors in every tile...
Thanks. Actually, now that you mention it, I think I knew that!
I started work a while ago on adding your driver as a Catalina plugin, but at the time I couldn't do it, probably because I didn't really understand how it worked, and I had a VGA driver that worked well enough. But I'll have another look at your driver. Is this the latest version ... ?
https://forums.parallax.com/discussion/171133/1080p-tile-driver-p1-style/p1
Ross.
This driver is ancient and from a time when I probably was better at pasm2 because it’s crazy complex…
However, think I did recently update it enough to work with new Prop Tool. Think it’s under a mixed signal scope thread…
Of course I am following this thread very interested!
I am working on my editor and was wondering how I could split up that work for more than one cog. Somehow an editor can be seen as a virtual processor. Input keys are it's instructions. Up to now though I have not come to a solution, which really brings benefit. File operations, the text buffer and also the screen are resources, that must be handled carefully for consistency, which is very much more easy in one program. Inserting a line into a large buffer is time consuming, but you definitely have to wait until it is done. An editor spends lots of time idling. My last improvement for performance is to postpone the refresh of the text display ( syntax highlight needs time ) and the update of the proposals for words for auto-completion to a point in time, when no new key is pending in the serial buffer. Obviously the writer did not need the information. As there is already a flag that the refresh has to be done, this work could be handled by a second cog, but there is no benefit.
Somehow on one hand the chunks of work that are given a second cog must be big enough for efficiency and on the other hand the main cog must be able to make use of the time in parallel.
One interesting application of a coprocessor would be the MMU of the coco3 emulator. Perhaps the MMU-coprocessor should swap whole memory pages?