Quick-n-dirty float32 hack for sharing it between multiple objects

M. K. Borri · 2006-10-20 16:27

This is a quick and dirty hack used to allow multiple objects to share a "math coprocessor" cog. It's the best I could think of, please tell me if it's been done before as it's probably been done better [noparse]:)[/noparse]

Very simple -- basically it starts a cog (waiting if there's none available), does the operation, then stops it.

Post Edited (M. K. Borri) : 10/20/2006 4:35:35 PM GMT

Mike Green · 2006-10-20 16:43

Starting and stopping a cog is pretty slow (100's of microseconds) because the Propellor has to load the 512 long word program into the cog. If you want to share a math cog between two other cogs, you can just use the LOCKxxx calls to "reserve" the floating point, do the operation, then "release" the floating point semaphore (lock). That's much much faster.

M. K. Borri · 2006-10-20 16:50

thanks! I'm going to try to do that then [noparse]:)[/noparse]

Post Edited (M. K. Borri) : 10/20/2006 4:55:47 PM GMT

cgracey · 2006-10-20 18:55

At 80MHz it takes 102.4us·to·load a COG, immediately after which it begins executing code:

512 longs to be read·* 16 clock cycles per hub access * 12.5ns per clock = 102.4us.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

M. K. Borri · 2006-10-20 20:22

Here's the float32full... note that float32a is unchanged.

Since it's just 1/10th of a millisecond more, I'm keeping this downloadable because it's still faster than doing everything in spin, some people may have a use for it and I'm going to use it until I do something better optimized [noparse]:)[/noparse]

op           floatmath    float32    this

FAdd        371        104+39=      143
FSub        404        104+39=     143
FMul        341        104+46=     150
FDiv        1413        104+45=      149
FFloat        192        104+35=      139
FTrunc    163        104+36=      140
FRound    163        104+36=      140
FSqr        1522        104+247=     351
FNeg        21        21        21
FAbs        21        21        21

still mostly worth it if you're not using the cog anyway...

What I'm thinking is, instead of messing with locks since the rest of the program may be using them, just upload the particular function I need to run into a cog -- would that be cheaper or does the hub have to send over the full 512 words anyway?

Post Edited (M. K. Borri) : 10/20/2006 8:40:47 PM GMT

Cliff L. Biffle · 2006-10-20 23:40

Alternatively, you can use a non-blocking asynchronous algorithm, similar to what the FullDuplex example does.

Reserve an area of shared RAM as a message queue. Also reserve a 'head' and 'tail' pointer.

Any Cog that needs floating-point will:
- Reserve an additional chunk of shared RAM (say 1 word) for the result. (Note that modifying the queue from more than one cog may require a lock, unless you're very careful.)
- Wait until there is room in the queue.
- Write a message (format of your choosing) into the queue indicating the operation to perform, operands (immediate or address), and a pointer to that cog's result word.
- Wait until the FPU Cog notifies you that the operation is done (either by setting yet another word, incrementing the 'tail' pointer, or by simply changing the result word)

The FPU Cog would:
- Wait until there are messages in the queue.
- Process.
- Repeat.

This may be overly complicated, depending on how much floating point you're doing. The main advantage is that floating point operations can be made asynchronous -- rather than waiting for the lock, you can issue the operation when you have the operands, and block on the result only when you really need it. It also scales better: if throughput becomes an issue, you can dedicate additional Cogs to floating point, reading from the same message queue.

M. K. Borri · 2006-10-21 02:06

Thanks for the suggestion! I want to see if I can make it so that people can just turn this on instead of float32. I'm going to go back to playing with it [noparse]:)[/noparse] I believe that this is something that might be useful to a lot of people afterall.

Phil Pilgrim (PhiPi) · 2006-10-21 17:59

For what it's worth, here's an archive I posted back in March on the beta testers forum, along with a copy of the original post. It might be useful here. In this scenario, the float object could be set up as a server and the programs that need it as clients. It works somewhat like what Cliff suggested above, in that each client has its own buffer and request table entry. The ipc object intermediates requests between clients and the available servers. Clients, when making requests can either wait for a response, or go about their business and check for a response later. The downside, of course, is that it uses up an extra cog.

Having now reread something I hadn't looked at for more than six months, it's apparent that I could have explained things better. I'll try to answer any questions that come up as best I can.

PhiPi said...

Attached is an archived Spin object I've been working on for a week or so. It enables interprocess communications, so Spin cogs can exchange data with each other in a transparent fashion. It uses a client/server model, whereby a server will listen for requests coming from a client. Each request will arrive at a server as the address of a client-owned buffer, which contains the data making up the request and into which the server can write any response it may have. The data in the buffer can be anything: there's no particular formatting required. How it's interpreted is entirely up to the user's program. Some possibilities might include I/O data, logging info, or even remote procedure calls. It's also possible for a local server to be the front-end for a server on an entirely different, remote system, with which it can communicate. The possibilities are endless.

This module requires its own cog and one lock. The lock is used only when creating or destroying server and client objects. Outside of that, the protocol uses flags only, and even these are transparent to the user. The cog contains an assembly language router which mediates the requests from client to server. This is a connectionless design. Once a request has been serviced, there is no further connection between client and server until another request is made.

Each server advertises its availability by name (actually a LONG value derived from a string, although any LONG value can be used). Requests are made by name only, so the client doesn't have to know which cog the target server exists in. If two servers advertise the same name, both will be available to service requests made to that name; and the router will simply pick the first one that's not busy. That way time-intensive requests can have more than one server available to service them, without the client being aware of the process.

Any Spin cog can operate as many servers and clients as desired. It is the programmer's responsibility to ensure that any servers it advertises are actually available when needed. This is done by polling each of its servers for requests and performing the requested services when they come in.

This module is "Phase I" of the project. It works with Spin programs only and should be considered very alpha. In the second phase, I will provide a way for assembly-language routines to communicate with each other and with Spin programs. ('Never got around to "Phase II" -Phil)

I've decided to license this under the GPL. I think that's better than just "throwing it out there"; yet it still retains the benefits of being free and open. As always, comments, questions, bug reports, and criticism are welcome.

A question came up in that thread about whether this object could be used for inter-Propeller communication. I believed it might be helpful and answered thus:

PhiPi said...

I think so, and probably without modification. Here's how I'd start:

1. Assign a server to be the off-chip gateway. Give it a name like "GATE". In a half-duplex, point-to-point world, it could also handle the bit-banging to the other chip.

2. A client, wishing service from the other chip would send a request to the gateway server. It's buffer would include a header with the name of the server on the other chip and a message length.

3. The gateway server, when it got such a message, would send the header, and a message from the buffer of the correct length to the other chip and wait for a reply.

4. The gateway server on the other chip would fill it's own buffer with the message, which would then become a client buffer for the server on it's chip having the name passed in the header.

5. When it got a response, it would send it's buffer contents back to the originating gateway.

6. Meanwhile, back at the ranch, when the reply comes back, the gateway server would copy the responding server's message into the original client buffer and call "respond", thus completing the transaction.

The neat thing about the Propeller, with its multiple processors, is that you don't need two of them to try it! (In fact you don't even need to tie two pins together.)

But, step outside the half-duplex point-to-point world, and things get way more complicated. Plus, I think Martin raises an interesting point as well: Is the cost of my handler, in overhead and resources, justified by the layer of abstraction that it provides; or might the same thing be accomplished just as easily with a more direct approach? If having it emboldens someone to try something they might not have attempted otherwise, then the answer would have to be yes. If not, well, it was still a fun project!

Another question came up regarding the advantages/disadvantages of this method, as opposed to just using global variables to manage the calling and to pass the data. Here, in part, is my answer:

Phipi said...

... Yes, granted, global variables can provide many of the same benefits without some of the overhead. However, the user is left with managing the process and dealing with any contention issues that might come up. What I was hoping to accomplish was to make this process more uniform and transparent to the programmer -- especially where multiple cogs are available to perform as redundant servers and/or multiple clients need access to a single server.

Another benefit is that Spin objects that use this module can be written independently, without knowledge of each others' global variables. The only things that tie them together are the names that the servers advertise and that the clients use to access them....

Cheers!
Phil

Quick-n-dirty float32 hack for sharing it between multiple objects

Comments