The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Electrodude · 2015-08-27 02:19

jmg wrote: »

ozpropdev wrote: »

Nice!
Now 28 bit constants can be buried within PASM code which also function as NOP's in fine tuned timing loops.

Well spotted - a smarter assembler can use a NOPc or similar, and then check the constants area, for values that can be 're-purposed'.
That also means constants could need to be placement floating, and placed last.

Keeps the code readable, but allows more use of COG memory.

Same with any variables the compiler can prove will always have their top four bits cleared.

Speaking of which, can PASM2 have a constant pool, a designated area to put literal constants that can't actually be literal constants because they're more than nine bites long? Any NOPs would automatically be contributed to the pool with the restriction that the top four bits must be all zeros. Plasma, the assembler Linus Åkesson wrote for his Propeller demo Turbulence, supports this. Here's an example of how Plasma does it:

        sub     t1, =$8000
{ ... }
number  long    123
        pool    ' $8000 ends up here
x       res     1

Dave Hein · 2015-08-27 12:18

I was thinking about how hubexec will interact rdxxxx and wrxxxx. My understanding is that hubexec will fetch instructions using a FIFO, which will be filled at a rate of one long per cycle. When a program jumps from a cog address to a hub address the cog will be stalled until the eggbeater memory cycle for the first instruction occurs. After that, the cog will execute straight line code at a rate of one instruction every two cycles. I assume the FIFO continues to fill when the eggbeater cycle for the next instruction comes around.

What happens when a rdxxxx/wrxxxx address is in the same cycle as a FIFO fetch. I'm assuming the rdxxxx/wrxxxx has priority, correct? Otherwise, the cog would be stalled for an additional 16 cycles waiting for the eggbeater to come around again.

evanh · 2015-08-28 02:49

I can't see it working any other way than the FIFO will finish up then the RDxxxx/WRxxxx gets it's turn. That's just another little penalty of HubExec.

Obviously the FIFO isn't bursting on every loop of the Hub so that's another break in determinism for HubExec.

Dave Hein · 2015-08-28 13:18

Well either the FIFO will have to wait or the RDXXXX/WRXXXX will have to wait when there is a conflict. It would be good to know which one has priority.

The FIFO wouldn't be bursting on every eggbeater loop, but it could burst on every other loop. I'm still wondering how tricky it's going to be to write efficient code that works well with the eggbeater memory cycles. It seems like random hub RAM accesses will be very inefficient. Data in hub RAM will have to be carefully aligned in memory and tuned to a specific cog to make it work efficiently.

Seairth · 2015-08-28 15:06

Here's my guess:

Chip has referred to the instruction caching mechanism as a streamer, which would lead me to believe that it's opportunistic. Assuming that the instruction cache is still 16 longs wide, then a single memory hubop that's aligned to the streamer would cause a streamer "miss" and result in 8 cached instructions executing before the streamer could again attempt to read the hub. If you had a second memory hubop within 7 instructions of the first memory hubop that accessed the same hub memory bank, then you will cause a second miss by the streamer. At which point, the instruction cache will be depleted by the time that the hub comes back around for the streamer. In order to recover from this, I'd think that each miss by the streamer causes an implicit JMP PC instruction to be inserted instead. That way, if the cache is fully depleted, then all that's left is the JMP instruction that will force the instruction cache to get reloaded.

evanh wrote: »

Obviously the FIFO isn't bursting on every loop of the Hub so that's another break in determinism for HubExec.

I don't think there was an expectation of deterministic timing for HubExec, as I don't think that's the primary use case. However, if my guess above is correct, then you should be able to avoid spurious instruction cache stalls by judiciously planning your memory hubops.

However, this could be a bit difficult. The most common hubop is going to be "RDxxxx/modify/WRxxxx". And, if the RDxxxx just happens to cause the streamer to stall, then the WRxxxx will most definitely cause a 16-clock (or more) penalty while the streamer recovers.

In the end, this is definitely a use case we should test when the FPGA image is released.

Conga · 2015-08-28 15:12

You all are amazing! Thanks for your previous answers!

I have no experience with Propeller programming, so please forgive me if I ask obvious things (or say obviously stupid things).

(1) One interaction between cogs that will be supported (efficiently) is message-passing:
SETQ right before a RDLONG/WRLONG should be optimal for message transfer (from/to hub).

(1.1) Synchronization: what kind will be required for message-passing?
After a few days of browsing I could not identify in this forum a conclusion on P2 locks or other synchronization primitives.

(1.2) Variable-length messages: how can they be handled efficiently?
The obvious idea is one LONG for message header, with length in least significant byte (or two bytes maximum) and message type (command, etc.) + flags grouped close to MSB.
To have more than 32 bits (one LONG) overhead per message seems wrong.
Maybe other layout for the header could be processed faster?

(2) Regarding interaction between FIFO and RDxxxx/WRxxxx, and their priority, I was thinking of the context of their use = the code being executed.

If I understand correctly,
hub code should be able to call code stored in cog, and
code stored in cog memory should be able to call hub code
(which, in turn, should be able to call code stored in cog ...
recursive use of / need for the per-cog code FIFO).

(2.1) If hub code calls code stored in cog memory, the FIFO should have lower priority than RDxxxx/WRxxxx.
Otherwise, as Dave Hein said above (August 27), the cog would be stalled for an additional 16 cycles waiting for the eggbeater to come around again --- I think unnecessarily.

(2.2) Let's say hub code calls code stored in cog memory, which calls hub code.
The inner call to hub code will need the FIFO, so after two levels of return the outer call to hub code will have a "cache miss".
I see the FIFO as a very limited cache for contiguous LONGs, which must be flushed entirely when the access pattern deviates from sequential.

Can you please correct my misunderstandings and update on the relevant parts of P2 design?
Links to messages that contain latest conclusions or explanations would be more than enough.
I don't think I'm the only one who would benefit from a snapshot of latest thinking on P2.

Heater. · 2015-08-28 15:19

Conga,

I can answer one of your questions. About synchronization between COGS.

As you probably know the Propeller has a mechanism for acquiring locks. Locks are the Propeller's way of providing the atomic operations required to enable multiple COGS to access the same data securely.

The Propeller II will also have a lock mechanism. I have no idea if it will be exactly the same or not.

However, most Propeller code does not use locks. Chip even asked once if it would be OK to remove locks from the P2 design as no one seemed to use them. The answer was "No" we had better keep them for when they are needed.

If you have one producer process and one consumer process then locks are not required to ensure data access is contention free.

If you go to OBEX, the Object Exchange, you will find the Full Duplex Serial object (FDS). In that code you will see that the application and the serial driver COGS exchange data through FIFOs, There are no locks used or needed.

evanh · 2015-08-28 15:49

Conga wrote: »

(2.1) If hub code calls code stored in cog memory, the FIFO should have lower priority than RDxxxx/WRxxxx.
Otherwise, as Dave Hein said above (August 27), the cog would be stalled for an additional 16 cycles waiting for the eggbeater to come around again --- I think unnecessarily.

My speculation is: The FIFO/streamer is not likely to be interrupted once it starts a loop of the hub. If it started before a RDxxxx/WRxxxx is executed then the RDxxxx/WRxxxx will just have to wait until the next hub loop.

(2.2) Let's say hub code calls code stored in cog memory, which calls hub code.
The inner call to hub code will need the FIFO, so after two levels of return the outer call to hub code will have a "cache miss".
I see the FIFO as a very limited cache for contiguous LONGs, which must be flushed entirely when the access pattern deviates from sequential.

Yeah, I wouldn't be wanting to make a big habit of calling Hub code from Cog code. Keep Cog code for those tight inner loops.

Chip hasn't posted a huge amount of detail, so we're mostly guessing. To be advised after FPGA image drop I'd say.

Conga · 2015-08-28 16:17

Heater. wrote: »

If you go to OBEX, the Object Exchange, you will find the Full Duplex Serial object (FDS). In that code you will see that the application and the serial driver COGS exchange data through FIFOs, There are no locks used or needed.

Thanks Heater!

Looked at the Full Duplex Serial object:
strange thing is that the only obvious case for the new block transfer is 'entry' from "Assembly language serial driver", while the rest (repeating code) seems optimized for shorter transfers.

To me it looks like P2 with the new hub memory optimized for sequential access will both support and encourage using longer messages between cogs.
I might be biased in favor of message-passing, I confess :-)

Conga · 2015-08-28 16:53

evanh wrote: »

Conga wrote: »

(2.1) If hub code calls code stored in cog memory, the FIFO should have lower priority than RDxxxx/WRxxxx.
Otherwise, as Dave Hein said above (August 27), the cog would be stalled for an additional 16 cycles waiting for the eggbeater to come around again --- I think unnecessarily.

My speculation is: The FIFO/streamer is not likely to be interrupted once it starts a loop of the hub. If it started before a RDxxxx/WRxxxx is executed then the RDxxxx/WRxxxx will just have to wait until the next hub loop.

I understand why it could be that way.
There's no reason to expect that FIFO/streamer filling *must* be more interruptible than explicit block transfer (SETQ right before a RDLONG/WRLONG).

But this does not mean that FIFO/streamer filling cannot be interruptible --- even if repeated-RDLONG would not be.

Since their destinations are of different type (in hardware: FIFO vs. cog memory), the hardware logic would very likely be different so it clearly could handle interruptibility differently.

Roy Eltham · 2015-08-28 16:54

Previous hints/mentions from Chip have been that regular RDxxxx/WRxxxx instructions would be "along side" or not use the streamer/FIFO.
The impression I got from him was that doing a RDxxxx/WRxxxx instruction would not interrupt the streamer/FIFO.

My understanding is that the streamer/FIFO fills one long on every clock of the hub, and the normal RDxxxx/WRxxxx instructions are satisfied once every 16 clocks (similar to the existing P1 hub/cog interaction). So in the time to "go around" once the streamer/FIFO is filled, and delaying the RDxxxx/WRxxxx beyond the normal "wait for it's slot" wouldn't happen.

Dave Hein · 2015-08-28 17:02

So, just to clarify, here's how I understand how hubexec works.

- When a cog jumps to a long address greater than 511 it goes into the hubexec mode.
- The streamer FIFO is filled with 16 longs, and cog execution continues as soon as the first long is available.
- The streamer FIFO takes precedence over RDxxxx/WRxxxx accesses.
- As longs are consumed from the streamer FIFO subsequent hub RAM reads are performed to continue to fill the FIFO.
- Any jumps to another hub address will cause the streamer FIFO to be refilled, even if the FIFO already contains the data from the target address
- Filling of the streamer FIFO is terminated immediately when a jump is made to an address less than 512

Conga · 2015-08-28 17:22

Roy Eltham wrote: »

My understanding is that the streamer/FIFO fills one long on every clock of the hub, and the normal RDxxxx/WRxxxx instructions are satisfied once every 16 clocks (similar to the existing P1 hub/cog interaction).
So in the time to "go around" once the streamer/FIFO is filled, and delaying the RDxxxx/WRxxxx beyond the normal "wait for it's slot" wouldn't happen.

I thought/hoped that this would apply only to the random access to hub memory,
not to explicit block transfers (SETQ right before a RDLONG/WRLONG) = sequential access.

Lack of fast block data transfers would make the new hub memory layout less useful;
optimizing only for HubExec would be too biased --- it's not clear that instruction reading would always be the bottleneck.

Roy Eltham · 2015-08-28 17:33

Conga,
My understanding was that there were separate instructions for doing block transfers equivalent to the RDxxxx/WRxxxx ones but meant for working with the streamer/FIFO. If you used those instructions then you would have contention between them and hubexec.

Conga · 2015-08-28 19:00

Roy Eltham wrote: »

My understanding was that there were separate instructions for doing block transfers equivalent to the RDxxxx/WRxxxx ones but meant for working with the streamer/FIFO.
If you used those instructions then you would have contention between them and hubexec.

I'm not sure what separate instructions you have in mind.
Are they RDLONGS / WRLONGS ? (retired recently, replaced by SETQ before a RDLONG/WRLONG)

Heater. · 2015-08-28 19:32

Conga,

It does not matter if you are moving byte by byte from COG to COG or blocks of bytes in one shot.

If there is only one producer of data and only one consumer it can be done without locks or other atomic operations.

The Full Duplex Serial object is a fine example of how to do it, byte by byte, in a FIFO. What if instead of bytes that FIFO used LONGS that were actually addresses of bigger data structures? It would work just as well. No locks required.

Conga · 2015-08-28 20:48

Heater. wrote: »

It does not matter if you are moving byte by byte from COG to COG or blocks of bytes in one shot.

If there is only one producer of data and only one consumer it can be done without locks or other atomic operations.

The Full Duplex Serial object is a fine example of how to do it, byte by byte, in a FIFO. What if instead of bytes that FIFO used LONG [...]?
It would work just as well. No locks required.

Thanks Heater, I appreciate the time you took to explain.

I understand what you say and I did not claim that locks are required in this case.

(1) I was wondering what kind of synchronization primitives will be required for message-passing (if any).

(2) I asked about the status / conclusion on P2 locks or other synchronization primitives designed for P2.

These were just 1/4 of a larger set of questions.

The underlying theme was that P2 with the new hub memory optimized for sequential access will both support and encourage (I hope) using longer messages between cogs.
This is just a design approach I am interested in exploring (that seems to fit the new P2).

Heater. · 2015-08-28 20:57

Conga,

I'm not sure I understand what you are asking:

(1) I was wondering what kind of synchronization primitives will be required for message-passing (if any).

Same as ever, you need locks or atomic operations or you don't. Depends on what data is being shared with whom.

(2) I asked about the status / conclusion on P2 locks or other synchronization primitives designed for P2.

As far as I know a similar lock mechanism will be included in the P2. It would be crazy not to have some atomic operations on shared memory in the P2.

Conga · 2015-08-28 21:22

Heater. wrote: »

Conga,

I'm not sure I understand what you are asking:

I just explained my intention and the meaning of my questions:
definitely not to challenge anyone or insist on the need for locks.

I think we might be violently agreeing.

Thanks!

potatohead · 2015-08-28 23:31

I'm confused as to what you mean by, "a longer message between COGS"

Cluso99 · 2015-08-29 01:28

From what Chip said a few days ago, my understanding is that up to a full cogs worth (2KB) can be transferred using SETQ followed by RD/WRLONG. The transfer will run at full clock speed.

Conga · 2015-08-29 09:05

potatohead wrote: »

I'm confused as to what you mean by, "a longer message between COGS"

I mean promoting a message-passing programming style and experiment to see how it works.
Other uses of hub memory would be mostly execute or read-only access.
Besides cog-to-cog (Unicast) communication there should be group / collective operations: Broadcast, Multicast.

Anyway, "longer messages" would mean an average message size around 10 LONGs, with a minimum of zero (excluding the header) and a maximum of a few hundred LONGs (the big ones would be unusual).

The new P2 seems suitable for this; whether it's a good idea, that remains to be seen.

I know there's nothing original in this, neither in the bigger computing world nor in the Propeller community ---
just a few ideas that I might be able to implement when P2 comes out.

koehler · 2015-08-29 13:00

Sounds like you are asking/proposing a new message passing scheme.

Not sure its really necessary, or it couldn't be done another way with less resource use.
Dump it to the Hub, and then put a pointer in a mailbox and set a flag that there is a new update?

UDP/Unicast/Multicast, all would seem to depend upon each Core spending time listening for a message like a network port does, instead of being focused on work and polling a mailbox when they have the time.

Or so thats how it appears at this hour of the morning.

Conga · 2015-08-29 14:53

koehler wrote: »

Sounds like you are asking/proposing a new message passing scheme.

Yes, but not because I think it would be the only way to program the P2, or the best way in all circumstances.

It's for education (in a general sense, not necessarily formal):
I want to demonstrate / illustrate communication and coordination patterns, on all layers of abstraction in computing.

The P2 would be the lowest layer I plan to do.
(I could go lower, but I'm not sure I have the time to learn everything required, much less that I can explain it to others well.)

Going from bottom to top, we would have:

(1) P2, with under-microsecond message passing (and possibly processing); for a suitable audience this can be explained/documented with hardware description language (if the P2 will be open-sourced too).

(2) RTOS communication primitives, with source code available (some of the implementations may be even readable).

(3) IPC on Unix-like OS; kernel source will be available but "readable" is not how I'd describe it (for my purpose --- a modern desktop or server OS kernel is anything but simple).

(4) What is usually called "lightweight" messaging: ZeroMQ, MPI --- I'm sure people in this forum would laugh at the "lightweight" designation.

(5) Message-Oriented Middleware (MOM): here are the comparatively "heavyweight" MQ systems.

(6) Workflow and batch processing: business process automation, scientific workflows. Processing could take hours.

So this could go from microseconds to hours latency.
I think this would show both the generality of communication and coordination patterns and their limitations.

koehler wrote: »

Not sure its really necessary, or it couldn't be done another way with less resource use.
Dump it to the Hub, and then put a pointer in a mailbox and set a flag that there is a new update?

You're probably right.

This would be a fascinating subtopic to explore (in general, not mainly about P2);
would be somewhat advanced for the purpose of the education program proposed above.

The relative cost of sharing vs. communication with copying has interesting switch points (reversals) when going through the above levels.
Had even more interesting changes in the history of computing.
On modern "big" CPUs, reference-counting and Copy-on-Write are not always best way to share and access data.

What I like about P2 (in this context --- there's more that I like in general) is:
message passing could be a realistic way to program the microprocessor.
I don't like giving examples that require me to concede at the end "Yeah, I would not recommend to actually use this, it does not make sense at all to do it".

Heater. · 2015-08-29 16:15

Conga,

message passing could be a realistic way to program the microprocessor.

I'm not sure what you are getting at. Message passing is perhaps the main way by which people work with the Propeller. It's not often I see programs that have multiple COGs hacking on shared data structures other than in a message passing style. Have I seen any? Only my FFT to which can be parallelized, spread over multiple COGs using OMG.

Problem is this message passing is not formalized, people just knock up whatever they want, FIFO's, mailboxes, whatever. Of course most of this is hidden under the Spin access functions to the objects.

I always imagined it would be nice if Spin had some formal message passing semantics, like the channels of Occam, XC, or Google's Go language.

potatohead · 2015-08-29 16:32

nevermind

Dave Hein · 2015-08-29 17:49

It seems like message passing is useful when there are multiple processes/processors that can't share each others memory. The only way they can share information is to pass messages to each other. This is not the case on the Prop. All of the hub memory is accessible by all cogs. Messages may be useful for sharing information that is contained in cog RAM, which is not accessible from other cogs. However, this is normally handled by just writing the data out to some shared location in hub RAM. I supposed that can be thought of as a message being sent from one cog to another, but there is no formal standard on how this is performed on the Prop.

evanh · 2015-08-29 23:35

True, that's the feature of the Hub. It's got super fast low latency shared main memory 100% common to all cores. The trade-off is it's not expandable.

I guess a second trade-off is latency also goes up linearly with core count.

msrobots · 2015-08-30 04:47

potatohead wrote: »

nevermind

The shortest @potatohead post ever!

Finger broken? Keyboard defect?

I am worried!

Everything OK?

Mike

potatohead · 2015-08-30 07:43

Everything is just fine. I typed something, thought better of it, hit the wrong button, and so there it is!

A quick edit later, and it's... nevermind!

I'm actually jamming on some enterprise software doing a demo-proof of concept sales support contract right now. Painful, but it very easily paid for my FPGA board.

A few more days... then it's over.

Sometimes that proof requires quite a few concepts...

Fun toy for the day: https://prezi.com It's a spiffy GL based presentation tool. Rather than one ugly PowerPoint hell full of slides, and slides, and slides... You can put the whole works on just one canvas! Then, organize it in some common sense, visual way, and at that point, it can be done. Zoom in to add info at various levels of detail, then sequence it, or some of it, whatever.

The final presentation has a sequence, or path, but it can also contain free form things, optional bits. If they ask, or it makes sense, do it, and if not, move through quick! Having done WAY more PPT than I ever should, I always hated the stack of slides bit. Nobody wants 'em, but then again, if you share the presentation materials with them, they might actually read through some good stuff, if it were possible to really put detail in a PPT without it all just being painful.

Secondly, the PPT is a sequential thing that gets really ugly when the dialog bounces around. One can go through and make a huge investment in PPT and get a little freedom. Then you never do that ever again.

This thing does that easily and visually! Love it. Once I get done with the show 'n tell, I think I'm going to experiment a bit with Prezi, Git Book, etc... and see if it isn't possible to knock out some really fun stuff that is informative. I'm kind of wondering if it might be possible to pack a sort of, "here's how PASM works", or build your own display driver kind of thing into Prezi. It's just one big sheet of paper that has hot spots on it and some visual guidance. Intriguing... to me at least.

Fun part is the old school enterprise sales team is shaking in their boots. I've got editorial control of this one, and there will not be a PPT. Not sure what they are going to do, but it won't be barking at some people I know well and would never torture in that way.

For this one, I finally bit the bullet and embraced the Google tools. Frankly, Drive & Docs rocks when you've got a few people needing to collaborate at a distance. Very highly recommended. Should have done that sooner. I've got Chrome set to open up to the workspaces and information all organized in a set of tabs. Open it, and it's all there. Since I use Firefox for my daily browsing, Chrome becomes a quick project view tool. I've been building a few nice docs and other goodies with a friend and we will often just be editing the same thing while yakking on the cell phones. The need to screen share and hand off is mostly gone, and we've accomplished close to what we would being in the same room.

And now, back to the P2 under discussion.

For once, I can wait a week. Take your time Chip.

(kidding)

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments