P16X32B SuperCog

kwinn · 2014-05-10 15:54

That makes at least 3 of us.

RossH wrote: »

I want the new chip now!

Ross.

RossH · 2014-05-10 16:03

Seairth wrote: »

If Chip had taken that attitude when first considering the Propeller, he probably wouldn't have made it. Its always easy to dismiss "edge cases" when there's no precedent for such things on existing hardware. One of the things that makes the Propeller so great is that it breaks the rules for the "mainstream" mould, allowing us to redefine what's "mainstream" and "edge cases". Don't go pigeonholing the Propeller's applicability until we have actual silicon.

I'm not pigeonholing, I'm pointing out the reality that no-one seems to have been able to identify any use cases where a slot-sharing scheme is the only solution, other than faster execution of high-level languages. And for that particular use case SuperCog is the simplest possible solution, and possibly the only one that brings no "baggage" that has to be borne by the other cogs.

I'm happy for people to post other use cases. It makes much more sense to try and figure out the problem we are all trying to solve before we determine the solution (if any) to be adopted. That is something that is strangely missing from all these hub-sharing discussions,

Ross.

Seairth · 2014-05-10 18:31

RossH wrote: »

I'm not pigeonholing, I'm pointing out the reality that no-one seems to have been able to identify any use cases where a slot-sharing scheme is the only solution, other than faster execution of high-level languages. And for that particular use case SuperCog is the simplest possible solution, and possibly the only one that brings no "baggage" that has to be borne by the other cogs.

I'm happy for people to post other use cases. It makes much more sense to try and figure out the problem we are all trying to solve before we determine the solution (if any) to be adopted. That is something that is strangely missing from all these hub-sharing discussions,

You want examples? How about:

High-bandwidth I/O drivers, particularly those that transfer parallel data. With the limited pin count on P8X32A, parallel I/O was severely limited, so the use case never materialized (despite the wealth of parallel I/O protocols). This is new territory for the P16X64A.
Multi-port I/O drivers can maintain multiple queues in HUB memory. This means that such drivers can require multiple non-optimal hub accesses. At the very least, it can reduce latency. It can also help reduce the impact of the coupling between ports due to use of a single cog.
Any two cogs that need fast communication between one another, like a driver that uses two cogs. With the P8X32A, 8 cogs put pressure on designers to fit drivers into a single cog. With 16 cogs on the P16X64A, it seems likely that more ambitious multi-cog drivers will pop up. With the lack of Port D (from the P8X96A), the hub will again be the primary conduit between multiple cogs.
DSP-based applications can require large memory spaces to work in. While it may be possible to tune the DSP code to use 1:16 timing, it may not always be feasible. In which case, simply increasing the access rate won't help with determinism, but it will reduce latency. Also, many DSP applications work at low bit-depths, meaning that byte or word transfers are common. In which case, faster timing increases bandwidth (regardless of whether the timing is deterministic).
I believe that "video" has been regularly provided as an application that could make use of this. As this is not my forte, I will not elaborate on this topic.
Multiple cogs running high-level languages (or interpreters). Considering that the actual usage patterns of P16X64A is unknown, it seems premature to state that only one cog is going to be running a high-level language. And, while I know that you didn't explicitly state that, your are implicitly arguing it be saying that SuperCog is enough for running high-level languages.
If Hubexec doesn't make it into P16X64A, LMM can make use of faster hub access. Like the prior bullet, it's not unreasonable to expect multiple cogs using LMM at the same time.
And in a general sense, it's not always practical to take the time to write precisely-timed code. An easy solution, of course, is to give a cog 1:2 timing, which effectively gives that cog immediate access to the hub every time (ignoring the occasional instructions that stall for a odd number of clock cycles). Of course, this also might not be very practical, because it could starve other cogs. But giving that same code 1:4 timing (or possibly even 1:8) timing might still be "good enough". We are all, I'm sure, familiar with those times in a project where practicality wins out over idealism. This is particularly true in the commercial world, where "ship it!" (and the more colorful versions) means that practicality often trumps all. Is this the right way to develop on the Propeller? Absolutely, if it's what's necessary to get a product out the door. I'm certainly not going to tell someone that they can only get faster speeds by writing perfectly-timed code.

This is only the things that immediately come to mind. In truth, though, these are all guesses. Just as "SuperCog will help high-level programs" is a guess. So, shooting any or all of these examples down now would be premature. We won't actually know whether any of these guesses are right until we are using actual hardware (or possibly the FPGA).

Personally, I'd rather have the opportunity to answer those questions with real-world use of the P16X64A. And that is what I was getting at in the prior post. Existing "mainstream" MCUs precluded us from doing things that the P8X32A allowed. Had Chip dismissed the Propeller idea because the current MCUs cover all of the important use cases, the Propeller would never have existed. And, while we all acknowledge that Chip is a smart guy, he couldn't possibly have known all the ways that the P8X32A was going to be used beforehand.

(note: I am a proponent of the "ship it" sentiment. If any of these hub-optimizing ideas prevents timely shipping of the P16X64A, I'd rather not have them at all. But, if one of these ideas is likely to be implemented, I think slot-sharing is the better way to go, as I feel it gives us more opportunities than the SuperCog approach to see what is possible. And, if SuperCog gets implemented instead, then I'm sure we'll still see what is possible. And I'll be okay with that, because we will have an actual P16X64A to do it with. Obviously, this is just my opinion. And I know you disagree with it. I just hope that, whatever Chip settles one, you are able to embrace it. I know I will. Even if that means we get none of this at all.)

RossH · 2014-05-10 19:45

Hi Seairth ...

Seairth wrote: »

You want examples? How about: ...

Well, I understand what you are saying, but what I had in mind was actual examples of things that could only (or at least best) be accomplished with hub slot sharing. For instance, a USB driver. I don't know enough about USB to know whether USB is going to be possible on the P16X32B (but we do know it is not possible on the P8X32A). So if it were not possible on the P16X32B but became possible with hub slot sharing then you would have a good example. That's the kind of example that I think you need to be able to demonstrate before hub slot sharing is going to seriously considered.

Another point is that (at least as far as I can see) none of your examples actually depend on hub-slot sharing. Some of them we already do on the P1, or could do if we just had more cogs and a higher clock speed (which we will have on the P16X32B). Some of them could be done with just one Supercog (or with two SuperCogs, which I agree is feasible but which I personally would rather not have). But the really interesting thing is that many of them could be done - better - if we instead just used multiple cogs and had some kind of fast inter-cog communication that does not rely on the hub. This would be much more in keeping with the Propeller architecture - I'd rather have cog-to-cog comms than any hub sharing scheme - even the SuperCog!

Seairth wrote: »

This is only the things that immediately come to mind. In truth, though, these are all guesses. Just as "SuperCog will help high-level programs" is a guess.

No, that one is not a guess.

Seairth wrote: »

So, shooting any or all of these examples down now would be premature. We won't actually know whether any of these guesses are right until we are using actual hardware (or possibly the FPGA).

Personally, I'd rather have the opportunity to answer those questions with real-world use of the P16X64A. And that is what I was getting at in the prior post. Existing "mainstream" MCUs precluded us from doing things that the P8X32A allowed. Had Chip dismissed the Propeller idea because the current MCUs cover all of the important use cases, the Propeller would never have existed. And, while we all acknowledge that Chip is a smart guy, he couldn't possibly have known all the ways that the P8X32A was going to be used beforehand.

(note: I am a proponent of the "ship it" sentiment. If any of these hub-optimizing ideas prevents timely shipping of the P16X64A, I'd rather not have them at all. But, if one of these ideas is likely to be implemented, I think slot-sharing is the better way to go, as I feel it gives us more opportunities than the SuperCog approach to see what is possible. And, if SuperCog gets implemented instead, then I'm sure we'll still see what is possible. And I'll be okay with that, because we will have an actual P16X64A to do it with. Obviously, this is just my opinion. And I know you disagree with it. I just hope that, whatever Chip settles one, you are able to embrace it. I know I will. Even if that means we get none of this at all.)

I agree with all this. Personally, I can live without any hub slot sharing scheme - my point is that if do have one, it should be the simplest scheme possible that adds value ... so that we can all get on with using the rest of the chip the way it was originally intended - i.e. as a deterministic symmetric multiprocessor. If you lose that, then the Propeller has very little to offer that other chips can't match or exceed.

Ross.

kwinn · 2014-05-10 19:50

I agree with the uses Seairth posted and would add signal acquisition/generation for applications like logic analyzers, protocol analyzers, ate systems, etc. Those applications will require multiple cogs with high speed deterministic hub access to work.

If supercog is the only access method implemented because that is the only use case we see now there is very little chance we will ever see any use cases that require more flexible hub access because they cannot be implemented.

I would like to have a monitor on my workbench along with a propeller based board that provides me with:

a 2/4 channel oscilloscope
32 channel (minimum) logic analyzer
serial protocol analyzer
multiple (at least 2 of each, preferably 4) voltage, current, and frequency measurements
the capability of logging all this information to sd or hd

the P1 is a bit too slow and memory limited to do this
the P2 (P16X64A) is fast enough
the P2 has enough cogs to do all this in parallel
the P2 has enough hub memory to store the data (although more would have been better)
the question is: will multiple cogs be able to access that memory fast enough?????

I don't even care if the video goes off or glitches during the burst acquisition needed for the scope and logic analyzer.

Cluso99 · 2014-05-11 04:11

Ross:
I have already quoted USB FS as an example that can benefit from both slot-sharing (co-operating cogs). I have learnt enough with P1 to never say it cannot be done. However, with the 1 clock instruction P2 FPGA image of Feb/Mar 2014, I could not read a full USB byte without new instruction help. The current 2 clock instruction makes this situation much worse!

What I am fairly certain about, is that without some form of help (instruction/s, slot-sharing, cog-cog comms in parallel, and/or hw support) USB FS will be extremely unlikely without way too many caveats.

Note: On P1, USB FS was achieved, IIRC by overclocking to 96 MHz, using 5 cogs, and ignoring every second transmission. There were many other USB non-compliances. Exhaustive testing was not done. Two cogs were used to read the bit-stream. IIRC, then while decoding that message, the message was retransmitted from the host, ignored by the P1, and at the end of transmission, it was assumed to be a repeat, and the reply the sent. I dont think this is an acceptable solution?

RossH · 2014-05-11 04:28

Cluso99 wrote: »

What I am fairly certain about, is that without some form of help (instruction/s, slot-sharing, cog-cog comms in parallel, and/or hw support) USB FS will be extremely unlikely without way too many caveats.

I think you're probably right that it could be done if you had all that support - but do you think hub sharing alone would be sufficient?

Cluso99 · 2014-05-11 06:08

RossH wrote: »

I think you're probably right that it could be done if you had all that support - but do you think hub sharing alone would be sufficient?

There are no guarantees, but without it I believe there is no chance without other additions which are now unlikely.

Todd Marshall · 2014-05-11 08:19

1) What is the slot sharing concept?
2) How does it compare/contrast with the slot assignment concept?
3) Is an algorithm running in a single COG preferable to one implemented in multiple COGs (e.g. HS USB)?

kwinn · 2014-05-11 12:50

1) What is the slot sharing concept?

Not 100% sure but it sounds like they may be referring to the slot assignment table.
Possibly with some unused slot grabbing scheme.

2) How does it compare/contrast with the slot assignment concept? See 1.

3) Is an algorithm running in a single COG preferable to one implemented in multiple COGs

If a single cog can process the I/O or data fast enough then definitely yes.

Seairth · 2014-05-11 14:49

RossH wrote: »

Well, I understand what you are saying, but what I had in mind was actual examples of things that could only (or at least best) be accomplished with hub slot sharing. For instance, a USB driver. I don't know enough about USB to know whether USB is going to be possible on the P16X32B (but we do know it is not possible on the P8X32A). So if it were not possible on the P16X32B but became possible with hub slot sharing then you would have a good example. That's the kind of example that I think you need to be able to demonstrate before hub slot sharing is going to seriously considered.

You keep moving the target! Frankly, if I wrote an actual driver that actually used the 32-slot table (before the chip/FPGA even supported it), you'd then say "well, I understand your example, but you still haven't proven that the driver can't be written without the 32-slot table."

RossH wrote: »

No, that one is not a guess.

It is. It may be a well-educated guess, but it's a guess nonetheless. Have you implemented a version of Catalina that uses SuperCog and shown that it is actually the right solution? Has anyone implemented any code that uses SuperCog and shown that it is actually the right solution? If so, that's great! I'd like to know more! But, if not, then SuperCog is, just like the 32-slot table approach, a guess.

As an aside, you keep saying "P16X32B". If you are talking about the chip which we previously referred to as P1+, then it's P16X64A (see [post=1266206]this post[/post] by Ken). If not, then what is P16X32B?

Seairth · 2014-05-11 15:09

kwinn wrote: »

1) What is the slot sharing concept?

Not 100% sure but it sounds like they may be referring to the slot assignment table.
Possibly with some unused slot grabbing scheme.

2) How does it compare/contrast with the slot assignment concept? See 1.

3) Is an algorithm running in a single COG preferable to one implemented in multiple COGs

If a single cog can process the I/O or data fast enough then definitely yes.

1) and 2) In my case, I am indeed referring to the 32-slot table. I don't know about the others.

3) This is a bit more nuanced, I think. With the P8X32A, I don't think you'd get much argument to the "definitely yes". With 16 cogs, the pressure to consolidate drivers in a single cog is lessened. If you take something like asynchronous serial, you could use FullDuplexSerial for a single cog, but you could also run the send and receive independently in two cogs and get a higher baud rate. Further, the two-cog version would have simpler code, which can be important for maintenance, verification, etc. Personally, I see the push to group code into a single cog to fall under the following broad categories:

Not enough cogs to do everything that needs to be done.
The code naturally belongs in a single cog (i.e. it is the most obvious way to implement the code)
The desire to put multiple, low-bandwidth, low-priority tasks in a single cog to improve resource utilization. (this was more the case for the P2/P8X96A, which has hardware tasking support and 8 cogs.)

RossH · 2014-05-11 16:06

Todd Marshall wrote: »

1) What is the slot sharing concept?

This thread is about one "mooching" supercog in cog 0. See the first post in the thread for more detail.

Todd Marshall wrote: »

2) How does it compare/contrast with the slot assignment concept?

It is much simpler, requires no additional tables or instructions, and affects only one cog - all other cogs operate identically to what they would without the supercog.

Todd Marshall wrote: »

3) Is an algorithm running in a single COG preferable to one implemented in multiple COGs (e.g. HS USB)?

If I understand this questrion correctly, then the answer is "yes" for programs running in the supercog, but "no" for other (normal) cogs.

Ross.

Todd Marshall · 2014-05-11 16:09

Seairth wrote: »

1) and 2) In my case, I am indeed referring to the 32-slot table. I don't know about the others.
/QUOTE]
Can you describe your table and how it works?
1) Is it static or dynamic?
2) Is there one-to-one correspondence COG to SLOT (with 1:16, 2:17, 3:18, etc?)
3) Can a COG be represented more than once in your table?
4) Does the HUB idle through or skip unoccupied slots ... or is that optional?
5) Is the modulo of the table fixed (32) or can it be less?

RossH · 2014-05-11 16:16

Seairth wrote: »

You keep moving the target! Frankly, if I wrote an actual driver that actually used the 32-slot table (before the chip/FPGA even supported it), you'd then say "well, I understand your example, but you still haven't proven that the driver can't be written without the 32-slot table."

I'm not trying to move the target, I'm trying to identify whether a hub slot table is necessary or if the same problems could better done with other techniques (such as normal cog cooperation techniques, or additional cog-to-cog communications). That's why actual examples are better than "classes" of examples.

Seairth wrote: »

It is. It may be a well-educated guess, but it's a guess nonetheless. Have you implemented a version of Catalina that uses SuperCog and shown that it is actually the right solution? Has anyone implemented any code that uses SuperCog and shown that it is actually the right solution? If so, that's great! I'd like to know more! But, if not, then SuperCog is, just like the 32-slot table approach, a guess.

Since I know that the various Catalina kernels are hub-access limited, it is not a guess that a SuperCog would improve it. But I would admit that if I claimed exactly how much it would improve it, then that would be a guess.

Seairth wrote: »

As an aside, you keep saying "P16X32B". If you are talking about the chip which we previously referred to as P1+, then it's P16X64A (see [post=1266206]this post[/post] by Ken). If not, then what is P16X32B?

The P16X32B was Chip's original name for the new chip. I was not aware that Ken had settled a new name. I'll use that from now on.

Ross.

RossH · 2014-05-11 16:19

Todd Marshall wrote: »

Seairth wrote: »

1) and 2) In my case, I am indeed referring to the 32-slot table. I don't know about the others.

Can you describe your table and how it works?

Seairth's model is discussed in his thread. This thread is for the "SuperCog" model proposed by Heater.

Ross.

Seairth · 2014-05-11 16:20

Todd Marshall wrote: »

Can you describe your table and how it works?
1) Is it static or dynamic?
2) Is there one-to-one correspondence COG to SLOT (with 1:16, 2:17, 3:18, etc?)
3) Can a COG be represented more than once in your table?
4) Does the HUB idle through or skip unoccupied slots ... or is that optional?
5) Is the modulo of the table fixed (32) or can it be less?

See the other thread ([post=1265725]A 32-slot Approach[/post]). In truth, I should have responded to Ross's "use case" challenge in that thread, as my response was in the context of that approach. This thread is, indeed, primarily about the SuperCog approach. My apologies for the confusion.

(edit: As Ross also just pointed out.)

Todd Marshall · 2014-05-11 16:29

RossH wrote: »

This thread is about one "mooching" supercog in cog 0. See the first post in the thread for more detail
Ross.

Thanks for that and for the gentle scolding. I have reread the first post. I understand it. And I can see where its (the so-called SuperCog's) behavior could be obtained with some dialect of a shared slot/assigned slot model which makes my question valid here.

Tubular · 2014-05-11 16:29

RossH wrote: »

I'm not trying to move the target, I'm trying to identify whether a hub slot table is necessary or if the same problems could better done with other techniques (such as normal cog cooperation techniques, or additional cog-to-cog communications).

Yes, there is a certainly a class of problems where low latency cog to cog comms would be preferable to going via the hub. The ultimate goal is to enable cogs to co-operate effectively

There have been several times when we wished port B on the P8X16A had been hooked up to enable inter cog comms. Having a 'Port C' internally connected on the P16x64A might be useful for inter cog communications, and may also help smooth the transition to P3.

Todd Marshall · 2014-05-11 16:49

Seairth wrote: »

See the other thread ([post=1265725]A 32-slot Approach[/post]). In truth, I should have responded to Ross's "use case" challenge in that thread, as my response was in the context of that approach. This thread is, indeed, primarily about the SuperCog approach. My apologies for the confusion.

(edit: As Ross also just pointed out.)

Ok. I've looked there. #45 sort of summed it up for me (I hope) and all the other stuff is implementation details. Seems to answer all my questions except (4) which evidently SuperCOG0 answers.

Is this "32 slot approach" slot sharing? slot assignment? or something else? These are terms I keep seeing in my reading and they are ambiguous (COIK - Clear only if known ... but a reflection of my lack of standing).

jmg · 2014-05-11 17:24

Reply moved to
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1266703&viewfull=1#post1266703

Seairth · 2014-05-11 17:26

jmg wrote: »

Correct, ...

I don't suppose you could move this to the other thread. I don't want to accidentally start talking about other-than-SuperCog approaches here.

jmg · 2014-05-11 17:29

Seairth wrote: »

I don't suppose you could move this to the other thread. I don't want to accidentally start talking about other-than-SuperCog approaches here.

Good idea - done.

Seairth · 2014-05-11 17:37

jmg wrote: »

Correct, there is a Table Design that can be easily configured to give SuperCogN, and when N=0, that covers this thread's narrow use case.

(But I am responding to this bit, as it is about SuperCog.)

That's not strictly true. The table approach (at least the single-table approach, anyhow) is static. If you give SuperCogN slots, it always has them (whether it's using them or not) and other cogs never has them (whether it needs them or not). My understanding of SuperCog is that non-SuperCog cogs still get their slots when they need them, but SuperCog gets them the rest of the time (and always gets its own slot). The static table approach cannot provide this exact functionality.

jmg · 2014-05-11 17:44

Seairth wrote: »

(But I am responding to this bit, as it is about SuperCog.)

The static table approach cannot provide this exact functionality.

If by static table you mean single table, True, but note I said there is a Table Design -- and that (dual) Table design of Primary/Secondary COGids can be configured to fully give SuperCog0 (or SuperCog1, or SuperCog2, or ... SuperCog15)

RossH · 2014-05-11 17:47

jmg wrote: »

If by static table you mean single table, True, but note I said there is a Table Design -- and that (dual) Table design of Primary/Secondary COGids can be configured to fully give SuperCog0 (or SuperCog1, or SuperCog2, or ... SuperCog15)

Gosh it's hard to keep up with this! I just made the same point over on the other thread, which is that to have a table design that can mimic what the SuperCog does requires an even more complex design than the original table schemes!

Ross.

jmg · 2014-05-11 17:54

RossH wrote: »

Gosh it's hard to keep up with this! I just made the same point over on the other thread, which is that to have a table design that can mimic what the SuperCog does requires an even more complex design than the original table schemes!
.

Only for some

That table variant design has actually been place for quite a while, perhaps you somehow missed the details ?

RossH · 2014-05-11 18:02

jmg wrote: »

Only for some
That table variant design has actually been place for quite a while, perhaps you somehow missed the details ?

Quite likely - there are too many complex schemes floating about to keep up with, and the details of each one change every time you look.

That's why I prefer the simplicity of the SuperCog!

Ross.

Todd Marshall · 2014-05-11 18:14

RossH wrote: »

Gosh it's hard to keep up with this! I just made the same point over on the other thread, which is that to have a table design that can mimic what the SuperCog does requires an even more complex design than the original table schemes!

Ross.

Sounds like the "shared slot" definition I was trying to find. Under SuperCog? design, if the HUB shows up at a slot (whether static or not), if that slot doesn't need the HUB, SLOT0/COG? (a SuperSlot design) or SLOT0/COG0 (a true SuperCOG0/SuperSLOT0 design) gets HUB services (whether it needs them or not). Variant 2 table scheme allows a COG other than COG0 to be fallback COG. This "sharing" is different than simply "idling through" or "skipping" the HUB access (skipping really screwing up determinsm which I assume no one wants?).

Slot sharing doesn't require primary Slot assignment (the static case) and in the case of SuperCOG0 doesn't require secondary Slot assignment (a super static case).

Do I sort of have it right?

RossH · 2014-05-11 19:32

Todd Marshall wrote: »

Do I sort of have it right?

Yes, I think so ... sort of!

P16X32B SuperCog

Comments