Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 61 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

15859616364144

Comments

  • Cluso99Cluso99 Posts: 18,069
    edited 2014-05-05 11:46
    Excellent summary Mike.
    Some of us see the advantages of being able to utilise otherwise wasted resources (if they are not being used) to advantage. This would enhance the hub accesses and therefore improve hub bandwidth and/or reduce hub latency. With some implementations, this could be done in a completely deterministic manner.
    Some fear that this upsets the previous notion that all cogs are equal, because now we can make some cogs run faster at the expense of other unused cogs, or cogs that have explicitly been demoted by these methoods.

    To placate those against, various solutions have been offered, including
    (a) the default is always no enhancements
    (b) obex restrictions to not include objects requiring any of these enhancements
    (c) simplification of these enhancements to cog pairs

    Those against have severely restricted the technical discussions of the best way to implement these features.

    I have adopted the path of least resistance in proposing slot sharing between two cogs, even though there are better methods where this can be still be achieved.
    Mike Green wrote: »
    As I see it, we have two camps representing two overlapping views of how the P1+ would be used. One camp is concerned that putting hub access under program control (assuming it is indeed easily implementable) will lead to overuse and a breaking of the easy determinism and independence of library objects. The other camp is concerned that the absence of this feature will lead to applications for the P1+ being out of reach due to insufficient hub data throughput and/or insufficient cog throughput due to the bottleneck of hub access.

    In one case, we're mostly looking at objects individually, somewhat in isolation. In the other case, we're mostly looking at overall programs and the global assignment of cogs to functions and allocating hub access globally. Both are legitimate goals, but won't overlap very much.

    I wonder whether we can add some support to compiled programs to facilitate this. The simplest thing would be to mark all objects as to whether they: 1) make use of dynamic allocation of hub access slots or not; 2) require a fixed 1:16 cog to hub access ratio. We may be able to come up with some standards for objects to specify the conditions under which they will run properly when hub access slots are allocated dynamically. We may end up with one or more objects that do hub management.

    I like the idea of having a 4-bit cog number register for each hub access slot with these being initialized to the corresponding cog numbers on a reset so we have the fixed 1:16 relationship as a default. Once you allow dynamic hub access allocation, I don't see a strong need to enforce good behavior in the hardware. I would be happy with an instruction that takes a cog number and a bit mask and sets the hub access slots corresponding to the bit mask so the specified cog uses them (and leaves the others alone). That way, a cog could dynamically change its own slot usage or change another cog's usage. There's risk that a program could hang or not function properly, but the default would be what's expected and we could create library routines to manage this correctly.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-05-05 12:04
    dMajo,
    I think you have not fully understood the discussion.
    By utilising cog slots in a better fashion, we can get faster cogs and still retain fully deterministic behaviour in both the faster cog(s) and also in the cog(s) donating their slots (if they are even used).
    None of this prevents using specific pins because can determine which cogs use these additional features, and of course the default behaviour is these added features are disabled. So it is the users choice.

    The mooching seems to be amuch more complex (hardware) implementation, andof course there are no guarantees of additional slots, and determinism is lost for those mooching cog(s). Then there is the problem of more than one moocher, and which moocher gets priority.
    Therefore I see that mooching should be limited to any 1 cog, or any 2 cogs wher one gets additional even slots and the other odd slots. This would simplify the logic significantly.

    Given the choice, I would opt for slot configuration / slot pairing over mooching because it can be done deterministically because we now have 16 cogs.
  • Heater.Heater. Posts: 21,230
    edited 2014-05-05 12:05
    Cluso's solution of sharing HUB resources between two, or more, cooperating COGs within a software component are the only way I can see to maintain the current happy situation that "my component cannot influence the timing of your component". And vice-versa.

    Is this practical, is it doable? I have no idea.

    There is much mention here of excluding demanding objects from OBEX.

    In my mind OBEX is irrelevant to this discussion. Currently there is no P2 and there is no OBEX for it. Besides if the P2 were a wild success then most of the code people use with it will probably never see OBEX.
  • markmark Posts: 252
    edited 2014-05-05 12:11
    If the hub slot sharing becomes a thing, it would seem even more important that there be a fast access cache in the hub, otherwise you have no option but to waste those cogs you're using the bandwidth from (unless they're tasked with nothing more than controlling some pins).
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-05 12:16
    I really like mooch, but I wonder about the decision path and it's impact on clock speed.

    Ray, what is the best method? Those in favor would do everyone a service by centering on what that is.

    Honestly, of all the options, mooch is my personal favorite because it is a single bit of control, maximizing it requires "cooperating objects", just taking advantage of it requires just setting a bit, and since it is passive, the strong default is the usual, expected behavior. But that is me. Chip seemed open to mooch as well, though we never did discuss how to mooch very well. Dave suggested a simple weighting per cogid, shifted round robin, or something along those lines.

    Those in favor still vary considerably. Jmg wants a very robust level of control. Ray wants a strong compromise, as do I best case. Others range around in similar ways. This means Heater, Ray and I just might agree! (amazing)

    I suggest a discussion about the best possible method, engineered to preserve round robin behavior, offer an opportunity to make effective use of unused HUB access windows, is as simple and robust as possible, and ideally does not impose COG specifics, due to how we need to handle the fixed WAITVID driven DAC pins per COG.

    (operating on them in software, sans the automatic WAITVID type function is not at issue as any COG can interact with any DAC otherwise.)

    Too many choices will serve to continue to dilute the discussion. We've reached a point of clarity summed up nicely by Mike.

    I further suggest a compelling use case or two be compared between round robin and not round robin, so that "what is worth what?" can be resolved in terms of using COGS together and focusing performance on specific COGS.

    Maybe some bit of progress is possible? :)

    My goal and interest is seeing this resolved. Just because it is painful...
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-05-05 12:18
    Cluso,
    I don't think the issue is that some of us don't see the advantages of <insert latest slot sharing scheme here>, as you imply by your comment "Some of us see the advantages of being able to utilise otherwise wasted resources (if they are not being used) to advantage."

    It's that some of us see disadvantages that some of you are dismissing as unimportant, and at least in my case I don't feel that the advantages are as big or worthwhile as some of you seem to think.

    Others:
    I see that wheels are still spinning in the mud on this topic a few pages later... I guess we'll be stuck on this until Chip does another drive by post or two that grabs everyone's attention onto something else...
  • Heater.Heater. Posts: 21,230
    edited 2014-05-05 12:23
    markaeric,
    (unless they're tasked with nothing more than controlling some pins)
    Which is to belittle the whole point of the Propeller.

    Unless, of course one want's it to grow up and become an ARM or MIPs or whatever. Which is totally pointless, those other guys out there already do that much better.
  • ctwardellctwardell Posts: 1,716
    edited 2014-05-05 12:26
    Roy Eltham wrote: »
    It's that some of us see disadvantages that some of you are dismissing as unimportant

    Could you list those disadvantages, being fairly specific?

    The ones I recall hearing consistently are:

    1) It makes the cogs 'unequal'.

    2) It somehow diminishes the utility of the OBEX.

    Thanks,

    C.W.
  • markmark Posts: 252
    edited 2014-05-05 12:28
    Heater. wrote: »
    markaeric,

    Which is to belittle the whole point of the Propeller.

    Unless, of course one want's it to grow up and become an ARM or MIPs or whatever. Which is totally pointless, those other guys out there already do that much better.

    How common is it for cog objects to not use any hub bandwidth? I assume that it's actually uncommon (but I take no issue with being proven wrong), which is my entire point.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-05 12:35
    Making COGS unequal means increased difficulty combining COG code where multiple needy COG code objects are involved. The more needy and numerous, the higher the potential for this.

    We won't know what COG code will do, because it's performance and timing would depend on what other COGS need from the HUB.

    Some schemes may require COG specifics, complicating the DAC pin use. This may be engineered away.

    The potential exisis to discourage parallel programming in favor of easier serial programming. Use COGS together vs emphasizing one or more COGS.

    Increased potential for non-deterministic timing interactions hard to debug and test for.

    Depending on scheme, added complexity / things to manage when employing reuse and or authoring new code.

    Those are the most common ones I've read.
  • Heater.Heater. Posts: 21,230
    edited 2014-05-05 12:44
    @ctwardell,
    The ones I recall hearing consistently are:

    1) It makes the cogs 'unequal'.

    2) It somehow diminishes the utility of the OBEX.

    1) Yes. When I put a transistor or 74XXX logic chip or some FPGA logic block in my design I expect it to behave the same no matter what other such devices I put into my design. Otherwise things get very difficult to manage.

    2) I don't care about OBEX. Currently there is no P2 and there is no OBEX for it. As I said elsewhere, if the P2 is is a sucesses most code will not come from an OBEX.

    Now, if you can engineer the thing such that the "component" is two COGs paired (or more) then we satisfy 1) above. Is that practical? I don't know.

    @mark
  • jmgjmg Posts: 15,175
    edited 2014-05-05 12:45
    dMajo wrote: »
    This is one reason more to discard the TopUsedCog or Power2 idea.

    That's cool, my Verilog test already has a single Boolean config bit that disables TopUsedCog. Easy Peasy.

    The reason it is there, is there are some design cases where 16 cannot give deterministic (no jitter) operation, that enabling TopUseCog can solve. Thus an optional TopUsedCog gives more deterministic chip solution.

    The Logic cost of TopUsedCog is tiny - a priority encoder and a control boolean.

    Anyone opposing the delivery of simple choice, has to also demand only a single PLL Divider choice - but Chip has expanded the Divider choices from P1, for good reasons.

    There is another State Engine I am exploring, I will call a Stepping-stone, or hot-brick scanner, which is an extension of TopUsedCog. It needs more logic, but looks to still (easily) meets the Timing, as it remains on the D-FF side of the scanner state engine. Yes, it will also have a control bit ti give 1:16 operation.
  • ctwardellctwardell Posts: 1,716
    edited 2014-05-05 12:52
    potatohead wrote: »
    Making COGS unequal means increased difficulty combining COG code where multiple needy COG code objects are involved. The more needy and numerous, the higher the potential for this.

    We won't know what COG code will do, because it's performance and timing would depend on what other COGS need from the HUB.

    Some schemes may require COG specifics, complicating the DAC pin use. This may be engineered away.

    The potential exisis to discourage parallel programming in favor of easier serial programming. Use COGS together vs emphasizing one or more COGS.

    Increased potential for non-deterministic timing interactions hard to debug and test for.

    Depending on scheme, added complexity / things to manage when employing reuse and or authoring new code.

    Those are the most common ones I've read.

    With the exception of parallel vs. serial programming all of these can be managed by the various ideas, finding the one that covers the most and is still useful is of course the hard part.

    As far as the parallel vs. serial, I don't think forcing someone into one or the other of those solutions is our decision to make.

    Some problems work out nicely when spread over multiple cogs, like the video example you mention earlier.
    The tasks that work well in parallel tend to be very 'mechanical' or like an assembly line where a few very specific operations need to happen very quickly and usually in large quantity.
    Tasks like business logic tend to have far fewer items that can simply be parted out for parallel processing and tend to be easier to create in a more serial manner.

    C.W.
  • Heater.Heater. Posts: 21,230
    edited 2014-05-05 12:52
    jmg,

    Maybe you missed my recent question: What does "TopUsedCog" mean?

    If I have some COGs running at some moment, say 0, 9 and 15. Then what is "TopUsedCog" and how does it influence HUB access?
  • Heater.Heater. Posts: 21,230
    edited 2014-05-05 12:59
    ctwardell,
    Tasks like business logic tend to have far fewer items that can simply be parted out for parallel and tend to be easier to create in a more serial manner.
    I have no idea what "business logic" is supposed to mean in the context of a micro-controller. It's a term that did not exist until the recent MBA generation.

    Anyway, never mind. If the thing you are trying to do is not amenable to parallelization then why are you using a 16 core MCU and not an ARM or some such that will do it a lot better no matter how we mess with HUB bandwidth allocation?
  • markmark Posts: 252
    edited 2014-05-05 13:03
    Heater. wrote: »

    @mark
  • ctwardellctwardell Posts: 1,716
    edited 2014-05-05 13:14
    Heater. wrote: »
    ctwardell,

    I have no idea what "business logic" is supposed to mean in the context of a micro-controller. It's a term that did not exist until the recent MBA generation.

    Anyway, never mind. If the thing you are trying to do is not amenable to parallelization then why are you using a 16 core MCU and not an ARM or some such that will do it a lot better no matter how we mess with HUB bandwidth allocation?

    Come on Heater, you know what it means.

    The main control program, the marionette that is controlling all the puppets.

    Something that is controlling all those nifty little objects that get loaded from the OBEX.

    The part that is left to the author...that which cannot be grabbed from the obex...that which makes my application unique...

    C.W.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-05-05 13:52
    I am on my xoom and its impossible to quote partial posts, so I will answer some things later.
    In the meantime, here is another possible solution (its currently not my choice but it may work out that it solves my preferences anyway).

    jmg has proposed the TopUsedCog scheme. Basically this has been done to reduce the hub loop to less than 16 slots. It works in conjunction with a 4bit table (for cog#) per hub time slot. The advantage is that a cog can be given say 3 slots equidistant apart. eg 1 slot in every 5 with a total of 15 slots gives 3:15 = 1:5. Otherwise 3:16 woould give jitter.

    Now the problem I see is that existing objects (the new objects) expect 1:16 and so anything other than 1:16 may break its determinism - ie the object may fail. I accept the answer that you do not need to use this feature.

    The other part is that the number of slots equals the highest cog used.

    Now, if we had the jmg proposed 4bit table, one per slot, and 16 max slots, and then add a mode that set (changed the default) the totalnumber of slots to 16 or less, we would accomplish the issues jmg is trying to solve.
    This would require a cog to set the number of slots, then setup the slot/cog table to the desired values. Most likely this would be done by cog 0 and it could be so limited if desired for security. This gives total control back to the user if he/she so desires.
    I think its quite simple to implement, and it permits the user to use any cogs desired. The cost for that user is that objects may not work as designed - but thats specifically his choice.

    For the rest of us, we can choose to ignore the total slots (ie use the default 16) and the default table of each slot is set to the corresponding cog#.

    The advanced user can decide to set the slot#s with any cog#s. We don't care because the default is still slot0=cog0, etc.

    Now, we can change the table to share slots. So, we could set the slot table for slot1=cog1 and slot9=cog9. This gives cog1 2x slots equally spaced, with cog 9 not getting any slots. This type of use would not compromise any standard objects.

    The only addition to this that I would like to see is a mooching method.

    So what if there were a second table that were used if the slot in the first table were not used. This would permit one level of mooching and permit total user contron of the mooching. The silicon fot this would not be that complex, and would let the user decide which, if any, cog(s) could mooch. This simpler 2nd mooch table can act like a lower priority in a cog pairec scheme, or as a mooch (but not both in respect to a specific slot)

    Now if we want, we can have more than 1 mooching cog, but those mooching cogs only get access to the unused slots they have been sedt for. ie each slot can only be assigned to 1 cog, and if that cog does not require it, it can be given to another specific cog.

    Does this make sense, and would this satisfy most?

    sorry about typos :( I did buy an iPad to replace my xoom but it seems I misplaced it into my wifes handbag!
  • jmgjmg Posts: 15,175
    edited 2014-05-05 14:05
    Heater. wrote: »
    jmg,

    Maybe you missed my recent question: What does "TopUsedCog" mean?

    If I have some COGs running at some moment, say 0, 9 and 15. Then what is "TopUsedCog" and how does it influence HUB access?

    It is a simple priority encoder, which gives 15 in your example.
    (when disabled, it always loads 15)

    If COG 15 then disables it's UsesHub boolean, the TopUsedCog changes to 9.
    (COG 15 can still run a timer etc, it just agrees to not use the HUB )
    If COG 9 disables its UsesHub boolean, TopUsedCog is still 15.

    A common usage would be to have a trigger COG as the upper one - it can get out of the way during the fastest burst work, and can re-appear to setup/result process.
  • jmgjmg Posts: 15,175
    edited 2014-05-05 14:29
    Cluso99 wrote: »
    Now, if we had the jmg proposed 4bit table, one per slot, and 16 max slots, and then add a mode that set (changed the default) the totalnumber of slots to 16 or less, we would accomplish the issues jmg is trying to solve.
    This would require a cog to set the number of slots, then setup the slot/cog table to the desired values. Most likely this would be done by cog 0 and it could be so limited if desired for security. This gives total control back to the user if he/she so desires.
    I think its quite simple to implement, and it permits the user to use any cogs desired.

    Sure, either works - the state engine does not really care if the 4 bit reload field comes from a Single Config register, or a Priority encoder.
    A priority encoder follows the UsesHub flag in a live manner, while a common config field has shared access questions.
    A priority encoder also cannot be miss-set, when UsesHub is on, COG gets a slot to use itself or remap.

    The mapping can be either global, or locally set - I favoured a 5 local bit field as that will fit in the single COG setup field, and gives control to the COG, but again the state control scanner does not care how the write is done, so long as it can read the values.

    It could even be dual-ported, so COG config allows local, atomic access(nibble), and global setup is also allowed.

    One down side I saw with global (my initial approach), is two COGS might try to change the field at the same time, and it really needs to be atomic and ideally granular, as it may be started on a trigger condition.
    Local Load and Priority Encoder keeps this all SysCLK granular and responsive.

    Of course, RESET would load OwnCogID into the mapping, and disable TopUsedCog by default.
  • jmgjmg Posts: 15,175
    edited 2014-05-05 14:50
    Cluso99 wrote: »
    The only addition to this that I would like to see is a mooching method.

    So what if there were a second table that were used if the slot in the first table were not used. This would permit one level of mooching and permit total user contron of the mooching. The silicon fot this would not be that complex, and would let the user decide which, if any, cog(s) could mooch.

    With local access a COG can change the CogID anytime it wants, so it can selectively donate to many Cogs under SW.
    Anything that tries to be 'smarter' than that, with fetch-time decisions, is not so much a silicon issue, as a System-Speed impact issue.

    A variant on 2 tables, could use sub-scan handling, so the two values you enter alternate every scan cycle, 50% each.
    (both the same would be 100%, and you might sensibly want Self + 1 other).
    That approach would have no fSys timing impact.

    Addit: one usage case of Dual tables sub-scan, would see a 16 COG design, where 15 COGS are set for alternate share, giving themselves 6.25MHz each and the hog-cog gets ~100MHz.
    That has a nice set-and-forget appeal.
  • markmark Posts: 252
    edited 2014-05-05 15:15
    Is there a reason that all these methods are based on the hub cycling every 8 instructions (16 clocks)? I see no reason why hub window allocations can't be selectively increased for certain cogs, at the expense of being reduced for others, but still at least guaranteeing them all some access. As I mentioned in an earlier post, if one hub window mode were to put 8 cogs in one group, and 8 in another, you could, for example, give the first group access to the hub every 4 instructions (8 clocks), while the other group would get it every 16 instructions. Of course, they don't have to be split in two groups like that. And while there are a large number of possible combinations, I think a good balance could be found with only a handful of these 'modes'.
  • T ChapT Chap Posts: 4,223
    edited 2014-05-05 15:44
    The concept is flawed. There should be direct access for cogs that need high speed, then no worries about this concept of how many clocks to allocate.
  • markmark Posts: 252
    edited 2014-05-05 15:50
    T Chap wrote: »
    The concept is flawed. There should be direct access for cogs that need high speed, then no worries about this concept of how many clocks to allocate.

    But hub bandwidth is a limited resource, so so there's no choice but to "worry" about allocation.
  • JRetSapDoogJRetSapDoog Posts: 954
    edited 2014-05-05 15:54
    Hey, I'm just throwing this oddball hub-on-steroids scheme out there to add some variety while we wait for Chip to resurface for air and shed some further light on various things (but the longer he's down there at depth, the better for all from a progress standpoint). In the nutshell, the scheme involves dividing the hub into 16 banks, wherein each cog accesses one bank at a time on a rotating (hub) basis, with all of the cogs doing so simultaneously. It also involves or necessitates accessing memory in a spaced-apart manner to make access flow. That's the gist. However, on further reflection, there might not be much advantage, if any, to this oddball scheme in practice compared with what Chip has already planned (as mentioned in the last paragraph of this post). Moreover, even if there were some advantage, the bus-arbitration/switching/steering logic would likely be quite large (perhaps making it virtually impossible or impractical). So, definitely feel free to skip this post, as I was just thinking out loud (but not it's not thought out very well at all). I'm guessing there's nothing to see here; move along folks (it likely won't be your first time skipping or skimming a hub-slot allocation post). That is to say, I'm guessing it's either unfeasible/unworkable or doesn't lead to meaningful benefits for the costs involved. But I could be wrong, and it might inspire some other idea or teaching (but please be gentle if responding). Anyway, read on if you want if you're just killing time waiting for Chip, curious or bored.

    But first, a look back: In another thread (I believe), I mentioned the obvious possibility of each cog having its own separate bank of memory, with banks being assignable, such that individual cogs could get maximum bandwidth since they wouldn't need to share. In other words, that would be a non-hub approach. That possibility limits cogs access to only a portion of the entire hub space (unless bank assignments could be dynamically reassigned), effectively making them separate MCU's, at least from a memory standpoint. Dynamic reassignments might provide the desired flexibility, though, depending on how fast buses could be reconfigured. Anyway, such a bank possibility never gained any traction (probably for a multitude of reasons). Perhaps the logic needed to do the assignments (connect buses to cogs, whether on-the-fly or not) would be too large/complex/cumbersome or slow, and there'd be a need to prevent multiple cogs from claiming or being assigned the same banks at the same time.

    But going in the opposite direction, how about a hub-sharing mechanism on steroids involving memory banks? The idea is that the hub arbitrator would consist of 16 sub-arbitrators (one for each current bank-cog pairing), such that each cog got access to one 32KB bank at a time (512KB/16), followed immediately by access privilege to the next sequential 32KB bank and so on. All cogs would have potential access to a bank at the same time, but no two cogs would have access to the same bank at the same time. All 16 cogs would be circularly following each other lock-step in terms of bank access (each cog with simultaneious access to a separate 32KB bank per 1/16th hub cycle). Maybe each cog would have a 16-bit circular shift register with only 1 bit set to feed the bus assignment logic of the 16 sub-arbitrators, the shift registers of adjacent cogs being offset by one bit in each direction from each other in terms of the set bit.

    When making a memory request, a first thought was that perhaps one would specify the bank number and the cog would block until that bank became available to that cog. But preferably the bank number would not have to be specified, as the hardware could automatically span consecutive addresses (not sure if the addresses point to longs or quads, a critical detail) across the 16 32KB banks, letting the memory be treated as one continuous strip, even though the memory was broken up into longs (or quads) that were actually separated by 32KB (though they would seem to be adjacent from the user's perspective). In such usage, a cog's data would be spread across the banks of memory (perhaps kind of like data in a RAID hard drive system). But in this way, every cog could access "chunks" of data at full speed, with each cog accessing a separate bank at the same time. Such a scheme could fly if "sequential" access to data were needed, sequential in the sense of from bank-to-bank, that is. But if needing to access the hub/banks in a random access way, then things would slow down to 1/8th speed (not counting other overhead), on average, as a cog would block until it got its shot at the desired bank. And sustained access within the same bank (if needed for some reason, though I'm not sure what that would be) would slow to 1/16th speed (but such access would only be possible if the programmer specified data addresses spaced apart by 16 to overcome (perhaps "defeat" is more correct) the way the logic would automatically spread access across hub banks).

    Take the case of multiple cogs executing code directly from the hub (if that does get implemented): each cog doing so could read in the code at full hub speed (presuming, of course, that the data were spread across the hub banks). In a sane world, that would require the cooperation of the compiler to automatically spread instructions (and data) across memory just right (a key requirement of this oddball scheme). And for coding, machine instructions would be implemented in such a way as to automate bank spreading. For example, if we do have indexing, then the index could automatically point to the the same long (or quad) of the next bank (with wrap-around-plus-one) instead of the next actual long (or quad).

    Gee! That sure sounds like a lot of effort to go to when Chip has said that it should be possible to get hub exec at half speed (50 MIPs, iirc) without such complexity, due in large part to more data coming in at a time (quads) than can be executed at a time (longs). Also, I recall Chip saying that, for example, a cog doing video could have access at 200MB/second (that's assumes sequential access, of course, which is often the case), which is another way of saying the same thing as the 50 MIPS thing. And a move to WIDES would double that. One wonders why higher throughput would be needed if the data can't be consumed or produced that fast by the rest of the chip (and, for example, targeting the new chip at very high-resolution displays seems like a less-than-ideal application of the chip). Anyway, I've mentioned the foregoing oddball scheme just in case there's anything in it that could be combined with the existing quad (or wide) access plan. Chances are low, I suppose, and even if not, complexity could be high (most notably, the cross-bus switcher, for lack of a better term, would involve a lot of logic). But it may be that all of this is seeking to find a solution to a problem that doesn't exist, as, in many ways, the "bottleneck" (if there is one) is how fast we can produce/consume data, rather than access it.
  • RossHRossH Posts: 5,477
    edited 2014-05-05 16:07
    Heater. wrote: »
    In my mind OBEX is irrelevant to this discussion. Currently there is no P2 and there is no OBEX for it. Besides if the P2 were a wild success then most of the code people use with it will probably never see OBEX.

    Yay, I disagree with Heater! The world has returned to normal! :smile:

    There has to be an OBEX - or some equivalent - for the P2.

    In my view, the wealth of the OBEX is one of the key reasons the P1 achieved any success at all. The ability to "plug and play" by throwing objects together taken from the OBEX, without having to worry about any compatibility or timing issues is a key reason the P1 was not just dismissed as just an interesting toy chip with a slow, obscure and difficult to use programming language that required you to write most of the difficult stuff in assembly language (and the main property of the P1 that made that possible was the orthogonality and simplicity of its instruction set!)

    The P2 will face many of the same problems - it cannot compete on speed, and it cannot compete on price. But increasingly, it is looking like losing some of the key features the P1 did have - i.e. orthogonality and simplicity.

    Without an OBEX, seeded by Chip and others with objects that demonstrate it to advantage (as the P1 OBEX was) the P2 will struggle badly.

    Ross.
  • RossHRossH Posts: 5,477
    edited 2014-05-05 16:11
    jmg wrote: »
    It is a simple priority encoder, which gives 15 in your example.
    (when disabled, it always loads 15)

    If COG 15 then disables it's UsesHub boolean, the TopUsedCog changes to 9.
    (COG 15 can still run a timer etc, it just agrees to not use the HUB )
    If COG 9 disables its UsesHub boolean, TopUsedCog is still 15.

    A common usage would be to have a trigger COG as the upper one - it can get out of the way during the fastest burst work, and can re-appear to setup/result process.

    I'm sorry, jmg - I've really tried to understand this TopCogUsed concept, but (like others) I'm struggling. Perhaps an example?

    Ross.
  • jmgjmg Posts: 15,175
    edited 2014-05-05 16:24
    RossH wrote: »
    I'm sorry, jmg - I've really tried to understand this TopCogUsed concept, but (like others) I'm struggling. Perhaps an example?

    That was an example, but I think some are over-thinking this.

    The core to this is nothing more than a simple Priority encoder
    http://en.wikipedia.org/wiki/Priority_encoder

    each COG feeds one line into that encoder, signaling I-am-Using-HUB
    The Binary 4 bit output, reloads the scan counter. It thus scans from 0..TopUsedCog.
    When disabled with the control flag, the reload value is always 15

    That's all there is to it, and that produces a live TopUsedCog value, 1 SysClk granular.
  • RossHRossH Posts: 5,477
    edited 2014-05-05 16:26
    [QUOTE=mark
  • RossHRossH Posts: 5,477
    edited 2014-05-05 16:34
    jmg wrote: »
    That was an example, but I think some are over-thinking this.

    The core to this is nothing more than a simple Priority encoder
    http://en.wikipedia.org/wiki/Priority_encoder

    each COG feeds one line into that encoder, signaling I-am-Using-HUB
    The Binary 4 bit output, reloads the scan counter. It thus scans from 0..TopUsedCog.

    That's all there is to it, and that produces a live TopUsedCog value, 1 SysClk granular.

    Ok. I think I understand ... but now that I do, I see that your scheme also seems to have the property that the performance of any cog is unpredictable unless you know (in advance) what will be running in every other cog.

    I think this is bad for a chip like the Propeller. Just for a start, how would you describe such behavior in each OBEX (or equivalent) object? Figuring out whether a set of OBEX objects could run together successfully would become very difficult. And every time you added a new object to your project, you would have to calculate it all over again.

    Ross.
Sign In or Register to comment.