Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II - Page 41 — Parallax Forums

Propeller II

13941434445

Comments

  • Dave HeinDave Hein Posts: 6,347
    edited 2012-08-23 12:39
    On the other hand, P2 contains a lot of specialized hardware and instructions. Even P1 has counters and video generators. A UART is nothing more than a shift register plus a little contol logic. It's trivial compared to video generators and some of the specialized hardware in P2. So Chip may vow to never put a UART in any of his processors, but it pales in comparison to the other stuff that is in P1 and P2.
  • jmgjmg Posts: 15,148
    edited 2012-08-23 13:11
    Rayman wrote: »
    But, Prop2 will allow multiple threads to run in a single core.
    That should mean that UART, SPI, and I2C can all be done in one core.

    Is that multiple threads a confirmed silicon feature ?

    On peripherals, it is a matter of balance.
    The COGs can certainly do most i2c and UART apps, but there is a good case for simple silicon to do stuff software cannot.

    So a Fast shift I/O interface for example, would allow a faster SPI link, than Software banging would.

    The Prop 2 specs & comments so far, suggest there is a fast IO shifter ?
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-08-23 13:13
    The P2 hardware is done. No use even discussing it.

    Only the ROM is open for discussion.

    Most of you have missed the point, even though it gets repeated often enough... The SD Boot in rom will not be os aware... nothing about fat or anything else is known/assumed by the rom boot code. We just want to be able to boot from SD if there is no (or blank?) Flash.

    It is easy enough to write the MBR pointer using a P1 or P2. I have done a P1 already.h.

    Having a tiny loader (being referred to as a monitor) is nice, but KISS should be applied. Once again, use it to boot a better monitor or download a larger boot program, or debugger. Therefore keep it short and simple, and therefore minimise any risk of failure.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-08-23 13:25
    I foresee a lot of commercial uses of the P2 where SD will be used. For these designs a Flash is not required and engineers will think it stupid to require one - remember, we have no internal flash/eprom in the P2. Part of the beauty of the P1 and P2 is the prop is ram based, so code can be loaded/unloaded. This should make for some interesting designs.

    As for hobbyist use, the requirement for a flash chip is almost irrelevant and will therefore likely be present.

    If you build a design that say sells 1 million units, and flash is required even though you have SD, this costs $500,000 in lost profits. It is not just the Flash chip, but pcb cost, assembly cost and inventory cost. The flash also has to be programmed.
  • RaymanRayman Posts: 13,897
    edited 2012-08-23 13:32
    jmg wrote: »
    Is that multiple threads a confirmed silicon feature ?

    I thought it was... But, now that you mention it, maybe I'm not 100% on that...
    But, even if it weren't I still think you can do many things at once in a Prop2 cog...
  • jazzedjazzed Posts: 11,803
    edited 2012-08-23 13:38
    Cluso99 wrote: »
    If you build a design that say sells 1 million units, and flash is required even though you have SD, this costs $500,000 in lost profits. It is not just the Flash chip, but pcb cost, assembly cost and inventory cost. The flash also has to be programmed.

    You'll be able to blow a fuse to use it.
  • Clock LoopClock Loop Posts: 2,069
    edited 2012-08-23 14:02
    jazzed wrote: »
    You'll be able to blow a fuse to use it.

    When you blow the fuse, and use it, you will "blow a fuse" because of how awesome it is, to use it.
  • Heater.Heater. Posts: 21,230
    edited 2012-08-23 14:17
    Rayman,
    I thought it [multiple threads] was... But, now that you mention it, maybe I'm not 100% on that...

    Back in the deep dark history of this thread the "tasksw" instruction for in COG threads makes it's debut:

    http://forums.parallax.com/showthread.php?141706-Propeller-II/page14
  • Mike GreenMike Green Posts: 23,101
    edited 2012-08-23 14:21
    Like a lot of features in the P2, the P2 has some hardware to make it easier to do multiple threads, but it doesn't have multithreading built-in.
  • Heater.Heater. Posts: 21,230
    edited 2012-08-23 14:29
    Except in post #361 where Chip has multithreading all figured out as a simple change to the tasksw mechanism:

    http://forums.parallax.com/showthread.php?141706-Propeller-II/page19

    Ken stepped in and demanded "no more hw changes". Not wanting to start a family fued I went all quite on the subject, secretly hoping Chip had sneaked it in there.
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-08-23 14:32
    Heater. wrote: »
    Except in post #361 where Chip has multithreading all figured out as a simple change to the tasksw mechanism:

    http://forums.parallax.com/showthread.php?141706-Propeller-II/page19

    Ken stepped in and demanded "no more hw changes". Not wanting to start a family fued I went all quite on the subject, secretly hoping Chip had sneaked it in there.
    I think he had second thoughts about it when I pointed out he was a few gates short of implementing an interrupt. :)
  • pjvpjv Posts: 1,903
    edited 2012-08-23 15:21
    Hi All;

    I thought that some number of posts back Chip said the task switch instruction was simply a JMPRET.
    The (co-operative) multi-tasking can already be very effectively done with Prop 1 using it's JMPRET instruction.
    Perhaps he has added some state-save features to it for P 2.

    Cheers,

    Peter (pjv)
  • Heater.Heater. Posts: 21,230
    edited 2012-08-23 15:45
    Simply a jmpret yes, but using the new IND registers as operands allong with auto increment gives you a way to jmp between eight threads easily.
    See page 14 of this thread for Chip's example threaded code with tasksw.
    There are 8 program counters and 8 sets of Z and C flags.

    Damn cool really, shame it's lost in the noise here.
  • pjvpjv Posts: 1,903
    edited 2012-08-23 16:16
    Heater;

    Well, it looks like a lot of extra functionality has been added. I'm looking forward to a definitive desctription of all the instructions...... so much new stuff to absorb !

    Cheers,

    Peter (pjv)
  • evanhevanh Posts: 15,192
    edited 2012-08-23 18:44
    Dave Hein wrote: »
    I think he had second thoughts about it when I pointed out he was a few gates short of implementing an interrupt. :)

    Not at all. Chip later posted, just before crawling into bed for a week, that time slicing was a no-brainer change but had one caveat where a stall in one hardware thread stalls them all. - http://forums.parallax.com/showthread.php?141706-Propeller-II&p=1117398&viewfull=1#post1117398 This poses a difficulty for hub reads in particular because even under best case still has a one cycle stall on occasion.

    By default it is not visible to the code, ie: only a single thread per Cog until configured. I think slicing is worth having just to try out.
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-08-23 19:49
    The comment about having second thoughts was a joke. I did put a :) after it. However, after I suggested the interrupt Chip did respond as follows:
    cgracey wrote: »
    Too much! Head going to explode!
    Seriously though, it wouldn't take much more effort to implement interrupts if the cog is able to save states between task swaps. The ISR would just be another task. A cog with interrupt capability would be much more powerful than it is without that capability. If you don't need interrupts then don't use them. I can think of several applications that could take advantage of an interrupt capability. Of course, I'm not suggesting this for P2, but it might be worth considering for P2+.
  • potatoheadpotatohead Posts: 10,254
    edited 2012-08-23 21:40
    And there is the discussion right there. There is "powerful" on a few axis.

    One is raw execute speed, regardless of consistency. That's the Intel game. Just get it done quick, and let speed be the equalizer. Add a fan, if you need to.

    Another is parallelism. Concurrent execution happens in a lot of ways. Cores, sub-systems, like the math functions in some chips, and don't we have that in P2, where some ops are dispatched to be computed? CORDIC? Lots of features, and I've been slammed, and things are a blur... Anyway, that's another axis.

    Then there is code. Is the chip difficult to write for, or not, and is there significant reuse or not?

    On that note, "difficult to write for" is highly arbitrary too. People will have different skills, goals, etc... I find Props "easy" to write for in many cases, more difficult for others. This varies widely depending on experience and needs / preferences. IMHO, the reuse potential is there for nearly everyone though, and it's there because it's been maximized.

    IMHO, the trade-off we see in the Propeller design is all about parallelism and re-use. We've tossed about "deterministic", and I think it just means having a piece of code just work, and keeping potential conflicts to a minimum so that the re-use, "it just runs" use case is maximized. That's "powerful", just not in the raw MIPS way normally associated with the term.

    I believe that use case is why moving the ROM to lower addresses and extending the offsets was done, for example.

    "Zero Page" is a limited resource. (and isn't that kind of funny how such an old term just makes more sense than it maybe should?)

    Greedy programs might not just work with other greedy programs, and the product of that is more "powerful" as in faster programs, but those won't always play well with one another, because they cannot be written with the assumption of consistency when they run.

    Taking that resource and allocating it differently in terms of the extended offsets means increasing the reuse case, and the "keep it simple" case, and leveraging the parallelism case, with the cost of peak speed. Those programs can be written with the assumption of consistency when they run.

    It's not possible to have all cases at one time. Well, I've never seen it. Maybe it is, but it's not on the current Propeller design path. Other devices have taken the approach of maximizing speed at the expense of consistency, and they are faster, but more difficult to author and reuse.

    As it is right now, the JMPRET with the ability to task assist as implemented now on P2 brings more power to the COG in that task switching can happen in a more lean way now, but it doesn't do that at the expense of reuse in the same way that the non-deterministic time-slice does. There is the matter of programs requiring time that will prohibit some from being used in tandem with others, but aside from that, they will operate in this way easily otherwise. Powerful, just not in the raw speed / clock sense.

    This has one other benefit and that is code gets targeted for the processor in general, not fixating on one particular feature. More of the code will just work with more of the code, as opposed to having to sort out bodies of code that all assume best case on a variable case execution unit. That's the difference between deterministic and not, as I see it reflected in the overall design of the Propeller.

    Seems to me, that's not an uninformed choice, but a deliberate one, and a perfectly valid and worthy one. After all, there are plenty of offerings that maximize the other cases. And they've got their strong points. Why not have one that maximizes the deterministic case of reuse like we've seen so far? Sure is nice to go and grab code and mix it with other code, like we regularly do. And it's nice enough that the barrier for entry on doing it is pretty low too.

    Anyway, that's just another take on "much more powerful" --it's arbitrary, depending on use cases and goals, which aren't the same for everybody.
  • FredBlaisFredBlais Posts: 370
    edited 2012-08-23 22:18
    cgracey wrote: »
    We've got over a month before the synthesis guys will deliver the final GDSII block containing all the guts, so there may be time to add some extra ROM functionality towards the end.

    What happens after this step? Is a chip or shuttle run can be made?

    ps: sorry to interrupt the discussion
  • evanhevanh Posts: 15,192
    edited 2012-08-24 01:15
    Dave Hein wrote: »
    Seriously though, it wouldn't take much more effort to implement interrupts if the cog is able to save states between task swaps. The ISR would just be another task. A cog with interrupt capability would be much more powerful than it is without that capability

    Only a single IRQ (per Cog) would be useful because only the highest priority is deterministic. Below that can be done with tasks. Thinking about it, an interrupt is equivalent, except for space, to what I was originally asking for but cheaper to implement.

    It's a bit of a mind twister deciding on efficient use of processor cycles vs accurate and precise I/O. In hindsight, I think the allotted time slicing, ie: isn't dependent on external trigger, is the better option for the Prop. All threads are deterministic this way.
  • evanhevanh Posts: 15,192
    edited 2012-08-24 01:27
    potatohead:
    If I'm reading this right, just to summarize, you are okay with the Prop targeting determinism over throughput, right?


    EDIT: On that note, the Prop2 is without a doubt a lot better at throughput than the Prop1, and will run hotter as a result. But the Cogs are still just eight. This is why I've been so keen to allow some of that raw performance to be usable in a wider spread of soft devices and/or simultaneously using the mips for number crunching.

    Currently, the Prop1 does not use a Cog for both at the same time.


    EDIT2: There is a very good reason why the Prop architecture is determinism oriented - To make those soft devices fly!
  • Heater.Heater. Posts: 21,230
    edited 2012-08-24 02:40
    Timing determinism in auto-threaded code within a COG is NOT required.

    There are two issues that are raised here as negatives regarding automatic thread slicing in COGS which I think are irrelevant:

    1) Some instructions, eg HUB access, take longer than usual and stall the COG and all threads a while. This is said to be bad because it ruins timing determinism. I.E. cycle counting in one thread cannot be relied on because another thread can jitter it.

    2) Because of the above it is not possible to grab code (threads) from various places and combine them into a single COG and be sure that one will not affect the other. This is said to be bad because it won't just work as it does when we throw objects at COGs currently. "Greedy programs might not just work with other greedy programs" as Potatohead put it.

    I think both of these arguments are missing a point and the rejection of automatic thread splicing because of them is wrong. Basically the point is that you cannot do this on the current deterministic Prop and thread splicing does not change anything in that respect.

    Lets take point 1). Generally a thread will do something like the following:

    a) Sit in a loop polling for some event or condition, (Can't use WAITxx as it stalls everything)
    b) That loop does a task switch to allow other threads to run as long as the event does not happen.
    c) When the event occurs exit the polling loop
    d) Run through some instructions handling the event.
    e) Jump pack to a) or enter some other polling loop.

    Now consider this. Without automatic thread splicing, as now, when a thread is in stage d) it totally blocks all other threads for a long time. Those other thread will : 1) Have a very long latency to responding to their events because they are blocked by a working thread. 2) Have totally undeterministic behaviour because they have no idea when other threads will be triggered and block them.

    Ergo there is no determinism in threaded cog code as done with JMPRET now or in the Prop II with TASKSW.

    So what does auto thread splicing buy you then?

    Well, to reduce those latencies mentioned above one can sprinkle more JMPRETs around the code, ultimately every other instruction would be a JMPRET to get the lowest latency. In Prop II every other instruction would be TASKSW.

    BUT that has now halved the available execution speed of the threads and doubled the size of the code !!!
    And still you have not fixed the determinism problem.

    Also note that any idea of "mixing and matching" code for different threads and having it "just work" is out as you have to check how the threads are structured and how the latencies add up.

    Bottom line is I would like to see auto thread scheduling, as Chip has in mind, as it:

    1) Minimizes latency in response to events.
    2) Reduces the code size, very important when you are combining multiple functionality into one COG.
    3) Increases performance.
    4) Has no effect on determinism in threaded code.
    5) As we have more raw speed available it makes more sense to combine small work tasks into a COG.
    6) Seems to be a very small hw change according to Chip.

    P.S. The sacred "determinism" is still there when you need it. Just dedicate a COG to the task. This is no different than what we have now with the supposedly fully deterministic Prop I.
    There is no issue with "greedy code" upsetting things because this is all self-contained in your object and your objects COG. Your object will work fine with other peoples objects using different COGs. No timing interactions. Just as it does now.



  • potatoheadpotatohead Posts: 10,254
    edited 2012-08-24 02:44
    Yes.

    -->Just saw the other response and have no desire at all to enter in to pages on threading, and what's in there is there, any changes to HW are not something I would advocate at this point, for reasons already stated.

    No offense intended Heater. I'm perfectly OK with what is currently implemented, and articulated why. However it has gone, or will go, we will see it play out, and I much prefer that discussion in the context of P3.
  • Heater.Heater. Posts: 21,230
    edited 2012-08-24 02:52
    Potaohead,
    ...hinged on knowing what a given chunk of code would do when operating with other code.
    ...runs in conflict with the idea of doing it in software, simply because doing it in software is most effective when one can mix 'n match software!

    This issue of code interoperating with other code or mix'n'match software does not apply when you are talking about threads within a COG. Everything running in COG is part of your object. Your object will coexist with other peoples objects just as they do now. With no issue about hogging time or upsetting their determinism. This is true in the P1 and now and would be true of the PII with auto thread slicing.

    Hope you did not miss my post above where I introduce that idea.


  • potatoheadpotatohead Posts: 10,254
    edited 2012-08-24 03:05
    (I didn't --One object / COG thinking about it makes a fair amount of sense, given none of that actually disturbs another COG in any way.)

    Smile... So OK, I've gotta get to bed, but are you suggesting then that we draw the line at the COG? One can run irregularly, so long as it doesn't disturb the others? I'm not sure that was ever discussed in the early P2 threading discussions, which I believe fixated on the HUB access, not so much how one COG might run as opposed to another one. That could have been why it piqued Chip's interest. :)

    (still don't want to advocate hardware changes though, kind of done on that front.)
  • Clock LoopClock Loop Posts: 2,069
    edited 2012-08-24 04:08
    cgracey wrote: »
    Too much! Head going to explode!



    Chip is trying to fill us a P2 raft, then all of us propeller kids find out, run over, and jump on, then his head explodes.
  • Heater.Heater. Posts: 21,230
    edited 2012-08-24 04:16
    Potatohead,
    ...are you suggesting then that we draw the line at the COG? One can run irregularly, so long as it doesn't disturb the others?

    Yep. We might call what a COG does a "process". Processes are started by COGNEW from Spin objects. (A spin object may have zero or many processes written in PASM or SPIN). Such objects can be mixed and matched with others safe in the knowledge that the timing is always preserved. That is what we do now.

    Within a COG though we some times have these lightweight threads which will jostle with each other for time. Currently we have cooroutines done with JMPRET, as in FullDuplexSerial. In Prop II we will have the possibility of more than two threads switched around with TASKSW. The same jostling applies, no harm done.

    With Chips little hardware tweak we would have auto thread splicing, an still the same jostling, no harm done. but greater performance, less latency and smaller code. A win all round,
  • RaymanRayman Posts: 13,897
    edited 2012-08-24 15:10
    Ok, great, I remembered right and there is better multitasking in Prop 2..
    Don't want Batang using any Lincoln quotes on me...
  • BatangBatang Posts: 234
    edited 2012-08-24 23:17
    @ Rayman :)
  • potatoheadpotatohead Posts: 10,254
    edited 2012-08-25 09:18
    @Heater

    OK, given that it can be triggered, I think I am close to agreeing with you Heater. There is a really slippery slope there though.

    On a COG that isn't slicing, PASM works like it does for us know. One can write it and know something specific is going to happen, minus the few edge cases we've found with things like waitvid. And those are just an artifact of the parallel operation of them and chip internals. The core idea for the majority case is PASM does what it does consistently, regardless of what other COGS are doing. All bets are off, of course, when the programmer introduces dependencies. No avoiding that.

    On a COG that is slicing, that PASM could vary depending on what other PASM is doing, but the whole would be predictable, but with more complexity in that prediction.

    There then would be COG code, and COG slice code. Maybe "slice safe" code is another way to put that. Mix 'n match does get less granular, but then again, bigger clumps of things working together are possible, freeing COGS. I like this, and I suspect Chip does too.

    With JMPRET throughput is less, and the programmer is forced to sort out the dynamics. With the slicing, throughput is higher, and the programmer can ignore some dynamics, but is forced to account for some others, namely the stall.

    Personally, I do still agree with what I wrote up-thread about not including this thing half-baked. Had it been in the discussion earlier, it very likely would have gotten sorted to a state that doesn't include the stalling of all threads and such being discussed right now.

    Interesting in any case! The purist in me kind of revolts on that whole thing, but the realist recognizes the point as a valid one. We will soon find out where Chip got to on the whole thing. Good times lie ahead!
  • Heater.Heater. Posts: 21,230
    edited 2012-08-26 03:04
    Potatohead,

    I really like the automatic thread splicing idea and as Chip said, he wished he had thought of it earlier.
    BUT I have to agree with you, Ken and others. It's a bit late to be making such hardware changes. Apart from pushing out the Prop II end date and cranking up costs there is an other issue that worries me.

    Would such a mechanism, thrown in at this late hour, be a solution one would be happy to live with long term and perhaps have to perpetuate into Prop III and up?
    Does it really mesh with the existing Prop architecture and philosophy of simplicity and regularity?
    Or would it become a millstone when we might realize there is a even better cleaner way to do it.

    Think about all the excess baggage in the Intel x86 architecture/instruction set that has had to be carried throught the generations. It's almost never used but has to be there just because someone thought it was a good idea at the time.
Sign In or Register to comment.