Propeller II update - BLOG

ctwardell · 2014-03-05 07:31

Seairth wrote: »

So this got me thinking about alternative approaches. In particular, I started wondering what it would take to move some of these parallel capabilities off-P2. For instance, Parallax currently sells the uM-FPU, which provides floating-point and fixed-point math routines. Now, what if you were to do the same sort of thing to the P2, moving all of the CORDIC, big-multiplier, etc to a separate chip that's accessible via SerDes? Suppose a dedicated math chip were developed that contained, say, 4 of each function. And further, suppose that the internal clock could run upwards of 4 times (or more?) as fast as the P2 itself. I could see the following pros/cons:
PRO: Frees up (significant?) space for other features.

PRO: Possibly allow for faster clock speed in P2?

PRO: External chip could be revised (e.g. to add FP math, FFT, etc.) without having to release a new version of P2.

PRO: External chip could actually be sold for use with other MCUs (e.g. add advanced maths to arduino, for instance)

CON: Off-chip (even if running at higher clock speed) would be slower than on-chip.

CON: Increases complexity and code

CON: Increases overall price (I'm assuming that the P2 wouldn't be any cheaper) for those that need the functionality.

NOTE: I am *NOT* suggesting that this should be done. Unless it makes a lot of sense. In which case, I am.

Let me second the *NOT* suggesting this...

C.W.

mindrobots · 2014-03-05 07:36

Kerry S wrote: »

Perhaps we should call it "Self Supervised Tasking" and end up with an SST Propeller Cog

Non-Interrupt Driven Pre-Emptive Multi-Threading

NIDPEMT - now there's an unstruction mnemonic!!

Remember, multi-Tasking has a 5 instruction scheduler:

                JMP #task0
                JMP #task1
                JMP #task2
                JMP #task3
task0
                SETTASK #%............

Multi-Threading takes a village!

Seairth · 2014-03-05 09:37

Bill Henning wrote: »

moving math off chip is a bad idea

- even with 66Mhz SPI, setting up MUL 32x32, and reading result will take more than 128 * 3 = 384 clock cycles. VS. 16.

VERY BAD IDEA.

I did point out that it was a CON.

ctwardell wrote: »

Let me second the *NOT* suggesting this...

C.W.

I agree, actually. Really, that post was more of a "thinking out loud" sort of thing. It does make me wonder, however, if you could move those into the hub and have just 2 of each. I realize that there will always be use cases where you could have all cogs using the same function (32x32 multiplier, for instance), but I'm guessing that the more typical usage will be limited to only a few cogs at a time. This would certainly free up space in the cogs. I still don't know if it would improve clock speeds at all.

Again, just thinking out loud. I'm not expecting (or necessarily encouraging) such changes...

potatohead · 2014-03-05 09:43

Fair.

Personally, I think having them per COG, given we can keep it all in the area we have, is a killer differentiator! Having the nice math is great for a lot of cases where tasking really isn't used too much. Think of the big program calling a co-processor, etc... A math COG will rock it just fine.

On the other hand, somebody is doing some interactive things like Chip has mentioned. In that scenario, lots of math on board means having something nobody else can touch, and it can be done "the Propeller way"

pedward · 2014-03-05 10:10

<RANT>

I'm personally against this crusade to make the P2 everything to everyone.

I told Chip that I didn't think Hubex was a good idea to pursue because it would take 4 months to code and test and get right, and there were a lot of complexities to making it work. Along the way he had to rip out stuff to make room and those changes were probably good changes to have made, totally independent of the need that drove them.

I said that going for pre-emptive tasking was going a bit too far, there is a lot to do to make that work right, having to swap all of the internal registers for a shadow register set (TSS in i386).

Now Heater thinks we should remove the heavy-math chunk to satisfy the never-ending feeping creaturism that is going on. I say forget that!

The P2 has grown from something elegant to something that is trying to satisfy 15 different masters. Chip is happy to oblige many of these requests because he's all about making the neatest things. The problem I see is that you are quickly approaching features that can never be used, except by the 5 people that requested them.

4 tasks per COG and 8 cogs is 32 tasks, this is a microcontroller for Pete's sake, not an ARM v10 chip designed to compete with every other ARM chip.

The *whole* point of the original design of the Propeller was deterministic execution. With 32 tasks it's possible, but difficult to get deterministic execution. With HUBEX and 32 tasks, that quickly becomes infeasible and strips much of the inherent performance of the P2.

Then you throw PREEMPTIVE MULTITASKING onto the heap and you've just made it completely non-deterministic. This thing is trying to look more like an XMOS every day.

The leap between P1 and P2 is growing from an evolutionary improvement to a complete massive re-learning.

Can we stop adding features that make the chip so damn hard to use, and start adding things that REALLY need to be added, like SERDES and USB instructions?

The C API already has POSIX pthreads, why do you need a hardware analogue that has the same level of determinism?

</RANT>

Dave Hein · 2014-03-05 10:33

pedward, I agree with most of the things you said. However, I think that hubex is an excellent feature, and it makes P2 a much more flexible device. Not all applications require deterministic timing, and those that do just need to execute within the cog's memory. I'm also in favor of hub-slot sharing, but only if it doesn't require ripping out other features.

I agree with you on the preemptive multitasking thing. I think it is taking up WAAAAY to many development resources that should be applied to other things. I don't think preemptive multitasking is needed for P2, and if we're going to do it we should do it the right way and implement interrupts. The current approach almost implements the functionality of an interrupt, but in a crippled way that is not as good as an interrupt.

Bill Henning · 2014-03-05 10:41

pedward.

Your assertion that determinism has been lost is demonstrably false.

NONE of the determinism has been lost in cog mode.

1) Hubexec gives significant capabilities that were not there before. It replaces LMM, which was at best 1/4 the speed of hubexec. It makes large code incredibly easier. It is not as deterministic as cog-only mode, but it is NOT replacing cog-only.

2) Pre-emptive threading is useful, but not necessary. It maps nicely onto pthreads, without having to be dependent on cooperative multitasking. It is meant for large user level threads, and not meant to be fine-grained deterministic.

3) Threading does not need an incredible amount of support, as long as purists don't insist on everything being auto-magic for new users. The current posix threads are cooperative.

4) Ripping out MUL/DIV/CORDIC was not a good suggestion, and I doubt it will happen.

Basically, nothing has been lost, and a lot has been gained.

Due precisely to the addition of hardware tasks and hubexec it will be possible to efficiently address many more problem domains, where otherwise the P2 could not compete.

Basically, as it stands now, P2 will have a hiearchy of possible implementations, depending on the nature and deterministic requirement of the problem being addressed.

1) COG-ONLY: usage model - extremely high speed drivers (HD, MHz signal generation etc)

a) without tasking - like P1 but far more powerful, fully deterministic (think 5ns grain scale)
b) cog-only mode with 2-4 tasks - can be considered to be the same as 2-4 "baby cogs", timing can still be fully deterministic

2) HUB-EXEC: usage model - HMI/business logic that fits in the hub

a) without tasking - can be deterministic but to not as fine a grain (think 100ns grain), 4x+ faster than LMM could be, smaller code than LMM, allows easily writing and running very large programs - cog code size limits are gone
b) with tasking - can be deterministic to a rougher (think to microsecond resolution), 4x+ faster than LMM could be, smaller code than LMM, allows easily writing and running very large programs - cog code size limits are gone

3) THREADS: usage model - HMI/business logic that can use a lot of threads for readebility, ease of writing, wnd NOT use up the hard real time task/cog resources

a) multiple low determinism requirement (think in miliseconds) threads for user level processes
b) allows writing easy to read multi-threaded code without using up all potential hardware tasks for code that needs to respond to slow events

This basically allows P2 to address a pyramid of ns scale timing requirements to loose 10's of ms scale problems - without wasting a whole cog on a slow problem.

Please note:

- there is no requirement anywhere to use threads, hubexec, or even tasking if one does not want to, or if someone has a philosophical objection
- hardware tasks save cogs due to the ability of easily combining up to four drivers in one cog
- hubexec allows large code without LMM or byte code interpreters at a far greater speed
- even if threads are used, it is extremely likely that only one cog would use them - leaving seven to be fully deterministic at any required level

I fully understand the frustration at how long this is taking, but we are all ending up with a far more capable chip - that can address many more problem domains, and as such, improve the odds of it succeeding.

Note a very large chunk of the delay was due to two failed shuttle runs, and testing in Dec.2012 prevented another failed shuttle run. Chip has found a couple of bugs during the recent expansion that would likely have cost an extra run.

What I would like to see is that the opposition to simple solutions for any issues raised due to the proposed simple solution not being academically/philosophically perfect stop.

Technical arguments are good, they point out potential issues.

Arguments that ammount to "I won't use it, so don't put it in", or "it does not match p1 philosophy" are a waste of everyone's time.

mindrobots · 2014-03-05 11:06

I don't see it ending up being a hard chip to use if you want to use it as a super P1 - 8 deterministic COGs with a a lot more memory and a lot more I/O and a lot more speed. That's what everyone started out basically wanting. Add to that the nifty extra I/O features, the AUXRAM, the built in math, super video, etc.....all things that are implemented in the Propeller style as evolution of the original design. This stands as an amazing chip just on those merits.

Now the fun starts, depending on what you want to do.....and never forget, there ARE 8 cogs, just because you use HUBEXEC and/or multi-tasking and/or multi-threading on one or two cogs, you still have the remaining cogs as fully deterministic P1 style cogs! That's incredible!

For using it like a P1, it will just take learning some new PASM mnemonics or some new Spin keywords....OUTA, DIRA, INA were gone from early on, don't blame that on revolution.

When you are ready to try some new features, they can actually be learned and played with in small chunks...at no point do you have to swallow the elephant whole! Documentation should probably be structured this way: P1 to P2 fundamentals and then start with the features.

You can add HUBEXEC if you need a big program without worrying about much else. Basically, it's a few PASM directives and a new way to specify addresses and maybe a handful of new rules.

Want to try a multi-tasking cog, ok, it can stand by itself. Read the section on the SETTASK instruction and you're pretty much good to go.

I was against a lot of this at first, now as I've thought about it and tried to play with it, I'm thinking it is all good if done simply and elegantly and that's where I trust Chip in the final decisions.

What you now see is a microcontroller that is a hybrid and I think a workable hybrid between what we love in the Propeller and what we wish we had from other microcontrollers like a large, flat memory space if needed....yes, if needed!! And all of this across 8 cogs in some mix to fit your needs.

Mix and match at it's best.......it's like 8 Arduinos in one chip! An Octduino!!!!!

(ok, I'm getting carried away.)

pedward · 2014-03-05 11:14

I agree that HUBEX was a net gain, but I felt that multiplying it to 4 tasks was a loss, because it takes such a memory bandwidth hit due to cache misses, etc. I agree that HUBEX 1 thread is a great solution and I discussed at short length with Chip on how a 2nd stage bootloader could implement encryption with this.

I think that PREEMPT is a BAD IDEA(tm) and should remain theoretically or FPGA only and shouldn't make it to silicon. That's my opinion.

Here's a little detail on the 2nd stage bootloader idea:

1st stage 512 long bootloader is loaded by ROM and authenticated
1st stage then reads all of system memory from SPI flash, performs HMAC on data, so it's both signed and checksummed to protect against attacks and corruption
1st stage stores the 128bit key at 3FFE7 then jumps via HUBEXEC to location stored at 3FFFB, this is the address to the fragment of bootloader code that implements AES-128 decrypt in memory
2nd stage AES-128 decrypts all of HUB memory up to the address specified at 3FFFB, which means it doesn't decrypt itself.
2nd stage bootloader erases the AES-128 key then jumps to address stored at location 3FFF7, which will be the start of the user code.

This process allows you to have a binary BLOB that is stored in the HEAP space of the user program, which contains the decrypting bootloader. If you choose not to use decryption, the authentication chain is maintained with an HMAC, so you simply need a stub to erase the key and jump to the start of your program. If your chip is unlocked, the key is a bunch of zeroes, so for unlocked chips you just need to put the start address of your code at 3FFFB. The bootloader is thrown away at user code start and just becomes part of the HEAP memory of the program.

The SHA-256 HASH HMAC should be at EB0 in your binary, the 1st stage bootloader would just assume to read a whole 256KB from memory.

So, for data structure layout:

0xE80 - SHA-256 HMAC of memory contents from EA0-3FFFF
0x3FFD7 - AES-128 Initialization Vector
0x3FFE7 - AES-128 key
0x3FFF7 - Location of user code start address
0x3FFFB - Location of 2nd stage bootloader start address

mindrobots · 2014-03-05 11:24

pedward wrote: »

I think that PREEMPT is a BAD IDEA(tm) and should remain theoretically or FPGA only and shouldn't make it to silicon. That's my opinion.

I'm just curious why is PREEMPT a BAD IDEA(tm)? What does it take away if you don't use it?

I've never really done anything with the video on the P1 but it being there doesn't really detract from my usage.

Is it going to be hard to test? Probably. Was it maybe a "bolt on" more than standard equipment? Can't answer that only Chip has a feel for how well its features blended into the sausage. Did it take features away? I don't think so. Did it delay a shuttle launch? I don't so. Is it taking any user experience away from people that won't use it? Again, I don't think so.

I'm not being contentious, I'm just wondering if there are things that haven't been considered about PREEMPT.

Bill Henning · 2014-03-05 11:32

pedward,

Would you believe we agree on a lot of things?

I plan to pack 4 drivers into cogs using hardware tasking - I think that is a huge win.

Personally, I only plan to use use one hubexec task in a cog precisely due to caching issues (we'd need at least four lines of dcache to get similar performance using four hubexec tasks) unless the code only needs very very loose determinism.

Now I am for software threads precisely because of the caching limitations of using more than one hubexec in a cog.

P2 is more than powerful enough to run TCP/IP and multiple USB endpoints.

I don't want to chew up a cog (running one hubexec thread) waiting on a socket or an endpoint.

I want to be able to handle many sockets at once, and I don't want to have to sprinkle yield()'s in my code for the far less deterministic cooperative multi-tasking.

I have real-world uses for handling many sockets at once, and for many USB endpoints. Threads will make writing the code far easier.

I can see having:

cog 0 - hdmi 1080p display
cog 1 - sdram driver/handler
cog 2 - one hubexec fast code task (graphics handling, processing ADC/DAC data etc)
cog 3 - usb low level code
cog 4 - user cog, running multiple hubexec threads to handle sockets and endpoints (scheduler at 1/16 cycles, hubexec task at 15/16)
cog 5 -
cog 6 -
cog 7 -

Leaving three cogs available for custom needs.

Note without threads, this usage case would likely not even fit.

From a quick glance, I think I like your second stage loader. Will mull on it more.

pedward · 2014-03-05 11:48

mindrobots wrote: »

I'm just curious why is PREEMPT a BAD IDEA(tm)? What does it take away if you don't use it?

I've never really done anything with the video on the P1 but it being there doesn't really detract from my usage.

Is it going to be hard to test? Probably. Was it maybe a "bolt on" more than standard equipment? Can't answer that only Chip has a feel for how well its features blended into the sausage. Did it take features away? I don't think so. Did it delay a shuttle launch? I don't so. Is it taking any user experience away from people that won't use it? Again, I don't think so.

I'm not being contentious, I'm just wondering if there are things that haven't been considered about PREEMPT.

The PREEMPT isn't multi-threading, your code isn't running simultaneously. It is a task switching approach.

The i386 was designed from the start to have task switching, hence the TSS and instructions for switching tasks. It too was just doing task switching, but it was elegant and designed in from go.

Chip is his own task master, but he does things because they "Can be done" sometimes, not because it was part of an elegant design.

I'd much rather see effort put towards implementing verilog to do most of the SDRAM babysitting, instead of requiring code to implement it. In this way I'd rather see SDRAM implemented as an extension to the HUB address space.

Bill Henning · 2014-03-05 11:54

pedward wrote: »

The PREEMPT isn't multi-threading, your code isn't running simultaneously. It is a task switching approach.

I know, but the proposed threading is perfect for waiting on a lot of sockets/endpoints for my applications - there is no need to simultaneously wait on tens of milliseconds events

pedward wrote: »

The i386 was designed from the start to have task switching, hence the TSS and instructions for switching tasks. It too was just doing task switching, but it was elegant and designed in from go.

I like elegance as much as the next guy... but I am of the opinion that one less-than-elegant solution in hand is worth far more than several elegant solutions in the bush...

i.e.

The threading being discussed will allow me to do what I need to do, without tying up many cogs, without the issues with cooperative threads. I can live with less elegance - until we are talking P3.

But I don't want to have to bolt an ARM chip to the P2 to do these kinds of apps, therefore I am for threading (at a minimal time/gate cost for P2, I will worry about elegance for P3).

pedward wrote: »

Chip is his own task master, but he does things because they "Can be done" sometimes, not because it was part of an elegant design.

I'd much rather see effort put towards implementing verilog to do most of the SDRAM babysitting, instead of requiring code to implement it. In this way I'd rather see SDRAM implemented as an extension to the HUB address space.

Now that is a P3 issue.

Off the top of my head, that is a very difficult problem. Due to the sdram timing/setup requirements, I don't believe 8 cycles could be guaranteed even for a single WORD read/write, and an 8 long WIDE in 8 cycles is flat out impossible.

For a P3, I'd like to see RDRAM / WRRAM that worked just like the hub instructions in {BYTE|WORD|LONG|QUAD|WIDE} wide sizes, but that is a discussion for another day

rjo__ · 2014-03-05 11:57

It hasn't been that long since Chip asked everyone: "do you want more cogs or more RAM?" But it does seems like ages.
The nice thing about this course is that the logic of the design is being thoroughly discussed as it is being implemented.
When we get to testing, the flood of questions will further the discussion. I think the complexity of the design will actually
encourage talented programmers to become interested. Designing for the P1 is often so simple that it would make
a talented guy ask: "geez anyone could do that so why should I?" There is enough here already to separate
the wheat from the chaff so to speak.

What I like most about the P1 is that I can look at just about anyone's code and (with a modicum of effort) understand what it is doing.
Without a really good "help" system, that probably isn't going to happen with the P2. I really hope that with the volunteer documentation system, which is now on hold (until the P2 is stable), a real effort will be made to integrate that program with the new Propeller IDE project.

Thanks

Rich

User Name · 2014-03-05 12:01

Arguments that amount to "I won't use it, so don't put it in", or "it does not match p1 philosophy" are a waste of everyone's time.

+1

As always, I'm delighted that Chip is the ultimate arbiter. I trust and admire his judgement.

jmg · 2014-03-05 12:11

Cluso99 wrote: »

I meant in software! Perhaps I did not word it well but I think Chip understood what I meant.

In SW ? - maybe as an interim patch, that's all you can do.

However, that's cost a memory location, needs more management, and is not atomic.

Because there is a window between the loading of SETTASK, and the loading of I_hope_this_is_a_Copy_Of_setTask - in that window, another thread can change settask, update the agreed-on-common-location-for-copy, and oops, that value is replaced on Thread restore with a value that is NOT the real SETTASK.

Being able to simply read a Setup register value is an operational model common across a wide range of microcontrollers.

mindrobots · 2014-03-05 12:29

pedward wrote: »

The PREEMPT isn't multi-threading, your code isn't running simultaneously. It is a task switching approach.

Technically, multi-threading never gives you simultaneous run time - that's only achieved through multiple cores or processors which we've had since day one of the Propeller.

Multi-threading (or multi-tasking) gives you the ability to maximize resources on one or more physical processors (or cores).

Earlier in this thread, the distinction was made between tasks (splitting a core into two, three or four separate execution units with the time slice allocation controlled by the task mask) and threads (the software construct that runs on top of a task and can be pre-empted by a scheduler task)

Tasking is basically a hardware construct and and as I mentioned earlier, 5 instructions and you are multi-tasking on a cog. This can give you close to deterministic operations of those four mini-cogs. You can play with the the timeslices all you want but you can never go beyond 4 execution units. Unless a task knocks itself out of the mix by issuing a new SETTASK or another task knocks a neighbor task out via a SETTASK, the whole cog just runs along according to it's task mask. This will be a great way to implement HMI features in a single cog, since those humans can be pretty slow.

Threading is built on top of tasking, mainly because you need a master task, the scheduler and one, two, or three, worker tasks - probably best done with only one worker task. Except for a little support from the hardware, threading is a software concept and has all different directions it can go in. If you want to use it, you will need a scheduler to manage the threads. There will need to be agreements (just like with tasks) of what can and can't be done within a thread and how certain things should be done. The scheduler becomes a small operating system which along with being really fun stuff to play with. Probably will have some practical applications as folks figure out the P2.

Underneath this all , you still have 8 cogs of multi-processing - true simultaneous execution of code.

I'm seeing it more as a win-win-win.

mindrobots · 2014-03-05 12:42

jmg wrote: »

...another thread can change settask, update the agreed-on-common-location-for-copy, and oops, that value is replaced on Thread restore with a value that is NOT the real SETTASK.

Because there is no distinction between privileged and non-privelidged instructions and because there is no memory protection and because the thread code will not be outside of the control of the application developers, there needs to be gentlemen's agreements in place as to what a thread can do. Threads can't issue a SETTASK is the biggest rule - to do so usurps the power of the scheduler. Threads can't just go away, they need to terminate with the scheduler. Threads may not just be able to grab resources (now we're getting into BIG OS territory). There are probably other things that need to be in the agreement.

With all this said, the scheduler task is in control of the task mask and therefore, always knows what it is and generally knows where it is either in the user task or the scheduler task (if you complicate the model with multiple users tasks, this becomes harder to do but that's probably a bad model). If you have a scheduler task, a user task and two dedicated worker tasks, it is also more complicated but not impossible in software. I don't think it really needs any hardware support for reading the active task register.

jmg · 2014-03-05 12:59

mindrobots wrote: »

Technically, multi-threading never gives you simultaneous run time - that's only achieved through multiple cores or processors which we've had since day one of the Propeller.

Multi-threading (or multi-tasking) gives you the ability to maximize resources on one or more physical processors (or cores).

I think a terminology cleanup is needed, now that even another layer of Full Software swap of a running task is being discussed.
TASK is getting too confusing, as it is used in too many places.

8 COGS can run truly in parallel, and within a COG, 4 way time slices can be allocated, currently with a N:16 granularity.
Unless the Time-Mapping register is changed, (or trumped), that time-slot pattern is highly deterministic.

The lack of across-slice impact of code, is a very significant P2 feature.
Once you allocate a time-map, and debug a Thread, it should not change as other Threads use free time-slots.

As most other systems use TASK as a software swapped entity, I think that should be reserved for the top layer of SW.
That means the hardware, 4 way multiplex of time. needs another name

Threads or Slices or Slots or vCore for Virtual Core(s) or ??
TimeSlices/Timeslots is how it is done, but a Virtual Core is what results, and what the SW 'runs on'.

The register that maps the 4:1 Time slots, would be called vCoreMap.

Bill Henning · 2014-03-05 13:05

(funny mode: on)

I know!

We rename cogs as "tiles", tasks as "cores", and leave threads be...

(funny mode: off)

(serious mode: on)

I really don't think anyone finds the following confusing, except when someone mixes up task/thread:

1. cog (totally deterministic)
2. task (one of four hardware tasks, very deterministic)
3. thread (thread one of many threads of execution in a task, what is determinism?)

The cog/task/thread terminology works, it is simple, clear, and no need to muck it up.

Threads also correspond to software threads.

jmg · 2014-03-05 13:06

mindrobots wrote: »

. I don't think it really needs any hardware support for reading the active task register.

Really ? Not even for debug ?

Debug is exactly when you are tying to find who/where is breaking that "gentlemen's agreement" of that another layer of paper rules you have imposed on developers...

jmg · 2014-03-05 13:09

Bill Henning wrote: »

I really don't think anyone finds the following confusing, except when someone mixes up task/thread:

1. cog (totally deterministic)
2. task (one of four hardware tasks, very deterministic)
3. thread (thread one of many threads of execution in a task, what is determinism?)

The cog/task/thread terminology works, it is simple, clear, and no need to muck it up.

Threads also correspond to software threads.

That's fine, but those coming from other areas of training will be used to using TASK differently.

Local morphs of what is common industry usage, need to be done with care.

(notice the many postings already around preemptive multi-tasking )

mindrobots · 2014-03-05 13:13

It would need to be:

COGS beget VCOGS (via a SETTASK which should be renamed SETVCOG to use the VCOGMAP) beget threads which are total software constructs and only use the PREEMPT instruction to talk back to the scheduler.

giving:

Totally Deterministic, Mostly Deterministic and Non-Deterministic

Now, let's talk about moving executing HUBEXEC tasks between COGS and COG and VCOG affinity ............ JUST KIDDING!!!!

mindrobots · 2014-03-05 13:18

jmg wrote: »

Really ? Not even for debug ?

Debug is exactly when you are tying to find who/where is breaking that "gentlemen's agreement" of that another layer of paper rules you have imposed on developers...

Does Bill's new instruction here handle your debug concerns? (you have to excuse me, I'm an old school debugger, we had blinking lights and maybe a terminal and we liked it!!)

I'm sorry about the paper rules for developers and engineers, I thought that was just part of the job. Your protocol looks like this. The processor does this when you do this. If you do this, others can't do that. I've never seen it all automatic.

Bill Henning · 2014-03-05 13:18

LOL... I know, I've made the multi-tasking mistake in posting, even the task/thread confusion.

I'd be OK with cog/vcog/thread, but I am strongly against calling what now are tasks threads.

Frankly, due to the history of the development of the P2, I prever cog/task/thread, and I really don't think anyone new to the prop would have trouble with the convention.

I strongly suspect that tasks will mostly be used for packing multiple drivers into a cog, with user level threads providing many threads for HMI/business logic.

jmg wrote: »

That's fine, but those coming from other areas of training will be used to using TASK differently.

Local morphs of what is common industry usage, need to be done with care.

(notice the many postings already around preemptive multi-tasking )

pedward · 2014-03-05 13:43

Now let's talk a but NUMA zones and memory pools...

jmg · 2014-03-05 13:50

Bill Henning wrote: »

LOL... I know, I've made the multi-tasking mistake in posting, even the task/thread confusion.

I'd be OK with cog/vcog/thread, but I am strongly against calling what now are tasks threads.

Problem is, intel has already chosen the other terminology, some years back...

They use threads for the sharing-pipeline interleaved-in-HW-in-time stuff, and they leave TASKS to the operating system

see
http://en.wikipedia.org/wiki/Hyper-threading
and
http://en.wikipedia.org/wiki/Intel_Atom_%28CPU%29

So that's a lot if industry training, and nomenclature to flip on its head...
P2 time-sliced threads are not exactly the same thing, but they are sharing the pipeline, and done at the lowest hardware level, so there will be no complete inversion of training involved here.

Bill Henning · 2014-03-05 13:54

If you want to follow intel...

cog
HT <-- hyper thread
thread

And Intel's nomeclature is a johnny-come-lately

threads have been used as software threads since forever (Unix, I think also Multics)

Also, hyper-threading ala Intel is a microprocessor term, coined by Intel.

AMD, with Bulldozer, went a bit further implementation wise, and calls them cores - when in reality, they pack two integer "cores" that share some aspects of dispatch etc and one "FPU" and call them two cores.

jmg wrote: »

Problem is, intel has already chosen the other terminology, some years back...

They use threads for the sharing-pipeline interleaved-in-HW-in-time stuff, and they leave TASKS to the operating system

see
http://en.wikipedia.org/wiki/Hyper-threading
and
http://en.wikipedia.org/wiki/Intel_Atom_%28CPU%29

So that's a lot if industry training, and nomenclature to flip on its head...
P2 time-sliced threads are not exactly the same thing, but they are sharing the pipeline, and done at the lowest hardware level, so there will be no complete inversion of training involved here.

Heater. · 2014-03-05 13:57

pedward,

This thing is trying to look more like an XMOS every day.

Oh what! Is that a bad thing ? The XMOS is starting to look very clean by comparison.

Pandering to the needs of every obsessive assembler language programmer out there is not he road to success.

The chip needs to be useful.

jmg · 2014-03-05 13:57

mindrobots wrote: »

Does Bill's new instruction here handle your debug concerns?

Not sure that link is right, but I think Bill? did mention a coarse encoded readback of >= 1/16 map.
I then suggested if bothering to do encoded readback, than a full 4 bit per Thread read would not lose information.

Extra encoding is fine, but it first starts with a readable copy of the static task-map = an icing on the top sort of option.

Normally in debug, just reading what you (or another thread) wrote is enough, but having that full working view vision is important.
Blindspots are best avoided.

You want to be able to catch any unexpected change to TaskMap, between break points, for example.

Propeller II update - BLOG

Comments