Propeller II update - BLOG

Heater. · 2014-02-28 16:27

pjv,

Just for fun, this is what 4 individual co-operative LED flashers look like....
....A whole load of gibberish...

Sapieha,

...the same gibberish again but formatted nicely...

What are we trying to demonstrate here? That the P2 is impossible for "normal people" to program?

Sapieha · 2014-02-28 16:42

Hi Heater.

I have only made correct formating to it.
Not study it -- to see if it is gibberish.
Don't have time for that.

Heater. wrote: »

pjv,

Sapieha,

What are we trying to demonstrate here? That the P2 is impossible for "normal people" to program?

Heater. · 2014-02-28 16:54

Sapieha,

Don't have time for that.

Exactly my point!

pjv · 2014-02-28 17:07

Heater;

Sorry if I have offended you...... that was not my intent. Only was trying to show how simple co-operative multi-threading can be in a P1.

I'll try to keep quiet from now on.

Peter (pjv)

cgracey · 2014-02-28 17:44

I have a question for you guys:

Is it worth getting rid of FIXINDA/FIXINDB (which wrap within a limited area) and only supporting SETINDA/SETINDB (which always span the whole $000..$1FF), in order to give each task its own INDA/INDB?

The only obvious case that would suffer from not being able to set up wrapping INDA/INDB would be FIR filters, where you want the coefficients to slide against the samples. FIRs would be complicated somewhat by loosing this feature, but by getting rid of it we would reduce required state storage to 1/3 and get rid of lots of logic, which would allow time for mux'ing of tasks' INDA/INDB's.

What do you think?

rogloh · 2014-02-28 18:09

I don't have an opinion one way or the other on the FIXINDA/FIXINDB wrapping, but if you included these INDA/INDB registers per task wouldn't it then take you outside a WIDE boundary for the entire task state? Does that then mess up your task state switching a little?

UPDATE: I guess INDA, INDB are pretty small (9 bits each), so perhaps that does fit in what you had left.

ozpropdev · 2014-02-28 18:11

cgracey wrote: »

One thing I see we need is a way to set INDA/INDB using a variable. This is especially important for hub exec code, as it can't self-modify. These new SETINDA D and SETINDB D instructions will need a two-instruction gap before they take effect. Those are going to be very simple to implement.

As we start coding in hub exec mode, it may be apparent that some new instructions are needed to make things flow smoothly. I think we've got it mostly covered, already, but there may be a few things.

cgracey wrote: »

I have a question for you guys:

Is it worth getting rid of FIXINDA/FIXINDB (which wrap within a limited area) and only supporting SETINDA/SETINDB (which always span the whole $000..$1FF), in order to give each task its own INDA/INDB?

The only obvious case that would suffer from not being able to set up wrapping INDA/INDB would be FIR filters, where you want the coefficients to slide against the samples. FIRs would be complicated somewhat by loosing this feature, but by getting rid of if we would reduce required state storage to 1/3 and get rid of lots of logic, which would allow time for mux'ing of tasks' INDA/INDB's.

What do you think?

The idea of SETINDx D makes a lot of sense, and assists in the absence of a GETINDx instruction.
I personally can live without FIXINDx.

cgracey · 2014-02-28 18:16

rogloh wrote: »

I don't have an opinion one way or the other on the FIXINDA/FIXINDB wrapping, but if you included these INDA/INDB registers per task wouldn't it then take you outside a WIDE boundary for the entire task state? Does that then mess up your task state switching a little?

They would have to be saved and restored, anyway, perhaps in a secondary RDTASK2/WRTASK2 instruction.

I've been through all the register declarations in the cog's Verilog and have come up with a list of stuff that needs to be saved to provide comprehensive context switches. One thing we forgot about is PORA/PORB/PORC/PORD which are only two bits each and help address pins in various ports without self-modifying code, which is critical for hub exec. I just gave each task a set of those, but they will need to be saved and restored, as well. It might be, too keep things simple, that all the task state data are in one RDTASK/WRTASK, while the LIFO stack is handled in another. This whole thing is very simple, but it requires identifying and organizing all the elements involved.

Bill Henning · 2014-02-28 18:20

Sounds good to me.

cgracey wrote: »

I have a question for you guys:

Is it worth getting rid of FIXINDA/FIXINDB (which wrap within a limited area) and only supporting SETINDA/SETINDB (which always span the whole $000..$1FF), in order to give each task its own INDA/INDB?

The only obvious case that would suffer from not being able to set up wrapping INDA/INDB would be FIR filters, where you want the coefficients to slide against the samples. FIRs would be complicated somewhat by loosing this feature, but by getting rid of it we would reduce required state storage to 1/3 and get rid of lots of logic, which would allow time for mux'ing of tasks' INDA/INDB's.

What do you think?

Bill Henning · 2014-02-28 18:22

Saving/loading the LIFO state in a separate WIDE makes sense to me.

cgracey wrote: »

They would have to be saved and restored, anyway, perhaps in a secondary RDTASK2/WRTASK2 instruction.

I've been through all the register declarations in the cog's Verilog and have come up with a list of stuff that needs to be saved to provide comprehensive context switches. One thing we forgot about is PORA/PORB/PORC/PORD which are only two bits each and help address pins in various ports without self-modifying code, which is critical for hub exec. I just gave each task a set of those, but they will need to be saved and restored, as well. It might be, too keep things simple, that all the task state data are in one RDTASK/WRTASK, while the LIFO stack is handled in another. This whole thing is very simple, but it requires identifying and organizing all the elements involved.

rogloh · 2014-02-28 18:34

Bill Henning wrote: »

Saving/loading the LIFO state in a separate WIDE makes sense to me.

Yes it makes sense like that because there can potentially be cases where the LIFO state of user threads may not get used (if everything always simply used a hub stack for example). In those cases we may not have do the extra work of saving it.

By the way, depending on the VM design we may still have to copy out some COG register values to hub upon task state switching. So if GCC uses R0-R15 for example, we would have to save/restore these 16 registers whenever we switch threads. That is all under software control in the scheduler task you would write. So there will be plenty of other hub accesses required upon thread switching within the user task. Needing the extra RDTASK2 now is not going to make a huge difference to performance in the end.

Tubular · 2014-02-28 18:35

Seems reasonable and removes another task based gotcha.

I think spin2 will give us operators capable of hard limits (correct?). Pasm programmers are used to such problems. And those following FIR code templates designed for other (lesser) micros won't miss them either.

cgracey · 2014-02-28 18:41

rogloh wrote: »

Yes it makes sense like that because there can potentially be cases where the LIFO state of user threads may not get used (if everything always simply used a hub stack for example). In those cases we may not have do the extra work of saving it.

By the way, depending on the VM design we may still have to copy out some COG register values to hub upon task state switching. So if GCC uses R0-R15 for example, we would have to save/restore these 16 registers whenever we switch threads. That is all under software control in the scheduler task. So there will be plenty of other hub accesses required upon thread switching within the user task. Needing the extra RDTASK2 now is not going to make a huge difference to performance in the end.

Actually, I'm adding a mode to the register remapper so that instead of just being driven by task ID or INDB, you'll be able to set a static state. This means that you'll be able to execute an instruction to say, remap $010..$01F into $000..$00F until further notice. Next thread, change that to $020..$02F into $000..$00F, and so on. So, in cases where you just want unique sets of registers in a common location range, and there's enough cog RAM to support it all, there's no need to do any more moving of data into and out of the hub.

rogloh · 2014-02-28 18:43

Nice one. That will allow quite a few internal threads before any hub memory has to be used for saving register state. Makes good use of COG RAM for that.

cgracey · 2014-02-28 18:54

I've just thought of a way around the FIR limitation.

Since you can't efficiently "scroll' your samples in a FIFO, because they're all in different registers, just keep writing them into what amounts to a circular buffer. This only comes up when you input a new sample. So, you have a circular buffer that only needs one write into it to update its state.

Meanwhile, arrange your tap coefficients into a stretch of registers so that they are repeated once, minus the last tap. So, coefficients A,B,C,D,E,F,G,H get arranged as A,B,C,D,E,F,G,H,A,B,C,D,E,F,G.

To compute the FIR, point INDA to the start of the circular buffer and point INDB to the next offset (0..7) from the start of the coefficient table. Then, do the usual:

	reps	#8,#1
	clracca
	maca	inda++,indb++

I can feel okay about getting rid of INDA/INDB wrapping now!

Sapieha · 2014-02-28 19:00

Hi Chip

Nice solution

cgracey wrote: »
I've just thought of a way around the FIR limitation.

Since you can't efficiently "scroll' your samples in a FIFO, because they're all in different registers, just keep writing them into what amounts to a circular buffer. This only comes up when you input a new sample. So, you have a circular buffer that only needs one write into it to update its state.

Meanwhile, arrange your tap coefficients into a stretch of registers so that they are repeated once, minus the last tap. So, coefficients A,B,C,D,E,F,G,H get arranged as A,B,C,D,E,F,G,H,A,B,C,D,E,F,G.

To compute the FIR, point INDA to the start of the circular buffer and point INDB to the next offset (0..7) from the start of the coefficient table. Then, do the usual:
    reps    #8,#1
    clracca
    maca    inda++,indb++
I can feel okay about getting rid of INDA/INDB wrapping now!

Ariba · 2014-02-28 23:20

cgracey wrote: »

I have a question for you guys:

Is it worth getting rid of FIXINDA/FIXINDB (which wrap within a limited area) and only supporting SETINDA/SETINDB (which always span the whole $000..$1FF), in order to give each task its own INDA/INDB?

The only obvious case that would suffer from not being able to set up wrapping INDA/INDB would be FIR filters, where you want the coefficients to slide against the samples. FIRs would be complicated somewhat by loosing this feature, but by getting rid of it we would reduce required state storage to 1/3 and get rid of lots of logic, which would allow time for mux'ing of tasks' INDA/INDB's.

What do you think?

No, No, No

If you remove that, you immediatly loose one of the most important feature that makes the Prop2 a DSP. Every dedicated DSP has these automatic modulo/wrap of index registers. This is essential for many many DSP algorithms, not only FIR filters.

Chip, please stop to cannibalize old instructions for these unneeded thread switching. We already lost some useful features / instructions in the last 3 month. (IJNZ and executability of the mapped WIDEs for example).

IMO The current development goes in a totally wrong direction, which not only bloats the design more and more but also eliminates a lot of the fun to work with the Prop2.

Andy

cgracey · 2014-02-28 23:23

I have another question for you all:

Would it be wise to change INDA/INDB usage options from [ INDA / INDA++ / INDA-- / ++INDA ] to [ INDA++ / INDA-- / ++INDA / --INDA ]? In other words, in order to accommodate the currently-missing --INDA case, we would get rid of INDA. In some places I've used plain INDA, but those instances could be recoded using alternating INDA++ and --INDA, at some the expense of visual simplicity. What --INDA would get us is bottom-upwards stacks, whereas now we have only top-downwards. I'm not overly passionate about this matter, but I'd like to know what you think, as I'm into this section of Verilog now.

cgracey · 2014-02-28 23:25

Ariba wrote: »

No, No, No

If you remove that, you immediatly loose one of the most important feature that makes the Prop2 a DSP. Every dedicated DSP has these automatic modulo/wrap of index registers. This is essential for many many DSP algorithms, not only FIR filters.

Chip, please stop to cannibalize old instructions for these unneeded thread switching. We already lost some useful features / instructions in the last 3 month. (IJNZ and executability of the mapped WIDEs for example).

IMO The current development goes in a totally wrong direction, which not only bloats the design more and more but also eliminates a lot of the fun to work with the Prop2.

Andy

Thanks for speaking up on this, Andy! I see what you are saying. I'm thinking about this...

cgracey · 2014-02-28 23:32

Ariba wrote: »

No, No, No

If you remove that, you immediatly loose one of the most important feature that makes the Prop2 a DSP. Every dedicated DSP has these automatic modulo/wrap of index registers. This is essential for many many DSP algorithms, not only FIR filters.

Chip, please stop to cannibalize old instructions for these unneeded thread switching. We already lost some useful features / instructions in the last 3 month. (IJNZ and executability of the mapped WIDEs for example).

IMO The current development goes in a totally wrong direction, which not only bloats the design more and more but also eliminates a lot of the fun to work with the Prop2.

Andy

I don't want to mess up the fun! I see why this is important.

By simplifying the INDA/INDB circuits, it would have allowed each task to have a set, which would have been kind of useful. The wrapping circuitry takes a lot of logic and time. I agree we need to keep wrapping now. For this context-save effort, it doesn't matter if each task has a set, because they can be saved and restored via the WIDES, so big-model preemptive threads can still use them. Tasks would have to precede their own usage with a TLOCK and a TFREE, afterwards.

cgracey · 2014-02-28 23:44

Ariba wrote: »

IMO The current development goes in a totally wrong direction, which not only bloats the design more and more but also eliminates a lot of the fun to work with the Prop2. Andy

There is definitely bloat involved in hub exec, but I'm looking forward to being able to write large-model PASM programs without needing an interpreter. I also like this task save/restore business because it opens lots of doors to things that are otherwise out-of-reach and interesting, and FUN to me, though others have a foreboding sense about it. I see it as another neat thing to be explored and developed in software.

EDIT: It was providence that you spoke up just when you did, because the ax was already out. Good thing Quartus' text editor has a reliable UNDO function.

jmg · 2014-02-28 23:58

cgracey wrote: »

I also like this task save/restore business because it opens lots of doors to things that are otherwise out-of-reach and interesting, and FUN to me, though others have a foreboding sense about it. I see it as another neat thing to be explored and developed in software.

Good debug is much more than just "another neat thing", and the more a Chip is unusual, the more good debug matters.

cgracey wrote: »

EDIT: It was providence that you spoke up just when you did, because the ax was already out. Good thing Quartus' text editor has a reliable UNDO function.

hehe, editors with a deep undo are quite useful things

potatohead · 2014-03-01 00:10

I want to make my note above clear.

Doing things like diluting the DSP capability is a net loss. Good news is that won't happen. This tasking business is fun Chip! I get that. And I think it's fun to see everybody contributing to good ideas too.

HUBEXEC is awesome. It was the right move. Tasks are a very good move, and something I think is the right move too. Few of us would disagree at this point, and authoring larger programs in PASM is a very significant thing, which removes a lot of complications.

At some point, it starts to become a what is worth what discussion, and given the instructions, it's at that point. Opening a new door means potentially closing, or jamming another one opened prior. Sometimes that means a kludge to sort of split the middle, or maybe just open both doors a little, etc... and that's where I was writing to above.

Can't open 'em all fully. At least not on this design. So what is worth what? To me, that speaks to the basic design goals and how the P2 will be differentiated out there in the market. Either we nail that, or we don't. I find it difficult now to articulate what the focus really is, and that's the core of the worry expressed above.

And it's hard too, because it's a couple of instructions, seemingly minor, just like a lot of other things were seemingly minor, but they all had core reasons that seemed to resonate at the time. Moving forward, those reasons might not be factored into the current change, and we end up with something like the DSP capability being diluted down, almost missed. Not for bad intent of any kind. It's not that. It's just hard to keep it all balanced, because there is a LOT now. Many doors open, in other words.

Really, having broken through some very fundamental things, like HUBEXEC, we get into niches, each taking a little bit, until it starts to become trades, not adds, and that is perhaps another way to express the dilution worry I did above.

Ever make something, like out of clay, or metal, or paper? That first pass is special, kind of neat, and a cut here, add there, and it's excellent? Then continuing to improve it, sees a kind of indistinct thing instead of the great idea first realized. The P2 is starting to be like that, put another way still.

Finally, as I mentioned above, I'm just contributing like everybody else. The more macro things aren't often discussed, and I think they should be, right along with the more micro, niche things, or context is lost, and things get mushy.

Maybe that helps clarify some of the intent above.

potatohead · 2014-03-01 00:17

Re: Good debug.

Again, with what we had before this change, good debug was entirely possible.** The "what is worth what?" discussion comes down to whether or not making it better still, or packing more of it into hardware as opposed to software is worth trading off something else we thought worth it at the time, which links back to the posts I made recently.

Most anything at this point is going to be a small change, but it might cost a lot and we may well find it more compelling to engineer the change rather than seriously evaluate what the overall implications would be, that's all.

Well, not all. None of the ideas are bad ideas. I'm not saying good debug is a bad idea. I am saying that further drilling down on that might not deliver a return equal to what it costs to get it, and things cost now, because we've stuffed a lot of great ideas in already.

**And other chips do require significant debug because of things they lack that P2 has! It's hard to setup an interactive environment on those without doing a lot of work. We can have interactive debug and trace capabilities that run on some of the chip, while we have displays and outputs on other parts of the chip, while having the target code run on still another part, and running real time on another part still, and that's not hard to do. And the beauty of it is we get to say "on chip" all the way through development, if we want to.

So then, I am very wary of claims that things are needed because they are needed elsewhere, without comparing and validating those claims against the capabilities we have now, and that's just not often done. It should be.

ozpropdev · 2014-03-01 00:34

Chip
Could a substitute instruction be made to replace FIXINDx operations.
A variant of the INCMOD but for INDx indexing

         INCINDA #$23 WZ    'fix inda from $1E to $23
    IF_Z SETINDA #$1E

This would replace the wrap circuitry.

Roy Eltham · 2014-03-01 00:35

jmg,
I think the "another neat thing" is in regards to preemptive multitasking, not debug stuff. It turns out the there is overlap in what's needed for enhancing debugging and stuff needed to do the preemptive multitasking. I'm all for having better debug stuff built in, in fact I would have prefered a full jtag interface way back when if it was feasible (which it's not, so don't even think about it now Chip!). I don't like the preemptive multitasking, but whatever.

What I especially don't like, is that Chip is resorting to taking out existing features to make new ones. Luckily, Ariba caught it in time for this one, but what things have we lost in favor of some new feature that might not even be used by anyone (or very few). It's scary.

There is so much power already there, most anything added now is just making something that is ALREADY possible a little simpler or a little more efficient. Is it really worth it? Let's get it done, testing, and started on the docs and compilers/etc.

potatohead · 2014-03-01 00:37

If I'm not mistaken, having that all setup for the pointer to wrap means cycle and logic savings, both of which seriously improve DSP performance. This is a classic example where maybe it's worth revisiting that as Andy had made very compelling arguments for the structure we have now.

cgracey · 2014-03-01 00:37

Doug,

In your opinion, if some aspects of the chip were not documented, so certain features effectively didn't exist, would you say we are still afield from where we ought to be?

potatohead · 2014-03-01 00:38

There is so much power already there, most anything added now is just making something that is ALREADY possible a little simpler or a little more efficient. Is it really worth it? Let's get it done, testing, and started on the docs and compilers/etc.

Precisely.

cgracey · 2014-03-01 00:40

ozpropdev wrote: »
Chip
Could a substitute instruction be made to replace FIXINDx operations.
A variant of the INCMOD but for INDx indexing
         INCINDA #$23 WZ    'fix inda from $1E to $23
    IF_Z SETINDA #$1E
This would replace the wrap circuitry.

The magic of INDx wrapping is that it works in the background with single-cycle MAC instructions. Even one more clock cycle would cut that DSP speed in half.

Propeller II update - BLOG

Comments