Propeller II update - BLOG

ctwardell · 2014-03-09 17:44

Ahle2,

A problem with the solution you have suggested is that it makes things much more difficult on the scheduler side because it no longer has a simple absolute test for a non-zero value or a specific value in a flag bit.

With Bill's approach the scheduler needs to watch for the register loaded by TPAUSE to be non-zero.
Bill puts any return value to the client in a separate register from the register loaded by TPAUSE and then clears the register loaded by TPAUSE to signal the TPAUSE to resume.
The scheduler can immediately resume polling.

With jmg's approach the scheduler needs to watch for bit31 to be set in the register loaded by TPAUSE. A value is returned to the client using the same register used by TPAUSE, but bit31 must be cleared to tell the TPAUSE to resume.
The scheduler can immediately resume polling.

With your approach I don't see a simple way for the scheduler to operate.

If we want to do simple non-zero testing in the scheduler the value in the register to be loaded by TPAUSE must be zero until TPAUSE causes it to change.
Whenever the scheduler writes a non-zero value back to a client the register is not in the proper state for polling, so the scheduler would need to wait some amount of time to allow the client to grab the return value and then clear the register so it could start polling again.

C.W.

Cluso99 · 2014-03-09 18:07

ctwardell wrote: »

Ahle2,

A problem with the solution you have suggested is that it makes things much more difficult on the scheduler side because it no longer has a simple absolute test for a non-zero value or a specific value in a flag bit.

With Bill's approach the scheduler needs to watch for the register loaded by TPAUSE to be non-zero.
Bill puts any return value to the client in a separate register from the register loaded by TPAUSE and then clears the register loaded by TPAUSE to signal the TPAUSE to resume.
The scheduler can immediately resume polling.

With jmg's approach the scheduler needs to watch for bit31 to be set in the register loaded by TPAUSE. A value is returned to the client using the same register used by TPAUSE, but bit31 must be cleared to tell the TPAUSE to resume.
The scheduler can immediately resume polling.

With your approach I don't see a simple way for the scheduler to operate.

If we want to do simple non-zero testing in the scheduler the value in the register to be loaded by TPAUSE must be zero until TPAUSE causes it to change.
Whenever the scheduler writes a non-zero value back to a client the register is not in the proper state for polling, so the scheduler would need to wait some amount of time to allow the client to grab the return value and then clear the register so it could start polling again.

C.W.

Am I missing something really basic with Ahle2's suggestion???
His would work with both Bill's suggestion and jmg's suggestion, as well as perhaps a lot of others.

For Bill,
TCHECK D, #somevalue
...
MOV D, #0 'release the TCHECK because 0 <> #somevalue

For jmg,
AUGS #bit31
TCHECK D, #somevalue 'or S=b31 | somevalue
...
CLRB D,#31 'clear bit31: release the TCHECK because 0 | #somevalue <> 1 | somevalue

Apart from the extra silicon to hold a copy of the value set by TCHECK, what could be simpler and more powerful ??? What am I missing???

ctwardell · 2014-03-09 18:29

Cluso99 wrote: »

Am I missing something really basic with Ahle2's suggestion???
His would work with both Bill's suggestion and jmg's suggestion, as well as perhaps a lot of others.

For Bill,
TCHECK D, #somevalue
...
MOV D, #0 'release the TCHECK because 0 <> #somevalue

For jmg,
AUGS #bit31
TCHECK D, #somevalue 'or S=b31 | somevalue
...
CLRB D,#31 'clear bit31: release the TCHECK because 0 | #somevalue <> 1 | somevalue

Apart from the extra silicon to hold a copy of the value set by TCHECK, what could be simpler and more powerful ??? What am I missing???

Got it now! I was mixing Bill's and jmg's when working through Ahle2's.

Looks like that is a good solution for TPAUSE (TCHECK?).

While it does leave a more flexible TPAUSE, there is still the need for the additional jump negative instruction to fully implement jmg's approach.

C.W.

ozpropdev · 2014-03-09 18:47

jmg wrote: »

You can just use positive and negative, but I think that is also a new opcode.

Hi All
Having finished coffee #7 (or was it #8?) I just realised something.
Mindrobots was on the right track. We don't need new instructions.

We already have JP (Jump positive) and JNP (jump not positive) instructions and JPD,JNPD (delayed).
The JNP instruction is actually the "JB31" instruction jmg suggested.

Cheers
Brian

cgracey · 2014-03-09 18:54

The problem with implementing the 'current == original' test is that it requires another 32 bits of state storage per task (128 more flops per cog, or 1024 per chip), of which 32 bits must also be stored in the WIDEs for T3LOAD/T3SAVE. There are only 12 more bits available there, though. So, this could work if the value was only tracked to 12 bits. For what extra benefit this would give, it seems way too expensive, even if we limited the value checking to 9 bits, all around. You can always use another register to convey a feedback value.

ctwardell · 2014-03-09 18:56

ozpropdev wrote: »

Hi All
Having finished coffee #7 (or was it #8?) I just realised something.
Mindrobots was on the right track. We don't need new instructions.

We already have JP (Jump positive) and JNP (jump not positive) instructions and JPD,JNPD (delayed).
The JNP instruction is actually the "JB31" instruction jmg suggested.

Cheers
Brian

Those test pin states, not bit31.

From Prop2_Docs.txt:

To avoid excessively stalling the pipeline during multi-tasking, the WAITCNT/WAITPEQ/WAITPNE
instructions can be substituted with non-stalling alternatives:

  PASSCNT D/#        jumps to itself if some amount of time has not passed, use instead of WAITCNT
  JP/JNP  D/#,S/#    jumps based on pin states, use instead of WAITPEQ/WAITPNE

C.W.

rogloh · 2014-03-09 18:59

I really like how Ahle2's idea can be worked both ways and is method agnostic. Having the ability for a automatically getting a return value (aka jmg method) could be quite useful in some cases. The extra scheduler task overhead of needing multiple cycles for polling a set bit (eg bit31) in the written value by the TPAUSE instruction (or whatever it is called nowadays) instead of using quick and atomic JZ (or TJNZ) may not necessarily be a big problem.. After all if we are giving 1/16's of the CPU to the scheduler and it is more than likely scheduling threads at a rate much slower than this there will be plenty of time for extra more complex instructions to do checks like this. And we can still do the zero based approach too so we give up nothing from a software point of view.

I guess we will see if Chip likes it too vs what he already has now.

UPDATE: Oops, too late. Chip has already responded. I gotta remember to check before submitting so I am not out of date. :frown:

cgracey · 2014-03-09 19:00

ctwardell wrote: »

Those test pin states, not bit31.

C.W.

When we had lots of 'D,S/#' opcode space, I added Jump-positive/negative instructions, but they've since been removed to make room for other things. They were freebies at one time, because they take almost zero logic to implement, given the opcode space is available.

ozpropdev · 2014-03-09 19:30

ctwardell wrote: »

Those test pin states, not bit31.

C.W
Yikes!

Clearly I need to change my "brand" of coffe!

Where I went wrong on that one was a quick test of JP,JNP gave me the correct results.
Looking at it further I just happened to meet the "right states" in all of my tests.
1,2,3,4 ok positive
-1,-2,-3 ok negative
garbage in = garbage out!

Cheers
Brian

rogloh · 2014-03-09 19:32

cgracey wrote: »

The problem with implementing the 'current == original' test is that it requires another 32 bits of state storage per task (128 more flops per cog, or 1024 per chip), of which 32 bits must also be stored in the WIDEs for T3LOAD/T3SAVE. There are only 12 more bits available there, though. So, this could work if the value was only tracked to 12 bits. For what extra benefit this would give, it seems way too expensive, even if we limited the value checking to 9 bits, all around. You can always use another register to convey a feedback value.

I really hope we don't find too many more state resources per task to add from here on in. Those 12 free bits you still have left in the WIDE are going to get pretty precious now.

We will have to be selective. Any new hardware we request like CRC etc might want to get targetted per COG not per task.

cgracey · 2014-03-09 19:36

rogloh wrote: »

I really hope we don't find too many more state resouces per task to add from here on in. Those 12 free bits you still have left in the WIDE are going get pretty precious now. We will have to be selective. Any new hardware we request like CRC etc might want to get targetted per COG not per task.

If we can keep any new instructions ATOMIC, so that they work on an addressable register, it will be smooth sailing.

ctwardell · 2014-03-09 19:40

ozpropdev wrote: »

C.W
Yikes!

Clearly I need to change my "brand" of coffe!

Where I went wrong on that one was a quick test of JP,JNP gave me the correct results.
Looking at it further I just happened to meet the "right states" in all of my tests.
1,2,3,4 ok positive
-1,-2,-3 ok negative
garbage in = garbage out!

Cheers
Brian

lol, no worries! I think I must have been drinking that same coffee that past few weeks, some would say a lot longer...

C.W.

mindrobots · 2014-03-09 20:39

ozpropdev wrote: »

Hi All
Having finished coffee #7 (or was it #8?) I just realised something.
Mindrobots was on the right track. We don't need new instructions.

We already have JP (Jump positive) and JNP (jump not positive) instructions and JPD,JNPD (delayed).
The JNP instruction is actually the "JB31" instruction jmg suggested.

Cheers
Brian

I knew you had the wrong brand of coffee when you said I was on the right track.

I still go back to my earlier statement that Chip's latest proposal was simple, elegant and propeller-like and gives enough hardware support to let the software play.

Cluso99 · 2014-03-09 21:38

cgracey wrote: »

The problem with implementing the 'current == original' test is that it requires another 32 bits of state storage per task (128 more flops per cog, or 1024 per chip), of which 32 bits must also be stored in the WIDEs for T3LOAD/T3SAVE. There are only 12 more bits available there, though. So, this could work if the value was only tracked to 12 bits. For what extra benefit this would give, it seems way too expensive, even if we limited the value checking to 9 bits, all around. You can always use another register to convey a feedback value.

I wasn't even thinking that they would require saving.

Agreed, way too expensive. The current TCHECK (or whatever) waiting for =0 is a fine solution.

How close is a release?

cgracey · 2014-03-09 21:51

Cluso99 wrote: »

How close is a release?

I've just got to test the T3SAVE/T3LOAD instructions, then update the documentation. So, hopefully early this week.

Ahle2 · 2014-03-10 00:06

cgracey wrote: »

The problem with implementing the 'current == original' test is that it requires another 32 bits of state storage per task (128 more flops per cog, or 1024 per chip), of which 32 bits must also be stored in the WIDEs for T3LOAD/T3SAVE. There are only 12 more bits available there, though. So, this could work if the value was only tracked to 12 bits. For what extra benefit this would give, it seems way too expensive, even if we limited the value checking to 9 bits, all around. You can always use another register to convey a feedback value.

Ouch!

It seems so easy when the sausage maker adds the ingredients. It is hard beeing a naive sausage trainee with dirty little fingers.

/Johannes

Baggers · 2014-03-10 02:26

cgracey wrote: »

I've just got to test the T3SAVE/T3LOAD instructions, then update the documentation. So, hopefully early this week.

Awesome news Chip, can't wait to play

Heater. · 2014-03-10 02:50

Ahle2,

It is hard being a naive sausage trainee with dirty little fingers.

Yep. It's hard to not get your fingers caught in the sausage machine!

MJB · 2014-03-10 05:53

jmg wrote: »

I still do not like the inefficiency of using 32 bits as a boolean.., and the asymmetry of message passing.

mov task3result,result ' optionally pass back result
mov task3req,#0 ' release task if PC not incremented past TPAUSE

- but I also do not see an opcode that neatly allows compact mixing of flags and params

I think we don't have any
JBITZ D, addr, #0..31
JBITNZ D, addr, #0..31

jmg · 2014-03-10 13:02

MJB wrote: »

I think we don't have any
JBITZ D, addr, #0..31
JBITNZ D, addr, #0..31

Those would be nice.

in #6159 Chip said there were Sign-bit testing jump-bit opocodes, but they were removed to make room for others, and there are a couple of USB opcodes still to fit in somewhere...

cgracey · 2014-03-10 19:48

Today I got the preemptive tasking proven and added a few new instructions:

TPOP D,S/# 'pop task S/#'s LIFO stack into D
TPUSH D/#,S/# 'push D/# into task S/#'s LIFO stack
TJMP D/#,S/# 'set task S/#'s PC to D/# and reset all task-related states for a clean task restart

TPOP and TPUSH were needed to facilitate storing task 3's LIFO stack, in conjunction with T3SAVE/T3LOAD.

TJMP replaces the old JMPTASK instruction, but with reversed operands, so that it agrees with TPOP/TPUSH.

Right now, I'm rearranging the bit order of data that get stored into the WIDEs for T3SAVE/T3LOAD, so that it will be very simple to start up a thread by pre-setting the WIDE values: PC goes into the lower word of WIDE[0] and all other bits, through WIDE[7], can be cleared to 0's to give a sensible startup configuration for a thread.

rogloh · 2014-03-10 20:03

Sounds good. I guess for TJMP you didn't want a new task to startup thinking it was it the middle of a REP loop or something crazy like that so had to clear things.

What are the known defaults at startup as far as task specific registers? For example are the PTRA/B/X/Y going to be cleared here as well?

cgracey · 2014-03-10 20:13

rogloh wrote: »

Sounds good. I guess for TJMP you didn't want a new task to startup thinking it was it the middle of a REP loop or something crazy like that so had to clear things.

What are the known defaults at startup as far as task specific registers? For example are the PTRA/B/X/Y going to be cleared here as well?

When you do a TJMP, none of the pointers or flags are affected, though everything indicating that some special state was in progress is reset, like REPS/REPD, delayed jump pending, TLOCK pending, THALT post first iteration, and a few other things I can't remember at the moment. It just makes sure, like you said, that the task starts off cleanly, without its shoe laces tied together or some hitch-up in its get-along.

Bill Henning · 2014-03-10 20:23

Sounds good!

cgracey wrote: »

Today I got the preemptive tasking proven and added a few new instructions:

TPOP D,S/# 'pop task S/#'s LIFO stack into D
TPUSH D/#,S/# 'push D/# into task S/#'s LIFO stack
TJMP D/#,S/# 'set task S/#'s PC to D/# and reset all task-related states for a clean task restart

TPOP and TPUSH were needed to facilitate storing task 3's LIFO stack, in conjunction with T3SAVE/T3LOAD.

TJMP replaces the old JMPTASK instruction, but with reversed operands, so that it agrees with TPOP/TPUSH.

Right now, I'm rearranging the bit order of data that get stored into the WIDEs for T3SAVE/T3LOAD, so that it will be very simple to start up a thread by pre-setting the WIDE values: PC go into the lower word of WIDE[0], and all other bits, through WIDE[7] can be cleared to 0's to give a sensible startup configuration for a thread.

Cluso99 · 2014-03-11 01:38

Chip,
Would you mind posting the Verilog for one of the simple P2 instruction sections? Perhaps an ADD would be nice.

That way I can try and keep the USB instruction Verilog using some of the same variable names, and also perhaps understand Verilog a little more.

cgracey · 2014-03-11 02:06

Cluso99 wrote: »

Chip,
Would you mind posting the Verilog for one of the simple P2 instruction sections? Perhaps an ADD would be nice.

That way I can try and keep the USB instruction Verilog using some of the same variable names, and also perhaps understand Verilog a little more.

All those sections are in the context of the whole design, with special flop declarations to expedite decoding, and output aimed at the result mux, so I think it might only turn you off. Whatever you come up with I'll have to implement, anyway, to make it fit into the overall design.

For what it's worth, I'll post a submodule that stands alone, without any greater context, so that you can see the whole picture of a functional section. The main cog Verilog code makes all kinds of global references everywhere, so any piece of it won't make much sense. Here is the 32x32 multiplier, though. This can be compiled all by itself:

// MLT

module		mlt
(
input		clk,
input		ena,
input		set,
input		sign,
input	[31:0]	d,
input	[31:0]	s,

output		done,
output	[63:0]	p
);


reg  [5:0]	n;
reg [33:0]	t;
reg [33:0]	m;
reg [69:0]	a;


// multiply

wire busy	= n[5];

always @(posedge clk or negedge ena)
if (!ena)
	n <= 6'b0;
else if (set || busy)
	n <= set ? 6'b101111
		 : n + 1'b1;

always @(posedge clk)
if (set || busy)
	t <= set ? {sign && d[31], d, 1'b0}
		 : {{2{t[33]}}, t[33:2]};

always @(posedge clk)
if (set)
	m <= {{2{sign && s[31]}}, s};


// booth functions

wire onex	= t[1] ^ t[0];
wire twox	= t[2] ^ t[1] && !onex;
wire notx	= t[2];

wire [35:0] b	= {36{notx}} ^ {{2{(twox || onex) && m[33]}} ^ 2'b10, {34{twox}} & {m[32:0], 1'b0} | {34{onex}} & m[33:0]};

wire [37:0] sum	= {2'b00, a[69:34]} + {2'b01, b};

always @(posedge clk)
if (set || busy)
	a <= set ? {3'b100, {{2{sign && d[31]}}, d} & 34'h2AAAAAAAA, 33'b0}
		 : {sum[37:0], a[33:2]};


// result

assign done	= !busy;

assign p	= a[63:0];

endmodule

Cluso99 · 2014-03-11 03:03

Thanks Chip.

I was trying to keep the naming common, and perhaps some other things. Of course you will need to verify what I write anyway as I am such a novice here.
Do you just call the Z & C flags z and c?

I have posted a possible USB instruction over on the USB thread. It still has some bugs and I need to add a bit counter to count up an incoming byte.

cgracey · 2014-03-11 03:26

Cluso99 wrote: »

Thanks Chip.

I was trying to keep the naming common, and perhaps some other things. Of course you will need to verify what I write anyway as I am such a novice here.
Do you just call the Z & C flags z and c?

I have posted a possible USB instruction over on the USB thread. It still has some bugs and I need to add a bit counter to count up an incoming byte.

Yep, they're just called 'z' and 'c'.

That's great that you are coming up with an instruction!!! I've glanced through that thread and I can see that you guys are working out the parameters of what must be done. I think that is the hard part of anything - to qualify it with its parameters. The coding is always the easy part. Once you know WHAT to do, you're almost done.

Cluso99 · 2014-03-11 03:36

cgracey wrote: »

Yep, they're just called 'z' and 'c'.

That's great that you are coming up with an instruction!!! I've glanced through that thread and I can see that you guys are working out the parameters of what must be done. I think that is the hard part of anything - to qualify it with its parameters. The coding is always the easy part. Once you know WHAT to do, you're almost done.

Thanks Chip.

I am pretty sure about the CRC now. Looks like I can get the whole bit receive including crc and unstuffing to work in 1 instruction. Just need to verify the unstuffing and then finish the byte counter. That way, after calling the instruction 8+ times I will have a byte assembled, together with the crc accumulation. I am also using the z and c flags to return a completed byte flag, and the SE0/SE1 condition. This should really help shorten the USB receive routine.
Hopefully the instruction can be used for transmit as well (after outputting the pins).

cgracey · 2014-03-11 05:07

Cluso99 wrote: »

Thanks Chip.

I am pretty sure about the CRC now. Looks like I can get the whole bit receive including crc and unstuffing to work in 1 instruction. Just need to verify the unstuffing and then finish the byte counter. That way, after calling the instruction 8+ times I will have a byte assembled, together with the crc accumulation. I am also using the z and c flags to return a completed byte flag, and the SE0/SE1 condition. This should really help shorten the USB receive routine.
Hopefully the instruction can be used for transmit as well (after outputting the pins).

Wow! It sounds like you've really nailed what needs to be done. Good job! This is going to be something really valuable.

Propeller II update - BLOG

Comments