[SOLVED] Using SETINDx with REPS, a little gotcha.

potatohead · 2014-01-30 23:51

I like consistent in all modes with spacers the best. It's going to flow pretty naturally after a few tries. Pretty soon, one just thinks, "OK, I've got a rep coming, what can I stuff in there for free?"

Edit: Didn't see the next post. I'm OK with it just consuming the cycles too. Easier that way. Fine by me.

cgracey · 2014-01-30 23:56

potatohead wrote: »

I like consistent in all modes with spacers the best. It's going to flow pretty naturally after a few tries. Pretty soon, one just thinks, "OK, I've got a rep coming, what can I stuff in there for free?"

I like the simplicity of clearing the pipeline and having the repeat block immediately follow the REPS/REPD, but it IS wasteful and disallows an instruction or three which is/are sometimes very handy to have right before the repeating block. I'm going to try to implement it so that it codes up like single-task instances currently do, no matter the task mode. This will be maximally efficient.

Sapieha and Potatohead, thank you for your timely input on this matter.

jmg · 2014-01-31 00:19

cgracey wrote: »

It would only waste three clocks initially. The loops would be zero-overhead.

Any costs within the loop are the worst, by far. That zero overhead is more important.

Sure, it is nice to have slots you can optionally fill, but if they are outside the REPS, that time is less important.
With the larger PASM code capability the chance of users not knowing what is in all their modules increases.
So operational safety, to me, becomes more important.
Especially given the partial way this fails - in a large project, it could be a nightmare to track down, and it could be some delay from the user-edit to noticing they have a problem somewhere else, so they may not 'connect the dots' of cause and effect.

I would also make the mnemonic address based (as Blackfin does), but that is not a binary change, just a code maintenance / clarity one.

cgracey wrote: »

I'm talking about making coding consistent in all task modes.
......
I'm going to try to implement it so that it codes up like single-task instances currently do, no matter the task mode. This will be maximally efficient.

That is a big plus. It allows compact libraries that will not break, no matter how users scramble their tasks.

If there is room, maybe opcodes to allow both, but I would rank the coding consistent above coding compact in highly special cases. (and of course, both together would be magic)

cgracey wrote: »

I could stall the pipeline for the two instructions before GETPIX, without needing two discrete 'WAIT #2' instructions, so GETPIX would still work.

Your solution of some cycle-swallow, still allows REPS on GETPIX, right ?

cgracey · 2014-01-31 00:35

jmg wrote: »

Any costs within the loop are the worst, by far. That zero overhead is more important.

Sure, it is nice to have slots you can optionally fill, but if they are outside the REPS, that time is less important.
With the larger PASM code capability the chance of users not knowing what is in all their modules increases.
So operational safety, to me, becomes more important.
Especially given the partial way this fails - in a large project, it could be a nightmare to track down, and it could be some delay from the user-edit to noticing they have a problem somewhere else, so they may not 'connect the dots' of cause and effect.

I would also make the mnemonic address based (as Blackfin does), but that is not a binary change, just a code maintenance / clarity one.

That is a big plus. It allows compact libraries that will not break, no matter how users scramble their tasks.

If there is room, maybe opcodes to allow both, but I would rank the coding consistent above coding compact in highly special cases. (and of course, both together would be magic)

Your solution of some cycle-swallow, still allows REPS on GETPIX, right ?

It turned out to be very simple to make all task mixes code up just like single-task for REPS/REPD.

All I had to do was add 1 to the initial instructions-before-looping count for REPS if the lower-stage pipeline's task ID mismatched the REPS' task ID. For REPD, I add 0..3 based on how many task mismatches there are in the pipe, comparing the executing task ID to three lower-stage pipeline task IDs. This way, it accommodates those spacer instructions consistently, whether they would actually be needed, or not, based on the task mix. So, everything will code up with the current single-task rules.

It's compiling right now. Once it works, I'll add a REP block for every task. Then, things will be simple.

jmg · 2014-01-31 00:40

cgracey wrote: »

It turned out to be very simple to make all task mixes code up just like single-task for REPS/REPD.

All I had to do was add 0 or 1 to the initial instructions-before-looping count for REPS, based on whether the prior pipeline stage was a different task (+1 in that case). For REPD, I add 0..3 based on how many task mismatches there are in the pipe, comparing the executing task ID to three lower-stage pipeline task IDs. This way, it accommodates those spacer instructions consistently, whether they would actually be needed, or not, based on the task mix. So, everything will code up with the current single-task rules.

Magic.
It did seem to me the fix would be to get it opcode-sync'd rather than time-based, and only the HW knows what the pipeline is up to.
One detail - if there is a WAIT #3 in the REPS loop, does the state engine count opcodes or cycles ?

cgracey · 2014-01-31 00:42

jmg wrote: »

Magic.
It did seem to me the fix would be to get it opcode-sync'd rather than time-based, and only the HW knows what the pipeline is up to.
One detail - if there is a WAIT #3 in the REPS loop, does the state engine count opcodes or cycles ?

'WAIT #3' will stall the pipeline for three clocks, always. The REPS circuit counts opcodes, not cycles.

jmg · 2014-01-31 00:45

cgracey wrote: »

'WAIT #3' will stall the pipeline for three clocks, always.

so it counts as one-instruction to the REPS ? , which means state engine, which now effectively counts real opcodes, and can easily be address-based at the PASM level (makes for clearer code, same binary opcode is used) ?

cgracey · 2014-01-31 00:50

jmg wrote: »

so it counts as one-instruction to the REPS ? , which means state engine now counts real opcodes, and can easily be address-based at the PASM level (makes for clearer code, same binary opcode is used) ?

Yes.

It uses relative addressing. It loads an initial instruction count, which is the #instructions (plus pipeline mismatches for spacer-instruction coding consistency) from the REPS/REPD operand, then after that many instructions from the REPS/REPD task executes, it subtracts the #instructions value from the PC and repeats this by the specified number of loops. So, the first loop may have 1..3 more instructions than subsequent loops, in order to accommodate the spacers. The programmer doesn't realize this, though. He just follows the rules: 1 spacer for REPS, 3 spacers for REPD.

jmg · 2014-01-31 00:57

cgracey wrote: »

Yes.

It uses relative addressing. It loads an initial instruction count, which is the #instructions (plus pipeline mismatches for spacer-instruction coding consistency) from the REPS/REPD operand, then after that many instructions from the REPS/REPD task executes, it subtracts the #instructions value from the PC and repeats this by the specified number of loops. So, the first loop may have 1..3 more instructions than subsequent loops, in order to accommodate the spacers.

Cool, buried some way back is the Blackfin REPS equivalent, and they use a mnemonic form that has 3 params. (essentially the same binary engine)

Count, LoopStart and LoopEnd

This makes code clearer, avoids line-counting, and the labels auto-adjust to any edits.
This form also allows PASM to easily check the spacers do actually match-up with what the opcode will run-over-once. (ie the user gets what they hoped)

Thus, if someone changes REPS to REPD, and did nothing else at all, a warning would result.

ozpropdev · 2014-01-31 00:59

cgracey wrote: »

It turned out to be very simple to make all task mixes code up just like single-task for REPS/REPD.

All I had to do was add 1 to the initial instructions-before-looping count for REPS if the lower-stage pipeline's task ID mismatched the REPS' task ID. For REPD, I add 0..3 based on how many task mismatches there are in the pipe, comparing the executing task ID to three lower-stage pipeline task IDs. This way, it accommodates those spacer instructions consistently, whether they would actually be needed, or not, based on the task mix. So, everything will code up with the current single-task rules.

It's compiling right now. Once it works, I'll add a REP block for every task. Then, things will be simple.

Brilliant Chip!

evanh · 2014-01-31 01:07

Way to go! That's a big win. Both consistancy and performance achieved.

Does this have further applicability? How many other instructions have spacer behaviour?

cgracey · 2014-01-31 01:09

jmg wrote: »

Cool, buried some way back is the Blackfin REPS equivalent, and they use a mnemonic form that has 3 params. (essentially the same binary engine)

Count, LoopStart and LoopEnd

This makes code clearer, avoids line-counting, and the labels auto-adjust to any edits.
This form also allows PASM to easily check the spacers do actually match-up with what the opcode will run-over-once. (ie the user gets what they hoped)

Thus, if someone changes REPS to REPD, and did nothing else at all, a warning would result.

Now that we have consistent rules, we could do something like that in the assembler. Thanks for all your help, jmg.

cgracey · 2014-01-31 01:11

evanh wrote: »

Way to go! That's a big win. Both consistancy and performance achieved.

Does this have further applicability? How many other instructions have spacer behaviour?

None, if I'm remembering correctly.

Wait. The delayed branches have this behavior. I've got to think about what can be done there. That's a little more complex, since branches don't have states connected to them like REPS/REPD.

It would be neat to be able to write super fast single-task code with 3 delay slots after branches, and have it still run in multi-tasking. This needs some thinking.

jmg · 2014-01-31 01:27

cgracey wrote: »

It would be neat to be able to write super fast single-task code with 3 delay slots after branches, and have it still run in multi-tasking. This needs some thinking.

Oh dear, there goes any sleep...

cgracey · 2014-01-31 01:33

jmg wrote: »

Oh dear, there goes any sleep...

The FPGA compile finished and REPS/REPD now program consistently for any task mix. I just need to make four of those REPS/REPD blocks next, so each task can have one.

I'm really intrigued by making things so that single-task code always runs in any multi-task situation, being optimal for single-task, but not broken or impaired during multi-task.

I would stay up all night, but I've got to meet David Betz tomorrow morning, so I must sleep.

Thanks for all your input, Guys. These new changes will simplify programming, as well as the documentation. The less there is to document, the better things are.

Heater. · 2014-01-31 02:28

Wow Chip. I am understanding correctly? This is big. Actually it's two big things:

1) Whatever the programmer writes using REPx will work unchanged no matter what tasks are going on.

2) REPx will be usable in all threads at the same time.

This removes all the surprises jmg was worried about. Great.

ctwardell · 2014-01-31 05:27

Nice solution Chip!

Gaining a REPx circuit for each task will be very helpful.

C.W.

Dave Hein · 2014-01-31 06:38

So if I understand correctly, the REPx problem will be resolved by hardware. It sounds like there is a similar issue for JMPD. Will that also be resolved by hardware?

Ariba · 2014-01-31 07:05

@Dave
he is thinking about it:

cgracey wrote: »

Wait. The delayed branches have this behavior. I've got to think about what can be done there. That's a little more complex, since branches don't have states connected to them like REPS/REPD.

It would be neat to be able to write super fast single-task code with 3 delay slots after branches, and have it still run in multi-tasking. This needs some thinking.

I hope he finds an equally good solution as for the REPx problem. This would make Multitasking again a lot easier.

Andy

potatohead · 2014-01-31 07:36

Agreed. Thanks Chip. Sometimes I find it difficult to know when it makes sense to ask as opposed to document. Nice to know you see these discussions and can think on them for us.

And of course, I and probably others are reluctant to ask just because of the timeline.

cgracey · 2014-01-31 07:43

I thought about this before falling asleep and I realized how to make all delayed branches execute three trailing instructions, no matter the task mix. Instead of branches that execute on their own clock cycle, it will be necessary to create flops for the branch addresses, along with counters which count up to three. In case there are not three same-task instructions in the pipeline at the time of the delayed jump instruction, the flop circuit engages and after cancelling the deficit of three future same-task instructions, it then does the branch. This way, all delayed jumps (JMPD/CALLD/RETD/etc) execute three trailing instructions from the same task. This will make high-speed single-task code also work in multitasking and eliminate all the complex considerations of what's in the pipeline when doing delayed jumps.

So, we've standardized REPS/REPD and delayed-jump behavior, making them adhere to optimal single-task coding style, no matter the task mix.

This is going to be great, because it will allow optimally-timed single-task code to be written, which will still work with any task mix.

Are there any other pipeline-conditional phenomena that you guys can think of that could be standardized?

Bill Henning · 2014-01-31 07:52

Excellent news Chip!

Now go get some good ZZZzzz's

Will think on more pipeline phenomena, as I am sure others will as well.

cgracey wrote: »

I thought about this before falling asleep and I realized how to make all delayed branches execute three trailing instructions, no matter the task mix. Instead of branches that execute on their own clock cycle, it will be necessary to create flops for the branch addresses, along with counters which count up to three. In case there are not three same-task instructions in the pipeline at the time of the delayed jump instruction, the flop circuit engages and after cancelling the deficit of three future same-task instructions, it then does the branch. This way, all delayed jumps (JMPD/CALLD/RETD/etc) execute three trailing instructions from the same task. This will make high-speed single-task code also work in multitasking and eliminate all the complex considerations of what's in the pipeline when doing delayed jumps.

So, we've standardized REPS/REPD and delayed-jump behavior, making them adhere to optimal single-task coding style, no matter the task mix.

This is going to be great, because it will allow optimally-timed single-task code to be written, which will still work with any task mix.

Are there any other pipeline-conditional phenomena that you guys can think of that could use standardization?

potatohead · 2014-01-31 07:56

Really great Chip! The docs just got shorter! A fine metric I think.

Yes, let's all go think on those.

Cluso99 · 2014-02-01 04:03

I have been wondering if REPD is really required. This instruction has 2 things that REPS doesn't - conditional execution and the use of D/#.

I don't think I would use conditional execution because I think I would be more likely to perform a conditional jump around a REPD loop.

So maybe the REPS could add a D style variant. Would this require 2 nops or could it be done with 1?

This could simplify the REP to a single REPS. If so, then perhaps iiiiii could be held in cccc + zc and nnnnnnnnnnnnnnnnnn in D+S. Then I bit would indicate an immediate value, else the S would hold a reg# where the repeat count would be held. The immediate count could be set by movword. Or could we preset using AUGS ?

This might simplify the repd/reps logic ?

ozpropdev · 2014-02-01 05:07

Cluso99 wrote: »

I have been wondering if REPD is really required. This instruction has 2 things that REPS doesn't - conditional execution and the use of D/#.

I don't think I would use conditional execution because I think I would be more likely to perform a conditional jump around a REPD loop.

So maybe the REPS could add a D style variant. Would this require 2 nops or could it be done with 1?

This could simplify the REP to a single REPS. If so, then perhaps iiiiii could be held in cccc + zc and nnnnnnnnnnnnnnnnnn in D+S. Then I bit would indicate an immediate value, else the S would hold a reg# where the repeat count would be held. The immediate count could be set by movword. Or could we preset using AUGS ?

This might simplify the repd/reps logic ?

I agree Ray

I don't think I would use REPD, particularly the conditional variant. As you suggested a JMP past the loop makes more sense.
Having to apply conditions to all the instructions within the loop is not practical.
I suspect that a lot of REPx loops will more than likely contain instruction using INDx which CANNOT be conditional.

A variant of REPS that uses D as the count would be nice.

Brian

cgracey · 2014-02-01 10:35

REPS executes in pipeline stage 2, before D and S register contents become available in stage 4. For this reason, REPS' repeat value is a constant and it requires only one spacer instruction. REPD exists to offer register-content-based repeating, but must execute in stage 4, and therefore requires three spacer instructions.

jmg · 2014-02-01 11:00

Sounds like enough added features to keep REPD, - & if I read the DOCs right, REPD also allows > 2^16 loops, as well as run-time-variable counts.

Cluso99 · 2014-02-01 11:13

Thanks Brian and Chip.
Might it then be simpler to have REPS with a variant that can be preceded with a MOV xxx,S that operates in 1 cycle and work similar to AUGS/D to preload the count.

The advantage would be 1+2 clocks vs 4 (REPD) but would use an extra instruction. Wouldthis be simpler in gates? If not, there is no point to change.

ozpropdev · 2014-02-01 16:03

From the current DOCS -INSTRUCTION-BLOCK REPEATING

instructions (iiiiii = #i-1, [b][color=red]n_[/color][/b]nnnn_nnnnnnnnn_nnn/nnnnnnnnn = #n-1)                            clocks
-----------------------------------------------------------------------------------------------------
1111101 01 1 nnnn nnnnnnnnn nnniiiiii        REPS    #n,#i   'execute 1..64 inst's 1..[b][color=red]131072[/color][/b] times  1

1111111 00 0 CCCC 111111111 001iiiiii        REPD    #i      'execute 1..64 inst's infintely        1
1111111 00 0 CCCC DDDDDDDDD 001iiiiii        REPD    D,#i    'execute 1..64 inst's D times          1
1111111 00 1 CCCC nnnnnnnnn 001iiiiii        REPD    #n,#i   'execute 1..64 inst's 1..512 times     1
-----------------------------------------------------------------------------------------------------


----  1111101 01 1 [b][color=blue]nnnn nnnnnnnnn nnn[/color][/b]iiiiii     REPS    #1..[b][color=green]$10000[/color][/b],#1..64

REPS shows 17 bits for count when instruction encoding show 16.

Brian

jmg · 2014-02-01 16:11

Can't they both be right ?
The 16 bit field has no 0 mapping, so the number is 1..2^16, instead of 0..2^16-1 ?

[SOLVED] Using SETINDx with REPS, a little gotcha.

Comments