LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Bill Henning · 2012-12-11 13:06

#179 results suggest that it works OK - it looks like it just waits for the appropriate hub cycles for each multi-cycle instruction, and executes the other single cycle ones as well.

If it did not execute all four instructions in the quad, the results would be wrong. Unless I am missing something again.

I have not tried to nail down where the best place is for hub access instructions, but the organization above at least seems to work.

Multi-cycle instructions will have to be scheduled very carefully, but they should work (famous last words!) - I hope.

I'll have to try some other multi-cycle instructions.

Thanks again for the sample and explanation.

cgracey wrote: »

The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.

Sapieha · 2012-12-11 13:08

Hi Chip

If this answer was to me -- You don't understand me.

Can I?
SETQUAD XXXX0 Set fist 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX
SETQUAD XXXX4 Set next 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX without move data from QUAD

cgracey wrote: »

The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.

Bill Henning · 2012-12-11 13:13

ARGH.

I took another look at the last result, and I may still have a problem.

Based on the example Chip posted, it looks like the first instruction in a quad is not executed the first time as it is not ready yet.

I'm putting my thinking cap back on.

Bill Henning wrote: »

#179 results suggest that it works OK - it looks like it just waits for the appropriate hub cycles for each multi-cycle instruction, and executes the other single cycle ones as well.

If it did not execute all four instructions in the quad, the results would be wrong. Unless I am missing something again.

I have not tried to nail down where the best place is for hub access instructions, but the organization above at least seems to work.

Multi-cycle instructions will have to be scheduled very carefully, but they should work (famous last words!) - I hope.

I'll have to try some other multi-cycle instructions.

Thanks again for the sample and explanation.

Cluso99 · 2012-12-11 13:17

This is brilliant work by you all !!!
I am missing all the fun.

jmg · 2012-12-11 13:22

Bill Henning wrote: »

jmg,David:

I think Chip wants something *REALLY* simple in hardware, I don't believe he has the time / logic budget for anything complicated. If he did, I think all of us would innundate him with additional suggestions

My suggestion was simple, really just a Mux-unpack.
I'm not clear if this compression is intended as an Opcode-space/runtime option (which sounds like a high impact change as you affect the opcode decode pathways), or if it is intended as a Memory move (COG load) (de)compression, where I imagine there is more time headroom, and it does not impact opcode decode pathways. A single boolean flag, or a special opcode, is all that is needed to enable the Mux-unpack, plus the storage of the 16 bits of more-stable merge information

cgracey · 2012-12-11 13:28

Sapieha wrote: »

Hi Chip

If this answer was to me -- You don't understand me.

Can I?
SETQUAD XXXX0 Set fist 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX
SETQUAD XXXX4 Set next 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX without move data from QUAD

Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.

Sapieha · 2012-12-11 13:31

Hi Chip.

Thanks

Good to know.

Ps. Need add this to my version of PDF

cgracey wrote: »

Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.

cgracey · 2012-12-11 13:41

Bill, Sapieha, and Ariba,

Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?

Bill Henning · 2012-12-11 13:45

Chip,

I think I twisted my brain around it enough to see what is going on. At least I hope I did.

Your code sample:

        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

        NOP                     'do something for at least 3 clocks to allow QUADs to update
        NOP
        NOP

        NOP                     'do at least 1 instruction to get QUADs into pipeline

quad0   NOP                     'QUAD0..QUAD3 are now executable
quad1   NOP
quad2   NOP
quad3   NOP

This effectively means that to execute the four fetched instructions will take 9 clock cycles, throwing the next RDQUAD out of hub sync.

If the hub cache was copied to the cog registers instead of being mapped in, the following should work: (it does not now due to mapping)

	setquad	#quad1

	reps	#128,#16
	getcnt	start

        RDQUAD  pc

        ADD	pc,#61
        NOP
        NOP

        SETQUAD #quad0          'map QUADs to quad0..quad3

quad0   NOP                     'QUAD0..QUAD3 are now executable
	NOP
	NOP
	NOP

        RDQUAD  pc

        ADD	pc,#61
        NOP
        NOP

        SETQUAD #quad1          'map QUADs to quad0..quad3

quad1   NOP                     'QUAD0..QUAD3 are now executable
	NOP
	NOP
	NOP

One potential solution I see - which I suspect is far too late in the design cycle to implement would be:

EXECQUAD D

which just shoves the four values into the pipeline instead of mapping them in.

I do have another idea I am working on, but need a few minutes to chew on it before posting (ie try it out)

Sapieha · 2012-12-11 13:46

Hi Chip.

I can only answer for me

I think it now works --- Need only learning time to use it efficiently.

cgracey wrote: »

Bill, Sapieha, and Ariba,

Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?

Bill Henning · 2012-12-11 13:48

I think it is probably OK but not capable of executing 4 instructions in 8 cycles as I was hoping.

I did post another idea, but suspect its too big a change - can you look at it and comment?

cgracey wrote: »

Bill, Sapieha, and Ariba,

Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?

David Betz · 2012-12-11 13:58

Bill Henning wrote: »

One potential solution I see - which I suspect is far too late in the design cycle to implement would be:

EXECQUAD D

which just shoves the four values into the pipeline instead of mapping them in.

I do have another idea I am working on, but need a few minutes to chew on it before posting (ie try it out)

I don't see how this would work. The whole reason we have a pipeline is that different things happen at each stage. I don't think you can just load all of the stages at once and expect it to work.

Bill Henning · 2012-12-11 14:00

Ok,

It is *impossible* to execute four instructions within the cycle hub window of RDQUAD.

(there, someone will figure it out now)

I did come up with an alternate use, however I have to write a new test case for it.

What if a RDQUAD based LMM2 did not try to execute four instructions? What if we reserved one slot for a 32 bit constant or address?

This would actually make compiler writers lives MUCH easier, and it would use the fourth slot usefully the vast majority of the time.

Revised kernel loop:

	reps	#256,#8
	getcnt	start

	rdquad  pc
	add	pc,#16
	nop	' delay slot
	nop	' delay slot
arg	nop	' NOT executable - use for constant, address etc
	nop	' will execute, "arg" is a valid register reference for it as it is fixed in the cog memory map
	nop	' will execute "arg" is a valid register reference for it as it is fixed in the cog memory map
	nio	' will execute "arg" is a valid register reference for it as it is fixed in the cog memory map

This obeys all the rules Chip posted, and is very useful for LMM2.

Examples:

        mov   Rn,arg   .... load large immediate value

        mov   pc,arg    .... jump to arg

        add   D, arg   .... add a large const

        rdlong  Rn, arg   .... read any location

        wrlong   Rn, arg   ..... write to any location

        and so on.

It may actually be more useful than executing four instructions, as it makes addressing the whole hub, and large constants, trivial.

I think RDQUAD can be declared working, and now I will verify that my new LMM2 kernel idea works.

Roy Eltham · 2012-12-11 14:01

Bill,
Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.

Bill Henning · 2012-12-11 14:03

I don't know how it works internally - if there is an instruction queue, with a stage count, it would; if there is a queue of micro-ops, it won't

Chip knows

Besides, I have a useful work-around - i think.

David Betz wrote: »

I don't see how this would work. The whole reason we have a pipeline is that different things happen at each stage. I don't think you can just load all of the stages at once and expect it to work.

Roy Eltham · 2012-12-11 14:05

My understanding is that it doesn't work either way you describe at all. It's much simpler.

Bill Henning wrote: »

I don't know how it works internally - if there is an instruction queue, with a stage count, it would; if there is a queue of micro-ops, it won't

Chip knows

Besides, I have a useful work-around - i think.

Bill Henning · 2012-12-11 14:05

Hi Roy,

Yes, I did - see post#1 in this thread for alternates, however the best I was able to do with those variations is 3 LMM single cycle instructions per 8 clock cycles; RDQUAD would have allowed for four.

Once I accepted I could not get it working at 4/8 I did come up with another great use for RDQUAD, which should actually make a lot of the compiler work (32 bit constants an addresses) much easier, see a post above.

Roy Eltham wrote: »

Bill,
Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.

Bill Henning · 2012-12-11 14:07

Frankly, even if possible, I think EXECQUAD is more trouble (delay) than it would be worth; I was doing a "what if"

Roy Eltham wrote: »

My understanding is that it doesn't work either way you describe at all. It's much simpler.

Roy Eltham · 2012-12-11 14:10

The problem with your last idea/use is that it's not really the same thing anymore. It requires a compiler (and one with special stuff in it to use this), while the traditional LMM could be used with a simple assembler that translated branches.

So, sure you can maybe convince the Prop2GCC effort to have a kernel using this "special" LMM setup, but I think it needs a different name from LMM, and that a more traditional LMM should be what we call LMM.

cgracey · 2012-12-11 14:12

I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?

Cluso99 · 2012-12-11 14:15

cgracey wrote: »

Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.

Chip:
I guess it might be a little late now, but just in case...

If we were not using any I/O pins, as is often the case in the main processing routine, could the QUAD window be mapped to $1F8-$1FB (the PINx registers)?

This would have the benefit of not taking any normal usable cog space providing the pins were not being used.
I know you use $1F8-$1FF to clear the mapping, but $1FA-$1FF (or just $1FF) could still be used for this.

re SEUSSF/SEUSSR instructions
I know there is some form of byte swapping instruction yet to be documented. Is this capable of performing an endian swap? If not, could these instructions be used for a word swap and a long swap ???

It may be too complex, but could they be used to copy to/from the quad cache to a register set (cog ram). This would allow a quick copy from/to hub cog using quads and leave some instructions free as well ???

User Name · 2012-12-11 14:15

Bill Henning wrote: »

Releasing the Terasic loadables was a brillant move - it is letting us find bugs before the shuttle run.

Agreed! But it has surprised me how much mileage you and others have gotten out of the DE0 single cog emulation. What a remarkable thing!

I've been working night and day on an unrelated project. Now that it's nearly finished, I'm looking forward to dusting off my DE0 and torturing my own small part of the P2 instruction set.

Roy Eltham · 2012-12-11 14:17

Chip,
The main problem with that is that limiting LMM code to only 1 cycle instructions would be very undesirable. You really need to be able to do hub instructions and WAITxxx in LMM code. I think the focus should switch to using things that don't try to do overlapping of the mapped registers.

cgracey wrote: »

I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?

Bill Henning · 2012-12-11 14:17

Roy,

My previous four instruction attempt - which did not work as expected - would have required just as much special work.

Short of sticking with the simplest 1 LMM instruction per 8 cycle hub window PropGCC will require significant re-work.

Ignoring FCACHE/FLIB for now:

- Without significant gcc re-work LMM code will run at most 1/8 speed of native code, which is not competitive with ARM.

- using RDLONGC instead will improve things somewhat, reaching at most 1/6 clock speed of native code (see experiment#1)

More effective use of RDLONGC or RDQUAD (four or thee instruction version) will require significant GCC rework, but provide better speed.

Roy Eltham wrote: »

The problem with your last idea/use is that it's not really the same thing anymore. It requires a compiler (and one with special stuff in it to use this), while the traditional LMM could be used with a simple assembler that translated branches.

So, sure you can maybe convince the Prop2GCC effort to have a kernel using this "special" LMM setup, but I think it needs a different name from LMM, and that a more traditional LMM should be what we call LMM.

Bill Henning · 2012-12-11 14:20

Yes, but that would not be very useful.

Please take a look at post #194 - I think that will work, and give pretty good results.

While it would only execute 3 simple instructions every eight clock cycles, those instructions would be made much more useful with the fourth slot being used for a constant or address.

Take a look at the illustrations I provided; all the MVI macros, FJMP primitves become unnecessary, saving many cycles.

cgracey wrote: »

I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?

David Betz · 2012-12-11 14:25

Okay, here is a bizarre idea. How about making the PC 30 bits long instead of just 9 and have it fetch instructions using the equivilent of RDLONGC if any of the high address bits are set? Then you can run code directly out of COG memory with no LMM loop. I'm sure there is some fatal flaw in this idea....

cgracey · 2012-12-11 14:27

Cluso99 wrote: »

Chip:
I guess it might be a little late now, but just in case...

If we were not using any I/O pins, as is often the case in the main processing routine, could the QUAD window be mapped to $1F8-$1FB (the PINx registers)?

This would have the benefit of not taking any normal usable cog space providing the pins were not being used.
I know you use $1F8-$1FF to clear the mapping, but $1FA-$1FF (or just $1FF) could still be used for this.

re SEUSSF/SEUSSR instructions
I know there is some form of byte swapping instruction yet to be documented. Is this capable of performing an endian swap? If not, could these instructions be used for a word swap and a long swap ???

It may be too complex, but could they be used to copy to/from the quad cache to a register set (cog ram). This would allow a quick copy from/to hub cog using quads and leave some instructions free as well ???

I suppose you could map the QUADs up there, but you usually need to surround them with other instructions to facilitate RDQUAD, etc.

cgracey · 2012-12-11 14:30

David Betz wrote: »

Okay, here is a bizarre idea. How about making the PC 30 bits long instead of just 9 and have it fetch instructions using the equivilent of RDLONGC if any of the high address bits are set? Then you can run code directly out of COG memory with no LMM loop. I'm sure there is some fatal flaw in this idea....

Sounds really neat but it strains my mind to consider how this would work with the pipeline, which is normally used to read hub memory.

David Betz · 2012-12-11 14:32

cgracey wrote: »

Sounds really neat but it strains my mind to consider how this would work with the pipeline, which is normally used to read hub memory.

I guess the pipeline would have to stall every time a new RDQUAD had to be done since it couldn't access the hub at the same time as the RDQUAD. I told you it was a bizarre idea! :-)

Cluso99 · 2012-12-11 14:38

Bill, without my morning coffee (I don't often drink coffee

), here are a few ideas (without much thought)...

1. I wonder if you could use the task switcher to readquads and execute them as a second task ??? This would at least make the flags available for the LMM load routine.

2. I wonder about going back to using the REPS instruction and/or flags to perform the loop.

3. As long as the LMM instructions in the quad are single cycle, there are no problems. The compiler will need to take special care for mult-cycle instructions - you already do that for calls/jumps so maybe a similar mechanism will be required here too.

Maybe these suggestions may at least provoke some further thoughts.

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments