LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Bill Henning · 2012-12-10 11:08

cgracey wrote: »

Not if you wrap it in a loop. Just make sure the RDQUAD is followed by the QUAD registers with a max of one instruction between the RDQUAD and the mapped QUADs. That way, the QUADs will execute the *previous* RDQUAD longs without one or more of the *recent* RDQUAD longs slipping in and causing woes.

Here is my latest experiment:

	reps	#256,#8
	getcnt	start

'-- must be on a XXXXXXX00 address in the cog
ins1	nop
ins2	nop
ins3	nop
ins4	nop
ins5	setquad	#ins1
	rdquad	pc
ins6	add	pc,#16
ins7	nop

And the result looks correct:

>n
>2000.201f
02000- 000000FE 0000007F 000000FE 00000080   '................'
02010- 00000BFD 0000007E 00000000 00000000   '....~...........'

The difference here is that I don't SETQUAD the cache in until after the NOP's are executed on the first pass, avoiding executing unknown code.

I'll try some more code mixes that I can verify a bit later.

cgracey · 2012-12-10 11:23

I just looked at the Verilog code and did some tests to confirm mapped QUAD behavior:

After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:


        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

	NOP			'do something for at least 3 clocks to allow the QUADs to update
	NOP
	NOP

	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S

After a RDQUAD, mapped QUAD registers can be executed after three clocks and two instructions:


        SETQUAD #quad0          'quad0 must be a quad-aligned address
        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

        NOP                     'do something for at least 3 clocks to allow the QUADs to update
        NOP
        NOP

        NOP                     'do at least 2 instructions to allow QUADs to enter pipeline
        NOP

quad0   NOP                     'QUAD0..QUAD3 are now executable
quad1   NOP
quad2   NOP
quad3   NOP

This is different than what I initially documented. Sorry for the confusion.

Bill Henning · 2012-12-10 11:33

Thank you for checking.

I guess I will need to experiment more to see if somehow it can be sqeezed into an 8 or 16 instruction loop.

What worries me is in the loop I have in post#62 may execute a mix of fresh/stale cache data.

I'll think about it for a while, and come up with some test cases I can run on my nano.

cgracey wrote: »

I just looked at the Verilog and did some tests to confirm mapped QUAD behavior:

After a RDQUAD, mapped QUAD registers are accessible via D and S after 3 clocks:


        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

	NOP			'do something for at least 3 clocks to allow the QUADs to update
	NOP
	NOP

	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S

After a RDQUAD, mapped QUAD registers can be executed after three clocks and two instructions:


        SETQUAD #quad0          'quad0 must be a quad-aligned address
        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

        NOP                     'do something for at least 3 clocks to allow the QUADs to update
        NOP
        NOP

        NOP                     'do at least 2 instructions to allow QUADs to enter pipeline
        NOP

quad0   NOP                     'QUAD0..QUAD3 are now executable
quad1   NOP
quad2   NOP
quad3   NOP

cgracey · 2012-12-10 12:36

I'm finding that if the RDQUAD is the first instruction in the REPS block, it doesn't get executed. Have you guys noticed this?

Ariba · 2012-12-10 12:38

Thank you Chip and Bill

With the new timing information, I was able to get it to work:

' Quad LMM
DAT
        org 0
lmm     reps #511,#7
        setquad #ins1
         rdquad pc
         nop
ins1     nop            'quad aligned
         nop
         nop
         nop
         add pc,#16
        jmp #lmm        'after 511 repeats

pc      long @lmmcode+$0E80
t1      long 0

        long 0[4-($-($ & $1FC))]  'quad align

lmmcode setp #2         'toggle pin2 25%        
        notp #4
        clrp #2
        sub pc,#32      'relative jump (delayed by 1 quad)

        nop
        notp #4         'toggle pin4 50%
        mul t1,t1       '2 cycle instr
        nop
                        '< jump happens here

The loop is 7 instructions, but takes 8 cycles so also the sync to the hubslot works now.
The not used cycle in the loop allows for one two cycle instruction inside a LMM-Quad packet, without slowing down the instruction rate (shown with the MUL instruction).

The problems with Quad-LMM will be the jumps, which are delayed by up to 7 instructions and can only go to Quad aligned location.

Andy

Ariba · 2012-12-10 12:45

cgracey wrote: »

I'm finding that if the RDQUAD is the first instruction in the REPS block, it doesn't get executed. Have you guys noticed this?

Yes and No
My conclusion before my previous test was, that the RDQUAD gets executed only the first time in the loop but not afterwards.
But in my code above it gets executed always now. So it must depend on the mix of read-quad timing and repeat logic.

Andy

Bill Henning · 2012-12-10 13:01

I think I saw that on occasion.

cgracey wrote: »

I'm finding that if the RDQUAD is the first instruction in the REPS block, it doesn't get executed. Have you guys noticed this?

Bill Henning · 2012-12-10 13:06

Thanks Ariba,

Do you have the problem Chip mentioned of the cache being random code on the first pass?

I tried to get around that by "priming" the cache.

I'll try to torture your flavor.

Regarding delayed jumps... PropGCC2 will have to schedule instructions very carefully; using RDQUAD essentially turns LMM2 into a VLIW with a fairly deep pipe.

The pain will be worth it - it will be able to *very closely* approach 50% of native mips even without FCACHE etc, and almost 100% with all the tricks :-)

I'll try the loop you posted shortly.

Ariba wrote: »
Thank you Chip and Bill

With the new timing information, I was able to get it to work:
' Quad LMM
DAT
        org 0
lmm     reps #511,#7
        setquad #ins1
         rdquad pc
         nop
ins1     nop            'quad aligned
         nop
         nop
         nop
         add pc,#16
        jmp #lmm        'after 511 repeats

pc      long @lmmcode+$0E80
t1      long 0

        long 0[4-($-($ & $1FC))]  'quad align

lmmcode setp #2         'toggle pin2 25%        
        notp #4
        clrp #2
        sub pc,#32      'relative jump (delayed by 1 quad)

        nop
        notp #4         'toggle pin4 50%
        mul t1,t1       '2 cycle instr
        nop
                        '< jump happens here
The loop is 7 instructions, but takes 8 cycles so also the sync to the hubslot works now.
The not used cycle in the loop allows for one two cycle instruction inside a LMM-Quad packet, without slowing down the instruction rate (shown with the MUL instruction).

The problems with Quad-LMM will be the jumps, which are delayed by up to 7 instructions and can only go to Quad aligned location.

Andy

Bill Henning · 2012-12-10 13:13

Andy,

Unless I goofed in the attached file, your latest version did not work (for me)

Ariba · 2012-12-10 13:33

Bill
I dont now how the results must look like, I get:

=== Propeller II Monitor ===

>n2000.2010
02000- 00000080 00000080 000000FF 000000FF   '................'
02010- 00000BFF   '....'
>

.. if I change it so that ins1 is quad aligned and not ins7

Andy

Bill Henning · 2012-12-10 13:36

ARGH!

Thanks - I don't believe I quad-aligned the delay slot... duh. I've been staring at variations of this code too long

Thanks Andy.

I now get the same result you do.

Bill Henning · 2012-12-10 13:44

Uh-Oh

I just realized something. It is not executing the hub instructions in i4 and i6

$2014 should have $0000007F in it (result6)

cgracey · 2012-12-10 13:46

This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.

Are you guys seeing this?

Bill Henning · 2012-12-10 13:51

Frankly I am seeing all sorts of unusual behaviour, depending on where the RDQUAD is in relation to REPS, the delay slots etc.

I'll need to make some more complex test cases, because strangely enough, the only code that seems to pass my "torture" test is the one where the quad buffer is aligned, and comes *before* the RDQUAD

What was really weird is Ariba's last code passed 5 of the 6 tests, only failing the readlong/add#1/writelong embedded LMM code...

It might be a good idea to hold off on the shuttle run until we figure this out.

cgracey wrote: »

This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.

Are you guys seeing this?

Ariba · 2012-12-10 13:52

Chip

For me it looks like the mapping of the quad registers into the register space affects the register right before the first mapped register address so that it is no longer executable.
In my earlier test this was the RDQUAD which was not executed right, now I have a NOP at that position and it works, but if I try to place the ADD PC,#16 there it seems not execute correct.

Andy

Sorry read this post of you after writing the above:

cgracey wrote: »

This is weird. It seems to me that the QUAD window is appearing 1 location earlier than it should. If I set it at $014, it begins at $013. I can't understand how this is happening, or if I've just got something wrong in my perception.

Are you guys seeing this?

:
Hmm, but the 4 instructions get executed, also the last one - so the Quad window must then be 5 instructions. But yes as I wrote above the instruction before the QUAD window does not execute right.

I think now it has nothing to do with the REPS, I see the same if I make the loop with jumps.

cgracey · 2012-12-10 13:55

Ariba wrote: »

Chip

For me it looks like the mapping of the quad registers into the register space affects the register right before the first mapped register address so that it is no longer executable.
In my earlier test this was the RDQUAD which was not executed right, now I have a NOP at that position and it works, but if I try to place the ADD PC,#16 there it seems not execute correct.

Andy

That's what I found, too, but the register before the QUADs seems to be QUAD0, like everything is shifted up one location.

Ariba · 2012-12-10 14:22

Yes you are right, the window is shifted up 1 address.
If I place a pin toggle instruction as the last of the 4 quad instructions it get executed in addition to the 4 instructions loaded per rdquad:

' Quad LMM
DAT
        org 0
lmm     reps #511,#7
        setquad #ins1
         rdquad pc
         nop
ins1     nop            'quad aligned
         nop
         nop
         notp #2
         add pc,#16
        jmp #lmm        'after 511 repeats

pc      long @lmmcode+$0E80
t1      long 0

        long 0[4-($-($ & $1FC))]  'quad align

lmmcode setp #2         'toggle pin2 25%        
        notp #4
        clrp #2
        sub pc,#32      'relative jump (delayed by 1 quad)

        nop
        notp #4         'toggle pin4 50%
        mul t1,t1       '2 cycle instr
        nop
                        '< jump happens here

Bill Henning · 2012-12-10 14:35

This test also fails for me, it does not execute the WRLONG in i6

Ariba wrote: »

Yes you are right, the window is shifted up 1 address.
If I place a pin toggle instruction as the last of the 4 quad instructions it get executed in addition to the 4 instructions loaded per rdquad:

' Quad LMM
DAT
        org 0
lmm     reps #511,#7
        setquad #ins1
         rdquad pc
         nop
ins1     nop            'quad aligned
         nop
         nop
         notp #2
         add pc,#16
        jmp #lmm        'after 511 repeats

pc      long @lmmcode+$0E80
t1      long 0

        long 0[4-($-($ & $1FC))]  'quad align

lmmcode setp #2         'toggle pin2 25%        
        notp #4
        clrp #2
        sub pc,#32      'relative jump (delayed by 1 quad)

        nop
        notp #4         'toggle pin4 50%
        mul t1,t1       '2 cycle instr
        nop
                        '< jump happens here

cgracey · 2012-12-10 14:39

I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

Thanks for discovering this, Bill, Ariba, and any others who've worked on this!

Bill Henning · 2012-12-10 14:42

You are most welcome!

I REALLY enjoyed this (even with the excessive head scratching involved) - plus I am very glad it was caught before the shuttle run.

cgracey wrote: »

I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

Thanks for discovering this, Bill, Ariba, and any others who've worked on this!

FredBlais · 2012-12-10 14:49

This is really great, the propeller 2 emulation is worth the effort finally!

Sapieha · 2012-12-10 14:50

Hi Chip.

In time You fix that -- It is possible to add one more display mode for Monitor?

You have --- Byte, Word, Long, ----> One I think are Binary that show Long that as we see it in Instruction listing

cgracey wrote: »

I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

Thanks for discovering this, Bill, Ariba, and any others who've worked on this!

Ariba · 2012-12-10 14:50

Bill Henning wrote: »

This test also fails for me, it does not execute the WRLONG in i6

Bill
This was not meant as a new improved test. It shows only how I have tested the shift up of the 4 quad longs in cog memory.
I have replaced the 4.th instruction with a pin toggle, so I see something on the scope. The 4 instructions that get executed are:
ins1-1,ins1, ins1+1, ins1+2, while ins1 must be quad aligned.
But that will change soon, when Chip posts the new FPGA files.

Andy

cgracey · 2012-12-10 14:52

Bill Henning wrote: »

You are most welcome!

I REALLY enjoyed this (even with the excessive head scratching involved) - plus I am very glad it was caught before the shuttle run.

Can any of you think of any reason why QUAD registers should be assignable to 4 separate addresses, without concern for alignment? Just curious.

cgracey · 2012-12-10 14:54

Sapieha wrote: »

Hi Chip.

In time You fix that -- It is possible to add one more display mode for Monitor?

You have --- Byte, Word, Long, ----> One I think are Binary that show Long that as we see it in Instruction listing

I've thought about this, but there is no more room in the Monitor code. I mean not room for one more instruction. Binary would have been really nice for configuring the I/O pins, too.

Bill Henning · 2012-12-10 14:57

My only concern for it is that there is some way to have a loop that takes 8 cycles that reads and executes the four quad registers... because I really want the 2 clock cycle effective LMM2 instructions (for normally single cycle instructions) as that makes a properly optimized PropGCC2 generate code comparable in speed with a 80-120MHz ARM (without FPU).

cgracey wrote: »

Can any of you think of any reason why QUAD registers should be assignable to 4 separate addresses, without concern for alignment? Just curious.

Ariba · 2012-12-10 14:58

cgracey wrote: »

I see the problem in the Verilog code now. I'm using an address to compare that is not aligned with the proper pipeline stage.

I'm going to fix this and post new files for the Terasic boards. It's going to take three days of synthesis/layout/verification to fix this. Better now than later.

Thanks for discovering this, Bill, Ariba, and any others who've worked on this!

Great that you found the reason :thumb:

Andy

Sapieha · 2012-12-10 14:58

Hi Chip.

From my side of view --> Best case.

If Quad aligns to first free Long address and next 3 after

cgracey wrote: »

Can any of you think of any reason why QUAD registers should be assignable to 4 separate addresses, without concern for alignment? Just curious.

Bill Henning · 2012-12-10 15:02

Hi Andy,

I was not criticizing... just pointing out that it did not work for some instructions. Frankly, I am glad the issue came up before the test chips - far cheaper to fix it now!

Ariba wrote: »

Bill
This was not meant as a new improved test. It shows only how I have tested the shift up of the 4 quad longs in cog memory.
I have replaced the 4.th instruction with a pin toggle, so I see something on the scope. The 4 instructions that get executed are:
ins1-1,ins1, ins1+1, ins1+2, while ins1 must be quad aligned.
But that will change soon, when Chip posts the new FPGA files.

Andy

cgracey · 2012-12-10 15:04

Sapieha wrote: »

Hi Chip.

From my side of view --> Best case.

If Quad aligns to first free Long address and next 3 after

I think having to set the QUAD window at a QUAD-aligned address is a big pain. Do you guys agree? It keeps the circuitry simple but causes pain for the programmer.

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments