LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Sapieha · 2012-12-11 05:58

Hi Chip.

NANO reprogrammed

> Function correctly

Thanks

Ariba · 2012-12-11 06:25

This version of the Quad-LMM loop works now also with no problems - well done Chip!

' Quad LMM
DAT
        org 0
        mov pc,start
lmm     setquaz #ins1
        rdquad pc
ins1     nop
         nop
         nop
ins4     nop
         jmpd #ins1
         add pc,#16
         nop
         rdquad pc

start   long @lmmcode+$0E80
pc      long 0
t1      long 0

        long 0[4-($-($ & $1FC))]  'quad align following code

lmmcode setp #2         'toggle pin2 25%        
        notp #4
        clrp #2
        sub pc,#32      'rel jump, delayed by 1 quad

        nop
        notp #4         'toggle pin4 50%
        nop
        nop

I will try now some other features of the Prop2...

Andy

Bill Henning · 2012-12-11 06:35

WOW!

I go to dinner, and to sleep, and look how much happens....

I will head down to the lab shortly, update my Nano, run tests, and catch up on the posts :-)

Bill Henning · 2012-12-11 07:54

Hi guys,

Here is my massive "Catch Up" post:

jmg,David:

I think Chip wants something *REALLY* simple in hardware, I don't believe he has the time / logic budget for anything complicated. If he did, I think all of us would innundate him with additional suggestions

Chip:

First, thanks for SETQUADZ and everything!

Right after this message I'll program your latest Nano image & update to the latest PNut; then I will torture-test RDQUADLONG.

Cluso99:

I like the movx (condition code)

For byte packing/unpacking Chip has a MOVF that he will document (can't wait)

potatohead:

Yep, testing RDQUAD was "interesting", and non-cog aligned RDQUAD will be nice.

re/ compressed: take a look at what I proposed, I was wary of any more complexity due to implementation delay, logic budget, chip's time

Sapieha:

I think movf will do this... (I hope!)

Cluso99:

Of course I don't mind you posting! An efficient loader will be needed for FCACHE

I will try your code... looks good at a quick glance.

Baggers:

Your *REALLY* should ge a Nano...

Chip:

Downloading new .zip ... thanks!

Andy:

Loos good!

I will torture your version too

Bill Henning · 2012-12-11 08:48

Latest Prop2 configuration / PNut torture tests:

Chip's sample:

>n
>2000.201f
02000- 00000080 00000080 000000FF 000000FF   '................'
02010- 00000FF3 00000000 00000000 00000000   '................'

not what we would like to see :-(

Andy's main loop, modified for my test framework:

>n
>2000.201f
02000- 00000080 00000080 000000FF 000000FF   '................'
02010- 00000FF3 0000007F 00000000 00000000   '................'

not what we would like to see :-(

My latest main loop, quad aligned:

>n
>2000.201f
02000- 000000FE 00000080 00000080 00000080   '................'
02010- 00000FF6 0000007F 00000000 00000000   '................'

IT WORKS!!!

My latest main loop, NOT quad aligned:

>n
>2000.201f
02000- 000000FE 00000080 00000080 00000080   '................'
02010- 00000FF7 0000007F 00000000 00000000   '................'

THANKS CHIP - WORKS PERFECTLY!!!

No more worries about alignment...

I am attaching the torture test to this message, it has all three different LMM2 loops in it - just uncomment the one you want to run.

THE TORTURE TEST EXPLAINED

Currently, the torture test consists of the following eight instructions, repeated 256 times in the hub (2k instructions, 8k hub used)

i1	add	count, #1
i2	add	count2,#1
i3	add	count3,#1
i4	add	count4,#1
i5	rdlong	lc,result6
i6	add	lc,#1
i7	wrlong	lc,result6
i8	add	count,#1

Due to how READQUAD works, the LMM2 loop will alternately execute i1-i4, then i5-i8 ensuring that RDQUAD reads a different group of four instructions on every loop iteration

i1-i4 are simple adds to counters, which are later deposited into result1-result4

i5-i7 increment result6 in the hub

i8 does an extra increment of result1

The LMM2 loops RDQUAD is executed 256 times (so we get nice easy to read cycle counts in hex), however the last four instructions fetched are not executed, and the first four are executed as NOP's as they would not have been fetched yet.

The first iteration executes four NOP's on the first pass (thanks to SETQUAZ)

This means that i1-i4 gets executed 128 times, and i5-i8 gets executed 127 times

Therefore the mathematically correct expected results (in hex) are:

result 1: $0000FE ($7F+$7F)
result 2: $000080
result 3: $000080
result 4: $000080

result 5: $000FF7 *** approximate
result 6: $00007F

256*8 RDQUAD's ($800 cycles), 127*8 RDLONG's ($3FF cycles), 127*8 WRLONGS ($3FF) cycles as after first hub access all will hit hub sync sweet spots

$800+$3F0+$3F0 = $FE0 cycles

Sorry about the incorrect calculation in an earlier edit of this message - the timing works as expected!

p.s.

Other instructions can be tried in i1-i8, but then predicted results should be calculated for them.

The attached file is basically a framework for testing LMM2 and native instructions.

Bill Henning · 2012-12-11 09:13

I wonder if the torture test is faster because there is only a single cog implemented on my nano, and thus it actually ends up using "other cogs" hub cycles?

I fixed my initial incorrect calculation, the cycle count was correct as well.

Could someone try the torture test using my loops on a DE2-115 and post the results?

potatohead · 2012-12-11 09:14

Setting up now... Programming. Bill, I've got time for a quick run. I'll do the loops and post up results here in a moment.

Bill Henning · 2012-12-11 09:35

Thanks!

Meanwhile, I found an error in my initial calculation for expected cycle count - and the result I am getting is correct.

potatohead wrote: »

Setting up now... Programming. Bill, I've got time for a quick run. I'll do the loops and post up results here in a moment.

potatohead · 2012-12-11 09:41

Bill, I uncommented them in sequence, from top to bottom.

#1 --> "my out of order LMM2 inner loop

=== Propeller II Monitor ===

>n2000.201f
02000- 000000FE 00000080 00000080 00000080 '................'
02010- 00000FF6 0000007F 00000000 00000000 '................'
>

#2 --> "Chip's easy to read LMM2 loop"

=== Propeller II Monitor ===

>n2000.201f
02000- 00000080 00000080 000000FF 000000FF '................'
02010- 00000FF3 00000000 00000000 00000000 '................'
>

#3 --> "Andy's non-pta LMM2 loop with jumpd, modified for 256 iterations"

=== Propeller II Monitor ===

>N2000.201f
02000- 00000080 00000080 000000FF 000000FF '................'
02010- 00000FF3 00000000 00000000 00000000 '................'
>

potatohead · 2012-12-11 09:44

These don't make sense to me yet. Did I get the right test file Bill?

Bill Henning · 2012-12-11 09:47

#1 makes sense - you might want to read the (long) explanation I put in the post#156 with the code that calculates the expected results for the torture test.

Unfortunately Chip's and Andy's loops have pipeline issues, where as my out of order loop gets the right result.

I am planning to come up with other torture tests over time - we really need to get LMM2 right.

potatohead wrote: »

These don't make sense to me yet. Did I get the right test file Bill?

potatohead · 2012-12-11 09:50

Got it. I see that explanation now.

Well, we've now got a nice test framework, and a template to calculate expected results. Time to try different sets of instructions, among other things. If needed, I can run things this evening. The pipeline is going to be interesting to work with...

Bill Henning · 2012-12-11 09:57

Thanks that would be great!

The more verification we can get done, the more likely the test shuttle Chip's (pun intended) will just work!

I am going to concentrate on making sure LMM2 is bullet proof, and appreciate additional torture testing on that.

It would be very useful if we could get every instruction verified in both LMM2 and native cog environment... but we'd need many volunteers for that. I would prefer that LMM2 test results be reported in this thread.

Ideally people would post a list of instructions they are working on, and the verified/not status.

(I am certain that Chip already has great test vectors for his Verilog files, but extra verification never hurts)

potatohead wrote: »

Got it. I see that explanation now.

Well, we've now got a nice test framework, and a template to calculate expected results. Time to try different sets of instructions, among other things. If needed, I can run things this evening. The pipeline is going to be interesting to work with...

Bill Henning · 2012-12-11 10:04

I added the Torture Test documentation and latest version to post#2 - this version adds a test for my out of order LMM2 loop that does not use PTRA, but uses a pc register.

Code snippet:

	'nop	' needed for aligned, comment out for un-aligned test
	setquaz	#ins1
	reps	#256,#8
	getcnt	start
ins1	nop			'four LMM instructions from RDQUAD before last execute here
ins2	nop
ins3	nop
ins4	nop
ins0	rdquad	pc		'LMM loop
ins5	add	pc,#16		'(this is where the mapped QUADs actually become executable after RDQUAD)
ins6	nop
ins7	nop

Test results:

>n
>2000.201f
02000- 000000FE 00000080 00000080 00000080   '................'
02010- 00000FF7 0000007F 00000000 00000000   '................'

The results are correct

potatohead · 2012-12-11 10:05

I'm wondering about the WAITVID, and friends... Need docs for those though. When Chip gets to it, that's another round of tests.

Bill Henning · 2012-12-11 10:08

DEFINITELY!

potatohead wrote: »

I'm wondering about the WAITVID, and friends... Need docs for those though. When Chip gets to it, that's another round of tests.

I really hope the non-DAC parallel mode output for VGA is still possible, I need it for LCD's.- otherwise I'll have bit (byte?word?long?) bang it

potatohead · 2012-12-11 10:10

Me too, I've been wondering. I'm half tempted to just attempt it anyway, tinkering with the modes until I see that. Hope it's there, because it's useful on a few levels, not just video. Bit banging is better now, but still... (fingers crossed)

waitvid classic mode

Bill Henning · 2012-12-11 10:11

Very true... on P1 I've also used it for fast clocked serial out and 8 bit DDS experiments.

potatohead wrote: »

Me too, I've been wondering. I'm half tempted to just attempt it anyway, tinkering with the modes until I see that. Hope it's there, because it's useful on a few levels, not just video. Bit banging is better now, but still... (fingers crossed)

Ariba · 2012-12-11 11:25

Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....

Andy

Baggers · 2012-12-11 11:29

cgracey wrote: »

Baggers, that would be maybe too much at this late stage. I wish I would have known about this earlier. You told me, but I didn't picture it this clearly back them. That would be pretty simple.

No worries, you can save it for P3

Bill Henning · 2012-12-11 11:37

Thanks for the thorough look Andy!

I will try your modifications, and no doubt get the same results, and I am sure Chip will look over the Verilog code again - all of us want this work

Releasing the Terasic loadables was a brillant move - it is letting us find bugs before the shuttle run.

Ariba wrote: »

Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....

Andy

Bill Henning · 2012-12-11 11:51

Good catch Andy!

I added:

testvalue long $180

at the bottom of the file, and

wrlong testvalue,result6

before the loop, and I verified that at the end of the run, result6 was $7F

Thinking about it, I think that only RDLONG has an issue, otherwise we would not see $7F... and the single cycle adds are all executing correctly.

Update:

Thinking some more about it, it pretty much has to be a difference in how the Verilog handles executing RDLONG from a "real cog register" and a "RDQUAD cache mapped to registers"

Reason I think this is the case is that two days ago the RDLONG tests worked fine - but I am going to go and re-test that right now...

Ariba wrote: »

Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....

Andy

Bill Henning · 2012-12-11 12:33

Ok, looks like my hypothesis is confirmed.

The problem pretty much has to be the difference between executing RDLONG from regular registers, or from the RDLONG cache buffer.

I initialized result6 to $180, then I executed the following code:

	reps	#127,#8
	getcnt	start
	add	count, #1
	add	count2,#1
	add	count3,#1
	add	count4,#1
	rdlong	lc,result6
	add	lc,#1
	wrlong	lc,result6
	add	count,#1

in the torture test framework, an got the following result:

>n
>2000.201f
02000- 000000FE 0000007F 0000007F 0000007F   '................'
02010- 000007F6 000001FF 00000000 00000000   '................'

Note that result six is $180+$7F, and all the other results are correct as well.

cgracey · 2012-12-11 12:43

Ariba wrote: »

Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.

What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....

Andy

The problem is that RDLONG's take more than one clock and therefore make subsequent instructions in QUADs available for execution earlier. I don't know what to do about this, other than suggest timing provisions are made to execute the just-read QUADs and not execute overlapped QUADs, as the overlapped execution is where timing gets advanced by multi-clock instructions and causes new instructions to execute in place of old.

Read this carefully and it will make sense:

After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:


        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

	NOP			'do something for at least 3 clocks to allow QUADs to update
	NOP
	NOP

	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S


After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:


        SETQUAD #quad0          'map QUADs to quad0..quad3

        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

        NOP                     'do something for at least 3 clocks to allow QUADs to update
        NOP
        NOP

        NOP                     'do at least 1 instruction to get QUADs into pipeline

quad0   NOP                     'QUAD0..QUAD3 are now executable
quad1   NOP
quad2   NOP
quad3   NOP


After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
readable via D and S after 2 instructions:


        SETQUAD #quad0          'map QUADs to quad0..quad3 (new address)

        NOP			'do at least two instructions to queue up QUADs
        NOP

        CMP     quad0,quad1     'mapped QUADS are now accessible via D and S


On cog startup, the QUAD registers are cleared to 0's.

Bill Henning · 2012-12-11 12:49

Ok that makes sense.

Question:

RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)

Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.

I'll go try to make a new torture test to show RDLONG does work if scheduled right.

The current thread expected RDLONG to pause the pipe until synced to the hub like in non-quad code, which makes writing code for it easier.

cgracey · 2012-12-11 12:54

Bill Henning wrote: »

Ok that makes sense.

Question:

RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)

Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.

I can't change the timing of RDxxxx. That's kind of set, at this point.

I think one thing you could do would be to put the RDLONG at the 4th (last) location, and have three single-cycle instructions in front of it. This would inhibit the problem.

Sapieha · 2012-12-11 12:55

Hi Chip.

Fills registers that are hidden by QUAD with same data as QUAD ?

Bill Henning · 2012-12-11 12:57

THANK YOU CHIP! IT WORKS!!!!

You explanation let me schedule the code properly for the kernel.

Re-written, scheduled LMM2 code:

i1	rdlong	lc,result6
i2	add	count, #1
i3	add	count2,#1
i4	add	count3,#1
i5	add	count4,#1
i6	add	lc,#1
i7	wrlong	lc,result6
i8	add	count,#1

Test results - which make sense:

>n
>2000.201f
02000- 000000FF 00000080 00000080 0000007F   '................'
02010- 00000FF7 000001FF 00000000 00000000   '................'

Basically, it was NOT a verilog issue, just improper scheduling of code to be run through the LMM2 kernel.

Bill Henning wrote: »

Ok that makes sense.

Question:

RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)

Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.

I'll go try to make a new torture test to show RDLONG does work if scheduled right.

The current thread expected RDLONG to pause the pipe until synced to the hub like in non-quad code, which makes writing code for it easier.

cgracey · 2012-12-11 13:00

The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.

cgracey · 2012-12-11 13:02

Sapieha wrote: »

Hi Chip.

Fills registers that are hidden by QUAD with same data as QUAD ?

No, it has to do with when new QUAD values enter the pipeline, based on how many clocks have elapsed since the last RDQUAD.

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments