LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Bill Henning · 2012-12-09 06:57

Greetings!

As Dave and Ariba have already started posting about their experiments, I think it is time for me to post about the LMM ideas I am experimenting with for Propeller 2.

Please note the following thought experiments have not been tried on my DE0-Nano yet.

One of the problems Parallax (and the GCC team) will face is that Propeller 2 is a very different beast from Propeller 1. To get the most out of the Propeller 2 with LMM, the compiler will need some fairly significant changes.

LMM on Propeller 1, not counting FCACHE etc., executes a native 4 cycle instruction in 20 clock cycles (4 way unrolled LMM loop) or 18 cycles (eight way unrolled LMM loop). In order to save space, I will refer to Propeller-1 style LMM as "LMM1")

Without the pipelining, Propeller 1 can natively execute instructions at up to 20MIPS @ 80Mhz.

With its pipelining, Propeller 2 can natively execute instructions at up to 160MIPS @ 160MHz.

I've been scratching my head trying to make LMM2 far more efficient.

I've also been trying to figure out a "stepping stone" version that will require very few changes to GCC in order to get GCC on Propeller 2 running as quickly as possible - mind you, that will run much slower than a LMM2 that is optimized for the Propeller 2 architecture.

1) LMM1 GCC compatibility mode - LMM2 experiment #1

next    rdlong    ins,pc   ' could be rdlongc
        jmpd      #next
        add       pc,#4
ins     nop

The above simple kernel will execute LMM1 code at 8 cycles per instruction (ignoring hub etc instructions greater execution times) - and as Propeller 2 is pipelined, this means up to 20MIPS @ 160Mhz - basically, LMM1 code will run at the same speed on Propeller 2 as native cog code does on Propeller 1 (assuming 160MHz and 80MHz clock speeds respectively) for a native "efficiency" factor of 1/8. (ok, JMP, CALL will take 3 cycles inside, but still fit in 8 cycle hub window)

A simple change may boost this, but I have to check that the pipeline will allow the extra speed:

again   reps       #511,4
        nop        ' delay slot

next    rdlongc    ins,pc
        nop        ' delay slot useful for profiling - ie "ADD LMMOPS,#1"
        add       pc,#4
ins     nop

        jmp       #again   ' we will break the REPS with any JMP or CALL

(I think the above is almost exactly the same as Dave's and Ariba's version)

Due to the pipeline delays, this new version is not much faster - but every little bit helps. This version can theoretically execute four simple P2 instructions in 8+5+5+5 = 23 cycles once the hub is in sync, for a 4/23 maximum "efficiency" - however due to hub windows, it would actually take 4/24 cycles to execute the four instructions.

This version should still be mostly GCC Prop1 compatible.

2) LMM2 experiment #2

This version tries to improve on the efficiency of (1) by discarding PropGCC1 (shorthand for Propeller 1 GCC) compatibility..

Actually, for best effect it requires significant changes to GCC to support the "VLIW"-like mode it makes possible (more on this later)

loopy  reps      #511,#6
       nop       ' delay slot
        
next   rdlongc   ins1,[ptra++]
       rdlongc   ins2,[ptra++]
       nop       ' delay slot
       nop       ' delay slot - or use a JMPD here if you don't want to use REPS
ins1   nop
ins2   nop

       jmp       #loopy               ' so we go back to REPS when a JMP/CALL busts the REPS loop

The changes here are:

- using PTRA as the program counter gives free increments
- using REPS gives us free looping
- the delay slots can be used to implement interrupts

As long as the pipelining works out the way I expect, this would execute four native simple instructions in 8+6 = 14 cycles, for a 4/14

Hub syncing will reduce this to an actual best rate of 4/16, ie 40MIPS max at 160MHz.

VLIW mode:

With appropriate changes, PropGCC could support a VLIW mode where ins1/ins2 were treated as either a single 64 bit instruction, or two 32 bit instructions; note all code would have to be generated on eight byte boundaries for this to work well.

This would bring a major advantage to GCC code generation - here are just a couple of examples:

MVI REGx, ##longvalue ---> MOV REGx,ins2

JMP #FJMP ---> SETPTRA ins2
LONG hubaddr

etc

3) LMM2 experiment #3 - "Off-stride"

This is the most I can pack into an 8 cycle hub window.

loopy   reps      #511,#6
        nop

next    rdlongc   ins1,[ptra++]
        rdlongc   ins2,[ptra++]
        rdlongc   ins3,[ptra++]
ins1    nop
ins2    nop
ins3    nop

        jmp       #loopy   ' used when REPS loop is broken

The good news:

It has a 3/8 efficiency, giving up to 60MIPS!

The bad news:

You lose the nice fast "MVI" and "FJMP" capability (without a LOT of compiler headaches)

Now I am not certain that the ins2 and ins3 will be ready to execute in time, in which case priming and inverting should work:

loopy   reps      #511,#6
        nop

ins1    nop
ins2    nop
ins3    nop
        rdlongc   ins1,[ptra++]
        rdlongc   ins2,[ptra++]
        rdlongc   ins3,[ptra++]

        jmp       #loopy   ' used when REPS loop is broken

4) LMM2 experiment #4 - Things get weird...

(Baggers asked if anyone thought of using READQUAD...)

top         reps       #511,#12
            setquad   #block1

next2       rdquad    [ptr++]
            setquad  #block2
            nop    ' delay slot
block1      nop
            nop
            nop
            nop
next1       rdquad    [ptr++]
            setquad  #block1
            nop    'delay slot
block2      nop
            nop
            nop
            nop

            jmp   #top

I think this would not be very nice to generate code for, but it does have an 8/16 (50%) potential efficiency - 80MIPS

For best results, it would have to be treated as a 128 bit VLIW machine, and block1 should be primed before entering the loop.

5) LMM2 experiment #5 - Even weirder...

My attempt at shrinking #4 - but I need to make sure it will still execute in 8 cycles (when synced to the hub)

top         reps       #511,#6
            setquad   #block

block       nop
            nop
            nop
            nop
            jmpd   #block   
            rdquad    [ptr++]

            jmp   #top

Block must be primed, not sure the pipelining would work (ie keep it 8 cycles) but if it works, it will give us 4/8 ie 50% efficiency for 80MIPS as well.

6) LMM2 experiment #7+ And now, for something completely different...

All kidding aside, the other ideas I've been playing with get really weird, and do not exceed 50% potential efficiency.

For all experiments, FCACHE/FLIB can add a huge boost.

Apologies if I get the syntax wrong on any of the instructions above, I will fix errors as I find out about them

UPDATE

Releasing the Prop2 Terasic boards was an extremely useful thing to do.

Testing the RDQUAD based LMM2 ideas above Chip was able to find and fix a small error in the Verilog sources, potentially saving the cost of an additional shuttle run.

Andy noticed the first sign of a problem, I and others were able to verify it.

After Chip's change, single cycle instructions would work, but unexpected issues cropped up with multi-cycle instructions.

There is a potential useful work-around, however it reduces the maximum MIPS from 80 to 60 as it only executes three instructions, but provides a 32 bit instruction or address for them - which removes the need for MVI, FJMP and many other constructs.

I'll add that code here shortly, and I plan to verify that it works for both single and multi-cycle instructions.

I am very happy to see how people jumped in with both feet to test P2!

Update:

I may have gotten it running using RDQUAD with 4 instructions executed per RDQUAD!

---

Let me know what you think! Your input is most appreciated...

---

Notes:

All methods show above will do much better with FCACHE and FLIB added into the equation, the efficiency above is for strainght in-line code.

(I've been playing with LMM2 on paper since Chip started posting the instructions - but I did not want to post until I had enough information on the pipelining)

Bill Henning · 2012-12-09 06:57

LMM2 NAMING CONVENTION

To save typing, and make it easier for all of us to be specific about exactly which variant we are talking about:

[b]LMM2Q[/b]  - Propeller 2 LMM, RDQUAD 4/8 efficiency
[b]LMM2C[/b]  - Propeller 2 LMM, one RDLONGC per hub window, 1/8 efficiency
[b]LMM2C2[/b] - two RDLONGC's per hub window, 2/8 efficiency
[b]LMM2C3[/b] - three RDLONGS's per hub window, 3/8 efficiency
[b]LMM2P[/b]  - paging variant I did not publish, roughly 3/8 efficiency

In case any of you are wondering, I did not publish LMM2P as it required significantly more compiler work than LMM2Q and performed more poorly than LMM2Q due to additional overhead and thrashing.

THE TORTURE TEST EXPLAINED

Currently, the torture test consists of the following eight instructions, repeated 256 times in the hub (2k instructions, 8k hub used)

i1	add	count, #1
i2	add	count2,#1
i3	add	count3,#1
i4	add	count4,#1
i5	rdlong	lc,result6
i6	add	lc,#1
i7	wrlong	lc,result6
i8	add	count,#1

Due to how READQUAD works, the LMM2 loop will alternately execute i1-i4, then i5-i8 ensuring that RDQUAD reads a different group of four instructions on every loop iteration

i1-i4 are simple adds to counters, which are later deposited into result1-result4

i5-i7 increment result6 in the hub

i8 does an extra increment of result1

The LMM2 loops RDQUAD is executed 256 times (so we get nice easy to read cycle counts in hex), however the last four instructions fetched are not executed, and the first four are executed as NOP's as they would not have been fetched yet.

The first iteration executes four NOP's on the first pass (thanks to SETQUAZ)

This means that i1-i4 gets executed 128 times, and i5-i8 gets executed 127 times

Therefore the mathematically correct expected results (in hex) are:

result 1: $0000FE ($7F+$7F)
result 2: $000080
result 3: $000080
result 4: $000080

result 5: $000FF7 *** approximate
result 6: $00007F

256*8 RDQUAD's ($800 cycles), 127*8 RDLONG's ($3FF cycles), 127*8 WRLONGS ($3FF) cycles as after first hub access all will hit hub sync sweet spots

$800+$3F0+$3F0 = $FE0 cycles

NOTE:

Other instructions can be tried in i1-i8, but then predicted results should be calculated for them.

This post reserved for source code, links and FAQ's

- added LMM2 using RDQUAD test
- added LMM2 using RDQUAD test 2
- added LMM2 using RDQUAD test 3 to verify Andy's results
- added what looks like verified working RDQUAD LMM code

- added torture test for latest Nano loadable that matches the instructions above

Baggers · 2012-12-09 07:43

Has anyone thought of using rdquads?

Bill Henning · 2012-12-09 07:46

Yep,

In theory, should be better.

In practice, does not look like it, as it is very awkward...

I'll add it to the top thread.

Baggers · 2012-12-09 07:59

btw, I'm not sure, as I've not looked at the instructions yet, as I don't have an emulator setup, but wouldn't it be setquad #block1 and setquad #block2? I may be wrong, but this is the usual way to point at an address and not the contents of it.

Edit: No a quick read up on it

and you are correct

Bill Henning · 2012-12-09 08:11

Thanks Baggers!

I did mention it was a bit weird... had to be, due to how the pipelining works.

I'll be trying these experiments on my DE0-Nano, but its a bit painful as I can't use tasks to do serial out if I want accurate cycle counts, I have to keep dumping out to the monitor and examining the hub.

But it is still a lot of fun :-) :-) :-)

Cluso99 · 2012-12-09 11:34

Bill: I have not looked too deeply into your code examples.
I am unsure if this works or not, but Chip said that we can set the quad cache into a window of cog ram. If those 4 quads were windowed into the LMM loop as instr1-4 I wonder if they would be able to be executed from there? Of course this makes the LMM compilers more difficult but if looking for every last bit of speed it could be worth looking into.

Bill Henning · 2012-12-09 13:05

Hi Cluso99,

(4) and (5) in post #1 use RDQUAD (cache mapped into the cog), and assuming they work, will be able to execute code from the hub at 50% of native cog speed (much more with FCACHE etc).. I think I may have to remove the delay slots from (4) to make it fit within the hub window; it is not needed anyway due to how I use two rdquads.

and you are absolutely correct, it does require more compiler work.

Cluso99 wrote: »

Bill: I have not looked too deeply into your code examples.
I am unsure if this works or not, but Chip said that we can set the quad cache into a window of cog ram. If those 4 quads were windowed into the LMM loop as instr1-4 I wonder if they would be able to be executed from there? Of course this makes the LMM compilers more difficult but if looking for every last bit of speed it could be worth looking into.

cgracey · 2012-12-09 15:42

After a RDQUAD, the 4-register QUAD window becomes executable on the 5th instruction.

I think this is what can be done:

	repd	$1FF,@extra3	'$1FF causes infinte REPD
	nop
	nop
	nop

	rdquad	ptra++	'1 clock rdquad, QUAD executable at +5 clocks
inst1	nop		'inst1..4 are the QUAD window
inst2	nop		'because of pipeline delays, inst1..4 execute
inst3	nop		'...from RDQUAD *before* last
inst4	nop
extra1	nop		'QUAD becomes executable here (5th instruction)
extra2	nop
extra3	nop		'whole loop takes 8 clocks and executes 4 LLM instructions

Note that there are 8 permutations possible of the repeated block by putting the last instruction where the first one is, and so on.

Bill, I'm not totally sure about the timing on this, as I haven't done the experiment in a while, but this could be verified. If it is, indeed, the 5th instruction after the RDQUAD when the mapped QUADs become executable, you would want to put the QUAD window right after the RDQUAD instruction. The funny thing is that it would always be executing the QUAD from the RDQUAD before the RDQUAD just above the QUAD window.

Bill Henning · 2012-12-09 15:57

Thanks Chip!

I will try it - as long as I can get 4 LMM instructions running in 8 clocks I'll be happy - regardless of how I need to arrange them :-)

cgracey wrote: »
After a RDQUAD, the 4-register QUAD window becomes executable on the 5th instruction.

I think this is what can be done:
	repd	$1FF,@extra3	'$1FF causes infinte REPD
	nop
	nop
	nop

	rdquad	ptra++	'1 clock rdquad, QUAD executable at +5 clocks
inst1	nop		'inst1..4 are the QUAD window
inst2	nop		'because of pipeline delays, inst1..4 execute
inst3	nop		'...from RDQUAD *before* last
inst4	nop
extra1	nop		'QUAD becomes executable here (5th instruction)
extra2	nop
extra3	nop		'whole loop takes 8 clocks and executes 4 LLM instructions
Note that there are 8 permutations possible of the repeated block by putting the last instruction where the first one is, and so on.

Bill, I'm not totally sure about the timing on this, as I haven't done the experiment in a while, but this could be verified. If it is, indeed, the 5th instruction after the RDQUAD when the mapped QUADs become executable, you would want to put the QUAD window right after the RDQUAD instruction. The funny thing is that it would always be executing the QUAD from the RDQUAD before the RDQUAD just above the QUAD window.

If you squint at it just the right way... in a way it becomes pipelined LMM code.

Kye · 2012-12-09 17:07

Last 3 instructions could be used for interrupt polling.

Bill Henning · 2012-12-09 19:05

I just verified the pipeling for Experiment #1, you can have five instructions between back to back or repeated RDLONG's.

here is the code fragment:

	reps	#256,#6
	getcnt	start

	rdlong	ins3,what
ins1	nop		' verified, does not execute
ins2	nop		' verified, does not execute
ins3	nop		' verified, EXECUTES
ins4	nop		' verified, EXECUTES
ins5	mov	ins3,#0	' 2049 cycles measured for 256 loop iterations

	getcnt	stop

	mov	cycles,stop
	sub	cycles,start

	wrlong	cycles,result1
	wrlong	count,result2

Adding "ins6" makes the 256 loops take 4096+ cycles

Kye - yes, the delay slots can definitely be used to simulate interrupts, as previously posted.

Bill Henning · 2012-12-09 19:27

A modified Experiment #2 is now also verified... 512 LMM instructions executed in 2049 cycles - that is one per four clock cycles, as predicted.

	org	0

	' make 256 LMM instructions of addins

	reps	#511,#1
	setptra	what
	wrlong	addins,ptra++	' save the add instruction into the hub

	wrlong	addins,ptra++	' save the add instruction into the hub

	' point at start of lmm code

	setptra	what	

	' execute 256 LMM instructions

	reps	#256,#5
	getcnt	start

	rdlongc	ins3,ptra++
ins1	rdlongc	ins4,ptra++	
ins2	nop		' verified, does not execute
ins3	nop		' verified, EXECUTES
ins4	nop		' verified, EXECUTES

	getcnt	stop

	mov	cycles,stop
	sub	cycles,start

	wrlong	cycles,result1
	wrlong	count,result2

Bill Henning · 2012-12-09 19:33

I am on a roll...

Experiment #3 is also verified!!!

768 LMM instructions executed in 2049 cycles!

	org	0

	' make 256 LMM instructions of addins

	reps	#511,#1
	setptra	what
	wrlong	addins,ptra++	' save the add instruction into the hub

	wrlong	addins,ptra++	' save the add instruction into the hub

	reps	#511,#1
	nop
	wrlong	addins,ptra++	' save the add instruction into the hub

	wrlong	addins,ptra++	' save the add instruction into the hub

	' point at start of lmm code

	setptra	what	

	' execute 256 LMM instructions

	reps	#256,#6
	getcnt	start

	rdlongc	ins3,ptra++
ins1	rdlongc	ins4,ptra++	
ins2	rdlongc	ins5,ptra++	
ins3	nop		' verified, does not execute
ins4	nop		' verified, EXECUTES
ins5	nop		' verified, EXECUTES

	getcnt	stop

	mov	cycles,stop
	sub	cycles,start

	wrlong	cycles,result1
	wrlong	count,result2

Bill Henning · 2012-12-09 20:14

I really like rdquad - I just verified that

	org	0

	' make 4096 LMM instructions of "add count,#1"

	setptra	what		' point to 16k buffer of "add count,#1" lmm code
	mov	times,#16
loop	reps	#256,#1
	nop
	wrlong	addins,ptra++	' save the add instruction into the hub
	djnz	times,#loop

	' point at start of lmm code

	setptra	what	

	' execute 256 LMM instructions

	setquad	#ins4

	nop
	nop

	reps	#256,#8
	getcnt	start

	rdquad	ptra++
ins1	nop
ins2	nop
ins3	nop		' verified, does not execute
ins4	nop		' verified, EXECUTES
ins5	nop		' verified, EXECUTES
ins6	nop		' verified, EXECUTES
ins7	nop		' verified, EXECUTES

	getcnt	stop

	mov	cycles,stop
	sub	cycles,start

	wrlong	cycles,result1
	wrlong	count,result2

Executed the RDQUAD loop in 801 cycles, so it really is possible to put 7 instructions between hub-synced RDQUAD's!

I also verified that on the first go-around it only starts executing fetched code at ins6, so it needs 5 delay slots before the first executable slot.

cgracey · 2012-12-09 20:34

Bill, there could be a latency issue with RDQUAD and the pipeline. Instead of putting a bunch of identical instructions in the LMM memory, try a pattern of instructions that will indicate if they are actually executing in proper sequence.

Bill Henning · 2012-12-09 20:43

Ok, this took a bit of experimentation - but I now have a *working* LMM2 RDQUAD engine that executes at almost exactly 50% efficiency!

Note - any JMP or CALL out of the loop will have to "prime" ins1-ins4 with NOP's or do a RDQUAD ptr++ at least 5 cycles ahead.

Basically, this is a pipelined LMM engine.

This runs as-is on DE0-Nano, and should run on the DE2-115

- use pnut.exe to download and run it
- when the test is finished, it drops back into the monitor
- use the 'n' command to switch to longs for output
- type '2000.2007' to see the two counters
- the first long is how many cycles 256 iterations of the quad LMM loop took, it is normally between 801-803
- the second long is how many LMM instructions were executed, it will be 3Fc (first iteration of the loop executes NOP's)

THIS WAS FUN!

'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'

CON

	CLOCK_FREQ = 60_000_000
	BAUD = 115_200

DAT
	org	0

	' make 4096 LMM instructions of "add count,#1"

	setptra	what		' point to 16k buffer of "add count,#1" lmm code
	mov	times,#16
loop	reps	#256,#1
	nop
	wrlong	addins,ptra++	' save the add instruction into the hub
	djnz	times,#loop

	' point at start of lmm code

	setptra	what	

	' execute 256 LMM instructions

	nop
'--
	setquad #ins1

	reps	#256,#8
	getcnt	start


	rdquad   ptra++
'--
ins1    nop	' must be quad-long aligned!
ins2    nop
ins3    nop
ins4    nop
        nop	' delay slot
	nop	' delay slot
	nop	' delay slot

	getcnt	stop

	mov	cycles,stop
	sub	cycles,start

	wrlong	cycles,result1
	wrlong	count,result2

	coginit	monitor_pgm,monitor_ptr	'relaunch cog0 with monitor - thanks Chip!

monitor_pgm	long	$70C			'monitor program address
monitor_ptr	long	90<<9 + 91		'monitor parameter (conveys tx/rx pins)

addins	add	count,#1

start	long	0
stop	long	0
cycles	long	0

count	long	0
times	long	0

what		long	16384
result1		long	8192
result2		long	8196

Bill Henning · 2012-12-09 20:49

At 80MIPS, LMM2 on Prop2 is quite competitive with similarly priced ARM chips... and theoretically LMM2 could run in all eight cogs!

cgracey · 2012-12-09 20:53

Really neat!

One minor thing: after doing a 'GETCNT reg' and some instructions to time, you can end with a 'SUBCNT reg' and you'll have the difference in reg.

SRLM · 2012-12-09 20:59

Bill Henning wrote: »

At 80MIPS, LMM2 on Prop2 is quite competitive with similarly priced ARM chips... and theoretically LMM2 could run in all eight cogs!

I don't think it's directly comparable:

The MIPS figures which ARM (and most of the industry) quotes are "Dhrystone VAX MIPs". The idea behind this measure is to compare the performance of a machine (in our case, an ARM system) against the performance of a reference machine. The industry adopted the VAX 11/780 as the reference 1 MIP machine.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3885.html

potatohead · 2012-12-09 21:02

Bill, this is a great piece of code!

I got,

>2000.2007
02000- 02 08 00 00 FC 03 00 00

Now it needs a variety of instructions running..

jmg · 2012-12-09 21:11

Bill Henning wrote: »

At 80MIPS, LMM2 on Prop2 is quite competitive with similarly priced ARM chips.

Do you have a price ? - or what was your guess ?

Bill Henning · 2012-12-09 21:24

That was an educated guess.

At UPEW, a rough estimate of no more than 1.5x Prop1 cost was given, so my guess is $12 or less quantity one for Prop2

http://www.digikey.com/product-detail/en/LPC1785FBD208,551/568-7573-ND/2677567

120Mhz ARM for $12.00 at Digikey

I suspect that is competitive with a single cog if we also use FCACHE... but we will have 7 extra cogs as well

jmg wrote: »

Do you have a price ? - or what was your guess ?

Bill Henning · 2012-12-09 21:26

Thanks!

Chip's suggested replacement for 4 and 5 was simpler, and while it is possible to get 4&5 running with the correct alignment, simple is good.

All those 'add count,#1' did prove the code ran, now we can try other stuff.

Mind you, adding FCACHE, FLIB etc will increase the performance even more; I expect 99% native speed for FCACHE-friendly code.

potatohead wrote: »

Bill, this is a great piece of code!

I got,

>2000.2007
02000- 02 08 00 00 FC 03 00 00

Now it needs a variety of instructions running..

cgracey · 2012-12-09 21:30

Doug,

Type 'n' into the monitor to switch to long mode. Then do '2000.2007'. Or, all at once you can type 'n2000.2007'.

potatohead wrote: »

Bill, this is a great piece of code!

I got,

>2000.2007
02000- 02 08 00 00 FC 03 00 00

Now it needs a variety of instructions running..

Bill Henning · 2012-12-09 21:31

I just noticed this message... I will test that tomorrow, thanks for the heads-up.

(I used identical instructions so I could generate the LMM code easily)

Do you think something like the following would be enough to test the potential latency issue:

  add   count1,#1
  add   count2,#2
  add   count3,#3
  add   count4,#4
  add   count4,#4
  add   count3,#3
  add   count2,#2
  add   count1,#1

I'd repeat it 512 times, and execute all 4k instructions

cgracey wrote: »

Bill, there could be a latency issue with RDQUAD and the pipeline. Instead of putting a bunch of identical instructions in the LMM memory, try a pattern of instructions that will indicate if they are actually executing in proper sequence.

cgracey · 2012-12-09 21:38

Bill Henning wrote: »

I just noticed this message... I will test that tomorrow, thanks for the heads-up.

(I used identical instructions so I could generate the LMM code easily)

It will work out well to have a RDQUAD followed immediately by the QUAD window, but the instructions executing in the QUAD window won't be from the RDQUAD just executed, but from the RDQUAD executed in the prior iteration.

Bill Henning · 2012-12-09 21:41

Thanks, that is excellent news!

I noticed the pipelining effect, $800 loops only executed "add count,#1" $3FC times, due to executing the nop's the first time around.

This has implications for FJMP, FCALL, FRET etc, but it is managable.

cgracey wrote: »

It will work out well to have a RDQUAD followed immediately by the QUAD window, but the instructions executing in the QUAD window won't be from the RDQUAD just executed, but from the RDQUAD executed in the prior iteration.

cgracey · 2012-12-09 22:10

Bill Henning wrote: »
I just noticed this message... I will test that tomorrow, thanks for the heads-up.

(I used identical instructions so I could generate the LMM code easily)

Do you think something like the following would be enough to test the potential latency issue:
  add   count1,#1
  add   count2,#2
  add   count3,#3
  add   count4,#4
  add   count4,#4
  add   count3,#3
  add   count2,#2
  add   count1,#1
I'd repeat it 512 times, and execute all 4k instructions

That might do it. I'm thinking if you did something like...

WRBYTE h00,PTRB++
WRBYTE h01,PTRB++
WRBYTE h02,PTRB++
WRBYTE h03,PTRB++
WRBYTE h04,PTRB++
WRBYTE h05,PTRB++
WRBYTE h06,PTRB++
WRBYTE h07,PTRB++
<repeat>

...you could see very easily if you were executing part of one RDQUAD and part of another.

cgracey · 2012-12-09 22:13

Bill Henning wrote: »

Thanks, that is excellent news!

I noticed the pipelining effect, $800 loops only executed "add count,#1" $3FC times, due to executing the nop's the first time around.

This has implications for FJMP, FCALL, FRET etc, but it is managable.

I think you'd just throw away the last RDQUAD's four instructions, kind of like the pipeline throws away three instructions on a branch.

potatohead · 2012-12-09 22:21

Funny! I'm so used to bytes, I didn't even think about long mode. Nice feature Chip! No more mapping bytes in my head!

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments