Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Bill HenningBill Henning Posts: 6,445
edited 2012-12-14 18:12 in Propeller 1
Greetings!

As Dave and Ariba have already started posting about their experiments, I think it is time for me to post about the LMM ideas I am experimenting with for Propeller 2.

Please note the following thought experiments have not been tried on my DE0-Nano yet.

One of the problems Parallax (and the GCC team) will face is that Propeller 2 is a very different beast from Propeller 1. To get the most out of the Propeller 2 with LMM, the compiler will need some fairly significant changes.

LMM on Propeller 1, not counting FCACHE etc., executes a native 4 cycle instruction in 20 clock cycles (4 way unrolled LMM loop) or 18 cycles (eight way unrolled LMM loop). In order to save space, I will refer to Propeller-1 style LMM as "LMM1")

Without the pipelining, Propeller 1 can natively execute instructions at up to 20MIPS @ 80Mhz.

With its pipelining, Propeller 2 can natively execute instructions at up to 160MIPS @ 160MHz.

I've been scratching my head trying to make LMM2 far more efficient.

I've also been trying to figure out a "stepping stone" version that will require very few changes to GCC in order to get GCC on Propeller 2 running as quickly as possible - mind you, that will run much slower than a LMM2 that is optimized for the Propeller 2 architecture.

1) LMM1 GCC compatibility mode - LMM2 experiment #1
next    rdlong    ins,pc   ' could be rdlongc
        jmpd      #next
        add       pc,#4
ins     nop


The above simple kernel will execute LMM1 code at 8 cycles per instruction (ignoring hub etc instructions greater execution times) - and as Propeller 2 is pipelined, this means up to 20MIPS @ 160Mhz - basically, LMM1 code will run at the same speed on Propeller 2 as native cog code does on Propeller 1 (assuming 160MHz and 80MHz clock speeds respectively) for a native "efficiency" factor of 1/8. (ok, JMP, CALL will take 3 cycles inside, but still fit in 8 cycle hub window)

A simple change may boost this, but I have to check that the pipeline will allow the extra speed:
again   reps       #511,4
        nop        ' delay slot

next    rdlongc    ins,pc
        nop        ' delay slot useful for profiling - ie "ADD LMMOPS,#1"
        add       pc,#4
ins     nop

        jmp       #again   ' we will break the REPS with any JMP or CALL

(I think the above is almost exactly the same as Dave's and Ariba's version)

Due to the pipeline delays, this new version is not much faster - but every little bit helps. This version can theoretically execute four simple P2 instructions in 8+5+5+5 = 23 cycles once the hub is in sync, for a 4/23 maximum "efficiency" - however due to hub windows, it would actually take 4/24 cycles to execute the four instructions.

This version should still be mostly GCC Prop1 compatible.

2) LMM2 experiment #2

This version tries to improve on the efficiency of (1) by discarding PropGCC1 (shorthand for Propeller 1 GCC) compatibility..

Actually, for best effect it requires significant changes to GCC to support the "VLIW"-like mode it makes possible (more on this later)
loopy  reps      #511,#6
       nop       ' delay slot
        
next   rdlongc   ins1,[ptra++]
       rdlongc   ins2,[ptra++]
       nop       ' delay slot
       nop       ' delay slot - or use a JMPD here if you don't want to use REPS
ins1   nop
ins2   nop

       jmp       #loopy               ' so we go back to REPS when a JMP/CALL busts the REPS loop


The changes here are:

- using PTRA as the program counter gives free increments
- using REPS gives us free looping
- the delay slots can be used to implement interrupts

As long as the pipelining works out the way I expect, this would execute four native simple instructions in 8+6 = 14 cycles, for a 4/14

Hub syncing will reduce this to an actual best rate of 4/16, ie 40MIPS max at 160MHz.

VLIW mode:

With appropriate changes, PropGCC could support a VLIW mode where ins1/ins2 were treated as either a single 64 bit instruction, or two 32 bit instructions; note all code would have to be generated on eight byte boundaries for this to work well.

This would bring a major advantage to GCC code generation - here are just a couple of examples:

MVI REGx, ##longvalue ---> MOV REGx,ins2

JMP #FJMP ---> SETPTRA ins2
LONG hubaddr

etc

3) LMM2 experiment #3 - "Off-stride"

This is the most I can pack into an 8 cycle hub window.
loopy   reps      #511,#6
        nop

next    rdlongc   ins1,[ptra++]
        rdlongc   ins2,[ptra++]
        rdlongc   ins3,[ptra++]
ins1    nop
ins2    nop
ins3    nop

        jmp       #loopy   ' used when REPS loop is broken

The good news:

It has a 3/8 efficiency, giving up to 60MIPS!

The bad news:

You lose the nice fast "MVI" and "FJMP" capability (without a LOT of compiler headaches)

Now I am not certain that the ins2 and ins3 will be ready to execute in time, in which case priming and inverting should work:
loopy   reps      #511,#6
        nop

ins1    nop
ins2    nop
ins3    nop
        rdlongc   ins1,[ptra++]
        rdlongc   ins2,[ptra++]
        rdlongc   ins3,[ptra++]

        jmp       #loopy   ' used when REPS loop is broken


4) LMM2 experiment #4 - Things get weird...

(Baggers asked if anyone thought of using READQUAD...)
top         reps       #511,#12
            setquad   #block1

next2       rdquad    [ptr++]
            setquad  #block2
            nop    ' delay slot
block1      nop
            nop
            nop
            nop
next1       rdquad    [ptr++]
            setquad  #block1
            nop    'delay slot
block2      nop
            nop
            nop
            nop

            jmp   #top

I think this would not be very nice to generate code for, but it does have an 8/16 (50%) potential efficiency - 80MIPS

For best results, it would have to be treated as a 128 bit VLIW machine, and block1 should be primed before entering the loop.

5) LMM2 experiment #5 - Even weirder...

My attempt at shrinking #4 - but I need to make sure it will still execute in 8 cycles (when synced to the hub)
top         reps       #511,#6
            setquad   #block

block       nop
            nop
            nop
            nop
            jmpd   #block   
            rdquad    [ptr++]

            jmp   #top

Block must be primed, not sure the pipelining would work (ie keep it 8 cycles) but if it works, it will give us 4/8 ie 50% efficiency for 80MIPS as well.

6) LMM2 experiment #7+ And now, for something completely different...

All kidding aside, the other ideas I've been playing with get really weird, and do not exceed 50% potential efficiency.

For all experiments, FCACHE/FLIB can add a huge boost.

Apologies if I get the syntax wrong on any of the instructions above, I will fix errors as I find out about them :)

UPDATE

Releasing the Prop2 Terasic boards was an extremely useful thing to do.

Testing the RDQUAD based LMM2 ideas above Chip was able to find and fix a small error in the Verilog sources, potentially saving the cost of an additional shuttle run.

Andy noticed the first sign of a problem, I and others were able to verify it.

After Chip's change, single cycle instructions would work, but unexpected issues cropped up with multi-cycle instructions.

There is a potential useful work-around, however it reduces the maximum MIPS from 80 to 60 as it only executes three instructions, but provides a 32 bit instruction or address for them - which removes the need for MVI, FJMP and many other constructs.

I'll add that code here shortly, and I plan to verify that it works for both single and multi-cycle instructions.

I am very happy to see how people jumped in with both feet to test P2!

Update:

I may have gotten it running using RDQUAD with 4 instructions executed per RDQUAD!


---

Let me know what you think! Your input is most appreciated...

---

Notes:

All methods show above will do much better with FCACHE and FLIB added into the equation, the efficiency above is for strainght in-line code.

(I've been playing with LMM2 on paper since Chip started posting the instructions - but I did not want to post until I had enough information on the pipelining)
«13456712

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 06:57
    LMM2 NAMING CONVENTION

    To save typing, and make it easier for all of us to be specific about exactly which variant we are talking about:
    [b]LMM2Q[/b]  - Propeller 2 LMM, RDQUAD 4/8 efficiency
    [b]LMM2C[/b]  - Propeller 2 LMM, one RDLONGC per hub window, 1/8 efficiency
    [b]LMM2C2[/b] - two RDLONGC's per hub window, 2/8 efficiency
    [b]LMM2C3[/b] - three RDLONGS's per hub window, 3/8 efficiency
    [b]LMM2P[/b]  - paging variant I did not publish, roughly 3/8 efficiency
    


    In case any of you are wondering, I did not publish LMM2P as it required significantly more compiler work than LMM2Q and performed more poorly than LMM2Q due to additional overhead and thrashing.

    THE TORTURE TEST EXPLAINED

    Currently, the torture test consists of the following eight instructions, repeated 256 times in the hub (2k instructions, 8k hub used)
    i1	add	count, #1
    i2	add	count2,#1
    i3	add	count3,#1
    i4	add	count4,#1
    i5	rdlong	lc,result6
    i6	add	lc,#1
    i7	wrlong	lc,result6
    i8	add	count,#1
    


    Due to how READQUAD works, the LMM2 loop will alternately execute i1-i4, then i5-i8 ensuring that RDQUAD reads a different group of four instructions on every loop iteration

    i1-i4 are simple adds to counters, which are later deposited into result1-result4

    i5-i7 increment result6 in the hub

    i8 does an extra increment of result1

    The LMM2 loops RDQUAD is executed 256 times (so we get nice easy to read cycle counts in hex), however the last four instructions fetched are not executed, and the first four are executed as NOP's as they would not have been fetched yet.

    The first iteration executes four NOP's on the first pass (thanks to SETQUAZ)

    This means that i1-i4 gets executed 128 times, and i5-i8 gets executed 127 times

    Therefore the mathematically correct expected results (in hex) are:

    result 1: $0000FE ($7F+$7F)
    result 2: $000080
    result 3: $000080
    result 4: $000080

    result 5: $000FF7 *** approximate
    result 6: $00007F

    256*8 RDQUAD's ($800 cycles), 127*8 RDLONG's ($3FF cycles), 127*8 WRLONGS ($3FF) cycles as after first hub access all will hit hub sync sweet spots

    $800+$3F0+$3F0 = $FE0 cycles

    NOTE:

    Other instructions can be tried in i1-i8, but then predicted results should be calculated for them.

    This post reserved for source code, links and FAQ's

    - added LMM2 using RDQUAD test
    - added LMM2 using RDQUAD test 2
    - added LMM2 using RDQUAD test 3 to verify Andy's results
    - added what looks like verified working RDQUAD LMM code :)
    - added torture test for latest Nano loadable that matches the instructions above
  • BaggersBaggers Posts: 3,019
    edited 2012-12-09 07:43
    Has anyone thought of using rdquads?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 07:46
    Yep,

    In theory, should be better.

    In practice, does not look like it, as it is very awkward...

    I'll add it to the top thread.
  • BaggersBaggers Posts: 3,019
    edited 2012-12-09 07:59
    btw, I'm not sure, as I've not looked at the instructions yet, as I don't have an emulator setup, but wouldn't it be setquad #block1 and setquad #block2? I may be wrong, but this is the usual way to point at an address and not the contents of it.

    Edit: No a quick read up on it :) and you are correct :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 08:11
    Thanks Baggers!

    I did mention it was a bit weird... had to be, due to how the pipelining works.

    I'll be trying these experiments on my DE0-Nano, but its a bit painful as I can't use tasks to do serial out if I want accurate cycle counts, I have to keep dumping out to the monitor and examining the hub.

    But it is still a lot of fun :-) :-) :-)
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-09 11:34
    Bill: I have not looked too deeply into your code examples.
    I am unsure if this works or not, but Chip said that we can set the quad cache into a window of cog ram. If those 4 quads were windowed into the LMM loop as instr1-4 I wonder if they would be able to be executed from there? Of course this makes the LMM compilers more difficult but if looking for every last bit of speed it could be worth looking into.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 13:05
    Hi Cluso99,

    (4) and (5) in post #1 use RDQUAD (cache mapped into the cog), and assuming they work, will be able to execute code from the hub at 50% of native cog speed (much more with FCACHE etc).. I think I may have to remove the delay slots from (4) to make it fit within the hub window; it is not needed anyway due to how I use two rdquads.

    and you are absolutely correct, it does require more compiler work.
    Cluso99 wrote: »
    Bill: I have not looked too deeply into your code examples.
    I am unsure if this works or not, but Chip said that we can set the quad cache into a window of cog ram. If those 4 quads were windowed into the LMM loop as instr1-4 I wonder if they would be able to be executed from there? Of course this makes the LMM compilers more difficult but if looking for every last bit of speed it could be worth looking into.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-09 15:42
    After a RDQUAD, the 4-register QUAD window becomes executable on the 5th instruction.

    I think this is what can be done:
    	repd	$1FF,@extra3	'$1FF causes infinte REPD
    	nop
    	nop
    	nop
    
    	rdquad	ptra++	'1 clock rdquad, QUAD executable at +5 clocks
    inst1	nop		'inst1..4 are the QUAD window
    inst2	nop		'because of pipeline delays, inst1..4 execute
    inst3	nop		'...from RDQUAD *before* last
    inst4	nop
    extra1	nop		'QUAD becomes executable here (5th instruction)
    extra2	nop
    extra3	nop		'whole loop takes 8 clocks and executes 4 LLM instructions
    

    Note that there are 8 permutations possible of the repeated block by putting the last instruction where the first one is, and so on.

    Bill, I'm not totally sure about the timing on this, as I haven't done the experiment in a while, but this could be verified. If it is, indeed, the 5th instruction after the RDQUAD when the mapped QUADs become executable, you would want to put the QUAD window right after the RDQUAD instruction. The funny thing is that it would always be executing the QUAD from the RDQUAD before the RDQUAD just above the QUAD window.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 15:57
    Thanks Chip!

    I will try it - as long as I can get 4 LMM instructions running in 8 clocks I'll be happy - regardless of how I need to arrange them :-)
    cgracey wrote: »
    After a RDQUAD, the 4-register QUAD window becomes executable on the 5th instruction.

    I think this is what can be done:
    	repd	$1FF,@extra3	'$1FF causes infinte REPD
    	nop
    	nop
    	nop
    
    	rdquad	ptra++	'1 clock rdquad, QUAD executable at +5 clocks
    inst1	nop		'inst1..4 are the QUAD window
    inst2	nop		'because of pipeline delays, inst1..4 execute
    inst3	nop		'...from RDQUAD *before* last
    inst4	nop
    extra1	nop		'QUAD becomes executable here (5th instruction)
    extra2	nop
    extra3	nop		'whole loop takes 8 clocks and executes 4 LLM instructions
    

    Note that there are 8 permutations possible of the repeated block by putting the last instruction where the first one is, and so on.

    Bill, I'm not totally sure about the timing on this, as I haven't done the experiment in a while, but this could be verified. If it is, indeed, the 5th instruction after the RDQUAD when the mapped QUADs become executable, you would want to put the QUAD window right after the RDQUAD instruction. The funny thing is that it would always be executing the QUAD from the RDQUAD before the RDQUAD just above the QUAD window.

    If you squint at it just the right way... in a way it becomes pipelined LMM code.
  • KyeKye Posts: 2,200
    edited 2012-12-09 17:07
    Last 3 instructions could be used for interrupt polling.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 19:05
    I just verified the pipeling for Experiment #1, you can have five instructions between back to back or repeated RDLONG's.

    here is the code fragment:
    	reps	#256,#6
    	getcnt	start
    
    	rdlong	ins3,what
    ins1	nop		' verified, does not execute
    ins2	nop		' verified, does not execute
    ins3	nop		' verified, EXECUTES
    ins4	nop		' verified, EXECUTES
    ins5	mov	ins3,#0	' 2049 cycles measured for 256 loop iterations
    
    	getcnt	stop
    
    	mov	cycles,stop
    	sub	cycles,start
    
    	wrlong	cycles,result1
    	wrlong	count,result2
    


    Adding "ins6" makes the 256 loops take 4096+ cycles

    Kye - yes, the delay slots can definitely be used to simulate interrupts, as previously posted.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 19:27
    A modified Experiment #2 is now also verified... 512 LMM instructions executed in 2049 cycles - that is one per four clock cycles, as predicted.
    	org	0
    
    	' make 256 LMM instructions of addins
    
    	reps	#511,#1
    	setptra	what
    	wrlong	addins,ptra++	' save the add instruction into the hub
    
    	wrlong	addins,ptra++	' save the add instruction into the hub
    
    	' point at start of lmm code
    
    	setptra	what	
    
    	' execute 256 LMM instructions
    
    	reps	#256,#5
    	getcnt	start
    
    	rdlongc	ins3,ptra++
    ins1	rdlongc	ins4,ptra++	
    ins2	nop		' verified, does not execute
    ins3	nop		' verified, EXECUTES
    ins4	nop		' verified, EXECUTES
    
    	getcnt	stop
    
    	mov	cycles,stop
    	sub	cycles,start
    
    	wrlong	cycles,result1
    	wrlong	count,result2
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 19:33
    I am on a roll...

    Experiment #3 is also verified!!!

    768 LMM instructions executed in 2049 cycles!
    	org	0
    
    	' make 256 LMM instructions of addins
    
    	reps	#511,#1
    	setptra	what
    	wrlong	addins,ptra++	' save the add instruction into the hub
    
    	wrlong	addins,ptra++	' save the add instruction into the hub
    
    	reps	#511,#1
    	nop
    	wrlong	addins,ptra++	' save the add instruction into the hub
    
    	wrlong	addins,ptra++	' save the add instruction into the hub
    
    	' point at start of lmm code
    
    	setptra	what	
    
    	' execute 256 LMM instructions
    
    	reps	#256,#6
    	getcnt	start
    
    	rdlongc	ins3,ptra++
    ins1	rdlongc	ins4,ptra++	
    ins2	rdlongc	ins5,ptra++	
    ins3	nop		' verified, does not execute
    ins4	nop		' verified, EXECUTES
    ins5	nop		' verified, EXECUTES
    
    	getcnt	stop
    
    	mov	cycles,stop
    	sub	cycles,start
    
    	wrlong	cycles,result1
    	wrlong	count,result2
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 20:14
    I really like rdquad - I just verified that
    	org	0
    
    	' make 4096 LMM instructions of "add count,#1"
    
    	setptra	what		' point to 16k buffer of "add count,#1" lmm code
    	mov	times,#16
    loop	reps	#256,#1
    	nop
    	wrlong	addins,ptra++	' save the add instruction into the hub
    	djnz	times,#loop
    
    	' point at start of lmm code
    
    	setptra	what	
    
    	' execute 256 LMM instructions
    
    	setquad	#ins4
    
    	nop
    	nop
    
    	reps	#256,#8
    	getcnt	start
    
    	rdquad	ptra++
    ins1	nop
    ins2	nop
    ins3	nop		' verified, does not execute
    ins4	nop		' verified, EXECUTES
    ins5	nop		' verified, EXECUTES
    ins6	nop		' verified, EXECUTES
    ins7	nop		' verified, EXECUTES
    
    	getcnt	stop
    
    	mov	cycles,stop
    	sub	cycles,start
    
    	wrlong	cycles,result1
    	wrlong	count,result2
    

    Executed the RDQUAD loop in 801 cycles, so it really is possible to put 7 instructions between hub-synced RDQUAD's!

    I also verified that on the first go-around it only starts executing fetched code at ins6, so it needs 5 delay slots before the first executable slot.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-09 20:34
    Bill, there could be a latency issue with RDQUAD and the pipeline. Instead of putting a bunch of identical instructions in the LMM memory, try a pattern of instructions that will indicate if they are actually executing in proper sequence.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 20:43
    Ok, this took a bit of experimentation - but I now have a *working* LMM2 RDQUAD engine that executes at almost exactly 50% efficiency!

    Note - any JMP or CALL out of the loop will have to "prime" ins1-ins4 with NOP's or do a RDQUAD ptr++ at least 5 cycles ahead.

    Basically, this is a pipelined LMM engine.

    This runs as-is on DE0-Nano, and should run on the DE2-115

    - use pnut.exe to download and run it
    - when the test is finished, it drops back into the monitor
    - use the 'n' command to switch to longs for output
    - type '2000.2007' to see the two counters
    - the first long is how many cycles 256 iterations of the quad LMM loop took, it is normally between 801-803
    - the second long is how many LMM instructions were executed, it will be 3Fc (first iteration of the loop executes NOP's)

    THIS WAS FUN!
    '
    ' rdquad lmm2 test - now working LMM2 loop!
    '
    ' William Henning
    ' http://Mikronauts.com
    '
    
    CON
    
    	CLOCK_FREQ = 60_000_000
    	BAUD = 115_200
    
    DAT
    	org	0
    
    	' make 4096 LMM instructions of "add count,#1"
    
    	setptra	what		' point to 16k buffer of "add count,#1" lmm code
    	mov	times,#16
    loop	reps	#256,#1
    	nop
    	wrlong	addins,ptra++	' save the add instruction into the hub
    	djnz	times,#loop
    
    	' point at start of lmm code
    
    	setptra	what	
    
    	' execute 256 LMM instructions
    
    	nop
    '--
    	setquad #ins1
    
    	reps	#256,#8
    	getcnt	start
    
    
    	rdquad   ptra++
    '--
    ins1    nop	' must be quad-long aligned!
    ins2    nop
    ins3    nop
    ins4    nop
            nop	' delay slot
    	nop	' delay slot
    	nop	' delay slot
    
    	getcnt	stop
    
    	mov	cycles,stop
    	sub	cycles,start
    
    	wrlong	cycles,result1
    	wrlong	count,result2
    
    	coginit	monitor_pgm,monitor_ptr	'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm	long	$70C			'monitor program address
    monitor_ptr	long	90<<9 + 91		'monitor parameter (conveys tx/rx pins)
    
    addins	add	count,#1
    
    start	long	0
    stop	long	0
    cycles	long	0
    
    count	long	0
    times	long	0
    
    what		long	16384
    result1		long	8192
    result2		long	8196
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 20:49
    At 80MIPS, LMM2 on Prop2 is quite competitive with similarly priced ARM chips... and theoretically LMM2 could run in all eight cogs!
  • cgraceycgracey Posts: 14,155
    edited 2012-12-09 20:53
    Really neat!

    One minor thing: after doing a 'GETCNT reg' and some instructions to time, you can end with a 'SUBCNT reg' and you'll have the difference in reg.
  • SRLMSRLM Posts: 5,045
    edited 2012-12-09 20:59
    At 80MIPS, LMM2 on Prop2 is quite competitive with similarly priced ARM chips... and theoretically LMM2 could run in all eight cogs!

    I don't think it's directly comparable:
    The MIPS figures which ARM (and most of the industry) quotes are "Dhrystone VAX MIPs". The idea behind this measure is to compare the performance of a machine (in our case, an ARM system) against the performance of a reference machine. The industry adopted the VAX 11/780 as the reference 1 MIP machine.
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3885.html
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-09 21:02
    Bill, this is a great piece of code!

    I got,

    >2000.2007
    02000- 02 08 00 00 FC 03 00 00

    Now it needs a variety of instructions running..
  • jmgjmg Posts: 15,173
    edited 2012-12-09 21:11
    At 80MIPS, LMM2 on Prop2 is quite competitive with similarly priced ARM chips.

    Do you have a price ? - or what was your guess ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 21:24
    That was an educated guess.

    At UPEW, a rough estimate of no more than 1.5x Prop1 cost was given, so my guess is $12 or less quantity one for Prop2

    http://www.digikey.com/product-detail/en/LPC1785FBD208,551/568-7573-ND/2677567

    120Mhz ARM for $12.00 at Digikey

    I suspect that is competitive with a single cog if we also use FCACHE... but we will have 7 extra cogs as well :)
    jmg wrote: »
    Do you have a price ? - or what was your guess ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 21:26
    Thanks!

    Chip's suggested replacement for 4 and 5 was simpler, and while it is possible to get 4&5 running with the correct alignment, simple is good.

    All those 'add count,#1' did prove the code ran, now we can try other stuff.

    Mind you, adding FCACHE, FLIB etc will increase the performance even more; I expect 99% native speed for FCACHE-friendly code.
    potatohead wrote: »
    Bill, this is a great piece of code!

    I got,

    >2000.2007
    02000- 02 08 00 00 FC 03 00 00

    Now it needs a variety of instructions running..
  • cgraceycgracey Posts: 14,155
    edited 2012-12-09 21:30
    Doug,

    Type 'n' into the monitor to switch to long mode. Then do '2000.2007'. Or, all at once you can type 'n2000.2007'.
    potatohead wrote: »
    Bill, this is a great piece of code!

    I got,

    >2000.2007
    02000- 02 08 00 00 FC 03 00 00

    Now it needs a variety of instructions running..
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 21:31
    I just noticed this message... I will test that tomorrow, thanks for the heads-up.

    (I used identical instructions so I could generate the LMM code easily)

    Do you think something like the following would be enough to test the potential latency issue:
      add   count1,#1
      add   count2,#2
      add   count3,#3
      add   count4,#4
      add   count4,#4
      add   count3,#3
      add   count2,#2
      add   count1,#1
    

    I'd repeat it 512 times, and execute all 4k instructions
    cgracey wrote: »
    Bill, there could be a latency issue with RDQUAD and the pipeline. Instead of putting a bunch of identical instructions in the LMM memory, try a pattern of instructions that will indicate if they are actually executing in proper sequence.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-09 21:38
    I just noticed this message... I will test that tomorrow, thanks for the heads-up.

    (I used identical instructions so I could generate the LMM code easily)

    It will work out well to have a RDQUAD followed immediately by the QUAD window, but the instructions executing in the QUAD window won't be from the RDQUAD just executed, but from the RDQUAD executed in the prior iteration.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-09 21:41
    Thanks, that is excellent news!

    I noticed the pipelining effect, $800 loops only executed "add count,#1" $3FC times, due to executing the nop's the first time around.

    This has implications for FJMP, FCALL, FRET etc, but it is managable.
    cgracey wrote: »
    It will work out well to have a RDQUAD followed immediately by the QUAD window, but the instructions executing in the QUAD window won't be from the RDQUAD just executed, but from the RDQUAD executed in the prior iteration.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-09 22:10
    I just noticed this message... I will test that tomorrow, thanks for the heads-up.

    (I used identical instructions so I could generate the LMM code easily)

    Do you think something like the following would be enough to test the potential latency issue:
      add   count1,#1
      add   count2,#2
      add   count3,#3
      add   count4,#4
      add   count4,#4
      add   count3,#3
      add   count2,#2
      add   count1,#1
    

    I'd repeat it 512 times, and execute all 4k instructions

    That might do it. I'm thinking if you did something like...

    WRBYTE h00,PTRB++
    WRBYTE h01,PTRB++
    WRBYTE h02,PTRB++
    WRBYTE h03,PTRB++
    WRBYTE h04,PTRB++
    WRBYTE h05,PTRB++
    WRBYTE h06,PTRB++
    WRBYTE h07,PTRB++
    <repeat>

    ...you could see very easily if you were executing part of one RDQUAD and part of another.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-09 22:13
    Thanks, that is excellent news!

    I noticed the pipelining effect, $800 loops only executed "add count,#1" $3FC times, due to executing the nop's the first time around.

    This has implications for FJMP, FCALL, FRET etc, but it is managable.

    I think you'd just throw away the last RDQUAD's four instructions, kind of like the pipeline throws away three instructions on a branch.
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-09 22:21
    Funny! I'm so used to bytes, I didn't even think about long mode. Nice feature Chip! No more mapping bytes in my head!
Sign In or Register to comment.