LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

cgracey · 2012-12-10 17:46

Bill Henning wrote: »

Hi Chip,

I just tried to re-program my Nano, and I got:

"Error: File Z:/2012/prop2/Prop2_DEO_Nano_v2/DE0_Nano_Prop2.jic is corrupted"

I donloaded the zip twice, an tried it about four times.

Help!

I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?

cgracey · 2012-12-10 17:49

I made an additional SETQUAZ instruction, aside from keeping SETQUAD. They both set the QUAD address, while the first clears the QUAD registers.

Sapieha · 2012-12-10 17:50

Hi Chip.

I run it by point on its icon and use run with Quartus.

That function without problems

cgracey wrote: »

I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?

Bill Henning · 2012-12-10 17:52

My bad... I ran the old 10.1 install, not the new 12.1

12.1's programmer worked fine.

cgracey wrote: »

I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?

Bill Henning · 2012-12-10 17:52

Thank you

cgracey wrote: »

i made an additional setquaz instruction, aside from keeping setquad. They both set the quad address, while the first clears the quad registers.

Tubular · 2012-12-10 17:59

Worked Ok for me too, Bill. Quartus Programmer v12.1 build 177
Edit: oops took a phone call in between posting and see you've sorted. Onwards and upwards...

Bill Henning · 2012-12-10 18:13

YES!!!!!!!!!!

Other than the cache clearing, I have Prop2 Nano-v2 passing my torture test.

Here is the inner loop:

	' execute 256 LMM instructions

ins5	setquad	#ins1
	nop
	reps	#256,#8
	getcnt	start
'-- must be on a XXXXXXX00 address in the cog
ins1	nop
ins2	nop
ins3	nop
ins4	nop
	rdquad	pc
	add pc,#16
	nop
	nop

Here are the results:

>n
>2000.201f
02000- 000000FE 00000080 00000080 00000080   '................'
02010- 00000FF6 0000007F 00000000 00000000   '................'

THANKS CHIP!

cgracey · 2012-12-10 18:13

Here's a new .zip with updated PNUT.EXE (SETQUAD and SETQUAZ), Prop2_Docs, and the .jic file for the DE0_Nano:

Terasic_Prop2.zip

I'll be back in a few hours.

Bill Henning · 2012-12-10 18:16

More goodies :-)

Thanks!

Did the new Nano_v2 have SETQUAZ?

cgracey wrote: »

Here's a new .zip with updated PNUT.EXE (SETQUAD and SETQUAZ), Prop2_Docs, and the .jic file for the DE0_Nano:

Attachment not found.

I'll be back in a few hours.

jmg · 2012-12-10 18:22

cgracey wrote: »

I made an additional SETQUAZ instruction, aside from keeping SETQUAD. They both set the QUAD address, while the first clears the QUAD registers.

Does this mean the P2 is not yet taped out, and sitting in a shuttle queue somewhere ?

cgracey · 2012-12-10 18:25

Bill Henning wrote: »

More goodies :-)

Thanks!

Did the new Nano_v2 have SETQUAZ?

No. It's only inside the .zip I just posted.

cgracey · 2012-12-10 18:26

jmg wrote: »

Does this mean the P2 is not yet taped out, and sitting in a shuttle queue somewhere ?

We were hoping to catch the December shuttle, but we might have to wait until January. We'll see.

cgracey · 2012-12-10 18:31

Quick question to think about:

Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.

Bill Henning · 2012-12-10 18:38

- Removing conditional execution saves 4 bits
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits

This would be pretty easy for the compiler to support

A simple "RDWORDX" could read a word, and expand it back to a full instruction, so the LMM kernel could work on it

It would not be needed for the fast RDQUAD version as compressed code is about large code size, not highest speed possible

I'd map R0-R15 to $1E0-$1EF as it leaves eight more registers after it for future use (or Prop3 adding mroe special registers)

cgracey wrote: »

Quick question to think about:

Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.

Bill Henning · 2012-12-10 18:42

A new SJMP opcode could provide a direct 8 bit address (or 9 if you get cute)

Also needed are "ENTER COMPRESSED" and "LEAVE COMPRESSED" mode instructions.

This would SERIOUSLY speed up CMM2

(I keep sticking 2 on Prop2 versions to clearly identify which version I am talking about)

jmg · 2012-12-10 18:54

Bill Henning wrote: »

- Removing conditional execution saves 4 bits
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits

This would be pretty easy for the compiler to support

A simple "RDWORDX" could read a word, and expand it back to a full instruction, so the LMM kernel could work on it

So these are not opcodes in the true sense, but more a compression/map technique that a new opcode needs to fetch, to then place in (real) Opcode memory.
Since this will be a streaming operation, rather than limit the opcodes to a tiny subset, why not allow a sticky mapping system, which would allow users to compress functions, but not force everything into a bottle neck.

ie this would take the idea of a 16:32 map
- Removing conditional execution saves 4 bits
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits

but when expanding, the 16 new bits come from a register. Less 'chop' and more 'tiny dictionary'

Now, a user can define a 5 bit register group, which simply swaps-in with the 4 variable bits, to give a 'function local' type approach.
- same for the other 4+2 bits but they would less commonly change, and could be patched if needed.

A compiler can organize reg usage to be Function local, and use 4 bit fields, (merged with the 5 bit group, that changes far less often) and then if it finds it needs (say) 2 condition codes in 50 lines of code, it can patch those last, before the jump to execute step, and still have a memory size/bandwidth gain.

David Betz · 2012-12-10 20:28

cgracey wrote: »

Quick question to think about:

Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.

Chip,

Have you looked at the CMM document I sent you a while back. It is how PropGCC is currently encoding "compressed" instructions. You might want to check with Eric Smith to see if it makes sense to implement some of the translation in hardware.

Cluso99 · 2012-12-10 21:05

My favoured instructions...

Does not clear:

SETQUAD D
SETQUAD #n

Does clear:

SETQUAD D, wz
SETQUAD #n, wz

Postedit: Just realised it has been decided to use SETQUAZ to zero the quads.

cgracey · 2012-12-10 21:35

Has anyone tried the new .zip stuff on the DE0_Nano? Any LMM/SETQUAD/SETQUAZ news?

Cluso99 · 2012-12-10 21:42

Chip: I am unsure if this is what you are asking regarding bit manipulation...

We already have movs, movd, movi. I have seen where instructions are sometimes modified to be rdxxxx or wrxxxx. I have also seen where they are enabled/disabled.

I am wondering if a movc (moves the cccc bits) or a movzcri (moves the z,c,r & i bits) might be useful ??? Perhaps this could even be a combined instruction that used the wz and wc bits like....

movx D,[#]S wz,wc
where wc copies source bits 3:0 into cccc (bits 21:18)
and wz copies source bits 7:4 into z/c/r/i (bits 25:22)
maybe even the nr/wr bit could signal if the "i" bit 22 was copied???

Perhaps this may also have more uses with all the new instructions to change things like SETINDx and FIXINDx etc.

I am not sure if this would be really useful or not at this time, so its just a suggestion - maybe it will trigger something more useful.

Another possibility...

I read a group of 8 bits in from pins displaced from P0. I use a SHR #n and an AND #n to isolate them and put them in the lowest bits. So perhaps an instruction that shifted right n bits and zeroed m upper bits. However we have a problem to use 32 and 32 in an immediate instruction.
Solution:
SHRZ D,#mmmm_nnnnn ' mmmm=0-15, nnnnn=0-31
SHRZ D,label ' where label is a register of mmmmm_nnnnn allowing mmmmm=0-32, n=0-31

cgracey · 2012-12-10 21:55

This would be a simple change to a unary instruction that works on D. There is no S available. I may not make any changes here, after all. I've been thinking about it and it seems like kind of a wash.

potatohead · 2012-12-10 22:01

Sheesh! I had worked with Bill's code last night running add / sub instructions and had trouble getting them all executed and or bugs in the program loop, due to the things you guys highlighted today. Thought it was me and didn't post, figuring I would give it all another go this evening.

Glad that got sorted out.

I favor the simpler non-aligned option for RDQUAD mapping. Complexity is high with the other options, and there are already lots of rules to program by. Adding to them really should have a big return, IMHO, otherwise try and consolidate them without impacting things too much. I see that was done too.

Re: Compressed instructions

I'll just throw this out there: Seems to me a subset of instructions might make sense. For a COG running LMM code, it's not going to be doing much else, besides servicing that code. How about a simple look up? Use some of the 16 bits to index the instruction and the remaining for arguments, etc...

A programmer / compiler could choose a subset of instructions that way, and they could be changed too, depending. Target something like 32 instructions. Those get put into the cog as "template instructions", which then get modified by the half size LMM operand when they are fetched, in effect storing the deltas and limiting how many instructions are available at any one time. For a lot of tasks, this subset might really pay off, and it could change on the fly too.

Sapieha · 2012-12-10 22:13

Hi Chip.

What I'm still are missing are simple Instruction's to move one BYTE 0,1,2,3 from one COG Long/Register to another in any of 4 byte places (Cross MOVE between 2 LONG's) without need of shifting and then move then shift again in other LONG

Cluso99 · 2012-12-11 00:12

cgracey wrote: »

This would be a simple change to a unary instruction that works on D. There is no S available. I may not make any changes here, after all. I've been thinking about it and it seems like kind of a wash.

I cannot think of any instruction within these parameters that would be of major use.

Cluso99 · 2012-12-11 02:15

Bill, I hope you don't mind me posting here.

This is possible (untested) code to load an overlay (or a number of quad longs) catching the hub window...

        org             0
entry        
        setquad         #Instr0                 ' set quad map
        setptrb         ptr_olay1               ' set to point to hub location of start of overlay1
        setindb         #Overlay                ' set to cog overlay area
        nop                                     ' are any of these required???
        nop
        nop

        rdquad          ptrb++                  ' prime the cache (takes +5clocks)
                                                '  but the next rdquad will cause a stall anyway
        reps            #10-1,#5                ' repeat 9 loops of 5 instructions
        nop                                     ' is this required???
         rdquad          ptrb++
         mov             indb++,#instr1          ' copy into cog   
         mov             indb++,#instr2
         mov             indb++,#instr3
         mov             indb++,#instr4

here    jmp             #here                   ' just loop here indefinately!!!  normally goes to execute the overlay

ptr_olay1     long      $1000                   ' just a fixed hub location for now!!!        

'ensure a quad boundary for now
Instr0  nop
Instr1  nop
Instr2  nop
Instr3  nop        

overlay
        res             4*10                    'instructions get copied here onwards...

Sapieha · 2012-12-11 02:20

Hi Cluso.

Use ---> SETQUAZ instead in this place ---

Cluso99 wrote: »

Bill, I hope you don't mind me posting here.

This is possible (untested) code to load an overlay (or a number of quad longs) catching the hub window...

        org             0
entry        
        [B]setquad [/B]        #Instr0                 ' set quad map
        setptrb         ptr_olay1               ' set to point to hub location of start of overlay1
        setindb         #Overlay                ' set to cog overlay area
        nop                                     ' are any of these required???
        nop
        nop

        rdquad          ptrb++                  ' prime the cache (takes +5clocks)
                                                '  but the next rdquad will cause a stall anyway
        reps            #10-1,#5                ' repeat 9 loops of 5 instructions
        nop                                     ' is this required???
         rdquad          ptrb++
         mov             indb++,#instr1          ' copy into cog   
         mov             indb++,#instr2
         mov             indb++,#instr3
         mov             indb++,#instr4

here    jmp             #here                   ' just loop here indefinately!!!  normally goes to execute the overlay

ptr_olay1     long      $1000                   ' just a fixed hub location for now!!!        

'ensure a quad boundary for now
Instr0  nop
Instr1  nop
Instr2  nop
Instr3  nop        

overlay
        res             4*10                    'instructions get copied here onwards...

Baggers · 2012-12-11 02:56

Wow, I really should get one of those Nano boards, Great debugging guys, and nice find!

PS. Chip, If you're looking for any last minute instructions to add what about a RDQUADB which reads a long and puts each byte into 4 longs? that will be helpful. and WRQUADB to put them back into one long in hub ram.

cgracey · 2012-12-11 05:22

Baggers wrote: »

Wow, I really should get one of those Nano boards, Great debugging guys, and nice find!

PS. Chip, If you're looking for any last minute instructions to add what about a RDQUADB which reads a long and puts each byte into 4 longs? that will be helpful. and WRQUADB to put them back into one long in hub ram.

Baggers, that would be maybe too much at this late stage. I wish I would have known about this earlier. You told me, but I didn't picture it this clearly back them. That would be pretty simple.

cgracey · 2012-12-11 05:35

I just posted a new .zip here:

http://forums.parallax.com/showthread.php?144199-Propeller-II-Emulation-of-the-P2-on-DE0-NANO-amp-DE2-115-FPGA-boards&p=1145603&viewfull=1#post1145603

It's got everything in it, including new DE0_Nano and DE2_115 files, all tested. New Pnut.exe and Prop2 Doc's, as well.

I think we've got all the QUAD problems whipped. It runs LMM code like you'd expect. You can start the QUADs at any register now and clear them at the same time, if you want.

I got the REPS/REPD working with multitasking now. Any task can use it, but only one task at a time.

Here are the updated doc's for SETQUAD and RDQUAD timing:

After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:


        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

	NOP			'do something for at least 3 clocks to allow QUADs to update
	NOP
	NOP

	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S


After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:


        SETQUAD #quad0          'map QUADs to quad0..quad3

        RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3

        NOP                     'do something for at least 3 clocks to allow QUADs to update
        NOP
        NOP

        NOP                     'do at least 1 instruction to get QUADs into pipeline

quad0   NOP                     'QUAD0..QUAD3 are now executable
quad1   NOP
quad2   NOP
quad3   NOP


After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
readable via D and S after 2 instructions:


        SETQUAD #quad0          'map QUADs to quad0..quad3 (new address)

        NOP			'do at least two instructions to queue up QUADs
        NOP

        CMP     quad0,quad1     'mapped QUADS are now accessible via D and S


On cog startup, the QUAD registers are cleared to 0's.

Here's Bill's code that I worked with all day in order to get the QUAD issues straightened out. It's finally behaving and doing LMM nicely:

'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'

CON

	CLOCK_FREQ = 60_000_000
	BAUD = 115_200

DAT
	org	0

	setptra	what		'write 8k buffer of "add count,#1" lmm code
	mov	x,#$100
:loop	reps	#8,#1
	setinda	#addins0
	wrlong	inda++,ptra++
	djnz	x,#:loop

	setptra	what		'point at start of lmm code

	setquaz	#ins1		'point quads inside LMM loop (cleared to NOP's)

	reps	#257,#8		'execute 256 LMM instructions (257th loop reads garbage, but executes last ins)
	getcnt	cycles

ins0	rdquad	ptra++		'LMM loop
ins1	nop			'four LMM instructions from RDQUAD before last execute here
ins2	nop
ins3	nop
ins4	nop
ins5	nop			'(this is where the mapped QUADs actually become executable after RDQUAD)
ins6	nop
ins7	nop

	subcnt	cycles		'get elapsed time
	sub	cycles,#1

done	setptra	results		'write results
	reps	#9,#1
	setinda	#count0
	wrlong	inda++,ptra++

	coginit	monitor_pgm,monitor_ptr	'relaunch cog0 with monitor - thanks Chip!

monitor_pgm	long	$70C			'monitor program address
monitor_ptr	long	90<<9 + 91		'monitor parameter (conveys tx/rx pins)

addins0	add	count0,#1<<1
addins1	add	count1,#2<<1
addins2	add	count2,#3<<1
addins3	add	count3,#4<<1
subins0	sub	count0,#1
subins1 sub	count1,#2
subins2 sub	count2,#3
subins3 sub	count3,#4

count0	long	0
count1	long	0
count2	long	0
count3	long	0
count4	long	0
count5	long	0
count6	long	0
count7	long	0

cycles	long	0

x		long	0
what		long	$4000
results		long	$2000

cgracey · 2012-12-11 05:56

Sapieha wrote: »

Hi Chip.

What I'm still are missing are simple Instruction's to move one BYTE 0,1,2,3 from one COG Long/Register to another in any of 4 byte places (Cross MOVE between 2 LONG's) without need of shifting and then move then shift again in other LONG

I know. I want to document those soon. I'm sorry it's going slow lately.

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

Comments