Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 5 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

1235712

Comments

  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 17:46
    Hi Chip,

    I just tried to re-program my Nano, and I got:

    "Error: File Z:/2012/prop2/Prop2_DEO_Nano_v2/DE0_Nano_Prop2.jic is corrupted"

    I donloaded the zip twice, an tried it about four times.

    Help!

    I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 17:49
    I made an additional SETQUAZ instruction, aside from keeping SETQUAD. They both set the QUAD address, while the first clears the QUAD registers.
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-10 17:50
    Hi Chip.

    I run it by point on its icon and use run with Quartus.

    That function without problems


    cgracey wrote: »
    I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 17:52
    My bad... I ran the old 10.1 install, not the new 12.1

    12.1's programmer worked fine.
    cgracey wrote: »
    I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 17:52
    Thank you
    cgracey wrote: »
    i made an additional setquaz instruction, aside from keeping setquad. They both set the quad address, while the first clears the quad registers.
  • TubularTubular Posts: 4,703
    edited 2012-12-10 17:59
    Worked Ok for me too, Bill. Quartus Programmer v12.1 build 177
    Edit: oops took a phone call in between posting and see you've sorted. Onwards and upwards...
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 18:13
    YES!!!!!!!!!!

    Other than the cache clearing, I have Prop2 Nano-v2 passing my torture test.

    Here is the inner loop:
    	' execute 256 LMM instructions
    
    ins5	setquad	#ins1
    	nop
    	reps	#256,#8
    	getcnt	start
    '-- must be on a XXXXXXX00 address in the cog
    ins1	nop
    ins2	nop
    ins3	nop
    ins4	nop
    	rdquad	pc
    	add pc,#16
    	nop
    	nop
    

    Here are the results:
    >n
    >2000.201f
    02000- 000000FE 00000080 00000080 00000080   '................'
    02010- 00000FF6 0000007F 00000000 00000000   '................'
    

    THANKS CHIP!
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 18:13
    Here's a new .zip with updated PNUT.EXE (SETQUAD and SETQUAZ), Prop2_Docs, and the .jic file for the DE0_Nano:

    Terasic_Prop2.zip

    I'll be back in a few hours.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 18:16
    More goodies :-)

    Thanks!

    Did the new Nano_v2 have SETQUAZ?
    cgracey wrote: »
    Here's a new .zip with updated PNUT.EXE (SETQUAD and SETQUAZ), Prop2_Docs, and the .jic file for the DE0_Nano:

    Attachment not found.

    I'll be back in a few hours.
  • jmgjmg Posts: 15,173
    edited 2012-12-10 18:22
    cgracey wrote: »
    I made an additional SETQUAZ instruction, aside from keeping SETQUAD. They both set the QUAD address, while the first clears the QUAD registers.

    Does this mean the P2 is not yet taped out, and sitting in a shuttle queue somewhere ?
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 18:25
    More goodies :-)

    Thanks!

    Did the new Nano_v2 have SETQUAZ?

    No. It's only inside the .zip I just posted.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 18:26
    jmg wrote: »
    Does this mean the P2 is not yet taped out, and sitting in a shuttle queue somewhere ?

    We were hoping to catch the December shuttle, but we might have to wait until January. We'll see.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 18:31
    Quick question to think about:

    Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 18:38
    - Removing conditional execution saves 4 bits
    - Remove "nr" and the "direct" bit saves 2 more bits
    - chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits

    This would be pretty easy for the compiler to support

    A simple "RDWORDX" could read a word, and expand it back to a full instruction, so the LMM kernel could work on it

    It would not be needed for the fast RDQUAD version as compressed code is about large code size, not highest speed possible

    I'd map R0-R15 to $1E0-$1EF as it leaves eight more registers after it for future use (or Prop3 adding mroe special registers)
    cgracey wrote: »
    Quick question to think about:

    Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-10 18:42
    A new SJMP opcode could provide a direct 8 bit address (or 9 if you get cute)

    Also needed are "ENTER COMPRESSED" and "LEAVE COMPRESSED" mode instructions.

    This would SERIOUSLY speed up CMM2

    (I keep sticking 2 on Prop2 versions to clearly identify which version I am talking about)
  • jmgjmg Posts: 15,173
    edited 2012-12-10 18:54
    - Removing conditional execution saves 4 bits
    - Remove "nr" and the "direct" bit saves 2 more bits
    - chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits

    This would be pretty easy for the compiler to support

    A simple "RDWORDX" could read a word, and expand it back to a full instruction, so the LMM kernel could work on it

    So these are not opcodes in the true sense, but more a compression/map technique that a new opcode needs to fetch, to then place in (real) Opcode memory.
    Since this will be a streaming operation, rather than limit the opcodes to a tiny subset, why not allow a sticky mapping system, which would allow users to compress functions, but not force everything into a bottle neck.

    ie this would take the idea of a 16:32 map
    - Removing conditional execution saves 4 bits
    - Remove "nr" and the "direct" bit saves 2 more bits
    - chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits

    but when expanding, the 16 new bits come from a register. Less 'chop' and more 'tiny dictionary'

    Now, a user can define a 5 bit register group, which simply swaps-in with the 4 variable bits, to give a 'function local' type approach.
    - same for the other 4+2 bits but they would less commonly change, and could be patched if needed.

    A compiler can organize reg usage to be Function local, and use 4 bit fields, (merged with the 5 bit group, that changes far less often) and then if it finds it needs (say) 2 condition codes in 50 lines of code, it can patch those last, before the jump to execute step, and still have a memory size/bandwidth gain.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-10 20:28
    cgracey wrote: »
    Quick question to think about:

    Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.

    Chip,

    Have you looked at the CMM document I sent you a while back. It is how PropGCC is currently encoding "compressed" instructions. You might want to check with Eric Smith to see if it makes sense to implement some of the translation in hardware.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-10 21:05
    My favoured instructions...

    Does not clear:

    SETQUAD D
    SETQUAD #n

    Does clear:

    SETQUAD D, wz
    SETQUAD #n, wz

    Postedit: Just realised it has been decided to use SETQUAZ to zero the quads.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 21:35
    Has anyone tried the new .zip stuff on the DE0_Nano? Any LMM/SETQUAD/SETQUAZ news?
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-10 21:42
    Chip: I am unsure if this is what you are asking regarding bit manipulation...

    We already have movs, movd, movi. I have seen where instructions are sometimes modified to be rdxxxx or wrxxxx. I have also seen where they are enabled/disabled.

    I am wondering if a movc (moves the cccc bits) or a movzcri (moves the z,c,r & i bits) might be useful ??? Perhaps this could even be a combined instruction that used the wz and wc bits like....

    movx D,[#]S wz,wc
    where wc copies source bits 3:0 into cccc (bits 21:18)
    and wz copies source bits 7:4 into z/c/r/i (bits 25:22)
    maybe even the nr/wr bit could signal if the "i" bit 22 was copied???

    Perhaps this may also have more uses with all the new instructions to change things like SETINDx and FIXINDx etc.

    I am not sure if this would be really useful or not at this time, so its just a suggestion - maybe it will trigger something more useful.

    Another possibility...

    I read a group of 8 bits in from pins displaced from P0. I use a SHR #n and an AND #n to isolate them and put them in the lowest bits. So perhaps an instruction that shifted right n bits and zeroed m upper bits. However we have a problem to use 32 and 32 in an immediate instruction.
    Solution:
    SHRZ D,#mmmm_nnnnn ' mmmm=0-15, nnnnn=0-31
    SHRZ D,label ' where label is a register of mmmmm_nnnnn allowing mmmmm=0-32, n=0-31
  • cgraceycgracey Posts: 14,155
    edited 2012-12-10 21:55
    This would be a simple change to a unary instruction that works on D. There is no S available. I may not make any changes here, after all. I've been thinking about it and it seems like kind of a wash.
  • potatoheadpotatohead Posts: 10,261
    edited 2012-12-10 22:01
    Sheesh! I had worked with Bill's code last night running add / sub instructions and had trouble getting them all executed and or bugs in the program loop, due to the things you guys highlighted today. Thought it was me and didn't post, figuring I would give it all another go this evening.

    Glad that got sorted out.

    I favor the simpler non-aligned option for RDQUAD mapping. Complexity is high with the other options, and there are already lots of rules to program by. Adding to them really should have a big return, IMHO, otherwise try and consolidate them without impacting things too much. I see that was done too.

    Re: Compressed instructions

    I'll just throw this out there: Seems to me a subset of instructions might make sense. For a COG running LMM code, it's not going to be doing much else, besides servicing that code. How about a simple look up? Use some of the 16 bits to index the instruction and the remaining for arguments, etc...

    A programmer / compiler could choose a subset of instructions that way, and they could be changed too, depending. Target something like 32 instructions. Those get put into the cog as "template instructions", which then get modified by the half size LMM operand when they are fetched, in effect storing the deltas and limiting how many instructions are available at any one time. For a lot of tasks, this subset might really pay off, and it could change on the fly too.
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-10 22:13
    Hi Chip.

    What I'm still are missing are simple Instruction's to move one BYTE 0,1,2,3 from one COG Long/Register to another in any of 4 byte places (Cross MOVE between 2 LONG's) without need of shifting and then move then shift again in other LONG
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 00:12
    cgracey wrote: »
    This would be a simple change to a unary instruction that works on D. There is no S available. I may not make any changes here, after all. I've been thinking about it and it seems like kind of a wash.
    I cannot think of any instruction within these parameters that would be of major use.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 02:15
    Bill, I hope you don't mind me posting here.

    This is possible (untested) code to load an overlay (or a number of quad longs) catching the hub window...
            org             0
    entry        
            setquad         #Instr0                 ' set quad map
            setptrb         ptr_olay1               ' set to point to hub location of start of overlay1
            setindb         #Overlay                ' set to cog overlay area
            nop                                     ' are any of these required???
            nop
            nop
    
            rdquad          ptrb++                  ' prime the cache (takes +5clocks)
                                                    '  but the next rdquad will cause a stall anyway
            reps            #10-1,#5                ' repeat 9 loops of 5 instructions
            nop                                     ' is this required???
             rdquad          ptrb++
             mov             indb++,#instr1          ' copy into cog   
             mov             indb++,#instr2
             mov             indb++,#instr3
             mov             indb++,#instr4
    
    here    jmp             #here                   ' just loop here indefinately!!!  normally goes to execute the overlay
    
    ptr_olay1     long      $1000                   ' just a fixed hub location for now!!!        
    
    'ensure a quad boundary for now
    Instr0  nop
    Instr1  nop
    Instr2  nop
    Instr3  nop        
    
    overlay
            res             4*10                    'instructions get copied here onwards...
    
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-11 02:20
    Hi Cluso.

    Use ---> SETQUAZ instead in this place ---
    Cluso99 wrote: »
    Bill, I hope you don't mind me posting here.

    This is possible (untested) code to load an overlay (or a number of quad longs) catching the hub window...
            org             0
    entry        
            [B]setquad [/B]        #Instr0                 ' set quad map
            setptrb         ptr_olay1               ' set to point to hub location of start of overlay1
            setindb         #Overlay                ' set to cog overlay area
            nop                                     ' are any of these required???
            nop
            nop
    
            rdquad          ptrb++                  ' prime the cache (takes +5clocks)
                                                    '  but the next rdquad will cause a stall anyway
            reps            #10-1,#5                ' repeat 9 loops of 5 instructions
            nop                                     ' is this required???
             rdquad          ptrb++
             mov             indb++,#instr1          ' copy into cog   
             mov             indb++,#instr2
             mov             indb++,#instr3
             mov             indb++,#instr4
    
    here    jmp             #here                   ' just loop here indefinately!!!  normally goes to execute the overlay
    
    ptr_olay1     long      $1000                   ' just a fixed hub location for now!!!        
    
    'ensure a quad boundary for now
    Instr0  nop
    Instr1  nop
    Instr2  nop
    Instr3  nop        
    
    overlay
            res             4*10                    'instructions get copied here onwards...
    
  • BaggersBaggers Posts: 3,019
    edited 2012-12-11 02:56
    Wow, I really should get one of those Nano boards, Great debugging guys, and nice find!

    PS. Chip, If you're looking for any last minute instructions to add what about a RDQUADB which reads a long and puts each byte into 4 longs? that will be helpful. and WRQUADB to put them back into one long in hub ram.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-11 05:22
    Baggers wrote: »
    Wow, I really should get one of those Nano boards, Great debugging guys, and nice find!

    PS. Chip, If you're looking for any last minute instructions to add what about a RDQUADB which reads a long and puts each byte into 4 longs? that will be helpful. and WRQUADB to put them back into one long in hub ram.

    Baggers, that would be maybe too much at this late stage. I wish I would have known about this earlier. You told me, but I didn't picture it this clearly back them. That would be pretty simple.
  • cgraceycgracey Posts: 14,155
    edited 2012-12-11 05:35
    I just posted a new .zip here:

    http://forums.parallax.com/showthread.php?144199-Propeller-II-Emulation-of-the-P2-on-DE0-NANO-amp-DE2-115-FPGA-boards&p=1145603&viewfull=1#post1145603

    It's got everything in it, including new DE0_Nano and DE2_115 files, all tested. New Pnut.exe and Prop2 Doc's, as well.

    I think we've got all the QUAD problems whipped. It runs LMM code like you'd expect. You can start the QUADs at any register now and clear them at the same time, if you want.

    I got the REPS/REPD working with multitasking now. Any task can use it, but only one task at a time.

    Here are the updated doc's for SETQUAD and RDQUAD timing:
    After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:
    
    
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
    	NOP			'do something for at least 3 clocks to allow QUADs to update
    	NOP
    	NOP
    
    	CMP     quad0,quad1     'mapped QUADs are now accessible via D and S
    
    
    After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:
    
    
            SETQUAD #quad0          'map QUADs to quad0..quad3
    
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
            NOP                     'do something for at least 3 clocks to allow QUADs to update
            NOP
            NOP
    
            NOP                     'do at least 1 instruction to get QUADs into pipeline
    
    quad0   NOP                     'QUAD0..QUAD3 are now executable
    quad1   NOP
    quad2   NOP
    quad3   NOP
    
    
    After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
    readable via D and S after 2 instructions:
    
    
            SETQUAD #quad0          'map QUADs to quad0..quad3 (new address)
    
            NOP			'do at least two instructions to queue up QUADs
            NOP
    
            CMP     quad0,quad1     'mapped QUADS are now accessible via D and S
    
    
    On cog startup, the QUAD registers are cleared to 0's.
    

    Here's Bill's code that I worked with all day in order to get the QUAD issues straightened out. It's finally behaving and doing LMM nicely:
    '
    ' rdquad lmm2 test - now working LMM2 loop!
    '
    ' William Henning
    ' http://Mikronauts.com
    '
    
    CON
    
    	CLOCK_FREQ = 60_000_000
    	BAUD = 115_200
    
    DAT
    	org	0
    
    	setptra	what		'write 8k buffer of "add count,#1" lmm code
    	mov	x,#$100
    :loop	reps	#8,#1
    	setinda	#addins0
    	wrlong	inda++,ptra++
    	djnz	x,#:loop
    
    	setptra	what		'point at start of lmm code
    
    	setquaz	#ins1		'point quads inside LMM loop (cleared to NOP's)
    
    	reps	#257,#8		'execute 256 LMM instructions (257th loop reads garbage, but executes last ins)
    	getcnt	cycles
    
    ins0	rdquad	ptra++		'LMM loop
    ins1	nop			'four LMM instructions from RDQUAD before last execute here
    ins2	nop
    ins3	nop
    ins4	nop
    ins5	nop			'(this is where the mapped QUADs actually become executable after RDQUAD)
    ins6	nop
    ins7	nop
    
    	subcnt	cycles		'get elapsed time
    	sub	cycles,#1
    
    done	setptra	results		'write results
    	reps	#9,#1
    	setinda	#count0
    	wrlong	inda++,ptra++
    
    	coginit	monitor_pgm,monitor_ptr	'relaunch cog0 with monitor - thanks Chip!
    
    monitor_pgm	long	$70C			'monitor program address
    monitor_ptr	long	90<<9 + 91		'monitor parameter (conveys tx/rx pins)
    
    addins0	add	count0,#1<<1
    addins1	add	count1,#2<<1
    addins2	add	count2,#3<<1
    addins3	add	count3,#4<<1
    subins0	sub	count0,#1
    subins1 sub	count1,#2
    subins2 sub	count2,#3
    subins3 sub	count3,#4
    
    count0	long	0
    count1	long	0
    count2	long	0
    count3	long	0
    count4	long	0
    count5	long	0
    count6	long	0
    count7	long	0
    
    cycles	long	0
    
    x		long	0
    what		long	$4000
    results		long	$2000
    
  • cgraceycgracey Posts: 14,155
    edited 2012-12-11 05:56
    Sapieha wrote: »
    Hi Chip.

    What I'm still are missing are simple Instruction's to move one BYTE 0,1,2,3 from one COG Long/Register to another in any of 4 byte places (Cross MOVE between 2 LONG's) without need of shifting and then move then shift again in other LONG

    I know. I want to document those soon. I'm sorry it's going slow lately.
Sign In or Register to comment.