"Error: File Z:/2012/prop2/Prop2_DEO_Nano_v2/DE0_Nano_Prop2.jic is corrupted"
I donloaded the zip twice, an tried it about four times.
Help!
I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?
I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?
I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?
Worked Ok for me too, Bill. Quartus Programmer v12.1 build 177
Edit: oops took a phone call in between posting and see you've sorted. Onwards and upwards...
Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.
- Removing conditional execution saves 4 bits
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits
This would be pretty easy for the compiler to support
A simple "RDWORDX" could read a word, and expand it back to a full instruction, so the LMM kernel could work on it
It would not be needed for the fast RDQUAD version as compressed code is about large code size, not highest speed possible
I'd map R0-R15 to $1E0-$1EF as it leaves eight more registers after it for future use (or Prop3 adding mroe special registers)
Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.
- Removing conditional execution saves 4 bits
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits
This would be pretty easy for the compiler to support
A simple "RDWORDX" could read a word, and expand it back to a full instruction, so the LMM kernel could work on it
So these are not opcodes in the true sense, but more a compression/map technique that a new opcode needs to fetch, to then place in (real) Opcode memory.
Since this will be a streaming operation, rather than limit the opcodes to a tiny subset, why not allow a sticky mapping system, which would allow users to compress functions, but not force everything into a bottle neck.
ie this would take the idea of a 16:32 map
- Removing conditional execution saves 4 bits
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits
but when expanding, the 16 new bits come from a register. Less 'chop' and more 'tiny dictionary'
Now, a user can define a 5 bit register group, which simply swaps-in with the 4 variable bits, to give a 'function local' type approach.
- same for the other 4+2 bits but they would less commonly change, and could be patched if needed.
A compiler can organize reg usage to be Function local, and use 4 bit fields, (merged with the 5 bit group, that changes far less often) and then if it finds it needs (say) 2 condition codes in 50 lines of code, it can patch those last, before the jump to execute step, and still have a memory size/bandwidth gain.
Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.
Chip,
Have you looked at the CMM document I sent you a while back. It is how PropGCC is currently encoding "compressed" instructions. You might want to check with Eric Smith to see if it makes sense to implement some of the translation in hardware.
Chip: I am unsure if this is what you are asking regarding bit manipulation...
We already have movs, movd, movi. I have seen where instructions are sometimes modified to be rdxxxx or wrxxxx. I have also seen where they are enabled/disabled.
I am wondering if a movc (moves the cccc bits) or a movzcri (moves the z,c,r & i bits) might be useful ??? Perhaps this could even be a combined instruction that used the wz and wc bits like....
movx D,[#]S wz,wc
where wc copies source bits 3:0 into cccc (bits 21:18)
and wz copies source bits 7:4 into z/c/r/i (bits 25:22)
maybe even the nr/wr bit could signal if the "i" bit 22 was copied???
Perhaps this may also have more uses with all the new instructions to change things like SETINDx and FIXINDx etc.
I am not sure if this would be really useful or not at this time, so its just a suggestion - maybe it will trigger something more useful.
Another possibility...
I read a group of 8 bits in from pins displaced from P0. I use a SHR #n and an AND #n to isolate them and put them in the lowest bits. So perhaps an instruction that shifted right n bits and zeroed m upper bits. However we have a problem to use 32 and 32 in an immediate instruction.
Solution:
SHRZ D,#mmmm_nnnnn ' mmmm=0-15, nnnnn=0-31
SHRZ D,label ' where label is a register of mmmmm_nnnnn allowing mmmmm=0-32, n=0-31
This would be a simple change to a unary instruction that works on D. There is no S available. I may not make any changes here, after all. I've been thinking about it and it seems like kind of a wash.
Sheesh! I had worked with Bill's code last night running add / sub instructions and had trouble getting them all executed and or bugs in the program loop, due to the things you guys highlighted today. Thought it was me and didn't post, figuring I would give it all another go this evening.
Glad that got sorted out.
I favor the simpler non-aligned option for RDQUAD mapping. Complexity is high with the other options, and there are already lots of rules to program by. Adding to them really should have a big return, IMHO, otherwise try and consolidate them without impacting things too much. I see that was done too.
Re: Compressed instructions
I'll just throw this out there: Seems to me a subset of instructions might make sense. For a COG running LMM code, it's not going to be doing much else, besides servicing that code. How about a simple look up? Use some of the 16 bits to index the instruction and the remaining for arguments, etc...
A programmer / compiler could choose a subset of instructions that way, and they could be changed too, depending. Target something like 32 instructions. Those get put into the cog as "template instructions", which then get modified by the half size LMM operand when they are fetched, in effect storing the deltas and limiting how many instructions are available at any one time. For a lot of tasks, this subset might really pay off, and it could change on the fly too.
What I'm still are missing are simple Instruction's to move one BYTE 0,1,2,3 from one COG Long/Register to another in any of 4 byte places (Cross MOVE between 2 LONG's) without need of shifting and then move then shift again in other LONG
This would be a simple change to a unary instruction that works on D. There is no S available. I may not make any changes here, after all. I've been thinking about it and it seems like kind of a wash.
I cannot think of any instruction within these parameters that would be of major use.
This is possible (untested) code to load an overlay (or a number of quad longs) catching the hub window...
org 0
entry
setquad #Instr0 ' set quad map
setptrb ptr_olay1 ' set to point to hub location of start of overlay1
setindb #Overlay ' set to cog overlay area
nop ' are any of these required???
nop
nop
rdquad ptrb++ ' prime the cache (takes +5clocks)
' but the next rdquad will cause a stall anyway
reps #10-1,#5 ' repeat 9 loops of 5 instructions
nop ' is this required???
rdquad ptrb++
mov indb++,#instr1 ' copy into cog
mov indb++,#instr2
mov indb++,#instr3
mov indb++,#instr4
here jmp #here ' just loop here indefinately!!! normally goes to execute the overlay
ptr_olay1 long $1000 ' just a fixed hub location for now!!!
'ensure a quad boundary for now
Instr0 nop
Instr1 nop
Instr2 nop
Instr3 nop
overlay
res 4*10 'instructions get copied here onwards...
This is possible (untested) code to load an overlay (or a number of quad longs) catching the hub window...
org 0
entry
[B]setquad [/B] #Instr0 ' set quad map
setptrb ptr_olay1 ' set to point to hub location of start of overlay1
setindb #Overlay ' set to cog overlay area
nop ' are any of these required???
nop
nop
rdquad ptrb++ ' prime the cache (takes +5clocks)
' but the next rdquad will cause a stall anyway
reps #10-1,#5 ' repeat 9 loops of 5 instructions
nop ' is this required???
rdquad ptrb++
mov indb++,#instr1 ' copy into cog
mov indb++,#instr2
mov indb++,#instr3
mov indb++,#instr4
here jmp #here ' just loop here indefinately!!! normally goes to execute the overlay
ptr_olay1 long $1000 ' just a fixed hub location for now!!!
'ensure a quad boundary for now
Instr0 nop
Instr1 nop
Instr2 nop
Instr3 nop
overlay
res 4*10 'instructions get copied here onwards...
Wow, I really should get one of those Nano boards, Great debugging guys, and nice find!
PS. Chip, If you're looking for any last minute instructions to add what about a RDQUADB which reads a long and puts each byte into 4 longs? that will be helpful. and WRQUADB to put them back into one long in hub ram.
Wow, I really should get one of those Nano boards, Great debugging guys, and nice find!
PS. Chip, If you're looking for any last minute instructions to add what about a RDQUADB which reads a long and puts each byte into 4 longs? that will be helpful. and WRQUADB to put them back into one long in hub ram.
Baggers, that would be maybe too much at this late stage. I wish I would have known about this earlier. You told me, but I didn't picture it this clearly back them. That would be pretty simple.
It's got everything in it, including new DE0_Nano and DE2_115 files, all tested. New Pnut.exe and Prop2 Doc's, as well.
I think we've got all the QUAD problems whipped. It runs LMM code like you'd expect. You can start the QUADs at any register now and clear them at the same time, if you want.
I got the REPS/REPD working with multitasking now. Any task can use it, but only one task at a time.
Here are the updated doc's for SETQUAD and RDQUAD timing:
After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow QUADs to update
NOP
NOP
CMP quad0,quad1 'mapped QUADs are now accessible via D and S
After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:
SETQUAD #quad0 'map QUADs to quad0..quad3
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow QUADs to update
NOP
NOP
NOP 'do at least 1 instruction to get QUADs into pipeline
quad0 NOP 'QUAD0..QUAD3 are now executable
quad1 NOP
quad2 NOP
quad3 NOP
After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
readable via D and S after 2 instructions:
SETQUAD #quad0 'map QUADs to quad0..quad3 (new address)
NOP 'do at least two instructions to queue up QUADs
NOP
CMP quad0,quad1 'mapped QUADS are now accessible via D and S
On cog startup, the QUAD registers are cleared to 0's.
Here's Bill's code that I worked with all day in order to get the QUAD issues straightened out. It's finally behaving and doing LMM nicely:
'
' rdquad lmm2 test - now working LMM2 loop!
'
' William Henning
' http://Mikronauts.com
'
CON
CLOCK_FREQ = 60_000_000
BAUD = 115_200
DAT
org 0
setptra what 'write 8k buffer of "add count,#1" lmm code
mov x,#$100
:loop reps #8,#1
setinda #addins0
wrlong inda++,ptra++
djnz x,#:loop
setptra what 'point at start of lmm code
setquaz #ins1 'point quads inside LMM loop (cleared to NOP's)
reps #257,#8 'execute 256 LMM instructions (257th loop reads garbage, but executes last ins)
getcnt cycles
ins0 rdquad ptra++ 'LMM loop
ins1 nop 'four LMM instructions from RDQUAD before last execute here
ins2 nop
ins3 nop
ins4 nop
ins5 nop '(this is where the mapped QUADs actually become executable after RDQUAD)
ins6 nop
ins7 nop
subcnt cycles 'get elapsed time
sub cycles,#1
done setptra results 'write results
reps #9,#1
setinda #count0
wrlong inda++,ptra++
coginit monitor_pgm,monitor_ptr 'relaunch cog0 with monitor - thanks Chip!
monitor_pgm long $70C 'monitor program address
monitor_ptr long 90<<9 + 91 'monitor parameter (conveys tx/rx pins)
addins0 add count0,#1<<1
addins1 add count1,#2<<1
addins2 add count2,#3<<1
addins3 add count3,#4<<1
subins0 sub count0,#1
subins1 sub count1,#2
subins2 sub count2,#3
subins3 sub count3,#4
count0 long 0
count1 long 0
count2 long 0
count3 long 0
count4 long 0
count5 long 0
count6 long 0
count7 long 0
cycles long 0
x long 0
what long $4000
results long $2000
What I'm still are missing are simple Instruction's to move one BYTE 0,1,2,3 from one COG Long/Register to another in any of 4 byte places (Cross MOVE between 2 LONG's) without need of shifting and then move then shift again in other LONG
I know. I want to document those soon. I'm sorry it's going slow lately.
Comments
I just downloaded it from the link I put up and I programmed it into the board, no problem. Someone was saying something about needing to get the whole Quartus setup onto your machine to get it to work. Anyone remember this?
I run it by point on its icon and use run with Quartus.
That function without problems
12.1's programmer worked fine.
Edit: oops took a phone call in between posting and see you've sorted. Onwards and upwards...
Other than the cache clearing, I have Prop2 Nano-v2 passing my torture test.
Here is the inner loop:
Here are the results:
THANKS CHIP!
Terasic_Prop2.zip
I'll be back in a few hours.
Thanks!
Did the new Nano_v2 have SETQUAZ?
Does this mean the P2 is not yet taped out, and sitting in a shuttle queue somewhere ?
No. It's only inside the .zip I just posted.
We were hoping to catch the December shuttle, but we might have to wait until January. We'll see.
Can you guys propose a simple word-to-long-instruction scheme that I could be quickly place into the Verilog? I've thought about this a little, but haven't come to any conclusions. I'm thinking I could swap either the SEUSSF or SEUSSR instruction with this bit-remapping operation to provide some half-size LMM functionality.
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits
This would be pretty easy for the compiler to support
A simple "RDWORDX" could read a word, and expand it back to a full instruction, so the LMM kernel could work on it
It would not be needed for the fast RDQUAD version as compressed code is about large code size, not highest speed possible
I'd map R0-R15 to $1E0-$1EF as it leaves eight more registers after it for future use (or Prop3 adding mroe special registers)
Also needed are "ENTER COMPRESSED" and "LEAVE COMPRESSED" mode instructions.
This would SERIOUSLY speed up CMM2
(I keep sticking 2 on Prop2 versions to clearly identify which version I am talking about)
So these are not opcodes in the true sense, but more a compression/map technique that a new opcode needs to fetch, to then place in (real) Opcode memory.
Since this will be a streaming operation, rather than limit the opcodes to a tiny subset, why not allow a sticky mapping system, which would allow users to compress functions, but not force everything into a bottle neck.
ie this would take the idea of a 16:32 map
- Removing conditional execution saves 4 bits
- Remove "nr" and the "direct" bit saves 2 more bits
- chop D and S to only four bits each, allowing only R0-R15, saving the other 10 bits
but when expanding, the 16 new bits come from a register. Less 'chop' and more 'tiny dictionary'
Now, a user can define a 5 bit register group, which simply swaps-in with the 4 variable bits, to give a 'function local' type approach.
- same for the other 4+2 bits but they would less commonly change, and could be patched if needed.
A compiler can organize reg usage to be Function local, and use 4 bit fields, (merged with the 5 bit group, that changes far less often) and then if it finds it needs (say) 2 condition codes in 50 lines of code, it can patch those last, before the jump to execute step, and still have a memory size/bandwidth gain.
Chip,
Have you looked at the CMM document I sent you a while back. It is how PropGCC is currently encoding "compressed" instructions. You might want to check with Eric Smith to see if it makes sense to implement some of the translation in hardware.
Does not clear:
SETQUAD D
SETQUAD #n
Does clear:
SETQUAD D, wz
SETQUAD #n, wz
Postedit: Just realised it has been decided to use SETQUAZ to zero the quads.
We already have movs, movd, movi. I have seen where instructions are sometimes modified to be rdxxxx or wrxxxx. I have also seen where they are enabled/disabled.
I am wondering if a movc (moves the cccc bits) or a movzcri (moves the z,c,r & i bits) might be useful ??? Perhaps this could even be a combined instruction that used the wz and wc bits like....
movx D,[#]S wz,wc
where wc copies source bits 3:0 into cccc (bits 21:18)
and wz copies source bits 7:4 into z/c/r/i (bits 25:22)
maybe even the nr/wr bit could signal if the "i" bit 22 was copied???
Perhaps this may also have more uses with all the new instructions to change things like SETINDx and FIXINDx etc.
I am not sure if this would be really useful or not at this time, so its just a suggestion - maybe it will trigger something more useful.
Another possibility...
I read a group of 8 bits in from pins displaced from P0. I use a SHR #n and an AND #n to isolate them and put them in the lowest bits. So perhaps an instruction that shifted right n bits and zeroed m upper bits. However we have a problem to use 32 and 32 in an immediate instruction.
Solution:
SHRZ D,#mmmm_nnnnn ' mmmm=0-15, nnnnn=0-31
SHRZ D,label ' where label is a register of mmmmm_nnnnn allowing mmmmm=0-32, n=0-31
Glad that got sorted out.
I favor the simpler non-aligned option for RDQUAD mapping. Complexity is high with the other options, and there are already lots of rules to program by. Adding to them really should have a big return, IMHO, otherwise try and consolidate them without impacting things too much. I see that was done too.
Re: Compressed instructions
I'll just throw this out there: Seems to me a subset of instructions might make sense. For a COG running LMM code, it's not going to be doing much else, besides servicing that code. How about a simple look up? Use some of the 16 bits to index the instruction and the remaining for arguments, etc...
A programmer / compiler could choose a subset of instructions that way, and they could be changed too, depending. Target something like 32 instructions. Those get put into the cog as "template instructions", which then get modified by the half size LMM operand when they are fetched, in effect storing the deltas and limiting how many instructions are available at any one time. For a lot of tasks, this subset might really pay off, and it could change on the fly too.
What I'm still are missing are simple Instruction's to move one BYTE 0,1,2,3 from one COG Long/Register to another in any of 4 byte places (Cross MOVE between 2 LONG's) without need of shifting and then move then shift again in other LONG
This is possible (untested) code to load an overlay (or a number of quad longs) catching the hub window...
Use ---> SETQUAZ instead in this place ---
PS. Chip, If you're looking for any last minute instructions to add what about a RDQUADB which reads a long and puts each byte into 4 longs? that will be helpful. and WRQUADB to put them back into one long in hub ram.
Baggers, that would be maybe too much at this late stage. I wish I would have known about this earlier. You told me, but I didn't picture it this clearly back them. That would be pretty simple.
http://forums.parallax.com/showthread.php?144199-Propeller-II-Emulation-of-the-P2-on-DE0-NANO-amp-DE2-115-FPGA-boards&p=1145603&viewfull=1#post1145603
It's got everything in it, including new DE0_Nano and DE2_115 files, all tested. New Pnut.exe and Prop2 Doc's, as well.
I think we've got all the QUAD problems whipped. It runs LMM code like you'd expect. You can start the QUADs at any register now and clear them at the same time, if you want.
I got the REPS/REPD working with multitasking now. Any task can use it, but only one task at a time.
Here are the updated doc's for SETQUAD and RDQUAD timing:
Here's Bill's code that I worked with all day in order to get the QUAD issues straightened out. It's finally behaving and doing LMM nicely:
I know. I want to document those soon. I'm sorry it's going slow lately.