Chip,
Did you manage to fit the USB read bit (GETXP) and CRC (CRCBIT) instructions in this release?
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
Or start a new thread, for USB opcodes might be easier to track ?
I just realized that since all instructions during hub execution come from the hub, the cog RAM instruction fetching is still going on, but it's being ignored. We could stuff some other address in the instruction-read-address of the cog RAM and get any long out of cog RAM we want. I wonder if there is something useful that can be done by repurposing the cog's internal instruction fetch. It's a free cog RAM read on every hub exec instruction.
Chip
If there is some room in the opcode map, to craft a tight follower to the instruction block, being fetched from Hub memory, perhaps it could be arranged to give an opportunity to a suggestion made by Cluso99, at post #4217 of this thread, about the creation of some real time method to "teach" the cache LRU algorithm, how to deal with de following fetches from Hub memory.
Unfortunatelly, it will not yet deal with conditional branching preview, unless you could devise a method to automaticaly doing some "Y" branching at the cache level, advancing a single eight long fetche, as an alternate, and faster, way to "predict", from where it will start to gather new data, in case branching realy does happens to be taken.
If there is some room in the opcode map, to craft a tight follower to the instruction block, being fetched from Hub memory, perhaps it could be arranged to give an opportunity to a suggestion made by Cluso99, at post #4217 of this thread, about the creation of some real time method to "teach" the cache LRU algorithm, how to deal with de following fetches from Hub memory.
Unfortunatelly, it will not yet deal with conditional branching preview, unless you could devise a method to automaticaly doing some "Y" branching at the cache level, advancing a single eight long fetche, as an alternate, and faster, way to "predict", from where it will start to gather new data, in case branching realy does happens to be taken.
Yanomani
I think it would be nice if we could just "force" the loading of the instruction cache with an instruction like I proposed. I think it would be worth playing with this concept on the FPGA to see what we could determine. I don't think the autoload of the instruction cache can ever be as good as what a useful instruction could do. This way, we could control the cache contents by sw.
I think it would be nice if we could just "force" the loading of the instruction cache with an instruction like I proposed. I think it would be worth playing with this concept on the FPGA to see what we could determine. I don't think the autoload of the instruction cache can ever be as good as what a useful instruction could do. This way, we could control the cache contents by sw.
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
As long as there is auto-load in the absence of forced-loading.
Auto load is needed to reduce the work compilers need to do.
Auto load is the prime requirement.
I just think it could be extremely useful to override this default. Our code would have to be more complex to take advantage of it, but it could be a real boost to some specialised code.
It would be nice to play with this in the fpga while other things are being done.
It would also be nice for the instruction cache autoloader to be able to take advantage of the next available slot to load the wide
As Cluso said we can't help much on PNut -- But --->
As Instructions use Now Absolute and Relative addresses ---> That need in my opinion 2 directives ABSCOD and RELCOD
Then as code can execute from HUB else COG ---> HUBCOD and COGCOD.
I just think it could be extremely useful to override this default. Our code would have to be more complex to take advantage of it, but it could be a real boost to some specialised code.
It would be nice to play with this in the fpga while other things are being done.
It would also be nice for the instruction cache autoloader to be able to take advantage of the next available slot to load the wide
What you were seeking was a PREFETCHI #hubaddr16bit ... to prefetch. Assembly coders (and optimizing compilers) could make good use of that.
hubexec would really benefit from slurping up unused slots...
Actually, I was hoping for a little more
PREFETCHI #hubaddr16bit,#1..3
to prefetch 1, 2 or 3 wides into the instruction cache because the program would know that the routine about to be called was <8, <16, <24 longs.
It would be particularly beneficial if the prefetching (auto or manual) could grab the next unused slot! What a performance boost this would be, and it's slot would then be free for another cogs prefetching!
Actually, I was hoping for a little more
PREFETCHI #hubaddr16bit,#1..3
to prefetch 1, 2 or 3 wides into the instruction cache because the program would know that the routine about to be called was <8, <16, <24 longs.
It would be particularly beneficial if the prefetching (auto or manual) could grab the next unused slot! What a performance boost this would be, and it's slot would then be free for another cogs prefetching!
Currently I expect that the auto icache loader waits until a hub instruction (that's not in the cache) is required. It then stalls until that wide is fetched. The wide will replace the LRU (last recent used).
What would be nice if the loader then automatically presumed the next block would be required, and fetched that wide into the next LRU. And, once that first block is used and the second wide's use has begun, the next block is fetched.
However, all this is a little complex for now. It will be nice to start playing to see what performance we can get, and where the stalls are.
What would be nice if the loader then automatically presumed the next block would be required, and fetched that wide into the next LRU. And, once that first block is used and the second wide's use has begun, the next block is fetched.
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Can You include in Prop2_Docs.txt -- list of all directives used in PNut?
I will in the next release. The ORG-type directives select between cog and hub assembler modes. In cog mode, the cog address is used. In hub mode, the hub address >> 2 is used.
After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub.
Isn't this going to be kind of tricky since you could possibly have all four tasks trying to fill a cache line at the same time? Somehow you have to make sure they don't step on each other. Although, maybe with LRU that falls out of the design. Anyway, it sounds a little mind bending to me! :-)
Isn't this going to be kind of tricky since you could possibly have all four tasks trying to fill a cache line at the same time? Somehow you have to make sure they don't step on each other. Although, maybe with LRU that falls out of the design. Anyway, it sounds a little mind bending to me! :-)
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Thanks Chip. I will do the initial ones too.
Much of the work is done with formulae in excel. This makes it easier when the instruction set changes.
I need to break the instructions into common format groups for ny disassembler.
When you get the cache line mechanism working, could you consider adding an instruction to force the loading of a line(s) please?
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Chip, why don't you just use an 8 bit byte for the LRU.
When you load a cache line, shift the byte right, put the 2 bit line ID in the upper 2 bits. When you load the next line, you just pull the 2 bit ID off the bottom of the byte and use that to determine what the next cache line is.
Init LRU to 11100100 at startup
First load grabs 00, shifts right, pushes 00 to the top: 00111001
Next load, so on.
When the cogs start running instructions through the cache, the efficiency of the code will mix up the order, creating LRU instead of round-robin, but at start it will simply be round-robin.
Also, I don't see how all tasks could want data at once, since they are staggered through the 4 stages of the pipeline. Sure, it takes 3-8 clocks to load, but the tasks will stall anyway. So, yes you can have a pile-up, and RDLONG should have priority over HUBEX, which will cause further stalls.
This is all a better reason not to support HUBEX for more than task 0. You will have totally non-deterministic execution with 4 HUBEX tasks, because of issue order of the HUBEX loads and overriding priority for RDXXXX instructions. That means if the first instruction of a HUBEX is RDXXXX, it will stall the next HUBEX load, then if it's multiple RDXXXX instructions in a row, those other HUBEX tasks will stall indefinitely.
If you only make 1 task HUBEX and have a 4 line cache, that will give best possible performance.
pre-caching can make hub execution fast, but you must allow explicit memory ops to interrupt pending pre-cache requests.
Comments
It seems too choppy, in time. It might have to functionally correlate with instruction execution to be useful.
Did you manage to fit the USB read bit (GETXP) and CRC (CRCBIT) instructions in this release?
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
Or start a new thread, for USB opcodes might be easier to track ?
Chip
If there is some room in the opcode map, to craft a tight follower to the instruction block, being fetched from Hub memory, perhaps it could be arranged to give an opportunity to a suggestion made by Cluso99, at post #4217 of this thread, about the creation of some real time method to "teach" the cache LRU algorithm, how to deal with de following fetches from Hub memory.
Unfortunatelly, it will not yet deal with conditional branching preview, unless you could devise a method to automaticaly doing some "Y" branching at the cache level, advancing a single eight long fetche, as an alternate, and faster, way to "predict", from where it will start to gather new data, in case branching realy does happens to be taken.
Yanomani
Auto load is needed to reduce the work compilers need to do.
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1224661&viewfull=1#post1224661
And here were some fixes/updates/etc (summaries) that I did a short time ago (most were resolved or are now irrelevant)
http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1224734&viewfull=1#post1224734
Thanks for these links. I should be able to get to these in a few days. I'm still modifying PNut.exe to get the new instruction set integrated.
I just think it could be extremely useful to override this default. Our code would have to be more complex to take advantage of it, but it could be a real boost to some specialised code.
It would be nice to play with this in the fpga while other things are being done.
It would also be nice for the instruction cache autoloader to be able to take advantage of the next available slot to load the wide
I don't suppose there is any way we can help you with pnut?
As Cluso said we can't help much on PNut -- But --->
As Instructions use Now Absolute and Relative addresses ---> That need in my opinion 2 directives ABSCOD and RELCOD
Then as code can execute from HUB else COG ---> HUBCOD and COGCOD.
This time they are in a pdf.
P2_Instruction_Set_20131217b.pdf
What you were seeking was a PREFETCHI #hubaddr16bit ... to prefetch. Assembly coders (and optimizing compilers) could make good use of that.
hubexec would really benefit from slurping up unused slots...
I'm not sure that is still correct. See Chip's post here:
http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-%28split-from-blog%29?p=1228729&viewfull=1#post1228729
I look on Instruction list abd it show both absolute and relative possibility's.
PREFETCHI #hubaddr16bit,#1..3
to prefetch 1, 2 or 3 wides into the instruction cache because the program would know that the routine about to be called was <8, <16, <24 longs.
It would be particularly beneficial if the prefetching (auto or manual) could grab the next unused slot! What a performance boost this would be, and it's slot would then be free for another cogs prefetching!
What would be nice if the loader then automatically presumed the next block would be required, and fetched that wide into the next LRU. And, once that first block is used and the second wide's use has begun, the next block is fetched.
However, all this is a little complex for now. It will be nice to start playing to see what performance we can get, and where the stalls are.
I am pretty sure this is what Chip did...
Indeed...
ZCxS Opcode ZC I Cond.Destinatn Source Instr/00 01 10 11 Operand(s) Flags ------------------------------------------------------------------------------------------------------------------------------------------------- ZCWS 0000000 ZC I CCCC DDDDDDDDD SSSSSSSSS RDBYTE D,S/PTRA/PTRB [WZ],[WC] ZCWS 0000001 ZC I CCCC DDDDDDDDD SSSSSSSSS RDBYTEC D,S/PTRA/PTRB [WZ],[WC] ZCWS 0000010 ZC I CCCC DDDDDDDDD SSSSSSSSS RDWORD D,S/PTRA/PTRB [WZ],[WC] ZCWS 0000011 ZC I CCCC DDDDDDDDD SSSSSSSSS RDWORDC D,S/PTRA/PTRB [WZ],[WC] ZCWS 0000100 ZC I CCCC DDDDDDDDD SSSSSSSSS RDLONG D,S/PTRA/PTRB [WZ],[WC] ZCWS 0000101 ZC I CCCC DDDDDDDDD SSSSSSSSS RDLONGC D,S/PTRA/PTRB [WZ],[WC] ZCWS 0000110 ZC I CCCC DDDDDDDDD SSSSSSSSS RDAUX D,S/#0..$FF/PTRX/PTRY [WZ],[WC] ZCWS 0000111 ZC I CCCC DDDDDDDDD SSSSSSSSS RDAUXR D,S/#0..$FF/PTRX/PTRY [WZ],[WC] ZCMS 0001000 ZC I CCCC DDDDDDDDD SSSSSSSSS ISOB D,S/# [WZ],[WC] ZCMS 0001001 ZC I CCCC DDDDDDDDD SSSSSSSSS NOTB D,S/# [WZ],[WC] ZCMS 0001010 ZC I CCCC DDDDDDDDD SSSSSSSSS CLRB D,S/# [WZ],[WC] ZCMS 0001011 ZC I CCCC DDDDDDDDD SSSSSSSSS SETB D,S/# [WZ],[WC] ZCMS 0001100 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBC D,S/# [WZ],[WC] ZCMS 0001101 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBNC D,S/# [WZ],[WC] ZCMS 0001110 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBZ D,S/# [WZ],[WC] ZCMS 0001111 ZC I CCCC DDDDDDDDD SSSSSSSSS SETBNZ D,S/# [WZ],[WC] ZCMS 0010000 ZC I CCCC DDDDDDDDD SSSSSSSSS ANDN D,S/# [WZ],[WC] ZCMS 0010001 ZC I CCCC DDDDDDDDD SSSSSSSSS AND D,S/# [WZ],[WC] ZCMS 0010010 ZC I CCCC DDDDDDDDD SSSSSSSSS OR D,S/# [WZ],[WC] ZCMS 0010011 ZC I CCCC DDDDDDDDD SSSSSSSSS XOR D,S/# [WZ],[WC] ZCMS 0010100 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXC D,S/# [WZ],[WC] ZCMS 0010101 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXNC D,S/# [WZ],[WC] ZCMS 0010110 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXZ D,S/# [WZ],[WC] ZCMS 0010111 ZC I CCCC DDDDDDDDD SSSSSSSSS MUXNZ D,S/# [WZ],[WC] ZCMS 0011000 ZC I CCCC DDDDDDDDD SSSSSSSSS ROR D,S/# [WZ],[WC] ZCMS 0011001 ZC I CCCC DDDDDDDDD SSSSSSSSS ROL D,S/# [WZ],[WC] ZCMS 0011010 ZC I CCCC DDDDDDDDD SSSSSSSSS SHR D,S/# [WZ],[WC] ZCMS 0011011 ZC I CCCC DDDDDDDDD SSSSSSSSS SHL D,S/# [WZ],[WC] ZCMS 0011100 ZC I CCCC DDDDDDDDD SSSSSSSSS RCR D,S/# [WZ],[WC] ZCMS 0011101 ZC I CCCC DDDDDDDDD SSSSSSSSS RCL D,S/# [WZ],[WC] ZCMS 0011110 ZC I CCCC DDDDDDDDD SSSSSSSSS SAR D,S/# [WZ],[WC] ZCMS 0011111 ZC I CCCC DDDDDDDDD SSSSSSSSS REV D,S/# [WZ],[WC] ZCWS 0100000 ZC I CCCC DDDDDDDDD SSSSSSSSS MOV D,S/# [WZ],[WC] ZCWS 0100001 ZC I CCCC DDDDDDDDD SSSSSSSSS NOT D,S/# [WZ],[WC] ZCWS 0100010 ZC I CCCC DDDDDDDDD SSSSSSSSS ABS D,S/# [WZ],[WC] ZCWS 0100011 ZC I CCCC DDDDDDDDD SSSSSSSSS NEG D,S/# [WZ],[WC] ZCWS 0100100 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGC D,S/# [WZ],[WC] ZCWS 0100101 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGNC D,S/# [WZ],[WC] ZCWS 0100110 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGZ D,S/# [WZ],[WC] ZCWS 0100111 ZC I CCCC DDDDDDDDD SSSSSSSSS NEGNZ D,S/# [WZ],[WC] ZCMS 0101000 ZC I CCCC DDDDDDDDD SSSSSSSSS ADD D,S/# [WZ],[WC] ZCMS 0101001 ZC I CCCC DDDDDDDDD SSSSSSSSS SUB D,S/# [WZ],[WC] ZCMS 0101010 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDX D,S/# [WZ],[WC] ZCMS 0101011 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBX D,S/# [WZ],[WC] ZCMS 0101100 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDS D,S/# [WZ],[WC] ZCMS 0101101 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBS D,S/# [WZ],[WC] ZCMS 0101110 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDSX D,S/# [WZ],[WC] ZCMS 0101111 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBSX D,S/# [WZ],[WC] ZCMS 0110000 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMC D,S/# [WZ],[WC] ZCMS 0110001 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMNC D,S/# [WZ],[WC] ZCMS 0110010 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMZ D,S/# [WZ],[WC] ZCMS 0110011 ZC I CCCC DDDDDDDDD SSSSSSSSS SUMNZ D,S/# [WZ],[WC] ZCMS 0110100 ZC I CCCC DDDDDDDDD SSSSSSSSS MIN D,S/# [WZ],[WC] ZCMS 0110101 ZC I CCCC DDDDDDDDD SSSSSSSSS MAX D,S/# [WZ],[WC] ZCMS 0110110 ZC I CCCC DDDDDDDDD SSSSSSSSS MINS D,S/# [WZ],[WC] ZCMS 0110111 ZC I CCCC DDDDDDDDD SSSSSSSSS MAXS D,S/# [WZ],[WC] ZCMS 0111000 ZC I CCCC DDDDDDDDD SSSSSSSSS ADDABS D,S/# [WZ],[WC] ZCMS 0111001 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBABS D,S/# [WZ],[WC] ZCMS 0111010 ZC I CCCC DDDDDDDDD SSSSSSSSS INCMOD D,S/# [WZ],[WC] ZCMS 0111011 ZC I CCCC DDDDDDDDD SSSSSSSSS DECMOD D,S/# [WZ],[WC] ZCMS 0111100 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPSUB D,S/# [WZ],[WC] ZCMS 0111101 ZC I CCCC DDDDDDDDD SSSSSSSSS SUBR D,S/# [WZ],[WC] ZCMS 0111110 ZC I CCCC DDDDDDDDD SSSSSSSSS MUL D,S/# [WZ],[WC] ZCMS 0111111 ZC I CCCC DDDDDDDDD SSSSSSSSS SCL D,S/# [WZ],[WC] ------------------------------------------------------------------------------------------------------------------------------------------------- ZCWS 1000000 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD2 D,S/# [WZ],[WC] ZCWS 1000001 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD3 D,S/# [WZ],[WC] ZCWS 1000010 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD4 D,S/# [WZ],[WC] ZCWS 1000011 ZC I CCCC DDDDDDDDD SSSSSSSSS DECOD5 D,S/# [WZ],[WC] Z-WS 1000100 Zf I CCCC DDDDDDDDD SSSSSSSSS ENCOD BLMASK D,S/# [WZ] Z-WS 1000101 Zf I CCCC DDDDDDDDD SSSSSSSSS ONECNT ZERCNT D,S/# [WZ] -CWS 1000110 fC I CCCC DDDDDDDDD SSSSSSSSS INCPAT DECPAT D,S/# [WC] --WS 1000111 ff I CCCC DDDDDDDDD SSSSSSSSS SPLITB MERGEB SPLITW MERGEW D,S/# --MS 10010nn nf I CCCC DDDDDDDDD SSSSSSSSS GETNIB SETNIB D,S/#,#0..7 --MS 1001100 nf I CCCC DDDDDDDDD SSSSSSSSS GETWORD SETWORD D,S/#,#0..1 --MS 1001101 ff I CCCC DDDDDDDDD SSSSSSSSS STWORDS ROLNIB ROLBYTE ROLWORD D,S/# --MS 1001110 ff I CCCC DDDDDDDDD SSSSSSSSS SETS SETD SETX SETI D,S/# -CMS 1001111 fC I CCCC DDDDDDDDD SSSSSSSSS COGNEW WAITCNT D,S/# [WC] --MS 101000n nf I CCCC DDDDDDDDD SSSSSSSSS GETBYTE SETBYTE D,S/#,#0..3 --WS 1010010 ff I CCCC DDDDDDDDD SSSSSSSSS STBYTES SWBYTES PACKRGB UNPKRGB D,S/# --MS 1010011 ff I CCCC DDDDDDDDD SSSSSSSSS ADDPIX MULPIX BLNPIX MIXPIX D,S/# ZCMS 1010100 ZC I CCCC DDDDDDDDD SSSSSSSSS JMPSW D,S/# [WZ],[WC] ZCMS 1010101 ZC I CCCC DDDDDDDDD SSSSSSSSS JMPSWD D,S/# [WZ],[WC] --MS 1010110 ff I CCCC DDDDDDDDD SSSSSSSSS IJZ IJZD IJNZ IJNZD D,S/# --MS 1010111 ff I CCCC DDDDDDDDD SSSSSSSSS DJZ DJZD DJNZ DJNZD D,S/# ZCRS 1011000 ZC I CCCC DDDDDDDDD SSSSSSSSS TESTB D,S/# [WZ],[WC] ZCRS 1011001 ZC I CCCC DDDDDDDDD SSSSSSSSS TESTN D,S/# [WZ],[WC] ZCRS 1011010 ZC I CCCC DDDDDDDDD SSSSSSSSS TEST D,S/# [WZ],[WC] ZCRS 1011011 ZC I CCCC DDDDDDDDD SSSSSSSSS CMP D,S/# [WZ],[WC] ZCRS 1011100 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPX D,S/# [WZ],[WC] ZCRS 1011101 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPS D,S/# [WZ],[WC] ZCRS 1011110 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPSX D,S/# [WZ],[WC] ZCRS 1011111 ZC I CCCC DDDDDDDDD SSSSSSSSS CMPR D,S/# [WZ],[WC] --RS 11000nn nf I CCCC DDDDDDDDD SSSSSSSSS COGINIT WAITVID D,S/#,#0..7 | #0..$DFF,S/# -CRS 110010n nC I CCCC DDDDDDDDD SSSSSSSSS WAITPEQ D,S/#,#0..3 [WC] -CRS 110011n nC I CCCC DDDDDDDDD SSSSSSSSS WAITPNE D,S/#,#0..3 [WC] --LS 1101000 fL I CCCC DDDDDDDDD SSSSSSSSS WRBYTE WRWORD D/#,S/PTRA/PTRB | D/#,S/PTRA/PTRB --LS 1101001 fL I CCCC DDDDDDDDD SSSSSSSSS WRLONG FRAC D/#,S/PTRA/PTRB | D/#,S/# --LS 1101010 fL I CCCC DDDDDDDDD SSSSSSSSS WRAUX WRAUXR D/#,S/#0..$FF/PTRX/PTRY | D/#,S/#0..$FF/PTRX/PTRY --LS 1101011 fL I CCCC DDDDDDDDD SSSSSSSSS SETACCA SETACCB D/#,S/# --LS 1101100 fL I CCCC DDDDDDDDD SSSSSSSSS MACA MACB D/#,S/# --LS 1101101 fL I CCCC DDDDDDDDD SSSSSSSSS MUL32 MUL32U D/#,S/# --LS 1101110 fL I CCCC DDDDDDDDD SSSSSSSSS DIV32 DIV32U D/#,S/# --LS 1101111 fL I CCCC DDDDDDDDD SSSSSSSSS DIV64 DIV64U D/#,S/# --LS 1110000 fL I CCCC DDDDDDDDD SSSSSSSSS SQRT64 QSINCOS D/#,S/# --LS 1110001 fL I CCCC DDDDDDDDD SSSSSSSSS QARCTAN QROTATE D/#,S/# --LS 1110010 fL I CCCC DDDDDDDDD SSSSSSSSS SETSERA SETSERB D/#,S/# --LS 1110011 fL I CCCC DDDDDDDDD SSSSSSSSS SETCTRS SETWAVS D/#,S/# --LS 1110100 fL I CCCC DDDDDDDDD SSSSSSSSS SETFRQS SETPHSS D/#,S/# --LS 1110101 fL I CCCC DDDDDDDDD SSSSSSSSS ADDPHSS SUBPHSS D/#,S/# --LS 1110110 fL I CCCC DDDDDDDDD SSSSSSSSS JP JPD D/#,S/# --LS 1110111 fL I CCCC DDDDDDDDD SSSSSSSSS JNP JNPD D/#,S/# --LS 111100n nL I CCCC DDDDDDDDD SSSSSSSSS CFGPINS JMPTASK D/#,S/#,#0..2 | D/#,S/# --LS 1111010 fL I CCCC DDDDDDDDD SSSSSSSSS SETXFR SETMIX D/#,S/# --LS 1111011 fL I CCCC DDDDDDDDD SSSSSSSSS <empty> <empty> D/#,S/# --RS 1111100 ff I CCCC DDDDDDDDD SSSSSSSSS JZ JZD JNZ JNZD D,S/# ------------------------------------------------------------------------------------------------------------------------------------------------- ---- 1111101 ff n nnnn nnnnnnnnn nnnnnnnnn AUGI #23bits ---- 1111101 01 0 nnnn nnnnnnnnn nnniiiiii REPS #1..$10000,#1..64 ---- 1111101 01 1 BBAA ddddddddd sssssssss xxxINDx FIXINDx #d,#s | #d,#s | #d,#s | SETINDx #s | #d | #d,#s ---- 1111101 10 0 CCCC ffnnnnnnn nnnnnnnnn JMP JMP_ JMP JMP_ #abs | #abs | @rel | @rel ---- 1111101 10 1 CCCC ffnnnnnnn nnnnnnnnn JMPD JMPD_ JMPD JMPD_ #abs | #abs | @rel | @rel ---- 1111101 11 0 CCCC ffnnnnnnn nnnnnnnnn CALL CALL_ CALL CALL_ #abs | #abs | @rel | @rel ---- 1111101 11 1 CCCC ffnnnnnnn nnnnnnnnn CALLD CALLD_ CALLD CALLD_ #abs | #abs | @rel | @rel ---- 1111110 00 0 CCCC ffnnnnnnn nnnnnnnnn CALLA CALLA_ CALLA CALLA_ #abs | #abs | @rel | @rel ---- 1111110 00 1 CCCC ffnnnnnnn nnnnnnnnn CALLAD CALLAD_ CALLAD CALLAD_ #abs | #abs | @rel | @rel ---- 1111110 01 0 CCCC ffnnnnnnn nnnnnnnnn CALLB CALLB_ CALLB CALLB_ #abs | #abs | @rel | @rel ---- 1111110 01 1 CCCC ffnnnnnnn nnnnnnnnn CALLBD CALLBD_ CALLBD CALLBD_ #abs | #abs | @rel | @rel ---- 1111110 10 0 CCCC ffnnnnnnn nnnnnnnnn CALLX CALLX_ CALLX CALLX_ #abs | #abs | @rel | @rel ---- 1111110 10 1 CCCC ffnnnnnnn nnnnnnnnn CALLXD CALLXD_ CALLXD CALLXD_ #abs | #abs | @rel | @rel ---- 1111110 11 0 CCCC ffnnnnnnn nnnnnnnnn CALLY CALLY_ CALLY CALLY_ #abs | #abs | @rel | @rel ---- 1111110 11 1 CCCC ffnnnnnnn nnnnnnnnn CALLYD CALLYD_ CALLYD CALLYD_ #abs | #abs | @rel | @rel ------------------------------------------------------------------------------------------------------------------------------------------------- ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000000ff COGID LOCKNEW GETPC GETLFSR D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000001ff GETCNT GETCNTX GETACAL GETACAH D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000010ff GETACBL GETACBH GETPTRA GETPTRB D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000011ff GETPTRX GETPTRY SERINA SERINB D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000100ff GETMULL GETMULH GETDIVQ GETDIVR D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000101ff GETSQRT GETQX GETQY GETQZ D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000110ff GETPHSA GETPHZA GETCOSA GETSINA D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0000111ff GETPHSB GETPHZB GETCOSB GETSINB D [WZ],[WC] ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 0001000ff PUSHZC POPZC SUBCNT GETPIX D [WZ],[WC] ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 0001001ff BINBCD BCDBIN BINGRY GRYBIN D [WZ],[WC] ZCM- 1111111 ZC 0 CCCC DDDDDDDDD 0001010ff ESWAP4 ESWAP8 SEUSSF SEUSSR D [WZ],[WC] Z-M- 1111111 ZC 0 CCCC DDDDDDDDD 0001011ff INCD DECD INCDS DECDS D [WZ],[WC] ZCW- 1111111 ZC 0 CCCC DDDDDDDDD 0001100ff POP <empty> <empty> <empty> D [WZ],[WC] ------------------------------------------------------------------------------------------------------------------------------------------------- --L- 1111111 00 L CCCC DDDDDDDDD 001iiiiii REPD D/#1..512,#1..64 ------------------------------------------------------------------------------------------------------------------------------------------------- --L- 1111111 00 L CCCC DDDDDDDDD 0100000ff CLKSET COGSTOP LOCKSET LOCKCLR D/# --L- 1111111 00 L CCCC DDDDDDDDD 0100001ff LOCKRET RDWIDEC RDWIDE WRWIDE D/# | D/PTRA/PTRB | D/PTRA/PTRB | D/PTRA/PTRB ZCL- 1111111 ZC L CCCC DDDDDDDDD 0100010ff GETP GETNP SEROUTA SEROUTB D/# [WZ],[WC] -CL- 1111111 0C L CCCC DDDDDDDDD 0100011ff CMPCNT WAITPX WAITPR WAITPF D/# [WC] ZCL- 1111111 ZC L CCCC DDDDDDDDD 0100100ff SETZC SETMAP SETXCH SETTASK D/# [WZ],[WC] --L- 1111111 00 L CCCC DDDDDDDDD 0100101ff SETRACE SARACCA SARACCB SARACCS D/# --L- 1111111 00 L CCCC DDDDDDDDD 0100110ff SETPTRA SETPTRB ADDPTRA ADDPTRB D/# --L- 1111111 00 L CCCC DDDDDDDDD 0100111ff SUBPTRA SUBPTRB SETWIDE SETWIDZ D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101000ff SETPTRX SETPTRY ADDPTRX ADDPTRY D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101001ff SUBPTRX SUBPTRY PASSCNT WAIT D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101010ff OFFP NOTP CLRP SETP D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101011ff SETPC SETPNC SETPZ SETPNZ D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101100ff DIV64D SQRT32 QLOG QEXP D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101101ff SETQI SETQZ CFGDACS SETDACS D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101110ff CFGDAC0 CFGDAC1 CFGDAC2 CFGDAC3 D/# --L- 1111111 00 L CCCC DDDDDDDDD 0101111ff SETDAC0 SETDAC1 SETDAC2 SETDAC3 D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110000ff SETCTRA SETWAVA SETFRQA SETPHSA D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110001ff ADDPHSA SUBPHSA SETVID SETVIDY D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110010ff SETCTRB SETWAVB SETFRQB SETPHSB D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110011ff ADDPHSB SUBPHSB SETVIDI SETVIDQ D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110100ff SETPIX SETPIXZ SETPIXU SETPIXV D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110101ff SETPIXA SETPIXR SETPIXG SETPIXB D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110110ff SETPORA SETPORB SETPORC SETPORD D/# --L- 1111111 00 L CCCC DDDDDDDDD 0110111ff PUSH <empty> JMPREL JMPRELD D/# | D/# | D | D --R- 1111111 00 0 CCCC DDDDDDDDD 0111010ff JMP JMP_ JMPD JMPD_ D --R- 1111111 00 0 CCCC DDDDDDDDD 0111011ff CALL CALL_ CALLD CALLD_ D --R- 1111111 00 0 CCCC DDDDDDDDD 0111100ff CALLA CALLA_ CALLAD CALLAD_ D --R- 1111111 00 0 CCCC DDDDDDDDD 0111101ff CALLB CALLB_ CALLBD CALLBD_ D --R- 1111111 00 0 CCCC DDDDDDDDD 0111110ff CALLX CALLX_ CALLXD CALLXD_ D --R- 1111111 00 0 CCCC DDDDDDDDD 0111111ff CALLY CALLY_ CALLYD CALLYD_ D ZC-- 1111111 ZC x CCCC xxxxxxxxx 1000000ff RETA RETAD RETB RETBD [WZ],[WC] ZC-- 1111111 ZC x CCCC xxxxxxxxx 1000001ff RETX RETXD RETY RETYD [WZ],[WC] ZC-- 1111111 ZC x CCCC xxxxxxxxx 1000010ff RET RETD POLCTRA POLCTRB [WZ],[WC] ZC-- 1111111 ZC x CCCC xxxxxxxxx 1000011ff POLVID CAPCTRA CAPCTRB CAPCTRS [WZ],[WC] ---- 1111111 00 x CCCC xxxxxxxxx 1000100ff CACHEX CLRACCA CLRACCB CLRACCS ZC-- 1111111 ZC x CCCC xxxxxxxxx 1000101ff CHKPTRX CHKPTRY SYNCTRA SYNCTRB [WZ],[WC] ---- 1111111 00 x CCCC xxxxxxxxx 1000110ff SETPIXW <empty> <empty> <empty> -------------------------------------------------------------------------------------------------------------------------------------------------
InstructionSet_20131217.spinNice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Can You include in Prop2_Docs.txt -- list of all directives used in PNut?
I will in the next release. The ORG-type directives select between cog and hub assembler modes. In cog mode, the cog address is used. In hub mode, the hub address >> 2 is used.
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Much of the work is done with formulae in excel. This makes it easier when the instruction set changes.
I need to break the instructions into common format groups for ny disassembler.
When you get the cache line mechanism working, could you consider adding an instruction to force the loading of a line(s) please?
- single cog mode, LRU 4 line cache
- multi-task mode, each task gets 1 line (bypass LRU mechanism)
Chip, why don't you just use an 8 bit byte for the LRU.
When you load a cache line, shift the byte right, put the 2 bit line ID in the upper 2 bits. When you load the next line, you just pull the 2 bit ID off the bottom of the byte and use that to determine what the next cache line is.
Init LRU to 11100100 at startup
First load grabs 00, shifts right, pushes 00 to the top: 00111001
Next load, so on.
When the cogs start running instructions through the cache, the efficiency of the code will mix up the order, creating LRU instead of round-robin, but at start it will simply be round-robin.
Also, I don't see how all tasks could want data at once, since they are staggered through the 4 stages of the pipeline. Sure, it takes 3-8 clocks to load, but the tasks will stall anyway. So, yes you can have a pile-up, and RDLONG should have priority over HUBEX, which will cause further stalls.
This is all a better reason not to support HUBEX for more than task 0. You will have totally non-deterministic execution with 4 HUBEX tasks, because of issue order of the HUBEX loads and overriding priority for RDXXXX instructions. That means if the first instruction of a HUBEX is RDXXXX, it will stall the next HUBEX load, then if it's multiple RDXXXX instructions in a row, those other HUBEX tasks will stall indefinitely.
If you only make 1 task HUBEX and have a 4 line cache, that will give best possible performance.
pre-caching can make hub execution fast, but you must allow explicit memory ops to interrupt pending pre-cache requests.