Chip,
Did you manage to fit the USB read bit (GETXP) and CRC (CRCBIT) instructions in this release?
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
Or start a new thread, for USB opcodes might be easier to track ?
I just realized that since all instructions during hub execution come from the hub, the cog RAM instruction fetching is still going on, but it's being ignored. We could stuff some other address in the instruction-read-address of the cog RAM and get any long out of cog RAM we want. I wonder if there is something useful that can be done by repurposing the cog's internal instruction fetch. It's a free cog RAM read on every hub exec instruction.
Chip
If there is some room in the opcode map, to craft a tight follower to the instruction block, being fetched from Hub memory, perhaps it could be arranged to give an opportunity to a suggestion made by Cluso99, at post #4217 of this thread, about the creation of some real time method to "teach" the cache LRU algorithm, how to deal with de following fetches from Hub memory.
Unfortunatelly, it will not yet deal with conditional branching preview, unless you could devise a method to automaticaly doing some "Y" branching at the cache level, advancing a single eight long fetche, as an alternate, and faster, way to "predict", from where it will start to gather new data, in case branching realy does happens to be taken.
If there is some room in the opcode map, to craft a tight follower to the instruction block, being fetched from Hub memory, perhaps it could be arranged to give an opportunity to a suggestion made by Cluso99, at post #4217 of this thread, about the creation of some real time method to "teach" the cache LRU algorithm, how to deal with de following fetches from Hub memory.
Unfortunatelly, it will not yet deal with conditional branching preview, unless you could devise a method to automaticaly doing some "Y" branching at the cache level, advancing a single eight long fetche, as an alternate, and faster, way to "predict", from where it will start to gather new data, in case branching realy does happens to be taken.
Yanomani
I think it would be nice if we could just "force" the loading of the instruction cache with an instruction like I proposed. I think it would be worth playing with this concept on the FPGA to see what we could determine. I don't think the autoload of the instruction cache can ever be as good as what a useful instruction could do. This way, we could control the cache contents by sw.
I think it would be nice if we could just "force" the loading of the instruction cache with an instruction like I proposed. I think it would be worth playing with this concept on the FPGA to see what we could determine. I don't think the autoload of the instruction cache can ever be as good as what a useful instruction could do. This way, we could control the cache contents by sw.
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
As long as there is auto-load in the absence of forced-loading.
Auto load is needed to reduce the work compilers need to do.
Auto load is the prime requirement.
I just think it could be extremely useful to override this default. Our code would have to be more complex to take advantage of it, but it could be a real boost to some specialised code.
It would be nice to play with this in the fpga while other things are being done.
It would also be nice for the instruction cache autoloader to be able to take advantage of the next available slot to load the wide
As Cluso said we can't help much on PNut -- But --->
As Instructions use Now Absolute and Relative addresses ---> That need in my opinion 2 directives ABSCOD and RELCOD
Then as code can execute from HUB else COG ---> HUBCOD and COGCOD.
I just think it could be extremely useful to override this default. Our code would have to be more complex to take advantage of it, but it could be a real boost to some specialised code.
It would be nice to play with this in the fpga while other things are being done.
It would also be nice for the instruction cache autoloader to be able to take advantage of the next available slot to load the wide
What you were seeking was a PREFETCHI #hubaddr16bit ... to prefetch. Assembly coders (and optimizing compilers) could make good use of that.
hubexec would really benefit from slurping up unused slots...
Actually, I was hoping for a little more
PREFETCHI #hubaddr16bit,#1..3
to prefetch 1, 2 or 3 wides into the instruction cache because the program would know that the routine about to be called was <8, <16, <24 longs.
It would be particularly beneficial if the prefetching (auto or manual) could grab the next unused slot! What a performance boost this would be, and it's slot would then be free for another cogs prefetching!
Actually, I was hoping for a little more
PREFETCHI #hubaddr16bit,#1..3
to prefetch 1, 2 or 3 wides into the instruction cache because the program would know that the routine about to be called was <8, <16, <24 longs.
It would be particularly beneficial if the prefetching (auto or manual) could grab the next unused slot! What a performance boost this would be, and it's slot would then be free for another cogs prefetching!
Currently I expect that the auto icache loader waits until a hub instruction (that's not in the cache) is required. It then stalls until that wide is fetched. The wide will replace the LRU (last recent used).
What would be nice if the loader then automatically presumed the next block would be required, and fetched that wide into the next LRU. And, once that first block is used and the second wide's use has begun, the next block is fetched.
However, all this is a little complex for now. It will be nice to start playing to see what performance we can get, and where the stalls are.
What would be nice if the loader then automatically presumed the next block would be required, and fetched that wide into the next LRU. And, once that first block is used and the second wide's use has begun, the next block is fetched.
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Can You include in Prop2_Docs.txt -- list of all directives used in PNut?
I will in the next release. The ORG-type directives select between cog and hub assembler modes. In cog mode, the cog address is used. In hub mode, the hub address >> 2 is used.
After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub.
Isn't this going to be kind of tricky since you could possibly have all four tasks trying to fill a cache line at the same time? Somehow you have to make sure they don't step on each other. Although, maybe with LRU that falls out of the design. Anyway, it sounds a little mind bending to me! :-)
Isn't this going to be kind of tricky since you could possibly have all four tasks trying to fill a cache line at the same time? Somehow you have to make sure they don't step on each other. Although, maybe with LRU that falls out of the design. Anyway, it sounds a little mind bending to me! :-)
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Thanks Chip. I will do the initial ones too.
Much of the work is done with formulae in excel. This makes it easier when the instruction set changes.
I need to break the instructions into common format groups for ny disassembler.
When you get the cache line mechanism working, could you consider adding an instruction to force the loading of a line(s) please?
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Chip, why don't you just use an 8 bit byte for the LRU.
When you load a cache line, shift the byte right, put the 2 bit line ID in the upper 2 bits. When you load the next line, you just pull the 2 bit ID off the bottom of the byte and use that to determine what the next cache line is.
Init LRU to 11100100 at startup
First load grabs 00, shifts right, pushes 00 to the top: 00111001
Next load, so on.
When the cogs start running instructions through the cache, the efficiency of the code will mix up the order, creating LRU instead of round-robin, but at start it will simply be round-robin.
Also, I don't see how all tasks could want data at once, since they are staggered through the 4 stages of the pipeline. Sure, it takes 3-8 clocks to load, but the tasks will stall anyway. So, yes you can have a pile-up, and RDLONG should have priority over HUBEX, which will cause further stalls.
This is all a better reason not to support HUBEX for more than task 0. You will have totally non-deterministic execution with 4 HUBEX tasks, because of issue order of the HUBEX loads and overriding priority for RDXXXX instructions. That means if the first instruction of a HUBEX is RDXXXX, it will stall the next HUBEX load, then if it's multiple RDXXXX instructions in a row, those other HUBEX tasks will stall indefinitely.
If you only make 1 task HUBEX and have a 4 line cache, that will give best possible performance.
pre-caching can make hub execution fast, but you must allow explicit memory ops to interrupt pending pre-cache requests.
Comments
It seems too choppy, in time. It might have to functionally correlate with instruction execution to be useful.
Did you manage to fit the USB read bit (GETXP) and CRC (CRCBIT) instructions in this release?
I didn't put them in yet, but there's room for them. Could you please direct me to a post that spells out what is needed, or just repost it, here? What you wanted to do is not a big deal, at all. I just need to know what it is, again. Thanks!
Or start a new thread, for USB opcodes might be easier to track ?
Chip
If there is some room in the opcode map, to craft a tight follower to the instruction block, being fetched from Hub memory, perhaps it could be arranged to give an opportunity to a suggestion made by Cluso99, at post #4217 of this thread, about the creation of some real time method to "teach" the cache LRU algorithm, how to deal with de following fetches from Hub memory.
Unfortunatelly, it will not yet deal with conditional branching preview, unless you could devise a method to automaticaly doing some "Y" branching at the cache level, advancing a single eight long fetche, as an alternate, and faster, way to "predict", from where it will start to gather new data, in case branching realy does happens to be taken.
Yanomani
Auto load is needed to reduce the work compilers need to do.
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1224661&viewfull=1#post1224661
And here were some fixes/updates/etc (summaries) that I did a short time ago (most were resolved or are now irrelevant)
http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1224734&viewfull=1#post1224734
Thanks for these links. I should be able to get to these in a few days. I'm still modifying PNut.exe to get the new instruction set integrated.
I just think it could be extremely useful to override this default. Our code would have to be more complex to take advantage of it, but it could be a real boost to some specialised code.
It would be nice to play with this in the fpga while other things are being done.
It would also be nice for the instruction cache autoloader to be able to take advantage of the next available slot to load the wide
I don't suppose there is any way we can help you with pnut?
As Cluso said we can't help much on PNut -- But --->
As Instructions use Now Absolute and Relative addresses ---> That need in my opinion 2 directives ABSCOD and RELCOD
Then as code can execute from HUB else COG ---> HUBCOD and COGCOD.
This time they are in a pdf.
P2_Instruction_Set_20131217b.pdf
What you were seeking was a PREFETCHI #hubaddr16bit ... to prefetch. Assembly coders (and optimizing compilers) could make good use of that.
hubexec would really benefit from slurping up unused slots...
I'm not sure that is still correct. See Chip's post here:
http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-%28split-from-blog%29?p=1228729&viewfull=1#post1228729
I look on Instruction list abd it show both absolute and relative possibility's.
PREFETCHI #hubaddr16bit,#1..3
to prefetch 1, 2 or 3 wides into the instruction cache because the program would know that the routine about to be called was <8, <16, <24 longs.
It would be particularly beneficial if the prefetching (auto or manual) could grab the next unused slot! What a performance boost this would be, and it's slot would then be free for another cogs prefetching!
What would be nice if the loader then automatically presumed the next block would be required, and fetched that wide into the next LRU. And, once that first block is used and the second wide's use has begun, the next block is fetched.
However, all this is a little complex for now. It will be nice to start playing to see what performance we can get, and where the stalls are.
I am pretty sure this is what Chip did...
Indeed...
Nice list, Cluso99. You could do the ff-bit compaction for those initial instructions, too, and shrink it down some more.
I was at Parallax yesterday for some meetings, but I'm back on the Prop2 work today. I've got to finish a little work on PNut.exe and then do some checks on it. Then, I can make changes to the ROM code (branches are different now) and see if it all works together. After that, I'll be able to work on the cache line mechanism that feeds instructions from the hub. It's taken two weeks to get things in order to be able to make that addition. When I get it working, I'll post an update after I modify Prop2_Docs.txt.
Can You include in Prop2_Docs.txt -- list of all directives used in PNut?
I will in the next release. The ORG-type directives select between cog and hub assembler modes. In cog mode, the cog address is used. In hub mode, the hub address >> 2 is used.
Yes, they could all want new caches at once. It's hard to think about. I think that's why I'm getting everything else ready before I get into the implementation of the cache lines. I'm hoping that the LRU technique will perform equitably in all cases. I'm thinking the LRU "timer" used to measure usage must be the instruction clock. I'll probably get things running, at first, with no cache lines, just a single long. This will let me prove that hub execution is working properly. Then, I'll get into the caching. This has been a lot of work, but it's not going to be complicated to use.
Much of the work is done with formulae in excel. This makes it easier when the instruction set changes.
I need to break the instructions into common format groups for ny disassembler.
When you get the cache line mechanism working, could you consider adding an instruction to force the loading of a line(s) please?
- single cog mode, LRU 4 line cache
- multi-task mode, each task gets 1 line (bypass LRU mechanism)
Chip, why don't you just use an 8 bit byte for the LRU.
When you load a cache line, shift the byte right, put the 2 bit line ID in the upper 2 bits. When you load the next line, you just pull the 2 bit ID off the bottom of the byte and use that to determine what the next cache line is.
Init LRU to 11100100 at startup
First load grabs 00, shifts right, pushes 00 to the top: 00111001
Next load, so on.
When the cogs start running instructions through the cache, the efficiency of the code will mix up the order, creating LRU instead of round-robin, but at start it will simply be round-robin.
Also, I don't see how all tasks could want data at once, since they are staggered through the 4 stages of the pipeline. Sure, it takes 3-8 clocks to load, but the tasks will stall anyway. So, yes you can have a pile-up, and RDLONG should have priority over HUBEX, which will cause further stalls.
This is all a better reason not to support HUBEX for more than task 0. You will have totally non-deterministic execution with 4 HUBEX tasks, because of issue order of the HUBEX loads and overriding priority for RDXXXX instructions. That means if the first instruction of a HUBEX is RDXXXX, it will stall the next HUBEX load, then if it's multiple RDXXXX instructions in a row, those other HUBEX tasks will stall indefinitely.
If you only make 1 task HUBEX and have a 4 line cache, that will give best possible performance.
pre-caching can make hub execution fast, but you must allow explicit memory ops to interrupt pending pre-cache requests.