Once you kick a COG into HUBEXEC mode, what is the intended/best use of the COG register space. Except for the COG H/W registers at the top, does it just become a fast data space?
You could use it as a fast, deterministic code space and/or as a fast data space.
I suppose newbies coming into the Prop2 will probably start out with hub execution because that's the paradigm they're familiar with. They'll be able to explore a lot of new things, though.
I'm mired deep in cache stuff now. For some reason, this is a set of problems that don't chart out easily. It's hard for me to think about. I had everything running yesterday, but I wanted to accommodate the case of immediate need being sated by a prefetch already in progress. The whole thing blew up when I tried to put it all together. It's taking 45 minutes per compile now, which is a real drag.
There should be a warning on the documentation page against hungry mode, and objects on the OBEX that use hungry mode should be indicated as such on their download page.
P2 effectively has interrupts, via several mechanisms:
...
4) "interrupt task", given percentage (1/16 ... 15/16) of the cycles in a cog for an interrupt task behaving like 1/2/3 above, configurable latency
Can this be used for preemptive/more than 4 task multitasking? That would need a GETTASK pc, taskid instruction to get the current PC of the task being temporarily disabled for the next task, and I'm too lost by this thread to find the latest instruction listing to find out if such an instruction exists or if there's some other instruction that can do something similar.
I'm mired deep in cache stuff now. For some reason, this is a set of problems that don't chart out easily. It's hard for me to think about. I had everything running yesterday, but I wanted to accommodate the case of immediate need being sated by a prefetch already in progress. The whole thing blew up when I tried to put it all together. It's taking 45 minutes per compile now, which is a real drag.
Maybe you need to sleep on it - bet you haven't had a lot of sleep lately!
Can you add an SSD simply to improve the compile times?
I have advocated the use of "unused" slots for ages. But with the realisation that we could not do HD I suggested a method where a cog could "donate" its hub cycle to another cog. While it is no longer required for HD, the "donate" concept is both simple and deliberate. A cog can set an option (not on by default) where it can allow another cog to use it's slot, either if it does not require its own access, or as a priority over its own use.
The second part of this, is any cog can use the next available slot (which includes "donated" slots), again by setting an option (not on by default)
This is simpler than Chip's solution (referenced a few posts ago).
None of these options need to be used. But it does permit the dedicated designer (the professional user who wants to run the P2 to its max) to utilise these features to gain a performance from the P2 that would otherwise be unavailable to him/her. There is absolutely nothing wrong with this !!!
Now can we get over this hub slot discussion and move on to something more interesting?
I'm not against slot sharing but couldn't you use this same argument to say that interrupts should be supported on the Propeller because they are useful in the hands of "power designers"? Why should power designers be constrained to follow a no-interrupts model?
Not really - interrupts are not an inherent capability of the existing silicon, so were never even on the table.
- besides that red herring, the threads make not having interrupts less important.
I only got about 3 hours of sleep last night, which is part of the problem. I do have an SSD and a fast machine. It just takes forever.
Compiles used to take hours back in the 70's so some bright spark developed a patch program that could patch the object. Fast PC solved the delays, taking minutes to compile more complex programs. But that speed resulted in yet more complex programs, and of course the GUI, and up went the compile times. Seems it always will be this way.
Not really - interrupts are not an inherent capability of the existing silicon, so were never even on the table.
- besides that red herring, the threads make not having interrupts less important.
Slot sharing is not an inherent capability of the existing silicon either as far as I know. Anyway, I'm not promoting the argument that interrupts should be allowed. I'm just saying we may be being somewhat inconsistent in our reasoning.
I only got about 3 hours of sleep last night, which is part of the problem. I do have an SSD and a fast machine. It just takes forever.
I've been wondering about SSD speed. I have a 500gb one in my MacBook Pro and it certainly boots faster but I don't think it builds propgcc noticably faster. Is it possible that SSD write speed is actually slower than hard disk write speed?
Write speed isn't slower on SSD, unless it's a low end controller on the SSD.
I have a 120GB SSD and it's over 200MB/s read/write. SSD performance gets lackluster after it's been used a bit, the fragmentation of the block map adds quite a bit of overhead.
I'm mired deep in cache stuff now. For some reason, this is a set of problems that don't chart out easily. It's hard for me to think about. I had everything running yesterday, but I wanted to accommodate the case of immediate need being sated by a prefetch already in progress. The whole thing blew up when I tried to put it all together. It's taking 45 minutes per compile now, which is a real drag.
It sounds like you can't run Verilog RTL simulations for verification? Or is the FPGA compile and running on the bench simply faster?
Slot sharing is not an inherent capability of the existing silicon either as far as I know.
It depends on where you look and on the semantics.
Certainly, the silicon already allocates slots, and the Multi-tasker even allows users to allocate time-slot resource within a COG.
So there is a lot of slot/time management going on already, and even some precedent for giving user control of resource spread.
The point is not to imagine some misguided 'protect the user', but to instead design to give the user control.
So then no. It's not an inherent feature, but you would like it to be a feature, right?
Was that in reply to my post - it does not seem to fit ?
User control of some areas is already an inherent feature, and like many others, I push to have user control expanded, not restricted, where possible/practical.
This is not to be feared. The Prop 2 is a programmable device.
Okie Dokie. We simply disagree, but I will say it's not fear driven. Marginalizing other opinion that way is poor form, which is why you got the Okie Dokie.
How much Propeller programming do you do JMG? Just curious.
It sounds like you can't run Verilog RTL simulations for verification? Or is the FPGA compile and running on the bench simply faster?
I've never done that. I just burn and learn.
To set up a Verilog test bench at this point seems like a lot of extra work. I just run the FPGA and employ the Prop2 trace output, if necessary (assuming it's all working enough to even do that).
I've tried the simulator in Quartus before, it was so much monkey motion to configure and get running that I just decided I'd stick with watching what the FPGA does. If I had gone to school for this, I'm sure I would have had a more standard approach, but this way works fine, although sometimes I break something that is difficult to analyze. The good thing about running the FPGA is that it's close to full-speed, so you get a feel of the ergonomics of the chip, which a simulator would never give you.
The good thing about running the FPGA is that it's close to full-speed, so you get a feel of the ergonomics of the chip, which a simulator would never give you.
The FPGA has also allowed thorough testing of the video block too.
Okay. Hub exec caching is working like it's supposed to now.
There are two cache modes that can be set via special operand-less instructions:
ICACHEP - icache prefetch, default, exploits unused hub cycles to prefetch the next cache line, allows single-task straight-line code to run at full speed (when no hub instructions)
ICACHEN - icache no prefetch, loads a cache line only when a cache miss occurs, slower, but more cache-line efficient, actually faster for 4-way hub multitasking, potentially lower power
I'm tidying up the code now and then I'll do a final mental-synthesis reality check. This was all very tricky to think about because there are some subtleties to the icache logic that cause the Verilog to be smaller than you'd expect, with some operation cases working because of layered contingencies which are not obvious. It's been fun, but quite fatiguing to get this worked out. It's taken three days.
Hub execution really opens up new ways of thinking about cogs and code. It basically buys your application (if you're executing from hub) a whole new RAM resource, being the old cog RAM, now free for data or fast code. This is a mental paradigm shift for me. It's going to take a while to figure out the new balance of things. Like the ROM Monitor program - all that code can be executed from hub ROM, and I'll be able to use the cog RAM for much bigger data-entry buffers. This is kind of a mind-bender, from where I'm coming from.
Okay. Hub exec caching is working like it's supposed to now.
There are two cache modes that can be set via special operand-less instructions:
ICACHEP - icache prefetch, default, exploits unused hub cycles to prefetch the next cache line, allows single-task straight-line code to run at full speed (when no hub instructions)
ICACHEN - icache no prefetch, loads a cache line when a cache miss occurs, more efficient for hub multitasking, as cache lines are only reloaded when necessary, potentially lower power
I'm tidying up the code now and then I'll do a final mental synthesis reality check.
Hub execution really opens up new ways of thinking about cogs and code. It basically buys your application (if you're executing from hub) a whole new RAM resource, being the old cog RAM, now free for data or fast code. This is a mental paradigm shift for me. It's going to take a while to figure out the new balance of things. Like the ROM Monitor program - all that code can be executed from hub ROM, and I'll be able to use the cog RAM for much bigger data-entry buffers. This is kind of a mind-bender, from where I'm coming from.
I am perfectly happy with a warning in case "HUNGRY" is used in ANY module/object,
A warning is a very good idea. Note however that in general the warning cannot be 100% reliable. Even if a module has hungry mode setting instructions that does not mean they actually get executed. We don't have a way to do the static code analysis required to know if hungry mode will actually be enabled at run time.
In most use cases I see, there will only likely be one cog in HUNGRY mode -
I would not bank on this.
I seriously don't see what code would reasonably require additional bandwidth and break if it was not available
Perhaps you are right. As long as "hungry code" is that sloppy (timing wise) stuff that want's to run as fast as possible overall but does not care about little hiccups on the way and is ultimately not worried if it ends up running at half speed.
This is rather like processes on Unix. Unix tries manage resources, disk, I/O, memory, etc, so as to get the best aggregate performance for all processes but makes no guarantees about timing. This makes the overall throughput higher.
Of course everyone complains "You can't do real-time work on Linux":)
As the hungry cog could not take cycles forcibly from other cogs, just use otherwise unused cycles, your argument is flawed.
That is true.
Bottom line is that the speed of a hungry COG is now dependent on the activity going on in other COGs. This is a BIG new feature of the Propeller. Your claim is that it does not matter and the speed gains are worth that coupling. We worry that such coupling will break things in unexpected ways sometimes. As I said it's for Chip to decide the pros and cons here.
Jazzed,
This sucker needs to lock down fast...
Yes.
And as "greedy mode" seems to be almost cast into silicon, well Chip nearly has it working. I think us naysayers have to accept that it's in.
I'm starting to warm to the idea of "greedy mode" so that if we ever get that JavaScript engine running on the Prop it will run as fast as possible. Who expects timing determinism in the JavaScript world?
Chip,
Burn and learn.
I love it. That phrase is going to stick with me. It's how most of my development proceeds.
And as "greedy mode" seems to be almost cast into silicon, well Chip nearly has it working. I think us naysayers have to accept that it's in.
I'm starting to warm to the idea of "greedy mode" so that if we ever get that JavaScript engine running on the Prop it will run as fast as possible. Who expects timing determinism in the JavaScript world?
Chip,
I love it. That phrase is going to stick with me. It's how most of my development proceeds.
GREEDY mode is NOT in there yet. I'm still reeling from this icache business. The cache prefetch only uses hub cycles that the executing program doesn't use.
'Burn and learn' came from development using EPROMs and EPROM-based microcontrollers. The BASIC Stamps were developed by erasing the PIC chip's EPROM, reprogramming it, and then trying it out.
Didn't your old signature read, "For me, the past is not yet over."? That's the most memorable signature I've ever seen, and I always tell people about it. You need to be a little older, I think, or maybe obsessive to relate to that.
Okay. Hub exec caching is working like it's supposed to now.
There are two cache modes that can be set via special operand-less instructions:
ICACHEP - icache prefetch, default, exploits unused hub cycles to prefetch the next cache line, allows single-task straight-line code to run at full speed (when no hub instructions)
ICACHEN - icache no prefetch, loads a cache line only when a cache miss occurs, slower, but more cache-line efficient, actually faster for 4-way hub multitasking, potentially lower power
I'm tidying up the code now and then I'll do a final mental-synthesis reality check. This was all very tricky to think about because there are some subtleties to the icache logic that cause the Verilog to be smaller than you'd expect, with some operation cases working because of layered contingencies which are not obvious. It's been fun, but quite fatiguing to get this worked out. It's taken three days.
Hub execution really opens up new ways of thinking about cogs and code. It basically buys your application (if you're executing from hub) a whole new RAM resource, being the old cog RAM, now free for data or fast code. This is a mental paradigm shift for me. It's going to take a while to figure out the new balance of things. Like the ROM Monitor program - all that code can be executed from hub ROM, and I'll be able to use the cog RAM for much bigger data-entry buffers. This is kind of a mind-bender, from where I'm coming from.
Brilliant Chip!
Looking forward to writing some new stuff in HUB EXEC mode.
Very Cool
Sorry, I was jumping the gun there a bit re: greedy mode.
I too have done a lot of "burning and learning" with PROMS, EPROPM, PALS etc, back in the early 1980's and sometimes even in the late 90's. Now a days the term applies just as much to programming in Python or JavaScript where you have to run the code and test it to see what errors you have. There is no compiler to help you at compile time even. What is old is new again.
"For me, the past is not over yet.". Yep, that was me. Perhaps I should reinstate it.
That phrase came form a friend of mine who was always playing records from his 60's collection at any gathering at his home after a few glasses of wine. His wife would complain about it. "Why do we always have to listen to thirty year old hits?" and request something more modern. "For me,.......the past.....is.....not over yet." He slowly slurred out in reply one time.
Sorry, I was jumping the gun there a bit re: greedy mode.
I too have done a lot of "burning and learning" with PROMS, EPROPM, PALS etc, back in the early 1980's and sometimes even in the late 90's. Now a days the term applies just as much to programming in Python or JavaScript where you have to run the code and test it to see what errors you have. There is no compiler to help you at compile time even. What is old is new again.
"For me, the past is not over yet.". Yep, that was me. Perhaps I should reinstate it.
That phrase came form a friend of mine who was always playing records from his 60's collection at any gathering at his home after a few glasses of wine. His wife would complain about it. "Why do we always have to listen to thirty year old hits?" and request something more modern. "For me,.......the past.....is.....not over yet." He slowly slurred out in reply one time.
He had a way with words like that.
Did he actually say, "Minulle, ...ohi... ei ole... loppu, viel
"Burn and learn" here is a rather hard kind of trial and error.
When we build software or chip designs or aircraft or bridges etc now a days there is a lot of trial and error going on. But it's done with mathematical models, simulations, or building real but scale models and so on.
With "burn and learn" you have no way to analyse or simulate you have to go straight from the design to the hardware. From code to PROM, from VHDL to chip.
Of course one could say the FPGA Chip is using is the "simulator" as he is not burning a real Prop II chip every time. But it's plenty close enough for me.
Comments
You could use it as a fast, deterministic code space and/or as a fast data space.
I suppose newbies coming into the Prop2 will probably start out with hub execution because that's the paradigm they're familiar with. They'll be able to explore a lot of new things, though.
Can this be used for preemptive/more than 4 task multitasking? That would need a GETTASK pc, taskid instruction to get the current PC of the task being temporarily disabled for the next task, and I'm too lost by this thread to find the latest instruction listing to find out if such an instruction exists or if there's some other instruction that can do something similar.
Can you add an SSD simply to improve the compile times?
I have advocated the use of "unused" slots for ages. But with the realisation that we could not do HD I suggested a method where a cog could "donate" its hub cycle to another cog. While it is no longer required for HD, the "donate" concept is both simple and deliberate. A cog can set an option (not on by default) where it can allow another cog to use it's slot, either if it does not require its own access, or as a priority over its own use.
The second part of this, is any cog can use the next available slot (which includes "donated" slots), again by setting an option (not on by default)
This is simpler than Chip's solution (referenced a few posts ago).
None of these options need to be used. But it does permit the dedicated designer (the professional user who wants to run the P2 to its max) to utilise these features to gain a performance from the P2 that would otherwise be unavailable to him/her. There is absolutely nothing wrong with this !!!
Now can we get over this hub slot discussion and move on to something more interesting?
Not really - interrupts are not an inherent capability of the existing silicon, so were never even on the table.
- besides that red herring, the threads make not having interrupts less important.
I only got about 3 hours of sleep last night, which is part of the problem. I do have an SSD and a fast machine. It just takes forever.
Now, go get some decent sleep
I have a 120GB SSD and it's over 200MB/s read/write. SSD performance gets lackluster after it's been used a bit, the fragmentation of the block map adds quite a bit of overhead.
It sounds like you can't run Verilog RTL simulations for verification? Or is the FPGA compile and running on the bench simply faster?
It depends on where you look and on the semantics.
Certainly, the silicon already allocates slots, and the Multi-tasker even allows users to allocate time-slot resource within a COG.
So there is a lot of slot/time management going on already, and even some precedent for giving user control of resource spread.
The point is not to imagine some misguided 'protect the user', but to instead design to give the user control.
LOL
Was that in reply to my post - it does not seem to fit ?
User control of some areas is already an inherent feature, and like many others, I push to have user control expanded, not restricted, where possible/practical.
This is not to be feared. The Prop 2 is a programmable device.
How much Propeller programming do you do JMG? Just curious.
I've never done that. I just burn and learn.
To set up a Verilog test bench at this point seems like a lot of extra work. I just run the FPGA and employ the Prop2 trace output, if necessary (assuming it's all working enough to even do that).
I've tried the simulator in Quartus before, it was so much monkey motion to configure and get running that I just decided I'd stick with watching what the FPGA does. If I had gone to school for this, I'm sure I would have had a more standard approach, but this way works fine, although sometimes I break something that is difficult to analyze. The good thing about running the FPGA is that it's close to full-speed, so you get a feel of the ergonomics of the chip, which a simulator would never give you.
The FPGA has also allowed thorough testing of the video block too.
There are two cache modes that can be set via special operand-less instructions:
ICACHEP - icache prefetch, default, exploits unused hub cycles to prefetch the next cache line, allows single-task straight-line code to run at full speed (when no hub instructions)
ICACHEN - icache no prefetch, loads a cache line only when a cache miss occurs, slower, but more cache-line efficient, actually faster for 4-way hub multitasking, potentially lower power
I'm tidying up the code now and then I'll do a final mental-synthesis reality check. This was all very tricky to think about because there are some subtleties to the icache logic that cause the Verilog to be smaller than you'd expect, with some operation cases working because of layered contingencies which are not obvious. It's been fun, but quite fatiguing to get this worked out. It's taken three days.
Hub execution really opens up new ways of thinking about cogs and code. It basically buys your application (if you're executing from hub) a whole new RAM resource, being the old cog RAM, now free for data or fast code. This is a mental paradigm shift for me. It's going to take a while to figure out the new balance of things. Like the ROM Monitor program - all that code can be executed from hub ROM, and I'll be able to use the cog RAM for much bigger data-entry buffers. This is kind of a mind-bender, from where I'm coming from.
Nice news.
Good You managed it
This is rather like processes on Unix. Unix tries manage resources, disk, I/O, memory, etc, so as to get the best aggregate performance for all processes but makes no guarantees about timing. This makes the overall throughput higher.
Of course everyone complains "You can't do real-time work on Linux":) That is true.
Bottom line is that the speed of a hungry COG is now dependent on the activity going on in other COGs. This is a BIG new feature of the Propeller. Your claim is that it does not matter and the speed gains are worth that coupling. We worry that such coupling will break things in unexpected ways sometimes. As I said it's for Chip to decide the pros and cons here.
Jazzed, Yes.
And as "greedy mode" seems to be almost cast into silicon, well Chip nearly has it working. I think us naysayers have to accept that it's in.
I'm starting to warm to the idea of "greedy mode" so that if we ever get that JavaScript engine running on the Prop it will run as fast as possible. Who expects timing determinism in the JavaScript world?
Chip, I love it. That phrase is going to stick with me. It's how most of my development proceeds.
GREEDY mode is NOT in there yet. I'm still reeling from this icache business. The cache prefetch only uses hub cycles that the executing program doesn't use.
'Burn and learn' came from development using EPROMs and EPROM-based microcontrollers. The BASIC Stamps were developed by erasing the PIC chip's EPROM, reprogramming it, and then trying it out.
Didn't your old signature read, "For me, the past is not yet over."? That's the most memorable signature I've ever seen, and I always tell people about it. You need to be a little older, I think, or maybe obsessive to relate to that.
Brilliant Chip!
Looking forward to writing some new stuff in HUB EXEC mode.
Very Cool
Sorry, I was jumping the gun there a bit re: greedy mode.
I too have done a lot of "burning and learning" with PROMS, EPROPM, PALS etc, back in the early 1980's and sometimes even in the late 90's. Now a days the term applies just as much to programming in Python or JavaScript where you have to run the code and test it to see what errors you have. There is no compiler to help you at compile time even. What is old is new again.
"For me, the past is not over yet.". Yep, that was me. Perhaps I should reinstate it.
That phrase came form a friend of mine who was always playing records from his 60's collection at any gathering at his home after a few glasses of wine. His wife would complain about it. "Why do we always have to listen to thirty year old hits?" and request something more modern. "For me,.......the past.....is.....not over yet." He slowly slurred out in reply one time.
He had a way with words like that.
Did he actually say, "Minulle, ...ohi... ei ole... loppu, viel
"I enjoy the freedom to be alone" is a line I've rolled out a few time in recent years.
When we build software or chip designs or aircraft or bridges etc now a days there is a lot of trial and error going on. But it's done with mathematical models, simulations, or building real but scale models and so on.
With "burn and learn" you have no way to analyse or simulate you have to go straight from the design to the hardware. From code to PROM, from VHDL to chip.
Of course one could say the FPGA Chip is using is the "simulator" as he is not burning a real Prop II chip every time. But it's plenty close enough for me.