I've added a small boot ROM inside the cog that gets synthesized. Rather than the hardware forcing certain instructions at certain addresses to execute the load-and-run procedure (due to COGNEW/COGINIT), we'll now run a short program to do the loading. Since it's a program, it can do a lot more than just load from $000.
What would you guys think if every PASM program began with a prefix long that described how to load the ensuing program longs. For example, the lower word of the first long could tell how many longs to load, while the upper word could tell what cog address to start loading at. The MSB within that long could instruct the loader to clear the cog RAM before loading the image in. This way, your program not only starts off where you want it (maybe above remapped register space), but all your variables start off with $00000000's. Or, you could skip register clearing to get very fast short loads. Any other ideas for what that program could do? Would this additional prefix long be too much of a departure from where we are?
I think this makes a lot of sense. We've got a lot of ways a COG can be organized and what I see happening right now, sans the ROM, is we will be burning longs just getting data and code all sorted out, much like we sometimes do with P1 with locating data tables at 0, for one simple example.
Being able to define whether or not the COG is cleared, as well as where a program gets loaded seems a no brainer. Smaller COG images, prep for HUBEX code perhaps, are something all of us have thought about.
Having the COG just do the work to start in HUBEX @address would make sense, optionally clearing or not does too.
What do we think the most common multi-task mapping is? Perhaps a smaller COG image, coupled with this program doing the prep to just drop it into a multi-tasking COG right at the get-go makes sense too.
Those are the ideas I have at the moment. Just putting them out there for discussion.
I've added a small boot ROM inside the cog that gets synthesized. Rather than the hardware forcing certain instructions at certain addresses to execute the load-and-run procedure (due to COGNEW/COGINIT), we'll now run a short program to do the loading. Since it's a program, it can do a lot more than just load from $000.
What would you guys think if every PASM program began with a prefix long that described how to load the ensuing program longs. For example, the lower word of the first long could tell how many longs to load, while the upper word could tell what cog address to start loading at. The MSB within that long could instruct the loader to clear the cog RAM before loading the image in. This way, your program not only starts off where you want it (maybe above remapped register space), but all your variables start off with $00000000's. Or, you could skip register clearing to get very fast short loads. Any other ideas for what that program could do? Would this additional prefix long be too much of a departure from where we are?
The cog could just start executing the code in hub exec mode. The hub code could be used to load cog memory, and then jump to it, or the cog could just run in hub mode forever and use the cog memory for registers.
The cog could just start executing the code in hub exec mode. The hub code could be used to load cog memory, and then jump to it, or the cog could just run in hub mode forever and use the cog memory for registers.
This sounds very useful. I like Potatohead's additions - being able to go directly into HUB execution or a default (0-1-2-3) multitask. This loader and control long would be used with every cog load whether it is from initial reset or from a COGNEW/COGINIT?
At this point I don't know if anything is too big a departure from what we have unless you create an interrupt mechanism!
COG can be forced to default HUBEXEC
BUT can be started -- before You say to it what place to start from
At boot up it would start executing the code at location $0000, or maybe $0E00 depending on how it's configured. From a COGNEW/COGINIT it would start executing the code at the address given by D.
The cog could just start executing the code in hub exec mode. The hub code could be used to load cog memory, and then jump to it, or the cog could just run in hub mode forever and use the cog memory for registers.
Yes, this has been suggested before. It would make it very fast to get a COG started.
What do we think the most common multi-task mapping is? Perhaps a smaller COG image, coupled with this program doing the prep to just drop it into a multi-tasking COG right at the get-go makes sense too.
I like the idea, but you would also have to implicitly perform JMPTASKs and SETMAP.
Chip,
I really like the idea of being able to control how a cog starts up. Either in hubexec mode at a given address, or like you said with a load to a given address in the cog of a given size.
Question, if I load a specific cog once and have it clear itself, and then load in a small piece of code that runs and then stops the cog, if I then start that same cog but tell it to load 0 longs and not clear itself but start at the same address as before, will it's memory still retain the code I had loaded before? I guess the short version is, does the cog memory stay valid when stopped? This could be very handy if it works. because I could load up a cog with several routines that all end by stopping the cog. Then I could call them quickly, since they would not need to load anything, just start the cog going at the address of the routine. I could also reuse the memory they occupied in hub ram as buffers or stack once they were loaded into a cog.
Optional clear should address that Roy. Was thinking along similar lines.
@Searith: Yes, which is why I phrased it as "most common", so those two would get done by the program for a case or maybe two, depending on how Chip wants to use the bits. Anything else would fall under just starting the COG with some basic address for the program start, etc... and the program itself would take care of things.
With the small ROM, there is basically the potential for a bit of baked in COG init code. Nice!
I've added a small boot ROM inside the cog that gets synthesized. Rather than the hardware forcing certain instructions at certain addresses to execute the load-and-run procedure (due to COGNEW/COGINIT), we'll now run a short program to do the loading. Since it's a program, it can do a lot more than just load from $000.
What would you guys think if every PASM program began with a prefix long that described how to load the ensuing program longs. For example, the lower word of the first long could tell how many longs to load, while the upper word could tell what cog address to start loading at. The MSB within that long could instruct the loader to clear the cog RAM before loading the image in. This way, your program not only starts off where you want it (maybe above remapped register space), but all your variables start off with $00000000's. Or, you could skip register clearing to get very fast short loads. Any other ideas for what that program could do? Would this additional prefix long be too much of a departure from where we are?
What if this were implemented as a pseudo-instruction:
cb1: LOADCFG #0, @cb1end WZ // load the following block at address zero, clear registers (WZ)
// actual instructions here
cb1end: //last instruction here
LOADCFG would generate the prefix long you speak of. If LOADCFG isn't found at the start of the PASM, the default behavior (address 0, $1F0 instructions are loaded) is implicitly encoded.
This approach would allow several loadable blocks of code to be kept in a single file. From there, it would be possible for a COG to call COGINIT (or whatever) on itself to swap out a section of itself then execute it with a single instruction.
With the small ROM, there is basically the potential for a bit of baked in COG init code.
ROM code would save a few cycles and hub RAM locations over starting in the hub exec mode. However, it would be "baked in" and less flexible than the hub exec mode.
If a flexible start is needed, simply start the COG with a start address, all other options off. In that way, it's no different from the COG start we have now. One of the options really should be, start in HUBEX @ address too. Simple, fast, consistent, if desired.
...The MSB within that long could instruct the loader to clear the cog RAM before loading the image in. This way, your program not only starts off where you want it (maybe above remapped register space), but all your variables start off with $00000000's. Or, you could skip register clearing to get very fast short loads.
Does that flag clear ALL memory, (which would clobber other pre-loaded CODE blocks) ?
That will take some cycles to do, right ? - as a looping clear ?
So how much time does it save over sending a block with clears inline ? (it does save some total-code size)
(or do you use inline clears for partial-clear cases, and ROM-clears for fresh whole COG reloads ?)
Can such a ROM switch in and out, without impacting the peak MHz speed ?
I've added a small boot ROM inside the cog that gets synthesized. Rather than the hardware forcing certain instructions at certain addresses to execute the load-and-run procedure (due to COGNEW/COGINIT), we'll now run a short program to do the loading. Since it's a program, it can do a lot more than just load from $000.
What would you guys think if every PASM program began with a prefix long that described how to load the ensuing program longs. For example, the lower word of the first long could tell how many longs to load, while the upper word could tell what cog address to start loading at. The MSB within that long could instruct the loader to clear the cog RAM before loading the image in. This way, your program not only starts off where you want it (maybe above remapped register space), but all your variables start off with $00000000's. Or, you could skip register clearing to get very fast short loads. Any other ideas for what that program could do? Would this additional prefix long be too much of a departure from where we are?
With hubexec being made perhaps we can have a reserved hub address for cogstart (like we have for clock frequency) and every cog can start in hubexec mode from that address. The address will have a jump to the next instruction. The COGNEW address opcode can write the jump to this address and then start the next cog. In this way the first jump executed can jump somewhere in the hub to then execute the next instruction or in the cog's register space thus switching to cogexec.
If the cog needs to be loaded the first jump will jump somewhere into hub space where the cogloader routine exists. this routine can end with a jump to cog register space.
One thing that can be good in this scenario is that the cogstop op effectively stops(resets) the cog's PC and all in/out/dir... but preserves ram contents thus allowing to restart the cog with direct jump in cog space without reloading the code. It can be good in energy saving applications where events can be immediately fired/answered and then the energy saving is restored.
If you go along with your idea, which is not bad, I will prefer that the first long will be like this:
LL byte: longs*4 to be loaded
LH byte: cog address*4 to start fill
HL byte: cog address*4 to start execution (set of PC)
HH byte: mode flags (register cleanup, in/dir/out reset, counter reset, ... and whatever can comes handy)
If you align the longs by 4 (who needs to load one or two) this allows for up to 1024 (256*4) thus eventually allowing for cog register space expansion in P3 (where with ram redesign the 1024 longs can be used also for instruction/data caches when running in hubexec mode thus saving the dedicated caches)
@Chip, when you have a cog running in HUB mode, how does it affect the cog when you read from HUB with a rdwide etc? does it delay and have to rest of the instructions that were cached? or do you have a wide for the instruction cached in HUB mode? ps sorry if this has been explained before I must have missed it.
Just out of curiosity, how would a cog transfer hub data at full speed without encountering hub stalls? A "rdlongc inda++, ptra++" in a REPx loop would hit a hub stall every 8 loops. Is it possible to fill cog memory from the hub without hub stalls?
Just out of curiosity, how would a cog transfer hub data at full speed without encountering hub stalls? A "rdlongc inda++, ptra++" in a REPx loop would hit a hub stall every 8 loops. Is it possible to fill cog memory from the hub without hub stalls?
I think you would use RDWIDEx instead. This would start a sustained hub read that you would then transfer with a REPS/MOV combination. Based on recent posts from Chip, I think this approach would be full speed (no stalls). See [POST=1243817]this post[/POST] for an example of the code.
These ideas about possibly NOT loading any data, but executing what is already there, are intriguing. And, yes, data would survive in the cog registers perfectly, if they are not cleared or overwritten.
There are some strong reasons to start in hub mode, as code from hub ROM can orchestrate the loading. The headache is that it introduces a bottom-line of complexity that interrupts casual fun.
Also, this in-cog ROM could be a small (and banked?) program that hides behind the I/O registers. That means it could be always-callable within the cog. Instructions in cog memory are always fetched from the cog RAM, as there are no mux'ing circuits like D and S have to read the I/O registers, so it would entail only one mux to redirect instruction fetches from $1F4..$1FF to a bit of synthesized ROM made from logic gates (not an actual 'memory', per se).
There's a lot of possibilities here - too many, perhaps.
Personally, I like the optional clear COG RAM or not, start in HUBEX, or not, and load small + start at address. It does complicate understanding what a COG is doing, and it does complicate COG images too. But, those complications come with benefits. Code size, and the ability to make more rapid transit into and out of a COG being used.
Each of those seems well defined. Going farther seems redundant, given the small amounts of code needed, and the many potential options. I don't see a clear return over just writing a few instructions.
That said, it would be really awesome to just have load at address and number to load, and a clearing of registers be optional. If I were to pick the biggest bang for the complexity investment, those two options would be it for me.
Personally, I like the optional clear COG RAM or not, start in HUBEX, or not, and load small + start at address. It does complicate understanding what a COG is doing, and it does complicate COG images too. But, those complications come with benefits. Code size, and the ability to make more rapid transit into and out of a COG being used.
Each of those seems well defined. Going farther seems redundant, given the small amounts of code needed, and the many potential options. I don't see a clear return over just writing a few instructions.
That said, it would be really awesome to just have load at address and number to load, and a clearing of registers be optional. If I were to pick the biggest bang for the complexity investment, those two options would be it for me.
I just started drawing out what kinds of things ought to be done and I came to the same conclusion!
How about the prefix long being like this:
%c0_xxx_jjjjjjjjj_sssssssss_nnnnnnnnn = Load n longs starting at s, then jump to j. If c is 1 then pre-clear cog RAM before loading.
For simple programs that start at $000, the prefix long would just be the number of longs to load, with the optional MSB set for pre-clearing registers. A prefix long of $00000000 would just mean, "Don't load anything, jump to $000."
%c1_xxxxxxxxxxxxxx_jjjjjjjjjjjjjjjj = Jump to j (probably a hub address). If c is 1 then pre-clear cog RAM before jumping.
This scheme avoids the issues of multi-tasking setup, register remapping, etc., that are best handled by application code. This just gets things started.
You mentioned it introduces a bottom line of complexity that interrupts casual fun. I agree, you never want to lose that aspect of the Propeller. Can the loader long be desinged that if it is all zeroes (worst case, just a long count), then it is a classic cog? Anything non-zero would start triggering the complexity. Make feature available if you want them and know to ask for them otherwise, it's just a cog.
You mentioned it introduces a bottom line of complexity that interrupts casual fun. I agree, you never want to lose that aspect of the Propeller. Can the loader long be desinged that if it is all zeroes (worst case, just a long count), then it is a classic cog? Anything non-zero would start triggering the complexity. Make feature available if you want them and know to ask for them otherwise, it's just a cog.
I thought about the same thing. As long as there has to be a prefix long, at all, having it indicate a size isn't too bad, I think. We could make $00000000 a special case of 'load the whole cog and jump to $000' and force the use of %c1_xxxxxxxxxxxxxx_jjjjjjjjjjjjjjjj to jump to $000 without loading. I kind of think it's better to avoid this caveat, though, and just have the first long specify the long count - you get the benefit of minimal load time that way, and if you don't know about the size, just use $1F2.
I realized that to make this flexible, where loading can start at any cog address for any number of longs, we won't be able to use the full-speed RDWIDEx instructions, since they only work on wide (8-long) boundaries. To force wide boundaries on loads would be ugly. So, we'll have to settle for RDLONGC's.
... Make feature available if you want them and know to ask for them otherwise, it's just a cog.
Makes me wonder: will there be any undocumented and un-mentioned Easter eggs in the instruction set? Maybe an op code to make the PC decrement instead of increment?
Makes me wonder: will there be any undocumented and un-mentioned Easter eggs in the instruction set? Maybe an op code to make the PC decrement instead of increment?
That's a crazy idea! We'd have to change the return address computations for CALLs, too. Can you see any advantage to this?
I realized that to make this flexible, where loading can start at any cog address for any number of longs, we won't be able to use the full-speed RDWIDEx instructions, since they only work on wide (8-long) boundaries. To force wide boundaries on loads would be ugly. So, we'll have to settle for RDLONGC's.
Doesn't that depend on the COG ROM ? - it could sense, (or be told?), of a boundary load, to allow use of the full-speed RDWIDEx instructions, but if not boundary-snapped, it could use RDLONGC's ?
Doesn't that depend on the COG ROM ? - it could sense, (or be told?), of a boundary load, to allow use of the full-speed RDWIDEx instructions, but if not boundary-snapped, it could use RDLONGC's ?
You're right. It could do that, but it would probably double the ROM size. I don't know if it's worth it.
I've added a small boot ROM inside the cog that gets synthesized. Rather than the hardware forcing certain instructions at certain addresses to execute the load-and-run procedure (due to COGNEW/COGINIT), we'll now run a short program to do the loading. Since it's a program, it can do a lot more than just load from $000.
What would you guys think if every PASM program began with a prefix long that described how to load the ensuing program longs. For example, the lower word of the first long could tell how many longs to load, while the upper word could tell what cog address to start loading at. The MSB within that long could instruct the loader to clear the cog RAM before loading the image in. This way, your program not only starts off where you want it (maybe above remapped register space), but all your variables start off with $00000000's. Or, you could skip register clearing to get very fast short loads. Any other ideas for what that program could do? Would this additional prefix long be too much of a departure from where we are?
Where is this COG ROM located in the 512 long address space and how much space does it take?
Not really. It just seems like a relatively minor feature that could be "hidden" for discovery latter. Just thinking about writing code that can reverse (like UNO!) is presenting all sorts of fun little pieces of code.
Comments
What would you guys think if every PASM program began with a prefix long that described how to load the ensuing program longs. For example, the lower word of the first long could tell how many longs to load, while the upper word could tell what cog address to start loading at. The MSB within that long could instruct the loader to clear the cog RAM before loading the image in. This way, your program not only starts off where you want it (maybe above remapped register space), but all your variables start off with $00000000's. Or, you could skip register clearing to get very fast short loads. Any other ideas for what that program could do? Would this additional prefix long be too much of a departure from where we are?
Being able to define whether or not the COG is cleared, as well as where a program gets loaded seems a no brainer. Smaller COG images, prep for HUBEX code perhaps, are something all of us have thought about.
Having the COG just do the work to start in HUBEX @address would make sense, optionally clearing or not does too.
What do we think the most common multi-task mapping is? Perhaps a smaller COG image, coupled with this program doing the prep to just drop it into a multi-tasking COG right at the get-go makes sense too.
Those are the ideas I have at the moment. Just putting them out there for discussion.
It is very nice idea
COG can be forced to default HUBEXEC
BUT can be started -- before You say to it what place to start from
At this point I don't know if anything is too big a departure from what we have unless you create an interrupt mechanism!
I like the idea, but you would also have to implicitly perform JMPTASKs and SETMAP.
I really like the idea of being able to control how a cog starts up. Either in hubexec mode at a given address, or like you said with a load to a given address in the cog of a given size.
Question, if I load a specific cog once and have it clear itself, and then load in a small piece of code that runs and then stops the cog, if I then start that same cog but tell it to load 0 longs and not clear itself but start at the same address as before, will it's memory still retain the code I had loaded before? I guess the short version is, does the cog memory stay valid when stopped? This could be very handy if it works. because I could load up a cog with several routines that all end by stopping the cog. Then I could call them quickly, since they would not need to load anything, just start the cog going at the address of the routine. I could also reuse the memory they occupied in hub ram as buffers or stack once they were loaded into a cog.
@Searith: Yes, which is why I phrased it as "most common", so those two would get done by the program for a case or maybe two, depending on how Chip wants to use the bits. Anything else would fall under just starting the COG with some basic address for the program start, etc... and the program itself would take care of things.
With the small ROM, there is basically the potential for a bit of baked in COG init code. Nice!
What if this were implemented as a pseudo-instruction:
LOADCFG would generate the prefix long you speak of. If LOADCFG isn't found at the start of the PASM, the default behavior (address 0, $1F0 instructions are loaded) is implicitly encoded.
This approach would allow several loadable blocks of code to be kept in a single file. From there, it would be possible for a COG to call COGINIT (or whatever) on itself to swap out a section of itself then execute it with a single instruction.
If a flexible start is needed, simply start the COG with a start address, all other options off. In that way, it's no different from the COG start we have now. One of the options really should be, start in HUBEX @ address too. Simple, fast, consistent, if desired.
Does that flag clear ALL memory, (which would clobber other pre-loaded CODE blocks) ?
That will take some cycles to do, right ? - as a looping clear ?
So how much time does it save over sending a block with clears inline ? (it does save some total-code size)
(or do you use inline clears for partial-clear cases, and ROM-clears for fresh whole COG reloads ?)
Can such a ROM switch in and out, without impacting the peak MHz speed ?
With hubexec being made perhaps we can have a reserved hub address for cogstart (like we have for clock frequency) and every cog can start in hubexec mode from that address. The address will have a jump to the next instruction. The COGNEW address opcode can write the jump to this address and then start the next cog. In this way the first jump executed can jump somewhere in the hub to then execute the next instruction or in the cog's register space thus switching to cogexec.
If the cog needs to be loaded the first jump will jump somewhere into hub space where the cogloader routine exists. this routine can end with a jump to cog register space.
One thing that can be good in this scenario is that the cogstop op effectively stops(resets) the cog's PC and all in/out/dir... but preserves ram contents thus allowing to restart the cog with direct jump in cog space without reloading the code. It can be good in energy saving applications where events can be immediately fired/answered and then the energy saving is restored.
If you go along with your idea, which is not bad, I will prefer that the first long will be like this:
LL byte: longs*4 to be loaded
LH byte: cog address*4 to start fill
HL byte: cog address*4 to start execution (set of PC)
HH byte: mode flags (register cleanup, in/dir/out reset, counter reset, ... and whatever can comes handy)
If you align the longs by 4 (who needs to load one or two) this allows for up to 1024 (256*4) thus eventually allowing for cog register space expansion in P3 (where with ram redesign the 1024 longs can be used also for instruction/data caches when running in hubexec mode thus saving the dedicated caches)
I think you would use RDWIDEx instead. This would start a sustained hub read that you would then transfer with a REPS/MOV combination. Based on recent posts from Chip, I think this approach would be full speed (no stalls). See [POST=1243817]this post[/POST] for an example of the code.
There are some strong reasons to start in hub mode, as code from hub ROM can orchestrate the loading. The headache is that it introduces a bottom-line of complexity that interrupts casual fun.
Also, this in-cog ROM could be a small (and banked?) program that hides behind the I/O registers. That means it could be always-callable within the cog. Instructions in cog memory are always fetched from the cog RAM, as there are no mux'ing circuits like D and S have to read the I/O registers, so it would entail only one mux to redirect instruction fetches from $1F4..$1FF to a bit of synthesized ROM made from logic gates (not an actual 'memory', per se).
There's a lot of possibilities here - too many, perhaps.
Personally, I like the optional clear COG RAM or not, start in HUBEX, or not, and load small + start at address. It does complicate understanding what a COG is doing, and it does complicate COG images too. But, those complications come with benefits. Code size, and the ability to make more rapid transit into and out of a COG being used.
Each of those seems well defined. Going farther seems redundant, given the small amounts of code needed, and the many potential options. I don't see a clear return over just writing a few instructions.
That said, it would be really awesome to just have load at address and number to load, and a clearing of registers be optional. If I were to pick the biggest bang for the complexity investment, those two options would be it for me.
I just started drawing out what kinds of things ought to be done and I came to the same conclusion!
How about the prefix long being like this:
%c0_xxx_jjjjjjjjj_sssssssss_nnnnnnnnn = Load n longs starting at s, then jump to j. If c is 1 then pre-clear cog RAM before loading.
For simple programs that start at $000, the prefix long would just be the number of longs to load, with the optional MSB set for pre-clearing registers. A prefix long of $00000000 would just mean, "Don't load anything, jump to $000."
%c1_xxxxxxxxxxxxxx_jjjjjjjjjjjjjjjj = Jump to j (probably a hub address). If c is 1 then pre-clear cog RAM before jumping.
This scheme avoids the issues of multi-tasking setup, register remapping, etc., that are best handled by application code. This just gets things started.
I thought about the same thing. As long as there has to be a prefix long, at all, having it indicate a size isn't too bad, I think. We could make $00000000 a special case of 'load the whole cog and jump to $000' and force the use of %c1_xxxxxxxxxxxxxx_jjjjjjjjjjjjjjjj to jump to $000 without loading. I kind of think it's better to avoid this caveat, though, and just have the first long specify the long count - you get the benefit of minimal load time that way, and if you don't know about the size, just use $1F2.
I realized that to make this flexible, where loading can start at any cog address for any number of longs, we won't be able to use the full-speed RDWIDEx instructions, since they only work on wide (8-long) boundaries. To force wide boundaries on loads would be ugly. So, we'll have to settle for RDLONGC's.
Makes me wonder: will there be any undocumented and un-mentioned Easter eggs in the instruction set? Maybe an op code to make the PC decrement instead of increment?
That's a crazy idea! We'd have to change the return address computations for CALLs, too. Can you see any advantage to this?
Doesn't that depend on the COG ROM ? - it could sense, (or be told?), of a boundary load, to allow use of the full-speed RDWIDEx instructions, but if not boundary-snapped, it could use RDLONGC's ?
You're right. It could do that, but it would probably double the ROM size. I don't know if it's worth it.
Not really. It just seems like a relatively minor feature that could be "hidden" for discovery latter. Just thinking about writing code that can reverse (like UNO!) is presenting all sorts of fun little pieces of code.