I am curious about the difference between the MOV and the WRAUX/RDAUX although I know you could not use the same MOV instruction.
Does this mean $1F1 gets clobbered, or only anytime we reference $1F1 in a MOV or RD/WRAUX instruction ?
Pity the COG RAM wasn't built in a x128 (ie blocks of 8 longs) Isn't hindsight wonderful.
I think $1F1 will just become the conduit for the duration of the RDWIDEx/WRWIDEx instruction. Yes, if we had built the cog RAM to be 256 bits wide, we'd be able to plop a whole wide into it at once, or read one out. This isn't so bad the way it is, though, because instructions, themselves, can only process one long at a time.
P.S. I added onto the original post with some other use examples.
One thing that's bugged me about the WIDEs is that they cannot do sustained reads and writes to and from the hub. If you do nine RDLONGC's in a row, you'll exhaust the data cache at some point and need another RDWIDE, but only after the hub window has just passed, forcing you to wait for the next one. This cuts the effective data rate in half. Some speculative prefetching could be done, but it's kind of messy and doesn't address the write issue. XFR gets around all this by reading or writing the WIDEs on every clock, but then you need to issue the RDWIDE/WRWIDEs to transact with the hub, which is fine for SDRAM. Getting data between cog and hub RAM is more complicated because of pipeline requirements - each read or write to cog RAM requires an instruction.
Since last night, I've been trying to come up with some way to achieve sustained transfers between the hub and cog RAM and the hub and AUX.
This will cut the cog loading time in half, too. Now we can breathe data between all memories at the rate of one long per clock.
You can do other things, too, like for WRWIDEA, instead of 'MOV $1F1,INDA++', you could do 'MOV $1F1,INA' to capture pins. You just need an instruction to write $1F1. RDWIDEA just needs an instruction to read $1F1, like 'MOV OUTA,$1F1' or 'SETDACS $1F1' to write four 8-bit DAC values per clock!
.....
You can do other things, too, like for WRWIDEA, instead of 'MOV $1F1,INDA++', you could do 'MOV $1F1,INA' to capture pins. You just need an instruction to write $1F1. RDWIDEA just needs an instruction to read $1F1, like 'MOV OUTA,$1F1' or 'SETDACS $1F1' to write four 8-bit DAC values per clock!
Aha. I was thinking about aux-cog wide transfers, not aux-hub transfers.
I like the MOV without using INDA++ (INA) to capture pins. I am sure we can find a mix of instructions to perform interesting tricks.
Would this work?
' Setup <hubaddr> with long values #0, 10, 20 ....
MOV <cogaddr>,#0
SETPTRA <hubaddr>
REPS #<wides*8>,#1
WRWIDEA #<wides>
ADD $1F1,#10
Aha. I was thinking about aux-cog wide transfers, not aux-hub transfers.
I like the MOV without using INDA++ (INA) to capture pins. I am sure we can find a mix of instructions to perform interesting tricks.
Would this work?
' Setup <hubaddr> with long values #0, 10, 20 ....
MOV <cogaddr>,#0
SETPTRA <hubaddr>
REPS #<wides*8>,#1
WRWIDEA #<wides>
ADD $1F1,#10
There is a phase disparity between read and write windows, so that wouldn't work. WRWIDEx needs an instruction to just write. I'm thinking that it doesn't even need to write $1F1, just the ALU result would suffice. RDWIDEx, though, does need a location ($1F1) to be the data emitter, which is available for read.
For some reason I thought REPS/REPD did not work in hubexec...
REPS/REPD work in hubexec mode. For a case like Cluso99 showed, you would want them to certainly be cached to avoid preempting the RD/WRWIDEx operation.
Chip,
Will the rd/wrWIDEa/b work with multitasking? Presume it will just slow the repeated (eg MOV) instruction? Obviously cannot do more than one at a time though.
Bill,
I wasn't aware, or forgotten. Hopefully REPS/D works in hubexec, but if not I can live with it.
REPS/REPD work in hubexec mode. For a case like Cluso99 showed, you would want them to certainly be cached to avoid preempting the RD/WRWIDEx operation.
Nice,
Yes there might be a caveat that the group of instructions be within the cached block.
Chip,
Will the rd/wrWIDEa/b work with multitasking? Presume it will just slow the repeated (eg MOV) instruction? Obviously cannot do more than one at a time though.
You need to be single-tasking to use these instructions, because once you execute the RD/WRWIDEA/B, you need to supply or read data on every clock cycle for the duration of the operation. If you tried to do that in multitasking, you'd wind up with data salad.
We are almost there for starting a cog in hubexec mode.
Might it be possible to generate a cog reset to clear down whatever setting are done at cogstop/coginit ?
I was thinking that now we can load a cog quickly, maybe it would be nice to reset the other cog features.
REPS/REPD work in hubexec mode. For a case like Cluso99 showed, you would want them to certainly be cached to avoid preempting the RD/WRWIDEx operation.
You need to be single-tasking to use these instructions, because once you execute the RD/WRWIDEA/B, you need to supply or read data on every clock cycle for the duration of the operation. If you tried to do that in multitasking, you'd wind up with data salad.
what's to stop us from launching a cog, pointing to the start of the hubexec code? First instruction could be JMP @0, which should be the next instruction in the hub... or JMP #hubaddr16... so one instruction in, it can switch to hubexec.
Only loss will be the ~$1F1 ~= 498 cycles ~= 2.5us cycles to load the cog image.... THAT'S A FAST COG LOAD!
We are almost there for starting a cog in hubexec mode.
Might it be possible to generate a cog reset to clear down whatever setting are done at cogstop/coginit ?
I was thinking that now we can load a cog quickly, maybe it would be nice to reset the other cog features.
We are almost there for starting a cog in hubexec mode.
Might it be possible to generate a cog reset to clear down whatever setting are done at cogstop/coginit ?
I was thinking that now we can load a cog quickly, maybe it would be nice to reset the other cog features.
re "We are almost there for starting a cog in hubexec mode.
Might it be possible to generate a cog reset to clear down whatever setting are done at cogstop/coginit ?
I was thinking that now we can load a cog quickly, maybe it would be nice to reset the other cog features. "
COGINIT clears DIRA, CTRA, etc, etc and then loads cog $000-$1F1(well $1F3?) and then JMP $0.
COGSTOP performs the clears and stops execution.
I was wondering if we could have an instruction that performed the cleardown (only if it is simple) ? How many clocks would this take (approx.) ?
Now the only other thing is the ability to start the cog from a hub address (in hubexec mode).
This gives us a few possibilities...
Now we can already reload the cog ram, fully or partially, using the RDWIDEx instruction etc.
We need to know what was previously running if we do not reset the cog, so we can disable counters, video, and reset the registers such as PTRa/b, DIRa/b/c/d, OUTa/b/c/d, and some of the other things that get reset at coginit/cogstop time.
If COGINIT could start in hub, then we would have a much faster cog start because we would not need to load the cog. The cog reset would still be required. The hub code could decide if any cog ram needs to be loaded, or cleared, and do so as required. Now we have a fast boot mode.
I don't know how you currently start the prop now, but the first coginit or however you use it, may be able to start in hub ram (ROM) too.
@Chip, at boot up, does it load all 256KB ( minus ROM section obviously ) into HUBRAM? or does it check the size of the program and only load that?
As for speeding up boot up time, you could just have it load the program size, which could just fire up the ports to what's necessary, then have a second program after the boot loader, which then loads what it needs into HUB and starts new cogs if needed? be-it in COG mode or Cluso99's suggested HUBEXEC mode startup.
@Chip, at boot up, does it load all 256KB ( minus ROM section obviously ) into HUBRAM? or does it check the size of the program and only load that?
As for speeding up boot up time, you could just have it load the program size, which could just fire up the ports to what's necessary, then have a second program after the boot loader, which then loads what it needs into HUB and starts new cogs if needed? be-it in COG mode or Cluso99's suggested HUBEXEC mode startup.
That's all a function of the 2nd stage loader. The user can configure it to do whatever is needed. The only certain thing is that a 1st stage loader is loaded and executed from ROM, then it attempts to load a 2nd stage loader from either serial or flash. That 2nd stage loader can do anything, once it's signature is verified.
That's all a function of the 2nd stage loader. The user can configure it to do whatever is needed. The only certain thing is that a 1st stage loader is loaded and executed from ROM, then it attempts to load a 2nd stage loader from either serial or flash. That 2nd stage loader can do anything, once it's signature is verified.
You need to be single-tasking to use these instructions, because once you execute the RD/WRWIDEA/B, you need to supply or read data on every clock cycle for the duration of the operation. If you tried to do that in multitasking, you'd wind up with data salad.
I don't really understand the multitasking - but reading this I was wondering if,
when you run multiple tasks, you can hold all but 1 task (e.g. the master running from HUB) for some time,
like here, to do a RD/WRWIDEA/B and then continue the other tasks.
Do tasks keep their state, when they are stopped/halted/hibernated, and can they be resumed where they left off?
So in this case the master thread would pause the other threads, wait some clocks until the pipeline is clear,
does it's xWIDEx stuff (which only takes a few cycles and might be OK in many cases) and then resumes the other tasks.
I don't really understand the multitasking - but reading this I was wondering if,
when you run multiple tasks, you can hold all but 1 task (e.g. the master running from HUB) for some time,
like here, to do a RD/WRWIDEA/B and then continue the other tasks.
Do tasks keep their state, when they are stopped/halted/hibernated, and can they be resumed where they left off?
So in this case the master thread would pause the other threads, wait some clocks until the pipeline is clear,
does it's xWIDEx stuff (which only takes a few cycles and might be OK in many cases) and then resumes the other tasks.
Memory instructions like RDBYTE/etc pause the pipeline for up to 11 clocks. Other instructions which wait for some event, like GETSQRT, will loop in place during multitasking, in order to keep the pipeline moving. Memory reads don't loop, because there's a strong likelihood that some other task's instruction in the pipeline is waiting to access the hub, also.
It's not too hard for a certain task to commandeer the all thread slots for a specified part of it's execution to perform exclusive operations. Returning the Cog to it's former config, if it wasn't already known, might be more problematic though. Is the slicing order register readable?
It's not too hard for a certain task to commandeer the all thread slots for a specified part of it's execution to perform exclusive operations. Returning the Cog to it's former config, if it wasn't already known, might be more problematic though. Is the slicing order register readable?
Good point. How do you restore the SETTASK config? So far it seems you would probably have to track it with a soft copy.
Also, I know you can easily go force a particular task's PC value with JMPTASK, but it doesn't appear you can go read another task's PC back from what I read in the Prop2 docs file. That PC readback could be useful if you wanted to develop a debugger thread that could run in the background, stop another task dynamically with SETTASK, and then examine its state. Not sure if there is another way to read another task's PC ... Chip?
When a task starts, it can set a register indication what is running on that task.
A task can get its own PC simply by performing a LINK and MOV <xxx>,$0
However, interrupting another task is more problematic if you then want it to resume.
When a task starts, it can set a register indication what is running on that task.
A task can get its own PC simply by performing a LINK and MOV <xxx>,$0
However, interrupting another task is more problematic if you then want it to resume.
I guess such a debugger can always inject a "breakpoint" into a task's code by putting a call to some hubexec code which will have access to the current PC on one of the stack(s). Some further debug code will then be run in the interrupted tasks context to examine the state and it has the luxury of being able to use hubexec which can almost be zero overhead to the COG memory (just one entry in the internal 4 long stack for example). Also that way a clean return can be performed. We can also put the original instruction back at the breakpoint that was replaced by a call and "return" to the PC-1 address if required when resuming. If we can find a good way to single step in addition to this breakpoint capability, that will be nice too. I think Ariba had good ideas earlier on how to do that.
I have single step code working in P1 for both pasm and spin btecode.
I was working of getting my P2 debugger single stepping but with all the changes I stopped because there were going to be so many simplifications to what I had working. But I think multitasking will be something again.
Comments
How does rd/wrWIDEa/b know if it will be a hub or aux transfer ? It is not until the following instruction that it can tell ???
Postedit: Or is it meant to be rd/wrWIDEa/b/x/y ?
BTW I am quite happy to lose both $1F0 & $1F1 (for hub and aux) if that helps. This is another great boost.
I think $1F1 will just become the conduit for the duration of the RDWIDEx/WRWIDEx instruction. Yes, if we had built the cog RAM to be 256 bits wide, we'd be able to plop a whole wide into it at once, or read one out. This isn't so bad the way it is, though, because instructions, themselves, can only process one long at a time.
P.S. I added onto the original post with some other use examples.
It's the instruction that REPS repeats that actually reads or writes the WIDEs, in sequence. It can be anything that reads or writes in one clock.
Brilliant work Chip!
Aha. I was thinking about aux-cog wide transfers, not aux-hub transfers.
I like the MOV without using INDA++ (INA) to capture pins. I am sure we can find a mix of instructions to perform interesting tricks.
Would this work?
Nice
For some reason I thought REPS/REPD did not work in hubexec...
There is a phase disparity between read and write windows, so that wouldn't work. WRWIDEx needs an instruction to just write. I'm thinking that it doesn't even need to write $1F1, just the ALU result would suffice. RDWIDEx, though, does need a location ($1F1) to be the data emitter, which is available for read.
REPS/REPD work in hubexec mode. For a case like Cluso99 showed, you would want them to certainly be cached to avoid preempting the RD/WRWIDEx operation.
Will the rd/wrWIDEa/b work with multitasking? Presume it will just slow the repeated (eg MOV) instruction? Obviously cannot do more than one at a time though.
Bill,
I wasn't aware, or forgotten. Hopefully REPS/D works in hubexec, but if not I can live with it.
Yes there might be a caveat that the group of instructions be within the cached block.
That's right. You just want to be sure that RDWIDEA and the MOV are in the same cache line to avoid instruction-fetch interference.
You need to be single-tasking to use these instructions, because once you execute the RD/WRWIDEA/B, you need to supply or read data on every clock cycle for the duration of the operation. If you tried to do that in multitasking, you'd wind up with data salad.
We are almost there for starting a cog in hubexec mode.
Might it be possible to generate a cog reset to clear down whatever setting are done at cogstop/coginit ?
I was thinking that now we can load a cog quickly, maybe it would be nice to reset the other cog features.
So in single task mode, hubexec REPx loop could be up to 32 instructions (assuming you were careful about alignment) - VERY VERY COOL!
But the feature is great, caveats and all.
what's to stop us from launching a cog, pointing to the start of the hubexec code? First instruction could be JMP @0, which should be the next instruction in the hub... or JMP #hubaddr16... so one instruction in, it can switch to hubexec.
Only loss will be the ~$1F1 ~= 498 cycles ~= 2.5us cycles to load the cog image.... THAT'S A FAST COG LOAD!
Could you elaborate a little bit?
Might it be possible to generate a cog reset to clear down whatever setting are done at cogstop/coginit ?
I was thinking that now we can load a cog quickly, maybe it would be nice to reset the other cog features. "
COGINIT clears DIRA, CTRA, etc, etc and then loads cog $000-$1F1(well $1F3?) and then JMP $0.
COGSTOP performs the clears and stops execution.
I was wondering if we could have an instruction that performed the cleardown (only if it is simple) ? How many clocks would this take (approx.) ?
Now the only other thing is the ability to start the cog from a hub address (in hubexec mode).
This gives us a few possibilities...
Now we can already reload the cog ram, fully or partially, using the RDWIDEx instruction etc.
We need to know what was previously running if we do not reset the cog, so we can disable counters, video, and reset the registers such as PTRa/b, DIRa/b/c/d, OUTa/b/c/d, and some of the other things that get reset at coginit/cogstop time.
If COGINIT could start in hub, then we would have a much faster cog start because we would not need to load the cog. The cog reset would still be required. The hub code could decide if any cog ram needs to be loaded, or cleared, and do so as required. Now we have a fast boot mode.
I don't know how you currently start the prop now, but the first coginit or however you use it, may be able to start in hub ram (ROM) too.
As for speeding up boot up time, you could just have it load the program size, which could just fire up the ports to what's necessary, then have a second program after the boot loader, which then loads what it needs into HUB and starts new cogs if needed? be-it in COG mode or Cluso99's suggested HUBEXEC mode startup.
That's all a function of the 2nd stage loader. The user can configure it to do whatever is needed. The only certain thing is that a 1st stage loader is loaded and executed from ROM, then it attempts to load a 2nd stage loader from either serial or flash. That 2nd stage loader can do anything, once it's signature is verified.
Awesome
when you run multiple tasks, you can hold all but 1 task (e.g. the master running from HUB) for some time,
like here, to do a RD/WRWIDEA/B and then continue the other tasks.
Do tasks keep their state, when they are stopped/halted/hibernated, and can they be resumed where they left off?
So in this case the master thread would pause the other threads, wait some clocks until the pipeline is clear,
does it's xWIDEx stuff (which only takes a few cycles and might be OK in many cases) and then resumes the other tasks.
Memory instructions like RDBYTE/etc pause the pipeline for up to 11 clocks. Other instructions which wait for some event, like GETSQRT, will loop in place during multitasking, in order to keep the pipeline moving. Memory reads don't loop, because there's a strong likelihood that some other task's instruction in the pipeline is waiting to access the hub, also.
Good point. How do you restore the SETTASK config? So far it seems you would probably have to track it with a soft copy.
Also, I know you can easily go force a particular task's PC value with JMPTASK, but it doesn't appear you can go read another task's PC back from what I read in the Prop2 docs file. That PC readback could be useful if you wanted to develop a debugger thread that could run in the background, stop another task dynamically with SETTASK, and then examine its state. Not sure if there is another way to read another task's PC ... Chip?
A task can get its own PC simply by performing a LINK and MOV <xxx>,$0
However, interrupting another task is more problematic if you then want it to resume.
I guess such a debugger can always inject a "breakpoint" into a task's code by putting a call to some hubexec code which will have access to the current PC on one of the stack(s). Some further debug code will then be run in the interrupted tasks context to examine the state and it has the luxury of being able to use hubexec which can almost be zero overhead to the COG memory (just one entry in the internal 4 long stack for example). Also that way a clean return can be performed. We can also put the original instruction back at the breakpoint that was replaced by a call and "return" to the PC-1 address if required when resuming. If we can find a good way to single step in addition to this breakpoint capability, that will be nice too. I think Ariba had good ideas earlier on how to do that.
I was working of getting my P2 debugger single stepping but with all the changes I stopped because there were going to be so many simplifications to what I had working. But I think multitasking will be something again.