#179 results suggest that it works OK - it looks like it just waits for the appropriate hub cycles for each multi-cycle instruction, and executes the other single cycle ones as well.
If it did not execute all four instructions in the quad, the results would be wrong. Unless I am missing something again.
I have not tried to nail down where the best place is for hub access instructions, but the organization above at least seems to work.
Multi-cycle instructions will have to be scheduled very carefully, but they should work (famous last words!) - I hope.
I'll have to try some other multi-cycle instructions.
The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.
If this answer was to me -- You don't understand me.
Can I?
SETQUAD XXXX0 Set fist 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX
SETQUAD XXXX4 Set next 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX without move data from QUAD
The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.
#179 results suggest that it works OK - it looks like it just waits for the appropriate hub cycles for each multi-cycle instruction, and executes the other single cycle ones as well.
If it did not execute all four instructions in the quad, the results would be wrong. Unless I am missing something again.
I have not tried to nail down where the best place is for hub access instructions, but the organization above at least seems to work.
Multi-cycle instructions will have to be scheduled very carefully, but they should work (famous last words!) - I hope.
I'll have to try some other multi-cycle instructions.
I think Chip wants something *REALLY* simple in hardware, I don't believe he has the time / logic budget for anything complicated. If he did, I think all of us would innundate him with additional suggestions
My suggestion was simple, really just a Mux-unpack.
I'm not clear if this compression is intended as an Opcode-space/runtime option (which sounds like a high impact change as you affect the opcode decode pathways), or if it is intended as a Memory move (COG load) (de)compression, where I imagine there is more time headroom, and it does not impact opcode decode pathways. A single boolean flag, or a special opcode, is all that is needed to enable the Mux-unpack, plus the storage of the 16 bits of more-stable merge information
If this answer was to me -- You don't understand me.
Can I?
SETQUAD XXXX0 Set fist 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX
SETQUAD XXXX4 Set next 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX without move data from QUAD
Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.
Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?
I think I twisted my brain around it enough to see what is going on. At least I hope I did.
Your code sample:
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow QUADs to update
NOP
NOP
NOP 'do at least 1 instruction to get QUADs into pipeline
quad0 NOP 'QUAD0..QUAD3 are now executable
quad1 NOP
quad2 NOP
quad3 NOP
This effectively means that to execute the four fetched instructions will take 9 clock cycles, throwing the next RDQUAD out of hub sync.
If the hub cache was copied to the cog registers instead of being mapped in, the following should work: (it does not now due to mapping)
setquad #quad1
reps #128,#16
getcnt start
RDQUAD pc
ADD pc,#61
NOP
NOP
SETQUAD #quad0 'map QUADs to quad0..quad3
quad0 NOP 'QUAD0..QUAD3 are now executable
NOP
NOP
NOP
RDQUAD pc
ADD pc,#61
NOP
NOP
SETQUAD #quad1 'map QUADs to quad0..quad3
quad1 NOP 'QUAD0..QUAD3 are now executable
NOP
NOP
NOP
One potential solution I see - which I suspect is far too late in the design cycle to implement would be:
EXECQUAD D
which just shoves the four values into the pipeline instead of mapping them in.
I do have another idea I am working on, but need a few minutes to chew on it before posting (ie try it out)
Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?
Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?
One potential solution I see - which I suspect is far too late in the design cycle to implement would be:
EXECQUAD D
which just shoves the four values into the pipeline instead of mapping them in.
I do have another idea I am working on, but need a few minutes to chew on it before posting (ie try it out)
I don't see how this would work. The whole reason we have a pipeline is that different things happen at each stage. I don't think you can just load all of the stages at once and expect it to work.
It is *impossible* to execute four instructions within the cycle hub window of RDQUAD.
(there, someone will figure it out now)
I did come up with an alternate use, however I have to write a new test case for it.
What if a RDQUAD based LMM2 did not try to execute four instructions? What if we reserved one slot for a 32 bit constant or address?
This would actually make compiler writers lives MUCH easier, and it would use the fourth slot usefully the vast majority of the time.
Revised kernel loop:
reps #256,#8
getcnt start
rdquad pc
add pc,#16
nop ' delay slot
nop ' delay slot
arg nop ' NOT executable - use for constant, address etc
nop ' will execute, "arg" is a valid register reference for it as it is fixed in the cog memory map
nop ' will execute "arg" is a valid register reference for it as it is fixed in the cog memory map
nio ' will execute "arg" is a valid register reference for it as it is fixed in the cog memory map
This obeys all the rules Chip posted, and is very useful for LMM2.
Examples:
mov Rn,arg .... load large immediate value
mov pc,arg .... jump to arg
add D, arg .... add a large const
rdlong Rn, arg .... read any location
wrlong Rn, arg ..... write to any location
and so on.
It may actually be more useful than executing four instructions, as it makes addressing the whole hub, and large constants, trivial.
I think RDQUAD can be declared working, and now I will verify that my new LMM2 kernel idea works.
Bill,
Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.
I don't see how this would work. The whole reason we have a pipeline is that different things happen at each stage. I don't think you can just load all of the stages at once and expect it to work.
Yes, I did - see post#1 in this thread for alternates, however the best I was able to do with those variations is 3 LMM single cycle instructions per 8 clock cycles; RDQUAD would have allowed for four.
Once I accepted I could not get it working at 4/8 I did come up with another great use for RDQUAD, which should actually make a lot of the compiler work (32 bit constants an addresses) much easier, see a post above.
Bill,
Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.
The problem with your last idea/use is that it's not really the same thing anymore. It requires a compiler (and one with special stuff in it to use this), while the traditional LMM could be used with a simple assembler that translated branches.
So, sure you can maybe convince the Prop2GCC effort to have a kernel using this "special" LMM setup, but I think it needs a different name from LMM, and that a more traditional LMM should be what we call LMM.
Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.
Chip:
I guess it might be a little late now, but just in case...
If we were not using any I/O pins, as is often the case in the main processing routine, could the QUAD window be mapped to $1F8-$1FB (the PINx registers)?
This would have the benefit of not taking any normal usable cog space providing the pins were not being used.
I know you use $1F8-$1FF to clear the mapping, but $1FA-$1FF (or just $1FF) could still be used for this.
re SEUSSF/SEUSSR instructions
I know there is some form of byte swapping instruction yet to be documented. Is this capable of performing an endian swap? If not, could these instructions be used for a word swap and a long swap ???
It may be too complex, but could they be used to copy to/from the quad cache to a register set (cog ram). This would allow a quick copy from/to hub cog using quads and leave some instructions free as well ???
Releasing the Terasic loadables was a brillant move - it is letting us find bugs before the shuttle run.
Agreed! But it has surprised me how much mileage you and others have gotten out of the DE0 single cog emulation. What a remarkable thing!
I've been working night and day on an unrelated project. Now that it's nearly finished, I'm looking forward to dusting off my DE0 and torturing my own small part of the P2 instruction set.
Chip,
The main problem with that is that limiting LMM code to only 1 cycle instructions would be very undesirable. You really need to be able to do hub instructions and WAITxxx in LMM code. I think the focus should switch to using things that don't try to do overlapping of the mapped registers.
The problem with your last idea/use is that it's not really the same thing anymore. It requires a compiler (and one with special stuff in it to use this), while the traditional LMM could be used with a simple assembler that translated branches.
So, sure you can maybe convince the Prop2GCC effort to have a kernel using this "special" LMM setup, but I think it needs a different name from LMM, and that a more traditional LMM should be what we call LMM.
Please take a look at post #194 - I think that will work, and give pretty good results.
While it would only execute 3 simple instructions every eight clock cycles, those instructions would be made much more useful with the fourth slot being used for a constant or address.
Take a look at the illustrations I provided; all the MVI macros, FJMP primitves become unnecessary, saving many cycles.
Okay, here is a bizarre idea. How about making the PC 30 bits long instead of just 9 and have it fetch instructions using the equivilent of RDLONGC if any of the high address bits are set? Then you can run code directly out of COG memory with no LMM loop. I'm sure there is some fatal flaw in this idea....
Chip:
I guess it might be a little late now, but just in case...
If we were not using any I/O pins, as is often the case in the main processing routine, could the QUAD window be mapped to $1F8-$1FB (the PINx registers)?
This would have the benefit of not taking any normal usable cog space providing the pins were not being used.
I know you use $1F8-$1FF to clear the mapping, but $1FA-$1FF (or just $1FF) could still be used for this.
re SEUSSF/SEUSSR instructions
I know there is some form of byte swapping instruction yet to be documented. Is this capable of performing an endian swap? If not, could these instructions be used for a word swap and a long swap ???
It may be too complex, but could they be used to copy to/from the quad cache to a register set (cog ram). This would allow a quick copy from/to hub cog using quads and leave some instructions free as well ???
I suppose you could map the QUADs up there, but you usually need to surround them with other instructions to facilitate RDQUAD, etc.
Okay, here is a bizarre idea. How about making the PC 30 bits long instead of just 9 and have it fetch instructions using the equivilent of RDLONGC if any of the high address bits are set? Then you can run code directly out of COG memory with no LMM loop. I'm sure there is some fatal flaw in this idea....
Sounds really neat but it strains my mind to consider how this would work with the pipeline, which is normally used to read hub memory.
Sounds really neat but it strains my mind to consider how this would work with the pipeline, which is normally used to read hub memory.
I guess the pipeline would have to stall every time a new RDQUAD had to be done since it couldn't access the hub at the same time as the RDQUAD. I told you it was a bizarre idea! :-)
Bill, without my morning coffee (I don't often drink coffee ), here are a few ideas (without much thought)...
1. I wonder if you could use the task switcher to readquads and execute them as a second task ??? This would at least make the flags available for the LMM load routine.
2. I wonder about going back to using the REPS instruction and/or flags to perform the loop.
3. As long as the LMM instructions in the quad are single cycle, there are no problems. The compiler will need to take special care for mult-cycle instructions - you already do that for calls/jumps so maybe a similar mechanism will be required here too.
Maybe these suggestions may at least provoke some further thoughts.
Comments
If it did not execute all four instructions in the quad, the results would be wrong. Unless I am missing something again.
I have not tried to nail down where the best place is for hub access instructions, but the organization above at least seems to work.
Multi-cycle instructions will have to be scheduled very carefully, but they should work (famous last words!) - I hope.
I'll have to try some other multi-cycle instructions.
Thanks again for the sample and explanation.
If this answer was to me -- You don't understand me.
Can I?
SETQUAD XXXX0 Set fist 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX
SETQUAD XXXX4 Set next 4 longs that fills in COG's regs that have same addres as QUAD
RDQUAD XXXX without move data from QUAD
I took another look at the last result, and I may still have a problem.
Based on the example Chip posted, it looks like the first instruction in a quad is not executed the first time as it is not ready yet.
I'm putting my thinking cap back on.
I am missing all the fun.
My suggestion was simple, really just a Mux-unpack.
I'm not clear if this compression is intended as an Opcode-space/runtime option (which sounds like a high impact change as you affect the opcode decode pathways), or if it is intended as a Memory move (COG load) (de)compression, where I imagine there is more time headroom, and it does not impact opcode decode pathways. A single boolean flag, or a special opcode, is all that is needed to enable the Mux-unpack, plus the storage of the 16 bits of more-stable merge information
Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.
Thanks
Good to know.
Ps. Need add this to my version of PDF
Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?
I think I twisted my brain around it enough to see what is going on. At least I hope I did.
Your code sample:
This effectively means that to execute the four fetched instructions will take 9 clock cycles, throwing the next RDQUAD out of hub sync.
If the hub cache was copied to the cog registers instead of being mapped in, the following should work: (it does not now due to mapping)
One potential solution I see - which I suspect is far too late in the design cycle to implement would be:
EXECQUAD D
which just shoves the four values into the pipeline instead of mapping them in.
I do have another idea I am working on, but need a few minutes to chew on it before posting (ie try it out)
I can only answer for me
I think it now works --- Need only learning time to use it efficiently.
I did post another idea, but suspect its too big a change - can you look at it and comment?
It is *impossible* to execute four instructions within the cycle hub window of RDQUAD.
(there, someone will figure it out now)
I did come up with an alternate use, however I have to write a new test case for it.
What if a RDQUAD based LMM2 did not try to execute four instructions? What if we reserved one slot for a 32 bit constant or address?
This would actually make compiler writers lives MUCH easier, and it would use the fourth slot usefully the vast majority of the time.
Revised kernel loop:
This obeys all the rules Chip posted, and is very useful for LMM2.
Examples:
It may actually be more useful than executing four instructions, as it makes addressing the whole hub, and large constants, trivial.
I think RDQUAD can be declared working, and now I will verify that my new LMM2 kernel idea works.
Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.
Chip knows
Besides, I have a useful work-around - i think.
Yes, I did - see post#1 in this thread for alternates, however the best I was able to do with those variations is 3 LMM single cycle instructions per 8 clock cycles; RDQUAD would have allowed for four.
Once I accepted I could not get it working at 4/8 I did come up with another great use for RDQUAD, which should actually make a lot of the compiler work (32 bit constants an addresses) much easier, see a post above.
So, sure you can maybe convince the Prop2GCC effort to have a kernel using this "special" LMM setup, but I think it needs a different name from LMM, and that a more traditional LMM should be what we call LMM.
Chip:
I guess it might be a little late now, but just in case...
If we were not using any I/O pins, as is often the case in the main processing routine, could the QUAD window be mapped to $1F8-$1FB (the PINx registers)?
This would have the benefit of not taking any normal usable cog space providing the pins were not being used.
I know you use $1F8-$1FF to clear the mapping, but $1FA-$1FF (or just $1FF) could still be used for this.
re SEUSSF/SEUSSR instructions
I know there is some form of byte swapping instruction yet to be documented. Is this capable of performing an endian swap? If not, could these instructions be used for a word swap and a long swap ???
It may be too complex, but could they be used to copy to/from the quad cache to a register set (cog ram). This would allow a quick copy from/to hub cog using quads and leave some instructions free as well ???
Agreed! But it has surprised me how much mileage you and others have gotten out of the DE0 single cog emulation. What a remarkable thing!
I've been working night and day on an unrelated project. Now that it's nearly finished, I'm looking forward to dusting off my DE0 and torturing my own small part of the P2 instruction set.
The main problem with that is that limiting LMM code to only 1 cycle instructions would be very undesirable. You really need to be able to do hub instructions and WAITxxx in LMM code. I think the focus should switch to using things that don't try to do overlapping of the mapped registers.
My previous four instruction attempt - which did not work as expected - would have required just as much special work.
Short of sticking with the simplest 1 LMM instruction per 8 cycle hub window PropGCC will require significant re-work.
Ignoring FCACHE/FLIB for now:
- Without significant gcc re-work LMM code will run at most 1/8 speed of native code, which is not competitive with ARM.
- using RDLONGC instead will improve things somewhat, reaching at most 1/6 clock speed of native code (see experiment#1)
More effective use of RDLONGC or RDQUAD (four or thee instruction version) will require significant GCC rework, but provide better speed.
Please take a look at post #194 - I think that will work, and give pretty good results.
While it would only execute 3 simple instructions every eight clock cycles, those instructions would be made much more useful with the fourth slot being used for a constant or address.
Take a look at the illustrations I provided; all the MVI macros, FJMP primitves become unnecessary, saving many cycles.
I suppose you could map the QUADs up there, but you usually need to surround them with other instructions to facilitate RDQUAD, etc.
Sounds really neat but it strains my mind to consider how this would work with the pipeline, which is normally used to read hub memory.
1. I wonder if you could use the task switcher to readquads and execute them as a second task ??? This would at least make the flags available for the LMM load routine.
2. I wonder about going back to using the REPS instruction and/or flags to perform the loop.
3. As long as the LMM instructions in the quad are single cycle, there are no problems. The compiler will need to take special care for mult-cycle instructions - you already do that for calls/jumps so maybe a similar mechanism will be required here too.
Maybe these suggestions may at least provoke some further thoughts.