Shop OBEX P1 Docs P2 Docs Learn Events
LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz) - Page 7 — Parallax Forums

LMM2 - Propeller 2 LMM experiments (50-80 LMM2 MIPS @ 160MHz)

145791012

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 13:06
    #179 results suggest that it works OK - it looks like it just waits for the appropriate hub cycles for each multi-cycle instruction, and executes the other single cycle ones as well.

    If it did not execute all four instructions in the quad, the results would be wrong. Unless I am missing something again.

    I have not tried to nail down where the best place is for hub access instructions, but the organization above at least seems to work.

    Multi-cycle instructions will have to be scheduled very carefully, but they should work (famous last words!) - I hope.

    I'll have to try some other multi-cycle instructions.

    Thanks again for the sample and explanation.
    cgracey wrote: »
    The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-11 13:08
    Hi Chip

    If this answer was to me -- You don't understand me.

    Can I?
    SETQUAD XXXX0 Set fist 4 longs that fills in COG's regs that have same addres as QUAD
    RDQUAD XXXX
    SETQUAD XXXX4 Set next 4 longs that fills in COG's regs that have same addres as QUAD
    RDQUAD XXXX without move data from QUAD

    cgracey wrote: »
    The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 13:13
    ARGH.

    I took another look at the last result, and I may still have a problem.

    Based on the example Chip posted, it looks like the first instruction in a quad is not executed the first time as it is not ready yet.

    I'm putting my thinking cap back on.
    #179 results suggest that it works OK - it looks like it just waits for the appropriate hub cycles for each multi-cycle instruction, and executes the other single cycle ones as well.

    If it did not execute all four instructions in the quad, the results would be wrong. Unless I am missing something again.

    I have not tried to nail down where the best place is for hub access instructions, but the organization above at least seems to work.

    Multi-cycle instructions will have to be scheduled very carefully, but they should work (famous last words!) - I hope.

    I'll have to try some other multi-cycle instructions.

    Thanks again for the sample and explanation.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 13:17
    This is brilliant work by you all !!!
    I am missing all the fun.
  • jmgjmg Posts: 15,183
    edited 2012-12-11 13:22
    jmg,David:

    I think Chip wants something *REALLY* simple in hardware, I don't believe he has the time / logic budget for anything complicated. If he did, I think all of us would innundate him with additional suggestions :)

    My suggestion was simple, really just a Mux-unpack.
    I'm not clear if this compression is intended as an Opcode-space/runtime option (which sounds like a high impact change as you affect the opcode decode pathways), or if it is intended as a Memory move (COG load) (de)compression, where I imagine there is more time headroom, and it does not impact opcode decode pathways. A single boolean flag, or a special opcode, is all that is needed to enable the Mux-unpack, plus the storage of the 16 bits of more-stable merge information
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 13:28
    Sapieha wrote: »
    Hi Chip

    If this answer was to me -- You don't understand me.

    Can I?
    SETQUAD XXXX0 Set fist 4 longs that fills in COG's regs that have same addres as QUAD
    RDQUAD XXXX
    SETQUAD XXXX4 Set next 4 longs that fills in COG's regs that have same addres as QUAD
    RDQUAD XXXX without move data from QUAD

    Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-11 13:31
    Hi Chip.

    Thanks

    Good to know.

    Ps. Need add this to my version of PDF
    cgracey wrote: »
    Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 13:41
    Bill, Sapieha, and Ariba,

    Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 13:45
    Chip,

    I think I twisted my brain around it enough to see what is going on. At least I hope I did.

    Your code sample:
            RDQUAD  hubaddress      'read a quad into the QUAD registers mapped at quad0..quad3
    
            NOP                     'do something for at least 3 clocks to allow QUADs to update
            NOP
            NOP
    
            NOP                     'do at least 1 instruction to get QUADs into pipeline
    
    quad0   NOP                     'QUAD0..QUAD3 are now executable
    quad1   NOP
    quad2   NOP
    quad3   NOP
    

    This effectively means that to execute the four fetched instructions will take 9 clock cycles, throwing the next RDQUAD out of hub sync.

    If the hub cache was copied to the cog registers instead of being mapped in, the following should work: (it does not now due to mapping)
    	setquad	#quad1
    
    	reps	#128,#16
    	getcnt	start
    
            RDQUAD  pc
    
            ADD	pc,#61
            NOP
            NOP
    
            SETQUAD #quad0          'map QUADs to quad0..quad3
    
    quad0   NOP                     'QUAD0..QUAD3 are now executable
    	NOP
    	NOP
    	NOP
    
            RDQUAD  pc
    
            ADD	pc,#61
            NOP
            NOP
    
            SETQUAD #quad1          'map QUADs to quad0..quad3
    
    quad1   NOP                     'QUAD0..QUAD3 are now executable
    	NOP
    	NOP
    	NOP
    

    One potential solution I see - which I suspect is far too late in the design cycle to implement would be:

    EXECQUAD D

    which just shoves the four values into the pipeline instead of mapping them in.

    I do have another idea I am working on, but need a few minutes to chew on it before posting (ie try it out)
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-11 13:46
    Hi Chip.

    I can only answer for me
    I think it now works --- Need only learning time to use it efficiently.


    cgracey wrote: »
    Bill, Sapieha, and Ariba,

    Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 13:48
    I think it is probably OK but not capable of executing 4 instructions in 8 cycles as I was hoping.

    I did post another idea, but suspect its too big a change - can you look at it and comment?
    cgracey wrote: »
    Bill, Sapieha, and Ariba,

    Are there still any lingering uncertainties about how execution from QUADs via RDQUAD works? Do you suspect there are yet problems in the Verilog code? I'm thinking we probably fixed what was the bug and now have software implementation details to sort out. Do you guys have an opinion on this?
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-11 13:58
    One potential solution I see - which I suspect is far too late in the design cycle to implement would be:

    EXECQUAD D

    which just shoves the four values into the pipeline instead of mapping them in.

    I do have another idea I am working on, but need a few minutes to chew on it before posting (ie try it out)
    I don't see how this would work. The whole reason we have a pipeline is that different things happen at each stage. I don't think you can just load all of the stages at once and expect it to work.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:00
    Ok,

    It is *impossible* to execute four instructions within the cycle hub window of RDQUAD.

    (there, someone will figure it out now)

    I did come up with an alternate use, however I have to write a new test case for it.

    What if a RDQUAD based LMM2 did not try to execute four instructions? What if we reserved one slot for a 32 bit constant or address?

    This would actually make compiler writers lives MUCH easier, and it would use the fourth slot usefully the vast majority of the time.

    Revised kernel loop:
    	reps	#256,#8
    	getcnt	start
    
    	rdquad  pc
    	add	pc,#16
    	nop	' delay slot
    	nop	' delay slot
    arg	nop	' NOT executable - use for constant, address etc
    	nop	' will execute, "arg" is a valid register reference for it as it is fixed in the cog memory map
    	nop	' will execute "arg" is a valid register reference for it as it is fixed in the cog memory map
    	nio	' will execute "arg" is a valid register reference for it as it is fixed in the cog memory map
    

    This obeys all the rules Chip posted, and is very useful for LMM2.

    Examples:
            mov   Rn,arg   .... load large immediate value
    
            mov   pc,arg    .... jump to arg
    
            add   D, arg   .... add a large const
    
            rdlong  Rn, arg   .... read any location
    
            wrlong   Rn, arg   ..... write to any location
    
            and so on. 
    

    It may actually be more useful than executing four instructions, as it makes addressing the whole hub, and large constants, trivial.

    I think RDQUAD can be declared working, and now I will verify that my new LMM2 kernel idea works.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-11 14:01
    Bill,
    Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
    It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:03
    I don't know how it works internally - if there is an instruction queue, with a stage count, it would; if there is a queue of micro-ops, it won't

    Chip knows

    Besides, I have a useful work-around - i think.
    David Betz wrote: »
    I don't see how this would work. The whole reason we have a pipeline is that different things happen at each stage. I don't think you can just load all of the stages at once and expect it to work.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-11 14:05
    My understanding is that it doesn't work either way you describe at all. It's much simpler.

    I don't know how it works internally - if there is an instruction queue, with a stage count, it would; if there is a queue of micro-ops, it won't

    Chip knows

    Besides, I have a useful work-around - i think.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:05
    Hi Roy,

    Yes, I did - see post#1 in this thread for alternates, however the best I was able to do with those variations is 3 LMM single cycle instructions per 8 clock cycles; RDQUAD would have allowed for four.

    Once I accepted I could not get it working at 4/8 I did come up with another great use for RDQUAD, which should actually make a lot of the compiler work (32 bit constants an addresses) much easier, see a post above.
    Roy Eltham wrote: »
    Bill,
    Have you considered the idea of using RDLONGC to make a LMM loop instead? Where the first one takes the long hit, but the next 3 are 1 cycle? They don't rely on the mapping thing, and so might have less problems with timing causing aliasing.
    It might not be as simple and small, but could work more reliably with less restrictions on the code being fed through it.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:07
    Frankly, even if possible, I think EXECQUAD is more trouble (delay) than it would be worth; I was doing a "what if"
    Roy Eltham wrote: »
    My understanding is that it doesn't work either way you describe at all. It's much simpler.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-11 14:10
    The problem with your last idea/use is that it's not really the same thing anymore. It requires a compiler (and one with special stuff in it to use this), while the traditional LMM could be used with a simple assembler that translated branches.

    So, sure you can maybe convince the Prop2GCC effort to have a kernel using this "special" LMM setup, but I think it needs a different name from LMM, and that a more traditional LMM should be what we call LMM. :)
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 14:12
    I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 14:15
    cgracey wrote: »
    Oh, I understand now. This won't work because the floating QUAD window does not copy its contents to the underlying registers.

    Chip:
    I guess it might be a little late now, but just in case...

    If we were not using any I/O pins, as is often the case in the main processing routine, could the QUAD window be mapped to $1F8-$1FB (the PINx registers)?

    This would have the benefit of not taking any normal usable cog space providing the pins were not being used.
    I know you use $1F8-$1FF to clear the mapping, but $1FA-$1FF (or just $1FF) could still be used for this.

    re SEUSSF/SEUSSR instructions
    I know there is some form of byte swapping instruction yet to be documented. Is this capable of performing an endian swap? If not, could these instructions be used for a word swap and a long swap ???

    It may be too complex, but could they be used to copy to/from the quad cache to a register set (cog ram). This would allow a quick copy from/to hub cog using quads and leave some instructions free as well ???
  • User NameUser Name Posts: 1,451
    edited 2012-12-11 14:15
    Releasing the Terasic loadables was a brillant move - it is letting us find bugs before the shuttle run.

    Agreed! But it has surprised me how much mileage you and others have gotten out of the DE0 single cog emulation. What a remarkable thing!

    I've been working night and day on an unrelated project. Now that it's nearly finished, I'm looking forward to dusting off my DE0 and torturing my own small part of the P2 instruction set.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2012-12-11 14:17
    Chip,
    The main problem with that is that limiting LMM code to only 1 cycle instructions would be very undesirable. You really need to be able to do hub instructions and WAITxxx in LMM code. I think the focus should switch to using things that don't try to do overlapping of the mapped registers.
    cgracey wrote: »
    I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:17
    Roy,

    My previous four instruction attempt - which did not work as expected - would have required just as much special work.

    Short of sticking with the simplest 1 LMM instruction per 8 cycle hub window PropGCC will require significant re-work.

    Ignoring FCACHE/FLIB for now:

    - Without significant gcc re-work LMM code will run at most 1/8 speed of native code, which is not competitive with ARM.

    - using RDLONGC instead will improve things somewhat, reaching at most 1/6 clock speed of native code (see experiment#1)

    More effective use of RDLONGC or RDQUAD (four or thee instruction version) will require significant GCC rework, but provide better speed.
    Roy Eltham wrote: »
    The problem with your last idea/use is that it's not really the same thing anymore. It requires a compiler (and one with special stuff in it to use this), while the traditional LMM could be used with a simple assembler that translated branches.

    So, sure you can maybe convince the Prop2GCC effort to have a kernel using this "special" LMM setup, but I think it needs a different name from LMM, and that a more traditional LMM should be what we call LMM. :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-11 14:20
    Yes, but that would not be very useful.

    Please take a look at post #194 - I think that will work, and give pretty good results.

    While it would only execute 3 simple instructions every eight clock cycles, those instructions would be made much more useful with the fourth slot being used for a constant or address.

    Take a look at the illustrations I provided; all the MVI macros, FJMP primitves become unnecessary, saving many cycles.
    cgracey wrote: »
    I think you CAN have 4 instructions executing every 8 clocks, as long as they are single-cycle, so that RDQUADs can be overlapped. Any agreement?
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-11 14:25
    Okay, here is a bizarre idea. How about making the PC 30 bits long instead of just 9 and have it fetch instructions using the equivilent of RDLONGC if any of the high address bits are set? Then you can run code directly out of COG memory with no LMM loop. I'm sure there is some fatal flaw in this idea....
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 14:27
    Cluso99 wrote: »
    Chip:
    I guess it might be a little late now, but just in case...

    If we were not using any I/O pins, as is often the case in the main processing routine, could the QUAD window be mapped to $1F8-$1FB (the PINx registers)?

    This would have the benefit of not taking any normal usable cog space providing the pins were not being used.
    I know you use $1F8-$1FF to clear the mapping, but $1FA-$1FF (or just $1FF) could still be used for this.

    re SEUSSF/SEUSSR instructions
    I know there is some form of byte swapping instruction yet to be documented. Is this capable of performing an endian swap? If not, could these instructions be used for a word swap and a long swap ???

    It may be too complex, but could they be used to copy to/from the quad cache to a register set (cog ram). This would allow a quick copy from/to hub cog using quads and leave some instructions free as well ???

    I suppose you could map the QUADs up there, but you usually need to surround them with other instructions to facilitate RDQUAD, etc.
  • cgraceycgracey Posts: 14,253
    edited 2012-12-11 14:30
    David Betz wrote: »
    Okay, here is a bizarre idea. How about making the PC 30 bits long instead of just 9 and have it fetch instructions using the equivilent of RDLONGC if any of the high address bits are set? Then you can run code directly out of COG memory with no LMM loop. I'm sure there is some fatal flaw in this idea....

    Sounds really neat but it strains my mind to consider how this would work with the pipeline, which is normally used to read hub memory.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-11 14:32
    cgracey wrote: »
    Sounds really neat but it strains my mind to consider how this would work with the pipeline, which is normally used to read hub memory.
    I guess the pipeline would have to stall every time a new RDQUAD had to be done since it couldn't access the hub at the same time as the RDQUAD. I told you it was a bizarre idea! :-)
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-11 14:38
    Bill, without my morning coffee (I don't often drink coffee ;) ), here are a few ideas (without much thought)...

    1. I wonder if you could use the task switcher to readquads and execute them as a second task ??? This would at least make the flags available for the LMM load routine.

    2. I wonder about going back to using the REPS instruction and/or flags to perform the loop.

    3. As long as the LMM instructions in the quad are single cycle, there are no problems. The compiler will need to take special care for mult-cycle instructions - you already do that for calls/jumps so maybe a similar mechanism will be required here too.

    Maybe these suggestions may at least provoke some further thoughts.
Sign In or Register to comment.