Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc) - Page 11 — Parallax Forums

Propeller II: Emulation of the P2 on FPGA boards (Prop123-A7/A9, DE0-NANO, DE2-115, etc)

18911131424

Comments

  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-06 11:55
    Nice job Chip - Task switching looks amazingly simple to implement. For sure there are some traps for the unwary.
  • cgraceycgracey Posts: 14,232
    edited 2012-12-06 11:57
    Did anyone read the new doc section on multitasking yet? Did it make complete sense?
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-06 11:58
    Sapieha: Your P2 board looks nice. I see you have gone for a bank of V,G,0-7 sockets allowing the use of many P1 modules from Bill and others.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-06 12:00
    cgracey wrote: »
    Did anyone read the new doc section on multitasking yet? Did it make complete sense?
    Yes. Our posts crossed.

    The P2 is certainly going to keep Koroneko busy ;)
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-06 12:00
    cgracey wrote: »
    There was no @ before 'reserves'.
    Ah, that must be my problem! Thanks for the explanation!
  • SapiehaSapieha Posts: 2,964
    edited 2012-12-06 12:01
    Hi Cluso

    V,G,0-7,5V

    And Yes it is for support all standard 10 + Bill's 11 pins modules


    Cluso99 wrote: »
    Sapieha: Your P2 board looks nice. I see you have gone for a bank of V,G,0-7 sockets allowing the use of many P1 modules from Bill and others.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-06 12:07
    cgracey wrote: »
    Did anyone read the new doc section on multitasking yet? Did it make complete sense?
    It made sense to me. I think you had posted a shorter description of tasking earlier as well. Even that was pretty clear!
  • mindrobotsmindrobots Posts: 6,506
    edited 2012-12-06 12:51
    cgracey wrote: »
    Did anyone read the new doc section on multitasking yet? Did it make complete sense?

    Chip,

    It made sense to me, which may or may not cause concern! :lol:

    I like how you can play with the task register to control the amount of execution each task can get.

    Now we'll see if I understand.

    At least two ways to start multi-tasking:

    1) Your code after COGINIT would start with instructions for task0 starting at $000, it would run through some code doing initialization/housekeeping until it was ready to start task1. When ready to start task1, it would execute a JMPTASK with the mask set for task1 and D set for the first instruction of task1's code and then use SETTASK to give some portion of the execution slots to task1. Any combination of this up to 3 additional tasks.

    2) As in your 4 task example so task 0 ends up just doing the JMP then the SETTASK and then going about its business.

    Wicked cool!!

    I also like that the stack area is non-voaltile across COGINTs. That seems ripe for adventure and exploitation (in a good sense)!
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-12-06 12:59
    Made sense to me.
    cgracey wrote: »
    Did anyone read the new doc section on multitasking yet? Did it make complete sense?
  • ElectrodudeElectrodude Posts: 1,661
    edited 2012-12-06 13:36
    The non-volatility of the stack makes LMM really easy:
    load variables from stack into cogram
    run part of code
    save variables into stack
    coginit this cog with next chunk
    repeat
    
  • TubularTubular Posts: 4,706
    edited 2012-12-06 14:32
    cgracey wrote: »
    Did anyone read the new doc section on multitasking yet? Did it make complete sense?

    Yep! Looks clear enough despite the caveats. I am thinking of re-writing my SVGA generator in task form, perhaps a VGA version (dot clock 25MHz), just as an exercise - ie
    task00 - supervisor, vsync & idle thread
    task01 - output active VGA
    task02 - fetch / modulate VGA contents
    task03 - porch/hsync/porch

    The advantage I see is you don't have to cycle count to get the timing spot on (can base it all of system CNT for instance).

    As I said just an exercise, we will of course be using the video generator down the track.
  • cgraceycgracey Posts: 14,232
    edited 2012-12-06 16:14
    Okay!

    Thanks for the feedback, Guys.

    Sounds like you've got an accurate handle on the multitasking.
  • cgraceycgracey Posts: 14,232
    edited 2012-12-06 16:18
    The non-volatility of the stack makes LMM really easy:
    load variables from stack into cogram
    run part of code
    save variables into stack
    coginit this cog with next chunk
    repeat
    

    When a cog loads, it is actually executing RDLONGC's. So, you could do the same in software and load at the same rate, without having to commit the whole cog memory and having the I/O's cancelled each time.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-12-06 17:25
    Since I've been waylaid and just getting back into the P2 seat now I have copied the latest (I hope) documentation from Chip's post and put this into a Google document. I find having to go back to the post and scroll select the text and mucking about is a nuisance for a number of reasons and one of them is that all the formatting of the document done previously is lost. Why can't we just have the Google document updated so we keep the formatting, we can introduce bookmarks and table of contents so that this is a live document.
    Link to the editable version of the document
    Link to the webpage version of the document which is automatically updated when the master has been changed.
  • SeairthSeairth Posts: 2,474
    edited 2012-12-06 18:00
    cgracey wrote: »
    Sounds like you've got an accurate handle on the multitasking.

    *sigh* I fee like I'm beating this to death, but could you add a bit more about how SETTASK affects the pipeline? In an earlier comment, you said:
    cgracey wrote: »
    It just writes a new value into the TASK register which immediately affects which task is going to execute next.

    Since I am assuming the new value isn't actually written until the last stage of the pipeline, does that mean that the other instructions in the pipeline (regardless of their associated task) will still be executed? Put another way, is the TASK register controlling the next instruction to load into the pipeline, rather than the next instruction to execute?
  • cgraceycgracey Posts: 14,232
    edited 2012-12-06 19:05
    Seairth wrote: »
    *sigh* I fee like I'm beating this to death, but could you add a bit more about how SETTASK affects the pipeline? In an earlier comment, you said:



    Since I am assuming the new value isn't actually written until the last stage of the pipeline, does that mean that the other instructions in the pipeline (regardless of their associated task) will still be executed? Put another way, is the TASK register controlling the next instruction to load into the pipeline, rather than the next instruction to execute?

    You're right. I need to do some pipeline explanation.

    When I say an instruction executes, I mean it is in the last stage of the pipeline, where the action occurs. Prior stages, going backwards, read the operands, handle indirection and other things, and read the instruction:

    0: read the instruction
    1: handle indirection
    2: read the operands
    3: execute the instruction (compute result, affect Z, C, write result)

    So, when SETTASK issues a new time slot pattern, there are already three instructions in the pipeline, so the 4th instruction after SETTASK will be from the task specified in the two LSB's of the SETTASK operand.

    Anyone want to verify this? You could have the SETTASK's 2 LSB's give a time slot to a task which just sets pin 1 using 'SETP #1', then the instruction after the SETTASK could do a 'SETP #0'. See how many clock periods are between the two.
  • ctwardellctwardell Posts: 1,716
    edited 2012-12-07 05:53
    This is getting too fun, went ahead and ordered a DE0-Nano from Digi-Key this morning.

    I want to continue with the COSMACog progect using the P1, but also want to have a P2 version when the P2 hits the streets.

    Depending on how things look around Jan/Feb I may spring for a DE2-115.

    C.W.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-12-07 07:42
    Cluso99,

    I guess you own this topic so I'd like to request that you add a link to the p2load thread so people can find the loader when they need it. The thread itself is not active enough to stay close to the top of the posts.

    Here is a link to it: http://forums.parallax.com/showthread.php?144384-p2load-A-Loader-for-the-Propeller-II

    Thanks!
    David
  • LeonLeon Posts: 7,620
    edited 2012-12-07 07:55
    ctwardell wrote: »
    This is getting too fun, went ahead and ordered a DE0-Nano from Digi-Key this morning.

    I want to continue with the COSMACog progect using the P1, but also want to have a P2 version when the P2 hits the streets.

    Depending on how things look around Jan/Feb I may spring for a DE2-115.

    C.W.

    It might make more sense to implement the COSMAC Elf in an FPGA:

    http://whats.all.this.brouhaha.com/category/computing/hardware/fpga/
  • SeairthSeairth Posts: 2,474
    edited 2012-12-07 11:29
    cgracey wrote: »
    Initially, $1F6 and $1F7 point only to themselves, so they are more or less regular RAM registers and do get loaded on cog start. When the hardware sees $1F6 or $1F7 for D or S, it substitutes the current pointer value for $1F6/$1F7. The only way you can address the shadow registers is by pointing INDA or INDB at them.

    Could you clarify this a bit (so I can update the google doc)? When a cog is initialized, $1F6 and $1F7 are written to. But your comment indicates that the INDA and INDB pointers actually point to those addresses. What is the final state of INDA and INDB when the cog starts to run?

    Also, it occurs to me that it would be possible to generate instructions like:

    MOV ++INDA INDA++
    MOV INDA++, INDA--
    etc.

    Are these allowed or undefined?
  • cgraceycgracey Posts: 14,232
    edited 2012-12-07 11:50
    Seairth wrote: »
    Could you clarify this a bit (so I can update the google doc)? When a cog is initialized, $1F6 and $1F7 are written to. But your comment indicates that the INDA and INDB pointers actually point to those addresses. What is the final state of INDA and INDB when the cog starts to run?

    Also, it occurs to me that it would be possible to generate instructions like:

    MOV ++INDA INDA++
    MOV INDA++, INDA--
    etc.

    Are these allowed or undefined?

    Those crazy examples are all allowed. Just OR the 2-bit fields together to get the 2-bit post-effect.

    At cog startup, INDA and INDB are configured as if these instructions had been execute:

    FIXINDA $1F6,$1F6
    FIXINDB $1F7,$1F7

    So, reading or writing $1F6 or $1F7 has the intended effect. You just won't be able to have any conditional execution.
  • ctwardellctwardell Posts: 1,716
    edited 2012-12-07 12:00
    Leon wrote: »
    It might make more sense to implement the COSMAC Elf in an FPGA:

    http://whats.all.this.brouhaha.com/category/computing/hardware/fpga/

    I may do that as well someday for giggles, but for now the prop1 and then prop2 are fine.

    The goal is to have a very low cost emulator with a low barrier to entry.

    C.W.
  • SeairthSeairth Posts: 2,474
    edited 2012-12-07 12:03
    cgracey wrote: »
    At cog startup, INDA and INDB are configured as if these instructions had been execute:

    FIXINDA $1F6,$1F6
    FIXINDB $1F7,$1F7

    So this means that COGINIT only loads $1F6 instructions (effectively). I'm assuming that, internally, the cog load is performing 126 iterations of 4 RDLONGCs with the last two longs just being throw-away operations.
  • cgraceycgracey Posts: 14,232
    edited 2012-12-07 13:55
    Seairth wrote: »
    So this means that COGINIT only loads $1F6 instructions (effectively). I'm assuming that, internally, the cog load is performing 126 iterations of 4 RDLONGCs with the last two longs just being throw-away operations.

    No. It loads $1F8 instructions. If you ever actually execute $1F6 or $1F7, it will get the data from those absolute registers. Only D and S have indirection. The instruction doesn't.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-12-07 14:52
    Oooh! another 2 buried instructions ;)

    Now provided we dont actually use any I/O, we can also put instructions into PINA, PINB, PINC & PIND (by doing mov instructions of course since they are all 0's at launch) and get an extra 6 instructions in total ;)

    That is enough for my zero footprint debugger.
  • AribaAriba Posts: 2,690
    edited 2012-12-07 20:38
    Chip

    I experiment a bit with LMM on Prop2, and have a few questions:

    1) does the single cog DE0-Nano version simulate the Hub timing for 8 cogs, or can the single cog access the hub on every cycle in this version?
    2) what is the minimal number of cycles between modifying an instruction and execute it? My observation so far is that 2 instructions in between are enough.

    Here is my first attempt to LMM, it executes with 1/5 the clock rate (12MHz for a 60MHz clock):
    lmm   rdlongc instr,pc
          jmpd #lmm
          add pc,#4
    instr nop
          nop
          jmp #lmm   '(if jmpd gets cancelled by LMM code)
    
    Theoretically every fourth rdlongc a new quad is read and the timing then must be on a multiple of 8 clocks, but
    I don't see this behavior.

    Andy
  • cgraceycgracey Posts: 14,232
    edited 2012-12-07 23:13
    Ariba,

    1) Regardless of how few cogs an FPGA implementation has, it always cycles the hub as if there were eight cogs. So, the DE0-Nano board gives its single cog every 8th hub cycle, just as a single cog would get in a complete chip.

    2) You are right about two instructions needing to be between an instruction modifier and the instruction getting modified. I was just writing some pipeline explanation about this:
    PIPELINE
    --------
    
    Each cog has a 4-stage pipeline which all instructions progress through, in order to execute:
    
    
      1st stage    - Read instruction
      2nd stage    - Determine indirect/remapped D and S addresses, update INDA/INDB
      3rd stage    - Read D and S
      4th stage    - Execute instruction, writing D, Z/C/PC, and any other results
    
    
    On every clock cycle, the instruction in each stage advances to the next stage, unless the instruction
    in the 4th stage is stalling the pipeline because it's waiting for something (i.e. WRBYTE waits for
    the hub).
    
    To keep D and S data current within the pipeline, the resultant D from the 4th stage is passed back to
    the 3rd stage to substitute for any obsoleted data being read from the cog register RAM. The same is
    done for instruction data in the 1st stage, but there is still a two-stage gap between when a register
    is modified and it can be executed, at the earliest:
    
    
            MOVD    :inst,top9         'modify instruction
            NOP                        '1...
            NOP                        '2... at least two instructions in-between
    :inst   ADD     A,B                'modified instruction executes
    
    
    Tasks that execute in at least every 3rd time slot don't need to observe this 2-instruction rule because
    their instructions will always be sufficiently spread apart in the pipeline.
    

    Your LMM code looks sensible to me. I would think that every 20 clocks you should suffer a 4-clock delay to get up to a multiple of 8 clocks (24). When RDLONGC needs to do a RDQUAD, it will take 3-10 clocks. What kind of 4-loop period are you seeing?
  • AribaAriba Posts: 2,690
    edited 2012-12-08 01:32
    Thank you Chip

    Good to know that my findings on the FPGA version are also applicable on the real Prop2.
    cgracey wrote: »
    ....
    Your LMM code looks sensible to me. I would think that every 20 clocks you should suffer a 4-clock delay to get up to a multiple of 8 clocks (24). When RDLONGC needs to do a RDQUAD, it will take 3-10 clocks. What kind of 4-loop period are you seeing?

    I also expected a 24 cycle loop for 4 instructions. My LMM loop is 2 instructions:
    notp #2
        sub pc,#8
    
    so I should get a toggle frequency at Pin 2 of 2.5MHz (60MHz / 24), with a small asymmetry, because 3 instructions need 5 cycles each and one instructon needs 9 cycles. But my Scope showed a higher frequency (more like 3MHz) and a symmetrical signal yesterday. I need to verify this again. At the moment I have running also a second task with 25% timeslot together with the LMM loop from last post. With this second task I see some jitter in the LMM generated frequency, but it still works. If I give the second task 50%, then I need to execute the jmpd on instruction later, which makes sense to me.

    Andy
  • AribaAriba Posts: 2,690
    edited 2012-12-08 02:25
    OK I verified it and it is reallly 3MHz.

    But after a while I also found the reason:
    This 2 instructions LMM code just stays always in the quad-cache, no need to reload it ! So rdlongc always takes only 1 cycle.
    This changes if I make the LMM code longer or the addresses of the two instruction loop goes over a qaud boundery.

    Andy
  • cgraceycgracey Posts: 14,232
    edited 2012-12-08 02:30
    Here's a little program that kicks off four tasks running the same code, but with different variable sets.

    Register remapping is set up to remap 4 sets of 4 registers, according to the task executing. For tasks 0..3, hard addresses 0..3 remap to 0..3, 4..7, 8..11, or 12..15.
    dat
            org			'longs are like nop's, get skipped
    
    pin	long	0		'task 0 data
    count	long	1
    delay	long	0
    extra	long	0
    
    	long	1		'task 1 data
    	long	5
    	long	0
    	long	0
    
    	long	2		'task 2 data
    	long	13
    	long	0
    	long	0
    
    	long	3		'task 3 data
    	long	29
    	long	0
    	long	0
    
    	setmap	#%1_010_010	'remap registers by task, 4 sets, 4 registers each
    	settask	#%%3210		'enable all tasks
    	jmptask	#loop,#%1111	'before any newly-started tasks get to execute stage, jump all tasks to loop
    
    loop	notp	pin		'toggle task x pin
    	mov	delay,count	'get task x delay
    	djnz	delay,#$	'count down delay
    	jmp	#loop		'loop (count + 3 clocks)
    

    Task 0 toggles pin 0 every 16 clocks.
    Task 1 toggles pin 1 every 32 clocks.
    Task 2 toggles pin 2 every 64 clocks.
    Task 3 toggles pin 3 every 128 clocks.


    I'm going to document register remapping next.
Sign In or Register to comment.