Shop OBEX P1 Docs P2 Docs Learn Events
Pipeline/data forwarding issue with PTRx registers? — Parallax Forums

Pipeline/data forwarding issue with PTRx registers?

Hi Chip

I stumbled onto a little issue today that had me scratching my head for a while.
It appears to be a pipeline/data forwarding issue with the flags and PTRx register.
I went back to V26 and it has the same symptoms as V28 so it's not a new thing.

In the following code the testb instruction tests the bit ok and sets the C flag as expected.

But this seems to zero the original PTRx value.
The subsequent shr fails.

If I then comment out the testb instruction the shr works fine as expected.

On the othere hand if I place a nop before the testb instruction it fails too.

I'm guessing it could also be a shadow ram conflict as the PTRx registers have
a indexing mechanism as well.

I I substitute PTRx with any othere register it all works Ok.

Am I breaking the rules again? Sorry to be a pain.
dat	org

	bmask	dirb,#15		'enable leds
	mov	ptra,##$f100_0000
'	nop
	testb	ptra,#24 wc
	shr	ptra,#24 wz
	outnz	#40			'show c and z results on leds
	outc	#41
	waitx	##80_000_000		'wait a while then show result
	mov	outb,ptra
	jmp	#$

Comments

  • cgraceycgracey Posts: 14,222
    edited 2017-12-12 06:41
    I think this is okay.

    PTRA and PTRB are only 20-bit registers, so their top 12 bits always read 0.
  • evanhevanh Posts: 16,056
    Oh, so Oz is kind of breaking a rule. Something's odd with the SHR able to access the shadow RAM data when it immediately follows the MOV but not beyond that.
  • evanhevanh Posts: 16,056
    And likewise, the TESTB can see the data when it immediately follows the MOV.
  • Cluso99Cluso99 Posts: 18,069
    Sounds like the data forwarding in the pipeline for TESTB and SHR is not being limited to the 20 bits when referencing PTRx.
  • So it's 32 bits in the pipe and 20 bits outside the pipe. :)
  • cgraceycgracey Posts: 14,222
    edited 2017-12-12 08:57
    Ah, I didn't understand, at first, what you were meaning.

    While PTRA and PTRB are only 20 bits, they can receive ALU results which are 32 bits. When ALU results are being written to a register and the next instruction in the pipeline is referencing that same register, the ALU data is forwarded to the next instruction's S and/or D value(s). There is no time to mask the ALU data down to 20 bits for the cases of PTRA and PTRB, as the ALU paths are the longest within the cog, already, and cannot stand to be made any longer.

    So, this is an anomaly, but a rather harmless one. This can be documented, in case anyone runs into this and supposes something is wrong.
  • cgraceycgracey Posts: 14,222
    Cluso99 wrote: »
    Sounds like the data forwarding in the pipeline for TESTB and SHR is not being limited to the 20 bits when referencing PTRx.

    That's right.
  • cgraceycgracey Posts: 14,222
    edited 2017-12-12 09:33
    I just looked into expanding PTRA and PTRB to 32 bits, in order to get around this issue. The bottom 20 bits would act the same, while the top 12 bits would just be data that could be written.

    Do you guys see much value in doing this?

    It would make PTRA and PTRB more like other registers, in that they'd be 32 bits. The bottom 20 bits would automatically update in some situations, though.
  • Cluso99Cluso99 Posts: 18,069
    edited 2017-12-12 09:44
    cgracey wrote: »
    I just looked into expanding PTRA and PTRB to 32 bits, in order to get around this issue. The bottom 20 bits would act the same, while the top 12 bits would just be data that could be written.

    Do you guys see much value in doing this?

    It would make PTRA and PTRB more like other registers, in that they'd be 32 bits. The bottom 20 bits would automatically update in some situations, though.
    Is there much silicon cost?

    Seems like a more expected behaviour. While only 20 bits of PTRx would be used for addressing, what would happen if PTRx increments causing an overflow? Would b20 (21st bit) increment, or would it be lost? I guess either way is acceptable anyway.

    Might even be some tricks here :)
  • cgraceycgracey Posts: 14,222
    It's done!

    What FPGA image would you like this in? I figure either Prop123_A9 or BeMicro_A9. Which one first?
  • Cluso99Cluso99 Posts: 18,069
    edited 2017-12-12 09:50
    I have been meaning to ask. I have forgotten the depth of the internal cog stack depth - was it 8?

    Might there be any point in
    1. Increasing the depth?
    2. Increasing the width from 22? to 32 bits?

    Just wondering since the reduction to 8 cogs might have freed some spare silicon space. Presuming of course it's only a parameter change.
  • cgraceycgracey Posts: 14,222
    Cluso99 wrote: »
    cgracey wrote: »
    I just looked into expanding PTRA and PTRB to 32 bits, in order to get around this issue. The bottom 20 bits would act the same, while the top 12 bits would just be data that could be written.

    Do you guys see much value in doing this?

    It would make PTRA and PTRB more like other registers, in that they'd be 32 bits. The bottom 20 bits would automatically update in some situations, though.
    Is there much silicon cost?

    Seems like a more expected behaviour. While only 20 bits of PTRx would be used for addressing, what would happen if PTRx increments causing an overflow? Would b20 (21st bit) increment, or would it be lost? I guess either way is acceptable anyway.

    Might even be some tricks here :)

    The incrementing would be limited to the bottom 20 bits. The PTRx registers already flirt with the critical path. There's no time to increment a full 32 bits. We are at the limit there, already. Those top 12 bits will just be normal RAM. They will be affected by ALU writing, but not by auto-inc/dec behavior.
  • cgraceycgracey Posts: 14,222
    edited 2017-12-12 09:57
    Cluso99 wrote: »
    I have been meaning to ask. I have forgotten the depth of the internal cog stack depth - was it 8?

    Might there be any point in
    1. Increasing the depth?
    2. Increasing the width from 22? to 32 bits?

    Just wondering since the reduction to 8 cogs might have freed some spare silicon space. Presuming of course it's only a parameter change.

    It's 8 levels deep.

    We could increase it to 32 bits, instead of 22. It will take 10 bits * 8 levels * 8 cogs = 640 more flipflops.
  • cgraceycgracey Posts: 14,222
    The hardware stack is now 32 bits wide.

    This means there are no more undersized registers anywhere.
  • cgraceycgracey Posts: 14,222
    edited 2017-12-12 10:16
    I have an 8-cog/64-smart-pin BeMicro_A9 image compiling now. It has all 32-bit PTRA, PTRB, and hardware stack registers.

    When I get up in the morning, it will be done and I'll post it.
  • Wow!
    A 32 bit wide stack will make a big difference.
    Keeping everything symmetrical makes life easier.
    Thanks Chip! :)

  • cgraceycgracey Posts: 14,222
    You bet. Is CV-A9 what you want?
  • Cluso99Cluso99 Posts: 18,069
    edited 2017-12-12 10:38
    Thanks Chip. The stack being 32 bits wide will give better use for the PUSH and POP instructions.

    Isn't the PAR register also short?
  • P123-A9 is my primary development platform but I'm working on the debugger currently t so I'm not that fussy at the moment. :)

  • cgraceycgracey Posts: 14,222
    edited 2017-12-12 10:46
    Cluso99 wrote: »
    Thanks Chip. The stack being 32 bits wide will give better use for the PUSH and POP instructions.

    Isn't the PAR register also short?

    We have PTRA and PTRB on Prop2, and no PAR, right?

    Wait, you mean the width of the parameters from COGINIT, right? Those can be widened, too.
  • cgraceycgracey Posts: 14,222
    ozpropdev wrote: »
    P123-A9 is my primary development platform but I'm working on the debugger currently t so I'm not that fussy at the moment. :)

    I've got CV-A9 in the oven and I'm going to bed. I'll recompile everything else later.
  • cgracey wrote: »
    I've got CV-A9 in the oven and I'm going to bed. I'll recompile everything else later.

    I will have me a slice of that CVA9 pie when it's out of the oven and cooling on the Google Drive rack.
  • No worries Chip.
    The wider PTRx will enhance SETQ/COGINIT nicely too.
  • Cluso99Cluso99 Posts: 18,069
    edited 2017-12-12 12:41
    cgracey wrote: »
    Cluso99 wrote: »
    Thanks Chip. The stack being 32 bits wide will give better use for the PUSH and POP instructions.

    Isn't the PAR register also short?

    We have PTRA and PTRB on Prop2, and no PAR, right?

    Wait, you mean the width of the parameters from COGINIT, right? Those can be widened, too.

    Yes. Thanks.

    The widening to 32 bits will simplify things for the users. Less restrictions and caveats lurking is excellent news.

    BTW I am on CVA9 too. Unless someone is testing another board, maybe you can save compiling the others for now.
  • cgraceycgracey Posts: 14,222
    edited 2017-12-12 18:10
    Here is the BeMicro_A9 8-cog version with 32-bit wide PTRA, PTRB, and hardware-stack registers:

    https://drive.google.com/file/d/1l7yWVpljQN8OTV7d4Mok3ukAdjTwmoDX/view?usp=sharing

    Could you please see if this works as expected? Tonight I will get to widening the COGINIT conduit.

    Thanks.
  • jmgjmg Posts: 15,175
    cgracey wrote: »
    I just looked into expanding PTRA and PTRB to 32 bits, in order to get around this issue. The bottom 20 bits would act the same, while the top 12 bits would just be data that could be written.

    Do you guys see much value in doing this?

    It would make PTRA and PTRB more like other registers, in that they'd be 32 bits. The bottom 20 bits would automatically update in some situations, though.

    Sure, one immediate benefit of 32b holding, is better memory management.
    A simple routine can be called, check PTRx, and if on-chip, a single opcode is used, if off-chip, more code is employed to get that R/W done.
    To the user, they do not need to care where the data is.
  • cgraceycgracey Posts: 14,222
    jmg wrote: »
    cgracey wrote: »
    I just looked into expanding PTRA and PTRB to 32 bits, in order to get around this issue. The bottom 20 bits would act the same, while the top 12 bits would just be data that could be written.

    Do you guys see much value in doing this?

    It would make PTRA and PTRB more like other registers, in that they'd be 32 bits. The bottom 20 bits would automatically update in some situations, though.

    Sure, one immediate benefit of 32b holding, is better memory management.
    A simple routine can be called, check PTRx, and if on-chip, a single opcode is used, if off-chip, more code is employed to get that R/W done.
    To the user, they do not need to care where the data is.

    Good point.
  • Looks good Chip!
    Stack and PTRx widening working as expected.
    			push	##$f0000000
    			push	##$ff000000
    			push	##$fff00000
    			push	##$ffff0000
    			push	##$fffff000
    			push	##$ffffff00
    			push	##$fffffff0
    			push	##$ffffffff
    
    			pop	$100
    			pop	$101
    			pop	$102
    			pop	$103
    			pop	$104
    			pop	$105
    			pop	$106
    			pop	$107
    

    Resulted in
    -----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    --,00000008 CZ,00300270 --,00000006 --,00000000 --,00000000 --,00000000 --,00000000 --,00000000
    00004: FFF80000              AUGD    #$780000
    00005: FD64002A              PUSH    #$000 {$F0000000}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    --,F0000000 --,00000008 CZ,00300270 --,00000006 --,00000000 --,00000000 --,00000000 --,00000000
    00006: FFFF8000              AUGD    #$7F8000
    00007: FD64002A              PUSH    #$000 {$FF000000}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    --,FF000000 --,F0000000 --,00000008 CZ,00300270 --,00000006 --,00000000 --,00000000 --,00000000
    00008: FFFFF800              AUGD    #$7FF800
    00009: FD64002A              PUSH    #$000 {$FFF00000}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFF00000 --,FF000000 --,F0000000 --,00000008 CZ,00300270 --,00000006 --,00000000 --,00000000
    0000A: FFFFFF80              AUGD    #$7FFF80
    0000B: FD64002A              PUSH    #$000 {$FFFF0000}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,00000008 CZ,00300270 --,00000006 --,00000000
    0000C: FFFFFFF8              AUGD    #$7FFFF8
    0000D: FD64002A              PUSH    #$000 {$FFFFF000}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFFF000 CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,00000008 CZ,00300270 --,00000006
    0000E: FFFFFFFF              AUGD    #$7FFFFF
    0000F: FD66002A              PUSH    #$100 {$FFFFFF00}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFFFF00 CZ,FFFFF000 CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,00000008 CZ,00300270
    00010: FFFFFFFF              AUGD    #$7FFFFF
    00011: FD67E02A              PUSH    #$1F0 {$FFFFFFF0}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFFFFF0 CZ,FFFFFF00 CZ,FFFFF000 CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,00000008
    00012: FFFFFFFF              AUGD    #$7FFFFF
    00013: FD67FE2A              PUSH    #$1FF {$FFFFFFFF}
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFFFFFF CZ,FFFFFFF0 CZ,FFFFFF00 CZ,FFFFF000 CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000
    00014: FD62002B              POP     $100
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFFFFF0 CZ,FFFFFF00 CZ,FFFFF000 CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,F0000000
    00015: FD62022B              POP     $101
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFFFF00 CZ,FFFFF000 CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,F0000000 --,F0000000
    00016: FD62042B              POP     $102
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFFF000 CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000
    00017: FD62062B              POP     $103
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFFF0000 CZ,FFF00000 --,FF000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000
    00018: FD62082B              POP     $104
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    CZ,FFF00000 --,FF000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000
    00019: FD620A2B              POP     $105
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    --,FF000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000
    0001A: FD620C2B              POP     $106
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000
    0001B: FD620E2B              POP     $107
    (? for help) >-----------------------------------------------------------------------------------------(P2 Debugger)
    STACK :
    --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000 --,F0000000
    0001C: FD9FFFFC              JMP     #$FFFFC (ABS addr = $1C)
    (? for help) >REG 100 107
    100: $FFFFFFFF %11111111_11111111_11111111_11111111 #4294967295 #-1
    101: $FFFFFFF0 %11111111_11111111_11111111_11110000 #4294967280 #-16
    102: $FFFFFF00 %11111111_11111111_11111111_00000000 #4294967040 #-256
    103: $FFFFF000 %11111111_11111111_11110000_00000000 #4294963200 #-4096
    104: $FFFF0000 %11111111_11111111_00000000_00000000 #4294901760 #-65536
    105: $FFF00000 %11111111_11110000_00000000_00000000 #4293918720 #-1048576
    106: $FF000000 %11111111_00000000_00000000_00000000 #4278190080 #-16777216
    107: $F0000000 %11110000_00000000_00000000_00000000 #4026531840 #-268435456
    (? for help) >
    
  • cgraceycgracey Posts: 14,222
    Super! Thanks.
Sign In or Register to comment.