Chip
First test of "jump on event bug" looks good on V33i on the P123-A9 FPGA.
I Went back to Rev.A silicon to verify and bug appeared, so looks Ok so far.
I had one version of my code that was totally 0% branching on an edge event and with only the sporadic fall through breaking the REP. I've doubled check it still does that on v32i.
With exact same code just recompiled with Pnut33g and on v33i FPGA it is flawless 100% working. Not a single fall through either. Perfect fix for me.
Ozpropdev and Evanh, thank you for checking these things. I'm really glad these problems got addressed, thanks mainly to your persistence. To have LUT sharing working right, along with event jumps, is a big improvement.
Thanks, Evanh. I looked all that over. I also looked at the Verilog code. I don't feel like this would be worth doing, at this point. Thanks for bringing it up, again, though.
Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.
D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.
If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.
* * * * * * * * * *
The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest
The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.
Thanks, Evanh. I looked all that over. I also looked at the Verilog code. I don't feel like this would be worth doing, at this point. Thanks for bringing it up, again, though.
Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.
D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.
If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.
* * * * * * * * * *
The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest
The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.
The thing is, ON is now compiling what I last gave them and after looking at the areas in the Verilog that I would need to change, I'd need the room to take a deep breath and approach this. I don't have that right now and I don't want to restart the process of ON getting the layout going. If the opportunity arises, maybe we can change this, but I don't want to do it right now.
Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.
D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.
If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.
* * * * * * * * * *
The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest
The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.
The thing is, ON is now compiling what I last gave them and after looking at the areas in the Verilog that I would need to change, I'd need the room to take a deep breath and approach this. I don't have that right now and I don't want to restart the process of ON getting the layout going. If the opportunity arises, maybe we can change this, but I don't want to do it right now.
Chip, I appreciate that the last train has probably left the station and unless something important needs fixing the design is done. I mentioned the SCA rotation only as a consequence of a post today about faster multiplies. The interesting point is this change to SCA saves code if the result goes to the next instruction's D, but I can find no saving if it goes to S. This confirms to me that D is intrinsically superior even though S seems the more obvious choice. These are my final thoughts on this subject and thanks for considering the various ideas that have been suggested by everyone.
Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.
D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.
If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.
* * * * * * * * * *
The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest
The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.
The thing is, ON is now compiling what I last gave them and after looking at the areas in the Verilog that I would need to change, I'd need the room to take a deep breath and approach this. I don't have that right now and I don't want to restart the process of ON getting the layout going. If the opportunity arises, maybe we can change this, but I don't want to do it right now.
Chip, I appreciate that the last train has probably left the station and unless something important needs fixing the design is done. I mentioned the SCA rotation only as a consequence of a post today about faster multiplies. The interesting point is this change to SCA saves code if the result goes to the next instruction's D, but I can find no saving if it goes to S. This confirms to me that D is intrinsically superior even though S seems the more obvious choice. These are my final thoughts on this subject and thanks for considering the various ideas that have been suggested by everyone.
TonyB_, it probably IS better. If we have the opportunity, we can do this. Let's see what happens with how things are going.
you are missing out on something there. The P1 is a funny little beast and not that much different from the P2. At least from the view of programming a multi core.
I am just stepping up and have a lot of fun with the P2.
Note on most modern architectures, stacks grow down, heaps grow up. I don't think I've seen a
compiler that does it otherwise. If you only have positive offset addressing it pretty much forces
the choice as local variables are addressed from the SP, rather than maintain a separate FP.
I just tried out V33 with my 90's 3D code...
Didn't work at first, but after some WinDiff on the VGA example, I found I needed to change these lines to match and now it works:
m_bs long $7F010000+16 'before sync
m_sn long $7F010000+96 'sync
m_bv long $7F010000+48 'before visible
m_vi long $7F010000+640 'visible
m_rf long $7F080000+640 'visible rlong 8bpp lut
I posted the final logic version at the top of this thread, which includes, lastly, the extra register on each IN signal from the pins. This is to ensure metastability in the final silicon. Note that this adds one clock period to the IN signals.
Peter and Ray, we need to use this version to finish developing the ROM code.
I just updated the main file at the top of the thread.
There is a new PNut_v33h.exe which allows a full 1MB download on the -A9 boards to permit ROM updating. This is necessary for PeterJakacki and Cluso99. Others don't need to care about this.
Comments
How about the jump-on-event bug?
Confirming all good with the lut-sharing. Tested all eight cogs, four patterns each.
First test of "jump on event bug" looks good on V33i on the P123-A9 FPGA.
I Went back to Rev.A silicon to verify and bug appeared, so looks Ok so far.
I will run some more tests soon.
With exact same code just recompiled with Pnut33g and on v33i FPGA it is flawless 100% working. Not a single fall through either. Perfect fix for me.
You bet. Thanks for trying it out.
If there are no more logic changes, we'll have one more update with the final ROM installed.
Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.
D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.
If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.
* * * * * * * * * *
The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest
The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.
The thing is, ON is now compiling what I last gave them and after looking at the areas in the Verilog that I would need to change, I'd need the room to take a deep breath and approach this. I don't have that right now and I don't want to restart the process of ON getting the layout going. If the opportunity arises, maybe we can change this, but I don't want to do it right now.
Chip, I appreciate that the last train has probably left the station and unless something important needs fixing the design is done. I mentioned the SCA rotation only as a consequence of a post today about faster multiplies. The interesting point is this change to SCA saves code if the result goes to the next instruction's D, but I can find no saving if it goes to S. This confirms to me that D is intrinsically superior even though S seems the more obvious choice. These are my final thoughts on this subject and thanks for considering the various ideas that have been suggested by everyone.
TonyB_, it probably IS better. If we have the opportunity, we can do this. Let's see what happens with how things are going.
What was the rational behind CALLA/CALLB stacking upwards instead of downwards in hubRAM?
PS: I note WRLONG can stack up or down, so PUSHA, for example, could be aliased either way around. But not so for CALLx and RETx.
CALL = PUSH = add on top = grow upwards
and https://forums.parallax.com/discussion/169697/rom-changes-for-next-silicon/p1
I guess
Mike
I am just stepping up and have a lot of fun with the P2.
Enjoy!
Mike
Note on most modern architectures, stacks grow down, heaps grow up. I don't think I've seen a
compiler that does it otherwise. If you only have positive offset addressing it pretty much forces
the choice as local variables are addressed from the SP, rather than maintain a separate FP.
Z80, 8080, 8086, 68000 all pre-decrement SP on push. I think one notable exception in the microprocessor world is the 8051.
Didn't work at first, but after some WinDiff on the VGA example, I found I needed to change these lines to match and now it works:
Peter and Ray, we need to use this version to finish developing the ROM code.
The FPGA image files are all numbered v32j, not v33j. Is that just a typo or worse?
Whoops! That should be v33j. I will fix that this morning. Thanks for noticing.
It was just mislabeled. I fixed it. Sorry about that.
There is a new PNut_v33h.exe which allows a full 1MB download on the -A9 boards to permit ROM updating. This is necessary for PeterJakacki and Cluso99. Others don't need to care about this.