New FPGA files for next silicon version - 5th/final release - contains new ROM!!

cgracey · 2019-02-01 05:54

Ok. Great!

How about the jump-on-event bug?

ozpropdev · 2019-02-01 06:09

cgracey wrote: »

Ok. Great!

How about the jump-on-event bug?

About to start testing that now….

evanh · 2019-02-01 06:44

cgracey wrote: »

If you can, please verify that the LUT-sharing bug is fixed, as well as the JMP-event-within-REP bug.

Confirming all good with the lut-sharing. Tested all eight cogs, four patterns each.

ozpropdev · 2019-02-01 06:45

Chip
First test of "jump on event bug" looks good on V33i on the P123-A9 FPGA.
I Went back to Rev.A silicon to verify and bug appeared, so looks Ok so far.

I will run some more tests soon.

evanh · 2019-02-01 08:41

I had one version of my code that was totally 0% branching on an edge event and with only the sporadic fall through breaking the REP. I've doubled check it still does that on v32i.

With exact same code just recompiled with Pnut33g and on v33i FPGA it is flawless 100% working. Not a single fall through either. Perfect fix for me.

cgracey · 2019-02-01 11:58

Ozpropdev and Evanh, thank you for checking these things. I'm really glad these problems got addressed, thanks mainly to your persistence. To have LUT sharing working right, along with event jumps, is a big improvement.

Publison · 2019-02-01 14:41

V33i now reports 8 COGs, 512K, 63 Smart Pins. Thanks Chip!

cgracey · 2019-02-01 16:39

Publison wrote: »

V33i now reports 8 COGs, 512K, 63 Smart Pins. Thanks Chip!

You bet. Thanks for trying it out.

If there are no more logic changes, we'll have one more update with the final ROM installed.

TonyB_ · 2019-02-01 20:42

cgracey wrote: »

evanh wrote: »

evanh wrote: »

There was an idea or two I had but they weren't of much significance, or too big.

One idea that would be nice to have is changing XORO32 and SCA results to feeding next D input instead of next S input.

Yes. I looked into this. It's doable, but I wasn't convinced of its benefit. Could you please refresh me on this? A link would do. Thanks, Evanh.

cgracey wrote: »

evanh wrote: »

Here's the topic - https://forums.parallax.com/discussion/169585/xoro32-scrambler-output/p1

Tony felt it was a good idea - https://forums.parallax.com/discussion/comment/1461517/#Comment_1461517

Thanks, Evanh. I looked all that over. I also looked at the Verilog code. I don't feel like this would be worth doing, at this point. Thanks for bringing it up, again, though.

Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.

D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.

If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.

* * * * * * * * * *

The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest

The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.

cgracey · 2019-02-01 21:43

TonyB_ wrote: »

cgracey wrote: »

evanh wrote: »

evanh wrote: »

There was an idea or two I had but they weren't of much significance, or too big.

One idea that would be nice to have is changing XORO32 and SCA results to feeding next D input instead of next S input.

Yes. I looked into this. It's doable, but I wasn't convinced of its benefit. Could you please refresh me on this? A link would do. Thanks, Evanh.

cgracey wrote: »

evanh wrote: »

Here's the topic - https://forums.parallax.com/discussion/169585/xoro32-scrambler-output/p1

Tony felt it was a good idea - https://forums.parallax.com/discussion/comment/1461517/#Comment_1461517

Thanks, Evanh. I looked all that over. I also looked at the Verilog code. I don't feel like this would be worth doing, at this point. Thanks for bringing it up, again, though.

Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.

D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.

If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.

* * * * * * * * * *

The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest

The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.

The thing is, ON is now compiling what I last gave them and after looking at the areas in the Verilog that I would need to change, I'd need the room to take a deep breath and approach this. I don't have that right now and I don't want to restart the process of ON getting the layout going. If the opportunity arises, maybe we can change this, but I don't want to do it right now.

TonyB_ · 2019-02-02 02:22

cgracey wrote: »

TonyB_ wrote: »

Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.

D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.

If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.

* * * * * * * * * *

The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest

The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.

The thing is, ON is now compiling what I last gave them and after looking at the areas in the Verilog that I would need to change, I'd need the room to take a deep breath and approach this. I don't have that right now and I don't want to restart the process of ON getting the layout going. If the opportunity arises, maybe we can change this, but I don't want to do it right now.

Chip, I appreciate that the last train has probably left the station and unless something important needs fixing the design is done. I mentioned the SCA rotation only as a consequence of a post today about faster multiplies. The interesting point is this change to SCA saves code if the result goes to the next instruction's D, but I can find no saving if it goes to S. This confirms to me that D is intrinsically superior even though S seems the more obvious choice. These are my final thoughts on this subject and thanks for considering the various ideas that have been suggested by everyone.

cgracey · 2019-02-02 03:17

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Although I didn't think so at first, I believe Evan's idea is a very good one indeed. The rest of us can't see the Verilog but if all other things are equal in terms of the logic then I'm convinced that D is much better than S.

D is ideal for direct arithmetic and writing to LUT or hub RAM or pins. It also gives us in effect three operands in one instruction, with benefits that are yet to be fully appreciated.

If S is chosen there will be extra instructions that could have been avoided and other users in the future will be asking themselves "why wasn't D chosen instead?" but by then it will be too late to change.

* * * * * * * * * *

The Spin2 interpreter has SCA and SCAS but they use the Cordic. Are there any real-world PASM examples for the SCA instruction? Would it make any meaningful difference if the high word is not zero? Specifically, if the 16-bit right shift were replaced by a 16-bit rotate, then a 32x32 multiply with 64-bit result would be faster. Please see http://forums.parallax.com/discussion/169698/substantially-faster-shorter-multiply#latest

The great thing about SCA is that it doesn't change the operands being multiplied, so a MUL plus a shift/rotation is not equivalent.

The thing is, ON is now compiling what I last gave them and after looking at the areas in the Verilog that I would need to change, I'd need the room to take a deep breath and approach this. I don't have that right now and I don't want to restart the process of ON getting the layout going. If the opportunity arises, maybe we can change this, but I don't want to do it right now.

Chip, I appreciate that the last train has probably left the station and unless something important needs fixing the design is done. I mentioned the SCA rotation only as a consequence of a post today about faster multiplies. The interesting point is this change to SCA saves code if the result goes to the next instruction's D, but I can find no saving if it goes to S. This confirms to me that D is intrinsically superior even though S seems the more obvious choice. These are my final thoughts on this subject and thanks for considering the various ideas that have been suggested by everyone.

TonyB_, it probably IS better. If we have the opportunity, we can do this. Let's see what happens with how things are going.

evanh · 2019-02-02 05:54

Chip,
What was the rational behind CALLA/CALLB stacking upwards instead of downwards in hubRAM?

PS: I note WRLONG can stack up or down, so PUSHA, for example, could be aliased either way around. But not so for CALLx and RETx.

cgracey · 2019-02-02 06:17

evanh wrote: »

Chip,
What was the rational behind CALLA/CALLB stacking upwards instead of downwards in hubRAM?

PS: I note WRLONG can stack up or down, so PUSHA, for example, could be aliased either way around. But not so for CALLx and RETx.

CALL = PUSH = add on top = grow upwards

evanh · 2019-02-02 06:19

Why up instead of down? Stacks aren't usually placed at start of memory space.

evanh · 2019-02-02 06:28

Hmm, maybe the hubRAM stack should be de-facto right at the beginning. That way the system parameters being discussed in the other topic can be the first items on it - https://forums.parallax.com/discussion/169714/p2-mailbox-and-parameters-where-to-place-and-what-is-needed/p1
and https://forums.parallax.com/discussion/169697/rom-changes-for-next-silicon/p1

msrobots · 2019-02-02 07:23

It is because SPIN did this on the P1, having the stack behind the program and pushing upwards.

I guess

Mike

evanh · 2019-02-02 07:42

Ah, I see. I never did get round to doing anything with the prop1.

msrobots · 2019-02-02 07:49

you are missing out on something there. The P1 is a funny little beast and not that much different from the P2. At least from the view of programming a multi core.

I am just stepping up and have a lot of fun with the P2.

Enjoy!

Mike

Mark_T · 2019-02-02 12:22

cgracey wrote: »

CALL = PUSH = add on top = grow upwards

Note on most modern architectures, stacks grow down, heaps grow up. I don't think I've seen a
compiler that does it otherwise. If you only have positive offset addressing it pretty much forces
the choice as local variables are addressed from the SP, rather than maintain a separate FP.

Rayman · 2019-02-02 12:49

I think I remember that the 80x86 has a push instruction that adds to the stack pointer register...

Mark_T · 2019-02-02 17:19

Rayman wrote: »

I think I remember that the 80x86 has a push instruction that adds to the stack pointer register...

Z80, 8080, 8086, 68000 all pre-decrement SP on push. I think one notable exception in the microprocessor world is the 8051.

Rayman · 2019-02-02 23:46

You're right... Guess I remembered wrong, been a while..

Rayman · 2019-02-03 01:16

I just tried out V33 with my 90's 3D code...
Didn't work at first, but after some WinDiff on the VGA example, I found I needed to change these lines to match and now it works:

m_bs        long    $7F010000+16        'before sync
m_sn        long    $7F010000+96        'sync
m_bv        long    $7F010000+48        'before visible
m_vi        long    $7F010000+640       'visible

m_rf        long    $7F080000+640       'visible rlong 8bpp lut

cgracey · 2019-02-13 07:22

I posted the final logic version at the top of this thread, which includes, lastly, the extra register on each IN signal from the pins. This is to ensure metastability in the final silicon. Note that this adds one clock period to the IN signals.

Peter and Ray, we need to use this version to finish developing the ROM code.

evanh · 2019-02-13 11:10

Chip,
The FPGA image files are all numbered v32j, not v33j. Is that just a typo or worse?

cgracey · 2019-02-13 17:12

evanh wrote: »

Chip,
The FPGA image files are all numbered v32j, not v33j. Is that just a typo or worse?

Whoops! That should be v33j. I will fix that this morning. Thanks for noticing.

Publison · 2019-02-13 17:27

Is the v32j a valid v33j image? Is it just mislabeled? I loads fine.

cgracey · 2019-02-13 19:33

Publison wrote: »

Is the v32j a valid v33j image? Is it just mislabeled? I loads fine.

It was just mislabeled. I fixed it. Sorry about that.

cgracey · 2019-02-19 02:16

I just updated the main file at the top of the thread.

There is a new PNut_v33h.exe which allows a full 1MB download on the -A9 boards to permit ROM updating. This is necessary for PeterJakacki and Cluso99. Others don't need to care about this.

New FPGA files for next silicon version - 5th/final release - contains new ROM!!

Comments