Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

evanh · 2019-01-15 04:44

Yep, two unrelated problems that I'd bumped into with the one code loop. Actually there was also the branch at last line of REP block problem too, but Chip has clarified that that is considered an illegal operation.

The corruption problem, #2, is as I've said. It continuously occurs when synchronously reading at a particular phase to the writes. That problem doesn't occur otherwise. I don't have other examples.

evanh · 2019-01-20 00:21

Chip,
Got another tiny improvement request: For async receive, smartpin mode %11111, I'd like RDPIN reg,pin WC to make C same as IN state. Currently, C is always zero I think.

cgracey · 2019-01-20 05:25

Keep cutting

evanh wrote: »

Chip,
Got another tiny improvement request: For async receive, smartpin mode %11111, I'd like RDPIN reg,pin WC to make C same as IN state. Currently, C is always zero I think.

Ok. I will look into that tonight.

evanh · 2019-01-26 13:32

Chip,
I've started trying to duplicate the lut sharing issue with single pass test by aligning read and write on the same CT number.

I've found something else: There's a one clock difference between the v32i FPGA and P2ES silicon when reading a written lutRAM location. Both tested at 20 MHz.

'----- test lut sharing dual-port glitch -----
		getct   ticks
		add     ticks, sec

		lutson
		setq    ticks
		coginit #1, ##@start_lut_test

		addct1  ticks, #3     '<-------------- HERE --- #3 is need for P2ES, #2 is needed for FPGA
		waitct1

		getct   ticks
		rdlut   pa, #$1ff
		rdlut   i, #$1ff
		rdlut   j, #$1ff
		rdlut   k, #$1ff
		getct   temp
...



'============================
ORG
start_lut_test
		wrlut   #0, #$1ff
		mov     pa, ##$deadbeef
		addct1  ptra, #4
		waitct1
		wrlut   pa, #$1ff

		cogid   pa
		cogstop pa
'============================

evanh · 2019-01-26 13:52

Oh, that's a revelation. It depends on setting vs resetting of the bits. When resetting bits, both timings are the same at #2, so not really a logic difference at all.

Okay, resetting bits is when the glitching shows it's face the most. It's actually got a worse case of bit-mashing on the P2ES silicon than the FPGA.

EDIT: Here's output I get from the P2ES board:

lut $1ff = ffffffff   4a41febf   00000000   00000000    14

Attached is full source (need to uncomment the HUBSET's in Set Xtal for P2ES silicon):

jmg · 2019-01-26 18:39

Trying to follow this - I think you are saying
The shared access is ok when Old-> New is =\_ (same on FPGA and P2ES)
When Old -> New is _/= there is corruption, and an extra clock delay in needed in P2ES ?

If P2ES reads again, is the data correct (ie is it read-side corruption, or write-side corruption ?)
Is any data not being read affected (cross address corruption) ?
Does this behave the same for all COG pairs ?
20MHz should be well inside timing, even on FPGA, but you could try 10MHz & 40MHz on P2ES to see if they change.
Does heat affect this ?

evanh · 2019-01-26 21:32

JMG,
Give that code a run.

ozpropdev · 2019-01-27 03:49

@evanh
Confirming your findings, here's what I found in my own testing.
The glitch occurs when then RDLUT is +2 clocks after the WRLUT.

Offset  Original New
-4      FFFFFFFF FFFFFFFF
-3      FFFFFFFF FFFFFFFF
-2      FFFFFFFF FFFFFFFF
-1      FFFFFFFF FFFFFFFF
+0      FFFFFFFF FFFFFFFF
+1      FFFFFFFF FFFFFFFF
+2      FFFFFFFF 09009DFF	'glitch
+3      FFFFFFFF 00000000
+4      FFFFFFFF 00000000

Offset  Original New
-4      00000000 00000000
-3      00000000 00000000
-2      00000000 00000000
-1      00000000 00000000
+0      00000000 00000000
+1      00000000 00000000
+2      00000000 00000000
+3      00000000 FFFFFFFF
+4      00000000 FFFFFFFF

Offset  Original New
-4      55555555 55555555
-3      55555555 55555555
-2      55555555 55555555
-1      55555555 55555555
+0      55555555 55555555
+1      55555555 55555555
+2      55555555 01005555	'glitch
+3      55555555 AAAAAAAA
+4      55555555 AAAAAAAA

Offset  Original New
-4      AAAAAAAA AAAAAAAA
-3      AAAAAAAA AAAAAAAA
-2      AAAAAAAA AAAAAAAA
-1      AAAAAAAA AAAAAAAA
+0      AAAAAAAA AAAAAAAA
+1      AAAAAAAA AAAAAAAA
+2      AAAAAAAA 88288AAA	'glitch
+3      AAAAAAAA 55555555
+4      AAAAAAAA 55555555

evanh · 2019-01-27 04:05

Thanks Brian. I see the $55/$AA combos show it even more.

ozpropdev · 2019-01-27 04:07

Here's the code I used for the above tests.
I used COGATN/WAITATN to sync the shared lut activity.

evanh · 2019-01-27 04:17

I had no idea CALLPA could do that! I'm thieving it!

ozpropdev · 2019-01-27 04:23

evanh wrote: »

I had no idea CALLPA could do that! I'm thieving it!

Send royalty payment to cgracey @ Red Bluff!

evanh · 2019-01-27 04:32

Bugger, I just realised I'm always using long calls. CALLPA can't do those directly.

EDIT: And fastspin has no error/warning for it either.

jmg · 2019-01-27 04:36

ozpropdev wrote: »

@evanh
Confirming your findings, here's what I found in my own testing.
The glitch occurs when then RDLUT is +2 clocks after the WRLUT.

Because this looks more like an aperture effect, it's less a 'glitch' and more a transitional ambiguity.
Does that transitional ambiguity occur independent (largely?) of SysCLK speed ?

ozpropdev · 2019-01-27 09:13

jmg wrote: »

ozpropdev wrote: »

@evanh
Confirming your findings, here's what I found in my own testing.
The glitch occurs when then RDLUT is +2 clocks after the WRLUT.

Because this looks more like an aperture effect, it's less a 'glitch' and more a transitional ambiguity.
Does that transitional ambiguity occur independent (largely?) of SysCLK speed ?

Ran from 20 MHz up to 350 MHz, issue remains the same at +2 clock offset.

jmg · 2019-01-27 09:21

ozpropdev wrote: »

Ran from 20 MHz up to 350 MHz, issue remains the same at +2 clock offset.

Cool thanks, that does not sound like a hard to nail down difference delay effect, just an aperture effect where the change occurs on the same edge/window as the sample.
Becomes the same as a group of D-FF's with fail tsu/th, so some capture new data and some capture old data.

I'm not sure how much Chip can vary here, & if there is enough margin to shift the RD/WR to opposite clock edges , or that might be a latch, that holds over a change.

ozpropdev · 2019-01-27 09:22

@cgracey
I know you've explained it before somewhere on the forum but what was the reason behind WRLUT taking 2 clocks and RDLUT taking 3 clocks?
Is this somehow related to this LUT share issue?

ersmith · 2019-01-27 12:36

evanh wrote: »

Bugger, I just realised I'm always using long calls. CALLPA can't do those directly.

EDIT: And fastspin has no error/warning for it either.

Could you give me an example? When I tried to reproduce this with:

dat
	org 0
	callpa #1, #faraway
	callpa #2, #\faraway
	orgh $400
	callpa #20, #faraway
	long 0[512]
faraway
	jmp	#faraway

fastspin gave me:

foo.spin2(3) error: Source out of range for relative branch callpa
foo.spin2(4) error: Absolute address not valid for callpa
foo.spin2(6) error: Source out of range for relative branch callpa

evanh · 2019-01-27 16:41

I seem to have found the one hole of hubexec calling lutexec:

dat
org 0
	callpa #1, #faraway
	callpa #2, #\faraway
	callpa #3, #lutaway

orgh $400
	callpa #20, #faraway
	callpa #21, #lutaway
	long 0[512]
faraway
	jmp	#$

org $200
lutaway
	jmp	#$

The final CALLPA is illegal but gives no error:

Version 3.9.15 Compiled on: Jan 21 2019
callpa.spin2
callpa.spin2(3) error: Source out of range for relative branch callpa
callpa.spin2(4) error: Absolute address not valid for callpa
callpa.spin2(5) error: Source out of range for relative branch callpa
callpa.spin2(8) error: Source out of range for relative branch callpa

evanh · 2019-01-27 16:57

Correction, it seems to be all down to a relative calculation thing:

dat
org 0
	callpa #1, #faraway
	callpa #2, #\faraway
cogaway
	callpa #3, #lutaway

orgh $400
	callpa #20, #faraway
	callpa #21, #cogaway
	callpa #22, #lutaway
	long 0[512]
faraway
	callpa #30, #cogaway
	callpa #31, #lutaway

org $200
lutaway
	callpa #40, #faraway
	callpa #41, #cogaway

Here, only lines 10 and 11 don't create an error:

callpa.spin2(3) error: Source out of range for relative branch callpa
callpa.spin2(4) error: Absolute address not valid for callpa
callpa.spin2(6) error: Source out of range for relative branch callpa
callpa.spin2(9) error: Source out of range for relative branch callpa
callpa.spin2(14) error: Source out of range for relative branch callpa
callpa.spin2(15) error: Source out of range for relative branch callpa
callpa.spin2(19) error: Source out of range for relative branch callpa
callpa.spin2(20) error: Source out of range for relative branch callpa

evanh · 2019-01-27 17:27

Of note, relative addressing is illegal when crossing domains. Therefore all the out-of-range errors are kind of wrong too.

evanh · 2019-01-27 18:15

ozpropdev wrote: »

@cgracey
I know you've explained it before somewhere on the forum but what was the reason behind WRLUT taking 2 clocks and RDLUT taking 3 clocks?
Is this somehow related to this LUT share issue?

No would be the short answer.

The SRAM dual-porting function should be handled independently of processor operations. The two cogs are accessing the lutRAM on separate buses. Or at least are supposed to be afaik.

ersmith · 2019-01-27 20:17

evanh wrote: »

Here, only lines 10 and 11 don't create an error:

Wow, that is very weird. Well, it's definitely a bug -- thanks for finding it and sending the reproducer. I'll try to figure out what's going on.

cgracey · 2019-01-28 05:29

ozpropdev wrote: »

@cgracey
I know you've explained it before somewhere on the forum but what was the reason behind WRLUT taking 2 clocks and RDLUT taking 3 clocks?
Is this somehow related to this LUT share issue?

The RDLUT takes three clocks because it must do the read command, the data latch, and the result mux, which each take a clock.

The WRLUT takes only two clocks because it must do the write command and then take one more clock to finish the minimum instruction cycle.

This doesn't have anything to do with LUT sharing.

cgracey · 2019-01-28 05:31

So, is there a problem with LUT sharing, where writing on one port will always cause a read corruption when reading on the other port?

Sorry I'm behind the curve here.

jmg · 2019-01-28 05:38

cgracey wrote: »

So, is there a problem with LUT sharing, where writing on one port will always cause a read corruption when reading on the other port?

Sorry I'm behind the curve here.

Yes, see the post with test results tabulated http://forums.parallax.com/discussion/comment/1462738/#Comment_1462738
There is a single clock cycle critical timing alignment where this occurs, largely independent of SysCLK.
Present on both FPGA and P2 silicon.
Sounds like data is being changed, on the same clk edge it is being read, without enough tsu.th margin.

evanh · 2019-01-28 07:01

Chip,
I remember bringing this up some months back. There is a setting in the Altera megafunction ALTSYNCRAM that looks like it needs to be setup correctly to sort this issue. I'm guessing ALTSYNCRAM is the building block you've used in the FPGA design.

Parameter name is: READ_DURING_WRITE_MODE_MIXED_PORTS
Description is: Whats the expected output when reading and writing at the same address through different ports ?. Values are "OLD_DATA" or "DONT_CARE"(default)

And as you can see the default is DONT_CARE. I'm guessing it still needs changed to OLD_DATA for all your dual-port RAMs.

And presumably this parameter is also used by OnSemi.

evanh · 2019-01-28 07:17

Actually, someone else might have suggested SRAM configuration first but I do remember looking it up and pondering on here about it and as to whether it would affect OnSemi.

evanh · 2019-02-04 17:14

Chip,
I've got a setup here with sync serial output responding within 3 clocks of an external clock source. I've been using another smartpin to produce the clock source so that on the scope the timing is about 3 clocks. That's for 20 - 60 MHz on the FPGA. However ...

The reaffirming issue here is that with the latest v33i FPGA at 80 MHz it still leaps another clock the same way as my much older measurements using software and GETCT.

I've carefully looked at it on the scope. The output timing of a non-registering pin stays within a nanosecond of the phase of a pad-ring registered pin. To me this says the output circuit is not any issue.

So the problem has to be early in the input path, I'm guessing between the pad-ring and first verilog register stage.

PS: Turning on pad-ring registering doesn't help at all. It just adds two sysclocks to the lag.

evanh · 2019-02-04 17:27

Here's two screenshots of the scope with registered and unregistered output on the blue trace with the FPGA operating at 80 MHz.

Green trace is transition smartpin mode without registering. It is the clock input for the sync serial smartpin.
Orange trace is sync serial smartpin mode also without registering.
Blue trace is OTHER inverted output paired to the pin of the green trace. This probe is missing its ground clip so the trace wobbles way more than the other two.

Unregistered lag from green rising to blue falling is about 3 ns.
Registered lag from green rising to blue falling is about 14.5 ns.
The difference is 11.5 which is 1.0 ns short of ideal. No issue there.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments