RAM checksum error

ManAtWork · 2015-08-10 10:57

In the latest batch of my stepper controller boards there were two where I had problems programming them. The propeller tool tells me "RAM checksum error" when I hit F10 or F11. What could be the possible cause of this?

There are two propellers on the board which share one EEPROM and the P28..31 signals. But I verified that the second is held in reset while the first is being programmed. Both props also share the same clock but I think that doesn't matter because they run in RCFAST mode while being programmed. Supply is stable at 3.29V. It's a two layer SMT board with the bottom layer as almost solid ground plane. I have four 0.1µF caps near the supply pins at each prop. I don't see any obvious soldering error. If the EEPROM was broken, I think, it should say "EEPROM error" instead of "RAM". I use the original prog-plug from Parallax.

Heater. · 2015-08-10 11:09

I don't think it's the EEPROM. Hitting F10 only loads to RAM.

If this works on all you other boards and only these two are faulty I would suspect that there is somehow some serial data corruption happening at load time.

ManAtWork · 2017-10-23 12:56

Today, I had two boards with checksum error, again. I checked the supply voltage, 3.28 and 3.32V. I tried a different ProgPlug without success. Finally, soldering out and replacing the propellers has helped.

As I expect that Parallax tests the chips before shipping them they have to be damaged during assebly, soldering or testing. But how could that happen? If I zapped a pin or so by electrostatic discharge or whatever it should not affect the RAM (alone).

Or does "RAM checksum error" mean that the chip doesn't respond to serial programming at all?

JonnyMac · 2017-10-23 13:09

The Propeller downloads the compiled image (just the code) which contains a checksum for that packet (it's in byte 5 of the image). Once downloaded, the Propeller verifies the checksum (sums all bytes in image, low byte should be 0). If this isn't happening, there is something interfering with the process.

Do you have a pull-up on the shared P31 (Propeller RX) line? If not, the Propeller that is in reset might be putting a load on that pin that is creating a problem for the other. Yes, it's a SWAG -- but should be easy to check by adding a 10K pull-up to that line.

Cluso99 · 2017-10-23 22:01

Isn't the checksum error caused by the eeprom failing the verify?

Just read the first post again. Two props share 1 eeprom. How are you sharing the the clock?

Is your problem with programming the eeprom (F11)?

As for sharing the eeprom, I have doubts this will always work reliably since both props will be supplying the SCLK pin (ie a conflict). Also, SDA when both props send to the eeprom is another conflict. The RC Clocks in the props will not necessarily be anywhere near the same frequency.

So I think you need to explain how the props and eeprom are connected together - circuit diagram.

Peter Jakacki · 2017-10-23 23:05

Seeing it is a RAM checksum error then there are three possible causes. First, a RAM checksum error only occurs once the Prop has finished accepting the binary image and is either ready to run it or program the EEPROM.

1 - Corrupted data stream
2 - RAM corruption during receive
3 - Bad RAM

Of these causes only the first is the most plausible although I did once have a new Prop that had bad RAM, but once only ever. A RAM checksum error caused by a bad data stream may be because of a bad ground or ground loop problem, or even not having the chip soldered properly etc. If in doubt I will apply flux to the pins (I always use QFP) and run the iron along the pins to reflow the joints, then retest. Never do I replace the chip before I do some basic checks such as this and most of the time it is because of a connection, either open or shorted across by a minute whisker. The flux and reflow method are actually my second option as the first option is always to do a visual using a good loupe and even shine a light from behind the pcb both ways. Checking voltages is always wise, but set your meter to AC as well which will give an indication of ripple/instability problems probably caused by a bad bulk cap. Especially is this so if the regulated DC voltage is unusual (i.e. 4.8V instead of 5V) since the DC reading only shows the average. A 5V square wave would read as 2.5V on DC meters etc.

By just replacing the chips and "fixing" it you still haven't found out what the problem was. It may have been very simple. In the old days of vacuum tube TVs the local serviceman (these sets needed a lot of maintenance) would just swap out tubes/valves with ones in his kit until it worked. Getting it to work is not the same thing as finding the problem and valves certainly were not cheap in those days (or now), the mere act of unplugging a valve and putting the same one back in might cause it to work. The customer was none the wiser though, although poorer. These servicemen were referred to as valve/tube jockeys, much as PC "technicians" are referred to as board jockeys. Don't be a chip jockey.

Cluso99 · 2017-10-24 02:36

I didn't think there was a ram check in the boot code. Currently out so cannot check.

I do the same as Peter... check first, replace last. I used to desolder 40 pin ceramic ICs and be able to reuse them if not faulty. Those ceramics (white) were so fragile back in the 70's.

Tracy Allen · 2017-10-24 06:04

I've recently seen that issue. In a batch of 240 boards assembled by a CM (top quality equipment and QC), three kicked back that fatal error upon testing. They were put aside and out of mind to deal with "later". I pulled them off the shelf this afternoon, same deal, still no joy. So, 3 out of 240. In principle, I agree with everything Peter said about likelihoods and things to troubleshoot, in particular the power supply issue. Bad connections usually manifest as a total failure to communicate, "PNF", propeller not found. This board has only one prop, and uses an on-board FT231X. I may have time to give it another look later this week.

Peter Jakacki · 2017-10-24 06:25

@Cluso99 - this is the code from the booter that does a checksum after it receives the image:

			mov	bits,#0			'compute ram checksum
:checksum		rdbyte	rxdata,count
			add	bits,rxdata
			add	count,#1
			djnz	address,#:checksum
			test	bits,#$FF	wz	'z=1 if checksum okay
			call	#tx_bit_align		'send checksum okay/error
	if_nz		jmp	#shutdown		'if checksum error, shutdown
			djnz	command,#program	'if command 2-3, program eeprom
			jmp	#launch			'else command 1, launch

Cluso99 · 2017-10-24 08:24

Thanks Peter. Obviously my memory (ram) is failing me. Thought it was calculated on the fly during read from eeprom

ManAtWork · 2017-10-25 11:16

Today, I had 11 boards with RAM checksum error again. Two of them have double propellers, 9 have only one. So I think we can exclude the EEPROM sharing as possible source of the fault. The 9 boards with single propeller also have no switch mode power supply on board but are powered directly from USB with a linear regulator. So AC ripple and ground loops can also be eliminated.

I have replaced two propellers, one on the double and one on the single prop board. Both boards work afterwards. (Call me chip jockey but I have to fight bad yield as I have to deliver 100 working boards next week)

I inspected the solder joints of a third (bad) board with the microscope. The joints look nearly perfect (well, if there's a "perfect" at all for lead free solder...).

I'll try the flux-reflow method Peter suggested to test the whisker or bad joint theory.

I Bought the chips from Mouser. Date code is 1629.

ManAtWork · 2017-10-25 11:27

BTW, this is the schematic for the EEPROM sharing. The first propeller (IC2) holds the reset line of the second (IC1) low with T1 until it has finished booting. There is no issue with different clock frequencies because the two props don't boot at the same time.

I sold ~2500 of theese boards and if they survive testing they work very reliable. I had some problems with connectors and the power circuits but not a single propeller or EEPROM failure in the last 5 years.

ManAtWork · 2017-10-25 11:37

Re-soldering all pins of a bad prop with flux applied does not make any difference. And yes, the checksum error also occurs when I only load to RAM (F10).

Peter Jakacki · 2017-10-25 12:37

Can you keep one of those faulty boards intact so the real fault can be tracked down? I mentioned I had one chip with faulty RAM from new, so maybe there have been some short cuts at the foundry with testing perhaps. I'm guessing too that you are using an LDO from USB power and that the caps are correct for that particular regulator too.

Cluso99 · 2017-10-25 14:09

Would be interesting to see if a ram bit is stuck.
Try something like

PUB dummy

DAT
data byte [32000]$FF

Then try with $00.

Tracy Allen · 2017-10-25 23:32

Cluso99, Interesting approach. I tried your DAT fill on the three I have here, ones that that throw the RAM checksum error like ManAtWork described.

Summary...mixed results...

#238 accepts $ffffffff, $00000000, $55555555, $aaaaaaaa, and also it accepts a short program that just blinks an LED. However, it errors out on a long program.

#239 accepts $ffffffff, but not $00000000, $55555555 or $aaaaaaaa. It does always accept $ffff0000. For a while it was accepting $000c0000. I thought I had the issue pinned on one bit $00080000, but then it started failing on that too. Inconsistent. Huh? Back to $ffff0000 which it accepts consistently. It also accepts the LED blink program so long as it also has the DAT to fill the rest of memory.

#240 accepts $ffffffff, $00000000, $55555555, but NOT $aaaaaaaa. Those $5 and $a patterns force bits to alternate states in case they are shorted together or something like that. This also accepts the LED blink program so long as it has the permissive DAT fill.

At some point I'll replace a chip gently and move the old one onto a minimal PCB for a separate test. The date code on these is 1709.

Cluso99 · 2017-10-26 02:40

Interesting results Tracy.

I use a rolling pattern when I test the SRAM on my RamBlades. I use the address as the data, filling the whole SRAM, then I read it back and verify the address = data. It's reasonable check for shorted/open address and data pins.

Perhaps you could try a small program to run a hub ram check, followed by the $ffffffff data.

I will write a test program and try on my prop boards.

jstjohnz · 2017-10-26 06:10

I have seen ram checksum errors on a small percentage of DIP package props. Maybe a half dozen such failures out of more than 10,000 chips.

I saw most of those in a cluster during the time a couple of years ago when the chips were in very short supply.

At the time Parallax support told me they had never had a DOA failure like that since the chips are 100% tested prior to shipping. I offered to send the parts to them for testing but I was told they couldn't put those chips in their test fixture because they had been soldered rather than socketed and could damage the contacts on the test fixture.

Cluso99 · 2017-10-26 06:33

Here is a HUB RAM TEST program.

This is only a mini stub spin program to start a pasm program that will write $00 to all hub ram, then read it back while verifying. The value being written is displayed in hex, and when it is completely read back a "p" (passed) is displayed. Then the data value is incremented and the whole hub ram is retested, etc, until $FF has been tested, when an "e" is displayed.

You will need to use PST (serial program) or similar at 115,200 baud. I am only transmitting on P30 with an inbuilt pasm soft uart. (ie I do not use hub ram after the spin stub boots).

If a failure is detected, the address xxxx:yy!zz is displayed where xxxx is the hub address (hex), yy is the value written, and zz is the value read back.

For those with suspected faulty hub ram, the first DAT section fills the first 16KB (from $0018) with $FF. You can also try this with $00. If either of these will download correctly, then my program will run and test the whole hub ram.

If you cannot get the above to work, you can move the first DAT section to the end and try that. If not, try filling with $00. If either of these download correctly then my program will then test the whole hub ram.

FWIW I found that any code slightly larger than ~$4000 fails download with a COMMS error. I wasn't aware of this. I am using a CP2102 and have W10 in my PC, with PropTool 1.3.2. I will try an FT232 later.

Tracy Allen · 2017-10-28 00:04

Cluso99,
I did run your program on two of my boards The third board #240 wouldn't load the program so I put it aside for the moment.

On board #239 it printed out as follows,
00*1190:00!40
So, in attempting to write $00 to address $1190, it read back $40 instead of $00. Consistently.
With board #238, the result was,
00*1005:00!08
Error at address $1005, read back $08 instead of $00.
I messed around with that a bit and then modified your program to kick out a file of all the errors in the 32k hub ram. The files for #238 and #239 are attached. Both have lots of errors, but #238 far more than #239. The list is consistent when re-running the program. Also for reference there is a file from a normal board #226 that threw no errors, as a indicator that the program is operating correctly.

Looking at the result from #239, the errors are all in a 32 byte block of addresses from $1190 to $11AF. Here is the attempt to write $00 to that block.
00*
1190:00!40
1192:00!04
1193:00!20
1194:00!40
1196:00!04
1197:00!20
1198:00!40
119A:00!04
119B:00!20
119C:00!40
119E:00!04
119F:00!20
11A0:00!40
11A2:00!04
11A3:00!20
11A4:00!40
11A6:00!04
11A7:00!20
11A8:00!40
11AA:00!04
11AB:00!20
11AC:00!40
11AE:00!04
11AF:00!20

What do you think? It looks like crossover from the address to the data. Weird. For #238, the errors all fall in a larger 4085 byte block of addresses, from $1001 to $1FF5.

Tracy Allen · 2017-10-28 00:18

Here is the modified program. Mainly a change of a jmp #error to a call #error so that the test continues after detecting the first error.

Cluso99 · 2017-10-28 01:55

Here is the pattern for #239 (list displayed above in red)

1190:00!40  1194:00!40  1198:00!40  119C:00!40  11A0:00!40  11A4:00!40  11A8:00!40  11AC:00!40
                                                                                              
1192:00!04  1196:00!04  119A:00!04  119E:00!04  11A2:00!04  11A6:00!04  11AA:00!04  11AE:00!04
1193:00!20  1197:00!20  119B:00!20  119F:00!20  11A3:00!20  11A7:00!20  11AB:00!20  11AF:00!20

Interesting hey!

While we don't know how the hub is arranged, we do know that it would at least be in longs.
These columns show a long where
+0 has b6=1
+1 ok
+2 has b2=1
+3 has b5=1
and two longs are effected.

Haven't had time to check the others.

Tracy Allen · 2017-10-28 16:53

Yes, I checked the whole file for #239, consistent with your observation.

It is as if the 32 byte address range from $1190 to $11AF is organized as 4 longs and bits as follows are stuck high. (Written conventional long order msbyte, highest address, on the left -- how is is physically?).
%00100000_00000100_00000000_01000000
$20040040

It stands to reason that it won't report an error if you write a byte to all locations that has all three of those bits set. For example,$64, $65, $66, or $67 do indeed report no errors, but $63 and $68 do so because the $3 is converted to $7 and $8 is converted to $C.

p
63*
1192:63!67
1196:63!67
119A:63!67
119E:63!67
11A2:63!67
11A6:63!67
11AA:63!67
11AE:63!67
p
64*
p
65*
p
66*
p
67*
p
68*
1192:68!6C
1196:68!6C
119A:68!6C
119E:68!6C
11A2:68!6C
11A6:68!6C
11AA:68!6C
11AE:68!6C

Tracy Allen · 2017-10-28 19:22

The error pattern in #238 is quite different. In summary,
All the errors reported are in in a 4096 byte block from $1000 to $13FF.

Bit 3 %xxxx1xxx is stuck high at addresses of the form $1hh5
e.g.

00*
1005:00!08
1015:00!08
1025:00!08
...         so on 256 total errors in writing $00 to the block
1FE5:00!08
1FF5:00!08
p

Furthermore, bit 3 %xxxx0xxx is stuck low at addresses of the form $1hh1
e.g.

08*
1001:08!00
1011:08!00
1021:08!00
...         so on 256 total errors in writing $08 to the block
1FE1:08!00
1FF1:08!00
p

Unlike #239, where the repeating pattern was 4 bytes, in #238 the pattern repeats at 16 bytes. It involves one bit, stuck high for some addresses and stuck low for others.

Peter Jakacki · 2017-10-28 23:20

I've had this problem before, and perhaps before that too.

KeithE · 2017-10-29 00:12

I think that you all should get RMAs and return the parts to Parallax for failure analysis. Otherwise it's not obvious how they can proceed.

>I was told they couldn't put those chips in their test fixture because they had
>been soldered rather than socketed and could damage the contacts on the test fixture.

I'm sure that they can find a way. Semiconductor companies often need to recall parts etcetera as part of failure analysis, so finding a DIP socket should not be a show stopper.

Tracy Allen · 2017-10-29 03:37

Looking back at your referenced thread, Peter, do your results in the second test mean that there was a failure (always low?) in bit 2 for all hub byte addresses of the form $5hh9 and $5hhD?

Keith, it is definitely worth pursuing with Parallax. I'll ask Jeff. I didn't want to bother them with something that could be traced to cold solder joints, but evidence is mounting that it is something more than that. Maybe it is a board handling or assembly issue, but I know my CM has the right equipment and skills. Yet Parallax uses the Prop in far more builds and would surely have run across this issue if it is attributable to the chips.

Jeff Martin · 2017-10-31 22:31

Thanks everyone for all that you've put into diagnosing this!

If this ever happens, we really appreciate knowing the Lot Codes and Date Codes from each bad chip (some lot codes span multiple date codes because of packaging logistics) and also knowing where you purchased them from. Lot codes indicate the batch of wafers the die came from and date codes indicate the time of packaging said die. We've been keeping yield data on every batch of Propellers, and a sample set from each batch from the last few years, so we can better look for any oddities should a customer later report problems like this.

We also really appreciate receiving the bad chips back for further diagnosis if they are in a good enough condition.

jstjohnz wrote: »

I have seen ram checksum errors on a small percentage of DIP package props.... At the time Parallax support told me they had never had a DOA failure like that since the chips are 100% tested prior to shipping. I offered to send the parts to them for testing but I was told they couldn't put those chips in their test fixture because they had been soldered rather than socketed and could damage the contacts on the test fixture.

Sorry about that, it's too bad that was said to you. It sounds like we could have done something with those; our Manufacturing staff is excellent with rework and probably could have cleaned them up enough for us to safely test them. I'll make sure our staff knows to at least quiz me about it before refusing the chips - having them here is one important piece of the puzzle I'd like to have, otherwise we're left with more questions.

The results of your tests do indeed indicate that the RAM is bad. We've never known a customer manufacturing process to cause this particular type of problem, and it's hard to imagine one, so I'm thinking that somehow the chips were shipped to you in that condition. If anyone in this thread still has said chips, please send them back for a replacement and make sure to mark them to me, Jeff Martin, so I can at least verify that they fail in our tester.

There was some time ago (can't remember how long ago) where our tester passed chips experiencing a particular kind of RAM failure. When the customer sent his bad Propeller back to me, I found this to be true and determined the problem and fixed it so the tester properly failed that chip. Since then, we've yet to find a case where a verified bad Propeller tested as good, but I keep looking for them just in case.

Around June/July we heard from a customer who was having trouble and we replaced his Propellers and received the bad ones back. The bad chips were severely bad; pulled too much current under test. They may have been damaged after shipping, but RAM failures like what you've experienced are not likely to have been caused that way. We've recently reviewed our internal processes and made some changes to improve them and want to continue making sure that our fab is producing within the agreed process window and that we're always shipping good packaged parts.

By the way, I think Peter clarified it, but in case anyone is still wondering: the boot loader does indeed verify the checksum of the received image after it has programmed it into RAM, and before it programs it into EEPROM (if requested) followed by a checksum verification from EEPROM as well. The RAM Checksum failure could actually be due to a communication problem, rather than a true RAM problem, as Peter indicated, but communication-induced problems are more likely to exhibit Propeller Not Found errors, and others too, each time you try to download. A consistent RAM Checksum failure message is very likely a real problem with the RAM.

Though it doesn't change the good/bad status in the current Propeller's in question, there's something else I should share here - a Propeller Main RAM failure could appear to be bad Main RAM locations, but could actually be good Main RAM locations but with one or more Cogs that can't read it properly. For this reason, we test each Main RAM location from the perspective of each individual Cog. I haven't looked at the test code in this thread to see if it's doing that or not, just thought I should point it out for good measure.

Tracy Allen · 2017-11-01 16:02

Jeff, Thanks very much for chiming in. I'll send over the ones I have here. That's interesting about doing the test from each individual cog. I'll need to tweak the test program in order to cover all 8 bases.

-- Tracy

Jeff Martin · 2017-11-01 17:10

You're welcome, Tracy. Looking forward to checking out the Propellers.

ManAtWork · 2017-11-02 12:50

We also really appreciate receiving the bad chips back for further diagnosis if they are in a good enough condition.

Thanks, Jeff, for the explanations. I'll collect all bad chips and return them to you as soon I have time.

Propeller Main RAM failure could appear to be bad Main RAM locations, but could actually be good Main RAM locations but with one or more Cogs that can't read it properly.

Errrr, that means that there is a small chance that bad chips do not show as bad while programming. This could happen if the bootloader (cog0?) can read/write with no problems from/to RAM but a different cog couldn't. Bad chips showing up while programming are annoying but this can be easily corrected. Bad chips failing at the customer can get really expensive, though. This is very unlikely, I know, as we run an extra test that executes the actual software under real-world conditions (with multiple cogs).

RAM checksum error

Comments