RAM checksum error
ManAtWork
Posts: 2,176
in Propeller 1
In the latest batch of my stepper controller boards there were two where I had problems programming them. The propeller tool tells me "RAM checksum error" when I hit F10 or F11. What could be the possible cause of this?
There are two propellers on the board which share one EEPROM and the P28..31 signals. But I verified that the second is held in reset while the first is being programmed. Both props also share the same clock but I think that doesn't matter because they run in RCFAST mode while being programmed. Supply is stable at 3.29V. It's a two layer SMT board with the bottom layer as almost solid ground plane. I have four 0.1µF caps near the supply pins at each prop. I don't see any obvious soldering error. If the EEPROM was broken, I think, it should say "EEPROM error" instead of "RAM". I use the original prog-plug from Parallax.
There are two propellers on the board which share one EEPROM and the P28..31 signals. But I verified that the second is held in reset while the first is being programmed. Both props also share the same clock but I think that doesn't matter because they run in RCFAST mode while being programmed. Supply is stable at 3.29V. It's a two layer SMT board with the bottom layer as almost solid ground plane. I have four 0.1µF caps near the supply pins at each prop. I don't see any obvious soldering error. If the EEPROM was broken, I think, it should say "EEPROM error" instead of "RAM". I use the original prog-plug from Parallax.
Comments
If this works on all you other boards and only these two are faulty I would suspect that there is somehow some serial data corruption happening at load time.
As I expect that Parallax tests the chips before shipping them they have to be damaged during assebly, soldering or testing. But how could that happen? If I zapped a pin or so by electrostatic discharge or whatever it should not affect the RAM (alone).
Or does "RAM checksum error" mean that the chip doesn't respond to serial programming at all?
Do you have a pull-up on the shared P31 (Propeller RX) line? If not, the Propeller that is in reset might be putting a load on that pin that is creating a problem for the other. Yes, it's a SWAG -- but should be easy to check by adding a 10K pull-up to that line.
Just read the first post again. Two props share 1 eeprom. How are you sharing the the clock?
Is your problem with programming the eeprom (F11)?
As for sharing the eeprom, I have doubts this will always work reliably since both props will be supplying the SCLK pin (ie a conflict). Also, SDA when both props send to the eeprom is another conflict. The RC Clocks in the props will not necessarily be anywhere near the same frequency.
So I think you need to explain how the props and eeprom are connected together - circuit diagram.
1 - Corrupted data stream
2 - RAM corruption during receive
3 - Bad RAM
Of these causes only the first is the most plausible although I did once have a new Prop that had bad RAM, but once only ever. A RAM checksum error caused by a bad data stream may be because of a bad ground or ground loop problem, or even not having the chip soldered properly etc. If in doubt I will apply flux to the pins (I always use QFP) and run the iron along the pins to reflow the joints, then retest. Never do I replace the chip before I do some basic checks such as this and most of the time it is because of a connection, either open or shorted across by a minute whisker. The flux and reflow method are actually my second option as the first option is always to do a visual using a good loupe and even shine a light from behind the pcb both ways. Checking voltages is always wise, but set your meter to AC as well which will give an indication of ripple/instability problems probably caused by a bad bulk cap. Especially is this so if the regulated DC voltage is unusual (i.e. 4.8V instead of 5V) since the DC reading only shows the average. A 5V square wave would read as 2.5V on DC meters etc.
By just replacing the chips and "fixing" it you still haven't found out what the problem was. It may have been very simple. In the old days of vacuum tube TVs the local serviceman (these sets needed a lot of maintenance) would just swap out tubes/valves with ones in his kit until it worked. Getting it to work is not the same thing as finding the problem and valves certainly were not cheap in those days (or now), the mere act of unplugging a valve and putting the same one back in might cause it to work. The customer was none the wiser though, although poorer. These servicemen were referred to as valve/tube jockeys, much as PC "technicians" are referred to as board jockeys. Don't be a chip jockey.
I do the same as Peter... check first, replace last. I used to desolder 40 pin ceramic ICs and be able to reuse them if not faulty. Those ceramics (white) were so fragile back in the 70's.
I have replaced two propellers, one on the double and one on the single prop board. Both boards work afterwards. (Call me chip jockey but I have to fight bad yield as I have to deliver 100 working boards next week)
I inspected the solder joints of a third (bad) board with the microscope. The joints look nearly perfect (well, if there's a "perfect" at all for lead free solder...).
I'll try the flux-reflow method Peter suggested to test the whisker or bad joint theory.
I Bought the chips from Mouser. Date code is 1629.
I sold ~2500 of theese boards and if they survive testing they work very reliable. I had some problems with connectors and the power circuits but not a single propeller or EEPROM failure in the last 5 years.
Try something like
Then try with $00.
Summary...mixed results...
#238 accepts $ffffffff, $00000000, $55555555, $aaaaaaaa, and also it accepts a short program that just blinks an LED. However, it errors out on a long program.
#239 accepts $ffffffff, but not $00000000, $55555555 or $aaaaaaaa. It does always accept $ffff0000. For a while it was accepting $000c0000. I thought I had the issue pinned on one bit $00080000, but then it started failing on that too. Inconsistent. Huh? Back to $ffff0000 which it accepts consistently. It also accepts the LED blink program so long as it also has the DAT to fill the rest of memory.
#240 accepts $ffffffff, $00000000, $55555555, but NOT $aaaaaaaa. Those $5 and $a patterns force bits to alternate states in case they are shorted together or something like that. This also accepts the LED blink program so long as it has the permissive DAT fill.
At some point I'll replace a chip gently and move the old one onto a minimal PCB for a separate test. The date code on these is 1709.
I use a rolling pattern when I test the SRAM on my RamBlades. I use the address as the data, filling the whole SRAM, then I read it back and verify the address = data. It's reasonable check for shorted/open address and data pins.
Perhaps you could try a small program to run a hub ram check, followed by the $ffffffff data.
I will write a test program and try on my prop boards.
I saw most of those in a cluster during the time a couple of years ago when the chips were in very short supply.
At the time Parallax support told me they had never had a DOA failure like that since the chips are 100% tested prior to shipping. I offered to send the parts to them for testing but I was told they couldn't put those chips in their test fixture because they had been soldered rather than socketed and could damage the contacts on the test fixture.
This is only a mini stub spin program to start a pasm program that will write $00 to all hub ram, then read it back while verifying. The value being written is displayed in hex, and when it is completely read back a "p" (passed) is displayed. Then the data value is incremented and the whole hub ram is retested, etc, until $FF has been tested, when an "e" is displayed.
You will need to use PST (serial program) or similar at 115,200 baud. I am only transmitting on P30 with an inbuilt pasm soft uart. (ie I do not use hub ram after the spin stub boots).
If a failure is detected, the address xxxx:yy!zz is displayed where xxxx is the hub address (hex), yy is the value written, and zz is the value read back.
For those with suspected faulty hub ram, the first DAT section fills the first 16KB (from $0018) with $FF. You can also try this with $00. If either of these will download correctly, then my program will run and test the whole hub ram.
If you cannot get the above to work, you can move the first DAT section to the end and try that. If not, try filling with $00. If either of these download correctly then my program will then test the whole hub ram.
FWIW I found that any code slightly larger than ~$4000 fails download with a COMMS error. I wasn't aware of this. I am using a CP2102 and have W10 in my PC, with PropTool 1.3.2. I will try an FT232 later.
I did run your program on two of my boards The third board #240 wouldn't load the program so I put it aside for the moment.
On board #239 it printed out as follows,
00*1190:00!40
So, in attempting to write $00 to address $1190, it read back $40 instead of $00. Consistently.
With board #238, the result was,
00*1005:00!08
Error at address $1005, read back $08 instead of $00.
I messed around with that a bit and then modified your program to kick out a file of all the errors in the 32k hub ram. The files for #238 and #239 are attached. Both have lots of errors, but #238 far more than #239. The list is consistent when re-running the program. Also for reference there is a file from a normal board #226 that threw no errors, as a indicator that the program is operating correctly.
Looking at the result from #239, the errors are all in a 32 byte block of addresses from $1190 to $11AF. Here is the attempt to write $00 to that block.
00*
1190:00!40
1192:00!04
1193:00!20
1194:00!40
1196:00!04
1197:00!20
1198:00!40
119A:00!04
119B:00!20
119C:00!40
119E:00!04
119F:00!20
11A0:00!40
11A2:00!04
11A3:00!20
11A4:00!40
11A6:00!04
11A7:00!20
11A8:00!40
11AA:00!04
11AB:00!20
11AC:00!40
11AE:00!04
11AF:00!20
What do you think? It looks like crossover from the address to the data. Weird. For #238, the errors all fall in a larger 4085 byte block of addresses, from $1001 to $1FF5.
While we don't know how the hub is arranged, we do know that it would at least be in longs.
These columns show a long where
+0 has b6=1
+1 ok
+2 has b2=1
+3 has b5=1
and two longs are effected.
Haven't had time to check the others.
It is as if the 32 byte address range from $1190 to $11AF is organized as 4 longs and bits as follows are stuck high. (Written conventional long order msbyte, highest address, on the left -- how is is physically?).
%00100000_00000100_00000000_01000000
$20040040
It stands to reason that it won't report an error if you write a byte to all locations that has all three of those bits set. For example,$64, $65, $66, or $67 do indeed report no errors, but $63 and $68 do so because the $3 is converted to $7 and $8 is converted to $C.
All the errors reported are in in a 4096 byte block from $1000 to $13FF.
Bit 3 %xxxx1xxx is stuck high at addresses of the form $1hh5
e.g. Furthermore, bit 3 %xxxx0xxx is stuck low at addresses of the form $1hh1
e.g.
Unlike #239, where the repeating pattern was 4 bytes, in #238 the pattern repeats at 16 bytes. It involves one bit, stuck high for some addresses and stuck low for others.
>I was told they couldn't put those chips in their test fixture because they had
>been soldered rather than socketed and could damage the contacts on the test fixture.
I'm sure that they can find a way. Semiconductor companies often need to recall parts etcetera as part of failure analysis, so finding a DIP socket should not be a show stopper.
Keith, it is definitely worth pursuing with Parallax. I'll ask Jeff. I didn't want to bother them with something that could be traced to cold solder joints, but evidence is mounting that it is something more than that. Maybe it is a board handling or assembly issue, but I know my CM has the right equipment and skills. Yet Parallax uses the Prop in far more builds and would surely have run across this issue if it is attributable to the chips.
If this ever happens, we really appreciate knowing the Lot Codes and Date Codes from each bad chip (some lot codes span multiple date codes because of packaging logistics) and also knowing where you purchased them from. Lot codes indicate the batch of wafers the die came from and date codes indicate the time of packaging said die. We've been keeping yield data on every batch of Propellers, and a sample set from each batch from the last few years, so we can better look for any oddities should a customer later report problems like this.
We also really appreciate receiving the bad chips back for further diagnosis if they are in a good enough condition. Sorry about that, it's too bad that was said to you. It sounds like we could have done something with those; our Manufacturing staff is excellent with rework and probably could have cleaned them up enough for us to safely test them. I'll make sure our staff knows to at least quiz me about it before refusing the chips - having them here is one important piece of the puzzle I'd like to have, otherwise we're left with more questions.
The results of your tests do indeed indicate that the RAM is bad. We've never known a customer manufacturing process to cause this particular type of problem, and it's hard to imagine one, so I'm thinking that somehow the chips were shipped to you in that condition. If anyone in this thread still has said chips, please send them back for a replacement and make sure to mark them to me, Jeff Martin, so I can at least verify that they fail in our tester.
There was some time ago (can't remember how long ago) where our tester passed chips experiencing a particular kind of RAM failure. When the customer sent his bad Propeller back to me, I found this to be true and determined the problem and fixed it so the tester properly failed that chip. Since then, we've yet to find a case where a verified bad Propeller tested as good, but I keep looking for them just in case.
Around June/July we heard from a customer who was having trouble and we replaced his Propellers and received the bad ones back. The bad chips were severely bad; pulled too much current under test. They may have been damaged after shipping, but RAM failures like what you've experienced are not likely to have been caused that way. We've recently reviewed our internal processes and made some changes to improve them and want to continue making sure that our fab is producing within the agreed process window and that we're always shipping good packaged parts.
By the way, I think Peter clarified it, but in case anyone is still wondering: the boot loader does indeed verify the checksum of the received image after it has programmed it into RAM, and before it programs it into EEPROM (if requested) followed by a checksum verification from EEPROM as well. The RAM Checksum failure could actually be due to a communication problem, rather than a true RAM problem, as Peter indicated, but communication-induced problems are more likely to exhibit Propeller Not Found errors, and others too, each time you try to download. A consistent RAM Checksum failure message is very likely a real problem with the RAM.
Though it doesn't change the good/bad status in the current Propeller's in question, there's something else I should share here - a Propeller Main RAM failure could appear to be bad Main RAM locations, but could actually be good Main RAM locations but with one or more Cogs that can't read it properly. For this reason, we test each Main RAM location from the perspective of each individual Cog. I haven't looked at the test code in this thread to see if it's doing that or not, just thought I should point it out for good measure.
-- Tracy
Thanks, Jeff, for the explanations. I'll collect all bad chips and return them to you as soon I have time.
Errrr, that means that there is a small chance that bad chips do not show as bad while programming. This could happen if the bootloader (cog0?) can read/write with no problems from/to RAM but a different cog couldn't. Bad chips showing up while programming are annoying but this can be easily corrected. Bad chips failing at the customer can get really expensive, though. This is very unlikely, I know, as we run an extra test that executes the actual software under real-world conditions (with multiple cogs).