Serious security issue with ENC28J60 driver

ManAtWork · 2019-08-01 07:41

I recently found out that the propeller can crash really awfully when used together with a ENC28J60 ethernet interface and a "D-Link" network switch (currently sold at "Media Markt", a german electronics discounter). This switch sometimes emits a garbage packet with invalid header but (seemingly) correct CRC.

The function get_frame() in driver_enc28j60.spin contains the following code

  ' protect from oversized packet
  if pktptr AND (rxlen =< MAXFRAME)
    rd_block(pktptr, rxlen-4)

This reliably protects from a buffer overrun if the packet size was too large. But it does not protect against undersized packets! If rxlen is between 0 and 3 it literally opens the gate to hell. rd_block() interprets negative sizes as infinity and overwrites all the hub memory with the contents of the PHY buffer. This does not crash (freeze) the propeller which was an annoying but somehow acceptable reaction to "invalid data". But instead the still running spin interpreters in some of the cogs eat the garbage and run amok in a really uncontrolled way. They randomly toggle output pins, send random characters over the serial interface and other disastrous stuff.

You really don't want to see what happens if this propeller does not only control a toy robot but a full size industrial robot or a cnc machine with hydraulics and powerful motors in it. Well, Smile happens... and although this took me a month to debug it is relatively easy to fix.

What I worry more about is that the propeller is somehow prone to such runaway crashes because
1) all outputs from all cogs are OR-wired and all DIR registers are also OR-wired. A single cog crashing can take over all pins and there's no way stopping it.
2) there is no (hardware) watchdog or memory protection
3) there is no invalid instruction exception or something that could stop a cog from executing garbage code

Mickster · 2019-08-01 11:16

full size industrial robot or a cnc machine with hydraulics and powerful motors in it. Well, Smile happens... and although this took me a month to debug it is relatively easy to fix.

Oh, man, can I relate to this.....and it usually happens when you have just handed-off the equipment to the operators and your client starts to look at you sideways....been there myself Nicolas.

When I come across this pontificating about pre-written libraries, I shake my head...no thanks.

ManAtWork · 2019-08-01 12:59

The "funny" part of the story was, when we first encountered the problem there was no hint at all that this was caused by the network switch. It only happened when the machine was powered off for at least two days. When it was switched on again (and only for the first time on that day) all relays and solenoids were ratteling like an old Zuse computer and the pneumatic toolchanger was running wild. If switched off and on again - silence... we couldn't reproduce it unless we waited for two more days.

So we thought it was a thermal problem. We cooled down the boards with ice spray or heated them with a heat gun - no success. We changed boards, power supplies and everything else (except the switch!). And even more frustrating, every single board worked perfectly well on the workbench. The spooky effects came back as soon as we put the board back into that "cursed" machine.

Then we thought it was an EMI problem, some power glitch that corrupted the code while the propeller was loading it from the EEPROM. We even suspected the mains switch (a really big one) to cause glitches due to contact bounce. We tested all possible variations of voltage ramp up slopes and reset pulses. We removed all possible sources of noise in the machine (lube pump, transformers...) and cut all connections to loads. No effect. Once the controller worked it continued to work reliably no matter how hard it was tortured and how many times it was power-cycled. But leaving it untouched and powered down for several days ... bang, gone berserk on first power-on.

The enlighment came when I (rather accidentally) power cycled the network switch while desperately plugging out and in some connectors, again. It came out that a special power on delay between the controller board and the power supply of the network switch was required to provoke the fault. The exact delay was only met when both power supplies where completely discharged and cold.

ManAtWork · 2019-08-01 14:24

BTW, I've fixed the problem with the following code

  ' protect from over/undersized packet
  if pktptr AND (rxlen =< MAXFRAME) AND (rxlen => 18)

Does anybody know anything about the whereabouts of the original author, Harrison Pham? The link to his web site in OBEX doesn't work.

Mickster · 2019-08-01 14:47

As I type, I am in a production facility, just finished a 3-axis CNC tube bending machine...everything, except the HMI is a ground-up development.
The client has seen this thing running for more than 2-weeks but I have refused to hand it over until I am convinced that it's absolutely bullet proof.

A flaky app on a desktop is one thing but potentially lethal machinery is a whole different issue. One can't assume that an operator is going to operate one's system the way it's supposed to be.

Every bit of my code is authored by me and I spend 90%, of my time testing, testing, testing.

You and I could share many experiences, I'm sure.

Rayman · 2019-08-02 20:49

I guess it would be nice to be able to put limits on what a driver cog can do...
So, a bad driver wouldn't be able to take down the whole system...

msrobots · 2019-08-02 21:56

That is some nice 'side-effect' the RDFAST/WRFAST stuff offers on the P2. You can limit the buffer size to some /16 byte size and have build in buffer overflow protection because it automatically goes from buffer end to buffer start.

Mike

Rayman · 2019-08-02 22:12

That is an interesting idea for P2...

SaucySoliton · 2019-08-02 22:22

ManAtWork wrote: »

What I worry more about is that the propeller is somehow prone to such runaway crashes because
1) all outputs from all cogs are OR-wired and all DIR registers are also OR-wired. A single cog crashing can take over all pins and there's no way stopping it.
2) there is no (hardware) watchdog or memory protection
3) there is no invalid instruction exception or something that could stop a cog from executing garbage code

1. If pins are active when output low, then then a misbehaving cog can only force outputs to inactive.
2. Cog ram is protected. So the spin interpreter can keep running. Data and code are in the hub ram, vulnerable.
2, 3. Would have to be done in software.

This is unfortunate as having multiple independent cogs could provide isolation between code. To be fair, most microcontrollers don't have any memory or IO pin protection.

I'm using the prop to pre-charge an electric vehicle drive system. My solution was to write the critical sequencing code in PASM. The code lives in a separate file so it's less likely to be modified accidentally. It doesn't solve every issue.

Mickster wrote: »

A flaky app on a desktop is one thing but potentially lethal machinery is a whole different issue. One can't assume that an operator is going to operate one's system the way it's supposed to be.

A classic example of this: Therac 25. Although industrial and automotive malfunctions can be equally dangerous.

Tracy Allen · 2019-08-06 22:41

Directed back to this from your watchdog thread. The story would be a good match for EDN's Tales from the Cube. For those who don't know, that's a great series, a reality check for engineers who always suspect the infamous Murphy lurking around unsuspected nodes or loops of circuitspace.

ke4pjw · 2019-08-06 22:54

ManAtWork wrote: »
BTW, I've fixed the problem with the following code
  ' protect from over/undersized packet
  if pktptr AND (rxlen =< MAXFRAME) AND (rxlen => 18)
Does anybody know anything about the whereabouts of the original author, Harrison Pham? The link to his web site in OBEX doesn't work.

Why not

if pktptr AND (rxlen =< MAXFRAME) AND (rxlen => 4)

SaucySoliton · 2019-08-07 18:07

ke4pjw wrote: »
ManAtWork wrote: »
BTW, I've fixed the problem with the following code
  ' protect from over/undersized packet
  if pktptr AND (rxlen =< MAXFRAME) AND (rxlen => 18)
Does anybody know anything about the whereabouts of the original author, Harrison Pham? The link to his web site in OBEX doesn't work.
Why not
if pktptr AND (rxlen =< MAXFRAME) AND (rxlen => 4)

2 MAC addresses are 12 bytes, 2 bytes for type, then 4 for the FCS. So packets less than 18 bytes can't really be processed anyway.

The version in the obex does not check pktptr. Nor does it do the -4. Is there another version somewhere else?

It looks to me that this bug was from an optimization to avoid transferring the 4 byte FCS at the end of the packet.

Publison · 2019-08-07 19:13

ManAtWork wrote: »

BTW, I've fixed the problem with the following code

Does anybody know anything about the whereabouts of the original author, Harrison Pham? The link to his web site in OBEX doesn't work.

His website looks blank now. May he still has email running:

Harrison Pham <harrison@harrisonpham.com>

localroger · 2019-08-08 02:43

Harrison has been off-grid for awhile now, like many other early P1 users. I recently ported this ENC28J60 code to the P2 via FastSpin and I will look into fixing this bug tomorrow. In any case the code is MIT licensed so it is OK to apply fixes like this even if you can't find him.

localroger · 2019-08-08 02:46

Also, now that I'm thinking about it in Harrison's defense, I think there is actually a minimum size specification for an ethernet packet; I recall a line in the code padding the packet out if it doesn't meet this spec. So if you're receiving a packet with length 2 then there is something seriously broken routing it to you.

Serious security issue with ENC28J60 driver

Comments