Last summer, I wrote about how my aging ThinkPad laptop had become unbearably slow after reinstalling Windows 10. I eventually solved that problem, but the laptop became slow again a few months later. This time, I’d given up and concluded that the problem was beyond my ability to diagnose. Luckily, I stumbled upon the underlying hardware problem when I disassembled it to test something else entirely.
I’ll cut straight to the chase: the problem was a link-layer problem with the M.2 (“Mini 2.0”) connector between the laptop and its Solid-State Drive (SSD). Reseating the M.2 drive and adapter fixed the problem. So, why couldn’t I diagnose the link-layer problem without physically disconnecting the drive?
The problem was difficult to diagnose under Windows 10. Neither Windows Event logger nor the drive’s self-monitoring, analysis, and reporting technology (S.M.A.R.T.) report indicated any problem. It was half-full, and it wasn’t overheating. The drive even performed well in synthetic performance benchmarks.
However, the entire operating system would stall and become unresponsive for whole seconds at a time. It took several seconds to open the Start menu or any program window. The performance degradation was intermittent but persistent. I suspected a compatibility problem between a Windows 10 feature or driver and the aging hardware in my laptop.
As you may have understood from the feature photo at the top of the article, the problem was caused by connector fretting. My microscopy game isn’t strong, but I managed to snap a photo to confirm abrasions on all the M.2 connector pins. The abrasions go all the way into the nickel connectors, and the black spots are corrosion. This shouldn’t be a huge surprise, though. The M.2 specification only requires that M.2 connectors last 60 mating cycles (connects/disconnects).
M.2 devices are usually screwed into the mainboard to keep them in-place. However, the first-generation Lenovo ThinkPad X1 Carbon didn’t use a standard M.2 connector. Lenovo released this laptop model before the connector was finalized. To upgrade it with a larger standard M.2 drive, I had to use a third-party adapter. I had forgotten to fasten the adapter to the mainboard, and it has no obvious way to secure the drive. To give it some stability, I taped the drive down and hoped the chassis itself would hold it in place.
To make things worse, the drive is positioned right under the keyboard. Lenovo designed the laptop with a small air-gap between the M.2 drive and the keyboard. However, the adapter pushed the drive right up into the keyboard, resulting in a little wobble/vibration every time I struck a key on the keyboard.
What’s more interesting is that I couldn’t detect the link problems when running performance benchmarks. I would set the laptop on a table to run benchmarks, unlike how I would usually balance it on my lap and type on the keyboard. I believe that the vibrations from typing on the keyboard would intermittent link drops.
In hindsight, I can see that this set up was a recipe for disaster. I’d never have done this given what I now know about how weak the M.2 connector is! At any rate, the connector was damaged and this caused intermittently disconnects in the drives’ Serial Advanced Technology Attachment (SATA) link.
I’ve always believed that either a SATA drive would work or it wouldn’t. I found it interesting that Windows 10 wasn’t able to detect the problem. However, it turns out that legacy SATA, M.2 SATA, and even M.2 PCIe drives are very tolerant of link failures.
The SATA Advanced Host Controller Interface (SATA-AHCI) is designed to continuously retransmit data during a link failure until receipt is confirmed or a software driver intervenes. Microsoft’s Standard SATA AHCI Controller driver seems to be very tolerant of errors, and it also doesn’t log the errors. The behavior is both good and bad: it’s good because my computer kept working at a significantly reduced speed despite the problem, and it’s bad because it didn’t notify me of any problems with the drive.
One layer below the SATA-AHCI bus, we find the Peripheral Component Interconnect Express (PCIe) interface. PCIe has a similar feature called PCIe transport layer protocol (TLP) retransmission. It can affect links from an M.2 SATA drive (or a legacy SATA connector) via the SATA-AHCI bus or an M.2 PCIe drive connected directly to PCIe.
Both failure modes would create similar problems. I first thought that I was experiencing a PCIe TLP retransmission problem. After studying the relevant standards, I now believe the problem is with SATA-AHCI retransmission since the physical connector damage was to the M.2 connector between the M.2 SATA drive and the SATA-AHCI bus.
Steve Gibson, the developer of the drive diagnostic utility SpinRite, put me on the right track after briefly discussing a similar problem on a recent episode of his Security Now podcast. In the same podcast episode, he claimed that future versions of SpinRite will be able to identify link-layer problems.
So, can this type of damage to the M.2 connector pins be repaired? Theoretically: yes, practically: no. An M.2 pin is no more than 0,33 mm wide, and the damaged area is only 0,15 mm. I couldn’t even see the damage without a microscope. Repairing the pins at this minuscule scale is unfeasible.
In my case, I was able to “fix” the problem by remating the connector. It strengthened the connection by aligning it to a different part of the connector pin. In the future, I’ll make sure to tighten the screws on my M.2 devices and make sure they’re securely fastened.
In conclusion: SATA links are fault-tolerant to the extreme, and M.2 connectors are brittle. It’s difficult to troubleshoot issues later, so make sure you properly secure M.2 connections.