Saturday, October 5, 2013

SATA errors on Linux with Samsung SSD 840 Series with Asus M2N-E


This problem had been driving me crazy for weeks.  I run a home linux server (currently Fedora Core 18) as a NAS and various other services.   Ever since I upgraded the OS disks to SSD I began getting SATA errors in my kernel logs:

Jul 31 00:02:26 kernel: [20859.208310] ata3:EH in SWNCQ mode,QC:qc_active 0x7E sactive 0x7E
Jul 31 00:02:26 kernel: [20859.208371] ata3: SWNCQ:qc_active 0x3E defer_bits 0x40 last_issue_tag 0x5
Jul 31 00:02:26 kernel: [20859.208482] ata3: ATA_REG 0x41 ERR_REG 0x84
Jul 31 00:02:26 kernel: [20859.208533] ata3: tag : dhfis dmafis sdbfis sactive
Jul 31 00:02:26 kernel: [20859.208585] ata3: tag 0x1: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208636] ata3: tag 0x2: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208697] ata3: tag 0x3: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208748] ata3: tag 0x4: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208800] ata3: tag 0x5: 0 0 0 1
Jul 31 00:02:26 kernel: [20859.208860] ata3.00: exception Emask 0x1 SAct 0x7e SErr 0x0 action 0x6 frozen
Jul 31 00:02:26 kernel: [20859.208914] ata3.00: Ata error. fis:0x21
Jul 31 00:02:26 kernel: [20859.208967] ata3.00: failed command: READ FPDMA QUEUED
Jul 31 00:02:26 kernel: [20859.209049] ata3.00: cmd 60/08:08:e8:23:bb/00:00:04:00:00/40 tag 1 ncq 4096 in
Jul 31 00:02:26 kernel: [20859.209251] ata3.00: status: { DRDY ERR }
Jul 31 00:02:26 kernel: [20859.209303] ata3.00: error: { ICRC ABRT }
Jul 31 00:02:26 kernel: [20859.209354] ata3.00: failed command: READ FPDMA QUEUED
Jul 31 00:02:26 kernel: [20859.209410] ata3.00: cmd 60/08:10:d8:26:bb/00:00:04:00:00/40 tag 2 ncq 4096 in
Jul 31 00:02:26 kernel: [20859.209605] ata3.00: status: { DRDY ERR }
Jul 31 00:02:26 kernel: [20859.209675] ata3.00: error: { ICRC ABRT }
Jul 31 00:02:26 kernel: [20859.209734] ata3.00: failed command: READ FPDMA QUEUED


Since the spinning SATA disks I replaced also gave me errors (I assumed they were dying...), it was a bit of a mystery.  Is the SATA controller dying?  The errors did seem to correlate with greater disk utilization.

I swapped the cables.  I swapped the port.  For a day I convinced myself the problem was associated with the port.  I forced the kernel to throttle the SATA to 1.5Gb/s (libata.force=1.5G).  I tried libata.noacpi=1.

Finally I found the answer: libata.noncq

It's even a bit obvious now in the log messages: SWNCQ

Apparently newer drives let the kernel offload the write ordering to the drives.  After all, the drive knows its physical properties better than the kernel:  http://en.wikipedia.org/wiki/NCQ
I did file a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1013229

Now that I've added libata.nonncq to the kernel command line, my errors have gone away.  I'll try removing it someday when I upgrade the motherboard or OS.  I suspect it is the sata controller on the motherboard.