Saturday, October 5, 2013

Western Digital Green Drives and Linux - They're dying!

The last time I swapped the drives out in my home built NAS, I bought Western Digital Green drives.  They sounded great!  Low power, low heat, low noise...

Unfortunately, they appear to not have been built for this particular application.  They auto park and the timer is set too low for Linux filesystems.  (I believe the default may be 8 seconds! http://techreport.com/forums/viewtopic.php?f=5&t=78891)

My drives began to give me trouble (thankfully) even before the warranty ran out.  I found the load cycle count on my drive to be extremely high - 10's or 100's of thousands.  This is way beyond the expected count.

Here's how to query for it:

# smartctl -A /dev/sdX | grep ^193
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       XXXXXXX


Luckily there is a solution -- run this tool to either turn off the feature, or set the timeout much higher:

    http://idle3-tools.sourceforge.net/

Make sure you follow the instructions closely -- you must power cycle the drives!

Now every time I boot, I run this script as part of the boot, just in case I forget if I put a new WD Green in.

#!/bin/sh
for dev in /dev/sd[a-z]
do
   if ! hdparm -i $dev | grep Model=.*SSD
   then
       if hdparm -i $dev | grep -q Model=WDC
       then
           echo Disabling western digital 8sec park
           /opt/idle3-tools/sbin/idle3ctl -d $dev
       fi
       echo Setting timeout for non SSD drive
       hdparm -S 251 $dev
   fi
done

(WD has been mostly sending me back WD Red Drives, which is great as I believe they are marketed for NAS.)




SATA errors on Linux with Samsung SSD 840 Series with Asus M2N-E


This problem had been driving me crazy for weeks.  I run a home linux server (currently Fedora Core 18) as a NAS and various other services.   Ever since I upgraded the OS disks to SSD I began getting SATA errors in my kernel logs:

Jul 31 00:02:26 kernel: [20859.208310] ata3:EH in SWNCQ mode,QC:qc_active 0x7E sactive 0x7E
Jul 31 00:02:26 kernel: [20859.208371] ata3: SWNCQ:qc_active 0x3E defer_bits 0x40 last_issue_tag 0x5
Jul 31 00:02:26 kernel: [20859.208482] ata3: ATA_REG 0x41 ERR_REG 0x84
Jul 31 00:02:26 kernel: [20859.208533] ata3: tag : dhfis dmafis sdbfis sactive
Jul 31 00:02:26 kernel: [20859.208585] ata3: tag 0x1: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208636] ata3: tag 0x2: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208697] ata3: tag 0x3: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208748] ata3: tag 0x4: 1 0 0 1
Jul 31 00:02:26 kernel: [20859.208800] ata3: tag 0x5: 0 0 0 1
Jul 31 00:02:26 kernel: [20859.208860] ata3.00: exception Emask 0x1 SAct 0x7e SErr 0x0 action 0x6 frozen
Jul 31 00:02:26 kernel: [20859.208914] ata3.00: Ata error. fis:0x21
Jul 31 00:02:26 kernel: [20859.208967] ata3.00: failed command: READ FPDMA QUEUED
Jul 31 00:02:26 kernel: [20859.209049] ata3.00: cmd 60/08:08:e8:23:bb/00:00:04:00:00/40 tag 1 ncq 4096 in
Jul 31 00:02:26 kernel: [20859.209251] ata3.00: status: { DRDY ERR }
Jul 31 00:02:26 kernel: [20859.209303] ata3.00: error: { ICRC ABRT }
Jul 31 00:02:26 kernel: [20859.209354] ata3.00: failed command: READ FPDMA QUEUED
Jul 31 00:02:26 kernel: [20859.209410] ata3.00: cmd 60/08:10:d8:26:bb/00:00:04:00:00/40 tag 2 ncq 4096 in
Jul 31 00:02:26 kernel: [20859.209605] ata3.00: status: { DRDY ERR }
Jul 31 00:02:26 kernel: [20859.209675] ata3.00: error: { ICRC ABRT }
Jul 31 00:02:26 kernel: [20859.209734] ata3.00: failed command: READ FPDMA QUEUED


Since the spinning SATA disks I replaced also gave me errors (I assumed they were dying...), it was a bit of a mystery.  Is the SATA controller dying?  The errors did seem to correlate with greater disk utilization.

I swapped the cables.  I swapped the port.  For a day I convinced myself the problem was associated with the port.  I forced the kernel to throttle the SATA to 1.5Gb/s (libata.force=1.5G).  I tried libata.noacpi=1.

Finally I found the answer: libata.noncq

It's even a bit obvious now in the log messages: SWNCQ

Apparently newer drives let the kernel offload the write ordering to the drives.  After all, the drive knows its physical properties better than the kernel:  http://en.wikipedia.org/wiki/NCQ
I did file a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1013229

Now that I've added libata.nonncq to the kernel command line, my errors have gone away.  I'll try removing it someday when I upgrade the motherboard or OS.  I suspect it is the sata controller on the motherboard.



When to ask for help when stuck on a technical problem

I thought this blog post by +Matthew Ringel was a concise and useful summary of how (and when) to ask for help when stuck on a technical problem:
https://blogs.akamai.com/2013/10/you-must-try-and-then-you-must-ask.html
I often find coworkers skipping step #1: working at the problem a little longer and documenting/recording/reviewing what you've already tried.

When I do step #1 I often solve the problem myself.  I might be in the middle of an email explaining what the problem is.  I owe it to them to explain what I've tried, what results I got, etc.  Most of the time the solution then presents itself!  I then just delete the email draft and keep plugging away.

But then sometimes I wait too long for step #3 - going and asking for help.

And then sometimes, I just need a rubber duck: http://www.rubberduckdebugging.com/



Removing Embedded JPGs from Nikon NEF Files with Exiftool

I've been migrating away from Capture NX2 to Lightroom for editing my raw NEF's.  But I'm not quite ready to convert completely from NEF to DNG (Digital Negatives.)  I still might want to edit the photo in Capture - the control points are just too useful.  One tempting advantage of DNG is that they are reportedly a little smaller than NEF.

Why are the DNGs smaller?  I believe it is due to the NEF's embedded jpgs.  But Lightroom doesn't really need the jpeg rendering that is stored in the NEF and I could always recreate them later.  So how can I drop them?

Exiftool!

+Jeffrey Friedl's blog post at http://regex.info/blog/2006-12-08/303 put me on the right track, but I think his information is out of date now.  I found that my NEF had three jpgs:
  • JpgFromRaw (Full Size!)
  • OtherImage (Fairly large!)
  • PreviewImage (Thumbnail-ish)
Let's see how big the are in an NEF that is 44133771 bytes:

$ exiftool -list D8H_2754_20131001_183719.NEF  | egrep Binary\ data
[...]
Jpg From Raw                    : (Binary data 3492514 bytes, use -b option to extract)
Other Image                     : (Binary data 858341 bytes, use -b option to extract)
Preview Image                   : (Binary data 99052 bytes, use -b option to extract)&nbsp

I first attempted to just replace the JpgFromRaw with the PreviewImage.  That worked, but then I would be just duplicating the jpg -- but here is how you do it:

$ exiftool -v -JpgFromRaw\<PreviewImage 2754_20131001_183719.NEF
$ exiftool -v -OtherImage\<PreviewImage 2754_20131001_183719.NEF

So how do I just delete them?  Just delete the tags:

$ exiftool -JpgFromRaw= -OtherImage= -overwrite_original_in_place -P 2754_20131001_183719.NEF

(I'm leaving the smallest (~100KB) PreviewImage)

I ran this on a folder of about 11GB of NEFs and when done, the folder was 9.1GB.  That's 17% smaller!

I'm going to limit it to this folder for now, but will expand this to other parts of my archives as I gain more confidence that I really don't need the embedded jpgs.

Update:  This requires exiftool 9.03 -> Topic: "Otherimage" in NEFs of D800