Fixing accents on the new NAS
(grr... I wrote this for Sunday's post but it didn't post. This isn't today's post, think weekend-y thoughts!)As you know I have a new Synology 1812+ NAS spinning away in the closet.One thing I noticed yesterday morning is that a couple of files, mostly songs, had some odd behavior when reading it from the Mac.So, I'll use the filename "05 - Su-su bolondság" as the example. (In case you were wondering this is off the soundtrack to Csinibaba, a 1997 Hungarian musical comedy film.)So, the process I went through to get things set up is part of the problem. All I did was rsync from the Firewire-mounted external drives. This turned out to be infinitely faster than just doing the drag-and-drop method of copying. Additionally rsync can be easily restarted if something had caused it to quit for come reason.So it went: four drives rsynced over.Then, from the Mac, I did a du just for fun.Some files, including Su-su reported that they could not be found. ? Browse to it from the Finder (actually Path Finder) and sure enough they were there. When I tried to play it, however, VLC and iTunes returned the same error.Ssh-ing into the NAS I was certainly able to see them and I could cat them to my heart's content. The bits where there.This led me down the rabbit-hole of UTF-8 encoding of characters. The Mac prefers NFD, while *nix prefers NFC. NFC, for instance, codes "á" (accented a) as "accented a." It is the canonical representation of the "accented a" character. The Mac, on the other hand, wants a decomposed form. "á" is stored as "a - modified with an accent."Why prefer one over the other? Searching. If for instance you search for "bolondsag" instead of "bolondság" it is helpful to find the file regardless. When you store "á" as "a + accent" you can get rid of the "+ accent" pretty easily without much fuss or bother; it's a lot harder when you have "accented a" since you have to decompose it first.Both are correct. They are not equal (a simple byte compare would return false), but they have the same output when displayed or printed. Like I said though, they are both proper representations.Rsync just copied the filenames verbatim -- the NFD filename was just stored in the Linux filesystem.The problem, it seems, is that the AFPD service on the NAS was doing an extra translation which was jibbering things up.The solution is to put everything into the NFC format for the filenames.Some googling and a bit of hacking led to this Python script:
#!/usr/bin/python"""Recursively fix filename encoding problemshttp://www.opensourcehacker.com, MIT licensed"""import osimport unicodedata# Source filename encodingFROM="nfd"# Target filename encodingTO="utf-8"# Current working directoryPATH=os.getcwd()for root, dirs, files in os.walk(PATH):# Assume files are 8-bit strings in the native encoding# of the system which we cannot knowfor f in files:try:uf = unicode(f, "UTF-8");decoded = unicodedata.normalize(FROM, uf);except UnicodeDecodeError:print "Cannot decode:" + fcontinuefullpath = os.path.join(root, f)newpath = os.path.join(root, decoded.encode(TO, 'ignore'))if newpath != fullpath:print "Renaming:" + fullpath + " to:" + newpathos.rename(fullpath, newpath)
After I ran this everything was all honky-dory again! I could play the Su-su again! :-D Basically all this does is convert from the NFD that was copied over from rsync to the equivalent, but different, NFC representation. Once this was done, the AFPD could properly map the filenames back and forth without any problem.