From: Dan Oscarsson (Dan.Oscarsson@kiconsulting.se)
Date: 01/28/03-10:54:37 AM Z
Message-Id: <200301281651.h0SGpqn28528@malmo.trab.se> Date: Tue, 28 Jan 2003 17:54:37 +0100 (CET) From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se> Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft] >Normalization form C ALWAYS starts with decomposition, that is, >normalization form D. > >Any software which can perform normalization form C can also perform >normalization form D. That depends on what you normalise from. If you start with any of the ISO 8859-1 character sets then converting and normalising into UCS form C is easy done without any knowledge on form D. As a very large set of character sets in use today are, in Unicode concepts "precomposed", translation to UCS form C can be done without any form D normalisation. If you from the beginning start by defining that the final result is UCS form C, conversion/input/normalisation can very often easily be done very directely. No need to being able to handle normalisation D. > >> But we cannot require it in storage, only on the wire. > >In storage we must leave the fileserver free to do as it pleases, but >for one restriction: it must use a reasonably canonical form in storage, >otherwise equal filenames with unequal encodings could be allowed. > >I believe that this is what the draft specifies. An on-the-wire >normalization form specification would be an optimization, but is not >absolutely necessary. OK. What will happen if we do not require it? I assume the same as that which happens now with NFSv3. I have a user working on a Mac that have mounted a file system through NFSv3 from our Unix machine. When a file name is created on the Mac it is written as UTF-8 encoded names of some form of Unicode. When I look at the same file from my Unix machine the name is a mess. All my software on Unix uses ISO 8859-1. If instead the Mac NFS client did know that ISO 8859-1 was the standard to be used, it could have translated the names when sending to the Unix system and translated it back when reading from the Unix host. >Normalization form C is limited to composed characters defined in >Unicode 3.0 (we're past 3.0); as such it really means "composed for >these characters, decomposed for everything else" - so why not just >decomposed then? Normalisation form C is for the current version of Unicode (3.2). Form C have most characters that have been "precomposed" before in that form (in legacy character sets). People are used to working with precomposed characters. > >I don't thing encoding length is that big a deal, but cycles spent in >normalization, space dedicated to normalization data structures, *that* >is a big deal. > >This is why I'm for form D (on the wire as an optimization). Why would form D result in smaller tables? Most text today is already in precomposed form. All programs I have (and all open source I have fetched) handles text using precomposed characters. > >> Yes, it may result in additional code in servers, but many system can >> create very efficient code to convert between legacy character set >> and UCS normalising form C. So I think it will not be that expensive. > >I have no figures close at hand, but I don't think that Unicode >normalization data structures and code are small (remember, we're >talking about kernel constraints here, complete with small stacks). I cannot say how small they can be, but I think they need not be that big. But if we do like is intended in web and DNS to have the normalisation done by sender, the server can assume that incoming data on the wire is in the correct format and need not do any normalisation or checking on it. You can assume it is correct, it not, server routines to compare filenames (case sensitive or case insensitive) will fail to match. But you always still have to handle the encoding used in the file system on the server. If the file server uses EBCDIC in the file system, the NFS server have to convert between EBCDIC and UCS form C. I can see no way to avoid that (except doing the bad solution in DNS were everything is encoded using ASCII characters forcing every application handling file names on the system to be rewritten). And as you have to do that, the simplest is to have only one standard character set and form to convert to/from. Dan
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:51 AM Z CST