From: Dan Oscarsson (Dan.Oscarsson@kiconsulting.se)
Date: 01/29/03-04:03:05 AM Z
Message-Id: <200301291003.h0TA35AT016339@valinor.malmo.trab.se> Date: Wed, 29 Jan 2003 11:03:05 +0100 (CET) From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se> Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft] >> That depends on what you normalise from. If you start with any of the >> ISO 8859-1 character sets then converting and normalising into >> UCS form C is easy done without any knowledge on form D. > >That's codeset conversion. In the NFSv4 case we're talking about the >client sending unnormalized UTF-8 (therefore Unicode) filename strings. >The server has to then normalize to a canonical form (to prevent equal >name / unequal encoding conflicts). This process ALWAYS starts with >normalization to form D; end of story. Even so, when doing normalisation to C I expect you can combine the normalisation D part as a part in normalisation to C that do not give any (or very little) addition. I doubt that normalisation form C need to take more data than form D. Form D can take a lot more data space from the kernel so it is not suitable if we have kernels with little available memory to work in. >> >I believe that this is what the draft specifies. An on-the-wire >> >normalization form specification would be an optimization, but is not >> >absolutely necessary. >> >> OK. What will happen if we do not require it? > >The server has to normalize the client's filenames to avoid equal name / >unequal encoding conflicts. > >If it were required the server would only have to check that clients >send normalized filenames and return some error if they don't. > >In order words: nothing. I.e., NFSv4 is NOT broken wrt i18n by not >specifying an on-the-wire normalization form for filenames. I've said >this now more than once now - do you take issue with this statement? It is not broken, but it will make it a lot more difficult to implement and increase possible failure. It will result in big tables for normalisation everywhere, as the format will never match a systems internal needs (except those using unnormalised data which must be very rare). Optimising will be difficult. The end mounting a file system from a server will both need code to handle normalisation and conversion to local character set. >> I assume the same as that which happens now with NFSv3. > >Oh no, not at all. NFSv4 uses UTF-8, and therefore Unicode, on the >wire for filenames - not so for NFSv3. Big difference. I would not call it big. I get problems with NFSv3 due to not having a standardised character set and encoding. With NFSv4, if the mounted file system do not normalise and convert into my legacy character set, it will be just as bad. Even if I switched to UTF-8 as my local character set it will fail, if the UTF-8 encoded text is not normalised form C. No other form is acceptible to use due to things like invalid semantics, to much data space and complex and CPU consuming handling of that format. You cannot expect systems to switch to unnormalised UTF-8 in their file system to help NFSv4. It will break most applications. Just like all other protocols that communicate between systems, NFS need to convert between local and on the wire format at the end points of the communication link. And looking at history and common sense, allowing more than one possible format on the wire results in failed communication. I can create compact and fast conversion and handling for one format. I do not have time to write code to handle everything. (I tried to find out what CIFS have. From what I could find out Microsoft uses UCS-2/UTF-16 with precomposed characters - that is closest to form C). >> Normalisation form C is for the current version of Unicode (3.2). >> Form C have most characters that have been "precomposed" before in >> that form (in legacy character sets). People are used to working with >> precomposed characters. > >No, normalization form C is limited to using composed characters defined >in Unicode 3.0. I'll search for a reference tonight and post it >tomorrow, but I'm quite sure of this. Normalisation form C is driven by the tables that Unicode define for each version. So it automatically follows each version. - So I still think NFSv4 will be much better and easier to implement by defining that all UCS data should be in form C. I am sure I will get interoperability problems otherwise. Dan
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:51 AM Z CST