From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/27/03-11:10:17 AM Z
Date: Mon, 27 Jan 2003 11:10:17 -0600 From: Nicolas Williams <Nicolas.Williams@sun.com> Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft] Message-ID: <20030127111016.W16765@binky.central.sun.com> On Mon, Jan 27, 2003 at 08:55:37AM +0100, Dan Oscarsson wrote: > > >I think NFSv4 must require at the very least that filenames be stored > >normalized to some form (we should probably specify if it can be a K > >form or not, but D vs. C is not so important) and let clients and > >servers deal with that. This is pretty much what the draft says or > >implies. > > The Open Source Unix and Linux community have for internationalisation > selected UCS normalised using form C and encoded using UTF-8 as > the standard to be used on Unix and Linux. > The same form and encoding have been selected by W3C for the webb. > > So there is a lot of software that does or will handle form C encoded > text. From what I have seen there will be very little software > that will handle or use normalisation form D, KD or KC. > > So NFSv4 should use the same as most do (or will do): normalising form C. Normalization form C ALWAYS starts with decomposition, that is, normalization form D. Any software which can perform normalization form C can also perform normalization form D. > But we cannot require it in storage, only on the wire. In storage we must leave the fileserver free to do as it pleases, but for one restriction: it must use a reasonably canonical form in storage, otherwise equal filenames with unequal encodings could be allowed. I believe that this is what the draft specifies. An on-the-wire normalization form specification would be an optimization, but is not absolutely necessary. > It is on the wire, that is between systems, that it must be standardised > to one simple format. Systems can use any format they want. I remain unconvinced. > A system which uses normalising form C as its local format for staorage > will have a simpler implementation than others, and that will help > push system vendors to move to the most common format used. > UCS normalising form C is compact and do not destroy any information, > so it is best. The K forms destroy data and the D form takes more space and > breaks the semantic concept of letter on some letters. Normalization form C is limited to composed characters defined in Unicode 3.0 (we're past 3.0); as such it really means "composed for these characters, decomposed for everything else" - so why not just decomposed then? I don't thing encoding length is that big a deal, but cycles spent in normalization, space dedicated to normalization data structures, *that* is a big deal. This is why I'm for form D (on the wire as an optimization). > Yes, it may result in additional code in servers, but many system can > create very efficient code to convert between legacy character set > and UCS normalising form C. So I think it will not be that expensive. I have no figures close at hand, but I don't think that Unicode normalization data structures and code are small (remember, we're talking about kernel constraints here, complete with small stacks). Nico --
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:50 AM Z CST