Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]

New Message Reply About this list Date view Thread view Subject view Author view Attachment view

From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/27/03-11:10:17 AM Z


Date: Mon, 27 Jan 2003 11:10:17 -0600
From: Nicolas Williams <Nicolas.Williams@sun.com>
Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]
Message-ID: <20030127111016.W16765@binky.central.sun.com>

On Mon, Jan 27, 2003 at 08:55:37AM +0100, Dan Oscarsson wrote:
> 
> >I think NFSv4 must require at the very least that filenames be stored
> >normalized to some form (we should probably specify if it can be a K
> >form or not, but D vs. C is not so important) and let clients and
> >servers deal with that.  This is pretty much what the draft says or
> >implies.
> 
> The Open Source Unix and Linux community have for internationalisation
> selected UCS normalised using form C and encoded using UTF-8 as
> the standard to be used on Unix and Linux.
> The same form and encoding have been selected by W3C for the webb.
> 
> So there is a lot of software that does or will handle form C encoded
> text. From what I have seen there will be very little software
> that will handle or use normalisation form D, KD or KC.
> 
> So NFSv4 should use the same as most do (or will do):  normalising form C.

Normalization form C ALWAYS starts with decomposition, that is,
normalization form D.

Any software which can perform normalization form C can also perform
normalization form D.

> But we cannot require it in storage, only on the wire.

In storage we must leave the fileserver free to do as it pleases, but
for one restriction: it must use a reasonably canonical form in storage,
otherwise equal filenames with unequal encodings could be allowed.

I believe that this is what the draft specifies.  An on-the-wire
normalization form specification would be an optimization, but is not
absolutely necessary.

> It is on the wire, that is between systems, that it must be standardised
> to one simple format. Systems can use any format they want.

I remain unconvinced.

> A system which uses normalising form C as its local format for staorage
> will have a simpler implementation than others, and that will help
> push system vendors to move to the most common format used.
> UCS normalising form C is compact and do not destroy any information,
> so it is best. The K forms destroy data and the D form takes more space and
> breaks the semantic concept of letter on some letters.

Normalization form C is limited to composed characters defined in
Unicode 3.0 (we're past 3.0); as such it really means "composed for
these characters, decomposed for everything else" - so why not just
decomposed then?

I don't thing encoding length is that big a deal, but cycles spent in
normalization, space dedicated to normalization data structures, *that*
is a big deal.

This is why I'm for form D (on the wire as an optimization).

> Yes, it may result in additional code in servers, but many system can
> create very efficient code to convert between legacy character set
> and UCS normalising form C. So I think it will not be that expensive.

I have no figures close at hand, but I don't think that Unicode
normalization data structures and code are small (remember, we're
talking about kernel constraints here, complete with small stacks).

Nico
-- 


New Message Reply About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:50 AM Z CST