From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/24/03-11:59:24 AM Z
Date: Fri, 24 Jan 2003 11:59:24 -0600
From: Nicolas Williams <Nicolas.Williams@sun.com>
Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis- 05 draft]
Message-ID: <20030124115924.U16765@binky.central.sun.com>
On Fri, Jan 24, 2003 at 09:39:39AM -0800, Noveck, Dave wrote:
>
> > Oh boy. I've been thinking about the Unicode issues as well.
> >
> > My thoughts:
> >
> > - By specifying _a_ Unicode normalization form for filename components
> > on the wire we can avoid the need to normalize strings in the kernel.
> >
> > Checking that inputs are normalized is a lot simpler and less
> > resource consuming than actually performing normalization.
> >
> > Obviously, changing utf8str_cs to specify a normalization form may
> > require a new error code and may be an incompatible change.
>
> As far as the error code goes, I would think you could use NFS4ERR_BADNAME.
*nod*
> > If the WG cannot reach consensus about such a change then a
> > recommendation that a given normalization form be used should help
> > prevent unnecessary normalization operations by NFSv4 clients/servers
> > that heed the recommendation (a norm. form could even be
> > "negotiated", but let's not go there).
>
> My worry about this is the whole issue of existing file systems and
> their contents. I'm sure that there are file systems out there that
> do not normalize and thus there may be directories which contain multiple
> files that would map to the same string under any given normalization
> form. If you impose a normalization form, then you have the issue of
> dealing with that legacy data.
Er, no, utf8str_cs requires that a normalization form be used, just not
on the wire. So the problem of legacy filesystems remains even if we do
not act to recommend or require a specific normalization form on the
wire.
As for the legacy filesystems, the problem is far worse than you
suggest, since some filesystems allow arbitrary 8-bit data in filenames,
including arbitrary non-UTF-8 encodings, without any codeset tagging
whatsoever. But this is not relevant to the matter of whether
utf8str_cs should mandate a specific normalization form for filenames on
the wire.
> At least I feel the need for a "Normalization Forms for Dummies" document.
> Maybe other working group members do as well. Any pointers to something
> that will explain this stuff to those who have not already immersed
> themsleves in this area.
There are several books with "Unicode" in the title (none in my office
right now). And the Unicode home page is a good place to start:
http://www.unicode.org/
> > On the other hand, the native norm. form produced by input mechanisms
> > at the clients may be preferred by this WG, whatever it may be, as
> > long as all/most clients are consistent.
>
> Dream on.
Heh. I didn't want to dismiss that out of hand, see? I don't profess
to know what y'all do :)
[...]
> I think my point is the earliest possible point was when the files were
> created, i.e. it may be long gone.
"Normalization form" is a concept specific to Unicode (well, to any
codeset which has multiple equivalent encodings for the same character,
which mostly means Unicode).
Legacy filesystems don't even use a single codeset consistently, much
less a single Unicode normalization form.
Log in to some *nix system with one locale using Latin-1 as the codeset
and create some file with a name including an accented character. Then
log out and log in again using a locale with a UTF-8 codeset and try it
again. You should end up with the two files with the same name, each of
which does not display correctly or at all unless you're using the
correct locale. Fun, fun, fun!
But I insist: the legacy issue has nothing to do with selecting a single
required normalization form for filenames on the wire. The legacy
filesystem issues should be addressed in a separate thread. See above.
> > - I'm not sure that I care about avoiding the K forms (as suggested by
> > Dan Oscarsson) as I see the K forms as merely slightly reducing the
> > available namespace in exchange for reducing the scope for confusion.
> >
> > The typical example of a K normalization is conversion of ligatures
> > such as "fi" or "ae" (one codepoint each) to the visually related
> > codepoints that make them up ('f' and 'i' for "fi" and 'a' and 'e'
> > for "ae"); there may be other form K substitutions that may actually
> > be offensive to speakers of live languages, so further study of the
> > matter may be warranted. But then, this is a network filesystem
> > protocol, some compromises have to be made.
>
> Although I have not studied this area and so I may be missing the
> justification for some of this, this strikes me as strange. If
> someone names a file with the ligature fi, then maybe he did it for a
> reason, such as the fact that it is about the ligature fi. He could
> have named it ordinary fi, but chose not to for his own reasons. It
> seems wierd to change it to ordinary fi and that file may also exist.
> Visual similarity seems a strange basis to map characters. Are you
> going to map Cyrillic Veh to Roman B, because they look the same?
If that's what Unicode specifies for the K normalization forms, yes.
Think of a group of users using a shared directory: one user may know
how to enter a ligature, many may not even be able to tell that some
filename uses a ligature - they may not know what a ligature is, how to
recognize one, much less how to type one in. Doesn't it then make sense
to use the K compatibility mappings for ligatures? Obviously, for the
single-user case the answer is "no", but for the multi-user case the
answer is not clear.
Cheers,
Nico
--
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:48 AM Z CST