From: Noveck, Dave (Dave.Noveck@netapp.com)
Date: 01/24/03-12:41:25 PM Z
Message-ID: <C8CF60CFC4D8A74E9945E32CF096548A072A40@SILVER.nane.netapp.com>
From: "Noveck, Dave" <Dave.Noveck@netapp.com>
Subject: RE: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis- 05 draft]
Date: Fri, 24 Jan 2003 10:41:25 -0800
> > > If the WG cannot reach consensus about such a change then a
> > > recommendation that a given normalization form be used should help
> > > prevent unnecessary normalization operations by NFSv4 clients/servers
> > > that heed the recommendation (a norm. form could even be
> > > "negotiated", but let's not go there).
> >
> > My worry about this is the whole issue of existing file systems and
> > their contents. I'm sure that there are file systems out there that
> > do not normalize and thus there may be directories which contain multiple
> > files that would map to the same string under any given normalization
> > form. If you impose a normalization form, then you have the issue of
> > dealing with that legacy data.
>
> Er, no, utf8str_cs requires that a normalization form be used, just not
> on the wire. So the problem of legacy filesystems remains even if we do
> not act to recommend or require a specific normalization form on the
> wire.
In that case, there doesn't seem to be any reason not to pick one
form, assuming we can agree on what that is.
So it appears that we are supposed to map utf8str_cs "equivalent" strings
to the same filename, but what is the form of equivalence for which this
is required, cannonical equivalence or compatibility equivalence? This
affects the normalization forms that we may choose among.
> As for the legacy filesystems, the problem is far worse than you
> suggest, since some filesystems allow arbitrary 8-bit data in filenames,
> including arbitrary non-UTF-8 encodings, without any codeset tagging
> whatsoever. But this is not relevant to the matter of whether
> utf8str_cs should mandate a specific normalization form for filenames on
> the wire.
OK. Sigh.
> > At least I feel the need for a "Normalization Forms for Dummies" document.
> > Maybe other working group members do as well. Any pointers to something
> > that will explain this stuff to those who have not already immersed
> > themsleves in this area.
> There are several books with "Unicode" in the title (none in my office
> right now). And the Unicode home page is a good place to start:
> http://www.unicode.org/
I used http://www.unicode.org/reports/tr15/ which was helpful.
> > > On the other hand, the native norm. form produced by input mechanisms
> > > at the clients may be preferred by this WG, whatever it may be, as
> > > long as all/most clients are consistent.
> >
> > Dream on.
>
> Heh. I didn't want to dismiss that out of hand, see? I don't profess
> to know what y'all do :)
>
> [...]
> I think my point is the earliest possible point was when the files were
> created, i.e. it may be long gone.
> "Normalization form" is a concept specific to Unicode (well, to any
> codeset which has multiple equivalent encodings for the same character,
> which mostly means Unicode).
It hurts when you do that, so my advice would have been "don't do that".
> Legacy filesystems don't even use a single codeset consistently, much
> less a single Unicode normalization form.
> Log in to some *nix system with one locale using Latin-1 as the codeset
> and create some file with a name including an accented character. Then
> log out and log in again using a locale with a UTF-8 codeset and try it
> again. You should end up with the two files with the same name, each of
> which does not display correctly or at all unless you're using the
> correct locale. Fun, fun, fun!
>
> But I insist: the legacy issue has nothing to do with selecting a single
> required normalization form for filenames on the wire. The legacy
> filesystem issues should be addressed in a separate thread. See above.
At this point I'm thinking a suicide counseling hotline would be the
appropriate forum, but it'll pass.
> > > - I'm not sure that I care about avoiding the K forms (as suggested by
> > > Dan Oscarsson) as I see the K forms as merely slightly reducing the
> > > available namespace in exchange for reducing the scope for confusion.
> > >
> > > The typical example of a K normalization is conversion of ligatures
> > > such as "fi" or "ae" (one codepoint each) to the visually related
> > > codepoints that make them up ('f' and 'i' for "fi" and 'a' and 'e'
> > > for "ae"); there may be other form K substitutions that may actually
> > > be offensive to speakers of live languages, so further study of the
> > > matter may be warranted. But then, this is a network filesystem
> > > protocol, some compromises have to be made.
> >
> > Although I have not studied this area and so I may be missing the
> > justification for some of this, this strikes me as strange. If
> > someone names a file with the ligature fi, then maybe he did it for a
> > reason, such as the fact that it is about the ligature fi. He could
> > have named it ordinary fi, but chose not to for his own reasons. It
> > seems wierd to change it to ordinary fi and that file may also exist.
> > Visual similarity seems a strange basis to map characters. Are you
> > going to map Cyrillic Veh to Roman B, because they look the same?
> If that's what Unicode specifies for the K normalization forms, yes.
But it doesn't.
> Think of a group of users using a shared directory: one user may know
> how to enter a ligature, many may not even be able to tell that some
> filename uses a ligature - they may not know what a ligature is, how to
> recognize one, much less how to type one in. Doesn't it then make sense
> to use the K compatibility mappings for ligatures? Obviously, for the
> single-user case the answer is "no", but for the multi-user case the
> answer is not clear.
But I may not know how to enter a Cyrillic Veh, either, or recognize the
difference between that and a B, cause there isn't one. If there are
other cyrillic characters I might have a clue, but I kind of think that
asking the file system to deal with that problem would not be appropriate,
as it would also not be appropriate to ask it to deal with my confusion
about ligatures.
Anyway, if I have a choice, I would go for something that doesn't, on
dubious grounds, add additional equivalences, to what we are forced to
have. So I would prefer fundamental equivalence to canonical equivalence
to compatibility equivalence, consistent with with what the existing
spec requires. And on the grounds you offered, I would prefer D to C
and thus KD to KC. Given the choice beween D and KD (do I have that?),
I would go for D and leave compatibility equivalence out of the picture.
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:48 AM Z CST