Question about ASCII and nl_langinfo (locale work)

Discussion:

Baptiste Daroussin

2015-11-10 22:26:37 UTC

Hi all,

When merging the new collation, the locales has been reworked.

ache@ raised a good point about LOCALE C and POSIX and by extension the locales
US-ASCII: should we take the opportunity to change that:

First a desciption of the situation: nl_langinfo is not normalised each OS can
return the encoding they want. While it is pretty obvious about what should be
returned for for regular encodings (iso-8859* or UTF-8), for C and POSIX
locales, FreeBSD used to return US-ASCII (and does it again since today).

Lots of third party application (python, perl, tcl etc) tries to figure out the
encoding by matching against a table of "known" output of nl_langinfo()

The thing is not all are aware that FreeBSD uses US-ASCII, for example tcl does
not. which means tcl is not able to determine what encoding is needed for the C
and POSIX locales.

On Linux they to return ANSI_X3.4-1968 (also known as US-ASCII) and most
application knows what linux returns.

That means we need to teach all upstream about US-ASCII all the time.

The proposals are:
- Do not change what we have always done.
- Change it to something that makes sense "C" (what we tried with "POSIX" which
was a very bad idea, but "C" seems to be commonly recognised by application as
ASCII)
- Let's report the same as Linux, that will simplify portability
- Let's be obvious and report ASCII (also commonly recognised by applications)

The next question is if we change the above, would it make sense to also report
ASCII for ASCII locales:
- en_AU.US-ASCII
- en_CA.US-ASCII
- en_GB.US-ASCII
- en_NZ.US-ASCII
- en_US.US-ASCII
- en_ZA.US-ASCII

Which would require some work or should we make them return ASCII or even
ANSI_X3.4-1968.

Please share your opinion here

Best regards,
Bapt

John Marino

2015-11-14 12:19:04 UTC

Permalink

Post by Baptiste Daroussin
The thing is not all are aware that FreeBSD uses US-ASCII, for example tcl does
not. which means tcl is not able to determine what encoding is needed for the C
and POSIX locales.
On Linux they to return ANSI_X3.4-1968 (also known as US-ASCII) and most
application knows what linux returns.
That means we need to teach all upstream about US-ASCII all the time.
- Do not change what we have always done.
- Change it to something that makes sense "C" (what we tried with "POSIX" which
was a very bad idea, but "C" seems to be commonly recognised by application as
ASCII)
- Let's report the same as Linux, that will simplify portability
- Let's be obvious and report ASCII (also commonly recognised by applications)

Just repeating my opinion in this new thread.
Since POSIX don't tell anything certain, we should be Linux compatible
1) Return "ANSI_X3.4-1968" for C/POSIX locale (was "US-ASCII").
2) Return "ASCII" for *.US-ASCII locales (was "US-ASCII").
Typical Linux program knows nothing about our "US-ASCII", and porting
handles it rarely.
Not doing that leads to hidden, hard to find bugs like still present
right now in our tcl ports. For all that years tcl don't understand
FreeBSD-native nl_langinfo() "US-ASCII" and falls back to "iso8859-1"
(it understands Linux "ANSI_X3.4-1968" and "ASCII" of course).

As a DragonFly representative (and probably the person that would
implement it), I can accept Andrey's proposal.

What it would mean:
1) "ANSI_X3.4-1968" would be the one return value of
nl_langinfo(CODESET) that is not in the output of "locale -m"

2) This would require an alteration to usr.bin/locale to add this
"ANSI_X3.4-1968" if not found (similar to how it's done for US-ASCII

3) At the same time usr.bin/locale would be modified to change check
from "US-ASCII" to "ASCII"

4) The locale tools would have to be modified to change all source and
map references from "US-ASCII" to "ASCII" and the six LC* generating
makefiles regenerated

5) nl_langinfo would be changed to return "ANSI_X3.4-1968" instead of
"US-ASCII" if the encoding equals "NONE"

6) the "make upgrade" utility would need to remove *.US-ASCII locales

7) Do we really need 6 ".ASCII" locales? It has very limited use, I'd
suggest just having "en_US.ASCII" and that it. Dump en_AU, en_ZA,
en_GB, etc. We can keep all 6 if we want, but if we are removing
US-ASCII anyway, we should limit the locales to what makes sense.
Alternatively FreeBSD could link US-ASCII => ASCII and have both
variations but I think DragonFly will just drop US-ASCII in this case.

What nl_langinfo(CODESET) returns has to be reflected in the locale name
(with the exception of "ANSI_X3.4-1968") so there has to be e.g.
en_US.ASCII as a valid locale if US-ASCII is changed.

There might be other changes necessary if "US-ASCII" is changed; I'd
have to do a thorough review.

To get started, I think this needs to be decided:
A) confirm we want locale -m and nl_langinfo(CODESET) to return
"ANSI_X3.4-1968" for C/POSIX locales
B) Confirm renaming US-ASCII locales to ASCII
C) (FreeBSD only) Decide if you want to conserve US-ASCII locales with
symlinks. nl_langinfo(CODESET) will return "ASCII" for these symlinked
locales
D) Decide the set of "ASCII" locales are really needed. (I suggest one,
en_US.ASCII)

Thanks,
John

Ed Schouten

2015-11-16 17:35:03 UTC

Permalink

Hi Baptiste,

I personally think it's a shame if we were to deviate from returning
"US-ASCII", for the reason that "US-ASCII" also happens to be the
preferred MIME name for the character set:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

"ASCII" doesn't even seem to be an alias for this character set.
Though "ANSI_X3.4-1968" is an alias for ASCII, I wouldn't even know
that this is ASCII without doing a Google search.

In my opinion a decent implementation of newlocale() should support
any of the character set names and aliases provided on the IANA page,
but let nl_langinfo(CODESET) return the preferred MIME name.

Post by Baptiste Daroussin
That means we need to teach all upstream about US-ASCII all the time.

Could you come up with a concrete list of pieces of software that need
to be changed? Is it just those three pieces of software that you
mentioned above? If so, then I think it would be a shame to make the
concession.

--
Ed Schouten <***@nuxi.nl>
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717

Andrey Chernov

2015-11-16 19:00:29 UTC

Permalink

Post by Ed Schouten
I personally think it's a shame if we were to deviate from returning
"US-ASCII", for the reason that "US-ASCII" also happens to be the
http://www.iana.org/assignments/character-sets/character-sets.xhtml
"ASCII" doesn't even seem to be an alias for this character set.

Yes, I overlook it somehow. ASCII is not in the IANA, while both
ANSI_X3.4-1968 and US-ASCII are.

So, I reconsider the proposal. We can return ANSI_X3.4-1968 for POSIX/C
(for Linux compatibility reasons) and left pure US-ASCII as it was
(since it is used rarely).

Post by Ed Schouten
In my opinion a decent implementation of newlocale() should support
any of the character set names and aliases provided on the IANA page,
but let nl_langinfo(CODESET) return the preferred MIME name.

BTW, we already have and return non-IANA codesets historically (inspired
by X11). I.e. we have ISO8859-* instead of preferred names ISO-8859-*,
moreover, ISO8859-* even not the aliases (!) and IANA knows nothing
about them. Linux have IANA preferred names here, i.e. ISO-8859-*.

So the question is: should we rename ISO8859-* to ISO-8859-* to be IANA
and Linux compatible?

We can strip first (or all) "_" and "-" from the environment names (as
Linux does), to not violate POLA.

Post by Ed Schouten

Post by Baptiste Daroussin
That means we need to teach all upstream about US-ASCII all the time.

No, I see such checks many times in other programs too, tcl is just one
which can be found quickly. The proper procedure to examine situation
will be to unpack _all_ ports and search through the code, but my
machine can't handle it.

--
http://ache.vniz.net/

Baptiste Daroussin

2015-11-16 21:06:59 UTC

Permalink

Post by Andrey Chernov

Yes, I overlook it somehow. ASCII is not in the IANA, while both
ANSI_X3.4-1968 and US-ASCII are.
So, I reconsider the proposal. We can return ANSI_X3.4-1968 for POSIX/C
(for Linux compatibility reasons) and left pure US-ASCII as it was
(since it is used rarely).

To tell the truth, the locale change I made were painful enough (mostly my
fault)and I (for now) won't do anywork further beside fixing the fallouts if any
are left. But I do support this proposal!

Post by Andrey Chernov

BTW, we already have and return non-IANA codesets historically (inspired
by X11). I.e. we have ISO8859-* instead of preferred names ISO-8859-*,
moreover, ISO8859-* even not the aliases (!) and IANA knows nothing
about them. Linux have IANA preferred names here, i.e. ISO-8859-*.
So the question is: should we rename ISO8859-* to ISO-8859-* to be IANA
and Linux compatible?
We can strip first (or all) "_" and "-" from the environment names (as
Linux does), to not violate POLA.

I would like to see that as well, lots of new comers I have seen setup the
locales the IANA way and are unhappy because that does not work. The first plan
in the collation branch was to introduce the IANA syntax via an alias but in the
end I removed it, because there was already to many changes.

If one want to go further on the locale changes like the above proposal please
proceed.

Best regards,
Bapt

Andrey Chernov

2015-11-16 21:51:37 UTC

Permalink

Post by Baptiste Daroussin
locales the IANA way and are unhappy because that does not work. The first plan
in the collation branch was to introduce the IANA syntax via an alias but in the
end I removed it, because there was already to many changes.

For ISO case we don't need aliases and can keep our internal names
hierarchy honoring POLA. All we need is:
1) Convert "ISO-" and "ISO_" to "ISO" for setlocale(3) input.
2) Convert from "ISO" to "ISO-" for setlocale(3), nl_langinfo(3) and
locale(1) output.

--
http://ache.vniz.net/

John Marino (FreeBSD)

2015-11-17 08:22:57 UTC

Permalink

Post by Andrey Chernov

For ISO case we don't need aliases and can keep our internal names
1) Convert "ISO-" and "ISO_" to "ISO" for setlocale(3) input.
2) Convert from "ISO" to "ISO-" for setlocale(3), nl_langinfo(3) and
locale(1) output.

A huge patch just went into GCC libstdc++ testsuite to change all the
locale names to "ISO8859-" because it works for both Linux and *BSD.

This is a change for changes sake.

Locale -m lists the encodings.
Locale -a lists the available locales

This is true on Linux as well.
Nobody is getting POLA'D here.

Moveover, there is significant work to implement this. We brought up
the possibility of hyphen- and case- sensitivity on DragonFly and the
idea was shot down. The reasons were solid enough.

There is no standard for encoding, period. Using one source is as valid
another another. I say leave it alone.

John

Garrett Cooper

2015-11-17 16:46:26 UTC

Permalink

Post by John Marino (FreeBSD)

Post by Andrey Chernov

A huge patch just went into GCC libstdc++ testsuite to change all the
locale names to "ISO8859-" because it works for both Linux and *BSD.
This is a change for changes sake.
Locale -m lists the encodings.
Locale -a lists the available locales
This is true on Linux as well.
Nobody is getting POLA'D here.
Moveover, there is significant work to implement this. We brought up
the possibility of hyphen- and case- sensitivity on DragonFly and the
idea was shot down. The reasons were solid enough.
There is no standard for encoding, period. Using one source is as valid
another another. I say leave it alone.

Windows is probably the closest thing to a standard here. What does it use -- dashes or underscores?
Thanks,
-NGie