Discussion:
Case changing operations
Joseph Wright
2014-06-30 08:22:39 UTC
Permalink
Hello all,

To support case-changing operations in expl3, the team some time ago
added an experimental pair
\tl_expandable_uppercase:n/\tl_expandable_lowercase:n as alternatives to
\tl_to_uppercase:n/\tl_to_lowercase:n. While the expandable operations
are useful, there are issues both in terms of naming (solvable) and
functionality (more complex). In particular, they cover only the ASCII
range and do not offer some of the context-sensitive case changing that
is required for languages other than English.

In order to address this, we have now added a new set of experimental
functions to l3candidates:

- \tl_upper_case:n(n)
- \tl_lower_case:n(n)
- \tl_mixed_case:n(n)

These are x-type expandable, so can be used inside for example
\tl_set:Nx, and when used with XeTeX or LuaTeX offer full UTF-8
character coverage. What we are hoping for is some feedback on the
interfaces, naming, etc.: we believe that the ideas are useful, and hope
in the longer term to use these to replace
\tl_expandable_(upper|lower)case:n and \tl_to_(upper|lower)case:n for
case changing. (The latter will still be required for generating
non-standard catcodes: we will provide a better interface for that
process at a later date.)

(We note that while expandability is not absolutely required in this
area, there are advantages to being able to simple set a tl to the
case-changed version of text. We therefore feel that expandability is
desirable here and the approach we have taken to some technical issues
reflects this.)

The versions with one argument do a relatively simple
language-insensitive mapping:

\tl_lower_case:n { HELLO } => "hello"
\tl_upper_case:n { hello } => "HELLO"
\tl_mixed_case:n { HELLO } => "Hello"

while the two-argument versions can do language-dependent changes, such
as dotted/dotless-i/I handling in Turkish:

\tl_upper_case:nn { tr } { i } => "İ"

The 'mixed' case variant is the low-level command needed to implement
'sentence' or 'title' case (the Unicode Consortium refer to both the
low-level and higher-level mapping as title casing): here, there is no
attempt to pick up on 'words' in a 'sentence'. (Once discussion on these
lower level functions is complete, we will look to see how best to
provide higher-level code for title/sentence casing: these operations
clearly apply to 'text' not 'token lists'.)

Some of what is required here is clear from the Unicode docs.
Implementing some of the requirements in TeX, particularly in an
expandable form, requires some modification of the described algorithms.
Thus areas where feedback is particularly welcome include:

- Brace groups/escaping: the current version takes an approach similar
to BibTeX, treating all brace groups as 'preserved'. This is a
clear rule but leaves open questions on how (if at all) to handle
commands in 'text'. Notably, these functions are intended for 'text
like' input, so this may not be an issue. Notice that math mode
is given no special treatment but can be protected from case
changing by bracing.

- Category code treatment: should case operations apply to chars on
a string-like basis (current approach) or only to 'letters'. Again,
as these functions seem to target 'text', category codes may not be
as important here as in some other context.

- Chars to skip at the start of 'text' when doing 'mixed' casing
(what counts as the first 'letter').

- The 'final sigma' rule in Greek: trying to handle all cases here is
challenging in an expandable TeX system, and so we have implemented
a more limited approach which counts a sigma as 'final' if followed
by a small set of chars (currently a space or one of "!'),.:;?]}")

- The 'dot above' rule in Lithuanian: we have again implemented this
using a more restricted approach than the Unicode docs described,
focussing only on chars/accents which are (we understand) used in
Lithuanian

- Whether 'mixed' case is a clear description of the idea of
(informally) upper casing the first letter in 'text' and then
lower casing the remainder.

The code has not gone to CTAN yet but is available on the GitHub mirror.
See in particular
https://github.com/latex3/svn-mirror/blob/master/l3kernel/l3candidates.dtx
and
https://github.com/latex3/svn-mirror/blob/master/l3kernel/l3unicode-data.def:
the latter is needed as it contains the data used for the transformations.

Feedback on all of this is very welcome: we hope to provide a
high-quality interface for case changing such that it can readily be
applied to a range of situations.
--
Joseph Wright
Joseph Wright
2014-06-30 08:28:57 UTC
Permalink
Post by Joseph Wright
- Brace groups/escaping: the current version takes an approach similar
to BibTeX, treating all brace groups as 'preserved'. This is a
clear rule but leaves open questions on how (if at all) to handle
commands in 'text'. Notably, these functions are intended for 'text
like' input, so this may not be an issue. Notice that math mode
is given no special treatment but can be protected from case
changing by bracing.
To be clear, the operations do not case-change command sequence tokens:
I was referring to commands with arguments, as the brace groups will
prevent any case change:

\tl_upper_case:n { Some~\emph{text} } => "SOME~\emph{text}"
--
Joseph Wright
Joseph Wright
2014-06-30 11:30:59 UTC
Permalink
Post by Joseph Wright
Post by Joseph Wright
- Brace groups/escaping: the current version takes an approach similar
to BibTeX, treating all brace groups as 'preserved'. This is a
clear rule but leaves open questions on how (if at all) to handle
commands in 'text'. Notably, these functions are intended for 'text
like' input, so this may not be an issue. Notice that math mode
is given no special treatment but can be protected from case
changing by bracing.
I was referring to commands with arguments, as the brace groups will
\tl_upper_case:n { Some~\emph{text} } => "SOME~\emph{text}"
A particular question here is of course accents: something like

\tl_upper_case:n { \'{e} }

will fail to case change but both

\tl_upper_case:n { \'e }

and

\tl_upper_case:n { é }

will work. There is a balance to be struck between convenient input
('all braces skipped' is easy to follow) and sufficient flexibility.
Thoughts most welcome.
--
Joseph Wright
Joel C. Salomon
2014-06-30 13:05:59 UTC
Permalink
Post by Joseph Wright
The versions with one argument do a relatively simple
\tl_lower_case:n { HELLO } => "hello"
\tl_upper_case:n { hello } => "HELLO"
\tl_mixed_case:n { HELLO } => "Hello"
while the two-argument versions can do language-dependent changes, such
\tl_upper_case:nn { tr } { i } => "İ"
There’s an important use-case that seems not to have been addressed,
but perhaps this is better handled in a different layer:
mixed-language strings.

For example, consider a document with the title, “The Interesting Life
of Ragıp Hulûsi Özdem” (to chose the first Turkish name I could find
with both dotted i and dotless ı). Somehow, within the \title{}
declaration, the change of language must be indicated so that (e.g.)
at the top of the page this will be transformed to “THE INTERESTING
LIFE OF RAGIP HULÛSİ ÖZDEM” and not “THE INTERESTİNG LİFE …” nor “…
LIFE OF RAGIP HULÛSI …”.

A similar situation arises in German where within geographical names
‘ß’ should capitalize to the recently-defined ‘ẞ’, not ‘SS’.
(According to <http://en.wikipedia.org/wiki/Capital_ẞ>, this rule was
adopted in 2010.)

As I said, this is probably best handled in a separate layer: Code
that capitalizes user-provided text would need to defer to the LaTeX3
equivalent of Babel, which would scan the text for user-level
language-change commands, and (among other things) call
\tl_upper_case:nn with the appropriate language argument. But I think
it’s important that the interface to the casing functions being
defined now be aware of the way they will likely be used.

—Joel
Joseph Wright
2014-06-30 16:59:07 UTC
Permalink
Post by Joel C. Salomon
There’s an important use-case that seems not to have been addressed,
mixed-language strings.
For example, consider a document with the title, “The Interesting Life
of Ragıp Hulûsi Özdem” (to chose the first Turkish name I could find
with both dotted i and dotless ı). Somehow, within the \title{}
declaration, the change of language must be indicated so that (e.g.)
at the top of the page this will be transformed to “THE INTERESTING
LIFE OF RAGIP HULÛSİ ÖZDEM” and not “THE INTERESTİNG LİFE …” nor “…
LIFE OF RAGIP HULÛSI …”.
A similar situation arises in German where within geographical names
‘ß’ should capitalize to the recently-defined ‘ẞ’, not ‘SS’.
(According to <http://en.wikipedia.org/wiki/Capital_ẞ>, this rule was
adopted in 2010.)
As I said, this is probably best handled in a separate layer: Code
that capitalizes user-provided text would need to defer to the LaTeX3
equivalent of Babel, which would scan the text for user-level
language-change commands, and (among other things) call
\tl_upper_case:nn with the appropriate language argument. But I think
it’s important that the interface to the casing functions being
defined now be aware of the way they will likely be used.
As you say, this looks much more like a 'high level' requirement: it's
tricky to see how nesting can work and at the same time not be tied to
design otherwise. For example, we can mark up a language in the input
easily enough

The Interesting Life of \SomeLangCommand{tr}{Ragıp Hulûsi Özdem}

but the problem is then making sure that the command does case changing
at point of use. One might imagine that such a command might have a
flexible definition:

\TitleCase#1 =>
\cs_set_eq:NN \SomeLangCommand \text_title_case:nn
... % Other similar stuff
\text_title_case:Vn \l_language_current_tl {#1}

That might still leave a question about x- versus f-type expansion: if
the outcome is meant to be 'just text' then you need to expand
\SomeLangCommand, which at the moment is deliberately avoided. Of
course, such an issue might be avoided by doing a pre-parse, as you suggest:


\TitleCase#1 =>
\cs_set_eq:NN \SomeLangCommand \text_title_case_and_brace:nn
... % Other similar stuff
\tl_set:Nx \l_some_tmpa_tl {#1} %
% Now "The Interesting Life of {RAGIP HULÛSİ ÖZDEM}"
\tl_set:Nx \l_some_tmpa_tl
{ \text_title_case:VV \l_language_current_tl \l_some_tmpa_tl }

On the German business, I'd already wondered how best to add the
'capital Eszett' business for the simple case. I'm not quite sure how
one is meant to do that (not de-DE or whatever, but de-<something>!).
I'm also not sure whether people who do use it will use it for all
Eszetts (might otherwise lead to some odd decisions in upper casing). As
you say, this could of course occur in input where such a decision
applies only to some cases.

What is worth noting is that while the commands I've added take general
text as input, we are seeing them as building blocks for e.g. a
hypothetical \text_title_case:nn. That and related operations need to
do things like worry about 'words', and in that context splitting up the
input is needed anyway. (Will has an expandable approach to do that. We
might imagine seeking to add to that some form of 'recursion' for nested
languages.)

BTW, nice Turkish name: that one is going into the test suite for this area!
--
Joseph Wright
Joel C. Salomon
2014-06-30 21:07:14 UTC
Permalink
On Mon, Jun 30, 2014 at 12:59 PM, Joseph Wright
Post by Joseph Wright
BTW, nice Turkish name: that one is going into the test suite for this area!
Correction: The name is that of a linguist mentioned in
<http://en.wikipedia.org/wiki/Turkish_alphabet>. (I have no idea
whether his life was interesting.)

—Joel
Joel C. Salomon
2014-06-30 21:05:21 UTC
Permalink
On Mon, Jun 30, 2014 at 12:59 PM, Joseph Wright
Post by Joseph Wright
BTW, nice Turkish name: that one is going into the test suite for this area!
The name is that of a linguist mentioned in
<http://en.wikipedia.org/wiki/Dotted_and_dotless_I>. (I have no idea
whether his life was interesting.)

—Joel
Joseph Wright
2014-06-30 18:28:18 UTC
Permalink
Post by Joseph Wright
To support case-changing operations in expl3, the team some time ago
added an experimental pair
\tl_expandable_uppercase:n/\tl_expandable_lowercase:n as alternatives to
\tl_to_uppercase:n/\tl_to_lowercase:n. While the expandable operations
are useful, there are issues both in terms of naming (solvable) and
functionality (more complex). In particular, they cover only the ASCII
range and do not offer some of the context-sensitive case changing that
is required for languages other than English.
In order to address this, we have now added a new set of experimental
- \tl_upper_case:n(n)
- \tl_lower_case:n(n)
- \tl_mixed_case:n(n)
A question raised elsewhere
(http://chat.stackexchange.com/transcript/message/16351207#16351207) is
of course whether "tl" is the right place for such functions at all.
It's arguable that they can be regarded a "text" functions, so perhaps a
"text manipulation" module would be a better location. That does not of
course preclude discussing the detail of how they should also work, but
may be worth consideration. Feedback here also welcome!
--
Joseph Wright
Joseph Wright
2014-07-01 07:29:02 UTC
Permalink
Post by Joseph Wright
Post by Joseph Wright
In order to address this, we have now added a new set of experimental
- \tl_upper_case:n(n)
- \tl_lower_case:n(n)
- \tl_mixed_case:n(n)
A question raised elsewhere
(http://chat.stackexchange.com/transcript/message/16351207#16351207) is
of course whether "tl" is the right place for such functions at all.
It's arguable that they can be regarded a "text" functions, so perhaps a
"text manipulation" module would be a better location. That does not of
course preclude discussing the detail of how they should also work, but
may be worth consideration. Feedback here also welcome!
One argument here is that *at present* it's not clear what might be a
'better' location for case changing, while the need for the
functionality is apparent and an implementation is doable 'now'. Thus we
might argue that adding to tl with the possibility of a (well-defined)
move to another module could occur at some stage in the future. This
approach avoids adding new modules which turn out to be poorly defined.

There is a tension there of course with 'stability': we are aiming not
to make changes without good reason, but at the same time are trying to
have have mechanisms which do allow for some change where this makes sense.
--
Joseph Wright
Robin Fairbairns
2014-07-01 09:48:30 UTC
Permalink
Post by Joseph Wright
Post by Joseph Wright
A question raised elsewhere
(http://chat.stackexchange.com/transcript/message/16351207#16351207) is
of course whether "tl" is the right place for such functions at all.
It's arguable that they can be regarded a "text" functions, so perhaps a
"text manipulation" module would be a better location. That does not of
course preclude discussing the detail of how they should also work, but
may be worth consideration. Feedback here also welcome!
One argument here is that *at present* it's not clear what might be a
'better' location for case changing, while the need for the
functionality is apparent and an implementation is doable 'now'. Thus we
might argue that adding to tl with the possibility of a (well-defined)
move to another module could occur at some stage in the future. This
approach avoids adding new modules which turn out to be poorly defined.
There is a tension there of course with 'stability': we are aiming not
to make changes without good reason, but at the same time are trying to
have have mechanisms which do allow for some change where this makes sense.
the other concern that sticks in my mind, is frank suggesting
case-diddling isn't really stuff for the latex kernel.

does anyone know what context do? is there scope for joint work on a
common module?

robin
standing on the edge and watching...
Joseph Wright
2014-07-01 13:00:48 UTC
Permalink
Post by Robin Fairbairns
the other concern that sticks in my mind, is frank suggesting
case-diddling isn't really stuff for the latex kernel.
I hope we are safe here: case changing is provided by many programming
languages (so is likely 'right' for expl3), and we do need such
functionality for e.g. running headers (all caps is common in such places).
Post by Robin Fairbairns
does anyone know what context do? is there scope for joint work on a
common module?
Reading e.g. http://wiki.contextgarden.net/Titles, the ConTeXt approach
(as I'd guess) has more than one part. The primitives tend to be more
'encouraged' in ConTeXt than is likely to be the case for any
stand-alone LaTeX2, and indeed than is often the case for LaTeX2e. Thus
they have some examples using \uppercase, which is (in the version of
ConTeXt I have installed) the primitive unaltered (I did not check for
any callback here). There is also e.g. \WORD, which in MkIV uses
LuaTeX's \attribute system to effect the change at the font level. The
latter is therefore able to e.g. convert "ß" to "SS". (In my tests I did
not find another example of a 1 -> many situation which is covered by
the standard fonts.) Thus they have somewhat different
requirements/aims. (Testing here suggests that e.g. \Word will pick up
"." but does not handle for example quotations.)

(Note: That page says \WORD will mess up \em, which may be correct for
MkII but is wrong for MkIV. I've not checked what MkII does here.)

(See also http://wiki.contextgarden.net/Command/setcharactercasing for
the more general case, which works on a per-paragraph basis, again using
attributes to achieve this.)
--
Joseph Wright
Joseph Wright
2014-07-01 13:16:02 UTC
Permalink
Post by Joseph Wright
Post by Robin Fairbairns
does anyone know what context do? is there scope for joint work on a
common module?
Reading e.g. http://wiki.contextgarden.net/Titles, the ConTeXt approach
(as I'd guess) has more than one part. The primitives tend to be more
'encouraged' in ConTeXt than is likely to be the case for any
stand-alone LaTeX2, and indeed than is often the case for LaTeX2e. Thus
they have some examples using \uppercase, which is (in the version of
ConTeXt I have installed) the primitive unaltered (I did not check for
any callback here). There is also e.g. \WORD, which in MkIV uses
LuaTeX's \attribute system to effect the change at the font level. The
latter is therefore able to e.g. convert "ß" to "SS". (In my tests I did
not find another example of a 1 -> many situation which is covered by
the standard fonts.) Thus they have somewhat different
requirements/aims. (Testing here suggests that e.g. \Word will pick up
"." but does not handle for example quotations.)
(Note: That page says \WORD will mess up \em, which may be correct for
MkII but is wrong for MkIV. I've not checked what MkII does here.)
(See also http://wiki.contextgarden.net/Command/setcharactercasing for
the more general case, which works on a per-paragraph basis, again using
attributes to achieve this.)
What of course is notable here is that they've not felt the need to
actually convert stored input (easily) into different cases: nether the
primitive nor attribute approach are expandable.
--
Joseph Wright
Loading...