XML technologies / XSLT / XSLT and XPath functions / Alphabetical XSLT and XPath reference / normalize-unicode

XSLT and XPath function reference in alphabetical order

(Excerpt from “XSLT 2.0 & XPath 2.0” by Frank Bongers, chapter 5, translated from German)

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

fn:normalize-unicode

Category:

String functions – analysis and manipulation

Origin:

XPath 2.0

Return value:

A xs:string string; the input string in normalised form, either according to Unicode NFC or according to the indicated Unicode normalisation regulation.

Call/Arguments:

fn:normalize-unicode($inputString?, $normalizationForm?)

$inputString:

Optional. A xs:string character string which shall be normalised according to one of the Unicode normalisation forms. If the empty sequence is passed on, the function returns an empty output sequence.

$normalizationForm:

Optional. A xs:string string which must correspond to the identifier of an implemented normalisation regulation, otherwise an error is reported. In order to identify the identifier, $normalizationForm is, if appropriate, converted to upper case letters and normalised (leading and trailing space characters are removed). The passing on of an empty string as second argument deactivates the normalisation.

Purpose of use:

The fn:normalize-unicode() function does not perform a whitespace normalisation, but normalises the input string according to one of the four relevant regulations of the Unicode normalisation. Such a normalisation unifies the composition of character strings containing Unicode composite characters. This is reasonable in advance of a string comparison between Unicode character strings. Mere ASCII text does not need to be normalised in this form because here no composite characters can appear.

Possibilities of application:

A string containing compound Unicode characters (composite characters) is reasonably normalisable. This includes, for example, ligatures, characters with accents or umlauts. So, for instance, the lower case letter ä can be represented by the single character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as a combination of the two characters U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS. The possible variants can practically make the result of a direct string comparison of two (also in principle to be considered as equal) character strings unpredictable.

The Unicode normalisation forms ensure that all character combinations and combination characters are converted to uniformly defined characters or character strings before a possible comparison.

In the course of the normalisation, firstly, a decomposition of composite characters takes place. Then they are replaced by a canonical composite representation (according to a uniform rule). If special characters appear as so-called singleton characters which consist of only one Unicode symbol (for example Å as character for »Ångström«), they are replaced by their respective canonical composite form.

Mere ASCII text, and consequently any programme source code, remains unchanged after an Unicode normalisation (ASCII does not contain composite characters). Therefore, a normalisation is not harmful here, but only redundant.

Characters affected by Unicode normalisation:
There are comparison charts containing the characters of all character sets affected by Unicode normalisation.

Normalisation forms:

The standard requires from an application only the support of the (most common) normalisation according to NFC (Unicode Normalization Form C, canonical composition) which is also favoured by the W3C. This regulation is used by default, if no second argument is passed on, which means a normalisation form is not explicitly requested:

NFC (default setting) – Unicode Normalization Form C, canonical composition

In addition to NFC, there are further regulations which can (but do not have to) be known by the application:

NFD – Unicode Normalization Form D, canonical decomposition
NFKC – Unicode Normalization Form KC, canonical compatible composition
NFKD – Unicode Normalization Form KD, canonical compatible decomposition

Excerpt from the Unicode normalization chart

Image: excerpt from the Unicode normalization chart

There is a further form not belonging to the Unicode standard and possibly supported by some applications:

FULLY-NORMALIZED – the normalisation regulation of the W3C; described in »Character Model for the World Wide Web 1.0«.
This form essentially corresponds to NFC, but allows as first character of the string only so-called base characters (Unicode Combining Class 0). In case there is a combination character at the beginning of the string, for example a file number, a space character is put in front of it (in addition as base character) in the course of the normalisation in order to support the combination character.

Notice: notation of the identifiers of the normalisation form
Because of the performed normalisation (removal of leading and trailing space characters) of the value of the normalisation form argument as well as its subsequent conversion to upper case letters, identifiers can, in principle, be passed on in any notation.

Beyond the abovementioned normalisation forms, the application is free to support any number of further normalisation forms.

If an identifier is passed on which the application cannot identify or whose appropriate normalisation regulation cannot be supported by the application, the error »Unsupported normalization form« (err:FOCH0003) is reported.

If a second argument is passed on to the function, but it is the empty string, a normalisation does not take place (which means also no default normalisation according to NFC), but the input string is returned in unchanged form.

Function definition:

XPath 1.0:

The function is not avalilable.

XPath 2.0:

fn:normalize-unicode($arg as xs:string?) as xs:string?

fn:normalize-unicode($arg as xs:string?,

$normalizationForm as xs:string)

as xs:string?

<< back

next >>

Copyright © Galileo Press, Bonn 2008
Printing of the online version is permitted exclusively for private use. Otherwise this chapter from the book "XSLT 2.0 & XPath 2.0" is subject to the same provisions as those applicable for the hardcover edition: The work including all its components is protected by copyright. All rights reserved, including reproduction, translation, microfilming as well as storage and processing in electronic systems.

Galileo Press, Rheinwerkallee 4, 53227 Bonn, Germany