UCD: Unicode Property Database and Compilers

The Unicode consortium defines a database of character properties for Unicode characters, as documented in UAX #44: Unicode Character Database. Parabix has built-in support for these properties.

Property Names and Aliases

An enumeration UCD::property_t is defined for each of the properties of the Unicode Character Database.
PropertyAliases.h defines this enumeration as well as their standard names and aliases.

Property Values

For any given property, there is a value of that property for each character in Unicode. Many of the properties are enumerated properties such that the value of each character according to a property is one of a fixed set of named values. The names of all values for enumerated properties are found in PropertyValueAliases.h, together with aliases for those names when they exist. For example, the Script property has enumeration codes Latn, Grek and Cyrl and so on, for defining a specific script for each character.

Binary properties are a special case of enumerated properties having the values Y and N standing for yes or no (true or false).

Unicode Sets

Parabix describes many of the properties of Unicode characters in terms of UnicodeSet data objects. A UnicodeSet is a compact bitset representation defining all the characters having some property-value combination. For example, a UnicodeSet can be produced to determine the set of Greek characters as the set of all characters whose Script property has the value Grek.

UnicodeSet objects provide many of the standard set operations, such as membership tests, union, intersection, set difference and so on. See unicode/core/unicode_set.h.

Most Unicode properties have predefined UnicodeSet values built in. For example, the UnicodeSet for Greek characters is defined in Scripts.h as follows.

    /** Code Point Ranges for Grek
    [0370, 0373], [0375, 0377], [037a, 037d], [037f, 037f], [0384, 0384],
    [0386, 0386], [0388, 038a], [038c, 038c], [038e, 03a1], [03a3, 03e1],
    [03f0, 03ff], [1d26, 1d2a], [1d5d, 1d61], [1d66, 1d6a], [1dbf, 1dbf],
    [1f00, 1f15], [1f18, 1f1d], [1f20, 1f45], [1f48, 1f4d], [1f50, 1f57],
    [1f59, 1f59], [1f5b, 1f5b], [1f5d, 1f5d], [1f5f, 1f7d], [1f80, 1fb4],
    [1fb6, 1fc4], [1fc6, 1fd3], [1fd6, 1fdb], [1fdd, 1fef], [1ff2, 1ff4],
    [1ff6, 1ffe], [2126, 2126], [ab65, ab65], [10140, 1018e],
    [101a0, 101a0], [1d200, 1d245]**/


    namespace {
    const static UnicodeSet::run_t __grek_Set_runs[] = {
    {Empty, 27}, {Mixed, 3}, {Full, 1}, {Mixed, 1}, {Empty, 201},
    {Mixed, 3}, {Empty, 1}, {Mixed, 1}, {Empty, 10}, {Mixed, 1}, {Full, 1},
    {Mixed, 2}, {Full, 1}, {Mixed, 3}, {Empty, 9}, {Mixed, 1},
    {Empty, 1105}, {Mixed, 1}, {Empty, 686}, {Full, 2}, {Mixed, 2},
    {Empty, 1666}, {Full, 2}, {Mixed, 1}, {Empty, 31085}};
    const static UnicodeSet::bitquad_t  __grek_Set_quads[] = {
    0xbcef0000, 0xffffd750, 0xfffffffb, 0xffff0003, 0x000007c0, 0xe0000000,
    0x000007c3, 0x80000000, 0x3f3fffff, 0xaaff3f3f, 0x3fffffff, 0xffdfffff,
    0xefcfffdf, 0x7fdcffff, 0x00000040, 0x00000020, 0x00007fff, 0x00000001,
    0x0000003f};
    }

    const static UnicodeSet grek_Set{const_cast<UnicodeSet::run_t *>(__grek_Set_runs), 25, 0,
                                     const_cast<UnicodeSet::bitquad_t *>(__grek_Set_quads), 19, 0};

In general, these predefined UnicodeSet values are generated automatically by scripts written in Python, based on the text files of the Unicode Character Database (UCD). Scripts.h is generated based on the UCD file Scripts.txt for example.

Unicode Property Objects

The principal means for accessing information associated with UCD properties is through its PropertyObject. There are several subclasses of PropertyObject depending on the type of property: binary, enumerated, numeric or string. See PropertyObjects.h for the definitions of property object types. See PropertyObjectTable.h for the actual definition of property objects for each UCD property.

For example, to determine the UnicodeSet of Greek characters, one can use the operation:

llvm::cast<UCD::EnumeratedPropertyObject>(UCD::property_object_table[UCD:sc])->GetCodepointSet("Greek");

Unicode Property Resolution and Compilation

When a regular expression Name object refers to a Unicode character property, resolving that name into its corresponding UnicodeSet may be performed using operations defined in re/unicode/resolve_properties.h.

Direct compilation of Unicode properties can be achieved by the UnicodePropertyKernelBuilder in include/kernel/unicode/UCD_property_kernel.h. It is useful for constructing property streams for any of the defined Unicode properties. An example of its use for counting occurrences of a property within a file may be found in the ucount utility.

Grapheme Cluster Boundaries

Grapheme clusters are sequences of Unicode codepoints are generally considered together to represent one logical character. For example, a base character such as the letter a may be followed by an accent character such as ´ to produced the accented character á. The task of separating a stream of characters into grapheme clusters is a text segmentation problem known as the grapheme cluster boundary problem. The full Unicode rules for this are documented in UAX #29: Unicode Text Segmentation.

The logic for computing grapheme cluster boundaries with Parabix methods is illustrated by the gcount utility.