The Unicode consortium defines a database of character properties for Unicode characters, as documented in UAX #44: Unicode Character Database. Parabix has built-in support for these properties.
Property Names and Aliases
An enumeration UCD::property_t
is defined for each of the properties of the Unicode Character Database.
PropertyAliases.h
defines this enumeration as well as their standard names and aliases.
Property Values
For any given property, there is a value of that property for each character in Unicode.
Many of the properties are enumerated properties such that the value of each character
according to a property is one of a fixed set of named values. The names of all
values for enumerated properties are found in PropertyValueAliases.h
, together with aliases for those names when they exist.
For example, the Script
property has enumeration codes Latn
, Grek
and Cyrl
and so on, for defining
a specific script for each character.
Binary properties are a special case of enumerated
properties having the values Y
and N
standing for yes or no (true or false).
Unicode Sets
Parabix describes many of the properties of Unicode characters in terms of UnicodeSet data objects.
A UnicodeSet is a compact bitset representation defining all the characters having some
property-value combination. For example, a UnicodeSet can be produced to determine the set of
Greek characters as the set of all characters whose Script property has the value Grek
.
UnicodeSet objects provide many of the standard set operations, such as membership tests, union, intersection,
set difference and so on. See unicode/core/unicode_set.h
.
Most Unicode properties have predefined UnicodeSet values built in. For example, the UnicodeSet for
Greek characters is defined in Scripts.h
as follows.
/** Code Point Ranges for Grek
[0370, 0373], [0375, 0377], [037a, 037d], [037f, 037f], [0384, 0384],
[0386, 0386], [0388, 038a], [038c, 038c], [038e, 03a1], [03a3, 03e1],
[03f0, 03ff], [1d26, 1d2a], [1d5d, 1d61], [1d66, 1d6a], [1dbf, 1dbf],
[1f00, 1f15], [1f18, 1f1d], [1f20, 1f45], [1f48, 1f4d], [1f50, 1f57],
[1f59, 1f59], [1f5b, 1f5b], [1f5d, 1f5d], [1f5f, 1f7d], [1f80, 1fb4],
[1fb6, 1fc4], [1fc6, 1fd3], [1fd6, 1fdb], [1fdd, 1fef], [1ff2, 1ff4],
[1ff6, 1ffe], [2126, 2126], [ab65, ab65], [10140, 1018e],
[101a0, 101a0], [1d200, 1d245]**/
namespace {
const static UnicodeSet::run_t __grek_Set_runs[] = {
{Empty, 27}, {Mixed, 3}, {Full, 1}, {Mixed, 1}, {Empty, 201},
{Mixed, 3}, {Empty, 1}, {Mixed, 1}, {Empty, 10}, {Mixed, 1}, {Full, 1},
{Mixed, 2}, {Full, 1}, {Mixed, 3}, {Empty, 9}, {Mixed, 1},
{Empty, 1105}, {Mixed, 1}, {Empty, 686}, {Full, 2}, {Mixed, 2},
{Empty, 1666}, {Full, 2}, {Mixed, 1}, {Empty, 31085}};
const static UnicodeSet::bitquad_t __grek_Set_quads[] = {
0xbcef0000, 0xffffd750, 0xfffffffb, 0xffff0003, 0x000007c0, 0xe0000000,
0x000007c3, 0x80000000, 0x3f3fffff, 0xaaff3f3f, 0x3fffffff, 0xffdfffff,
0xefcfffdf, 0x7fdcffff, 0x00000040, 0x00000020, 0x00007fff, 0x00000001,
0x0000003f};
}
const static UnicodeSet grek_Set{const_cast<UnicodeSet::run_t *>(__grek_Set_runs), 25, 0,
const_cast<UnicodeSet::bitquad_t *>(__grek_Set_quads), 19, 0};
In general, these predefined UnicodeSet values are generated automatically by
scripts written in Python, based on the text files of the Unicode Character Database (UCD).
Scripts.h
is generated based on the UCD file Scripts.txt
for example.
Unicode Property Objects
The principal means for accessing information associated with UCD properties is through
its PropertyObject
. There are several subclasses of PropertyObject
depending
on the type of property: binary, enumerated, numeric or string.
See PropertyObjects.h
for the definitions of property object types.
See PropertyObjectTable.h
for the actual definition of property objects for each UCD property.
For example, to determine the UnicodeSet of Greek characters, one can use the operation:
llvm::cast<UCD::EnumeratedPropertyObject>(UCD::property_object_table[UCD:sc])->GetCodepointSet("Greek");
Unicode Property Resolution and Compilation
When a regular expression Name
object refers to a Unicode character property, resolving that name
into its corresponding UnicodeSet may be performed using operations defined in re/unicode/resolve_properties.h
.
Direct compilation of Unicode properties can be achieved by the
UnicodePropertyKernelBuilder
in include/kernel/unicode/UCD_property_kernel.h
.
It is useful for constructing property streams
for any of the defined Unicode properties. An example of its use for counting
occurrences of a property within a file may be found in the ucount utility.
Grapheme Cluster Boundaries
Grapheme clusters are sequences of Unicode codepoints are generally considered together to represent one logical character. For example, a base character such as the letter a
may be followed by an accent character such as ´ to produced the accented character á
. The task of separating a stream of characters into grapheme clusters is a text segmentation problem known as the grapheme cluster boundary problem. The full Unicode rules for this are documented in
UAX #29: Unicode Text Segmentation.
The logic for computing grapheme cluster boundaries with Parabix methods is illustrated by the gcount utility.