|
|
|
# UCD: The Unicode Database
|
|
|
|
|
|
|
|
The Unicode consortium defines a database of character properties for Unicode characters,
|
|
|
|
as documented in [UAX #44: Unicode Character Database](https://unicode.org/reports/tr44/).
|
|
|
|
Parabix has built-in support for these properties.
|
|
|
|
|
|
|
|
## Property Names and Aliases
|
|
|
|
|
|
|
|
An enumeration `UCD::property_t` is defined for each of the properties of the Unicode Character Database.
|
|
|
|
`PropertyAliases.h` defines this enumeration as well as their standard names and aliases.
|
|
|
|
|
|
|
|
## Property Values
|
|
|
|
|
|
|
|
For any given property, there is a value of that property for each character in Unicode.
|
|
|
|
Many of the properties are enumerated properties such that the value of each character
|
|
|
|
according to a property is one of a fixed set of named values. The names of all
|
|
|
|
values for enumerated properties are found in `PropertyValueAliases.h`, together with aliases for those names when they exist.
|
|
|
|
For example, the `Script` property has enumeration codes `Latn`, `Grek` and `Cyrl` and so on, for defining
|
|
|
|
a specific script for each character.
|
|
|
|
|
|
|
|
Binary properties are a special case of enumerated
|
|
|
|
properties having the values `Y` and `N` standing for yes or no (true or false).
|
|
|
|
|
|
|
|
## Unicode Sets
|
|
|
|
|
|
|
|
Parabix describes many of the properties of Unicode characters in terms of UnicodeSet data objects.
|
|
|
|
A UnicodeSet is a compact bitset representation defining all the characters having some
|
|
|
|
property-value combination. For example, a UnicodeSet can be produced to determine the set of
|
|
|
|
Greek characters as the set of all characters whose Script property has the value `Grek`.
|
|
|
|
|
|
|
|
UnicodeSet objects provide many of the standard set operations, such as membership tests, union, intersection,
|
|
|
|
set difference and so on. See `unicode/core/unicode_set.h`.
|
|
|
|
|
|
|
|
Most Unicode properties have predefined UnicodeSet values built in. For example, the UnicodeSet for
|
|
|
|
Greek characters is defined in `Scripts.h` as follows.
|
|
|
|
```
|
|
|
|
/** Code Point Ranges for Grek
|
|
|
|
[0370, 0373], [0375, 0377], [037a, 037d], [037f, 037f], [0384, 0384],
|
|
|
|
[0386, 0386], [0388, 038a], [038c, 038c], [038e, 03a1], [03a3, 03e1],
|
|
|
|
[03f0, 03ff], [1d26, 1d2a], [1d5d, 1d61], [1d66, 1d6a], [1dbf, 1dbf],
|
|
|
|
[1f00, 1f15], [1f18, 1f1d], [1f20, 1f45], [1f48, 1f4d], [1f50, 1f57],
|
|
|
|
[1f59, 1f59], [1f5b, 1f5b], [1f5d, 1f5d], [1f5f, 1f7d], [1f80, 1fb4],
|
|
|
|
[1fb6, 1fc4], [1fc6, 1fd3], [1fd6, 1fdb], [1fdd, 1fef], [1ff2, 1ff4],
|
|
|
|
[1ff6, 1ffe], [2126, 2126], [ab65, ab65], [10140, 1018e],
|
|
|
|
[101a0, 101a0], [1d200, 1d245]**/
|
|
|
|
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
const static UnicodeSet::run_t __grek_Set_runs[] = {
|
|
|
|
{Empty, 27}, {Mixed, 3}, {Full, 1}, {Mixed, 1}, {Empty, 201},
|
|
|
|
{Mixed, 3}, {Empty, 1}, {Mixed, 1}, {Empty, 10}, {Mixed, 1}, {Full, 1},
|
|
|
|
{Mixed, 2}, {Full, 1}, {Mixed, 3}, {Empty, 9}, {Mixed, 1},
|
|
|
|
{Empty, 1105}, {Mixed, 1}, {Empty, 686}, {Full, 2}, {Mixed, 2},
|
|
|
|
{Empty, 1666}, {Full, 2}, {Mixed, 1}, {Empty, 31085}};
|
|
|
|
const static UnicodeSet::bitquad_t __grek_Set_quads[] = {
|
|
|
|
0xbcef0000, 0xffffd750, 0xfffffffb, 0xffff0003, 0x000007c0, 0xe0000000,
|
|
|
|
0x000007c3, 0x80000000, 0x3f3fffff, 0xaaff3f3f, 0x3fffffff, 0xffdfffff,
|
|
|
|
0xefcfffdf, 0x7fdcffff, 0x00000040, 0x00000020, 0x00007fff, 0x00000001,
|
|
|
|
0x0000003f};
|
|
|
|
}
|
|
|
|
|
|
|
|
const static UnicodeSet grek_Set{const_cast<UnicodeSet::run_t *>(__grek_Set_runs), 25, 0,
|
|
|
|
const_cast<UnicodeSet::bitquad_t *>(__grek_Set_quads), 19, 0};
|
|
|
|
```
|
|
|
|
In general, these predefined UnicodeSet values are generated automatically by
|
|
|
|
scripts written in Python, based on the text files of the Unicode Character Database (UCD).
|
|
|
|
`Scripts.h` is generated based on the UCD file `Scripts.txt` for example.
|
|
|
|
|
|
|
|
## Unicode Property Objects
|
|
|
|
|
|
|
|
The principal means for accessing information associated with UCD properties is through
|
|
|
|
its `PropertyObject`. There are several subclasses of `PropertyObject` depending
|
|
|
|
on the type of property: binary, enumerated, numeric or string.
|
|
|
|
See `PropertyObjects.h` for the definitions of property object types.
|
|
|
|
See `PropertyObjectTable.h` for the actual definition of property objects for each UCD property.
|
|
|
|
|
|
|
|
For example, to determine the UnicodeSet of Greek characters, one can use the
|
|
|
|
operation:
|
|
|
|
```
|
|
|
|
llvm::cast<UCD::EnumeratedPropertyObject>(UCD::property_object_table[UCD:sc])->GetCodepointSet("Greek");
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Unicode Property Resolution
|
|
|
|
|
|
|
|
When a regular expression `Name` object refers to a Unicode character property, resolving that name
|
|
|
|
into its corresponding UnicodeSet may be performed using operations defined in `re/unicode/resolve_properties.h`.
|
|
|
|
|
|
|
|
The `UnicodePropertyKernelBuilder` in `include/kernel/unicode/UCD_property_kernel.h` is useful for constructing property streams
|
|
|
|
for any of the defined Unicode properties. An example of its use for counting
|
|
|
|
occurrences of a property within a file may be found in the `ucount` utility,
|
|
|
|
see `tools/wc/ucount.cpp`.
|
|
|
|
|
|
|
|
## Grapheme Cluster Boundaries
|
|
|
|
|
|
|
|
Grapheme clusters are sequences of Unicode codepoints are generally considered together to represent one logical character. For example, a base character such as the letter `a` may be followed by an accent character such as ´ to produced the accented character `á`. The task of separating a stream of characters into grapheme clusters is a text segmentation problem known as the grapheme cluster boundary problem. The full Unicode rules for this are documented in
|
|
|
|
[UAX #29: Unicode Text Segmentation](https://unicode.org/reports/tr29/).
|
|
|
|
|
|
|
|
The logic for computing grapheme cluster boundaries with Parabix methods is illustrated by the `gcount` utility, see `tools/wc/gcount.cpp`. |
|
|
|
\ No newline at end of file |