|
|
# UCD: The Unicode Database
|
|
|
|
|
|
The Unicode consortium defines a database of character properties for Unicode characters,
|
|
|
as documented in [UAX #44: Unicode Character Database](https://unicode.org/reports/tr44/).
|
... | ... | @@ -96,4 +95,4 @@ see `tools/wc/ucount.cpp`. |
|
|
Grapheme clusters are sequences of Unicode codepoints are generally considered together to represent one logical character. For example, a base character such as the letter `a` may be followed by an accent character such as ´ to produced the accented character `á`. The task of separating a stream of characters into grapheme clusters is a text segmentation problem known as the grapheme cluster boundary problem. The full Unicode rules for this are documented in
|
|
|
[UAX #29: Unicode Text Segmentation](https://unicode.org/reports/tr29/).
|
|
|
|
|
|
The logic for computing grapheme cluster boundaries with Parabix methods is illustrated by the `gcount` utility, see `tools/wc/gcount.cpp`. |
|
|
\ No newline at end of file |
|
|
The logic for computing grapheme cluster boundaries with Parabix methods is illustrated by the [gcount](tools/wc/gcount.cpp). |
|
|
\ No newline at end of file |