The bracket matching problem deals with parentheses ()
, square brackets []
, braces {}
and other syntactic elements that provide for nested syntactic structures with balanced sets of delimiters.
To tackle bracket matching with Parabix methods, a key concept is that of a Nesting Depth BixNum. This is a representation of the nesting depth of syntactic elements at each position in an input stream. The following example shows an input stream with square brackets and braces, the corresponding nesting depth of each element in the data stream and the nesting depth bixnum (ND) as a set of 3 parallel bit streams (giving a 3-bit number at each character position).
Bracket Stream ...[..{.}..{...[.[]..]..}...]...
Nesting Depth 00011122211222233443332221111000
ND BixNum[2] .................11.............
ND BixNum[1] ......111..111111..111111.......
ND BixNum[0] ...111...11....11..111...1111...
NestingDepth Kernel
The NestingDepth kernel is a Parabix kernel (to be implemented) that can compute the nesting
depth of characters in an input stream, given two character class bit streams
L
identifying all the left or opening delimiters and R
identifying all the right or closing delimiters. For our example bracket stream above, L
and R
are shown below.
Bracket Stream ...[..{.}..{...[.[]..]..}...]...
L ...1..1....1...1.1..............
R ........1.........1..1..1...1...
The NestingDepth Kernel also takes a parameter 'MaxDepth` identifying the maximum nesting depth expected for an input source. The number of bit streams in the NestingDepth BixNum must be equal to ceillog2(MaxDepth + 1).
Syntax Validation Using Nesting Depth
Consider the problem of validating JSON array syntax (where whitespace may be freely included between tokens):
array ::= '[' [value {',' value}] ']'
object ::= '{' string : [value {',' value}] '}'
value ::= object | array | numeral | string | 'true' | 'false' | 'null'
Validating non-nested JSON
A JSON grammar is given by JSON ::= value
, meaning that in some instances - numeral | string | 'true' | 'false' | 'null'
- we do not have nesting. See an example below:
// string.json
"This is valid JSON"
Thus, when we apply NestingDepth kernel to this stream, we get the output:
Bracket Stream ....................
Nesting Depth 00000000000000000000
ND BixNum[2] ....................
ND BixNum[1] ....................
ND BixNum[0] ....................
Assume that valueToken
is a bit stream that marks the end position of
any legal JSON numeral, string or keyword, then the example above would look like:
valueToken ...................1
Then we can validate if we have valid non-nesting values by checking:
otherND = bnc.UGT(ND, 0)
begin = ~Advance(<1>, 1)
firstValue = ScanTo(begin, valueToken)
nonNestingValue = ScanTo(Advance(firstValue, 1), valueToken)
errSimpleValue = nonNestingValue & ~otherND
Validating arrays
Assume we have two bit streams for tokens: valueToken
(explained above) and anyToken
, a bit stream marking the end position of any legal or illegal token. Assume also that LBrak
, RBrak
, and Comma
are streams marking the position of JSON [
, ]
, and ,
tokens respectively.
Suppose that a BixNum ND
has been computed as the nesting depth involving
the left and right delimiter sets consisting of square brackets and braces.
Then the validation of array syntax at each nesting depth d
can be
determined as follows.
Each array will be a span from an opening '[' to its closing ']'. Between
these brackets will be other tokens as well as nested arrays and objects.
However, nested arrays and objects can be easily computed as those spans
of elements having a nesting depth greater than d. Using a BixNum compiler
bnc
, the spans may be computed as bnc.UGT(ND, d)
. Tokens at depth d
can be identified as those at positions identified by the stream bnc.EQ(ND, d)
.
The following code can be used to validate all arrays at depth d
,
determining whether there are any errors after the opening LBrak or
after any value or Comma.
atDepth = bnc.EQ(ND, d)
nested = bnc.UGT(ND, d)
arrayStart = atDepth & LBrak
arrayEnd = ScanThru(arrayStart, nested | (atDepth & ~ (RBrak | RBrace)))
errorAtEnd = arrayEnd & RBrace
arraySpan = ExclusiveSpan(arrayStart, arrayEnd)
// Now validate that every value or nested item is followed
// either by a comma or a the end RBrak.
afterNested = Advance(nested & arraySpan, 1) & atDepth
afterToken = Advance(atDepth & valueToken & arraySpan, 1)
tokenNext = ScanThru(afterNested | afterToken, whitespace)
errAfterValue = tokenNext &~(Comma | RBrak)
//
// Every Comma must be followed by a value
advCommaArrSpan = Advance(Comma & atDepth & arrarySpan, 1)
errAfterComma = ScanTo(advCommaArrSpan, anyToken) & ~ (nested | (valueToken & atDepth))
//
// After the LBrak we must have either a value or an RBrak.
scanArrAnyToken = ScanTo(Advance(arrayStart, 1), anyToken)
errAfterLBrak = scanArrAnyToken & ~ (nested | valueToken | RBrak)
Validating objects
Assume we have two bit streams for tokens: valueToken
(explained above) and anyToken
, a bit stream marking the end position of any legal or illegal token. Assume also that LBrace
, RBrace
, DQuote
, Colon
and Comma
are streams marking the position of JSON {
, }
, "
, :
and ,
tokens respectively.
str = valueToken & DQuote
valueTokenMinusStr = valueToken ^ str
atDepth = bnc.EQ(ND, d)
nested = bnc.UGT(ND, d)
objStart = atDepth & LBrace
objEnd = ScanThru(objStart, nested | (atDepth & ~ (RBrak | RBrace)))
objAtEnd = objEnd & RBrak
objSpan = ExclusiveSpan(objStart, objEnd)
// Now validate that every value or nested item is followed
// either by a comma or a the end RBrace.
afterNested = Advance(nested & objSpan, 1) & atDepth
//
// process all values that are not strings
afterTokenMinusStr = Advance(atDepth & valueTokenMinusStr & objSpan, 1)
tokenNextMinusStr = ScanThru(afterNested | afterTokenMinusStr, whitespace)
errAfterValueMinusStr = tokenNextMinusStr &~(Comma | RBrace)
//
// process strings as both key and value
strAtDepth = atDepth & str
afterTokenStr = Advance(strAtDepth & objSpan, 1)
tokenNextStr = ScanThru(afterNested | afterTokenStr, whitespace)
errAfterValueStr = tokenNextStr &~(Comma | RBrace | Colon)
//
// join errors
errAfterValue = errAfterValueMinusStr | errAfterValueStr
//
// Every Colon must be followed by a value
advColon = Advance(Colon & atDepth & objSpan, 1)
errAfterColon = ScanTo(advColon, anyToken) & ~ (nested | (valueToken & atDepth))
//
// Every Comma must be followed by a key string
advComma = Advance(Comma & atDepth & objSpan, 1)
errAfterComma = ScanTo(advComma, anyToken) & ~ strAtDepth
//
// After the LBrace we must have either a value or an RBrace.
scanObjStartAnyToken = ScanTo(Advance(objStart, 1), anyToken)
errAfterLBrace = scanObjStartAnyToken & ~ (nested | valueToken | RBrace)
CFG Kernel (note: this doc isn't ready)
The CFG kernel is a Parabix kernel (to be implemented) that can compute a context-free grammars given a set of rules with its own syntax (close to EBNF), a BixNum representing the nesting depth of its tokens (see NestingDepth kernel), and a StreamSet representing all valid markers.
enum Token {
lCurly,
rCurly,
lBracket,
rBracket,
colon,
str,
value
};
// TODO: This is just an initial idea and it wouldn't work. Need to improve that.
CFGRule * valueRule = CGF::match(Token::value)
CFGRule * arrRule = CFG::matchBracket(
Token::lBracket,
{ Token::value },
Token::rBracket
);
CFGRule * objRule = CFG::matchBracket(
Token::lCurly,
{ Token::str, Token::colon, Token::value },
Token::rCurly
);
Nesting Depth 00011122211222233443332221111000
ND BixNum[2] .................11.............
ND BixNum[1] ......111..111111..111111.......
ND BixNum[0] ...111...11....11..111...1111...