This is an old version of this page. You can view the most recent version or browse the history.

Bracket Matching

The bracket matching problem deals with parentheses (), square brackets [], braces {} and other syntactic elements that provide for nested syntactic structures with balanced sets of delimiters.

To tackle bracket matching with Parabix methods, a key concept is that of a Nesting Depth BixNum. This is a representation of the nesting depth of syntactic elements at each position in an input stream. The following example shows an input stream with square brackets and braces, the corresponding nesting depth of each element in the data stream and the nesting depth bixnum (ND) as a set of 3 parallel bit streams (giving a 3-bit number at each character position).

Bracket Stream   ...[..{.}..{...[.[]..]..}...]...
Nesting Depth    00011122211222233443332221111000
ND BixNum[2]     .................11.............
ND BixNum[1]     ......111..111111..111111.......
ND BixNum[0]     ...111...11....11..111...1111...

NestingDepth Kernel

The NestingDepth kernel is a Parabix kernel (to be implemented) that can compute the nesting depth of characters in an input stream, given two character class bit streams L identifying all the left or opening delimiters and R identifying all the right or closing delimiters. For our example bracket stream above, L and R are shown below.

Bracket Stream   ...[..{.}..{...[.[]..]..}...]...
L                ...1..1....1...1.1..............
R                ........1.........1..1..1...1...

The NestingDepth Kernel also takes a parameter 'MaxDepth` identifying the maximum nesting depth expected for an input source. The number of bit streams in the NestingDepth BixNum must be equal to ceillog2(MaxDepth + 1).

Syntax Validation Using Nesting Depth

Consider the problem of validating JSON array syntax (where whitespace may be freely included between tokens):

array ::= '[' [value {',' value}] ']' 
object ::= '{' string : [value {',' value}] '}' 
value ::= object | array | numeral | string | 'true' | 'false' | 'null'

Validating non-nested JSON

A JSON grammar is given by JSON ::= value, meaning that in some instances - numeral | string | 'true' | 'false' | 'null' - we do not have nesting. See an example below:

// string.json

"This is valid JSON"

Thus, when we apply NestingDepth kernel to this stream, we get the output:

Bracket Stream   ....................
Nesting Depth    00000000000000000000
ND BixNum[2]     ....................
ND BixNum[1]     ....................
ND BixNum[0]     ....................

Assume that valueToken is a bit stream that marks the end position of any legal JSON numeral, string or keyword, then the example above would look like:

valueToken       ...................1

Then we can validate if we have valid non-nesting values by checking:

  otherND = bnc.UGT(ND, 0)
  begin = ~Advance(<1>, 1)
  firstValue = ScanTo(begin, valueToken)
  nonNestingValue = ScanTo(Advance(firstValue, 1), valueToken)
  errSimpleValue = nonNestingValue & ~otherND

Validating arrays

Assume we have two bit streams for tokens: valueToken (explained above) and anyToken, a bit stream marking the end position of any legal or illegal token. Assume also that LBrak, RBrak, and Comma are streams marking the position of JSON [, ], and , tokens respectively.

Suppose that a BixNum ND has been computed as the nesting depth involving the left and right delimiter sets consisting of square brackets and braces.

Then the validation of array syntax at each nesting depth d can be determined as follows.

Each array will be a span from an opening '[' to its closing ']'. Between these brackets will be other tokens as well as nested arrays and objects. However, nested arrays and objects can be easily computed as those spans of elements having a nesting depth greater than d. Using a BixNum compiler bnc, the spans may be computed as bnc.UGT(ND, d). Tokens at depth d can be identified as those at positions identified by the stream bnc.EQ(ND, d).

The following code can be used to validate all arrays at depth d, determining whether there are any errors after the opening LBrak or after any value or Comma.

   atDepth = bnc.EQ(ND, d)
   nested = bnc.UGT(ND, d)
   arrayStart = atDepth & LBrak
   arrayEnd = ScanThru(arrayStart, nested | (atDepth & ~ (RBrak | RBrace)))
   errorAtEnd = arrayEnd & RBrace
   arraySpan = ExclusiveSpan(arrayStart, arrayEnd)
   // Now validate that every value or nested item is followed 
   // either by a comma or a the end RBrak.
   afterNested = Advance(nested & arraySpan, 1) & atDepth 
   afterToken = Advance(atDepth & valueToken & arraySpan, 1)
   tokenNext = ScanThru(afterNested | afterToken, whitespace)
   errAfterValue = tokenNext &~(Comma | RBrak)
   //
   // Every Comma must be followed by a value
   advCommaArrSpan = Advance(Comma & atDepth & arrarySpan, 1)
   errAfterComma = ScanTo(advCommaArrSpan, anyToken) & ~ (nested | (valueToken & atDepth))
   //
   // After the LBrak we must have either a value or an RBrak.
   scanArrAnyToken = ScanTo(Advance(arrayStart, 1), anyToken)
   errAfterLBrak = scanArrAnyToken & ~ (nested | valueToken | RBrak)

Validating objects

Assume we have two bit streams for tokens: valueToken (explained above) and anyToken, a bit stream marking the end position of any legal or illegal token. Assume also that LBrace, RBrace, DQuote, Colon and Comma are streams marking the position of JSON {, }, ", : and , tokens respectively.

   str = valueToken & DQuote
   valueTokenMinusStr = valueToken ^ str

   atDepth = bnc.EQ(ND, d)
   nested = bnc.UGT(ND, d)
   objStart = atDepth & LBrace
   objEnd = ScanThru(objStart, nested | (atDepth & ~ (RBrak | RBrace)))
   objAtEnd = objEnd & RBrak
   objSpan = ExclusiveSpan(objStart, objEnd)
   // Now validate that every value or nested item is followed 
   // either by a comma or a the end RBrace.
   afterNested = Advance(nested & objSpan, 1) & atDepth
   //
   // process all values that are not strings
   afterTokenMinusStr = Advance(atDepth & valueTokenMinusStr & objSpan, 1)
   tokenNextMinusStr = ScanThru(afterNested | afterTokenMinusStr, whitespace)
   errAfterValueMinusStr = tokenNextMinusStr &~(Comma | RBrace)
   //
   // process strings as both key and value
   strAtDepth = atDepth & str
   afterTokenStr = Advance(strAtDepth & objSpan, 1)
   tokenNextStr = ScanThru(afterNested | afterTokenStr, whitespace)
   errAfterValueStr = tokenNextStr &~(Comma | RBrace | Colon)
   //
   // join errors
   errAfterValue = errAfterValueMinusStr | errAfterValueStr
   //
   // Every Colon must be followed by a value
   advColon = Advance(Colon & atDepth & objSpan, 1)
   errAfterColon = ScanTo(advColon, anyToken) & ~ (nested | (valueToken & atDepth))
   //
   // Every Comma must be followed by a key string
   advComma = Advance(Comma & atDepth & objSpan, 1)
   errAfterComma = ScanTo(advComma, anyToken) & ~ strAtDepth
   //
   // After the LBrace we must have either a value or an RBrace.
   scanObjStartAnyToken = ScanTo(Advance(objStart, 1), anyToken)
   errAfterLBrace = scanObjStartAnyToken & ~ (nested | valueToken | RBrace)

CFG Kernel (note: this doc isn't ready)

The CFG kernel is a Parabix kernel (to be implemented) that can compute a context-free grammars given a set of rules with its own syntax (close to EBNF), a BixNum representing the nesting depth of its tokens (see NestingDepth kernel), and a StreamSet representing all valid markers.

enum Token {
   lCurly,
   rCurly,
   lBracket,
   rBracket,
   colon,
   str,
   value
};

// TODO: This is just an initial idea and it wouldn't work. Need to improve that.
CFGRule * valueRule = CGF::match(Token::value)
CFGRule * arrRule   = CFG::matchBracket(
   Token::lBracket,
   { Token::value },
   Token::rBracket
);
CFGRule * objRule   = CFG::matchBracket(
   Token::lCurly,
   { Token::str, Token::colon, Token::value },
   Token::rCurly
);

Nesting Depth    00011122211222233443332221111000
ND BixNum[2]     .................11.............
ND BixNum[1]     ......111..111111..111111.......
ND BixNum[0]     ...111...11....11..111...1111...