tags

Writing a Tag Algebra compiler with Coco/R

This week, i was digging back into the Coco/R implemented parser of DekiScript, tracking down a bug, which turned out to not be in the parsing bits at all. It did, however, get me familiarized with Coco/R again. So I thought i’d give myself an hour to implement the parser for my Tag related boolean algebra with Coco/R. If i could pull it off, forget about the regex/state-machine approach i was considering.

Took me about 15 minutes to set up the classes to represent the intermediate AST and another 30 minutes for the grammar in Coco/R ATG format. After that I wrote a couple of unit tests to check that the parsing was right, only to realize that while AND and OR are left-to-right associative, NOT is right-to-left associative. Figuring out how to adjust the grammar for that actually took me another 10-15 minutes. But overall, I hit the one hour goal.

The Syntax Tree

Before tackling the grammar to parse, I needed to define data structures to represent the parsed syntax tree, which I’d later convert to executable code. The syntax is fairly simple:

(foo+bar)|(^foo+baz)

This can be represented by just 4 tree node types (with common parent TagExpression:

  • AndExpression
  • OrExpression
  • NotExpression
  • Tag

The parentheses are just tree structure artifacts, so are not represented in the AST.

The Coco/R grammar

I’ve broken the grammar up to discuss the various parts, but the below sections represent a single ATG file, the Coco/R grammar format.

COMPILER TagAlgebra

  public TagExpression Result = TagExpression.Empty;

The COMPILER defines the entrypoint PRODUCTIONfor the parser. The following lines. until the next grammar definition, are inserted into the generated Parser, and can be used to inject class fields, extra methods, etc. into the Parser source. The only thing I inserted was a field to hold the root of the AST and initialize it to empty.

IGNORECASE

This tells Coco/R that our grammar is case insensitive.

CHARACTERS
  tab = '\t'.
  eol = '\n'.
  cr = '\r'.
  nbsp = '\u00a0'. // 0xA0 = 'unbreakable space'
  shy = '\u00ad'.  // 0xAD = 'soft-hyphen'
  letter = 'a'..'z'.
  digit = '0'..'9'.

CHARACTERS defines characters that the parser should recognize for matches.

TOKENS
  tag =
    letter { letter | digit | "_" | ":" }
    .

The only token in the grammar are tag, which is composed from the characters defined above and extra quoted characters.

IGNORE eol + cr + tab + nbsp + shy

IGNORE tells Coco/R what characters have no meaning in parsing the input.

Next come the PRODUCTIONS, i.e. the meat of the grammar. These are the rules for matching input and converting it into code. Coco/R is an LL(1) parser generator, i.e. the grammar must be parsable from Left to Right with Left-canonical derivations and one look-ahead symbol. We also cannot have a loop in our grammar, i.e all possible branches have to lead to a terminal via a unique set of production matches.

PRODUCTIONS

  TagAlgebra                      (. Result = TagExpression.Empty; .)
    =
    [ BinaryExpr<out Result> ]
    .

The first production, is the entry point which, again, sets the result to an empty AST, since the same instance of the parser can parse multiple expressions. It then specifies 0 or 1 BinaryExpr productions.

  BinaryExpr<out TagExpression expr>  (.  expr = null; TagExpression right = null; .)
    =
    NotExpr<out expr> {               (. bool and = false; .)
      (
        "|"
        | "+"                         (. and = true; .)
      )
      NotExpr<out right>              (. expr = and ? (TagExpression)new AndExpression(expr,right) : new OrExpression(expr,right); .)
    }
    .

BinaryExpr is used for both AND and OR expressions, since both take two sub-expressions and combine them with a single operator. The production specifies a left side of a NotExpr followed by an optional operator and another NotExpr. I.e. should our first match happen to not be a BinaryExpr, the parser can fall through to NotExpr return its result instead, without matching the optional production body.

  NotExpr<out TagExpression expr> (. expr = null; int notCount = 0; .)
    =
    {
      "^"                         (. notCount++; .)
    }
    (
      "(" BinaryExpr<out expr> ")"
      | tag                       (. expr = new Tag(t.val); .)
    )                             (. for(var i=0;i<notCount;i++) expr = new NotExpression(expr); .)
    .

END TagAlgebra.

NotExpr, just like BinaryExpr optionally matches on its operator, requiring only that the end of the operation it matches either a BinaryExpr enclosed by parentheses (i.e. not a circular match back into BinaryExpr, since it requires additional surrounding matches), or it matches a tag, the ultimate terminal in the grammar.

There is one tricky bit in this production, i.e. the NOT operator can match multiple times, which means, we need to accumulate the number of operator matches and build the chain of NOT expressions wrapping the current expression once we know how many, if any, matched.

What to do with the AST

The nice thing with Coco/R is that it adds no runtime dependency at all, building a fully self-contained Scanner and Parser. With these built, it is now possible to take Tag Algebra expressions and turn them into an executable tree of Func calls, as described in “Boolean Algebra for Tag queries“.

The grammar could have been written to accumulate the unique tags and construct the Func tree right away, but the two benefits of going to an AST first, is that a)the AST can easily be rendered back into text form (even placing parentheses properly for expressions that previously had not parentheses), and b) the AST can easily be programatically composed with other expressions, or decomposed into sub-expressions, which can be used for caching and other efficiency operations.

I’ll probably play with Irony next, but the “no runtime dependency” and existing familiarity made Coco/R the winner this time.

By arne on | .net, geek | A comment?
Tags: , , , , , ,

Boolean Algebra for Tag queries

Currently working on a tag query engine and couldn’t find anything all that useful in the published approaches. I want to do arbitrary boolean algebras against a set of tags in a database, which seems to be out of scope of SQL approaches. All the various tagging schemas out there reduce to either just AND or just OR queries, but not complex logic. However, I want to be able to do something like:

(foo+bar)|(foo+baz)|(bar+^baz)

If there is a way to do this with SQL, i’d love to know. But the way i look at it, i really have to fetch all tags for each item and then do apply that formula to the list of tags on the item.

But let’s break down the matching problem itself into something i can execute. Let’s assume I’ve got a simple parser that can turn the above into an AST. Really, i can decompose any variation into three operations, AND(a,b), OR(a,b) and NOT(a). And I can represent those with some simple Func<> definitions:

Func<bool, bool, bool> AND = (a, b) => a && b;
Func<bool, bool, bool> OR = (a, b) => a || b;
Func<bool, bool> NOT = (a) => !a;

Assuming that i have boolean tokens for foo, bar and baz, the expressions becomes:

OR(AND(foo, bar), OR(AND(foo, baz), AND(bar,NOT(baz))))

Now, the above expression can be expressed as a function that takes three booleans describing the presence of the mentioned tags, ignoring any other tags that the item has, returning a boolean indicating a successful match. In C# that expression would look like this:

Func<bool[], bool> f = x => OR(AND(x[0], x[1]), OR(AND(x[0], x[2]), AND(x[1],NOT(x[2]))));

Next we need to generate this boolean map from the list of tags on the item. Assuming that the tag list is a list of strings, we can define an extension methods on IEnumerable<string> to generate the boolean map like this:

public static bool[] GetBinaryMap(this IEnumerable<string> tags, string[] mask) {
    var map = new bool[mask.Length];
    foreach(var x in tags) {
        for(var i = 0; i < mask.Length; i++) {
            if(x == mask[i]) {
                map[i] = true;
            }
        }
    }
    return map;
}

And with this we can define a linq query that will return us all matching items:

var mask = new[] { "foo", "bar", "baz"};
Func<bool[], bool> f = x => OR(AND(x[0], x[1]), OR(AND(x[0], x[2]), AND(x[1],NOT(x[2]))));
var match = from item in items
            where f(item.Tags.GetBinaryMap(mask))
            select item;

Clearly this is isn’t the fastest executing query, since we first had to create our items, each item in which has a collection of tags. But there is a lot of optimizations left on the table here, such as using our tag mask to pre-qualify items, breaking down the AST into sub-matches that could be used against a cache to find items, etc.

But at least we have a fairly simple way to take complex boolean algebra on tags and convert them into something that we can evaluate generically

By arne on | .net, geek | A comment?
Tags: , , , ,