Sunday, October 11, 2015

Filtering using Scala Parser Combinators

Filtering using Scala Parser Combinators


So I got interesting task to enable record filtering according to some rules that we get from configuration represented as text.
well so far we know two things :
1. we need a filter that accepts a record as an input and might return a record as its output :
def filter(record:Record ):Option[Record]
2. we need to get a text with some kind of rule e.g "foo = 'FOO'" , "bar = 8", "buz IS NOT NULL " and AND / OR combinations . where foo/bar/buz are record's attributes
*well I know that "null" is a dirty word in FP but that was the requirement since it was a mix-in with java project.
Following the rule of three so is our todo list :
  1. parse the text
  2. evaluate it's Boolean expression in a form of Abstract Syntax Tree.
  3. filter according the Boolean evaluation.
The complete Source Code can be found here (you are more than welcome to contribute ) Parser Combinators to the rescue
Inspired by this excellent post an introduction to scala parser combinators and by this stackoverflow answer .
let's start writing the preserved words that we will use :
object FilterConstants{
  val And = "AND"
  val OR = "OR"
  val UnaryAnd = "&&"
  val UnaryOr = "||"
  val Equals = "="
  val NotEquals = "!="
  val Greater = ">"
  val Smaller = "<"
  val GreaterOrEqual = ">="
  val SmallerOrEqual = "<="
  val IsNull = "IS NULL"
  val IsNotNull = "IS NOT NULL"
}
and add our AST objects
next let's write the parser it self for that we will need extend JavaTokenParsers and PackratParsers . since JavaTokenParsers enables us to capture text using regex we need to do a little override to inherited regex of the string literal in order to capture single quote text .
object ASTObjects {
  sealed abstract class AST

  sealed abstract class BooleanExpression extends AST

  sealed abstract class Constant extends AST

  case class BooleanOperation(op: String, lhs: BooleanExpression, rhs: BooleanExpression) extends BooleanExpression

  case class Comparison[T <: Constant](lhs: Constant,op: String) extends BooleanExpression

  case class LRComparison (lhs: Variable, op: String, rhs: Constant) extends BooleanExpression

  case class NumericConstant(value: Double) extends Constant

  case object NullConstant extends Constant

  case class DateConstant(value: Date) extends Constant

  case class TimestampConstant(value: Timestamp) extends Constant

  case class StringConstant(value: String) extends Constant

  case object EmptyConstant extends Constant

  case class Variable(name: String) extends Constant {
    def eval(env: Map[String, Constant]) = env(name)

    override def toString = name
  }
}

The Parser

trait ConditionParser extends JavaTokenParsers with PackratParsers {
  import FilterConstants._
  override def stringLiteral: Parser[String] =  """(["'])[^"']*\1""".r
  var  simpleDateFormat:SimpleDateFormat = new SimpleDateFormat("yyyy-mm-dd")

  val d = """'+\d{4}-\d{2}-\d{2}+'""".r // "'2015-03-25'".matches("""'+\d{4}-\d{2}-\d{2}+'""")
  def dateLiteral: Parser[String] = d
  val ts = """'+\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}(\.\d)?+'""".r
  def dateTimeLiteral: Parser[String] = ts

  val booleanOperator : PackratParser[String] = literal(UnaryOr) | literal(UnaryAnd) | literal(And) | literal(OR)
  val comparisonOperator : PackratParser[String] = literal(SmallerOrEqual) | literal(GreaterOrEqual) | literal(Equals) | literal(NotEquals) | literal(Smaller) | literal(Greater)
  val nullCompare : PackratParser[String] = literal(IsNotNull) | literal(IsNull)
  val constant : PackratParser[Constant] = floatingPointNumber ^^ { x => NumericConstant(x.toDouble) } |
    stringLiteral ^^ {s => StringConstant(s.replaceAll("'","")) } |
    dateLiteral ^^ {d => DateConstant(new Date(simpleDateFormat.parse(d).getTime))}  |
    dateTimeLiteral ^^ {d => DateConstant(new Date(simpleDateFormat.parse(d).getTime))}  |
    stringLiteral ^^ {s => StringConstant(s.replaceAll("'","")) }
  val variable = ident ^^ {s=>Variable(s)}

  val comparison : PackratParser[BooleanExpression] = (variable ~ comparisonOperator ~ constant) ^^ { case lhs ~ op ~ rhs => LRComparison(lhs, op, rhs) } |
    (variable ~ nullCompare) ^^ {case lhs ~ nc => Comparison(lhs,nc)}

  lazy val p1 : PackratParser[BooleanExpression] = booleanOperation | comparison
  val booleanOperation : PackratParser[BooleanExpression]= (p1 ~ booleanOperator ~ p1) ^^ { case lhs ~ op ~ rhs => BooleanOperation(op, lhs, rhs) }|
    "(" ~> booleanOperation <~ ")"

  def parse(text:String) :Option[BooleanExpression] = {
    val res = parseAll(p1, text)
    res match {
      case Success(r, n) => Some(r)
      case Failure(msg, n) =>
        println("Failure parsing "+ text+"\nMessage: " +msg)
        None
      case Error(msg, n) =>
        println("Error"+msg)
        None
    }
  }
}
I do not want to get into those funny symbols but just a short intro to symbols I used above:
~ match a successful parser following other successful parser
| match a parser OR other parser
^^ that if parser succeeds it's content will be used in the following function
<~ if the parser on the left succeeds it is ignored and continues with right side
we can write some tests :
class ConditionParserTest extends FlatSpec with ShouldMatchers with ConditionParser{
  "Parser" should "Parse Null comparison" in {

    parse("foo IS NULL") should equal (Some(Comparison(Variable("foo"),IsNull)))
    parse("foo IS NOT NULL") should equal (Some(Comparison(Variable("foo"),IsNotNull)))
  }
  it should " parse left right comparison numeric values" in {
    parse("foo = 8") shouldEqual Some(LRComparison(Variable("foo"),Equals,NumericConstant(8D)))
    parse("foo = 8.5") shouldEqual Some(LRComparison(Variable("foo"),Equals,NumericConstant(8.5)))
    parse("foo = -8")shouldEqual Some(LRComparison(Variable("foo"),Equals,NumericConstant(-8D)))
  }
  it should " parse left right comparison String values" in {
    parse("foo = 'BAR'") shouldEqual Some(LRComparison(Variable("foo"),Equals,StringConstant("BAR")))
    parse("foo != 'BAR'") shouldEqual Some(LRComparison(Variable("foo"),NotEquals,StringConstant("BAR")))
  }
}
so this concludes the Parsing part

Evaluating the Boolean expressions

trait Evaluator{
  import FilterConstants._
  def evaluate(expression:BooleanExpression)(implicit env: Map[String, Constant]) : Boolean = expression match {

    case LRComparison(v:Variable, Equals, c) =>  (v.eval(env),c) match{
      case (StringConstant(l),StringConstant(r)) => l == r
      case (NumericConstant(l),NumericConstant(r)) => l == r
      case (StringConstant(l),NumericConstant(r)) => r == l.toDouble
      case (a,b) =>
        println(s"FAILED to evaluate '==' got $a not equal to $b" )
        false
    }
    case LRComparison(v:Variable, NotEquals, c) =>  (v.eval(env),c) match{
      case (StringConstant(l),StringConstant(r)) => l != r
      case (NumericConstant(l),NumericConstant(r)) => l != r
      case (StringConstant(l),NumericConstant(r)) =>  r != l.toDouble
      case (a,b) =>
        println(s"FAILED to evaluate != got $a not equal to $b" )
        false
    }
    case LRComparison(v:Variable, Greater, c) =>  (v.eval(env),c) match{
      case (NumericConstant(l),NumericConstant(r)) => l > r
      case (StringConstant(l),NumericConstant(r)) => l.toDouble > r
      case (a,b) =>
        println(s"FAILED to evaluate $Greater got $a not equal to $b" )
        false
    }
    case LRComparison(v:Variable, Smaller, c) =>  (v.eval(env),c) match{
      case (NumericConstant(l),NumericConstant(r)) => l < r
      case (StringConstant(l),NumericConstant(r)) => l.toDouble < r
      case (a,b) =>
        println(s"FAILED to evaluate $Smaller got $a not equal to $b" )
        false
    }
    case LRComparison(v:Variable, SmallerOrEqual, c) =>  (v.eval(env),c) match{
      case (NumericConstant(l),NumericConstant(r)) => l <= r
      case (StringConstant(l),NumericConstant(r)) => l.toDouble <= r
      case (a,b) =>
        println(s"FAILED to evaluate $SmallerOrEqual got $a not equal to $b" )
        false
    }
    case LRComparison(v:Variable, GreaterOrEqual, c) =>  (v.eval(env),c) match{
      case (NumericConstant(l),NumericConstant(r)) =>
        l >= r
      case (StringConstant(l),NumericConstant(r)) =>
        l.toDouble >= r
      case (a,b) =>
        println(s"FAILED to evaluate $GreaterOrEqual got $a not equal to $b" )
        false
    }
    case Comparison(v:Variable,IsNotNull) => v.eval(env) match{
      case EmptyConstant => false
      case _ => true
    }
    case Comparison(v:Variable,IsNull) => v.eval(env) match{
      case EmptyConstant => true
      case _ => false
    }

    case BooleanOperation(UnaryOr, a, b) => evaluate(a) || evaluate(b)
    case BooleanOperation(OR, a, b) => evaluate(a) || evaluate(b)
    case BooleanOperation(UnaryAnd, a, b) => evaluate(a) && evaluate(b)
    case BooleanOperation(And, a, b) => evaluate(a) && evaluate(b)
  }
}

Finally our Filter

class ConditionalFilter(ruleText:String) extends ConditionParser with Evaluator{

  val rule = parse(ruleText)
  val fields = rule match {
    case Some(r) => extractFields(r)
    case _ => Nil
  }

  def extractFields(b:BooleanExpression):List[String] = b match {
    case BooleanOperation(_,l,r) => extractFields(l):::extractFields(r)
    case LRComparison(l,_,_) => List(l.toString)
    case Comparison(l,_) => List(l.toString)
    case _ => Nil
  }

  def getMap(record:Record,fields:List[String]):Map[String, Constant] = {
    fields.foldLeft(Map.empty[String, Constant]){(m,f) =>
      m + (f -> {
        Option(record.getFieldValue(f)) match {
          case Some(v) =>
            StringConstant (v.toString)
          case _ => EmptyConstant
        }
      })
    }
  }

  def filter(record:Record ):Option[Record] = {
    implicit val m = getMap(record,fields)
    if (rule.exists(evaluate)){
      Some(record)
    }
    else
      None
  }

}
Now it is all complete and we can

Test it

class ConditionalFilterTest extends  FlatSpec  with ShouldMatchers{

  def getRecord(fields:Map[String,String]) = Record(fields)
  def getFilter(expr:String ) = new ConditionalFilter(expr)

  "Filter by col_name with rule " should "IS NULL include records with null values " in{
    val f = getFilter("foo IS NULL" )
    val f2 = getFilter("foo IS NOT NULL" )
    val r = getRecord(Map("foo"-> null,"bar"-> "5.2"))
    f.filter(r) should equal( Some(r))
    f2.filter(r) should equal( None)

  }
  it should "IS NOT NULL remove records with null values " in{
    val f = getFilter("foo IS NOT NULL" )
    val f2 = getFilter("foo IS NULL" )
    val r = getRecord(Map("foo"-> "foo","bar"-> "5.2"))
    f.filter(r) should equal( Some(r))
    f2.filter(r) should equal( None)

  }
  it should "= output records that equal the value" in {
    val f = getFilter("foo = 'FOO'" )
    val f2 = getFilter("foo = 'FOOx'" )
    val r = getRecord(Map("foo"-> "FOO","bar"-> "5.2"))
    f.filter(r) should equal( Some(r))
    f2.filter(r) should equal( None)
  }
}
more tests and updates can be found in the source code here .
In my next post we I will extend the Expression ADT using Type classes and Implicits .

contributes/feedbacks/issues are welcome 
Cheers 
Avi

No comments:

Post a Comment