What I Learnt Last Week

Saturday, February 11, 2006

Unit Testing with Excel and Jython

Something I've done recently that has worked out very well is using Excel and Jython to do some unit testing.

What I Did
The work I've been doing lately is for a data warehouse. If you've never worked on a data warehouse, take it from me that they are very hard. If you aren't doing sufficient testing of your ETLs (Extraction, Transform, Load) and making sure you get them right, there will be a lot of problems showing up later down the path that will be hard to diagnose, and take a lot of effort to solve.

Excel
The business rules for the ETLs were documented in Excel files. There are over 3000 rules for the system. The documentation is actually very good in that it is accompanied by an example source flat-file and gives:

  • the full business rule

  • which part of the rule is being tested

  • source field

  • target field

  • expected value of target field

  • which row of the input file has the data to be tested


A snippet of the file looks a little like:






Full RuleTestSource FieldTarget FieldExpected ResultRow
If len ProdCode > 10
  If ProdCode starts with '1234'
    ProdGrp=ProdCode(5,6)
  Else If ProdCode starts with '678'
    ProdGrp=ProdCode(4,5)
Else ProdGrp=ProdCode(1,2)
len > 10 & '1234'...ProdCodeProdGrpProdCode(5,6)1
...len > 10 & '678'...ProdCodeProdGroupProdCode(4,5)2
...length ? 10ProdCodeProdGrpProdCode(1,2)3


Source File
The source file was a plain ASCII text file that looks like:
2005-03-17 14:35:22|Some Field||Other Field|12341234567890|More Fields
2005-03-17 14:37:17|Some Field||Other Field|6780987654321|More Fields
2005-03-17 14:40:02|Some Field||Other Field|7678|More Fields
Testing
Ideally, it would be great to run tests straight off the excel file. How good would that be? I write one script that parses the excel file and executes tests to compare the DB with the input file; if the rules change then I don't have to anything! That sounds good to me.

I probably could have come close to doing this, but some of the tests had quite complicated rules and I think it would have taken a lot of effort and it would all be very brittle. I don't want to waste effort writing some human language parser, and I need to have confidence that I know what my tests are testing.

I wrote a script that parses the excel file and spits out a Python test script that I can then ammend and add extra logic to cases that require it. Python's inbuilt unit-testing framework and great text file manipulation syntax make it a great choice for this task.

Working backwards a little, (which is not a bad way to work unless you trip on things that people have left on the floor) I thought that I would like my tests to all look something like this:
    assertDBColEquals(dbColName, expected, row)

DBs and Excel
I tried to find a simple library that allowed me to connect to Oracle from Python, but they all seemed to require me to have the full Oracle Client already installed. Last time I installed the Oracle client, it was over 400Mb! The Oracle JDBC classes12.jar file on the other hand is about 1.5Mb. Jakarta POI makes it easy to read Excel files - so hello Jython.

My first Jython file parses the Excel file and spits out text that is then appended to the Jython test file. The parser looks a little like this:
from org.apache.poi.poifs.filesystem import POIFSFileSystem
from org.apache.poi.hssf.usermodel import HSSFWorkbook
from org.apache.poi.hssf.usermodel import HSSFSheet
from org.apache.poi.hssf.usermodel import HSSFCell
from java.io import FileInputStream
import sys

srcColNumMap = {'ProdCode': 5}

def getCellValue(cell):
if HSSFCell.CELL_TYPE_NUMERIC == cell.getCellType():
return '%d' % cell.getNumericCellValue()
else:
return cell.getStringCellValue().strip()

fs = POIFSFileSystem(FileInputStream(sys.argv[1]))
wb = HSSFWorkbook(fs)sheet = wb.getSheetAt(0)
for i in range(1, row.getLastCellNum()):
xlRow = sheet.getRow(i-1);
src = getCellValue(xlRow.getCell(2))
target = getCellValue(xlRow.getCell(3))
expected = getCellValue(xlRow.getCell(4))
row = getCellValue(xlRow.getCell(5))
print """
def test_%03d%s__%sRow%s(self):
row=%s
self.assertColEquals('%s', TestClass.infile[row][%d], row)
""" % (i, src, target, row, row, target, srcColNumMap[src])


I actually made it more complicated than this. I allocated some special meanings to columns 6 and 7 in the Excel file and did whatever simple things I could to allow me to keep as much logic in the Excel sheet as I could and make the parser do all the mundane work. I actually put Python code directly into the Excel file which is then cut and paste verbatim into the test script too.

The output from the parser looked like:
    def test_001_ProdCode__ProdGrpRow1(self):
row=1
self.assertColEquals('ProdGrp', TestClass.infile[row][5][4:6], row)

def test_002_ProdCode__ProdGrpRow2(self):
row=2
self.assertColEquals('ProdGrp', TestClass.infile[row][5][3:5], row)

def test_003_ProdCode__ProdGrpRow3(self):
row=3
self.assertColEquals('ProdGrp', TestClass.infile[row][5][:2], row)


Now, I only needed to write a class named 'TestClass', define a variable 'infile' that contains each row and column of the input file and define a method 'assertColEquals' that does what you think it does.

That class looks like:
import unittest
from com.ziclix.python.sql import zxJDBC

class TestClass(unittest.TestCase):
d, u, p, v = "jdbc:oracle:thin:@192.168.1.1:1521:db", "user", "pwd", "oracle.jdbc.driver.OracleDriver"
db = zxJDBC.connect(d, u, p, v)
c = db.cursor()
infile = ['dummy']
for l in open('source.txt').readlines():
infile.append(['dummy'] + l.strip().split('|'))

def getSqlResult(self, sql):
TestClass.c.execute(sql)
return TestClass.c.fetchall()

def sqlWhereRow(self, row):
return "EVENT_DATE=to_date('%s', 'YYYY-MM-DD HH24:MI:SS" % TestClass.infile[row][1]

def assertColEquals(self, col, expected, row):
sql = "select %s from TEST_TABLE \nwhere %s" % (col, self.sqlWhereRow(row))
rs = getSqlResults(sql)
if rs == None or len(rs) != 1:
print 'Bad sql for row %d:\n%s\nGot RS: %s' % (row, sql, rs)
self.failUnlessEqual(1, len(rs))
self.failUnlessEqual(expected, rs[0][0])


Now, I run my parser over the 3000 lines of Excel tests and append the output to my TestClass file and I have written 3000 unit tests pretty fast.

What I Learnt
This section is coming soon. Stuff about how JDBC is pretty good. Jython is really good. Dynamic languages for testing are excellent. Oh yeah - Excel for data entry is the best. You get the business people to define everything you need in Excel and then you turn it into code. Even if I had to define all my test cases myself, I would still use Excel. I've seen people use XML to define heaps of tests like these and then write a parser for the Excel that executes each of the tests - Arrgggh. I hate typing XML. I hate reading XML. I hate how everyone wants to use XML for heaps of dumb things.