What I Learnt Last Week

Monday, October 26, 2009

New Westpac Logon Userscript

I have updated my westpac logon greasemonkey userscript. It allows you to use normal password entry on the westpac (Australian bank) website. https://online.westpac.com.au/esis/Login/SrvPage

The script can be downloaded at http://userscripts.org/scripts/show/42301

Westpac have changed their logon page in the last few days. This time, the script is actually simpler since the westpac people have created a key padDef variable that holds a 'keys' array and a 'malgm' string.

I don't know why they put customers through so much pain. When they first introduced this, I sent them an email with my original script and suggested that they may want to make it available to fix the problems with their webpage. Noone sent me a response. I'll try again this time.

Saturday, February 11, 2006

Unit Testing with Excel and Jython

Something I've done recently that has worked out very well is using Excel and Jython to do some unit testing.

What I Did
The work I've been doing lately is for a data warehouse. If you've never worked on a data warehouse, take it from me that they are very hard. If you aren't doing sufficient testing of your ETLs (Extraction, Transform, Load) and making sure you get them right, there will be a lot of problems showing up later down the path that will be hard to diagnose, and take a lot of effort to solve.

Excel
The business rules for the ETLs were documented in Excel files. There are over 3000 rules for the system. The documentation is actually very good in that it is accompanied by an example source flat-file and gives:

  • the full business rule

  • which part of the rule is being tested

  • source field

  • target field

  • expected value of target field

  • which row of the input file has the data to be tested


A snippet of the file looks a little like:






Full RuleTestSource FieldTarget FieldExpected ResultRow
If len ProdCode > 10
  If ProdCode starts with '1234'
    ProdGrp=ProdCode(5,6)
  Else If ProdCode starts with '678'
    ProdGrp=ProdCode(4,5)
Else ProdGrp=ProdCode(1,2)
len > 10 & '1234'...ProdCodeProdGrpProdCode(5,6)1
...len > 10 & '678'...ProdCodeProdGroupProdCode(4,5)2
...length ? 10ProdCodeProdGrpProdCode(1,2)3


Source File
The source file was a plain ASCII text file that looks like:
2005-03-17 14:35:22|Some Field||Other Field|12341234567890|More Fields
2005-03-17 14:37:17|Some Field||Other Field|6780987654321|More Fields
2005-03-17 14:40:02|Some Field||Other Field|7678|More Fields
Testing
Ideally, it would be great to run tests straight off the excel file. How good would that be? I write one script that parses the excel file and executes tests to compare the DB with the input file; if the rules change then I don't have to anything! That sounds good to me.

I probably could have come close to doing this, but some of the tests had quite complicated rules and I think it would have taken a lot of effort and it would all be very brittle. I don't want to waste effort writing some human language parser, and I need to have confidence that I know what my tests are testing.

I wrote a script that parses the excel file and spits out a Python test script that I can then ammend and add extra logic to cases that require it. Python's inbuilt unit-testing framework and great text file manipulation syntax make it a great choice for this task.

Working backwards a little, (which is not a bad way to work unless you trip on things that people have left on the floor) I thought that I would like my tests to all look something like this:
    assertDBColEquals(dbColName, expected, row)

DBs and Excel
I tried to find a simple library that allowed me to connect to Oracle from Python, but they all seemed to require me to have the full Oracle Client already installed. Last time I installed the Oracle client, it was over 400Mb! The Oracle JDBC classes12.jar file on the other hand is about 1.5Mb. Jakarta POI makes it easy to read Excel files - so hello Jython.

My first Jython file parses the Excel file and spits out text that is then appended to the Jython test file. The parser looks a little like this:
from org.apache.poi.poifs.filesystem import POIFSFileSystem
from org.apache.poi.hssf.usermodel import HSSFWorkbook
from org.apache.poi.hssf.usermodel import HSSFSheet
from org.apache.poi.hssf.usermodel import HSSFCell
from java.io import FileInputStream
import sys

srcColNumMap = {'ProdCode': 5}

def getCellValue(cell):
if HSSFCell.CELL_TYPE_NUMERIC == cell.getCellType():
return '%d' % cell.getNumericCellValue()
else:
return cell.getStringCellValue().strip()

fs = POIFSFileSystem(FileInputStream(sys.argv[1]))
wb = HSSFWorkbook(fs)sheet = wb.getSheetAt(0)
for i in range(1, row.getLastCellNum()):
xlRow = sheet.getRow(i-1);
src = getCellValue(xlRow.getCell(2))
target = getCellValue(xlRow.getCell(3))
expected = getCellValue(xlRow.getCell(4))
row = getCellValue(xlRow.getCell(5))
print """
def test_%03d%s__%sRow%s(self):
row=%s
self.assertColEquals('%s', TestClass.infile[row][%d], row)
""" % (i, src, target, row, row, target, srcColNumMap[src])


I actually made it more complicated than this. I allocated some special meanings to columns 6 and 7 in the Excel file and did whatever simple things I could to allow me to keep as much logic in the Excel sheet as I could and make the parser do all the mundane work. I actually put Python code directly into the Excel file which is then cut and paste verbatim into the test script too.

The output from the parser looked like:
    def test_001_ProdCode__ProdGrpRow1(self):
row=1
self.assertColEquals('ProdGrp', TestClass.infile[row][5][4:6], row)

def test_002_ProdCode__ProdGrpRow2(self):
row=2
self.assertColEquals('ProdGrp', TestClass.infile[row][5][3:5], row)

def test_003_ProdCode__ProdGrpRow3(self):
row=3
self.assertColEquals('ProdGrp', TestClass.infile[row][5][:2], row)


Now, I only needed to write a class named 'TestClass', define a variable 'infile' that contains each row and column of the input file and define a method 'assertColEquals' that does what you think it does.

That class looks like:
import unittest
from com.ziclix.python.sql import zxJDBC

class TestClass(unittest.TestCase):
d, u, p, v = "jdbc:oracle:thin:@192.168.1.1:1521:db", "user", "pwd", "oracle.jdbc.driver.OracleDriver"
db = zxJDBC.connect(d, u, p, v)
c = db.cursor()
infile = ['dummy']
for l in open('source.txt').readlines():
infile.append(['dummy'] + l.strip().split('|'))

def getSqlResult(self, sql):
TestClass.c.execute(sql)
return TestClass.c.fetchall()

def sqlWhereRow(self, row):
return "EVENT_DATE=to_date('%s', 'YYYY-MM-DD HH24:MI:SS" % TestClass.infile[row][1]

def assertColEquals(self, col, expected, row):
sql = "select %s from TEST_TABLE \nwhere %s" % (col, self.sqlWhereRow(row))
rs = getSqlResults(sql)
if rs == None or len(rs) != 1:
print 'Bad sql for row %d:\n%s\nGot RS: %s' % (row, sql, rs)
self.failUnlessEqual(1, len(rs))
self.failUnlessEqual(expected, rs[0][0])


Now, I run my parser over the 3000 lines of Excel tests and append the output to my TestClass file and I have written 3000 unit tests pretty fast.

What I Learnt
This section is coming soon. Stuff about how JDBC is pretty good. Jython is really good. Dynamic languages for testing are excellent. Oh yeah - Excel for data entry is the best. You get the business people to define everything you need in Excel and then you turn it into code. Even if I had to define all my test cases myself, I would still use Excel. I've seen people use XML to define heaps of tests like these and then write a parser for the Excel that executes each of the tests - Arrgggh. I hate typing XML. I hate reading XML. I hate how everyone wants to use XML for heaps of dumb things.

Saturday, January 28, 2006

Small Python Simple Sudoku Solution

After a comment from Kevin Greer pointing to his JavaScript solution for Sudoku, I've been looking at Sudokus a little more.

What I Did
I realised that my solution only works for simple sudokus and after seeing Kevin's very small solution, I was challenged to come up with a smaller program that works for all sudokus, but doesn't use a brute-force approach. For now, I've found a much smaller solution - the main solution loop is 6 lines - but it still only works for simple sudokus. It is still using the basic approach of using squares that are already solved to reduce the potential solution set of squares in the same row, column and box. It repeats until all squares are solved.

Small Python Solution for Simple Sudokus.
#!/bin/python
import sys

f=file(sys.argv[1])
sudoku = []

def display():
for row in range(9):
if (row in (3,6)): print('-' * 31 + '+' + '-' * 32 + '+' + '-' * 32)
for column in range(9):
if (column in (3,6)): sys.stdout.write(' | ')
sys.stdout.write('%10s' % ''.join(filter(lambda x: x['r'] == row and x['c'] == column, sudoku)[0]['s']))
sys.stdout.write('\n')
sys.stdout.write('\n' * 3)

for row, line in enumerate(f.readlines()):
for column, c in enumerate(line.strip()):
solutionSet = '123456789'
if c in solutionSet: solutionSet = c
# r=row, c=column, b=box, s=solutionSet
sudoku += [{'r': row, 'c': column, 'b': (row / 3) * 3 + (column / 3), 's': solutionSet}]


# while some squares are not yet solved - i.e. len(solnSet) > 1
while (len(filter(lambda x: len(x['s']) > 1, sudoku))):
for grouptype in ['r', 'c', 'b']:
for r in range(9):
for solvedSquare in filter(lambda x: x[grouptype] == r and len(x['s']) == 1, sudoku):
for unsolvedSquare in filter(lambda x: x[grouptype] == r and len(x['s']) > 1, sudoku):
unsolvedSquare['s'] = unsolvedSquare['s'].replace(solvedSquare['s'], '')

display()


Kevin's solution seems to use a brute-force approach. To be honest, I can't actually figure out how his program works exactly - it is quite cryptic, or I'm just too dumb. It does work for difficult Sudokus, but I gave up after 3 minutes waiting for it to solve what is claimed to be the hardest Sudoku (below) that I found via Kevin's link to lambda-the-ultimate. In fairness, I think my solution above is quite cryptic and I don't know if I would follow it if I hadn't written it myself.

Hardest Sudoku:
 1 0 0 | 0 0 2 | 0 0 0
0 0 0 | 7 4 0 | 5 0 0
0 0 0 | 0 0 0 | 0 0 4
-------+-------+-------
0 3 0 | 7 5 0 | 0 0 0
0 0 0 | 0 0 0 | 0 0 9
0 0 0 | 0 0 0 | 6 0 0
-------+-------+-------
0 4 0 | 0 0 0 | 0 0 0
0 0 6 | 0 0 0 | 0 0 1
0 0 0 | 0 7 1 | 0 3 0


What I Learnt
filter(lamdba x: ..., ...) Using filter() with a lamdba expression allows you to do a lot in one line. I find it fun to write code like this, but it can be difficult to understand. I do think that less code is generally a good idea, but I would always put code maintainability before brevity. If you can get both, then hooray. If I wanted to make my program more maintainable, I'd maybe add a few more comments, but I would definitely split some of the lines into 2 or more lines and assign some intermediate variables with meaningful names. E.g. rather than:
for solvedSquare in filter(lambda x: x[grouptype] == r and len(x['s']) == 1, sudoku):
for unsolvedSquare in filter(lambda x: x[grouptype] == r and len(x['s']) > 1, sudoku):
unsolvedSquare['s'] = unsolvedSquare['s'].replace(solvedSquare['s'], '')


I would do:
# get all squares in the current row, column, or box that are already solved
# and remove those values from all other squares in the same row, column, box
solvedSquares = filter(lambda x: x[grouptype] == r and len(x['s']) == 1, sudoku)
for solved in solvedSquares:
unsolvedSquares = filter(lambda x: x[grouptype] == r and len(x['s']) > 1, sudoku)
for unsolved in unsolvedSquares:
unsolved['s'] = unsolved['s'].replace(solved['s'], '')
I don't know. Maybe that makes it worse?

It is all about data structures. It is always interesting to see how data structures influence the way you think about a problem and dictate the algorithms that you use. I was actually just reading through the ACM Classics last week. As I recall, I saw Simon Harris write about C. J. Date. While searching to find out more about C. J. Date (who seems to be carrying around some serious issues with how relational databases and SQL have turned out), I came across the ACM Classics. I couldn't find any official list on the ACM website of all the classics, but the ones I found are:
Let me know if you know of any others. I had to go fishing for most of those URLs.

Niklaus Wirth writes about the 8 Queens problem, which funnily enough is equivalent to sudoku and he says:
The success of an algorithm, however, depends almost always on a suitable choice of its data representation in the light of the ease in which this representation allows the necessary operations to be expressed.
In my new solution, I have changed the data structure from a 2-dimensional array, where each element of the array contains a vector of possible solutions for that square to a single list of dictionaries. Each dictionary represents a square on the board. The dictionary contains its row (key 'r'), column (key 'c') and box (key 'b') along with its solution set (key 's') represented as a string. The resulting data structure makes it easy to make the code concise by taking advantage of the previously mentioned filter() and lambda expressions.

It struck me how much my code felt like SQL. E.g. for solvedSquare in filter(lambda x: x[grouptype] == r and len(x['s']) == 1, sudoku) feels very similar to select * from sudoku where grouptype=:r and length(s) = 1. It makes me curious to think of what a sudoku solution in PL/SQL would look like. I think it would probably be smaller than any other language.

Anyway, they're just a few of my thoughts.

Tuesday, January 17, 2006

SMH in single page

A little while ago smh.com.au (Sydney Morning Herald) changed their web format (as did Fairfax's other newspaper, theage.com.au. Now some articles are spread over multiple pages in a cynical move to increase their advertising revenue. Even pressing the 'print' button doesn't display the whole article - grr.

What I Did
It was time to learn just how Greasemonkey works and how I can fix smh to display the whole article on a single page.

Mark Pilgrim has made an excellent site at diveintogreasemonkey.org where I learnt everything I needed to know. I originally thought that I would have to parse a page, look for the link to the next page and use the GM_xmlhttprequest() method to fetch the page and then insert its contents into page 1. When I looked at the html of the page, I noticed that smh actually loads all pages in the first page, but makes them invisible by setting style.display="none" on the containing div element. This is done so that print will work. Print uses a different style sheet that sets style.display="inline". I thought for a second about just writing a new style sheet. I'm sure the Web Developer extension or many others would make it simple to switch a style sheet, but I want to write a greasemonkey script darn it!

The script is really simple. I created a file called fairfaxfix.user.js (all greasemonkey scripts have to have the .user.js extension). I then open the script with Firefox and select Tools > Install This User Script.

// ==UserScript==
// @name Fairfax Fix
// @namespace joelhockey.com
// @description Convert all Fairfax (Sydney Morning Herald, and The Age) articles to be single page and remove some ads
// @include *smh.com.au/*
// @include *theage.com.au/*
// ==/UserScript==

// Al pages exist, they are inside divs with style.display="none".
// This makes it really easy to make everything visible.

pages = 5
if (typeof totalpagespagination != "undefined") {
pages = totalpagespagination
}

for (i = 2; i <= pages; i++) {
e = document.getElementById('contentSwap' + i)
if(e) {
e.style.display = "inline"
}
}

// while we're doing stuff, get rid of some annoying ads
ads = ['adSpotIsland', 'AdPlaceholder-popunder']
for (i in ads) {
e = document.getElementById(ads[i])
if(e) {
e.parentNode.removeChild(e)
}
}


// get rid of pagecount as well, just to use some XPath
es = document.evaluate('//*[@class="pagecount"]',
document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null)

for (var i = 0; i < es.snapshotLength; i++) {
e = es.snapshotItem(i);
e.parentNode.removeChild(e)
}


What I Learnt
Greasemonkey. I guess I've now had an introduction to greasemonkey, and I like it. I'm sure I'll be using it to scratch any web UI itches I get in the future. I have added my SMH single page script to the main greasemonkey script repository at userscripts.org.

XPath. I've seen a bit of XPath, but I've never actually written any before. It was simple for the simple thing that I wanted to do which is a good sign. I should say that I'm not the world's greatest fan of XML. I works for hierarchical data that has to contain unicode, but I see it used in too many places where there are simpler solutions.

One More Thing: I want to get rid of the annoying pop-up that smh puts up on the first page you visit every time you start a new session. I haven't taken much time to look at the code, but if anyone knows how I could change to script to get rid of that, please let me know. I thought Firefox is meant to block pop-ups - how does smh get around this?

Update: I tracked down an iframe with id AdPlaceholder-popunder that loads a javascript page which is doing the pop-up. It is simple to remove this element. I also checked Firefox 1.5 and it is detecting the pop-up and stopping it. My old version isn't quite as smart. I guess I'll have to check if all the extensions I use work OK with version 1.5 and upgrade.

Monday, January 02, 2006

Sudoku solutions in Python and Ruby

Carl and Jane (my younger brother and his wife) gave me a sudoku book for Christmas. Yeah, I know - you wish that your family gave good gifts like mine. Puzzles are always great for holidays, and I've seen Japan's official national sport in the newspaper and most magazines and even in Cedric's weblog for a while but I've never actually tried to do a sudoku. It was fun to do a few, so I thought I could either spend the rest of Christmas finishing the book, or I could write a program to solve them and then never have to worry about sudoku again in my life.

What I did:
Much to my wife's disgust, I pulled out my laptop and took about 2 hours to do a Python version of the solution with Python24.chm as my guiding companion. I've used Python a little bit for the last 2 years, so it was the first choice of language.

The data structure and algorithm that I devised is quite simple.
  1. The data structure (named 's' for sudoku) is a 9x9 array (s = [][]).
  2. The sudoku puzzle is read in from a text file (9 lines, each with 9 chars - either '1'-'9' or '.' - see below for an example)
  3. The puzzle is read in and each element of s is populated to either be a vector of size 1 containing the given value (e.g. v = ['3']), or is set to a vector containing chars '1' - '9' (v = '1,2,3,4,5,6,7,8,9'.split(',')).
  4. The solution is found by successively reducing all vectors down to a single value. The program iterates through all elements of 's'; if the given element is already solved (len(v) == 1), then remove that value (v[0]) from all elements in the same row, column, and 3x3 grid as that given element.


sudoku.txt:
..32.....
..6.7.982
..5.86...
29.8.3.75
1.8...4.9
53.4.1.26
...61.5..
649.3.2..
.....47..


sudoku.py:
#!/bin/python
import sys

f=file('sudoku.txt')
s=[]

def display(p=10, q=10):
for i in range(9):
if (i % 3 == 0): print('-' * 100)
for j in range(9):
if (j % 3 == 0): sys.stdout.write(' | ')
sys.stdout.write('%10s' % ''.join(s[i][j]))
if i==p and j==q: sys.stdout.write('_')
sys.stdout.write('\n')
sys.stdout.write('\n' * 3)

for l in f.readlines():
row = []
s.append(row)
for i in l.strip():
v = '1,2,3,4,5,6,7,8,9'.split(',')
if i in '123456789': v = [i]
row.append(v)

finished = False
while (not finished):
change = False
for i in range(9):
for j in range(9):
if len(s[i][j]) == 1:

# reduce rows and columns
for k in range(9):
if len(s[k][j]) > 1 and s[i][j][0] in s[k][j]:
s[k][j].remove(s[i][j][0])
change = True
for k in range(9):
if len(s[i][k]) > 1 and s[i][j][0] in s[i][k]:
s[i][k].remove(s[i][j][0])
change = True

#reduce small square
for p in range((i/3)*3, (i/3)*3+3):
for q in range((j/3)*3, (j/3)*3+3):
if len(s[p][q]) > 1 and s[i][j][0] in s[p][q]:
s[p][q].remove(s[i][j][0])
change = True

#randomly reduce any v if no change
p,q,v = 10,10,None
if not change:
for i in range(9):
for j in range(9):
if len(s[i][j]) > 1: p,q,v = i,j,[s[i][j][0]]
s[p][q] = v

finished = True
for i in range(9):
for j in range(9):
if len(s[i][j]) > 1:
finished = False

display()


That was kind of fun, but then a few days later I was watching the cricket and it was a bit dull and I thought that I really should try to learn Ruby a bit better so I can fit in with all the cool kids. I've written a few scripts in Ruby, but I don't know it as well as Python. So now with ProgrammingRuby.chm as my guide and the already completed sudoku.py, I wrote a Ruby program to solve sudoku.

sudoku.rb:
#!/bin/ruby

def display(s, p=10, q=10)
for i in 0..8
if (i % 3 == 0) then puts '-' * 100 end
for j in 0..8
if (j % 3 == 0) then print ' | ' end
printf('%10s', s[i][j].join(''))
if i==p and j==q then puts('_') end
end
print "\n"
end
print "\n" * 3
end

s = []
File.open('sudoku.txt').each do |line|
row = []
s.push(row)
line.strip().scan(/./).each do |c|
v = ('1'..'9').to_a
if v.include?(c) then v = [c] end
row.push(v)
end
end

finished = false
while (not finished)
change = false
for i in 0..8
for j in 0..8
if s[i][j].length == 1 then
# reduce rows and columns
for k in 0..8
if s[k][j].length > 1 and s[k][j].include?(s[i][j][0])
s[k][j].delete(s[i][j][0])
change = true
end
if s[i][k].length > 1 and s[i][k].include?(s[i][j][0])
s[i][k].delete(s[i][j][0])
change = true
end
end

#reduce small square
for p in ((i/3)*3)..((i/3)*3+2)
for q in ((j/3)*3)..((j/3)*3+2)
if s[p][q].length > 1 and s[p][q].include?(s[i][j][0])
s[p][q].delete(s[i][j][0])
change = true
end
end
end
end
end
end

#randomly reduce any v if no change
p,q,v = 10,10,nil
if not change
for i in 0..8
for j in 0..8
if s[i][j].length > 1 then p,q,v = i,j,[s[i][j][0]] end
end
end
s[p][q] = v
end

finished = true
for i in 0..8
for j in 0..8
if s[i][j].length > 1 then finished = false end
end
end
end

display(s)

What I learnt:
It didn't take much effort to change the Python solution to Ruby. For this program, I think both languages are an excellent fit. I'm very interested to hear from anyone who has ideas on a better solution algorithm, or a better way to implement the solution in Python or Ruby.

Some specific things that I learnt:

1. Python strings are immutable

I think I already knew this, but I learnt it again. I originally set my solution vector to v = '123456789' since I knew that Python stores strings as an array, and it is less typing than v = ['1', '2', '3', '4', '5', '6', '7', '8', '9']. I expected that I could do something like '123456789'.remove('3'), but alas, I can't. I've seen that there is a MutableString type, but I used v = '1,2,3,4,5,6,7,8,9'.split(',') to create my array instead. I first tried doing '123456789'.split(''), but that didn't do what I wanted. Does anyone know a better way to create an array of chars from a string? Surely there has to be a simple way to do this.

2. Both Python and Ruby do everything they can to stop you using indexes in your loops, and that's a good thing

I never really noticed how annoying it is to use for (int i = 0; i < a.length; i++) or even Java Iterators until I started using Python. Even the approach that I took of creating a Range object and doing for i in range(9) or for i in 0..8 feels wrong. I feel like I should be doing something like for row in s: - I only used 'i' and 'j' because I couldn't think of a better way to associate the sudoku rows and columns without them.

3. Ruby's Range object is great!

Every language should have one of these. I even like how Ruby's Range object is inclusive of each bound and I think especially for beginner programmers, it would be easier to learn and would help avoid the pesky off by 1 errors that are so common when indexing and iterating. I definitely noticed this as a difference between the two languages because when changing for i in range(9) to for i in 0..8, I wasn't just changing syntax, I was changing the actual value '9', to '8'. I considered creating my array in Ruby to be 10x10 and to index it from 1 to 9 as you would in Visual Basic, but some strong force in me would not allow it. Maybe it is the thought that I'm wasting slots in each array, or that Donald Knuth would get angry at me (maybe he wouldn't). I have actually taken the approach of indexing arrays 1-based before and I think it is a great approach in the right place. I guess in this case I wanted to keep the Python and Ruby programs as similar as possible. Going back to the previous point, Python and Ruby both do such a great job of iterating that it practically removes any 0-based or 1-based argument since you rarely need to use your 'i', 'j' or 'k' friends.

4. Python's list.remove(x) method raises a ValueError if x is not in the list.

This is different to Ruby which returns nil if you try to remove an item from a list that doesn't exist (but doesn't throw any Exceptions). Initially this meant I had python code like if len(s[k][j]) > 1 and s[i][j][0] in s[k][j]: s[k][j].remove(s[i][j][0]) and Ruby code that was slightly simpler like if s[k][j].length > 1 then s[k][j].delete(s[i][j][0]) end, but I realised after trying my program with a 'difficult' level sudoku that sometimes the constraints in a sudoku puzzle allow for multiple solutions and my program needed to detect this case and arbitrarily reduce one of the solution vectors to force out a solution. To detect this case, I needed to do the check for inclusion, so the ruby code ended up semantically identical.

Fifteen Twenty What?

When you're working in a network environment with many servers and a maze of firewalls, you quickly learn how to use SSH tunnelling. SSH tunnelling allows you to forward TCP traffic from a specified local port, through the server you are connecting to, to some remote host reachable by the server you are SSHing to. For example, you may have a web server on a remote machine, but your firewall will only allow SSH (port 22) and not HTTP (port 80). To get around this, you could open port 8080 on your local machine and tunnel it through SSH to the remote web server: ssh -L 8080:localhost:80 username@remoteweb.

You can also use the -R option to open a port on the remote server and forward traffic to any specified host. So if you want to open port 80 on some public server and forward that traffic via SSH to an internal web server (that you are currently on), you could do: ssh -R 80:localhost:80 username@external.

This works great for most cases, but I was recently trying to connect to an Oracle database server via SSH remote tunnelling. It didn't work. My connection kept timing out. I forget the exact TNS error that I got, but it was obvious that the network connections were not working.

What I Did:
With the help of ethereal, I quickly learnt a little about Oracle's Transparent Network Substrate (TNS) protocol by looking at the network traffic. Oracle can be configured to use what I think is called Direct Connection Handoff. In this mode, a client creates an initial connection on a specified port (usually 1521) and then during the connection setup, the Oracle server opens a new port (like say, 3064), specifically for the session, and tells the client to connect on it. I saw some info saying that this parameter is set by an entry in listener.ora such as "DIRECT_HANDOFF_TTC_<listener> OFF". I also found out that you can get special proxies to handle Oracle connections using Direct Handoff, but I felt confident that with a little effort, I could manage to get such a proxy going myself.

The network security guys were doing quite a good job of making life difficult for me, and I had to do no less than 3 hops to get from my laptop to the Oracle server that I wanted to connect to. I had an SSH tunnel listening on local port 1521 that forwarded traffic through to intermediate 1 (I1). I then had intermediate 2 tunnelling from port 2521 on I1 and connecting through to the DB server on port 1521. I think I better dance now. I mean, I think I better draw a picture.


So, I jigged the tnsnames.ora file on my laptop to tell it to connect to localhost:1521 rather than DB:1521. Then I opened a second SSH tunnel from my laptop to I1 using SSH -L 1522:localhost:2522 user@I1. The first tunnel would be used for the original connection to port 1521 on the server, and the second tunnel would be used for the session hand-off. I then had to write a proxy. It had to:
  1. Listen in on any connections to DB:1521
  2. Parse the response to get the hand-off port
  3. Open a new SSH session from I1:2522 to DB:
  4. Rewrite the Oracle response to tell the client to connect to localhost:1522 for the session connection
I didn't want to have to write a full-blown program to manage sockets and threads because I knew that all I needed to do is get into the existing streams and put in a simple filter. The unix programming philosophy is based entirely on simple filters and using pipes to join them all together. I knew there had to be a simple tool that allows you to include TCP streams in all of this. With a little searching, I found out about netcat. I checked the cygwin packages, and sure enough it is there, so within about 20 seconds, I had netcat installed and running. I added a few lines of python code, added some named pipes, and I had a solution!

Standard unix pipes are great for chaining processing where the data flows in a single direction, but when you are dealing with TCP streams, they can't handle the 2-way flow of data. This is where named pipes are useful. You can create separate process chains for filtering the streams in each direction (using standard pipes to create each process chain) and use named pipes as the endpoints to link the 2 chains together.

For what I was doing, I created 3 named pipes:
> mkfifo c2s # this is the data from the client to the server
> mkfifo s2p # this is the data from the server to the proxy
> mkfifo p2c # this is the data from the proxy, back to the client

The python proxy script sits between s2p and p2c and rewrites the data, as well as dynamically spawning a new SSH process.

The diagram below attempts to show what is happening. It's a bit crazy, but I can't think of a clearer way to show what is happening (maybe a sequence diagram would work better).



And here is the python script:
#!/bin/python
import sys
import os
import time
import re

m = re.search(r'(.*?)\(HOST=(.*?)\)\(PORT=(.*?)\)', sys.stdin.readline())
os.spawnl(os.P_NOWAIT, '/bin/ssh', 'ssh', '-N', '-R', '2522:%s:%s' % (m.group(2), m.group(3)), 'user@I1')
time.sleep(5)
sys.stdout.write('%s(HOST=localhost)(PORT=1522)' % m.group(1))


What I Learnt:

Oracle TNS has a hand-off mode . This is useful to remember if you ever have network firewall problems connecting to Oracle.

Netcat is cool. I have used a Java utility called TCPReflector for quite a while, and I have often found it useful. I've even modified the code quite a bit to act as a HTTPS tunnelling proxy and a number of other things, but using netcat, named pipes, and a little bit of scripting is better by far. By taking advantage of the unix piping philosophy, it is simple to create a network filter pipeline and use small existing tools to construct exactly what you want very quickly.

Cygwin (specifically Unix pipes) is cool. Cygwin is always the first thing I install on a windows machine. I have always used a windows machine in my day to day work, but when I found out about and started using grep, sed, awk, find and all the others, it was easy to become a unix believer.

Python is a bit tricky to use to spawn other processes. It took a bit of fiddling to get python to spawn the ssh process that I wanted. I think that you just need to add a dummy first parameter when you call os.spawnl(). I'm guessing that it has to do with how programs (such as ssh) read in arguments starting with argv[1] (since argv[0] is normally 'ssh' or whatever the user typed to kick off ssh). Python's spawnl(mode, path, ...) must start the process that you specify and then pass in the arguments you give, with the first argument at argv[0]. If you make your first argument the same as , then things should work OK. Once I got python spawning ssh, I also had to add the -N option to ssh to make it run as a background process. If I didn't add -N, I got message "Pseudo-terminal will not be allocated because stdin is not a terminal.".

Wednesday, March 12, 2003

How big is that image?

This morning, I saw an email on the Australia Java Users Group mailing list asking if anyone knew of any tools to determine the size of a GIF/JPG image.

This sparked my curiosity a little. It turns out that it doesn't take too much to get this info directly from the file.

I'm not sure who actually devised these file formats, but I can say that it was much easier to understand (with the specs that I found) the GIF format, than it was the JPG format. I do know that the EG in JPEG stands for Expert Group. It is no suprise to me that the spec is difficult to understand. I often find that groups do a worse job of developing/documenting technology than a single company or person.

I wonder if the www would have ever got off the ground if Tim Berners-Lee had to work with a committee of others.

Wednesday, March 05, 2003

I love mathematics

Last night I was talking to my father-in-law about the hot water in our shower. Natalie's (my wife) parents are staying with us for a few nights and I told Dad that he should help himself to use the shower. I followed that by saying 'It takes a little while for the hot water to come through though'. Then I told him what I like to do while I'm waiting for the hot water to come through.

I like to set up a few estimates on either:

  • water flow rate

  • length of pipe from shower head to hot water system

  • diameter of piping

  • time until water becomes hot, etc


Then, I either deduce the final variable from my estimates of all the rest, or just see if all my estimates are consistent.

My estimates were that the water flow rate would be about 5 litres per minute. It usually takes about 30 secs for the water to get hot. The pipes would be about 20mm in diameter. The length of pipe from the hot water system to the shower head would be about 30m. When I did the sums, I found that I was fairly close.

This reminds me of one of Edsgar Dijkstra's papers that I read the other month about doing some mental arithmetic. I tried his challenge, but found it too difficult. I guess the water in my shower always gets hot before I'm able to solve the challenge.