Thursday, December 8, 2011

Python Regexes: Named Groups. Cool Bananas

I'm currently writing a Python parser for GenBank files. I know BioPython has one, and it doesn't even suck, but BioPython requires a bunch of C extensions, so I can't go and just ship it with my Python application. So I'm creating a BioPython-compatible API for the classes that I need for antiSMASH, without the dependency tail BioPython forces on me. Having contributed to BioPerl before, I do like to use regular expressions for token-based parsers, especially as I'm not too fond of lexers in Python. Now, a GenBank header is a peculiar thing that stems from the punch card ages, with a fixed-width format. Unfortunately, at some point that format changed, and things moved around. And there's a ton of programs out there that produce GenBank files that are slightly off. So parsing the header using a token-based approach seems like a good thing. Now, let's look at the first line. That has a bunch of interesting information.
LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1999
The 'LOCUS' tag explains what this line is about, then there's the accession number for the sequence, the length of the sequence, the type of the molecule, some classification where the sequence comes from, and the date the sequence last saw a major update. The regex to parse this is pretty straightforward (yeah, right):
LOCUS\s+([\w.]+)\s+(\d+)\sbp\s(ss-|ds-|ms-|\s{3,3})(\S{2,4})\s+(linear\s\s|circular|\s{7,7})\s+(\w{3,3})\s+(\d\d-\w{3,3}-\d{4,4})
Incidently, this is a great example for why people dislike regular expressions. Now, in both Perl and Python, there's a way to define verbose regular expressions, so you can restate the regular expression as:
LOCUS\s+        # Header line starts with LOCUS tag followed by multiple spaces
(               # accession number regex:
  [\w.]+        # any alphanumeric character or '.'
)
\s+             # skip over whitespace
(               # sequence length
  \d+           # digits only
)\sbp\s         # skip ' bp ' string
(               # single, double or mixed stranded or nothing
  ss-|ds-|ms-|\s\s\s # can be all spaces
)
(               # Molecule type DNA, RNA, rRNA, mRNA, uRNA
  \S+
)
\s+
(               # linear, circular or seven spaces
  linear|circular|\s{7,7}
)\s+
(               # division code, three characters
 \w{3,3}
)
\s+
(               # date, in dd-MMM-yyyy
  \d\d-\w{3,3}-\d{4,4}
)
This is already pretty decent, but Python can do one better. With the current version of the regex, I need to remember that the molecule type is the 4th group, so match.group(3) will be what I'm looking for. Python decided to extend the Perl extension syntax a bit more, to add named groups. With ?P<name> you can name groups. and then call match.group('name') to access them later. So the final version of the parser regex turns into
LOCUS\s+        # Header line starts with LOCUS tag followed by multiple spaces
(?P<accession>  # accession number regex:
  [\w.]+        # any alphanumeric character or '.'
)
\s+             # skip over whitespace
(?P<length>     # sequence length
  \d+           # digits only
)\sbp\s         # skip ' bp ' string
(?P<stranded>   # single, double or mixed stranded or nothing
  ss-|ds-|ms-|\s\s\s # can be all spaces
)
(?P<molecule>   # Molecule type DNA, RNA, rRNA, mRNA, uRNA
  \S+
)
\s+
(?P<formation>  # linear, circular or seven spaces
  linear|circular|\s{7,7}
)\s+
(?P<division>   # division code, three characters
 \w{3,3}
)
\s+
(?P<date>       # date, in dd-MMM-yyyy
  \d\d-\w{3,3}-\d{4,4}
)
and you can use speaking names to access the group's contents later. Great to make the code more readable. Combined with a bunch of tests, that should stay maintainable.

Friday, August 12, 2011

From the frontline, day 5

Another day, another piece of testing mayhem. I've completed the 0.1 version of my Flask-Downloader helper class. With this, I could complete my web app. Now, the downloader itself has a bunch of tests to make sure it's working as expected, but I was also going to test the corresponding code paths in the web app's tests.
The user can provide the input either by uploading a file or by giving an accession number. Testing for the file uploads was easy, as the Flask test client accepts file-like objects as data input for POST requests. So testing the app will do the right thing is as easy as:
def test_upload(self):
    file_handle = open(tmp_filename)
    data = dict(file=file_handle)
    rv = self.client.post('/upload', data=data)
    assert "upload succeeded" in rv.data
Assuming your upload function listens on '/upload' and returns a page that contains "upload ducceeded", of course.
Testing file downloads is a bit more elaborated, because I don't actually want my downloader to connect to the internet during a test run. Minimock to the rescue! I can fake the download helper and create the same kind of output to fool the application code.
from minimock import Mock
from werkzeug import FileStore
def test_download(self):
    data = dict(id="FAKE")
    # now create the fake downloader
    tmp_file = open(tmp_file_path)
    dl.download = Mock('dl.download')
    dl.download.mock_returns = FileStore(stream=tmp_file)
     rv = self.client.post('/download', data=data)
    assert "download succeeded" in rv.data
With similar assumptions as in the example before, and also the idea that you have a pre-existing file in tmp_file_path. A StringIO file-like object should do the trick as well.
With all the tests in place and a test coverage of 100%, I declare this campaign a success. I still need to deploy the new web app on my test server instead of the old one, but I'm going to do that next week. I will also continue my war on legacy code, now tackling the pieces that do the actual work. No war is over as quick as you'd initially hope after all. Also, I'm pretty sure the 100% code coverage don't mean there's not plenty of places for bugs to hide in, just that at least all of the code is looked at by the interpreter once. Still, it's a good conclusion to a busy week. Testing rocks.

Thursday, August 11, 2011

From the frontline, day 4

Today, I decided to go for the downloader component that can download files on the behalf of the users. While looking at how to test this, I actually noticed that the mm_unit functionality has been merged into the minimock package. Sweet.
I wanted to keep this modular, a downloader sounds like a tool I could use in a couple of projects. So I created a Flask extension. There's a nice wizard script that automates the creation of the boilerplate files. Using the wizard, I created Flask-Downloader. It's pretty straightforward to use. There's a download(url) function that will return a werkzeug.FileStorage instance, just like the flask upload hander. I'll also add a save(url) function that'll save the url's contents to a file without returning a file-like object.
Not too much to write about, spent a lot of time researching stuff today. Hope to get done with my changes tomorrow. Let's see how that'll work out

Wednesday, August 10, 2011

From the frontline, day 3

Today, I decided to go and restructure my webapp into a package as recommended by the Flask "Larger Applications" pattern. Thanks to my existing test suite, the move was quick and pain-free. I had to fix imports in one of the tests, but apart from that, the only thing I had to do was splitting up the webapp.py file correctly into webapp/__init__.py, webapp/views.py and webapp/models.py.
I then started playing with implementing the actual functionality for uploading files and creating database entries for the jobs submitted. Took a while to get this right, never done database stuff with Flask before. But again, pretty easy to set up tests for all this. Also, I discovered Flask-Testing, making flask unit testing even more comfy. Just had to fix up the Twill module Flask-Testing comes with to not use the md5 and sha modules, triggering deprecation warnings. Will continue to write the last tests for the job submission form tomorrow, and then see how to deal with making the web app download data from elsewhere for the user.

Tuesday, August 9, 2011

From the frontline, day 2

Looks like my system strikes back. A fitting thing to happen for an episode 2, I guess. Turns out that nosetests and virtualenv need some extra care and feeding when kept together. Installing another copy of nosetests into my virtualenv fixed the test failures I was seeing. Thanks to the folks on #python and #pocoo for pointing me the right way.
Of course this broke the code coverage. Nothing a pip install --upgrade coverage wouldn't fix, though. As an added bonus, the coverage html output now looks much nicer. I guess it was redesigned between whatever my system got and the version pip grabbed.
Of course, after spending quite some time writing tests for my email sending module, the #pocoo folks point me at the already existing Flask-Mail extension, that integrates into the Flask test harness (as in, if you're testing, it won't send email) already. Oh well. Switched, ditched quite some code and corresponding tests. Even less stuff I have to maintain myself.
Unfortunately, Flask-Mail doesn't seem to like it when you switch on app.config['TESTING'] = True after initialization. Fortunately, you can still fiddle with the value used so it doesn't try sending emails, like so:
def setUp(self):
    webapp.app.config['TESTING'] = True
    webapp.mail.suppress = True
The key here is the mail.suppress = True setting. Once that's done, all the testing options work as expected. You can even have a look at the msg objects that would have been sent using the following snippet:
def test_sent_mail(self):
    """Test if emails were generated and sent correctly"""
    with webapp.mail.record_messages() as outbox:
        rv = self.app.post('/send-email',
                 data=dict(message="hello world"))
        assert len(outbox) == 1
        msg = outbox[0]
        assert "hello world" in msg.body
I like it, this really gets all the stuff I do under test in a very straightforward manner.

Monday, August 8, 2011

From the frontline, day 1

I've decided to start the campaign by ditching the existing PHP web app. I lost all confidence in it last Friday, when I found that it only worked by accident, due to a well-placed typo.
As I'm rewriting the web app anyway, I thought I could also ditch PHP altogether. Not that it's necessarily a bad language for web apps, but the rest of the code is perl or python, so getting rid of php means one less language to get confused by.
Because this is the War on Legacy Code, I'm not going to write untested code in this campaign. So first I need to brush up my python unit testing skills. I do have parts of a python version of the web app already (untested, that won't work later), but it's missing a user feedback form.
I don't want to send an email for every run of the test suite, so I need to mock up smtplib.SMTP. After some web research, I'll be using Ian Bicking's minimock to provide my mock objects. As I don't just run doctests (even though they're pretty cool), I decided to also throw in MiniMockUnit, which makes minimock print the output to a StringIO buffer instead of stdout. That way, you can easily put it in a normal unit test.
I usually run my tests using nosetests. Turns out, nosetests allows me to run both vanilla unit tests and doctests, and it also has a code coverage plugin. Thus,
nosetests -v --with-doctest --with-coverage --cover-html --cover-package="testmodule"
will get the module "testmodule" tested using available unit tests, doctests and the test coverage will be reported in html in the cover/ directory. The --cover-package part seems to be needed to stop the coverage code from trying (and failing) to create coverage information files in the standard lib paths.
To sum up, I didn't actually see much battle.. er.. code today but my arsenal is filled with testing tools, and I'm well prepared to jump into the fray tomorrow. Also, thanks to some help from the folks on #gsoc IRC on freenode, I now have decent syntax highlighting and formatting on my blog, so I might be able to post code samples for real now. Life is good so far.

War on Legacy Code

Following the hallowed US American tradition of declaring war on whatever things you don't like, I've decided to declare war on legacy code this week.
By legacy code, I mostly mean untested code, following the definition of Michael Feathers' book Working with Legacy Code. That's a great read, by the way. If you're working with old code, you should go read it. I found that I've been doing most of the things mentioned already, but it's nice to see a systematic write-up about it.
My chosen battlefield in this war is the code at my day job, mostly because it's in a much worse shape than the code I deal with in the various Open Source projects I'm involved in. Seeing how my day job code is a Frankenstein's monster of Perl, PHP and Python parts, some of the work will be to get some of the tests done twice. In particular, I really want to get rid of the PHP parts.
I won't delve into the particulars of the code too much, it's published under the GPLv3 if anyone is interested. I will however try to post some daily news from the front lines, with things that I have thought about during that particular day of the battle.

Sunday, June 19, 2011

Geeky stage props made easy

Currently I'm involved in building the stage and props for an amateur theater, the Brechtbau Theater at my university. We're currently preparing for an Agatha Cristie play, "And then there were none". As a great opportunity, we are able to perform this piece in the city's professional theater, the Landestheater Tübingen (LTT).
Needless to say, we're really excited about this. Of course having to build a stage for a 300 seat theater is a bit different to building a stage for the 80 seat theater we've got at university. Also, we only have about four hours to set up the stage, and after the last night, we have to clear out immediately. While the way to build a modular stage design probably is worth a blog post on it's own, I want to talk about the electronics behind stage a bit today.
Without wanting to spoil some of the surprises we have in store for our audience, we're working on a stage design with lots of big cogwheels and other moving parts. We will power some of these with the cheapest and most readily available power source we have available: actors. But some of the stage has parts that are just out of reach, or need to be positioned more precisely. For this, I'm currently planning to use a combination of stepper motors and hobby servos, run by an Arduino Uno.
I'm still heavy in the prototyping stage, but I just wanted to share my discovery on how easy it is to do stuff like this with the Arduino. My current test setup looks a bit like

Using an L293D (left) and an L293NE (right) IC, I'm running two Trinamic bipolar steppers, and I'm also controlling four servo motors. For making all of these move forwards and backwards, I had to write about 50 lines of code, including whitespace and some comments. Arguably, moving a couple of motors forward and backward in a loop isn't that intersting, but the amout of work the Arduino default libraries already take care of is just great.
Next, I'll have to figure out how to build a lamellar transport belt and move it with one of the steppers, while converting the circular movement of the other stepper to a linear movement. Never played with elaborate hardware stuff before, this is fun.

Thursday, February 17, 2011

Packaging python modules, the really easy way

I just had to package up a python package to make installation of a software we use at work easier. The systems we run here are Suse- and Ubuntu-based, so I got to package RPMs and debs. I did package the odd perl package and some bioinformatics tool before, but I haven't looked at it for a while, so I was pretty rusty. I use the OpenSuse Build Service to build RPMs, and the nice folks in FreeNode's #opensuse-buildservice pointed me a py2pack, a truly amazing piece of software that creates a .spec file for you from package information in pypi. Now, I had to package pysvn, which happens to not link to the downloadable file from pypi. As it turns out, this just stopped me from using py2pack for downloading the file, py2pack generate works fine. Filling in the missing License and Description information was easy, and my package was building on OBS in under five minutes. For Ubuntu, Python packaging guide was a bit less comfortable, but a short while afterwards my Ubuntu package was built on Launchpad as well. Life is good. At least so far, now I get to make all of that work for system #3 in the department. This happens to be an OS sold by a company from the northern west coast of the USA, and it doesn't have an execute bit and can't deal with hashbang lines. Oh well, two out of three isn't too bad.

Friday, January 21, 2011

Defensive programming to the rescue

I've just fixed a bug in a web app that I'm working on with a couple of colleagues, a bug that is a great example of a hard to find, easy to fix bug that could have been avoided by some defensive programming. I thought I'd share this titbit to illustrate why defensive programming is a good idea.
The application we're working on consists of three parts:
  • a web front-end that accepts jobs (php based)
  • a database keeping the job queue (SQL)
  • the back-end that runs the jobs (python)
The job is actually an external script, which is important for this bug as well.
On the web front-end, the user gets to choose a couple of options for the job using check-boxes.
The bug we were seeing was that if the user just clicked one checkbox (and that wasn't the "all" checkbox), the external script would die with an "invalid options" error.
The options are passed into the SQL database as a comma-separated string, and our first suspicion was that for single options, there was a trailing comma left. A quick look at the database dump showed that this was not the case.
The next idea was that the back-end was not constructing the command line for the external script correctly. The code looked sane, but just to be sure I decided to do some printf debugging. In the printout of the command line, I finally found the bug. It was in the web front-end after all. Let's look at the code (changed a bit for brevity).
// 1 is "all"
if($_POST["1"] == "on" ){
  $options = "1";
} else {
  for ($i = 2; $i <= 10; $i++) {
      if($_POST["$i"] == "on") {
          $options .= " " . $i . ",";
      }
  }
  $suffix = strripos($options, ",");
  $options = substr($options, 0, $suffix);
}
Can you spot the problem?
$options .= " " . $i . ",";
is the culprit. It adds a leading white-space to the $options string. Looking at the database dump, that's easy to miss.
Now, why is this a problem? Let's look at the back-end code (changed for clarity again):
job = get_next_job_from_work_queue()
args = ['./do_stuff.py', job.filename]
args += ['--options', job.options != None and \
                      jobs.options or '1']
subprocess.call(args)
The back-end uses an array for it's command line arguments to avoid having to call out to a shell first. This has the side effect that all the arguments are passed to the called script verbatim. Thus, a leading space character is kept and passed to the called application. This is defensive programming fail #1.
Still, this doesn't explain why the white-space is a problem. For that, we need to look at do_stuff.py (Changed for clarity again).
value = options[options.index(i) + 1]
if i == "--options":
    if "," not in value and value not in ["1","2","3","4","5",\
                                          "6","7","8","9","10"]:
        invalidoptions(i)
Assuming we clicked on the #7 check-box in the front-end, " 7" is passed to do_stuff.py. " 7" is not in ["1","2","3","4","5","6","7","8","9","10"], so we fail. Defensive programming fail #2 is in the value = options[options.index(i) + 1] line. Adding a strip() here would have avoided the bug. This was not found in manual testing, as the shell takes care of stripping the white-space characters for us. Still, a bit of defensive programming would have helped to avoid the issue.
If I ever teach a programming course, one of the assignments will be finding and fixing a bug like this.
Update: 2011-08-08 Pretty-print code.

Monday, January 17, 2011

My Samba status 11/1

Hi folks, you will have noticed that I failed to post any of my "On the way to Samba4" reports in December last year. That was because I failed to do any Samba work in December, spending all of my time on work-related things. A few co-workers and me had to rush to get a web server up and running that allows biologists to figure out what secondary metabolites like antibiotics might be produced by their bacterium/fungus. After pulling a 90-hour-week to get finished between Christmas and New Year's Eve, I had to take some time off not staring at a computer screen. Batteries recharged now, I'm ready to get into action. Over the weekend, I've been getting the skeleton for some DNS torture tests set up, I'm hoping to flesh this out a bit more during the week. Cheers, Kai