Blog spam update

Well, two comment spams have made it past the spamquestion plugin. This makes me wonder if either the submissions were done manually or whether the software the spammers use is at least human assisted. I guess it's also possible that the spam software is so good that it can automatically work out my simpler arithmetic questions.

The web server logs give some clues. There's literally hundreds of obviously automated POST attempts to various pages on my blog. The requested related to the two comments that made it through however seem far more human however. Here's one example:

68.187.226.250 freshfoo.com - [03/Nov/2007:01:41:57 +0000] "GET /blog/Holland_photos_online.1024px HTTP/1.1" 200 11367 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.187.226.250 freshfoo.com - [03/Nov/2007:01:42:06 +0000] "POST /blog/Holland_photos_online#comment_anchor HTTP/1.1" 200 14928 "http://freshfoo.com/blog/Holland_photos_online.1024px" "Mozilla/4.0 (compatible; MSIE 6.0; Windows
NT 5.1; SV1)"

These are the only two HTTP requests made for the first spam that made it through; no dumb, repeated automatic requests like some of the other attempts in the logs. Notice how the parent page was visited first and then 9 seconds later the POST was made. That's pretty quick for someone to fill out the form manually but it's possible, especially if the spam body was ready in the clipboard. If their system is partially automated then the short delay is even more plausible.

To test whether some spambots are actually capable of doing simple arithmetic by themselves, I've removed all the addition and subtraction questions from my spamquestion configuration and have added more questions that are harder to answer programmatically. If the spam continues, then I'm going to conclude that there's definitely some human assistance going on. If it stops, then it's more likely that the spambot software was actually able to solve some of my arithmetic questions itself.

I also need to look at is short-term blocking of spamming IPs. When examining my logs I found there had been almost 500 comment spam attempts for just today! I'd rather not be dealing with that bandwidth on my server. Dropping all packets from a spammer's IP for a few hours would slow them right down.

Fun fun fun...

Comments renabled, announcing spamquestion

Due to my recent comment spam issues I've created a new PyBlosxom plugin called spamquestion. It is similar to the existing Magic Word plugin but instead of using just the one question for any comment entry on the blog, it randomly selects a question from a larger set of configured questions. This makes it much harder for spammers to get past the comment form using automated software. Unlike CAPTCHA systems this scheme doesn't disadvantage visually impaired people or those on text based browsers.

The spamquestion plugin can be downloaded from my Code page.

Comments are now re-enabled on my site, with spamquestion enabled. It'll be interesting to see how the scheme holds up. I also plan to install the Akismet plugin as a second line of defense.

Spam: good news and bad news

The bad news...

My blog was hit by a comment spammer last week. Hundreds of entries were made, interestingly focussing only a few articles (perhaps with a higher Google ranking?). Running without a CAPTCHA system or similar was good while it lasted. Comments are now disabled until I get around to installing a CAPTCHA style plugin.

Lazy web: what anti comment-spam technologies do you find work well for you? Is CAPTCHA the best option we have?

The good news...

I started using SpamAssassin for my personal email over a month ago. Having seen the complete ineffectiveness of some anti-spam systems I was fairly pessimistic about how effective it would be. Boy was I wrong. Without any tweaks to the default filtering config (except for ensuring that the latest rules are being used) it stops virtually spam hitting my mailbox with zero false-positives so far. I get 20-40 spams a day and 1 or 2 a month get through to my inbox.

My mail volume is comparatively low so I just set Procmail to invoke SpamAssassin for each inbound message. For higher volume situations something like SA's spamd should probably be used. Using Procmail has the nice benefit of being able to direct spam to a separate folder for later persual and deletion.

A cron job is set to run sa-update ever night to ensure the latest default checks are being used. This is important; spammers develop new tricks to bypass anti-spam systems all the time.

Currently I have all suspected spam going to a spam folder. However SA has been so successful that I'm thinking of getting Procmail to automatically delete higher scoring spam and send only the lower scoring spams to the spam folder. Depending on attitudes towards false-positives some might just delete all emails that SA thinks is spam. Personally, I'd rather be a bit cautious. Losing real email scares me.

It's so nice when something works beyond expectation.

Announcing IMAPClient 0.3

I've just made a new release of IMAPClient (0.3). Changes are:

  • The distribution has been renamed from imapclient to IMAPClient so as follow the conventions used by modern Python packages. The Python package name remains imapclient however.
  • Fixed a bug reported by Brian Jackson which meant more complex fetch part selections weren't being handled correctly. (eg. "BODY[HEADER.FIELDS (FROM)]"). Thanks Brian!
  • IMAPClient is now distributed using setuptools.

IMAPClient can be installed from PyPI using easy_install IMAPClient or downloaded from my Code page. As always, feedback and patches are most welcome

FuzzyFinder

FuzzyFinder is a useful Vim extension that I've discovered recently (nothing to do with Fuzzyman). It has proven to be a great productivity enhancer, especially when dealing with large codebases with many files.

FuzzyFinder provides a mechanism to search through files on disk or Vim buffers using fuzzy filename matching. When activated it interactively searches the current directory for files matching the name you entered. Matching is very loose, so if for example you enter "abc", you'll get a list of all files matching *a*b*c*. It sounds strange at first but is very effective in practice.

Here's a screen shot of FuzzyFinder when first activated. A list of all files in the current directory is displayed. The arrow keys can be used to make a selection from the list (useful if you can see what you want). If the list is long, start filtering!

This screenshot shows what happens after a few characters have been entered. The list of available choices is filtered to match. Very powerful.

FuzzyFinder can also do recursive matching using the ** wildcard. This is great for large source code trees.

Converted to PyBlosxom

I've converted the blog part of this site to use PyBlosxom. It used to be powered by Blogger using their FTP-upload-to-your-own-host feature but I found various parts of the Blogger system to be inflexible and buggy.

PyBlosxom is an old-school CGI requires a certain amount of expertise to get set up but it's solid and has a very powerful plugin system. Most of the features on this page from the archive list to the tag list are provided via plugins. Because PyBlosxom is written in Python it's easy to extend, tweak or write new plugins.

I'm not completely happy with the styling of the pages but it will do for now. I also need to import the comments from the old blog. Please let me know if you see any problems.

Introducing imapclient

Today I released the first versions of imapclient, an IMAP4 client library I've been working on. From the README:

imapclient aims to be a easy-to-use, Pythonic and complete IMAP client library with no dependencies outside the Python standard library. Features:
  • Arguments and return values are natural Python types.
  • IMAP server responses are fully parsed and readily useable.
  • IMAP unique message IDs (UIDs) are handled transparently. There is no need to call different methods to use UIDs.
  • Convenience methods are provided for commonly used functionality.
  • Exceptions are raised when errors occur.

imapclient makes IMAP useable from Python with little effort. If you've used the imaplib module from the standard library, you'll know that a lot of supporting code is needed just to get simple things done. I'd wager that everyone who has used imaplib has their own fragile regexes and management code to go with it. I hope that imapclient will put an end to this.

imapclient can be downloaded from my Code page. Feedback and patches most welcome!

Time saviour

Virtualisation is so damn useful. I'm currently trying to figure out why the new kernel update at work doesn't boot[1]. Thanks to VMWare, after each failed boot I can quickly revert to the previous state before the update. No messing around trying to restore the old kernel or reinstall the system [2].

  • [1] Python work is much more fun
  • [2] Before anyone says "why don't you just install the old and new kernels side-by-side?": the update is fairly major and involves a bunch of userspace libraries and binaries as well.

Data on the rocks

I was asked to help a friend recently whose hard disk was dying. The system wouldn't boot anymore with the BIOS reporting disk or read errors. Ironically this starting happening the moment after my friend mentioned to his girlfriend that they should "really start doing backups".

Said friend was instructed by me to buy a new hard disk and an external hard disk enclosure and I went around armed with various Linux based rescue disks and a Windows XP install disk.

Things didn't start out well. I booted using one of the rescue disks and tried mounting the failing hard drive. The mount process hung and dmesg showed the kernel spewing out IDE related error messages at a great number of Hz. I ended up having to forcibly kill the mount process. Several more attempts failed in a similar way.

I had heard that failing hard disks can sometimes be made to work if cooled down. The theories about why this works seem flimsy (something about contracting the metal inside the drive so that components go back into alignment), but since options were limited at this stage I figured it was worth a try.

The new hard disk went into the computer and the faulty one was installed into the enclosure. I then wrapped the enclosure in several plastic bags and put it into the freezer with the cables hanging out of the door. With fingers crossed I connected the enclosure to the computer and turned everything on. This time I was able to mount the disk from the rescue disk without even a single kernel error. I couldn't believe it! I hurriedly partitioned the new drive and began copying before something went wrong.

There was a lot of data to copy (56GB). This gave plenty of time for worrying thoughts like "what if condensation occurs inside the cold drive and the electronics short out?". Fortunately all went well and every byte was recovered from the failing drive. After a bit more fiddling with boot.ini, NTFS conversion and incorrect file attributes the system was working normally again. Great!

After all this, I did some research to see if other people use this trick with success (one might question why I didn't do this research before I tried it...). It turns out that in most cases it does work. I'd love to hear a solid explanation of why...

cert2rss.py retired

A long time ago, I wrote a little script called cert2rss.py that takes the URLs from the US-CERT Summary Bulletin Feed, parsed the HTML summary pages and generated a feed of containing each item individually. I needed something like this for work as I need to keep track security vulnerabilities for all things Linux. The script has been broken for some time because the HTML of summary pages has changed.

Today I finally got around to looking at the problem and found that the HTML of the summary pages is now so bad that it is very difficult to extract useful data from. The HTML looks fine when rendered but it is full of incorrectly escaped text, missing tags and bizarre formatting. Whoever is responsible for generating that HTML shouldn't have their jobs.

So, I'm retiring cert2rss.py. I've updated my Code page to reflect this.

As an alternative, there's some reasonable RSS feeds from the National Vulnerability Database that provide similar functionality.