Monday, October 27, 2008

Mailman Scraper

I wrote a script to scrape messages from mailing-list archives and store them in a MySql database. It can retrieve message, author, and date information for everything sent to the list.

Code here. GPL, Requires PHP5 and mysql. You'll have to read README.txt for instructions.

Intended for use with a Mailman list, one that looks something like:

Because this just uses regular expressions to scrape the pages, it's kind of fragile. It makes some assumptions about the format of the page and might not work with other versions of mailman.

I've found this useful for a few projects. One result is a script that tracks list activity (and shows a flame-meter when the list gets intensely active; the current state is ranked from serene to mayhem). I'm also creating a site that allows searching the list archives, as well as voting on the quality of information and adding tags. Of course, one can also use this to create a Markov model of a list... but I'm not going there.