cElementTree ftw!

Python    2010-01-07

My mission was simple - to crawl all of our artist databases, extract names and email addresses from a couple of different tables, and spin out some custom XML.

The tables - one for complete user profiles and another for users who have signed up to receive emails only - may or may not exist, depending on the artist.

And the XML is to be consumed by an outside marketing vendor, for use in generating email campaigns. (We do some of our own email marketing, but most of our bulk mailing is handled by this third party.) So I got no spec - didn't need one, really - just some sample output.

Writing up a quick XML generator with minidom was a piece of cake.

But then I realized that some of these artists have more than a million records between their combined mailing list and profile tables. We have at least two artists with more than a million user records in each one. (In the end, some of the XML files I'm generating are as large as 40Mb.)

So that's obviously dealing with waaay too much data to keep in memory and throw to a prettily-formatted XML file all at once. It hummed along nicely for some of our artists with, ahem, smaller followings, but once I got to some of the heavy-hitters the whole process began to bog down.

(For the record - morbid curiosity and all - I did try writing each record to the file one at a time. And yes, it was as stupid and dog-slow as you might imagine it to be.)

What I needed was to be able to stream results to an XML output file. So I looked at some of the other XML packages out there (against the recommendations of a boyfriend who, I might add, is very opinionated when it comes to which XML libraries suck and which don't).

From all the benchmarking results I could find, lxml and libxml2 seem to perform pretty well when it comes to parsing.

http://codespeak.net/lxml/performance.html
http://effbot.org/zone/celementtree.htm#benchmarks

But parsing is not my problem - I'll let someone on the receiving end worry about that. My only concern is creating it as simply and quickly as possible, and I didn't find a lot in the way of results for generating XML.

So I decided to try cElementTree. It wasn't an entirely arbitrary decision - not needing to install anything new definitely swayed me (it's been part of Python since 2.5).

I didn't do any formal benchmarking of my own, just got some approximate numbers - it looks like I'm covering my total 3.5 million records at 0.13 ms per. So, considering what I was getting out of minidom, cElementTree is lightning fast.

A few other notes on odd things you might see in the script below:

  • I played around with a few different types of encoding to deal with all the crazy characters coming out of our various databases (I deal with mountains of legacy data every day, some of it as old as the internet itself, I think). ISO-8859-1 worked for most cases, and where it didn't, I'm just blowing away the value - with that many records, I'm not wasting my time tweaking to accommodate smaller fringe cases.
  • I ran into a few cases where either the email or name field came back empty, hence the AttributeError check. Such records should not exist, but again, legacy data - so you might not need it, but if you ever do come up against such a circumstance, that's how I dealt with it - just sayin'.
  • ...............................

    #!/usr/bin/python
    
    # usage: ./mylittlexmlscript.py
    
    import shutil, sys, MySQLdb as Database
    from xml.etree import cElementTree as ET
    
    # the completed files should be moved out of the way as they're generated
    # well, sure, they don't *have* to be - I just like keeping things neat and tidy
    processed = "/someplace/toput/allthat/xml"
    
    db = Database.connect(host="myhostname", port=0000, user="myusername", passwd="mypassword", db="")
    cursor = db.cursor()
    
    def walktheserver():
        """ walk the server, generate a list of dbs """
        sql = """SELECT table_schema FROM information_schema.tables GROUP BY table_schema"""
        cursor.execute(sql)
        dbnames = cursor.fetchall()
        return dbnames
    
    def gettables(dbname):
        """ return all the table names for a specific db """
        tables = []
    
        sql = """SHOW TABLES FROM %s"""%(dbname)
        cursor.execute(sql)
        rows = cursor.fetchall()
    
        for table in rows:
            ltable = list(table)
            tables.append(ltable[0])
    
        return tables
    
    def generate_xml(dbname):
        """ check to see if the db has either mailing list or user table 
            and generate the xml accordingly """
    
        tables = gettables(dbname)
    
        if "mailinglist" in tables:
    	create_xml(dbname, 'ml') # ml = mailing list
    
        if "users" in tables:
        	create_xml(dbname, 'up') # up = user profiles
    
    def create_xml(dbname, type):
        """ here's where we're actually going to put cElementTree to work """
    
        outputfile = dbname+"_"+type+".xml"    # name the output file
    
        BSWrapper = ET.Element("BSWrapper")    # create the root element of the XML tree
        Header = ET.SubElement(BSWrapper, "Header")    # create the first subelement, and give it a few attributes
        Header.set("AccountID", "996")
        Header.set("ListID", "885")
        Header.set("UserID", "774")
    
        sql = """USE %s"""%(dbname)
        cursor.execute(sql)
        if type is 'fcm':
    	sql = """SELECT fcm.email AS Email, fcm.first_name AS FirstName 
    	    FROM mailinglist AS fcm GROUP BY fcm.email ORDER BY Email"""
        if type is 'w3t':
            sql = """SELECT w3t.U_Email AS Email, w3t.U_First_Name AS FirstName
    	    FROM users AS w3t GROUP BY Email ORDER BY Email"""
        cursor.execute(sql)
        rows = cursor.fetchall()
    
        for record in rows:
            data = ET.SubElement(BSFImport, "data")
            Email = ET.SubElement(data, "Email")
            Firstname = ET.SubElement(data, "Firstname")
    	
            try:
    	    email = record[0].encode('iso-8859-1')
    	except AttributeError: 
    	    email = ''
    	except UnicodeDecodeError:
    	    email = ''
    
            try:
                name = record[1].encode('iso-8859-1')
            except AttributeError:
                name = ''
            except UnicodeDecodeError:
                name = ''
    
    	Email.text = email
            Firstname.text = name
    
        tree = ET.ElementTree(BSFImport)
        tree.write(outputfile)
        shutil.move(outputfile, processed)
    
    def main():
        dblist = walktheserver()
        # they're not "bad", per se - I would never pass that kind of judgement on a database
        # these are just the dbs to be excluded from consideration
        badlist = ['information_schema', 'artist1', 'artist2', 'artist3']
        for dbname in dblist:
            if dbname[0] not in badlist:
                # print dbname[0]
                newxml = generate_xml(dbname[0])
    
        cursor.close()
        db.close()
    
    if __name__ == "__main__":
        main()
    

    For the record, the XML output looks like this - no pretty formatting, alas, but at least all the nodes are where they ought to be, every element accounted for. If formatting is a problem for the vendor, I'll deal with it in a few days and let you know what I came up with:

    <BSWrapper><Header AccountID="996" ListID="885" UserID="774" /><data><Email>12ewoldt@yahoo.com</Email>
    <Firstname>christian</Firstname></data><data><Email>1995fxstc@sbcglobal.net</Email><Firstname>Daniel</Firstname>
    </data></BSWrapper>