Python XML Parsing

Series: dev July 12, 2017

Below is a copy of a gist I wrote last year. A friend was learning Python as his first programming language. He was having some trouble getting data out of an XML he was downloading online. I’m not too familiar with Python, and decided I’d write out a stream of consciousness as I go about trying to do the same thing.


Here’s a little more help feeling it out, from someone who knows like no python but has some dev experience. Forgive me if this is too simplistic, but since you’re starting out, it might be helpful.

My thought process going into a project like this is this:

What is the smallest chunk of this I can do in one logical step? Well I have an XML url I can go to to get a list of games. So I’m going to need to get that into my program. That’s step one. And it is all I care about. So how do I do it?

There are really two steps here:

  1. Read the data from the url Kane provided.
  2. Parse the xml, turning it into something you can use in the rest of your code.

People probably do this all the time, so I’m going to check stack overflow: https://stackoverflow.com/search?q=python+parse+xml+from+url

Hey, there we go. The second question has some good upvotes, and seems directly on topic. The answer is really helpful too: https://stackoverflow.com/questions/24124643/parse-xml-from-url-into-python-object

But for now, we don’t want to put anything in a function. We want as simple a program as we can get. So I’m not even going to write this in code, I just opened up my python command line interpreter.

It looks like I need to import some modules:

>> import urllib2
>> import xmltodict
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named xmltodict

Uh oh. I don’t have xmltodict installed. It’s not something that comes as part of the standard library.

We can install it, but lets take a look at it first. Google xmltodict and find the github repo: https://github.com/martinblech/xmltodict

It’s been updated in the past few months, so I know it’s at least getting minimal maintenance, that’s a good sign. It has a little ‘build passing’ badge, which is also nice - it probably has some automated testing I can take a look at, which also gives me a little bit of confidence. The best thing, is if you scroll down a bit, you see that there is some basic examples of how it’s used, and how to install it.

As with most Python modules, you can install it with pip:

pip install xmltodict

So I exit the python interpreter, install that by running that line from the command line, and go back into python:

>> import urllib2
>> import xmltodict

Awesome. Seems to work.

Now lets go back to that stack overflow answer. The first line appears to use urllib2.urlopen. I bet that opens a url. Lets try it with the link Kane provided:

>> file = urllib2.urlopen('http://steamcommunity.com/id/kane_t/games/?tab=all&xml=1')

Cool. No errors. Lets use file.read() like they do in the next line of the question. I bet that’s how we get the information out.

>> data = file.read()
>> data

By typing data, we can see what’s stored in that variable. Look at that, it appears it’s a string with a ton of XML and markup about Kane’s games history.

>> file.close()

Ok, so this just closes out our session. We probably don’t even need to worry about it for now, but lets make a note to read up on urllib2 once we’re done with this, and figure out exactly what we’re doing.

>> data_parsed = xmltodict.parse(data)
>> data_parsed

Ok. So I changed this line up a bit. I don’t want to overwrite my existing data variable with a new one. I want to be able to get a sense of what’s getting changed. When we type data_parsed it looks pretty much like data. But maybe has some different formatting. What’s up with that. lets just visually compare by viewing one then the other quickly:

>> data
>> data_parsed
>> data
>> data_parsed

So this data_parsed must be what a dict is in python, whatever that means. Lets look it up. The second or third result for the google ‘dict python’ is https://docs.python.org/2/library/stdtypes.html#dictionary-view-objects. Search the page for ‘dict’ and see that on the left hand side, there’s a table of contents with 5.8. Mapping Types — dict. Click that, and we can see some documentation. Scroll down to 5.8.1 - Dictionary view objects. Scanning just through the examples, it looks like you can access data within a dict by dict[‘key’]. That’s pretty common, but if you’re uncomfortable with it, there was a python tutorial on dicts at the top of the google search that might be worth checking out. It’s good to be comfortable with the language’s documentation, but for now we’ll be fine using tutorials until we have some way of understanding the language used in the documentation.

It also looks like dicts have a function you can call, .keys(), that will give you a list of their keys.

So lets try and figure this out.

>> data_parsed.keys()
[u'gamesList']
>> data_parsed['gamesList']

You get a bunch of stuff. But that’s alright. There was a lot in the XML.

>> data_parsed['gamesList'].keys()
[u'steamID64', u'steamID', u'games']

Now we’re on to something. We just want a list of games. So it appears there was a dict inside of the dict. That’s pretty common for these kinds of interfaces. We’re turning the XML into a hierarchy of dicts. So to get the games dict, inside the gamesList dict:

>> data_parsed['gamesList']['games']

More stuff, but less.

>> data_parsed['gamesList']['games'].keys()
[u'game']
>> data_parsed['gamesList']['games']['game']

Still a ton of stuff.

>> data_parsed['gamesList']['games']['game'].keys()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'keys'

Oh! Here we go. So this is a ‘list’ not a ‘dict’. What does that mean? lets google. I’m going to leave you to do that yourself, and use some intuition from here on out, because I’m already writing way too much.

>> data_parsed['gamesList']['games']['game'][0]
OrderedDict([(u'appID', u'255710'), (u'name', u'Cities: Skylines'), (u'logo', u'http://cdn.edgecast.steamstatic.com/steamcommunity/public/images/apps/255710/1b90f32be112870a4fa22c819a358d047b38d97f.jpg'), (u'storeLink', u'http://steamcommunity.com/app/255710'), (u'hoursLast2Weeks', u'9.4'), (u'hoursOnRecord', u'70'), (u'statsLink', u'http://steamcommunity.com/id/kane_t/stats/255710'), (u'globalStatsLink', u'http://steamcommunity.com/stats/255710/achievements/')])

Awesome! We got a game from the first element in the list.

Lets try and get the title of the game:

>>> data_parsed['gamesList']['games']['game'][0]['name']
u'Cities: Skylines'

And we got it! Take a break. Celebrate your accomplishment. This is cool.

Lets try cycling through all the games: Googled ‘python each in list’ and found https://stackoverflow.com/questions/7423118/python-list-for-each-access-find-replace-in-built-in-list

>> for game in data_parsed['gamesList']['games']['game']:
..   print(game['name'])
.. 

And we just printed out all the games.

Lets store the variable so it’s a little less of a pain to type (probably could have done this a while ago):

>> game_list = data_parsed['gamesList']['games']['game']

And lets make sure our code still works:

>> for game in game_list:
..   print(game['name'])
.. 

And it does! cool. So now that we’ve got a list of games we can deal with, how can we pick a random one?

Good luck! Also, remember to come back to that documentation we noted earlier. We got something running, but at some point, we’ll want to gain a bit of understanding of why. Read through the docs. Take in what you can, and if you’re still not processing something, ask someone else or come back again the next time you’re stuck. Using the documentation can be hard, but don’t let that put you off trying. It truly becomes invaluable.


built with , Jekyll, and GitHub Pages — read the fine print