This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.
BackgroundDOM stands for Document Object Model. It is a convention used in HTML, XHTML, and XML for representing and interacting with objects. As fairly well described by the name, things like HTML have many elements with relationships to other elements. For example, you may have a <span> element in your <body> element. The <span> element's parent is the <body>. The <span> may have child elements and/or sibling elements. It works similar to a family relationship. The elements in an HTML document may have identifiers, specified by attributes like id='something', class='something', and/or name='something'. You can use these identifiers to keep track of and find a specific element or list of elements. Once you have found the element(s) you are looking for, you can change things in a dynamic manner or get desired information.
Lets Try Some Beautiful SoupAs I found the need to parse HTML documents a little while ago, I went in search of a module to accommodate my needs. I could have made my own class to handle it (as DOM parsing really isn't that hard), but I don't have nearly the time I would need to take on such a project. Instead I found a module called 'BeautifulSoup'. As I looked into this module, it seemed to be well-written and have full functionality. Through experience I found that this module is quite easy to use.
Onto The CodeOk, lets start out with a simple HTML document:
In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id 'someid'. First we need to parse the document like so (we will assume the variable 'doc' contains the HTML):
Now we need to get the element. The python object 'dom' contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:
Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup("...")`. I hope you found this helpful. Good luck in your own DOM parsing.
BeautifulSoup documentation - http://www.crummy.com/software/BeautifulSoup/documentation.html