Repeating the title: Two Year Intermission.
Yes, I will be taking a bit of a break from my blogging (though I haven't been the most persistent anyway). Some may know, but I have chosen to serve a two year proselyting mission for my church, which is The Church of Jesus Christ of Latter-day Saints. If you care to look into this church, you can start here.
I was called to the Virginia Richmond Mission. I leave on March 3rd, 2010 and return on or around March 3rd, 2012. After my mission I would like to continue writing these blog posts (more regularly), which I really hope are beneficial to anyone that reads them.
Anyway, I am devout in my beliefs and love my religion. While I am gone, enjoy the posts I have written so far, and if you would like to follow my mission experience, I have set up a system using a couple services of the oh so amazing Google. Via Blogger (with Mail2Blogger) and FeedBurner, anyone can subscribe to receive emails of my posts, which themselves are emailed to the blog. You can view posts and sign up here.
Best of luck in your learning endeavors.
Monday, March 1, 2010
Saturday, February 20, 2010
DOM Parsing With Python
This is nearly verbatim from where I originally posted it.
This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.
In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id 'someid'. First we need to parse the document like so (we will assume the variable 'doc' contains the HTML):
Now we need to get the element. The python object 'dom' contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:
Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup("...")`. I hope you found this helpful. Good luck in your own DOM parsing.
BeautifulSoup documentation - http://www.crummy.com/software/BeautifulSoup/documentation.html
This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.
Background
DOM stands for Document Object Model. It is a convention used in HTML, XHTML, and XML for representing and interacting with objects. As fairly well described by the name, things like HTML have many elements with relationships to other elements. For example, you may have a <span> element in your <body> element. The <span> element's parent is the <body>. The <span> may have child elements and/or sibling elements. It works similar to a family relationship. The elements in an HTML document may have identifiers, specified by attributes like id='something', class='something', and/or name='something'. You can use these identifiers to keep track of and find a specific element or list of elements. Once you have found the element(s) you are looking for, you can change things in a dynamic manner or get desired information.Lets Try Some Beautiful Soup
As I found the need to parse HTML documents a little while ago, I went in search of a module to accommodate my needs. I could have made my own class to handle it (as DOM parsing really isn't that hard), but I don't have nearly the time I would need to take on such a project. Instead I found a module called 'BeautifulSoup'. As I looked into this module, it seemed to be well-written and have full functionality. Through experience I found that this module is quite easy to use.Onto The Code
Ok, lets start out with a simple HTML document:<html>
<head>
<title>Test</title>
</head>
<body>
<span id="someid">This is some text.<span>
<span class="someclass">This is some other text.</span>
</body>
</html>
<head>
<title>Test</title>
</head>
<body>
<span id="someid">This is some text.<span>
<span class="someclass">This is some other text.</span>
</body>
</html>
In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id 'someid'. First we need to parse the document like so (we will assume the variable 'doc' contains the HTML):
import BeautifulSoup
dom = BeautifulSoup.BeautifulSoup(doc)
dom = BeautifulSoup.BeautifulSoup(doc)
Now we need to get the element. The python object 'dom' contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:
# Find the first element with the id 'someid' (all have the same result)
elm1 = dom.find(None, {"id":"someid"})
elm1 = dom.find(None, id="someid")
elm1 = dom.find("span", {"id":"someid"}) # Only searches 'span' tags
elm1 = dom.find("span", id="someid") # Same as above
# Find all elements with the id 'someid'
elms1 = dom.findAll(None, id="someid")
# Find the first element with the class 'someclass'
elm2 = dom.find(None, {"class":"someclass"})
# Find all elements with the class 'someclass'
elms2 = dom.findAll(None, {"class":"someclass"})
# You cannot specify 'class' as a keyword argument, since it is reserved in python.
# That is why the find methods allow a dictionary that specifies what to look for.
# Also, you may specify any of a 'class', 'id', and/or 'name' to look for.
elm1.nextSibling # A reference to the next sibling element
elm2.previousSibling # A reference to the previous sibling element
# The above two lines are references to each other.
# Now, as it is a document _tree_ (each element references others), you can daisy-chain
# These will just lead back to the same element that elm1 referenced to begin with:
elm1.nextSibling.parent.find(None, id="someid")
elm1.parent.first()
# Now, of course you can do more than just walk the tree.
# Print all text contained in the element and all child elements:
print elm1.text
# Print all raw HTML contained in the element:
print elm1.renderContents()
elm1 = dom.find(None, {"id":"someid"})
elm1 = dom.find(None, id="someid")
elm1 = dom.find("span", {"id":"someid"}) # Only searches 'span' tags
elm1 = dom.find("span", id="someid") # Same as above
# Find all elements with the id 'someid'
elms1 = dom.findAll(None, id="someid")
# Find the first element with the class 'someclass'
elm2 = dom.find(None, {"class":"someclass"})
# Find all elements with the class 'someclass'
elms2 = dom.findAll(None, {"class":"someclass"})
# You cannot specify 'class' as a keyword argument, since it is reserved in python.
# That is why the find methods allow a dictionary that specifies what to look for.
# Also, you may specify any of a 'class', 'id', and/or 'name' to look for.
elm1.nextSibling # A reference to the next sibling element
elm2.previousSibling # A reference to the previous sibling element
# The above two lines are references to each other.
# Now, as it is a document _tree_ (each element references others), you can daisy-chain
# These will just lead back to the same element that elm1 referenced to begin with:
elm1.nextSibling.parent.find(None, id="someid")
elm1.parent.first()
# Now, of course you can do more than just walk the tree.
# Print all text contained in the element and all child elements:
print elm1.text
# Print all raw HTML contained in the element:
print elm1.renderContents()
Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup("...")`. I hope you found this helpful. Good luck in your own DOM parsing.
BeautifulSoup documentation - http://www.crummy.com/software/BeautifulSoup/documentation.html
Labels:
beautifulsoup,
dom,
html,
parsing,
programming,
python,
xhtml,
xml
Subscribe to:
Posts (Atom)