Monday, March 1, 2010

Two Year Intermission

Repeating the title: Two Year Intermission.

Yes, I will be taking a bit of a break from my blogging (though I haven't been the most persistent anyway). Some may know, but I have chosen to serve a two year proselyting mission for my church, which is The Church of Jesus Christ of Latter-day Saints. If you care to look into this church, you can start here.

I was called to the Virginia Richmond Mission. I leave on March 3rd, 2010 and return on or around March 3rd, 2012. After my mission I would like to continue writing these blog posts (more regularly), which I really hope are beneficial to anyone that reads them.

Anyway, I am devout in my beliefs and love my religion. While I am gone, enjoy the posts I have written so far, and if you would like to follow my mission experience, I have set up a system using a couple services of the oh so amazing Google. Via Blogger (with Mail2Blogger) and FeedBurner, anyone can subscribe to receive emails of my posts, which themselves are emailed to the blog. You can view posts and sign up here.

Best of luck in your learning endeavors.

Saturday, February 20, 2010

DOM Parsing With Python

This is nearly verbatim from where I originally posted it.

This guide assumes you have basic knowledge of python and have done at least some work with HTML, XHTML, and/or XML.

Background

DOM stands for Document Object Model. It is a convention used in HTML, XHTML, and XML for representing and interacting with objects. As fairly well described by the name, things like HTML have many elements with relationships to other elements. For example, you may have a <span> element in your <body> element. The <span> element's parent is the <body>. The <span> may have child elements and/or sibling elements. It works similar to a family relationship. The elements in an HTML document may have identifiers, specified by attributes like id='something', class='something', and/or name='something'. You can use these identifiers to keep track of and find a specific element or list of elements. Once you have found the element(s) you are looking for, you can change things in a dynamic manner or get desired information.

Lets Try Some Beautiful Soup

As I found the need to parse HTML documents a little while ago, I went in search of a module to accommodate my needs. I could have made my own class to handle it (as DOM parsing really isn't that hard), but I don't have nearly the time I would need to take on such a project. Instead I found a module called 'BeautifulSoup'. As I looked into this module, it seemed to be well-written and have full functionality. Through experience I found that this module is quite easy to use.

Onto The Code

Ok, lets start out with a simple HTML document:
<html>
<head>
    <title>Test</title>
</head>
<body>
    <span id="someid">This is some text.<span>
    <span class="someclass">This is some other text.</span>
</body>
</html>

In this document we have two child elements under the <body> element. Now, lets say we want to access the element with the id 'someid'. First we need to parse the document like so (we will assume the variable 'doc' contains the HTML):
import BeautifulSoup
dom = BeautifulSoup.BeautifulSoup(doc)

Now we need to get the element. The python object 'dom' contains the parsed document. There are some provided methods for searching the document tree. Here are some examples:
# Find the first element with the id 'someid' (all have the same result)
elm1 = dom.find(None, {"id":"someid"})
elm1 = dom.find(None, id="someid")
elm1 = dom.find("span", {"id":"someid"}) # Only searches 'span' tags
elm1 = dom.find("span", id="someid") # Same as above

# Find all elements with the id 'someid'
elms1 = dom.findAll(None, id="someid")

# Find the first element with the class 'someclass'
elm2 = dom.find(None, {"class":"someclass"})

# Find all elements with the class 'someclass'
elms2 = dom.findAll(None, {"class":"someclass"})

# You cannot specify 'class' as a keyword argument, since it is reserved in python.
# That is why the find methods allow a dictionary that specifies what to look for.
# Also, you may specify any of a 'class', 'id', and/or 'name' to look for.

elm1.nextSibling # A reference to the next sibling element
elm2.previousSibling # A reference to the previous sibling element

# The above two lines are references to each other.

# Now, as it is a document _tree_ (each element references others), you can daisy-chain
# These will just lead back to the same element that elm1 referenced to begin with:
elm1.nextSibling.parent.find(None, id="someid")
elm1.parent.first()

# Now, of course you can do more than just walk the tree.

# Print all text contained in the element and all child elements:
print elm1.text

# Print all raw HTML contained in the element:
print elm1.renderContents()

Now, of course you can do more, like manually looking through the attributes of an element or something, but this gives a basic idea of how to use the module for your needs. To parse XML, instead of HTML or XHTML, you will want to parse the document with `BeautifulSoup.BeautifulStoneSoup("...")`. I hope you found this helpful. Good luck in your own DOM parsing.

BeautifulSoup documentation - http://www.crummy.com/software/BeautifulSoup/documentation.html

Monday, December 28, 2009

Open Source

When someone makes a program, the public availability of source code is sometimes a weighty matter. Now, if someone goes to all the effort of making a huge, extensive program, why would they want to just give everyone the code for it? Lets take a look at reasons for and against just giving the source code.

Reasons against:
  • Perhaps the biggest reasons to keep source code private is control of whats public and retaining monetary flow. If you have some unapparent bugs in your program or maybe some algorithms you don't want everyone to have, you might think that keeping things from the public protects your interests. If you want to make money with your program, just giving the code doesn't make it that easy.
  • Maybe you have something to hide. You might be sending usage statistics back to company servers and you don't want the user to be aware of this. Obviously, a user might get a bit freaked out if they find out you are collecting information from them, so what better than to not let them know in the first place.
  • You may want to prevent competition. If you let your code into the open, others are destined to see it and try to make something better. If your company depends on keeping things private to remain prosperous, the last thing on your mind is feeding competition the code you have so far, and thus driving business away.
Reasons for:
  • One of the biggest reasons open source programs do so well is because they have a community backing and developing them. The program doesn't just rely on a small set of somewhat knowledgeable people, but many, many people, collectively far more knowledgeable than a subset of employees.
  • Another big reason is the speedy development. If someone finds a bug or gets an idea for improvement, they can fix or add code to their hearts content (obviously moderated by those in charge of accepting changes). This give the community a say in the direction a program goes and how secure it is. If a program has a vulnerability, it can be resolved soon after discovery, and possibly even by the discoverer. This makes a program much more secure and just cooler in general.
  • Everyone can see the code, so they can conform their program to work with yours or take full advantage of your programs capabilities. End-programmers can also learn from your program. Rather than spending hours hassling with something, just trying to get a simple piece of their program to work, they can see how you got yours to work. Its a great way to learn something.
  • Have you ever just made some quick, poorly thought piece of code, which would embarrass you to have everybody see? Well, the whole publicity factor can help you to make higher-quality, clean code that you can be proud to put your name to.
  • Open source programs can bring in revenue as well. If you make a high-quality product, your following community is much more likely to donate money than if you make a low-quality product. This encourages you to make an awesome, well-liked product. Plenty of profit can be made from simply asking your users to donate a little bit of money. If you make the users happy, they will gladly repay.
  • Lastly, its just cool and helps everyone. Sharing is better than selfishly hoarding, so why not apply that to your code?!
Ok, so in this comparison we can see that the reasons for open source outweigh the reasons against (at least I hope you can see that). Now, lets take a look at examples of proprietary vs. open source:

Microsoft has long kept its code private. This has brought a plethora of problems. First off, its operating system has (as I have seen) utterly failed. They control everything with it and it bugily progresses (if not digresses) at a very slow pace. Windows Vista is basically a slower prettier, much buggier version of XP. Windows7, though an improvement over Vista, fails in many ways. Nextly is their browser, Internet Explorer. As seen by the majority of web designers, it has been one of the largest stumbling blocks of internet progress. It renders things terribly, slowly, and very bugily. It doesn't support, but rather impedes many standards. The community has had little to no say in the development Microsoft products, its all corporately managed.

Some of the best known open source projects are Linux and Mozilla Firefox. Linux has been slowly, but surely, crushing Microsoft's monopoly. Anyone can make their own distribution of Linux and can contribute to existing distributions. Its entirely community based. Since its beginnings, Mozilla Firefox has been thriving. The browser quickly gained popularity and is currently one of the biggest forces crushing Microsoft's sad attempt at a browser. Recently, Google released a friendly competitor to Firefox. Google Chrome brags an amazingly fast javascript engine, which makes notable speed difference. Rather than attempting a monopoly, these two browsers provide encouragement to one-up each other. This just makes both of them better. The open source factor has brought Linux, Firefox, and Chrome to be far better than any corporate controlled product could ever become.

So, these are just a few examples of why open source is far superior to closed source. Next time you start a large project, consider open sourcing it. Help others and yourself!

Thursday, December 24, 2009

Learning How To Learn

A large part of the knowledge I have gained has come from my own endeavors to seek out truth. Since a very young age I have been a very curious person. I always had questions and rarely had answers. That is probably one of the biggest reasons I love computers, and especially the internet, so very much. Once I started looking for information in this massive network, filled with a large part of the worlds collective knowledge, I finally began to discover answers to my many questions. If I was without this amazing resource, I would be quite clueless in many respects.

The things I have learned range from how many, many parts of human anatomy work (nervous system, immune system, adaptability, bone structure and composition, tissue composition, etc), to how computers work on the lower levels, and even how operating systems and programming languages work. This isn't all. I have found many answers to random questions that come up. Some of my biggest resources have been the Google search engine and Wikipedia. Its amazing how you can just type a question or well-phrased query and find all sorts of answers.

Maybe I am just weird, but I often found school to get in the way of my education. Sure there are those things that I wouldn't have learned without school, and am grateful I did learn, but I found school to be less productive overall than the time that I really spent researching things myself. I cant say I am an expert in everything, but I have gained some good underlying knowledge of a multiplicity of subjects. I have often found the things that I learn to be extremely helpful and good to know. I prefer to know about things myself, rather than just blindly trusting whatever other people say. If I am going to get an MRI, like I just did days ago, I will research into the subject. If I am in some way injured or ill, I am going to look into it. That way I know about it and how to best treat myself. If I am going to gain mad security penetration skills, I am going to find good information resources (like HTS). For some, a teacher could even be a good, knowledgeable resource (but sometimes, in my experience, teachers are quite clueless in the things they are supposed to be expert in).

Sometimes even my family makes fun of me for my obsession with knowledge. I just laugh at myself right along-side them, rather than taking any offense whatsoever. I realize I am kind of a knowledge nerd, truth seeker, or dare I even say hacker, and I know that's just who I am. If someone doesn't like it, that's their problem.

Now, have you ever had an unanswered question? Have you ever had doubts about the truthfulness of information you have received? Have you ever just wanted to know more about something? Try researching it yourself. Here is a great tutorial for learning to learn. In fact at the top you can find a link to Wikipedia, which has an article about autodidactism (self-learning).

Ok, now that you know your resources, use them. Good luck!

Tuesday, December 22, 2009

PHP

As I wrote about my favorite language, python, I figure I might as well write about other languages I like. This time its PHP.

Ever wanted to write an interactive, dynamically generated web page or site? This language is built mainly for that exact purpose, plus its very powerful. That is why its the most popular server-side languages used for web programming.

PHP is pretty easy to learn and use. Its main site, php.net, has some awesome, pretty thorough, documentation to guide the learning process. Unlike python, the functions are all global. You don't need to import anything to use it. As with python, variable types don't have to be explicitly specified. In php this is called 'type juggling'.

Ok, now for some of the basics. All php code must be surrounded by the start(<?) and end(?>) tags, in order to separating php from html. Variable names begin with the dollar sign ($) and can contain letters, numbers (except for the first character), underscores(_), and some others (ascii 127+). Instructions end with a semicolon(;). Here's a very basic example:

<?
$some_variable = "Hello World!";
echo $some_variable;
?>

This will simply store the text "Hello World!" in a variable then output it. You can do a lot more with php, but these are just some of the basics. You can find more help with learning php from w3schools. This language can also be used for command-line programs, and even graphical programs (via gtk).

Best of luck!