1. Finally Getting Class-y
This is the third in a series of tutorials in which we build an RSS Reader with Python. If you have not read the earlier ones, it is best if you start in the beginning as this tutorial contains only part of the application.
This tutorial addresses the creation of a class for the feed and then finishes the program by creating a main() function and calling it. If you are at all unsure about Python classes, do yourself a favor and read about them in my tutorial and then come back. Go ahead now. I'll wait.
Other tutorials in this series:
Part 1 | 2 | 3 | 4 | 5
Get the Code!
2. Making a Model RSS Feed With Python
From my tutorial on classes, you know that a class is a blueprint for software objects. Object-oriented programming is a major boon for programmers, and Python's object-orientation makes development very easy. For this tutorial we will develop the blueprint for what RSS feed objects look like. It will not be as comprehensive as it could, but then we do not need it to be.
The name of the class for our model feed will be, appropriately, ModelFeed. While self-definition is not always necessary in one's class, it is a good practice to have. So let's begin our class definition and its first function.
class ModelFeed:
def __init__(self):
self.data = []
Again, if you have any questions or do not understand what I have done here, see my tutorial on classes. Otherwise, let us move on to the the finer parts of the class.
3. Getting the Feed's URL for Python to Retrieve
One of the basic attributes of any feed is its address. For that, we will use a function that calls the address out of feedinfo. The code looks like this:
def feeddata (self, feedname):
feedaddress = feedinfo[feedname]
return feedaddress
Here the function feeddata receives the variable feedname. It then accesses the dictionary feedinfo and asks for the the value of that key. The value is then assigned to feedaddress and passed back to the calling function.
If you need a refresher on functions, refer back to my tutorial.
4. Python's urllib2 Module
Now we get to the function you have been waiting to see. How do we use the urllib2 library? Consider the code and follow my commentary below.
def getlinks (self, address):
file_request = urllib2.Request(address)
file_opener = urllib2.build_opener()
file_object = file_opener.open(file_request)
file_feed = file_object.read()
First, we define getlinks as receiving an address, a URL. Second, we use that URL to create an instance of urllib2's Request object; let's call it file_request. It is essentially a web-based form of a file, and we have simply made Python aware of its existence.
Once the web page object has been created, we can build an opener to it. For this we use urllib2's build_opener function. Just like accessing a local file requires the built-in open command, so we need to build an opener to the web-based file, establishing a connection to it. build_opener creates an object (of type OpenerDirector, for those who would like to know). That object naturally has its own attributes and methods, of which open is one. However, we have not yet opened the file itself.
When the connection has been created, we can then open the web page with the open method of the opener. We then assign that open file object the name file_object. It may then be read like any other file object. This we do here, assigning the file data, which is the RSS feed, to file_feed.
5. XML, Minidom and Parsing the String of Data
Having the feed data into memory, we still need to do something with it. RSS feeds come in XML format. Therefore, we must parse the XML file to get to the RSS data. Consider the next two lines of code.
file_xml = minidom.parseString(file_feed)
item_node = file_xml.getElementsByTagName("item")
Using the module xml.dom.minidom, we can use a simple XML parser to create a DOM tree. The minidom module makes two parsing systems available to the programmer: parse and parseString. The former takes local filenames and parses them into memory. The latter, as we have used here, is for strings that are already in memory.
Next, we need to get the items from the feed. Every RSS feed item is braced by complementary '[item]' tags. Using minidom's method getElementsByTagName, we can access just those nodes of type item, leaving the rest of the document behind. getElementsByTagName returns a list of the nodes; these are here assigned to item_node. The list is actually returned as an object of type NodeList, thus allowing us to use NodeList methods to access its contents.
If you want to see what the feed looks like, put this line after file_xml is assigned the parsed string:
print file_xml.toxml()
But be sure to remove it after you have seen enough; otherwise, the
entire feed will be printed on the web page.
6. Getting the Titles and Links and Forming the Output
Now, to get the text out of the XML structure. First, here is the code:
linkdata = ""
for item in item_node:
title = item.childNodes[1]
link = item.childNodes[3]
ftitle = title.firstChild.data
flink = link.firstChild.data
linkdata = linkdata + "[a href=\"" + flink + "\" target=\"target\"]" + ftitle + "[/a]
\n"
return linkdata
The first line is simply initiating a character string to which the iterator will eventually add.
As stated earlier, getElementsByTagName returns a list of nodes. Therefore, we can employ the same iteration techniques to this list as to others. For every item in the object item_node, we can make several assignments. Before we discuss them, if this loop business looks a bit fuzzy to you, review loops in my tutorial.
The next two assignments employ the attribute childNodes, an attribute of Node objects. Since item_node is really just a list of Node objects, we can iterate over the list and treat each item as a Node object. childNodes gives a list of the current node's children. If we know the format of the XML, we can call out and assign specific parts of the node at will. In this instance, we want the second and third elements of the list. These correspond to the title and link elements.
However, each of these assignments also include the XML markup tags, [title] and [link]. Obviously, this would feature poorly in a web page. To access just the text, we use an interface of the firstChild attribute; that interface is called data. data allows us to return just the text to a variable.
We can then use the two variables and some markup to build a string variable that holds the link data in HTML format. We then return linkdata.
7. A Class Without An Object is Useless
Having finished the class ModelFeed, we need to write a function that initiates an object of this class. This will enable us to retrieve data by feed name. As the links will form the body of the HTML document, let's call this function formBody. The code looks like this:
def formBody(feedname):
feed = ModelFeed()
feedurl = feed.feeddata(feedname)
body = feed.getlinks(feedurl)
return body
If you have read my tutorials on functions and classes, this is straightforward. After declaring that 'formBody' requires the variable feedname, we initiate an object feed of class ModelFeed.
We then pass feedname as an argument to the method feeddata of the new object. This gives us feedurl, the address of the feed.
The address is then passed to the method getlinks. This method returns the headlines and links in HTML format and may thus be assigned to body and returned.
You may ask why we do not simply put all of this into main(). After all, we are only using it one time. The more that one can abstract a problem in software development, the more modular the design is. Modular design makes it easier to troubleshoot the program, to maintain the program, and to extend the program. If you ever want to do something else with the returned data from getlinks, you do need to gut main() and write the function. What's more, because it is Python, you can call all the classes and functions defined here from other Python programs. If you are unsure how to do that, see my tutorials on importing modules.
8. Defining and Calling the main() Thing
Now we can define main(). Because of the high level of abstraction involved in this program, main() is quite small. The code is as follows:
def main():
output = formBody(feedname)
print output
That is it for now. We define output in terms of the function
formBody and then print the results. We could simply pass the
function call to print, but I have kept them separate for simplicity's
sake.
Finally, the program is not finished until we call the main() function. To do this, we use the following code:
if __name__ == "__main__":
main()
Now, you can point your browser to this file on your web server and enjoy your own, personalised and Python-powered RSS Reader. As you do, keep in mind that all is not over for this project. In the future, we will be adding error checking, Unicode handling, and a few other essentials as well as features like images and computerised feed addition. So stay tuned.
Other tutorials in this series:
Part 1 | 2 | 3 | 4 | 5
Get the Code!
