XML: Structured Data Storage¶
XML stands for eXtensible Markup Language, and is a way to represent hierarchical (tree like) data in a text file. XML is commonly used to store and transfer data on the Internet. Python 3 has several library modules that allow a programmer to read and write XML. We will be using the xml.etree.ElementTree module.
XML Fundamentals¶
This section will introduce the idea of hierarchical data storage, trees, and XML terminology. If you already know all about XML and just want to learn how to read and write it in Python, skip to the next section.
A hierarchy is a tree-like structure where elements are arranged by parent/child or containment relationships. Think of a traditional organizational chart for a company with a president at the top with vice-presidents under him. Each vice-president has other department managers who report up to them. You can say that the vice-presidents are “children” to the presidents “parent”, or that the president’s managerial responsibilities “contain” all of the vice-presidents.
XML is a way to represent tree-like collections of data with very strict parent/child or contains/contained-by relationships. Each XML document has a “root” which “contains” all of the other elements. In our XML example below, the “Orgchart” itself is the root node, which contains the President and the CEO nodes.
Here is a fictionally simplistic organization chart represented graphically.
Here is the same org chart represented textually using XML
1 2 3 4 5 6 7 8 9 | <orgchart>
<president name="Herbert T. Walker, III" salary="234,000">
Make Money!
<vp name="Sally" vptype="Marketing">Get Customers!</vp>
<vp name="Tom" vptype="Sales">Make Sales!</vp>
<vp name="Cindy" vptype="Production">Build Widgets</vp>
</president>
<ceo name="Vicky" />
</orgchart>
|
In this example we have a single president (Herbert T. Walker, III) who is in charge of three VP’s (Sally, Tom, and Cindy) who are in charge of Marketing, Sales, and Production respectively. Some employees have a VP job type(vptype) and motto. For example, Herbert’s motto is “Make Money!”, while Sally’s is “Get Customers!. The organizational chart also includes a chief executive officer, Vicky, who does not have a Motto.
- Things to notice:
- Structurally, the beginning of an element is marked with a < character, followed by the name of the element. This is the beginning of the “start tag” of the element.
- The “start tag” can also contain named attributes for the element, such as name=”Sally” and will be ended with a > character.
- Our CEO and President ( Vicky and Herbert), are at the same level, and are children (or contained by) the orgchart element.
- All employees are contained by the root, or top level element named orgchart.
- The orgchart element is the root and contains all of the other elements. In this example it is the only element that does not contain named attributes, but nothing says that the root element may not also have attributes.
- If the element contains any text (mottos in this example) or other elements, those elements and text come between the “start tag” and the “end tag”.
- End tags have the same name as the start tag that they close, but are identified as “end tags” by the use of a / sign. (e.g. </vp> )
- If an element does not contain other elements or text, it does not need a separate “end tag” and the “start tag” can “self-end” by finishing with a /> instead of a singular > character. See our CEO Vicky for an example.
- A few rules about properly formatted XML documents that you need to know:
- Every valid XML file must have a single “root” or top-level element that contains all of the other elements.
- Elements can have any number of attributes, represented as name=”value” pairs, but the value must always be quoted. (So, for example, you must use quotes even when representing numbers, such as salary=”234000”.)
- Elements can contain text <element>Text</element> and have no attributes.
- Elements can even be completely empty <element></element>, in which case you can also represent them with a single self closing tag <element/>. (A completely empty element may sound useless, but its existence can act as a placeholder or a single “yes”/”no” bit of information.)
This section demonstrated how we can store data in a text file using a special format (XML). The benefit of using the XML format is that it makes it easy to read the data using Python or other programming languages, while still being reasonably easy for a human to read.
Using XML in Python¶
The xml.etree.ElementTree module uses objects to represent each element of an XML tree in your computers memory.
To write data to an XML file, you first create a hierarchical data structure made out of elements organized into parent/child relationships and then use a “Tree” object to write the entire tree out to a textual XML file on disk.
When you load a textual XML file from disk, the xml.etree.ElementTree module will give you a tree object that contains a hierarchical structure of Element objects. By walking through the hierarchy of the Element objects, you can navigate the XML tree and extract the data from it.
Writing a Tree¶
The following code will import the xml.etree.ElementTree module (naming it etree for easier access) and create the root of the Orgchart. Note that the first argument to the Element constructor is the element name.
import xml.etree.ElementTree as etree
rootVar = etree.Element("orgchart")
We can create another element for Vicky the CEO and then link it as a child of the orgchart node as follows. Note that subsequent arguments to the constructor are attribute name/value pairs for the Element
ceoNode = etree.Element("ceo", name="Vicky")
rootVar.append(ceoNode)
We added an attribute (name=”Vicky”) to the ceo element when we constructed it. You can also add attributes directly to a dictionary of attributes after the node is created. Here we add the President element and add data to the node after it is created. We add the name to the attribute dictionary (abbreviated attrib), and we add the motto to the nodes’ “text” variable:
1 2 3 4 5 6 | presNode = etree.Element("president")
presNode.attrib["name"] = "Herbet T. Walker, III"
presNode.attrib["salary"] = "$234,000"
presNode.text="Make Money!"
rootVar.append(presNode)
|
Next we will add the three VP’s that are child notes of the president. Instead of creating the node and then appending them to the president node, we will create SubElements, which have their parent element specified when they are created. In this case, the first argument is the parent node, and the second argument is the name of the node, while subsequent arguments are name/value pairs for attributes. These SubElements will automatically link as children of the presNode object when they are instantiated. Note that we are also using a for loop to generate as many VP nodes as we need from a list of data:
1 2 3 4 5 6 7 8 9 10 11 | data = [ ("Marketing", "Sally", "Get Customers!"),
("Sales", "Tom", "Make Sales!"),
("Production", "Cindy", "Build Widgets!") ]
#Dynamicly generate one VP element for each tuple in the list!
for VP in data:
typeStr = VP[0]
nameStr = VP[1]
motoStr = VP[2]
vpNode = etree.SubElement(presNode, "vp", name=nameStr, vptype=typeStr)
vpNode.text = motoStr
|
Now we have six Element (or subElement) nodes created in memory, and they are linked together (via calls to append or by creating SubElements already linked to a parent) into the same hierarchical structure as in the Orgchart above.
To write this data out to an XML file on disk, we simply need to put it into a Tree object, and then ask the Tree to write itself.
ourTree = etree.ElementTree(rootVar)
ourTree.write("orgchart.xml", "UTF-8")
This writes the entire tree to an XML file called orgchart. The first argument is the name of the file to write, and the second parameter is the encoding to use. Unless you are required to use an alternate encoding scheme for some reason, UTF-8 is the recommended encoding.
Note that the XML written to disk does not exactly match the textual example in the previous section. First, it has a header “<?xml version=‘1.0’ encoding=’UTF-8’?>” at the beginning of the file that was automatically added by the xml.etree.ElementTree module reporting the version of XML and the character encoding used. Because we added the CEO Vicky at the beginning of our code, she comes before the president and his child elements, although she is still contained by the orgchart root node. Also, the file does not have line breaks to make it human readable, as XML is primarily designed for machine to machine data transfer. One way you could have line breaks appear in a computer generated XML file is if the text of a node had a line break.
<?xml version='1.0' encoding='UTF-8'?>
<orgchart><ceo name="Vicky" /><president name="Herbet T. Walker, III" salary="$234,000"> ...
You can download the raw xml file and the complete source code here.
(Download writeOrgchart.py )
(Download orgchart.xml )
Reading an XML File¶
To read an xml file, we use the parse method from the xml.etree.ElementTree module, which returns an ElementTree object. The ElementTree object has a pointer to the root Element of the XML file in memory, and it can be retrieved using the getroot() method:
1 2 3 4 5 6 7 | import xml.etree.ElementTree as etree
#Load the xml file into memory.
tree = etree.parse("orgchart.xml")
#Get a pointer to the root node
root = tree.getroot()
|
Once we have a variable (root) that points at the root node, we can look up the tag name (the .tag object variable or attribute), or look at any attributes (in the .attrib object variable) or text it contains (in a .text object variable). We can also iterate through all of the root’s child nodes (the nodes that it contains) by using a for loop (line 5).
Inside the for loop, we print the name attribute value from the .attrib dictionary, as well as the salary attribute value (if it exists, printing N/A instead if the attrib dictionary does not have a salary key). (lines 7 and 9)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #print the tag name of the root node.
print(root.tag)
#Iterate through all the root's children (CEO/President)
for child in root:
#print the name attribute from the attrib dictionary.
print( " " + child.attrib['name'])
#Print salary attribute (if it exists!)
print( " -" + child.attrib.get('salary',"N/A") )
#Get a list of nodes with the name "vp" that are contained by
#this child node.
vps = child.findall("vp")
#Note that vps will be an empty list for the CEO node,
#but will contain 3 VP objects for the president node.
#Iterate through the VP child nodes if the list isn't empty...
for vp in vps:
print(" " + vp.attrib['name'])
print(" -" + vp.text)
|
Instead of iterating through all child nodes using a for loop, we can also use the .find(‘nodeName’) or .findall(‘nodeName’) methods on any Element object to retrieve the first child node that is named ‘nodeName’ or a list of all child nodes of type ‘nodeName’. In this example we use .findall(“vps”) to get a list of any/all vice president nodes contained by the President node (line 13). Note that because the CEO node has no vp children, the vps list will be empty in that case. Throughout this example, we are printing a one or more blank spaces and/or hyphens before some entries to indicate the indentation level of the data (lines 7,9,19, and 20).
(Download readOrgchart.py )
(Download orgchart.xml )
Exercises¶
Which 3 lines of code are needed to give you an root from an xml file called “sample.xml”? (remember to import the correct module)
Draw a tree representing the following XML file as represented in memory by ElementTree objects after it is parsed:
1 2 3 4 5 6 7 8 9
<purchases> <store type="accessories">Jill's Place <item number= "4">purses</item> <item number= "7">scarves</item> </store> <store type="clothes">Kohls <item number= "10">shirts</item> <item number= "18">jeans</item> </store> </purchases>
- For the two code listings below, write out the textual content of the xml file that they produce when ran. You can run the code to check your result. Also draw a graphical representation of the XML data objects in memory, showing their connections, tag-names, text, and attributes.
-Code listing 1:
1 2 3 4 5 6 7 8 9
import xml.etree.ElementTree as ET root = ET.Element("Mom and Dad") Sister = ET.SubElement(root, "Sister", age = '23') PetAnimal = ET.SubElement(root, "Dog", age = '5') You = ET.SubElement(root, "Student", age = '20') grandparents = ET.Element("grandparents") grandparents.append(root) tree = ET.ElementTree(grandparents) tree.write("test.xml","UTF-8")
-Code listing 2:
1 2 3 4 5 6 7 8 9 10 11
import xml.etree.ElementTree as et def xmlFile(): root = et.Element("Schools") school = et.SubElement(root, "Georgia Tech", name="Yellow Jackets") school.text = "This school is awesome" school = et.Element("UGA") school.set('name' ,'Bulldogs') school.text = "The hell with Georgia" root.append(school) tree = et.ElementTree(root) tree.write("schools.xml", "UTF-8")
Here is an XML file that contains telephone numbers and names.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
<phonebook> <section> <person name="Alex">4048765432</person> <person name="Amber">5674325678</person> </section> <section> <person name="Bob">7656789</person> <person name="Bryan">8655678</person> </section> <section> <person name="Robert">5678765432</person> </section> <section> <person name="Sam">27654321</person> <person name="Sandra">4567890876</person> <person name="Sandy">7654321234</person> <person name="Shelly">8584738</person> </section> </phonebook>
Write a function called phoneBook, which takes in a string containing a filename as its only parameter. The function should use the ElementTree module to retrieve the name of each person with the corresponding phone number and store these values in a dictionary. The person’s name is the key and phone number is the value. The function should return this dictionary.