info extraction [meta-metadata]

This tutorial will show you how to create a meta_metadata object specific to UrbanSpoon, complete with the necessary information extraction rules.

We will start with a new meta_metadata object named urban_spoon_restaurant. The object will be of the restaurant type we created in the last tutorial, and can be placed in the same file after restaurant.

The object will contain all of the data fields we defined for the restaurant object, for now the XPaths will be left blank. The selector specifies how this meta-metadata is selected when parsing documents. In this tutorial, we use the url_path_tree attribute of selector to define a base URL path for all documents that will use this meta-metadata.

<meta_metadata_repository name="urban_spoon" package="ecologylab.semantics.generated.library.tutorial.urbanspoon" >

<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath" comment="UrbanSpoon restaurant description page" >

</collection>

</meta_metadata>

</meta_metadata_repository>

The following key changes were made from the previous tutorial:

we use type instead of extends, which means use that meta_metadata directly (in terms of its attributes and subfields) without defining any new fields;
for meta_metadata, parser="xpath" means this object will use the XPath parser to extract the data from the raw HTML. When the data is presented as XML such as RSS or a web service API, direct binding (parser="direct") can be used to directly map data to a XML tree;
added selector with attribute url_path_tree="http://www.urbanspoon.com/r/" - This is the URL path that all UrbanSpoon restaurant pages start with, therefore all pages with a mathcing URL path will be mapped to this meta_metadata object;
there are other ways for specifying URL pattern in a selector. A url_stripped specifies the non-query part (the part before '?') in a URL. A url_regex allows using regular expression for URL pattern; an attribute domain will be needed together with url_regex for performance, e.g. domain="urbanspoon.com".
multiple selector is allowed if there is more than one way to select the same wrapper.

Note that when a meta_metadata is reusing another one directly through type, one cannot define new fields, or use nested meta-metadata or generic fields in it. In those cases one will need to use extends to create a new type. We will talk about the two advanced topic later.

Also note that there are now two new fields contained within genres. Remember that genres was defined with child_type = document. These are the fields from the document type that we will be using: title - String and location - ParsedURL, to store the name of the food genre and a link to respective UrbanSpoon search page.

XPath Expressions

If you are not familiar with XPath expressions please visit this tutorial.

We will look closely at the XPath for three distinct elements: title, pic, and genres.

Using Firebug we can see that the restaurant's title in the HTML:

<h1 class='page_title fn org'>Christopher's World Grill</h1>

…

Therefore a correct XPath expression for extracting the restaurant title would be:

//div[@id='directory']/div/h1

Now we will look at extracting an image source URL.

</a>

There are many images on the page and none of the 'div id's or 'class' names are very helpful. Instead of forming a path like the previous one I used xPather to help me deduce the correct expression:

//div[@id='aside']/div[@class='list photos']/ul[1]/li[1]/ div[@class='photo']/div[@class='image']/a/img/@srcd

Be sure when setting the XPath expressions for images to specifiy the src attribute.

The last XPath we will explore will be for genres. Because the genres field is collection the XPath will need to be for a list of nodes.

Looking through the HTML and using xPather I have formulated a correct expression.

//div[@id='secondary']/div[@class='cuisines']/fieldset/a

It returns three nodes, one for each of the listed genres. For each genre we want the name as a String and the link to its respective search page as a ParsedURL.

The XPaths for title and location inside genres can be applied to the nodes marked by the expression for genres. Or in other words, we can use relative XPaths for fields nested in a composite or collection field, though global XPaths are still allowed. Here these XPaths will be:

./text()
and
./@href

The XPath expressions for title and phone begin with the same sequence. We define a variable to store the common component of the XPath expressions to reduce repetition and assist maintenance. We use the def_var tag to define a variable of type node with the common XPath component.

<def_var name="main_block" type="node"
xpath="./div/h1" />

<def_var name="primary" type="node"
xpath="//div[@id='primary']" />

We can then reference this variable in the context_node attribute for title and phone, and shorten the XPath expressions for these fields to contain only the unique ending.

urban_spoon_restaurant meta_metadata object complete with XPath expressions:

<meta_metadata_repository name="urban_spoon" package="ecologylab.semantics.generated.library.tutorial.urbanspoon" >

<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath" comment="UrbanSpoon restaurant description page">

</collection>

</meta_metadata>
</meta_metadata_repository>

Note that XPaths may need change because websites may update their layout.

Now you've learned how to create data definitions (types, schemas, data models, or whatever you would like to call them) and attach extraction rules for specific information sources using meta-metadata. If you are a developer, you can now take a look at code samples, especially the NewMmTest program to see if your wrappers work correctly.

The next part of the tutorial will explain how to use semantic actions to implement control flows on extracted metadata, and connect them to your own application.