s.im.pl meta-metadata:
tutorial [information extraction]
This tutorial will show you how to create a meta_metadata object specific to UrbanSpoon, complete with the necessary information extraction rules.

We will start with a new meta_metadata object named urban_spoon_restaurant. The object will be of the restaurant type we created in the last tutorial, and can be placed in the same file after restaurant.

The object will contain all of the data fields we defined for the restaurant object, for now the XPaths will be left blank. The selector specifies how this meta-metadata is selected when parsing documents. In this tutorial, we use the url_path_tree attribute of selector to define a base URL path for all documents that will use this meta-metadata.
<meta_metadata_repository name="urban_spoon" package="ecologylab.semantics.generated.library.tutorial.urbanspoon" >

<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath" comment="UrbanSpoon restaurant description page" >

<selector url_path_tree="http://www.urbanspoon.com/r/" />

<scalar name="title" xpath="" />

<scalar name="phone" xpath="" />

<scalar name="pic" xpath="" />

<scalar name="link" xpath="" />

<scalar name="rating" xpath="" />

<scalar name="price_range" xpath="" />

<scalar name="map" xpath="" />

<collection name="genres" xpath="" >

<scalar name="title" xpath="" />

<scalar name="location" xpath="" />

</collection>

</meta_metadata>
</meta_metadata_repository>
The following key changes were made from the previous tutorial:
Note that when a meta_metadata is reusing another one directly through type, one cannot define new fields, or use nested meta-metadata or generic fields in it. In those cases one will need to use extends to create a new type. We will talk about the two advanced topic later.
Also note that there are now two new fields contained within genres. Remember that genres was defined with child_type = document. These are the fields from the document type that we will be using: title - String and location - ParsedURL, to store the name of the food genre and a link to respective UrbanSpoon search page.
XPath Expressions
If you are not familiar with XPath expressions please visit this tutorial.

We will look closely at the XPath for three distinct elements: title, pic, and genres.

Using Firebug we can see that the restaurant's title in the HTML:
<div class='hreview-aggregate' id='directory'>

<div class='item'>

<h1 class='page_title fn org'>Christopher's World Grill</h1>
Therefore a correct XPath expression for extracting the restaurant title would be:
//div[@id='directory']/div/h1

Now we will look at extracting an image source URL.
<a href="/rph/114/875031/27276/college-station-christopher-s-world-grill-christopher-s-world-grill-photo">

<img alt="Christopher's World Grill" height="130" src='/images/1/blank.gif' srcd="http://a1.urbancdn.com/w/s/ht/n1Pc0RahTuNJQK-130.jpg" width="130" />

</a>
There are many images on the page and none of the 'div id's or 'class' names are very helpful. Instead of forming a path like the previous one I used xPather to help me deduce the correct expression:
//div[@id='aside']/div[@class='list photos']/ul[1]/li[1]/ div[@class='photo']/div[@class='image']/a/img/@srcd
Be sure when setting the XPath expressions for images to specifiy the src attribute.

The last XPath we will explore will be for genres. Because the genres field is collection the XPath will need to be for a list of nodes.

Looking through the HTML and using xPather I have formulated a correct expression.
//div[@id='secondary']/div[@class='cuisines']/fieldset/a
It returns three nodes, one for each of the listed genres. For each genre we want the name as a String and the link to its respective search page as a ParsedURL.

The XPaths for title and location inside genres can be applied to the nodes marked by the expression for genres. Or in other words, we can use relative XPaths for fields nested in a composite or collection field, though global XPaths are still allowed. Here these XPaths will be:
./text()
and
./@href
The XPath expressions for title and phone begin with the same sequence. We define a variable to store the common component of the XPath expressions to reduce repetition and assist maintenance. We use the def_var tag to define a variable of type node with the common XPath component.
<def_var name="main_block" type="node"
xpath="./div/h1" />

<def_var name="primary" type="node"
xpath="//div[@id='primary']" />
We can then reference this variable in the context_node attribute for title and phone, and shorten the XPath expressions for these fields to contain only the unique ending.
<scalar name="title" context_node="main_block"
xpath="./div/h1" />

<scalar name="phone" context_node="primary" xpath="./h3" />
urban_spoon_restaurant meta_metadata object complete with XPath expressions:
<meta_metadata_repository name="urban_spoon" package="ecologylab.semantics.generated.library.tutorial.urbanspoon" >

<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath" comment="UrbanSpoon restaurant description page">

<selector url_path_tree="http://www.urbanspoon.com/r/" />

<def_var name="main_block" type="node"
xpath="//div[@id='directory']" />

<def_var name="primary" type="node"
xpath="//div[@id='primary']" />

<scalar name="title" context_node="main_block"
xpath="./div/h1" />

<scalar name="phone" context_node="primary" xpath="./h3" />

<scalar name="pic" xpath="//div[@id='aside']/
div[@class='list photos']/ul[1]/li[1]/
div[@class='photo']/div[@class='image']/a/img/@srcd" />

<scalar name="link" context_node="primary" xpath="./p[@class='website']/a[@class='ext']/@href" />

<scalar name="rating" xpath="//div[@id='vote_block']/
div[@class='score up']/div[@class='number']/
span[@class='digits percent-text rating average']" />

<scalar name="price_range" xpath="//div[@id='secondary']/div[@class='menu']/fieldset/
div[@class='price']/span[@class='pricerange']" />

<scalar name="map" context_node="primary" xpath="./div[@class='address adr']/a[1]/@href" />

<collection name="genres" xpath="//div[@id='secondary']/div[@class='cuisines']/fieldset/a" >

<scalar name="title" xpath="./text()" />

<scalar name="location" xpath="./@href" />

</collection>


</meta_metadata>
</meta_metadata_repository>
Note that XPaths may need change because websites may update their layout.

Now you've learned how to create data definitions (types, schemas, data models, or whatever you would like to call them) and attach extraction rules for specific information sources using meta-metadata. If you are a developer, you can now take a look at code samples, especially the NewMmTest program to see if your wrappers work correctly.

The next part of the tutorial will explain how to use semantic actions to implement control flows on extracted metadata, and connect them to your own application.