s.im.pl meta-metadata:
tutorial [advanced topics]

This tutorial will talk about advanced language features provided by meta-metadata, including regular expression based filtering and extraction, nested meta-metadata, and generic fields.

filtering extracted information
with regular expression

If you are not familiar with regular expression please refer to this tutorial.

It is common that extracted information is not in a immediately usable shape, and some post-processing is needed to normalize them into a uniform representation that can be used by the application. Regular expression is powerful tool for this task.

Meta-metadata supports filtering extracted information with regular expression in the form of an extra element <filter> inside a <scalar> field:


<scalar name="author_name" xpath="...">

  <filter regex="[A-Z][a-z]+, [A-Z]\." />

</scalar>

In the above example, the regular expression will be used to extract a name in the form of "Jobs, S." from the extracted information.


It is also to replace the matched part with another one. In the following example, the leading
"ISBN: " of the extracted information will be replaced by an empty string (thus removed):


<scalar name="isbn" xpath="...">

  <filter regex="ISBN:\s+" replace="" />

</scalar>
information extraction with regular expression

Meta-metadata allows extracting information from a flat string into a <composite> or a <collection> element, using regular expression or other means, through an element <field_parser>.

The two basic <field_parser>s are regex_find and regex_split. The former applies to a <composite>, which takes the string for that <composite> (evaluated from XPath / direct binding tag name) as input, matches the input string against specified regular expression, and outputs indexed capture groups. Subfields nested inside the <composite> can specify field_parser_key in the form of $1,$2 to assign a capture group into that field. For example:


<composite name="citation_info" xpath="//h1">

  <field_parser name="regex_find" regex="(\d+) citations .* (\d+) self" />

    <scalar name="total_citation" field_parser_key="$1" />

    <scalar name="self_citation" field_parser_key="$2" />

</composite>
In this example, the field parser uses XPath "//h1" to retrieve the citation information string (e.g. "40 citations -- 2 self") from CiteSeerX, matches it against the regular expression to capture the 2 numbers, and assigns them to the 2 nested scalar fields.

regex_split uses the regular expression as the delimiter to split the input string into a set of values, and assigns each of the value to an element in a collection (if there is more than one nested field in the collection, the field with field_parser_key set to $0 will have the value). For instance, the following example separates a list of comma (with spaces around) separated author names into a collection:


<collection name="authors" xpath="...">

  <field_parser name="regex_split" regex="\s*,\s*" />

    <scalar name="author_name" field_parser_key="$0" />

</collection>

The <field_parser> mechanism is extendable by deriving sub-classes from FieldParser and register them with name to FieldParserFactory. For example, bibtex handles input string in BibTeX format; regex_split_and_find allows you to combine the functionality of regex_split and regex_find. We encourage people to experiment with them or make your own parser when needed.

nested meta-metadata

As nested class definition in Java or other OOP langauge, it is permitted to extend a new meta-metadata type from an existing one just at the place you need it -- typically on a <composite> or <collection> field. The nested meta-metadata type will be visible only in the encampusing meta-metadata and its subtypes, or inside the field itself. For example:

<meta_metadata name="search" extends="compound_document">

  <collection name="search_results" child_type="search_result" />

</meta_metadata>


<meta_metadata name="delicious_search" extends="search">

  <collection name="search_results" child_type="delicious_search_result" child_extends="search_result" >

    <scalar name="author" scalar_type="String" />

    <collection name="tags" child_scalar_type="String" />

  </collection>

</meta_metadata>
search_result is a basic type for search results, with a title, a description (the snippet) and a location (the target document URL). However, for Delicious search, we would like to get the author and tag names, too. One can create a new meta-metadata extending search_result with new field definitions, as an independent meta_metadata in the repository, but it is also possible to do it for delicious_search where you actually use the type, as shown in the above example. Note the use of child_extends to implicitly specify that this is a new type and explicitly specify the supertype for it.

In the above example, the new type delicious_search_result will only be visible to subclasses of delicious_search (including itself), or inside the field search_result itself.

generic fields

When a composite (or collection) field is inherited in a sub meta-metadata type, it is possible to change its type (or child_type) to a more specific type. In other words, one can define the base field using a generic type and specify a concrete sub-type for that field in the derived meta-metadata. Let's look at an example:

<meta_metadata name="search" extends="compound_document">

  <collection name="search_results" child_type="search_result" />

</meta_metadata>


<meta_metadata name="social_search" extends="search">


  <collection name="search_results" child_type="social_search_result" >

  </meta_metadata>

</meta_metadata>
In the above example we assume that there is an independent type social_search_result which extends ordinary search_result with Social Network Service specific fields. social_search, which extends search, can just use social_search_result instead of search_result for the same field, to specify the concrete type used for this field in this context.

Note that a more fundamental support for generic fields is under active development. The syntax may change in near future.

This is not a complete list of all the advanced features meta-metadata supports for modeling, extracting, and using complex metadata semantics. There have been many unexpected and unusual challenges and subtleties to build such a system to support real world problems. For more details, please see our publications about the meta-metadata and S.IM.PL system.