This tutorial will talk about advanced language features provided by meta-metadata, including regular expression based filtering and extraction, nested meta-metadata, and generic fields.
filtering extracted information
with regular expression
If you are not familiar with regular expression please refer to this tutorial.
It is common that extracted information is not in a immediately usable shape, and some post-processing is needed to normalize them into a uniform representation that can be used by the application. Regular expression is powerful tool for this task.
Meta-metadata supports filtering extracted information with regular expression in the form of an extra element <filter> inside a <scalar> field:
<scalar name="author_name" xpath="...">
<filter regex="[A-Z][a-z]+, [A-Z]\." />
In the above example, the regular expression will be used to extract a name in the form of "Jobs, S." from the extracted information.
It is also to replace the matched part with another one. In the following example, the leading
"ISBN: " of the extracted information will be replaced by an empty string (thus removed):
<scalar name="isbn" xpath="...">
<filter regex="ISBN:\s+" replace="" />
information extraction with regular expression
Meta-metadata allows extracting information from a flat string into a <composite> or a <collection> element, using regular expression or other means, through an element <field_parser>.
The two basic <field_parser>s are regex_find and regex_split. The former applies to a <composite>, which takes the string for that <composite> (evaluated from XPath / direct binding tag name) as input, matches the input string against specified regular expression, and outputs indexed capture groups. Subfields nested inside the <composite> can specify field_parser_key in the form of $1,$2 to assign a capture group into that field. For example:
<composite name="citation_info" xpath="//h1">
<field_parser name="regex_find" regex="(\d+) citations .* (\d+) self" />
<scalar name="total_citation" field_parser_key="$1" />
<scalar name="self_citation" field_parser_key="$2" />
regex_split uses the regular expression as the delimiter to split the input string into a set of values, and assigns each of the value to an element in a collection (if there is more than one nested field in the collection, the field with field_parser_key set to $0 will have the value). For instance, the following example separates a list of comma (with spaces around) separated author names into a collection:
<collection name="authors" xpath="...">
<field_parser name="regex_split" regex="\s*,\s*" />
<scalar name="author_name" field_parser_key="$0" />
The <field_parser> mechanism is extendable by deriving sub-classes from FieldParser and register them with name to FieldParserFactory. For example, bibtex handles input string in BibTeX format; regex_split_and_find allows you to combine the functionality of regex_split and regex_find. We encourage people to experiment with them or make your own parser when needed.
As nested class definition in Java or other OOP langauge, it is permitted to extend a new meta-metadata type from an existing one just at the place you need it -- typically on a <composite> or <collection> field. The nested meta-metadata type will be visible only in the encampusing meta-metadata and its subtypes, or inside the field itself. For example:
<scalar name="author" scalar_type="String" />
<collection name="tags" child_scalar_type="String" />
In the above example, the new type delicious_search_result will only be visible to subclasses of delicious_search (including itself), or inside the field search_result itself.
When a composite (or collection) field is inherited in a sub meta-metadata type, it is possible to change its type (or child_type) to a more specific type. In other words, one can define the base field using a generic type and specify a concrete sub-type for that field in the derived meta-metadata. Let's look at an example:
<collection name="search_results" child_type="social_search_result" >
Note that a more fundamental support for generic fields is under active development. The syntax may change in near future.
This is not a complete list of all the advanced features meta-metadata supports for modeling, extracting, and using complex metadata semantics. There have been many unexpected and unusual challenges and subtleties to build such a system to support real world problems. For more details, please see our publications about the meta-metadata and S.IM.PL system.