RDFa in HTML 5

Totally Unofficial Unfinished Experimental Draft — 18 May 2009 22:23:24Z

Abstract

This document is a very rough initial attempt to define the process of extracting RDF triples from an HTML or XHTML document. The goal is for all RDFa implementations to extract the same collection of triples from a given document, by defining precisely the required behaviour (the Well-defined Behavior principle), including for invalid documents (the Handle Errors design principle) and dynamically modified documents.

Furthermore, the same processing rules ought to apply to text/html and application/xhtml+xml documents, and (as far as possible) the same sequences of characters ought to result in the same triples in both syntaxes, vaguely related to the DOM Consistency principle. (As such, this would override the processing rules defined in [RDFa in XHTML] for XHTML documents (as well as HTML documents) in conforming user agents.)

The [RDFa in HTML] document is limited to HTML 4, which does not define a processing model in adequate detail to support a strict definition of the processing requirements for RDFa. This document instead builds on HTML 5, which defines a processing model that is sufficient to achieve the goal and closely matches current implementations.

Feedback should probably go to public-html or public-rdf-in-xhtml-tf or #whatwg IRC or blogs or Twitter or wherever else you fancy.

Table of contents

  1. 1 Document conformance
    1. 1.1 New concepts
    2. 1.2 New data types
    3. 1.3 New attributes
      1. 1.3.1 The about attribute
      2. 1.3.2 The content attribute
      3. 1.3.3 The datatype attribute
      4. 1.3.4 The property attribute
      5. 1.3.5 The resource attribute
      6. 1.3.6 The rel and rev attributes
      7. 1.3.7 The typeof attribute
      8. 1.3.8 Pseudo-namespace attributes
  2. 2 Processing model
  3. 3 CURIE and URI processing

1 Document conformance

Document conformance for an "RDFa in HTML 5" document is equivalent to document conformance in HTML 5, with the exceptions defined in this section.

1.1 New concepts

A string prefix is in pseudo-namespace scope if:

Something about having either a "xmlns:prefix" attribute, or "prefix" in the xmlns namespace, on an ancestor

1.2 New data types

A string is a valid CURIE if it contains a ":" character; and the substring before the first ":" character either is the empty string, or else matches the NCName production defined in XML Namespaces and is in pseudo-namespace scope; and the substring after the first ":" character matches the irelative-ref production of RFC 3987 [IRIs].

A string is a valid URI or Safe CURIE if it is a valid URL; or if its first character is "[", its last character is "]", and the substring between those characters is a valid CURIE.

A string is a valid reserved word or CURIE if it is a link type; or if it is a valid CURIE.

1.3 New attributes

Add the following global attributes, which may be specified on all HTML elements:

Also, pseudo-namespace attributes may be specified on all HTML elements.

The rev attribute may be specified on any element which allows the rel attribute.

1.3.1 The about attribute

The about attribute states what the data is about. Its value must be a valid URI or Safe CURIE.

1.3.2 The content attribute

The content attribute supplies machine-readable content for a literal.

1.3.3 The datatype attribute

The datatype attribute states the type of a literal. Its value must be a valid CURIE.

1.3.4 The property attribute

The property attribute expresses the relationships between a subject and some literal text. Its value is a set of space-separated tokens. It must contain at least one token. Each token must be a valid CURIE.

1.3.5 The resource attribute

The resource attribute expresses the partner resource of a relationship that is not intended to be 'clickable'. Its value must be a valid URI or Safe CURIE.

1.3.6 The rel and rev attributes

The rel attribute expresses relationships between two resources. The rev attribute expresses reverse relationships between two resources.

Both attributes are sets of space separated tokens. Each token must be a valid reserved word or CURIE.

It's vague how this ties in to HTML 5's definition of rel.

1.3.7 The typeof attribute

The typeof attribute states the RDF type (or types) of the subject. Its value is a set of space-separated tokens. It must contain at least one token. Each token must be a valid CURIE.

1.3.8 Pseudo-namespace attributes

A pseudo-namespace attribute is an attribute whose name starts with the string "xmlns:", followed by a string that matches the NCName production defined in XML Namespaces.

These are not real XML Namespace attributes. In an XML parser, xmlns:prefix attributes have local name "prefix" and namespace "http://www.w3.org/2000/xmlns/". In a text/html parser, the same markup results in attributes with local name "xmlns:prefix" and in no namespace. This specification introduces the psuedo-namespace concept so that xmlns:prefix in text/html will work kind of (but not quite) like how it works in XML in terms of resolving CURIEs.

2 Processing model

Originally copied from [RDFaSYNTAX]. This is blatant copyright infringement. Additions are marked like this, deletions are marked like this.

Processing would normally begin after the document to be parsed has been completely loaded. However, there is no requirement for this to be the case, and it is certainly possible to use a stream-based approach, such as SAX [http://en.wikipedia.org/wiki/SAX] to extract the RDFa information. However, if some approach other than the DOM traversal technique defined here is used, it is important to ensure that any meta or link elements processed in the head of the document honor any occurrences of base which may appear after those elements. (In other words, XHTML processing rules must still be applied, even if document processing takes place in a non-HTML environment such as a search indexer.)

Rather than stating this explicitly, we just rely on the general concept used throughout HTML 5 that any implementation is conforming as long as its output is identical to the specified algorithm.

At the beginning of processing, an initial [evaluation context] is created, as follows:

Processing begins by applying the processing rules below to the document object, in the context of this initial [evaluation context]. All elements in the tree are also processed according to the rules described below, depth-first, although the [evaluation context] used for each set of rules will be based on previous rules that may have been applied.

The processing rules are:

  1. First, the local values are initialized, as follows:
    Note that some of the local variables are temporary containers for values that will be passed to descendant elements via an [evaluation context]. In some cases the containers will have the same name, so to make it clear which is being acted upon in the following steps, the local version of an item will generally be referred to as such.
  2. Next the [current element] is parsed for [URI mapping]s and these are added to the [local list of URI mappings]. Note that a [URI mapping] will simply overwrite any current mapping in the list that has the same name;
    Mappings are provided by @xmlns. The value to be mapped is set by the XML namespace prefix, and the value to map is the value of the attribute—a URI. Note that the URI is not processed in any way; in particular if it is a relative path it is not resolved against the current [base]. Authors are advised to follow best practice for using namespaces, which includes not using relative paths.
  3. Next the current element is parsed for URI mappings as follows:
    1. For each attribute in the current element that is in no namespace, and whose local name consists of the case-sensitive string "xmlns:" followed by a non-zero-length string of characters p: Update the local list of URI mappings so that p is mapped onto the value of the attribute. (This might replace an existing map entry.)
    2. For each attribute in the current element that is in the namespace http://www.w3.org/2000/xmlns/: Update the local list of URI mappings so that the attribute's local name is mapped onto the value of the attribute. (This might replace an existing map entry.)
  4. The [current element] is also parsed for any language information, and if present, [current language] is set accordingly;
    Language information can be provided using the general-purpose XML attribute @xml:lang.
  5. If the [current element] contains no @rel or @rev attribute, then the next step is to establish a value for [new subject]. Any of the attributes that can carry a resource can set [new subject];
    [new subject] is set to the URI obtained from the first match from the following rules:
    If the current element has an about attribute

    Parse the about attribute as a URI or Safe CURIE. If this returns a URL (not an error), set new subject to this URL.

    Otherwise, if the current element has a src attribute

    Resolve the src attribute as a URL. If this returns a URL (not an error), set new subject to this URL.

    Otherwise, if the current element has a resource attribute

    Parse the resource attribute as a URI or Safe CURIE. If this returns a URL (not an error), set new subject to this URL.

    Otherwise, if the current element has an href attribute

    Resolve the href attribute as a URL. If this returns a URL (not an error), set new subject to this URL.

    Shouldn't these attributes be restricted to elements where they're meant to appear? (<a href>, <img src>, etc; not <b src> etc)

    If no URI is provided by a resource attributeIf this did not cause new subject to be set, then the first match from the following rules will apply:

  6. If the [current element] does contain a @rel or @rev attribute, then the next step is to establish both a value for [new subject] and a value for [current object resource]:
    [new subject] is set to the URI obtained from the first match from the following rules:
    If the current element has an about attribute

    Parse the about attribute as a URI or Safe CURIE. If this returns a URL (not an error), set new subject to this URL.

    Otherwise, if the current element has a src attribute

    Resolve the src attribute as a URL. If this returns a URL (not an error), set new subject to this URL.

    If no URI is provided then the first match from the following rules will apply:

    Then the [current object resource] is set to the URI obtained from the first match from the following rules:

    If the current element has a resource attribute

    Parse the resource attribute as a URI or Safe CURIE. If this returns a URL (not an error), set the current object resource to this URL.

    Otherwise, if the current element has an href attribute

    Resolve the href attribute as a URL. If this returns a URL (not an error), set the current object resource to this URL.

    Note that final value of the [current object resource] will either be null (from initialization) or a full URI.

  7. If in any of the previous steps a [new subject] was set to a non-null value, it is now used to provide a subject for type values;
    One or more 'types' for the [new subject] can be set by using @typeof. If present, the attribute must contain one or more URIs, obtained according to the section on URI and CURIE Processing, each of which is used to generate a triple as follows: If the current element has a typeof attribute: Parse the typeof attribute as a list of CURIEs; for each url in the returned list, generate the triple:
    subject
    [new subject]
    predicate
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type
    object
    full URI of 'type'url
    Note that none of this block is executed if there is no [new subject] value, i.e., [new subject] remains null.
  8. If in any of the previous steps a [current object resource] was set to a non-null value, it is now used to generate triples:
    Predicates for the [current object resource] can be set by using one or both of the @rel and @rev attributes:
  9. If however [current object resource] was set to null, but there are predicates present, then they must be stored as [incomplete triple]s, pending the discovery of a subject that can be used as the object. Also, [current object resource] should be set to a newly created [bnode];
    Predicates for [incomplete triple]s can be set by using one or both of the @rel and @rev attributes:
  10. The next step of the iteration is to establish any [current object literal];
    Predicates for the [current object literal] can be set by using @property. If present, one or more URIs are obtained according to the section on CURIE and URI Processing, and then the actual literal value is obtained as follows:
    • as a [typed literal] if:
      • @datatype is present, and does not have an empty value, and is not set to rdf:XMLLiteral.

      The actual literal is either the value of @content (if present) or a string created by concatenating the value of all descendant text nodes, of the [current element] in turn. the textContent of the current element. The final string includes the datatype URI, as described in [RDF-CONCEPTS], which will have been obtained according to the section on CURIE and URI Processing. Parse the datatype attribute as a CURIE. If this returns a URL (not an error), the literal's type must be set to this returned URL.

    • as a [plain literal] if:
      • @content is present;
      • or all children of the [current element] are text nodes;
      • or there are no child nodes (in which case the literal value is the empty string);
      • or the body of the [current element] does have non-text child nodes but @datatype is present, with an empty value.

      Additionally, if there is a value for [current language] then the value of the [plain literal] should include this language information, as described in [RDF-CONCEPTS]. If the primary language of the current element is not unknown, then the value of the [plain literal] must include this language information. The actual literal is either the value of @content (if present) or a string created by concatenating the text content of each of the descendant elements of the [current element] in document order. the textContent of the current element.

    • as an [XML literal] if:
      • the [current element] has any child nodes that are not simply text nodes, and @datatype is not present, or is present, but is set to rdf:XMLLiteral.

      The value of the [XML literal] is a string created by serializing to text, all nodes that are descendants of the [current element], i.e., not including the element itself, running the XML fragment serialization algorithm on the current element, and giving it a datatype of rdf:XMLLiteral. If the XML fragment serialization algorithm raises an exception, then abort this step without generating a triple, and continue to the next step.

      HTML 5 requires HTML elements to be put in the http://www.w3.org/1999/xhtml namespace when parsed from text/html. The serialization will have to add any xmlns attributes necessary to preserve this namespace information.

      If the text/html input contains any xmlns:* attributes in the fragment, then it will be impossible to serialise as XML (since the attribute's local name contains ":"). That's quite nasty. (It'd be nice to not rely on xmlns:* attributes at all.)

    The [current object literal] is then used with each predicate to generate a triple as follows:

    If the current element has a property attribute: Parse the property attribute as a list of CURIEs; for each url in the returned list, generate the triple:

    subject
    [new subject]
    predicate
    full URIurl
    object
    [current object literal]

    Once the triple has been created, if the [datatype] of the [current object literal] is rdf:XMLLiteral, then the [recurse] flag is set to false.

  11. If the [skip element] flag is 'false', and [new subject] was set to a non-null value, then any [incomplete triple]s within the current context should be completed:
    The [list of incomplete triples] from the current [evaluation context] (not the [local list of incomplete triples]) will contain zero or more predicate URIs. This list is iterated, and each of the predicates is used with [parent subject] and [new subject] to generate a triple. Note that at each level there are two, lists of [incomplete triple]s; one for the current processing level (which is passed to each child element in the previous step), and one that was received as part of the [evaluation context]. It is the latter that is used in processing during this step.
    Note that each [incomplete triple] has a [direction] value that it used to determine what will become the subject, and what will become the object, of each generated triple:
  12. If the [recurse] flag is 'true', all elements that are children of the [current element] are processed using the rules described here, using a new [evaluation context], initialized as follows:

3 CURIE and URI processing

When instructed to parse a string as a CURIE, the UA must do the following:

If the string contains a ":" (colon) character

Let prefix be the part of the string before the first ":" character. Let reference be the part after the first ":" character.

If prefix is the empty string

Let uri be http://www.w3.org/1999/xhtml/vocab# concatenated with reference. Resolve uri as a URL relative to the current element. If this results in an error, return the error; otherwise, return the resulting URL.

Otherwise, if prefix is in the local list of URI mappings

Let uri be the result of looking up prefix in the local list of URI mappings, concatenated with reference. Resolve uri as a URL relative to the current element. If this results in an error, return the error; otherwise, return the resulting URL.

Otherwise

Return an error.

Otherwise, the string contains no ":" character

Return an error.


When instructed to parse a string as a URI or Safe CURIE, the UA must do the following:

If the string is not empty, and its first character is "[" and its last character is "]"

Remove the first and last characters from the string, then parse the resulting string as a CURIE.

Otherwise

Resolve the string as a URL relative to the current element. If this results in an error, return the error; otherwise, return the resulting URL.


When instructed to parse a string as a list of CURIEs, the UA must do the following:

  1. Let urls be a list of absolute URLs, initially empty.

  2. Split the string on spaces. For each token in the returned list:

    Let uri be http://www.w3.org/1999/xhtml/vocab# concatenated with the token. Resolve uri as a URL relative to the current element. If this results in an error, ignore the token and continue; otherwise, add the resulting URL to urls.

    Should the concatenated string get resolved?

  3. Return urls.


When instructed to parse a string as a list of reserved words and CURIEs, the UA must do the following:

  1. Let urls be a list of absolute URLs, initially empty.

  2. Split the string on spaces. For each token in the returned list:

    If the token contains a ":" (colon) character

    Parse the token as a CURIE. If this returns an error, ignore the token and continue; otherwise, add the resulting URL to urls.

    Otherwise

    Let uri be http://www.w3.org/1999/xhtml/vocab# concatenated with the token. Resolve uri as a URL relative to the current element. If this results in an error, ignore the token and continue; otherwise, add the resulting URL to urls.

    Should the concatenated string get resolved?

  3. Return urls.