This document is a very rough initial attempt to define the process of extracting RDF triples from an HTML or XHTML document. The goal is for all RDFa implementations to extract the same collection of triples from a given document, by defining precisely the required behaviour (the Well-defined Behavior principle), including for invalid documents (the Handle Errors design principle) and dynamically modified documents.
Furthermore, the same processing rules ought to apply to
text/html
and application/xhtml+xml
documents,
and (as far as possible) the same sequences of characters ought to
result in the same triples in both syntaxes, vaguely related to the DOM
Consistency principle. (As such, this would override the processing
rules defined in [RDFa in
XHTML] for XHTML documents (as well as HTML documents) in conforming
user agents.)
The [RDFa in HTML] document is limited to HTML 4, which does not define a processing model in adequate detail to support a strict definition of the processing requirements for RDFa. This document instead builds on HTML 5, which defines a processing model that is sufficient to achieve the goal and closely matches current implementations.
Feedback should probably go to public-html or public-rdf-in-xhtml-tf or #whatwg IRC or blogs or Twitter or wherever else you fancy.
Something about having either a
"xmlns:
prefix" attribute, or "prefix"
in the xmlns namespace, on an ancestor
A string is a valid CURIE if it contains a
":
" character; and the substring before the first
":
" character either is the empty string, or else matches
the NCName
production defined in XML Namespaces and is in pseudo-namespace
scope; and the substring after the first ":
"
character matches the irelative-ref
production of RFC 3987 [IRIs].
A string is a valid URI or Safe CURIE if it is a valid URL; or if its first character is
"[
", its last character is "]
", and the
substring between those characters is a valid CURIE.
A string is a valid reserved word or CURIE if it is a link type; or if it is a valid CURIE.
Add the following global attributes, which may be specified on all HTML elements:
Also, pseudo-namespace attributes may be specified on all HTML elements.
The rev
attribute may be specified on any element which
allows the rel
attribute.
about
attributeThe about
attribute states what the data is
about.
Its value must be a valid URI or Safe CURIE.
content
attributeThe content
attribute supplies
machine-readable content for a literal.
datatype
attributeThe datatype
attribute states the type of a
literal.
Its value must be a valid CURIE.
property
attributeThe property
attribute expresses the
relationships between a subject and some literal text.
Its value is a set of
space-separated tokens. It must contain at least one token. Each
token must be a valid CURIE.
resource
attributeThe resource
attribute expresses the partner
resource of a relationship that is not intended to be 'clickable'.
Its value must be a valid URI or Safe CURIE.
rel
and rev
attributesThe rel
attribute expresses relationships
between two resources.
The rev
attribute expresses reverse
relationships between two resources.
Both attributes are sets of space separated tokens. Each token must be a valid reserved word or CURIE.
It's vague how this ties in to HTML 5's definition of
rel
.
typeof
attributeThe typeof
attribute states the RDF type (or
types) of the subject.
Its value is a set of
space-separated tokens. It must contain at least one token. Each
token must be a valid CURIE.
xmlns:
", followed by a string that
matches the NCName
production defined in XML Namespaces.
These are not real XML Namespace attributes. In an XML
parser, xmlns:prefix
attributes have local name
"prefix
" and namespace
"http://www.w3.org/2000/xmlns/
". In a
text/html
parser, the same markup results in attributes
with local name "xmlns:prefix
" and in no namespace. This
specification introduces the psuedo-namespace concept so that
xmlns:prefix
in text/html
will work kind of
(but not quite) like how it works in XML in terms of resolving
CURIEs.
Originally copied from [RDFaSYNTAX]. This is blatant copyright infringement. Additions are marked like this, deletions are marked like this.
Processing would normally begin after the document to be parsed has been completely loaded. However, there is no requirement for this to be the case, and it is certainly possible to use a
stream-based approach, such as SAX [http://en.wikipedia.org/wiki/SAX] to extract the RDFa information. However, if some approach other than the DOM traversal technique defined here is used, it is
important to ensure that any meta
or link
elements processed in the head
of the document honor any occurrences of base
which may appear
after those elements. (In other words, XHTML processing rules must still be applied, even if document processing takes place in a non-HTML environment such as a search indexer.)
Rather than stating this explicitly, we just rely on the general concept used throughout HTML 5 that any implementation is conforming as long as its output is identical to the specified algorithm.
At the beginning of processing, an initial [evaluation context] is created, as follows:
base
element, if present;
base
element. If some other XML dialect that supports @xml:base eventually implements RDFa, a conforming RDFa parser for that host
language will likely process @xml:base and use its value to set [base].The HTML 5 concept of 'base' is used instead of this.
Document
;The HTML 5 concept of 'language' is used instead of this.
Processing begins by applying the processing rules below to the document object, in the context of this initial [evaluation context]. All elements in the tree are also processed according to the rules described below, depth-first, although the [evaluation context] used for each set of rules will be based on previous rules that may have been applied.
The processing rules are:
xmlns:
" followed by a non-zero-length string of
characters p: Update the local list of URI
mappings so that p is mapped onto the value of the
attribute. (This might replace an existing map entry.)
http://www.w3.org/2000/xmlns/
: Update the
local list of URI mappings so that the attribute's local
name is mapped onto the value of the attribute. (This might replace an
existing map entry.)
about
attribute
Parse the
about
attribute as a URI or Safe CURIE. If this
returns a URL (not an error), set new subject to this URL.
src
attribute
Resolve the src
attribute as a URL. If this returns a URL (not an error), set
new subject to this URL.
resource
attribute
Parse the
resource
attribute as a URI or Safe CURIE. If this
returns a URL (not an error), set new subject to this URL.
href
attribute
Resolve the href
attribute as a URL. If this returns a URL (not an error), set
new subject to this URL.
Shouldn't these attributes be restricted to elements where they're meant to appear? (<a href>, <img src>, etc; not <b src> etc)
If no URI is provided by a resource attributeIf this
did not cause new subject to be set, then the first match from the following rules will apply:
head
or
body
element then act as if there is an empty @about present, and process it according to the rule
for @about, above;typeof
attribute,
then [new
subject] is set to be a newly created [bnode].about
attribute
Parse the
about
attribute as a URI or Safe CURIE. If this
returns a URL (not an error), set new subject to this URL.
src
attribute
Resolve the src
attribute as a URL. If this returns a URL (not an error), set
new subject to this URL.
If no URI is provided then the first match from the following rules will apply:
head
or
body
element then act as if there is an empty @about present, and process it according to the rule
for @about, above;typeof
attribute,
then [new
subject] is set to be a newly created [bnode];Then the [current object resource] is set to the URI obtained from the first match from the following rules:
resource
attribute
Parse the
resource
attribute as a URI or Safe CURIE. If this
returns a URL (not an error), set the current object
resource to this URL.
href
attribute
Resolve the href
attribute as a URL. If this returns a URL (not an error), set the
current object resource to this URL.
Note that final value of the [current object resource] will either be null (from initialization) or a full URI.
typeof
attribute:
Parse the typeof
attribute as a list of CURIEs; for each url in the
returned list, generate the triple:
rel
attribute:
Parse the
rel
attribute as a list of reserved words and
CURIEs; for each url in the returned list, generate
the triple:
rev
attribute:
Parse the
rev
attribute as a list of reserved words and
CURIEs; for each url in the returned list, generate
the triple:
rel
attribute:
Parse the
rel
attribute as a list of reserved words and
CURIEs; for each url in the returned list, add the
following to the list of incomplete triples:
rev
attribute:
Parse the
rev
attribute as a list of reserved words and
CURIEs; for each url in the returned list, add the
following to the list of incomplete triples:
rdf:XMLLiteral
.The actual literal is either the value of @content (if present) or
a string created by concatenating the value of all descendant text nodes, of the [current element] in turn.
the textContent
of the
current element.
The final string includes the datatype URI, as described in [RDF-CONCEPTS], which will
have been obtained according to the section on CURIE and URI Processing.
Parse the datatype
attribute
as a CURIE. If this returns a URL (not an error), the literal's
type must be set to this returned URL.
Additionally, if there is a value for [current language] then the value of the [plain literal] should
include this language information, as described in [RDF-CONCEPTS].
If the primary language of the
current element is not unknown, then the value of the [plain literal] must include
this language information.
The actual literal is either the value of @content (if
present) or
a string created by concatenating the text content of each of the
descendant elements of the [current element] in document order.
the textContent
of the
current element.
rdf:XMLLiteral
.The value of the [XML literal] is a string created by
serializing to text, all nodes that are descendants of the [current element], i.e., not including the element
itself,
running the XML fragment
serialization algorithm on the current
element,
and giving it a datatype of rdf:XMLLiteral
.
If the XML fragment
serialization algorithm raises an exception, then abort this step
without generating a triple, and continue to the next step.
HTML 5 requires HTML elements to be put in the
http://www.w3.org/1999/xhtml
namespace when parsed from
text/html
. The serialization will have to add any
xmlns
attributes necessary to preserve this namespace
information.
If the text/html
input contains any
xmlns:*
attributes in the fragment, then it will be
impossible to serialise as XML (since the attribute's local name
contains ":
"). That's quite nasty. (It'd be nice to not
rely on xmlns:*
attributes at all.)
The [current object literal] is then used with each predicate to generate a triple as follows:
If the current element has a property
attribute:
Parse the property
attribute as a list of CURIEs; for each url in the
returned list, generate the triple:
Once the triple has been created, if the [datatype] of the [current object literal] is
rdf:XMLLiteral
, then the [recurse] flag is set to false
.
When instructed to parse a string as a CURIE, the UA must do the following:
:
" (colon) character
Let prefix be the part of the string before the first
":
" character. Let reference be the part after
the first ":
" character.
Let uri be
http://www.w3.org/1999/xhtml/vocab#
concatenated with
reference. Resolve
uri as a URL relative to the current
element. If this results in an error, return the error;
otherwise, return the resulting URL.
Let uri be the result of looking up prefix in the local list of URI mappings, concatenated with reference. Resolve uri as a URL relative to the current element. If this results in an error, return the error; otherwise, return the resulting URL.
Return an error.
:
" character
Return an error.
When instructed to parse a string as a URI or Safe CURIE, the UA must do the following:
[
" and its last character is "]
"
Remove the first and last characters from the string, then parse the resulting string as a CURIE.
Resolve the string as a URL relative to the current element. If this results in an error, return the error; otherwise, return the resulting URL.
When instructed to parse a string as a list of CURIEs, the UA must do the following:
Let urls be a list of absolute URLs, initially empty.
Split the string on spaces. For each token in the returned list:
Let uri be
http://www.w3.org/1999/xhtml/vocab#
concatenated with the
token. Resolve uri as a
URL relative to the current element. If this results
in an error, ignore the token and continue; otherwise, add the resulting
URL to urls.
Should the concatenated string get resolved?
Return urls.
When instructed to parse a string as a list of reserved words and CURIEs, the UA must do the following:
Let urls be a list of absolute URLs, initially empty.
Split the string on spaces. For each token in the returned list:
:
" (colon) character
Parse the token as a CURIE. If this returns an error, ignore the token and continue; otherwise, add the resulting URL to urls.
Let uri be
http://www.w3.org/1999/xhtml/vocab#
concatenated with the
token. Resolve uri as a
URL relative to the current element. If this results
in an error, ignore the token and continue; otherwise, add the resulting
URL to urls.
Should the concatenated string get resolved?
Return urls.