The HTML vocabulary establishes a draft standard to semantically represent HTML documents in RDF. The vocabulary is based on the HTML Living Standard. It includes classes for HTML elements, datatype properties for attributes, and SHACL shapes for code serialization. HTML documents can thus be represented, queried, generated, validated, analysed, transformed and reused as semantic objects themselves. As HTML documents are widely used in a myriad of use cases, from websites, dashboard, applications, to social media and documents, this vocabulary helps organisations and individuals to get a better grasp of their information products.
Introduction
In today's fast-paced business landscape, organizations grapple with the formidable challenge of information management. The demand for robust information products and agile information processing is relentless, constantly evolving with each passing moment, transcending industry boundaries. In this dynamic environment, the creation, validation, and utilization of information products are paramount, serving as the bedrock for informed decision-making and organizational effectiveness. The ability to swiftly respond to unforeseen crises or seize emerging opportunities hinges on having the right answers to previously unasked questions.
So, how can organizations effectively navigate this information-intensive terrain? A key part of the solution lies in the capability to exercise complete control over the generation, modification, validation, and reuse of information products. This entails the profound ability to construct and deconstruct any arbitrary information product, dissecting it from its foundational elements to its broader, strategic significance.
To address this need, we have pioneered the development of the RDF-based HTML vocabulary - a transformative framework designed to facilitate the management of an array of information artifacts built upon HTML.
Background
In an era where information is akin to a constantly flowing stream, organizations must grapple with several pressing challenges in aspects like velocity, variety, the insightfulness, adaptability and validity. The speed at which information is generated, transformed, and consumed is unparalleled. Keeping pace with this velocity is essential for timely decision-making. Information comes in diverse formats, from structured data to unstructured text, images, and multimedia. Effective management necessitates the ability to handle this rich variety seamlessly. Information is not merely data; it embodies insights, functionality, and knowledge. Extracting meaningful insights and practical functionality is vital for organizational success. Organizations must be agile in adapting information products to address new crises and opportunities swiftly. Finally, ensuring the correctness, compliance and reliability of information is fundamental to trustworthy decision-making and operations.
Objective
In response to these challenges, we introduce the RDF-based HTML vocabulary. With this vocabulary, organizations gain the capacity to (1) construct and deconstruct, (2) generate and validate and (3) adapt and reuse. Constructing and deconstructing means that one can craft and dismantle information products at will, from their fundamental building blocks to their overarching significance. Generating and validating comes down to creating new information products and rigorously validating their accuracy and completeness. Finally, adapting and reusing is about adapting existing information products to swiftly respond to evolving scenarios and reusing valuable components across a spectrum of contexts. This innovative vocabulary forms the cornerstone of an information management ecosystem that harmonizes the power of RDF with the ubiquity of HTML. It revolutionizes how organizations handle information artifacts, enabling them to seamlessly transition from raw data to strategic insights and practical solutions.
Audience
This document is intended for a diverse audience of web developers, content managers, semantic web enthusiasts, and anyone seeking to enhance the sustainability of information management, information processing and information technology.
Overview
Description
The HTML vocabulary establishes a draft standard that enables the semantic representation of any HTML document in RDF. The vocabulary is based on the Living Standard of HTML and offers classes to represent HTML elements, datatype properties to represent HTML attributes and a small number of SHACL shapes and SPARQL functions for the serialisation of HTML code.
#### Basic components of the ontology
##### HTML document
Central to the HTML living standard is the concept of an HTML document and accordingly also in the HTML vocabulary. An HTML document is a document written in the standard markup language HyperText Markup Language (HTML) and designed to be displayed in a web browser. The HTML of that document defines its content and structure. An HTML document contains underlying nodes, like HTML elements, text, comments, processing instructions, CDATA sections and a document type declaration. Hence, the ontology provides the following main classes to model instances of these HTML components:
***Main classes***
```
html:Document
html:DocumentType
html:Element
html:Text
html:Comment
html:ProcessingInstruction
html:CDATASection
```
Let us take a look at an example in order to get a better understanding of the vocabulary.
*Example:*
```
prefix doc:
prefix html:
prefix rdf:
# An example of a simple HTML document. Note how a specific HTML document is an instance of the class html:Document.
# This HTML document has two child nodes, a docType declaration and the root node 'html' modeled using the rdf:_1 and rdf:_2 statements.
# Note how this HTML document has not yet been serialized to HTML, as the doc:example node does not yet have an html:fragment statement.
doc:example a html:Document ;
rdf:_1 doc:doctype ;
rdf:_2 doc:html .
# A doctype declaration stating this is document type html
doc:doctype a html:DocumentType ;
html:documentTypeName "html" .
# The root node with two child nodes, a head node and a body node
doc:html a html:Html ;
rdf:_1 doc:head ;
rdf:_2 doc:body .
# The head node contains in this example two child nodes, a title and a stylesheet node
doc:head a html:Head ;
rdf:_1 doc:title ;
rdf:_2 doc:style .
# The title node contains a text node
doc:title a html:Title ;
rdf:_1 doc:titleText .
# Note that text nodes always should have a html:fragment statement
doc:titleText a html:Text ;
html:fragment "Tutorial Document Example"^^rdf:HTML .
# The style node contains a text node
doc:style a html:StyleSheet ;
rdf:_1 doc:styleText .
# Note that text nodes always should have a html:fragment statement
doc:styleText a html:Text ;
html:fragment """
able {
width: 70%;
margin: 0 auto;
border-collapse: collapse;
}
caption {
text-align: left;
font-weight: bold;
padding: 10px;
background-color: #f2f2f2; /* Light gray */
}
th, td {
padding: 12px;
text-align: center;
border: 1px solid #ddd; /* Light gray border */
}
th {
background-color: #4CAF50; /* Green */
color: white;
}
"""^^rdf:HTML .
# The body node contains a text node
doc:body a html:Body ;
rdf:_1 doc:bodyText .
# Note that text nodes always should have a html:fragment statement
doc:bodyText a html:Text;
html:fragment 'Hello world!'^^rdf:HTML .
```
##### Document Object Model (DOM)
***HTML and DOM alignment***
The components of an HTML document together form a tree-like structure of nodes known as the Document Object Model (DOM), which is a programmatic representation of the document that allows it to be manipulated dynamically. For this reason a separate DOM vocabulary had to be created as well, next to the HTML vocabulary. Classes like html:Document, html:Text, html:Comment, html:ProcessingInstruction and html:DocumentType are defined as subclasses of dom:Document, dom:Text, dom:Comment, dom:ProcessingInstruction and dom:DocumentType. The DOM vocabulary provides thus, through a RDF-based representation of the Document Object Model, the necessary context for the HTML vocabulary.
***Relation between HTML and DOM classes***
```
html:Document is a subclass of dom:Document.
html:DocumentType is a subclass of dom:DocumentType.
html:Element is a subclass of dom:Element.
html:Text is a subclass of dom:Text.
html:Comment is a subclass of dom:Comment.
html:ProcessingInstruction is a subclass of dom:ProcessingInstruction.
html:CDATASection is a subclass of dom:CDATASection.
```
***Node index***
The index of a DOM node within an HTML document, indicating its relative position towards any preceding siblings, is modeled using the properties that are instances of the class rdfs:ContainerMembershipProperty, like rdf:_1, rdf:_2, rdf:_3, ...et cetera.
*Example:*
```
prefix doc:
prefix html:
prefix rdf:
# This HTML document has two child nodes, a docType declaration and the root node 'html' modeled using the rdf:_1 and rdf:_2 statements.
doc:example a html:Document ;
rdf:_1 doc:doctype ;
rdf:_2 doc:html .
```
##### HTML element
An HTML element is the building block of an HTML document. For each HTML element, its name and the definition as given in the according section of the HTML Living Standard is modeled in the ontology using SKOS properties skos:prefLabel and skos:definition. In addition, the tag of the HTML element (like 'body' for the body element) is represented through the property html:tag. Each element is linked to the section of the HTML Living Standard in which the element is described, using the dct:conformsTo property.
*Example:*
```
html:Body
a owl:Class;
dct:conformsTo section:4.3.1;
html:tag 'body';
rdfs:subClassOf html:NormalElement;
skos:definition 'The body HTML element is the second element in a root html element. It contains the contents of the document.'@en;
skos:prefLabel 'the body element'@en;
rdfs:isDefinedBy html:.
```
A specific instance of the body element can be seen in the example below.
*Example:*
```
# An instance of a body element, with a connected text node
doc:body a html:Body ;
rdf:_1 doc:bodyText .
# Note that text nodes always should have a html:fragment statement
doc:bodyText a html:Text;
html:fragment 'Hello world!'^^rdf:HTML .
```
***Kinds of elements***
According to the Living Standard of HTML there are six different kinds of elements: void elements, the template element, raw text elements, escapable raw text elements, foreign elements and normal elements. In addition, the Living Standard also distinguishes the custom element, making for a total of seven different kinds of HTML elements.
These kinds of elements are represented in the HTML vocabulary as follows:
```
html:VoidElement
html:Area, html:Base, html:Br, html:Col, html:Embed, html:Hr, html:Img, html:Input, html:Link, html:Meta, html:Source, html:Track, html:Wbr
html:Template
html:RawTextElement
html:Script, html:Style
html:EscapableRawTextElement
html:Textarea, html:Title
html:ForeignElement
html:MathML, html:SVG
html:NormalElement
All other subclasses from HTML:element, not being a custom element
html:CustomElement
```
There exists a subclass relation between a specific element class (for instance: html:Area) and the kind of element (for instance html:VoidElement).
##### Content category
The Living Standard of HTML states that each HTML element falls into zero or more content categories that group elements with similar characteristics together. The following classes are used in the vocabulary to model these content categories:
***Content category classes***
```
html:MetadataContent is a subclass of html:ContentCategory.
html:FlowContent is a subclass of html:ContentCategory.
html:SectioningContent is a subclass of html:ContentCategory.
html:HeadingContent is a subclass of html:ContentCategory.
html:PhrasingContent is a subclass of html:ContentCategory.
html:EmbeddedContent is a subclass of html:ContentCategory.
html:InteractiveContent is a subclass of html:ContentCategory.
html:PalpableContent is a subclass of html:ContentCategory.
html:ScriptSupportingElement is a subclass of html:ContentCategory.
html:FormAssociatedElement is a subclass of html:ContentCategory.
html:LabelableElement is a subclass of html:ContentCategory.
```
These content categories are all subclasses of html:ContentCategory. Note that the HTML Living Standard defines its content categories scattered across multiple sections of its specification. As the content categories are defined in the vocabulary as complex classes using OWL, one can use an OWL inference engine to establish which element belongs in which category.
##### HTML attribute
The Living Standard of HTML defines an HTML attribute as a key-value pair that is associated with an HTML element to control its behavior, written within the start tag of an element. Examples are 'style' and 'colspan'. The ontology provides the 'html:Attribute' class to represent the set of all these HTML attributes, like the attributes 'html:style' and 'html:colspan'. Note that 'html:Attribute' itself is a subclass of dom:Attribute, a node in the Document Object Model (DOM). The key of the HTML attribute (like 'colspan' in 'html:colspan') is represented through the property html:key. In addition, every HTML attribute in the HTML vocabulary is also defined as a subproperty of 'html:attribute'. For each HTML attribute, its name and the definition as given in the according section of the HTML Living Standard is modeled in the ontology using SKOS properties skos:prefLabel and skos:definition.
*Example:*
```
html:colspan
a owl:DatatypeProperty;
rdfs:subPropertyOf html:attribute;
html:key 'colspan';
rdf:type html:Attribute;
rdfs:domain html:Cell;
rdfs:range xsd:nonNegativeInteger;
skos:definition "Specifies the number of columns a table cell should span."@en;
skos:prefLabel 'the colspan attribute'@en;
rdfs:isDefinedBy html:.
```
Next to the already defined HTML attributes in the HTML vocabulary (conforming to the Living Standard of HTML), custom defined attributes can also be used in RDF-based HTML documents. These attributes need to be defined in some arbitrary vocabulary using the same design pattern as is shown in the example. An example of an existing vocabulary with a custom defined attribute is the RDFa vocabulary, based on the formal RDFa specification of W3C.
*Example:*
```
rdfa:Attribute
rdf:type owl:Class;
rdfs:subClassOf dom:Attribute;
dct:conformsTo section:AttributesAndSyntax;
skos:definition """An RDFa attribute is used within web documents (HTML, XHTML, or XML) to embed structured metadata in the form of RDF (Resource Description Framework) triples. These attributes either define RDF concepts (like subjects, predicates, and objects) or modify existing HTML attributes to include RDF semantics. RDFa attributes allow semantic data to be expressed directly in web content without altering its visual presentation for human readers."""@en;
skos:prefLabel 'attribute'@en;
rdfs:isDefinedBy rdfa:.
rdfa:property
a owl:DatatypeProperty;
rdf:type rdfa:Attribute;
dct:conformsTo section:AttributesAndSyntax;
rdfs:domain dom:Element;
rdfs:range xsd:string;
rdfa:key 'property';
skos:prefLabel 'the property attribute'@en;
skos:definition "Specifies a property for an DOM element."@en;
rdfs:isDefinedBy rdfa:.
```
##### HTML serialisation
A RDF-based HTML document, HTML element, text, document type declaration, HTML comment or CDATA section can have an associated HTML fragment through the html:fragment property, representing the HTML code of itself, its possible underlying child nodes and possible HTML attributes. If a RDF-based node of an HTML document contains an HTML fragment through this property, it is said to be serialized to HTML. In order to serialize an HTML document to actual HTML code based on its RDF-representation, the ontology provides the SHACL based node shape shp:HTMLFragmentSerializationAlgorithm, with its associated target target:HTMLFragmentSerializationAlgorithm; and rule rule:HTMLFragmentSerializationAlgorithm. In doing so, we make use of the Advanced Features of SHACL (see https://www.w3.org/TR/shacl-af/).
Let us have a closer look at how this works.
***shp:HTMLFragmentSerializationAlgorithm***
This nodeshape searches, through its target, for nodes that do not have an HTML fragment yet and then transforms these nodes to contain actual HTML code by applying its rule. The logic behind the shape is that the HTML code of an HTML document can only be serialized from the leaves of the DOM tree upwards up and till the top of the tree. It means that an arbitrary element in the DOM tree of an HTML document can only be serialized to HTML code, whenever the underlying child elements of that element already have been serialized to HTML, including the HTML code of any possible attributes. The outer edges of the tree are text nodes and other nodes that do not contain any child nodes, like void elements, comments and processing instructions. Please note that instances of html:Text already need to have an html:fragment statement (representing the very text without actual HTML code) before the serializing is run, or else the serialisation will not result in a fully HTML-rendered document. From the start the nodes that do not contain any child nodes can be transformed into HTML code immediately without the necessity of traversing the tree in depth. The next iteration of the SHACL engine can then work its way up the tree, using the previously obtained results until the moment that all the nodes in the document have received an HTML fragment. The processing (by calling the SHACL engine iteratively) halts the moment the document itself has an html:fragment statement.
```
shp:HTMLFragmentSerializationAlgorithm
a sh:NodeShape;
sh:rule rule:HTMLFragmentSerializationAlgorithm;
sh:target target:HTMLFragmentSerializationAlgorithm;
skos:prefLabel 'Node shape for HTML fragment serialization algorithm'@en;
skos:definition 'A node shape with an algorithm to serialize an HTML fragment for a node in an HTML document.'@en;
rdfs:isDefinedBy html:.
```
***target:HTMLFragmentSerializationAlgorithm***
This target looks for nodes that do not have an HTML fragment yet, but whose child nodes (if any) all have an HTML fragment.
```
target:HTMLFragmentSerializationAlgorithm
a sh:SPARQLTarget;
skos:prefLabel 'SPARQL target for HTML fragment serialization algorithm'@en;
skos:definition 'A SPARQL Target to select all nodes in an HTML document that do not have an HTML fragment yet, and whose child nodes all have an HTML fragment already.'@en;
sh:prefixes html:;
sh:select """
select $this {
# Select all DOM nodes...
$this a/rdfs:subClassOf* dom:DocumentTreeNode.
# ...that do not yet have an HTML fragment.
filter not exists { $this html:fragment []. }
# ...but whose child nodes (if any) all have an HTML fragment
filter not exists {
$this ?member ?child.
filter(function:isMembershipProperty(?member))
filter not exists { ?child html:fragment []. }
?child a/rdfs:subClassOf* dom:DocumentTreeNode.
}
}""";
rdfs:isDefinedBy html:.
```
***rule:HTMLFragmentSerializationAlgorithm***
This rule establishes and asserts the new HTML fragment for the node that was found by the target. It does so by calling a SPARQL function, depending on the kind of node within the HTML document.
```
rule:HTMLFragmentSerializationAlgorithm
a sh:SPARQLRule;
skos:prefLabel 'SPARQL rule for HTML fragment serialization algorithm'@en;
skos:definition 'A SPARQL rule to serialize an HTML fragment for a node in an HTML document, analogue to the HTML fragment serialisation algorithm as described in the living standard of HTML.'@en;
sh:prefixes html:;
sh:construct """
construct {
# Assert the new HTML fragment for this node in the HTML document
$this html:fragment ?fragment.
} where {
# Establish the class of the node in the HTML document
$this a/rdfs:subClassOf* ?htmlClass.
?htmlClass rdfs:isDefinedBy html:.
# Build the HTML fragment for the node in the HTML document depending on its class
bind(if(?htmlClass = html:Element, function:getElementFragment($this),
if(?htmlClass = html:Text, function:getTextFragment($this),
if(?htmlClass = html:Comment, function:getCommentFragment($this),
if(?htmlClass = html:ProcessingInstruction, function:getProcessingInstructionFragment($this),
if(?htmlClass = html:DocumentType, function:getDocumentTypeFragment($this),
if(?htmlClass = html:Document, function:getDocumentFragment($this), ?unboundDummy))))))
as ?fragmentString)
# Convert result from string to rdf:HTML if fragment exists
bind(if(bound(?fragmentString), strdt(?fragmentString, rdf:HTML), ?unboundDummy) as ?fragment)
}""";
rdfs:isDefinedBy html:.
```
***Main functions***
The rule:HTMLFragmentSerializationAlgorithm can call six SPARQL functions as defined in the HTML vocabulary, depending on the kind of node within the HTML document. These functions are:
```
function:getElementFragment
function:getTextFragment
function:getCommentFragment
function:getProcessingInstructionFragment
function:getDocumentTypeFragment
function:getDocumentFragment
```
These functions establish and return the HTML fragment for a node in an HTML document. As an example, let us see the function:getDocumentFragment.
*Example:*
```
function:getDocumentFragment
a sh:SPARQLFunction;
skos:prefLabel "the getDocumentFragment() function"@en;
skos:definition "A SPARQL function that returns an HTML fragment for an HTML document."@en;
sh:parameter [
sh:path function:document;
sh:datatype xsd:anyURI;
sh:description "An HTML document.";
];
sh:prefixes html:;
sh:returnType xsd:string;
sh:select """
select ?result {
optional {
# Establish the HTML fragment of the HTML document by retrieving the HTML fragments of all child nodes.
bind(function:getChildNodeFragment($document) as ?fragment)
}
bind(coalesce(?fragment, '') as ?result)
}""";
rdfs:isDefinedBy html:.
```
This function has a name and definition, captured through the SKOS properties skos:prefLabel and skos:definition. There is a parameter defined (for an arbitrary document), and a returntype which details the kind of output the function will give back (a string). In the select query the actual algorithm is represented to retrieve the HTML code for the HTML document, by calling function function:getChildNodeFragment. As one can see, the HTML vocabulary has a modular structure, not only for classes but also for rules, targets, functions and the like.
***Suppporting functions***
The HTML vocabulary offers some additional supporting functions in order to serialize a HTML document to HTML from its RDF-representation.
```
function:getChildNodeFragment
# returns the HTML fragment for the child nodes of an HTML node (if any).
function:getElementAttribute
# returns the attributes for an HTML node (if any).
function:getMemberIndex
# returns the relative position of an HTML node towards possible previous siblings.
function:isMembershipProperty
# returns true/false whether a property is or is not an instance of the ContainerMembershipProperty class (rdf:_1, rdf:_2, rdf:_3,...).
```
***Parameters***
For the latter two functions, the HTML vocabulary also models two parameters that are used in these functions:
```
parameter:getMemberIndex_property
# a property
parameter:isMembershipProperty_term
# a term
```
Take for example the HTML document as modeled above. Running this through a SHACL engine will lead to the following triples:
*Example:*
```
prefix doc:
prefix html:
prefix rdf:
# An example of an HTML document that has been serialized to HTML code.
doc:Example a html:Document ;
rdf:_1 doc:doctype ;
rdf:_2 doc:html ;
html:fragment "Tutorial Document ExampleHello world!"^^rdf:HTML .
# A doctype declaration
doc:doctype a html:DocumentType ;
html:documentTypeName "html" ;
html:fragment ""^^rdf:HTML .
# The root node with two child nodes
doc:html a html:Html ;
rdf:_1 doc:head ;
rdf:_2 doc:body ;
html:fragment "Tutorial Document ExampleHello world!"^^rdf:HTML .
doc:head a html:Head ;
rdf:_1 doc:title ;
rdf:_2 doc:style ;
html:fragment "Tutorial Document Example"^^rdf:HTML .
doc:title a html:Title ;
rdf:_1 doc:titleText ;
html:fragment "Tutorial Document Example"^^rdf:HTML .
# Note that text nodes always should have a html:fragment statement
doc:titleText a html:Text ;
html:fragment "Tutorial Document Example"^^rdf:HTML .
doc:style a html:StyleSheet ;
rdf:_1 doc:styleText ;
html:fragment ""^^rdf:HTML .
doc:styleText a html:Text ;
html:fragment """
able {
width: 70%;
margin: 0 auto;
border-collapse: collapse;
}
caption {
text-align: left;
font-weight: bold;
padding: 10px;
background-color: #f2f2f2; /* Light gray */
}
th, td {
padding: 12px;
text-align: center;
border: 1px solid #ddd; /* Light gray border */
}
th {
background-color: #4CAF50; /* Green */
color: white;
}
"""^^rdf:HTML .
doc:body a html:Body ;
rdf:_1 doc:bodyText ;
html:fragment "Hello world!"^^rdf:HTML .
doc:bodyText a html:Text;
html:fragment 'Hello world!'^^rdf:HTML .
```
Note how each node in the HTML document has a html:fragment statement in which the resulting HTML code of that node is represented.
##### Custom element
As the Living Standard of HTML provides the possibility of custom defined HTML elements, so should the HTML vocabulary. Hence, the class html:CustomElement and the node shape shp:CustomElement and associated rule rule:CustomElement are defined in the vocabulary. A custom element can then be defined in some arbitrary vocabulary and applied in some HTML document.
*Example:*
```
# Some snippet of an arbitrary HTML document, showcasing the use of a custom element.
ex:someFlagIconElement rdf:type ex:FlagIcon.
# A definition of the custom element that should be added to the dataset containing the HTML vocabulary, for instance through a separate vocabulary.
ex:FlagIcon rdf:type html:CustomElement;
html:tag 'flag-icon';
html:extends html:Element;
rdfs:isDefinedBy ex:.
```
The node shape establishes the custom defined element not only as an instance of html:CustomElement but also as a subclass of the class html:CustomElement. This leverages the main node shape shp:HTMLFragmentSerializationAlgorithm through its target target:HTMLFragmentSerializationAlgorithm and rule:HTMLFragmentSerializationAlgorithm to serialize the custom element just like any other HTML element in the HTML document.
***Types of custom element***
Two distinct types of custom elements are defined in the Living Standard of HTML: An autonomous custom element, which is defined with no extends option. A customized built-in element, which is defined with an extends option. The vocabulary provides the following complex OWL classes:
```
html:AutonomousCustomElement is a subclass of html:CustomElement.
html:CustomizedBuiltInElement is a subclass of html:CustomElement.
```
##### Design patterns
Here follow the most important design patterns that were applied during the creation of the HTML vocabulary that shape the content, structure and behaviour of the HTML vocabulary.
1. Every HTML attribute in the HTML vocabulary is defined as both an instance of the class html:Attribute as well as a subproperty of 'html:attribute'. This resonates with the design pattern in RDF Schema 1.1 (RDFS) where a container membership (like rdf:_1, rdf:_2, ...) is represented through both the membership of the class rdfs:ContainerMembershipProperty and through the assertion that the membership property (like rdf:_1, rdf:_2, ...) is a subproperty of 'rdfs:member'.
2. The index of a node within an HTML document, indicating its relative position towards any preceding siblings, is modeled using the properties that are instances of the class rdfs:ContainerMembershipProperty, like rdf:_1, rdf:_2, rdf:_3, ...et cetera. This design choice for the vocabulary facilitates readability, writeability and maintainability of HTML documents. An alternative approach that was considered within the W3C community group involved using the list concept of RDF. The latter approach however would quickly render complex HTML documents unreadable, difficult to write manually and hard to maintain. Hence, the community group chose the approach of the container membership. Downside is that exhaustiveness of a sequence of HTML child nodes has to be achieved through other means (like for instance through the use of SHACL shapes). In addition, to retrieve the true index according to the DOM Living Standard specification where the index of a node is defined as the number of its preceding siblings, or 0 if it has none, one should write a specific SPARQL function to yield exactly that. This function would look like function:getMemberIndex but would need to substract '1' from the result. As this is easy to do so and seeing there is no need for now to use such a function, it was not seen as an hindrance.
Example
Visualisation
A visualisation of the ontology
Namespace
Prefixes and namespaces used in this specification
Prefix
Namespace
aria
http://www.w3.org/ns/wai-aria/
dom
http://www.w3.org/DOM/model/def/
function
https://www.w3.org/html/model/function/
html
https://www.w3.org/html/model/def/
parameter
https://www.w3.org/html/model/parameter/
rdf
http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs
http://www.w3.org/2000/01/rdf-schema#
rule
https://www.w3.org/html/model/rule/
section
https://www.w3.org/html/id/section/
shp
https://www.w3.org/html/model/shp/
standard
https://www.w3.org/html/id/standard/
target
https://www.w3.org/html/model/target/
xsd
http://www.w3.org/2001/XMLSchema#
Serialisation
A serialisation of the ontology in Turtle-format (*.ttl) can be found here.
The <a> HTML element (or anchor element), with its href attribute, creates a hyperlink to web pages, files, email addresses, locations in the same page, or anything else a URL can address.
The <abbr> HTML element represents an abbreviation or acronym; the optional title attribute can provide an expansion or description for the abbreviation. If present, title must contain this full description and nothing else.
The <area> HTML element defines an area inside an image map that has predefined clickable areas. An image map allows geometric areas on an image to be associated with hypertext link.
The <article> HTML element represents a self-contained composition in a document, page, application, or site, which is intended to be independently distributable or reusable (e.g., in syndication). Examples include: a forum post, a magazine or newspaper article, or a blog entry, a product card, a user-submitted comment, an interactive widget or gadget, or any other independent item of content.
The <aside> HTML element represents a portion of a document whose content is only indirectly related to the document's main content. Asides are frequently presented as sidebars or call-out boxes.
An attribute is a name-value pair that is associated with an HTML element. Attributes provide additional information about an element and are specified within the start tag of an element. Attributes can modify the behavior or appearance of an element, define relationships between elements, or provide other metadata. The name of the attribute is followed by an equal sign (=) and the attribute's value, which is enclosed in double or single quotes. Some attributes affect the element simply by their presence in the start tag of the element, with the value implicitly being an empty string.
The <audio> HTML element is used to embed sound content in documents. It may contain one or more audio sources, represented using the src attribute or the <source> element: the browser will choose the most suitable one. It can also be the destination for streamed media, using a MediaStream.
An autonomous custom element is a custom HTML element that is defined with no extends option. These types of custom elements have a local name equal to their defined name.
The <b> HTML element is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance. This was formerly known as the Boldface element, and most browsers still draw the text in boldface. However, you should not use <b> for styling text; instead, you should use the CSS font-weight property to create boldface text, or the <strong> element to indicate that text is of special importance.
The <bdi> HTML element tells the browser's bidirectional algorithm to treat the text it contains in isolation from its surrounding text. It's particularly useful when a website dynamically inserts some text and doesn't know the directionality of the text being inserted.
The <blockquote> HTML element indicates that the enclosed text is an extended quotation. Usually, this is rendered visually by indentation (see Notes for how to change it). A URL for the source of the quotation may be given using the cite attribute, while a text representation of the source can be given using the <cite> element.
The <br> HTML element produces a line break in text (carriage-return). It is useful for writing a poem or an address, where the division of lines is significant.
A DOM element with textual content that contain characters that would otherwise be treated as markup. A CDATA section is typically used to include code snippets, scripts, or other data within an HTML document without having to worry about escaping special characters. In HTML, a CDATA section is denoted by enclosing the block of text within <![CDATA[ and ]]> tags. Anything contained within these tags is treated as raw character data and is not parsed as markup.
The <cite> HTML element is used to describe a reference to a cited creative work, and must include the title of that work. The reference may be in an abbreviated form according to context-appropriate conventions related to citation metadata.
The <code> HTML element displays its contents styled in a fashion intended to indicate that the text is a short fragment of computer code. By default, the content text is displayed using the user agent's default monospace font.
The <col> HTML element defines a column within a table and is used for defining common semantics on all common cells. It is generally found within a <colgroup> element.
A comment is a markup construct used to insert comments within the HTML code. Comments are not displayed in the web browser, but they can be viewed in the HTML source code. Comments are typically used to add notes, descriptions, or explanations to the HTML code for the benefit of developers, without affecting the rendered output in the browser. Comments are denoted by enclosing the comment text within <!-- and --> tags. Anything contained within these tags is treated as a comment and is ignored by the web browser during rendering.
A custom data attribute is an attribute in no namespace whose name starts with the string "data-", has at least one character after the hyphen, is XML-compatible, and contains no ASCII upper alphas.
A custom element is an element that is custom. Informally, this means that its constructor and prototype are defined by the author, instead of by the user agent. This author-supplied constructor function is called the custom element constructor. Two distinct types of custom elements can be defined: An autonomous custom element, which is defined with no extends option. These types of custom elements have a local name equal to their defined name. A customized built-in element, which is defined with an extends option. These types of custom elements have a local name equal to the value passed in their extends option, and their defined name is used as the value of the is attribute, which therefore must be a valid custom element name.
A customized built-in element is a custom HTML element that is defined with an extends option. These types of custom elements have a local name equal to the value passed in their extends option, and their defined name is used as the value of the is attribute, which therefore must be a valid custom element name.
The <data> HTML element links a given piece of content with a machine-readable translation. If the content is time- or date-related, the <time> element must be used.
The <datalist> HTML element contains a set of <option> elements that represent the permissible or recommended options available to choose from within other controls.
The <del> HTML element represents a range of text that has been deleted from a document. This can be used when rendering "track changes" or source code diff information, for example. The <ins> element can be used for the opposite purpose: to indicate text that has been added to the document.
The <details> HTML element creates a disclosure widget in which information is visible only when the widget is toggled into an "open" state. A summary or label must be provided using the <summary> element.
The <dfn> HTML element is used to indicate the term being defined within the context of a definition phrase or sentence. The <p> element, the <dt>/<dd> pairing, or the <section> element which is the nearest ancestor of the <dfn> is considered to be the definition of the term.
The <div> HTML element is the generic container for flow content. It has no effect on the content or layout until styled in some way using CSS (e.g. styling is directly applied to it, or some kind of layout model like Flexbox is applied to its parent element).
The <dl> HTML element represents a description list. The element encloses a list of groups of terms (specified using the <dt> element) and descriptions (provided by <dd> elements). Common uses for this element are to implement a glossary or to display metadata (a list of key-value pairs).
An HTML document consists of a tree of elements and text. Each element is denoted in the source by a start tag, such as ‘<body>’, and an end tag, such as ‘</body>’. Tags have to be nested such that elements are all completely within each other, without overlapping. Elements can have attributes, which control how the elements work. The HTML vocabulary defines a set of elements that can be used in a HTML document, along with rules about the ways in which the elements can be nested. HTML user agents (e.g., web browsers) parse a HTML document, turning it into a DOM (Document Object Model) tree. A DOM tree is an in-memory representation of a document. A HTML document represents a media-independent description of interactive content. A HTML document might be rendered to a screen, or through a speech synthesizer, or on a braille display. To influence exactly how such rendering takes place, authors can use a styling language such as CSS.
A DOCTYPE is a required preamble. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
The <dt> HTML element specifies a term in a description or definition list, and as such must be used inside a <dl> element. It is usually followed by a <dd> element; however, multiple <dt> elements in a row indicate several terms that are all defined by the immediate next <dd> element.
An HTML element in the Document Object Model (DOM) represents a thing; that is, it has intrinsic meaning, also known as semantics. An element consists of an HTML start tag and an HTML end tag and has value content. An HTML start tag consists of a "smaller than" character ("<") and a tag name, followed by a "greater than" character (">"). An HTML end tag consists of a "smaller than" character ("<"), a slash ("/") and a tag name, followed by a "greater than" character (">"). The value content of an element can be arbitrarily complex.
The <em> HTML element marks text that has stress emphasis. The <em> element can be nested, with each level of nesting indicating a greater degree of emphasis.
The <embed> HTML element embeds external content at the specified point in the document. This content is provided by an external application or other source of interactive content such as a browser plug-in.
HTML element where the text inside is treated as raw text and not parsed as HTML, but character references can still be escaped within their text content.
An event handler content attribute is a content attribute for a specific event handler. The name of the content attribute is the same as the name of the event handler.
The <figure> HTML element represents self-contained content, potentially with an optional caption, which is specified using the <figcaption> element. The figure, its caption, and its contents are referenced as a single unit.
The <footer> HTML element represents a footer for its nearest sectioning content or sectioning root element. A <footer> typically contains information about the author of the section, copyright data or links to related documents.
The <header> HTML element represents introductory content, typically a group of introductory or navigational aids. It may contain some heading elements but also a logo, a search form, an author name, and other elements.
Heading content defines the heading of a section (whether explicitly marked up using sectioning content elements, or implied by the heading content itself).
The <hr> HTML element represents a thematic break between paragraph-level elements: for example, a change of scene in a story, or a shift of topic within a section.
The <html> HTML element represents the root (top-level element) of an HTML document, so it is also referred to as the root element or document element. All other elements must be descendants of this element.
The <i> HTML element represents a range of text that is set off from the normal text for some reason, such as idiomatic text, technical terms, taxonomical designations, among others. Historically, these have been presented using italicized type, which is the original source of the <i> naming of this element.
The <input> HTML element is used to create interactive controls for web-based forms in order to accept data from the user; a wide variety of types of input data and control widgets are available, depending on the device and user agent. The <input> element is one of the most powerful and complex in all of HTML due to the sheer number of combinations of input types and attributes.
The <ins> HTML element represents a range of text that has been added to a document. You can use the <del> element to similarly represent a range of text that has been deleted from the document.
The <kbd> HTML element represents a span of inline text denoting textual user input from a keyboard, voice input, or any other text entry device. By convention, the user agent defaults to rendering the contents of a <kbd> element using its default monospace font, although this is not mandated by the HTML standard.
The <li> HTML element is used to represent an item in a list. It must be contained in a parent element: an ordered list (<ol>), an unordered list (<ul>), or a menu (<menu>). In menus and unordered lists, list items are usually displayed using bullet points. In ordered lists, they are usually displayed with an ascending counter on the left, such as a number or letter.
Denotes elements that are listed in the form.elements and fieldset.elements APIs. These elements also have a form content attribute, and a matching form IDL attribute, that allow authors to specify an explicit form owner.
The <main> HTML element represents the dominant content of the <body> of a document. The main content area consists of content that is directly related to or expands upon the central topic of a document, or the central functionality of an application.
The <mark> HTML element represents text which is marked or highlighted for reference or notation purposes, due to the marked passage's relevance or importance in the enclosing context.
The <menu> HTML element is a semantic alternative to <ul>. It represents an unordered list of items (represented by <li> elements), each of these represent a link or other command that the user can activate.
The <meta> HTML element can represent document-level metadata with the name attribute, pragma directives with the http-equiv attribute, and the file's character encoding declaration when an HTML document is serialized to string form (e.g. for transmission over the network or for disk storage) with the charset attribute.
Metadata content is content that sets up the presentation or behavior of the rest of the content, or that sets up the relationship of the document with other documents, or that conveys other "out of band" information.
The <nav> HTML element represents a section of a page whose purpose is to provide navigation links, either within the current document or to other documents. Common examples of navigation sections are menus, tables of contents, and indexes.
Exclusionary definition: the elements that are neither (1) void elements, (2) the template element, (3) raw text elements, (4) escapable raw text elements, nor (5) foreign elements.
The <noscript> HTML element defines a section of HTML to be inserted if a script type on the page is unsupported or if scripting is currently turned off in the browser.
The <object> HTML element represents an external resource, which can be treated as an image, a nested browsing context, or a resource to be handled by a plugin.
The <option> HTML element is used to define an item contained in a <select>, an <optgroup>, or a <datalist> element. As such, <option> can represent menu items in popups and other lists of items in an HTML document.
The <p> HTML element represents a paragraph. Paragraphs are usually represented in visual media as blocks of text separated from adjacent blocks by blank lines and/or first-line indentation, but HTML paragraphs can be any structural grouping of related content, such as images or form fields.
As a general rule, elements whose content model allows any flow content or phrasing content should have at least one node in its contents that is palpable content and that does not have the hidden attribute specified.
Phrasing content is the text of the document, as well as elements that mark up that text at the intra-paragraph level. Runs of phrasing content form paragraphs.
The <picture> HTML element contains zero or more <source> elements and one <img> element to offer alternative versions of an image for different display/device scenarios.
The <pre> HTML element represents preformatted text which is to be presented exactly as written in the HTML file. The text is typically rendered using a non-proportional, or monospaced, font. Whitespace inside this element is displayed as written.
The <q> HTML element indicates that the enclosed text is a short inline quotation. Most modern browsers implement this by surrounding the text in quotation marks. This element is intended for short quotations that don't require paragraph breaks; for long quotations use the <blockquote> element.
The <rp> HTML element is used to provide fall-back parentheses for browsers that do not support display of ruby annotations using the <ruby> element. One <rp> element should enclose each of the opening and closing parentheses that wrap the <rt> element that contains the annotation's text.
The <rt> HTML element specifies the ruby text component of a ruby annotation, which is used to provide pronunciation, translation, or transliteration information for East Asian typography. The <rt> element must always be contained within a <ruby> element.
The <ruby> HTML element represents small annotations that are rendered above, below, or next to base text, usually used for showing the pronunciation of East Asian characters. It can also be used for annotating other kinds of text, but this usage is less common.
The <s> HTML element renders text with a strikethrough, or a line through it. Use the <s> element to represent things that are no longer relevant or no longer accurate. However, <s> is not appropriate when indicating document edits; for that, use the <del> and <ins> elements, as appropriate.
The <svg> HTML element is a container for SVG graphics. SVG allows for three types of graphics: vector graphic shapes (e.g., paths consisting of straight lines and curves), images, and text.
The <samp> HTML element is used to enclose inline text which represents sample (or quoted) output from a computer program. Its contents are typically rendered using the browser's default monospaced font (such as Courier or Lucida Console).
The <script> HTML element is used to embed executable code or data; this is typically used to embed or refer to JavaScript code. The <script> element can also be used with other languages, such as WebGL's GLSL shader programming language and JSON.
Script-supporting elements are those that do not represent anything themselves (i.e. they are not rendered), but are used to support scripts, e.g. to provide functionality for the user.
The <search> element represents a part of a document or application that contains a set of form controls or other content related to performing a search or filtering operation. This could be a search of the web site or application; a way of searching or filtering search results on the current web page; or a global or Internet-wide search function.
The <section> HTML element represents a generic standalone section of a document, which doesn't have a more specific semantic element to represent it. Sections should always have a heading, with very few exceptions.
The <slot> HTML element - part of the Web Components technology suite - is a placeholder inside a web component that you can fill with your own markup, which lets you create separate DOM trees and present them together.
The <small> HTML element represents side-comments and small print, like copyright and legal text, independent of its styled presentation. By default, it renders text within it one font-size smaller, such as from small to x-small.
The <source> HTML element specifies multiple media resources for the <picture>, the <audio> element, or the <video> element. It is an empty element, meaning that it has no content and does not have a closing tag. It is commonly used to offer the same media content in multiple file formats in order to provide compatibility with a broad range of browsers given their differing support for image file formats and media file formats.
The <span> HTML element is a generic inline container for phrasing content, which does not inherently represent anything. It can be used to group elements for styling purposes (using the class or id attributes), or because they share attribute values, such as lang. It should be used only when no other semantic element is appropriate. <span> is very much like a <div> element, but <div> is a block-level element whereas a <span> is an inline element.
The <strong> HTML element indicates that its contents have strong importance, seriousness, or urgency. Browsers typically render the contents in bold type.
The <style> HTML element contains style information for a document, or part of a document. It embeds a CSS style sheet, which is applied to the contents of the document containing the <style> element.
The <sub> HTML element specifies inline text which should be displayed as subscript for solely typographical reasons. Subscripts are typically rendered with a lowered baseline using smaller text.
The <summary> HTML element specifies a summary, caption, or legend for a <details> element's disclosure box. Clicking the <summary> element toggles the state of the parent <details> element open and closed.
The <sup> HTML element specifies inline text which should be displayed as superscript for solely typographical reasons. Superscripts are typically rendered with half a character above the normal line, and are sometimes rendered in a smaller font.
A table (<table> element) represents data with more than one dimension. Tables have rows, columns, and cells given by their descendants. The rows and columns form a grid; a table its cells must completely cover that grid without overlap.
The <template> HTML element is a mechanism for holding HTML that is not to be rendered immediately when a page is loaded but may be instantiated subsequently during runtime using JavaScript.
The <textarea> HTML element represents a multi-line plain-text editing control, useful when you want to allow users to enter a sizeable amount of free-form text, for example a comment on a review or feedback form.
The <time> HTML element represents a specific period in time. It may include the datetime attribute to translate dates into machine-readable format, allowing for better search engine results or custom features such as reminders.
The Title (<title> element) defines a document its title that is shown in a browser title bar or a page tab. It only contains text; tags within the element are ignored.
The <track> HTML element is used as a child of the media elements, <audio> and <video>. It lets you specify timed text tracks (or time-based data), for example to automatically handle subtitles. The tracks are formatted in WebVTT format (.vtt files) - Web Video Text Tracks.
The <u> HTML element represents a span of inline text which should be rendered in a way that indicates that it has a non-textual annotation. This is rendered by default as a simple solid underline, but may be altered using CSS.
The <var> HTML element represents the name of a variable in a mathematical expression or a programming context. It's typically presented using an italicized version of the current typeface, although that behavior is browser-dependent.
The <video> HTML element embeds a media player which supports video playback into the document. You can use <video> for audio content as well, but the <audio> element may provide a more appropriate user experience.
The <wbr> HTML element represents a word break opportunity - a position within text where the browser may optionally break a line, though its line-breaking rules would not otherwise create a break at that location.
Property that is the parent property of all existing HTML attributes. An attribute is a name-value pair that is associated with an HTML element. Attributes provide additional information about an element and are specified within the start tag of an element. Attributes can modify the behavior or appearance of an element, define relationships between elements, or provide other metadata. The name of the attribute is followed by an equal sign (=) and the attribute's value, which is enclosed in double or single quotes. Some attributes affect the element simply by their presence in the start tag of the element, with the value implicitly being an empty string.
Property that links the html fragment to a node in a document, representing the HTML document itself or the document type, HTML element, text, CDATA section or comment within that HTML document.
Specifies a space-separated list of URLs to which, when the resource is activated, post requests with the element's activation behavior should be sent.
A SPARQL Target to select all nodes in an HTML document that do not have an HTML fragment yet, and whose child nodes all have an HTML fragment already.
SPARQL query
select $this {
# Select all DOM nodes...
$this a/rdfs:subClassOf* dom:DocumentTreeNode.
# ...that do not yet have an HTML fragment.
filter not exists { $this html:fragment []. }
# ...but whose child nodes (if any) all have an HTML fragment
filter not exists {
$this ?member ?child.
filter(function:isMembershipProperty(?member))
filter not exists { ?child html:fragment []. }
?child a/rdfs:subClassOf* dom:DocumentTreeNode.
}
}
A SPARQL rule to serialize an HTML fragment for a node in an HTML document, analogue to the HTML fragment serialisation algorithm as described in the living standard of HTML.
SPARQL query
construct {
# Assert the new HTML fragment for this node in the HTML document
$this html:fragment ?fragment.
} where {
# Establish the class of the node in the HTML document
$this a/rdfs:subClassOf* ?htmlClass.
?htmlClass rdfs:isDefinedBy html:.
# Build the HTML fragment for the node in the HTML document depending on its class
bind(if(?htmlClass = html:Element, function:getElementFragment($this),
if(?htmlClass = html:Text, function:getTextFragment($this),
if(?htmlClass = html:Comment, function:getCommentFragment($this),
if(?htmlClass = html:ProcessingInstruction, function:getProcessingInstructionFragment($this),
if(?htmlClass = html:DocumentType, function:getDocumentTypeFragment($this),
if(?htmlClass = html:Document, function:getDocumentFragment($this), ?unboundDummy))))))
as ?fragmentString)
# Convert result from string to rdf:HTML if fragment exists
bind(if(bound(?fragmentString), strdt(?fragmentString, rdf:HTML), ?unboundDummy) as ?fragment)
}
A SPARQL function that returns an HTML fragment of child nodes for a node in an HTML document.
SPARQL query
select ?result {
optional {
# Get the HTML fragments of child nodes, if there are any.
select $parentNode (group_concat(str(?childFragment);separator='') as ?childFragments) {
{
select $parentNode ?member ?childFragment {
$parentNode ?member ?childNode.
filter(function:isMembershipProperty(?member))
?childNode html:fragment ?childFragment.
}
order by function:getMemberIndex(?member)
}
}
group by $parentNode
}
bind(coalesce(?childFragments, '') as ?result)
}
A SPARQL function that returns an HTML fragment for a comment node in an HTML document.
SPARQL query
select ?result {
optional {
# Establish the HTML fragment for this HTML comment
bind(concat('<!--',function:getChildNodeFragment($comment),'-->') as ?fragment)
}
bind(coalesce(?fragment, '') as ?result)
}
A SPARQL function that returns an HTML fragment for an HTML document.
SPARQL query
select ?result {
optional {
# Establish the HTML fragment of the HTML document by retrieving the HTML fragments of all child nodes.
bind(function:getChildNodeFragment($document) as ?fragment)
}
bind(coalesce(?fragment, '') as ?result)
}
A SPARQL function that returns an HTML fragment for a document type in an HTML document.
SPARQL query
select ?result {
optional {
# Establish the doctype name for this Document Type.
$doctype html:documentTypeName ?name.
bind(concat('<!DOCTYPE ',str(?name),'>') as ?fragment)
}
bind(coalesce(?fragment, '') as ?result)
}
A SPARQL function that returns an HTML fragment for the attributes of an HTML element.
SPARQL query
select ?result {
optional {
# Get the HTML attributes for this element, if there are any.
select $element (group_concat(distinct ?attributeFragment) as ?attributeFragments) {
$element ?attribute ?value.
?attribute
a/rdfs:subClassOf* dom:Attribute;
?localName ?key.
?localName rdfs:subPropertyOf dom:localName.
bind(concat(?key,'="',str(?value),'"') as ?attributeFragment)
}
group by $element
}
bind(coalesce(?attributeFragments, '') as ?result)
}
A SPARQL function that returns an HTML fragment for an element in an HTML document.
SPARQL query
select ?result {
optional {
# Retrieve the tag name of the element.
$element a ?class.
?class
html:tag ?tag;
rdfs:subClassOf ?elementType.
# Get the HTML attributes for the element, if there are any.
bind(function:getElementAttribute($element) as ?attributes)
# Get the HTML fragments of child nodes for the element, if there are any.
bind(function:getChildNodeFragment($element) as ?childFragments)
# Build the HTML fragment for this HTML element, by combining everything retrieved above.
bind(concat(
'<',?tag,if(?attributes='','',concat(' ',?attributes)),'>',
# Void elements have neither content nor a closing tag.
if(?elementType=html:VoidElement,'',concat(?childFragments,'</',?tag,'>'))) as ?fragment)
}
bind(coalesce(?fragment, '') as ?result)
}
A SPARQL function that returns an HTML fragment for a processing instruction in an HTML document.
SPARQL query
select ?result {
optional {
# Establish the HTML fragment for this HTML processingInstruction
bind(concat('<?',function:getChildNodeFragment($processingInstruction),'>') as ?fragment)
}
bind(coalesce(?fragment, '') as ?result)
}
A SPARQL function that returns an HTML fragment for a text node in an HTML document.
SPARQL query
select ?result {
# Text is stored in de data attribute of DOM text nodes
$text dom:data ?data.
optional {
# Establish the HTML fragment for this HTML text node
bind(strdt(?data,xsd:string) as ?fragment)
}
bind(coalesce(?fragment, '') as ?result)
}