The HTML vocabulary establishes a draft standard to semantically represent HTML documents in RDF. The vocabulary is based on the HTML Living Standard. It includes classes for HTML elements, datatype properties for attributes, and SHACL shapes for code serialization. HTML documents can thus be represented, queried, generated, validated, analysed, transformed and reused as semantic objects themselves. As HTML documents are widely used in a myriad of use cases, from websites, dashboard, applications, to social media and documents, this vocabulary helps organisations and individuals to get a better grasp of their information products.
Introduction
In today's fast-paced business landscape, organizations grapple with the formidable challenge of information management. The demand for robust information products and agile information processing is relentless, constantly evolving with each passing moment, transcending industry boundaries. In this dynamic environment, the creation, validation, and utilization of information products are paramount, serving as the bedrock for informed decision-making and organizational effectiveness. The ability to swiftly respond to unforeseen crises or seize emerging opportunities hinges on having the right answers to previously unasked questions.
So, how can organizations effectively navigate this information-intensive terrain? A key part of the solution lies in the capability to exercise complete control over the generation, modification, validation, and reuse of information products. This entails the profound ability to construct and deconstruct any arbitrary information product, dissecting it from its foundational elements to its broader, strategic significance.
To address this need, we have pioneered the development of the RDF-based HTML vocabulary - a transformative framework designed to facilitate the management of an array of information artifacts built upon HTML.
Background
In an era where information is akin to a constantly flowing stream, organizations must grapple with several pressing challenges in aspects like velocity, variety, the insightfulness, adaptability and validity. The speed at which information is generated, transformed, and consumed is unparalleled. Keeping pace with this velocity is essential for timely decision-making. Information comes in diverse formats, from structured data to unstructured text, images, and multimedia. Effective management necessitates the ability to handle this rich variety seamlessly. Information is not merely data; it embodies insights, functionality, and knowledge. Extracting meaningful insights and practical functionality is vital for organizational success. Organizations must be agile in adapting information products to address new crises and opportunities swiftly. Finally, ensuring the correctness, compliance and reliability of information is fundamental to trustworthy decision-making and operations.
Objective
In response to these challenges, we introduce the RDF-based HTML vocabulary. With this vocabulary, organizations gain the capacity to (1) construct and deconstruct, (2) generate and validate and (3) adapt and reuse. Constructing and deconstructing means that one can craft and dismantle information products at will, from their fundamental building blocks to their overarching significance. Generating and validating comes down to creating new information products and rigorously validating their accuracy and completeness. Finally, adapting and reusing is about adapting existing information products to swiftly respond to evolving scenarios and reusing valuable components across a spectrum of contexts. This innovative vocabulary forms the cornerstone of an information management ecosystem that harmonizes the power of RDF with the ubiquity of HTML. It revolutionizes how organizations handle information artifacts, enabling them to seamlessly transition from raw data to strategic insights and practical solutions.
Audience
This document is intended for a diverse audience of web developers, content managers, semantic web enthusiasts, and anyone seeking to enhance the sustainability of information management, information processing and information technology.
Overview
The HTML vocabulary establishes a draft standard that enables the semantic representation of any HTML document in RDF. The vocabulary is based on the Living Standard of HTML and offers classes to represent HTML elements, datatype properties to represent HTML attributes and a small number of SHACL shapes for the serialisation of HTML code.
Center to this ontology is the class 'html:Element', which represents a HTML element and is in fact a subclass of 'dom:Element', an element as specified in the Document Object Model specification. Any specific HTML element like 'html:A' (anchor element) or 'html:Body' (body element) is a subclass of the class 'html:Element'. Another fundamental building block of this ontology is the 'html:Attribute' class of all HTML attributes. Any attribute in HTML like 'html:class' or 'html:style' is also a subproperty of 'html:attribute'. An HTML attribute can be an attribute as defined in the Living Standard of HTML, RDFa or a custom defined attribute. In addition, there is the class of 'html:Text', where all instantiations contain a html:fragment property relation with an actual textual string, as for example in the triples 'doc:1.0 a html:Text; html:fragment "Hello, World!".'.
In order to serialize this HTML document to actual HTML code, there is the SHACL node shape shp:HTMLFragmentSerializationAlgorithm to process the RDF-structure and transform this into HTML code. The rule in the node shape calls six functions, depending on the content of the HTML document. The logic behind the shape is that the HTML code can be serialized from the leaves of the tree upwards up and till the top of the tree. It means that an arbitrary element in the DOM tree of an HTML document can only be serialized to HTML code, whenever the underlying child elements of that element have been serialized to HTML already, including the HTML code of possible attributes and their values. The outer edges of the tree are text nodes and elements that do not contain any child elements. From the start these can be transformed into HTML code immediately without the necessity of any context. From the outer edges the node shape shp:HTMLFragmentSerializationAlgorithm works its way up the tree until the moment that all the HTML elements have received an HTML fragment and the HTML document can be created based on the html:fragment, as contained in the html:html root element of that document. The processing halts the moment the document itself has an html:fragment.
Datamodel
A visualisation of the ontology
Namespace
Prefixes and namespaces used in this specification
Prefix
Namespace
aria
http://www.w3.org/ns/wai-aria/
dom
http://www.w3.org/DOM/model/def/
function
https://www.w3.org/html/model/function/
html
https://www.w3.org/html/model/def/
parameter
https://www.w3.org/html/model/parameter/
rdf
http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs
http://www.w3.org/2000/01/rdf-schema#
rule
https://www.w3.org/html/model/rule/
section
https://www.w3.org/html/id/section/
shp
https://www.w3.org/html/model/shp/
standard
https://www.w3.org/html/id/standard/
target
https://www.w3.org/html/model/target/
xsd
http://www.w3.org/2001/XMLSchema#
Serialisation
A serialisation of the ontology in Turtle-format (*.ttl) can be found here.
The <a> HTML element (or anchor element), with its href attribute, creates a hyperlink to web pages, files, email addresses, locations in the same page, or anything else a URL can address.
The <abbr> HTML element represents an abbreviation or acronym; the optional title attribute can provide an expansion or description for the abbreviation. If present, title must contain this full description and nothing else.
The <area> HTML element defines an area inside an image map that has predefined clickable areas. An image map allows geometric areas on an image to be associated with hypertext link.
The <article> HTML element represents a self-contained composition in a document, page, application, or site, which is intended to be independently distributable or reusable (e.g., in syndication). Examples include: a forum post, a magazine or newspaper article, or a blog entry, a product card, a user-submitted comment, an interactive widget or gadget, or any other independent item of content.
The <aside> HTML element represents a portion of a document whose content is only indirectly related to the document's main content. Asides are frequently presented as sidebars or call-out boxes.
An attribute is a name-value pair that is associated with an HTML element. Attributes provide additional information about an element and are specified within the start tag of an element. Attributes can modify the behavior or appearance of an element, define relationships between elements, or provide other metadata. The name of the attribute is followed by an equal sign (=) and the attribute's value, which is enclosed in double or single quotes. Some attributes affect the element simply by their presence in the start tag of the element, with the value implicitly being an empty string.
The <audio> HTML element is used to embed sound content in documents. It may contain one or more audio sources, represented using the src attribute or the <source> element: the browser will choose the most suitable one. It can also be the destination for streamed media, using a MediaStream.
An autonomous custom element is a custom HTML element that is defined with no extends option. These types of custom elements have a local name equal to their defined name.
The <b> HTML element is used to draw the reader's attention to the element's contents, which are not otherwise granted special importance. This was formerly known as the Boldface element, and most browsers still draw the text in boldface. However, you should not use <b> for styling text; instead, you should use the CSS font-weight property to create boldface text, or the <strong> element to indicate that text is of special importance.
The <bdi> HTML element tells the browser's bidirectional algorithm to treat the text it contains in isolation from its surrounding text. It's particularly useful when a website dynamically inserts some text and doesn't know the directionality of the text being inserted.
The <blockquote> HTML element indicates that the enclosed text is an extended quotation. Usually, this is rendered visually by indentation (see Notes for how to change it). A URL for the source of the quotation may be given using the cite attribute, while a text representation of the source can be given using the <cite> element.
The <br> HTML element produces a line break in text (carriage-return). It is useful for writing a poem or an address, where the division of lines is significant.
A DOM element with textual content that contain characters that would otherwise be treated as markup. A CDATA section is typically used to include code snippets, scripts, or other data within an HTML document without having to worry about escaping special characters. In HTML, a CDATA section is denoted by enclosing the block of text within <![CDATA[ and ]]> tags. Anything contained within these tags is treated as raw character data and is not parsed as markup.
The <cite> HTML element is used to describe a reference to a cited creative work, and must include the title of that work. The reference may be in an abbreviated form according to context-appropriate conventions related to citation metadata.
The <code> HTML element displays its contents styled in a fashion intended to indicate that the text is a short fragment of computer code. By default, the content text is displayed using the user agent's default monospace font.
The <col> HTML element defines a column within a table and is used for defining common semantics on all common cells. It is generally found within a <colgroup> element.
A comment is a markup construct used to insert comments within the HTML code. Comments are not displayed in the web browser, but they can be viewed in the HTML source code. Comments are typically used to add notes, descriptions, or explanations to the HTML code for the benefit of developers, without affecting the rendered output in the browser. Comments are denoted by enclosing the comment text within <!-- and --> tags. Anything contained within these tags is treated as a comment and is ignored by the web browser during rendering.
A custom data attribute is an attribute in no namespace whose name starts with the string "data-", has at least one character after the hyphen, is XML-compatible, and contains no ASCII upper alphas.
A custom element is an element that is custom. Informally, this means that its constructor and prototype are defined by the author, instead of by the user agent. This author-supplied constructor function is called the custom element constructor. Two distinct types of custom elements can be defined: An autonomous custom element, which is defined with no extends option. These types of custom elements have a local name equal to their defined name. A customized built-in element, which is defined with an extends option. These types of custom elements have a local name equal to the value passed in their extends option, and their defined name is used as the value of the is attribute, which therefore must be a valid custom element name.
A customized built-in element is a custom HTML element that is defined with an extends option. These types of custom elements have a local name equal to the value passed in their extends option, and their defined name is used as the value of the is attribute, which therefore must be a valid custom element name.
The <data> HTML element links a given piece of content with a machine-readable translation. If the content is time- or date-related, the <time> element must be used.
The <datalist> HTML element contains a set of <option> elements that represent the permissible or recommended options available to choose from within other controls.
The <del> HTML element represents a range of text that has been deleted from a document. This can be used when rendering "track changes" or source code diff information, for example. The <ins> element can be used for the opposite purpose: to indicate text that has been added to the document.
The <details> HTML element creates a disclosure widget in which information is visible only when the widget is toggled into an "open" state. A summary or label must be provided using the <summary> element.
The <dfn> HTML element is used to indicate the term being defined within the context of a definition phrase or sentence. The <p> element, the <dt>/<dd> pairing, or the <section> element which is the nearest ancestor of the <dfn> is considered to be the definition of the term.
The <div> HTML element is the generic container for flow content. It has no effect on the content or layout until styled in some way using CSS (e.g. styling is directly applied to it, or some kind of layout model like Flexbox is applied to its parent element).
The <dl> HTML element represents a description list. The element encloses a list of groups of terms (specified using the <dt> element) and descriptions (provided by <dd> elements). Common uses for this element are to implement a glossary or to display metadata (a list of key-value pairs).
An HTML document consists of a tree of elements and text. Each element is denoted in the source by a start tag, such as ‘<body>’, and an end tag, such as ‘</body>’. Tags have to be nested such that elements are all completely within each other, without overlapping. Elements can have attributes, which control how the elements work. The HTML vocabulary defines a set of elements that can be used in a HTML document, along with rules about the ways in which the elements can be nested. HTML user agents (e.g., web browsers) parse a HTML document, turning it into a DOM (Document Object Model) tree. A DOM tree is an in-memory representation of a document. A HTML document represents a media-independent description of interactive content. A HTML document might be rendered to a screen, or through a speech synthesizer, or on a braille display. To influence exactly how such rendering takes place, authors can use a styling language such as CSS.
A DOCTYPE is a required preamble. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
The <dt> HTML element specifies a term in a description or definition list, and as such must be used inside a <dl> element. It is usually followed by a <dd> element; however, multiple <dt> elements in a row indicate several terms that are all defined by the immediate next <dd> element.
An HTML element in the Document Object Model (DOM) represents a thing; that is, it has intrinsic meaning, also known as semantics. An element consists of an HTML start tag and an HTML end tag and has value content. An HTML start tag consists of a "smaller than" character ("<") and a tag name, followed by a "greater than" character (">"). An HTML end tag consists of a "smaller than" character ("<"), a slash ("/") and a tag name, followed by a "greater than" character (">"). The value content of an element can be arbitrarily complex.
The <em> HTML element marks text that has stress emphasis. The <em> element can be nested, with each level of nesting indicating a greater degree of emphasis.
The <embed> HTML element embeds external content at the specified point in the document. This content is provided by an external application or other source of interactive content such as a browser plug-in.
HTML element where the text inside is treated as raw text and not parsed as HTML, but character references can still be escaped within their text content.
An event handler content attribute is a content attribute for a specific event handler. The name of the content attribute is the same as the name of the event handler.
The <figure> HTML element represents self-contained content, potentially with an optional caption, which is specified using the <figcaption> element. The figure, its caption, and its contents are referenced as a single unit.
The <footer> HTML element represents a footer for its nearest sectioning content or sectioning root element. A <footer> typically contains information about the author of the section, copyright data or links to related documents.
The <header> HTML element represents introductory content, typically a group of introductory or navigational aids. It may contain some heading elements but also a logo, a search form, an author name, and other elements.
Heading content defines the heading of a section (whether explicitly marked up using sectioning content elements, or implied by the heading content itself).
The <hr> HTML element represents a thematic break between paragraph-level elements: for example, a change of scene in a story, or a shift of topic within a section.
The <html> HTML element represents the root (top-level element) of an HTML document, so it is also referred to as the root element or document element. All other elements must be descendants of this element.
The <i> HTML element represents a range of text that is set off from the normal text for some reason, such as idiomatic text, technical terms, taxonomical designations, among others. Historically, these have been presented using italicized type, which is the original source of the <i> naming of this element.
The <input> HTML element is used to create interactive controls for web-based forms in order to accept data from the user; a wide variety of types of input data and control widgets are available, depending on the device and user agent. The <input> element is one of the most powerful and complex in all of HTML due to the sheer number of combinations of input types and attributes.
The <ins> HTML element represents a range of text that has been added to a document. You can use the <del> element to similarly represent a range of text that has been deleted from the document.
The <kbd> HTML element represents a span of inline text denoting textual user input from a keyboard, voice input, or any other text entry device. By convention, the user agent defaults to rendering the contents of a <kbd> element using its default monospace font, although this is not mandated by the HTML standard.
The <li> HTML element is used to represent an item in a list. It must be contained in a parent element: an ordered list (<ol>), an unordered list (<ul>), or a menu (<menu>). In menus and unordered lists, list items are usually displayed using bullet points. In ordered lists, they are usually displayed with an ascending counter on the left, such as a number or letter.
Denotes elements that are listed in the form.elements and fieldset.elements APIs. These elements also have a form content attribute, and a matching form IDL attribute, that allow authors to specify an explicit form owner.
The <main> HTML element represents the dominant content of the <body> of a document. The main content area consists of content that is directly related to or expands upon the central topic of a document, or the central functionality of an application.
The <mark> HTML element represents text which is marked or highlighted for reference or notation purposes, due to the marked passage's relevance or importance in the enclosing context.
The <menu> HTML element is a semantic alternative to <ul>. It represents an unordered list of items (represented by <li> elements), each of these represent a link or other command that the user can activate.
The <meta> HTML element can represent document-level metadata with the name attribute, pragma directives with the http-equiv attribute, and the file's character encoding declaration when an HTML document is serialized to string form (e.g. for transmission over the network or for disk storage) with the charset attribute.
Metadata content is content that sets up the presentation or behavior of the rest of the content, or that sets up the relationship of the document with other documents, or that conveys other "out of band" information.
The <nav> HTML element represents a section of a page whose purpose is to provide navigation links, either within the current document or to other documents. Common examples of navigation sections are menus, tables of contents, and indexes.
Exclusionary definition: the elements that are neither (1) void elements, (2) the template element, (3) raw text elements, (4) escapable raw text elements, nor (5) foreign elements.
The <noscript> HTML element defines a section of HTML to be inserted if a script type on the page is unsupported or if scripting is currently turned off in the browser.
The <object> HTML element represents an external resource, which can be treated as an image, a nested browsing context, or a resource to be handled by a plugin.
The <option> HTML element is used to define an item contained in a <select>, an <optgroup>, or a <datalist> element. As such, <option> can represent menu items in popups and other lists of items in an HTML document.
The <p> HTML element represents a paragraph. Paragraphs are usually represented in visual media as blocks of text separated from adjacent blocks by blank lines and/or first-line indentation, but HTML paragraphs can be any structural grouping of related content, such as images or form fields.
As a general rule, elements whose content model allows any flow content or phrasing content should have at least one node in its contents that is palpable content and that does not have the hidden attribute specified.
Phrasing content is the text of the document, as well as elements that mark up that text at the intra-paragraph level. Runs of phrasing content form paragraphs.
The <picture> HTML element contains zero or more <source> elements and one <img> element to offer alternative versions of an image for different display/device scenarios.
The <pre> HTML element represents preformatted text which is to be presented exactly as written in the HTML file. The text is typically rendered using a non-proportional, or monospaced, font. Whitespace inside this element is displayed as written.
The <q> HTML element indicates that the enclosed text is a short inline quotation. Most modern browsers implement this by surrounding the text in quotation marks. This element is intended for short quotations that don't require paragraph breaks; for long quotations use the <blockquote> element.
The <rp> HTML element is used to provide fall-back parentheses for browsers that do not support display of ruby annotations using the <ruby> element. One <rp> element should enclose each of the opening and closing parentheses that wrap the <rt> element that contains the annotation's text.
The <rt> HTML element specifies the ruby text component of a ruby annotation, which is used to provide pronunciation, translation, or transliteration information for East Asian typography. The <rt> element must always be contained within a <ruby> element.
The <ruby> HTML element represents small annotations that are rendered above, below, or next to base text, usually used for showing the pronunciation of East Asian characters. It can also be used for annotating other kinds of text, but this usage is less common.
The <s> HTML element renders text with a strikethrough, or a line through it. Use the <s> element to represent things that are no longer relevant or no longer accurate. However, <s> is not appropriate when indicating document edits; for that, use the <del> and <ins> elements, as appropriate.
The <svg> HTML element is a container for SVG graphics. SVG allows for three types of graphics: vector graphic shapes (e.g., paths consisting of straight lines and curves), images, and text.
The <samp> HTML element is used to enclose inline text which represents sample (or quoted) output from a computer program. Its contents are typically rendered using the browser's default monospaced font (such as Courier or Lucida Console).
The <script> HTML element is used to embed executable code or data; this is typically used to embed or refer to JavaScript code. The <script> element can also be used with other languages, such as WebGL's GLSL shader programming language and JSON.
Script-supporting elements are those that do not represent anything themselves (i.e. they are not rendered), but are used to support scripts, e.g. to provide functionality for the user.
The <search> element represents a part of a document or application that contains a set of form controls or other content related to performing a search or filtering operation. This could be a search of the web site or application; a way of searching or filtering search results on the current web page; or a global or Internet-wide search function.
The <section> HTML element represents a generic standalone section of a document, which doesn't have a more specific semantic element to represent it. Sections should always have a heading, with very few exceptions.
The <slot> HTML element - part of the Web Components technology suite - is a placeholder inside a web component that you can fill with your own markup, which lets you create separate DOM trees and present them together.
The <small> HTML element represents side-comments and small print, like copyright and legal text, independent of its styled presentation. By default, it renders text within it one font-size smaller, such as from small to x-small.
The <source> HTML element specifies multiple media resources for the <picture>, the <audio> element, or the <video> element. It is an empty element, meaning that it has no content and does not have a closing tag. It is commonly used to offer the same media content in multiple file formats in order to provide compatibility with a broad range of browsers given their differing support for image file formats and media file formats.
The <span> HTML element is a generic inline container for phrasing content, which does not inherently represent anything. It can be used to group elements for styling purposes (using the class or id attributes), or because they share attribute values, such as lang. It should be used only when no other semantic element is appropriate. <span> is very much like a <div> element, but <div> is a block-level element whereas a <span> is an inline element.
The <strong> HTML element indicates that its contents have strong importance, seriousness, or urgency. Browsers typically render the contents in bold type.
The <style> HTML element contains style information for a document, or part of a document. It embeds a CSS style sheet, which is applied to the contents of the document containing the <style> element.
The <sub> HTML element specifies inline text which should be displayed as subscript for solely typographical reasons. Subscripts are typically rendered with a lowered baseline using smaller text.
The <summary> HTML element specifies a summary, caption, or legend for a <details> element's disclosure box. Clicking the <summary> element toggles the state of the parent <details> element open and closed.
The <sup> HTML element specifies inline text which should be displayed as superscript for solely typographical reasons. Superscripts are typically rendered with half a character above the normal line, and are sometimes rendered in a smaller font.
A table (<table> element) represents data with more than one dimension. Tables have rows, columns, and cells given by their descendants. The rows and columns form a grid; a table its cells must completely cover that grid without overlap.
The <template> HTML element is a mechanism for holding HTML that is not to be rendered immediately when a page is loaded but may be instantiated subsequently during runtime using JavaScript.
The <textarea> HTML element represents a multi-line plain-text editing control, useful when you want to allow users to enter a sizeable amount of free-form text, for example a comment on a review or feedback form.
The <time> HTML element represents a specific period in time. It may include the datetime attribute to translate dates into machine-readable format, allowing for better search engine results or custom features such as reminders.
The Title (<title> element) defines a document its title that is shown in a browser title bar or a page tab. It only contains text; tags within the element are ignored.
The <track> HTML element is used as a child of the media elements, <audio> and <video>. It lets you specify timed text tracks (or time-based data), for example to automatically handle subtitles. The tracks are formatted in WebVTT format (.vtt files) - Web Video Text Tracks.
The <u> HTML element represents a span of inline text which should be rendered in a way that indicates that it has a non-textual annotation. This is rendered by default as a simple solid underline, but may be altered using CSS.
The <var> HTML element represents the name of a variable in a mathematical expression or a programming context. It's typically presented using an italicized version of the current typeface, although that behavior is browser-dependent.
The <video> HTML element embeds a media player which supports video playback into the document. You can use <video> for audio content as well, but the <audio> element may provide a more appropriate user experience.
The <wbr> HTML element represents a word break opportunity - a position within text where the browser may optionally break a line, though its line-breaking rules would not otherwise create a break at that location.
Property that is the parent property of all existing HTML attributes. An attribute is a name-value pair that is associated with an HTML element. Attributes provide additional information about an element and are specified within the start tag of an element. Attributes can modify the behavior or appearance of an element, define relationships between elements, or provide other metadata. The name of the attribute is followed by an equal sign (=) and the attribute's value, which is enclosed in double or single quotes. Some attributes affect the element simply by their presence in the start tag of the element, with the value implicitly being an empty string.
Property that links the html fragment to a node in a document, representing the HTML document itself or the document type, HTML element, text, CDATA section or comment within that HTML document.
Specifies a space-separated list of URLs to which, when the resource is activated, post requests with the element's activation behavior should be sent.
A SPARQL Target to select all nodes in an HTML document that do not have an HTML fragment yet, and whose child nodes all have an HTML fragment already.
SPARQL query
select $this {
# Select all DOM nodes...
$this a/rdfs:subClassOf* dom:DocumentTreeNode.
# ...that do not yet have an HTML fragment.
filter not exists { $this html:fragment []. }
# ...but whose child nodes (if any) all have an HTML fragment
filter not exists {
$this ?member ?child.
filter(function:isMembershipProperty(?member))
filter not exists { ?child html:fragment []. }
?child a/rdfs:subClassOf* dom:DocumentTreeNode.
}
}
A SPARQL rule to serialize an HTML fragment for a node in an HTML document, analogue to the HTML fragment serialisation algorithm as described in the living standard of HTML.
SPARQL query
construct {
# Assert the new HTML fragment for this node in the HTML document
$this html:fragment ?fragment.
} where {
# Establish the class of the node in the HTML document
$this a/rdfs:subClassOf* ?htmlClass.
?htmlClass rdfs:isDefinedBy html:.
# Build the HTML fragment for the node in the HTML document depending on its class
bind(if(?htmlClass = html:Element, function:getElementFragment($this),
if(?htmlClass = html:Text , function:getTextFragment($this),
if(?htmlClass = html:Comment, function:getCommentFragment($this),
if(?htmlClass = html:ProcessingInstruction, function:getProcessingInstructionFragment($this),
if(?htmlClass = html:DocumentType, function:getDocumentTypeFragment($this),
if(?htmlClass = html:Document, function:getDocumentFragment($this), ?unboundDummy))))))
as ?fragmentString)
# Convert result from string to rdf:HTML if fragment exists
bind(if(bound(?fragmentString), strdt(?fragmentString, rdf:HTML), ?unboundDummy) as ?fragment)
}
A SPARQL function that returns an HTML fragment of child nodes for a node in an HTML document.
SPARQL query
prefix function: <https://www.w3.org/html/model/function/>
prefix html: <https://www.w3.org/html/model/def/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
select ?result where {
OPTIONAL {
# Get the HTML fragments of child nodes, if there are any.
select $parentNode (group_concat(str(?childFragment);separator='') as ?childFragments) {
{
select $parentNode ?member ?childFragment {
$parentNode ?member ?childNode.
filter(function:isMembershipProperty(?member))
?childNode html:fragment ?childFragment.
}
order by function:getMemberIndex(?member)
}
} group by $parentNode
}
bind(if(bound(?childFragments),?childFragments,'') as ?result)
}
A SPARQL function that returns an HTML fragment for a comment node in an HTML document.
SPARQL query
prefix function: <https://www.w3.org/html/model/function/>
select ?result where {
OPTIONAL {
# Establish the HTML fragment for this HTML comment
bind(concat('<!--',function:getChildNodeFragment($comment),'-->') as ?fragment)
}
bind(if(bound(?fragment),?fragment,'') as ?result)
}
A SPARQL function that returns an HTML fragment for an HTML document.
SPARQL query
prefix function: <https://www.w3.org/html/model/function/>
select ?result where {
OPTIONAL {
# Establish the HTML fragment of the HTML document by retrieving the HTML fragments of all child nodes.
bind(function:getChildNodeFragment($document) as ?fragment)
}
bind(if(bound(?fragment),?fragment,'') as ?result)
}
A SPARQL function that returns an HTML fragment for a document type in an HTML document.
SPARQL query
prefix html: <https://www.w3.org/html/model/def/>
select ?result where {
OPTIONAL {
# Establish the doctype name for this Document Type.
$doctype html:documentTypeName ?name.
bind(concat('<!DOCTYPE ',str(?name),'>') as ?fragment)
}
bind(if(bound(?fragment),?fragment,'') as ?result)
}
A SPARQL function that returns an HTML fragment for the attributes of an HTML element.
SPARQL query
prefix dom: <http://www.w3.org/DOM/model/def/>
prefix html: <https://www.w3.org/html/model/def/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?result where {
OPTIONAL
{ # Get the HTML attributes for this element, if there are any.
select $element (group_concat(distinct ?attributeFragment; separator=' ') as ?attributeFragments) {
$element ?attribute ?value.
?attribute a/rdfs:subClassOf* dom:Attribute;
?localName ?key.
?localName rdfs:subPropertyOf dom:localName.
bind(concat(?key,'="',str(?value),'"') as ?attributeFragment)
} group by $element
}
bind(if(bound(?attributeFragments),?attributeFragments,'') as ?result)
}
A SPARQL function that returns an HTML fragment for an element in an HTML document.
SPARQL query
prefix function: <https://www.w3.org/html/model/function/>
prefix html: <https://www.w3.org/html/model/def/>
select ?result where {
OPTIONAL {
# Retrieve the tag name of the element.
$element a ?class.
?class html:tag ?tag;
rdfs:subClassOf ?elementType.
# Get the HTML attributes for the element, if there are any.
bind(function:getElementAttribute($element) as ?attributes)
# Get the HTML fragments of child nodes for the element, if there are any.
bind(function:getChildNodeFragment($element) as ?childFragments)
# Build the HTML fragment for this HTML element, by combining everything retrieved above.
bind(
concat(
'<',?tag,if(?attributes='','',concat(' ',?attributes)),'>',
# Void elements have neither content nor a closing tag.
if(?elementType=html:VoidElement,'',concat(?childFragments,'</',?tag,'>'))) as ?fragment)
}
bind(if(bound(?fragment),?fragment,'') as ?result)
}
A SPARQL function that returns an HTML fragment for a processing instruction in an HTML document.
SPARQL query
prefix function: <https://www.w3.org/html/model/function/>
select ?result where {
OPTIONAL {
# Establish the HTML fragment for this HTML processingInstruction
bind(concat('<?',function:getChildNodeFragment($processingInstruction),'>') as ?fragment)
}
bind(if(bound(?fragment),?fragment,'') as ?result)
}
A SPARQL function that returns an HTML fragment for a text node in an HTML document.
SPARQL query
prefix dom: <http://www.w3.org/DOM/model/def/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
select ?result where {
# Text is stored in de data attribute of DOM text nodes
$text dom:data ?data.
OPTIONAL {
# Establish the HTML fragment for this HTML text node
bind(strdt(?data,xsd:string) as ?fragment)
}
bind(if(bound(?fragment),?fragment,'') as ?result)
}