Pattern: Citing and linking into reference documents - REVIEW
What is deep-linking?
Deep linking is to link (refer) to a part within a linked resource. In RDF, a resource such as a document is referred to by an IRI or if located on the web by an IRL. The standard for IRIs (or previously URIs) defines fragments as way to a resource within the resource referred to by the IRI. Fragments are separated from the document IRI using the '#
' character.
see https://datatracker.ietf.org/doc/html/rfc3986#section-3.5
Whether and how easy a deep link can be expressed as an IRI fragment , depends on the media type of the document. Some formats, especially markup languages such as HTML and XML have built-in means for deep-linking, others do not.
Another perspective is to view a deep link as an index of a document.
The General Fragment Model (GFM) defines such model. It is composed by three main concepts: information artifact, indexer and anchor. In brief, an indexer defines a specific way of indexing (i.e. identifying) parts (or fragments) of a specific information artifact.
It can be thought as a function that maps arbitrary tokens to parts of an information artifact. Each token identifies a part of an information artifact. An anchor is a particular token applied to a indexer targeted at an instantiated information artifact.
retrieved from https://arxiv.org/pdf/1909.04117.pdf
External documents in IDMP-O
In IDMP-O we can refer to a document as an instance of a CMNS document. The media type can be stated using the Dublin Core format property, and would be usually text/html
if the URL is pointing to a web site:
A PDF document available on the web and would use application/pdf
as media type.
Media types are registered by the Internet Assigned Numbers Authority IANA (https://www.iana.org) organization and are specified by the Internet RFC 6838 https://www.rfc-editor.org/rfc/rfc6838.html. We can reference the media type instance by appending the textual form of the the media type such as text/html
to the prefix https://www.iana.org/assignments/mediatypes/.
Use cases
In some cases in IDMP, an RDF model using IDMP-O must link to a part of an external document. An example for this, is the approved text for a therapeutic indication, a contraindication or an unintended effect, which is mentioned in the clinical particulars of the product information sheet or another document approved by the authorities. The IDMP standard uses a plain text representation for those, but it is often the case, esp. for unintended effects, that those effects are listed in formatted text, usually tables, that cannot be directly represented as plain text. Instead of an imprecise quote as plain text, a reference into content within the document should be used, that can be understood and checked.
Deep linking is also used for citation, commenting and change control of documents. In these cases, referencing the whole document is often to unspecific to be useful.
Simple reference by character or byte regions
Like URLs a deep link should be stable.
A simple concept of a deep link is to give a character offset from the beginning of the document and the length of the range. This works for text-based documents, for binary ones, a byte offset and length can be used instead. However this is a very brittle way of deep linking and it is a well known problem. Small technical changes such as byte order encoding, character set choice, newline encoding all effect the bytes and character lengths of the document. Revision control systems like git or cvs struggle with this problem, transfer protocols such as FTP have different modes for text and binary. Different operating systems and programming languages even today use different standards.
RDF (like default XML) standardizes on using UTF-8 for string representations, but counting characters is still not an easy task.
Is 'à
' and the combination of '`
' + 'a
' the same character? Does it count as one or two? The Unicode standard defines several normalization methods to answer this question. https://unicode.org/reports/tr15/
For formatted text, counting characters becomes more ambiguous:
Given a simple list:
- item 1
- item 2
Does the enumeration dot count as a character? What is its correct Unicode character or should we use '-
' or '*
' from ASCII as equivalent? Is the indent significant? Is it a single '<tab>' character or multiple spaces?
Given a table:
Column 1 | Column 2 |
---|---|
a | b |
Are the lines of the table characters? How is the table represented as characters? CSV? Markdown?
It is clear by now, that deep linking using a character offset and length is not as trivial as it looks and a simple FTP download or git checkout can break it.
If we want to deep link into an element of a picture or diagram, using offset and range will break completely. A JPEG image representation of a diagram has not single byte range, that can be used to refer to an text box in that image. It also will not have not any characters directly in it. One has to use optical character recognition for that.
Most documents relevant in the IDMP context of regulatory affairs, will probably be in the PDF format. PDF is somewhat in between a text representation and a graphical representation. Most of the texts in the PDF is typically available for extraction, but even this is not always the case. The text could be hidden in an embedded image for example. Structural formatting such as sections and tables are sometimes listed in table of contents and pages can be counted, but it depends on the author and the tools used, what will be present. In the worst case the PDF is a scan of a printout, so more a less a list of page images, in an more ideal case it uses the PDF/EA standard for marking and structuring the PDF document.
If the format of the document is plain text (media type text/plain
), then the RFC 5147 https://datatracker.ietf.org/doc/html/rfc5147 specifies a fragment for these documents.
To quote the examples in this RFC 5147
- the IRI
https://example.com/text.txt#line=,1
refers to the first line of the documenthttps://example.com/text.txt
. - the IRI
ftp://example.com/text.txt#line=10,20;length=9876,UTF-8
refers to lines 10-20, with a length of 9876 characters in UTF-8 encoding or the document available via FTP URL atftp://example.com/text.txt
Deep-linking into markup text
HTML and XML are markup languages that give text a structure and semantics by using special text markers in and around the text. Both languages have built-in support for deep linking resp. indexing. The XML/HTML attribute ID can be attached to any markup text and defines an anchor in the text that can directly referred to in the IRL fragment of the document.
The URL https://example.org/sample/document.html#section1
would refer to an HTML element with the ID "section1". The attributes of type ID must be unique within the document.
If the author of the document used HTML or XML and has indexed all relevant document elements with unique IDs, then linking using simple IDREF fragments is the easiest way.
The problem with this approach is that is relies on the author having done the indexing. The author however can only guess, what parts of the document will be referenced, so even if he wants to support deep-linking, only a subset of elements will get ID anchors.
In order to address this problem, the HTLM/XML standardization body W3C has defined a special kind of fragment called XPointer , that uses the XPath expression language to target any element within the HTML resp. XML document, without needing to have an explicit ID.
The fragment using the XPointer standard of the IRL starts with #xpointer()
and contains within the parentheses the XPath expression.
XPath allows expressions that refer to specific HTML/XML elements such as <div>
by order of occurrence and nesting within other elements, referring to attributes of these elements and even the content of the elements and many combinations.
For example the URL https://example.org/sample/document.html
#xpointer(/h1[1])
with an XPointer fragment would refer to the first <h1>
element in the text, which could be the heading, by the XPath expression /h1[1]
.
An alternative language to refer to elements has been used in the Cascading Style Sheet (CSS) specification, which is used to separate formatting (such as color) from the structure of the markup text. The expressions that address elements are called selectors and work similar to XPath. The CSS selector #section1
is a selector that selects the element with ID section1, as in the simple reference. The selector [title~="Amlodipine"]
selects the HTML title containing Amlodipine as text. CSS selectors are often used in JavaScript to target HTML elements for dynamic HTML, so have become very popular through frameworks like JQuery, but there is no official standard to use CSS selectors as fragments in an URL.
Some documents are using JSON as media type. JSON is hierarchical like XML and there exists a standard JSONPath that allows for similar expressions like XPath for XML documents, so in principal JSON based documents can be supported as well. Like with CSS selectors, there is no universally excepted standard for JSONPath fragments.
Given the document https://example.org/sample/document.json
which is the equivalent of the HTML example
the JSON path expression "$/body/div1" refers to the first section, which then could be referenced by an URI https:/example.org/sample/document.json#jsonpath("$/body/div1")
An RDF based implementation of XPointer was proposed as part of the W3C Evaluation and Report Language (EARL) project. The EARL project intended to support reporting test results and needed deep links to refer to test cases in a test specification. Furthermore differences from expected results must be possible to express.
https://www.w3.org/TR/Pointers-in-RDF10/
This specification defines a small RDFS ontology of pointers, offset based ones and expression based ones (using XPath, XPointer of CSS selectors) and combinations thereof.
For example given the example HTML document about, the deep link to the first section can be defined using an XPath expression /html/body/div[1]
. Note the XPath is one-based, so the first element has the index 1 not 0.
The following RDF defines an XPath pointer http://example.org/sample/document.html#xpathToSection1
on the example document with this XPath expression.
This pointer can then be used instead of the document IRI for deep-linking. XPath expressions are normally expressed using CURIEs, which use namespace prefixes instead of full IRIs. The ontology allows to reference such a namespace prefix declaration.
If the document is available in media type that uses markup text, then most use cases for deep linking can be targeted with XPointer fragments.
A notable problem is content that is expressed in form of images. If the images is in itself using XML, such as SVG, then graphical elements can be targeted using XPointer. If the image is a pixel based image, then it may be possible by specifying regions of the image, e.g. the pixel positions of the opposing corners of an rectangle superposed on the image. This would be a complex pointer, the outer pointer referencing the image and a specific "rectangle pointer" to mark the region within the image.
Deep-linking by asserting the structural and/or semantic content
Most approved documents that will be referred to in the IDMP domain, will be PDF documents. PDF document are not using markup text, so the parts (fragments) of the document that exist, must be defined outside of the document, by the model that refers to it.
The parts can be purely structural (like heading, chapter, section, paragraph, text element, tabular element, list) or rhetorical function based (introduction, summary) , or a mix of both.
If we assert that some document contains 3 sections then we can refer to the 1st, 2nd or 3rd section in that document. We do not state, how that section can be identified, but that simple assertion allows us to reference the sections. The deep links would be broken if another section would be introduced at the beginning of the document, or we do not know how many sections are in the document. We can then assert that the section title of the section we want to refer, contains text "second section". So basically what we need is an ontology the defines parts of documents and possible nested components.
The Semantic Publishing and Referencing (SPAR) suite of ontologies http://www.sparontologies.net/ developed by Silvio Peroni et.al. define such parts of a document.
- the document component ontology (DoCO)
http://www.sparontologies.net/ontologies/docoontology that provides a general-purpose structured vocabulary of document elements. DoCO has been designed as a general unifying ontological framework for describing different aspects related to the content of scientific and other scholarly texts. Its primary goal has been to improve the interoperability and shareability of academic documents (and related services) when multiple formats are actually used for their storage.
- the discourse element ontology (DEO)
http://www.sparontologies.net/ontologies/deoThe pure rhetorical characterisation of document components is not necessarily linked to the structural organisation that a scholarly article may have. For example, some scientific journals require their articles to follow a particular rhetorical segmentation, in order to identify explicitly what the meaningful parts are from a scientific point of view – e.g., Introduction, Background, Evaluation, Materials, Methods and Conclusion. These parts usually, but not necessarily, correspond to the coarse structural parts of the article – its sections. Whilst the background is usually woven together with the introduction, it may also be presented as a separate section, or indeed may substitute for the introduction entirely.
- the pattern ontology (PO)
http://purl.org/spar/po , previously available at http://www.essepuntato.it/2008/12/patternan ontology defining formally patterns for segmenting a document into atomic components, in order to be manipulated independently and re-flowed in different contexts.
- the Citation Typing Ontology (CiTO)
http://purl.org/spar/cito/2018-02-16ontology to enable characterization of the nature or type of citations, both factually and rhetorically, and to permit these descriptions to be published on the Web.
- the Functional Requirements for Bibliographic Record (FRBR)
http://www.sparontologies.net/ontologies/frbrontology of the basic concepts and relations described in the IFLA report on the Functional Requirements for Bibliographic Records (FRBR), also described in Ian Davis's RDF vocabulary.
the Functional Requirements for Bibliographic Record (FRBR) is a general model, proposed by the International Federation of Library Association (IFLA), for describing documents and their evolution. It works for both physical and digital resources and it has proved to be very flexible and powerful. One of the most important aspects of FRBR is the fact that it is not associated with a particular metadata schema or implementation.
FRBR describes all documents from four different and correlated points of view: Work, Expression, Manifestation and Item; each of which is a FRBR Endeavour.
- the Citation Counting and Context Characterisation Ontology (C4O)
http://www.sparontologies.net/ontologies/c4oontology that permits the number of in-text citations of a cited source to be recorded, together with their textual citation contexts, along with the number of citations a cited entity has received globally on a particular date.
The Document Component ontology has supports all major components of a document, such as chapter, paragraph, section, table or sentence. However it does not include the sub-components of tables, such as rows and columns. These components will be added by an IDMP-O extension. The Pattern ontology defines more basic and abstract components such as text content, container and fields.
The structure of the document is specified using the po:contains
relation from the PO ontology. It can be aligned with CMNS by making po:contains
a sub property of cmns-col:includes
. The content of the an element, such as a sentence is using the c4o:hasContent
data property which can be align with CMNS as sub property of cmns-txt:hasTextValue
.
Many components and sub components are forming an ordered collection (list). Lists can be expressed in RDF, but not in an OWL compliant way. A separate ontology for collection is needed and the SPAR ontologies use the collection ontology (CO) https://github.com/collections-ontology/collections-ontology for that. IDMP-O needs also ordered collections for describing the macromolecule sequences. It has extended the basic CMNS collection ontology with sequences. CO defines a list as a linked list. IDMP-O provides an integer index for an array like direct reference of the list element. Either ontology should fit.
https://semantic-web-journal.net/system/files/swj506.pdf
The list must be kept separate from the document or document component. A document has lists of different components, for example a list of sections and a list of tables. It is not itself a list. A list can be a document component itself, such as the table of content.
With this assertion about the example document, we have anchors defined for the first two sections and first two tables of the document, that can be referred to as deep link. The example uses simple fragment URLs, but the document is of an unknown media type. It does not imply that it is an HTML or XML document with annotated with XML IDs. It can be a hardcopy of a printout, completely binary. By externalizing the markup, we are not longer dependent on the author and the media type and tooling he has used. The disadvantage is that the generation of the externalized document structure is more difficult to generate automatically. It would be fairly easy to generate it from a markup text source, but from PDF, more complex tools with optical character recognition would be needed. Currently there is much work being done using generative AI models, and this can help, although this is out of scope of IDMP-O.
Note that it is not necessary to describe the full document. It is sufficient to describe only the structure that is necessary to identify the component that the deep link refers to.
Example for deep-linking into the Pluvicto product information document
Pluvicto is a therapeutic agent for the treatment of cancer, that is approved by the FDA in the U.S. jurisdiction. The FDA has published the product information document under a public URL https://www.accessdata.fda.gov/drugsatfda_docs/label/2022/215833s000lbl.pdf.
The document can be asserted in RDF using this URL as
For the purpose of the example we want to deep-link into the first page and into a table on the second page.
page 1
snippet from page 2
The first page contains a section about indications with the section title "-----INDICATIONS AND USAGE-----" and another section that summarizes adverse reactions with the section title "-----ADVERSE REACTIONS------". The text in these sections is relevant for the IDMP-O Therapeutic Indication and Undesirable Effect entities.
We can extract the textual content of both sections as plain text. This is the easy case that works for many authorized medicinal products and is base of the IDMP standard model, using plain text for the description of the therapeutic indication. For the first section this results in the following IDMP-O model fragment:
In this model there is no reference to the source. Although it this example works with the IDMP standard model for therapeutic indication using plain text, It would be more FAIR, if the text is linked into its source. For this a deep-link into the PDF document is needed.
The second section can be used in a similar way to assert the undesirable effects. The section is a summary, and the different adverse reactions are described in more detail in the table on the next page. The IDMP standard relates an undesirable effect with its severity grade. Because there are different severity grades mentioned, the resulting IDMP model will have to treat each adverse reaction as a separate undesirable effect, e.g. one for fatigue, one for dry mouth, etc. Each undesirable effect would have the same undesirable effect text, but only part of the text, is explicitly relevant to the specific adverse reaction.
Again to be FAIR, it would be more appropriate to state that the two undesirable effect were part of the text in the adverse reaction summary text, not are equivalent to the text in adverse reaction summary section.
To make an external structural model for the product information document, we can declare both sections and identify them by their title:
To simplify the example, we are using c4o:hasContent and ignore the dashes in the title. We could state the section title contains a text region (compare with HTML <span>) with the exact words, or use another property to indicate that it is a substring. Ignoring the dashes in the titles makes the reference more robust.
We assert that the document contains sections that itself contains a section title with the textual value "INDICATIONS AND USAGE" resp. "ADVERSE_REACTIONS".
Both sections now have a defined IRL that can be used to deep-link into the product information document PDF.
This can be used to state the provenance. Fatigue and dry mouth can now have separate unintended effect text, that assert that they are quoted from the "ADVERSE REACTION" section in the product information document.
The unintended effect text for Fatigue would be simple "fatigue" and we can assert that it was included in the "ADVERSE REACTION" section.
Instead of the generic cmns-col:is included in relation we can use a more specific one, e.g. one from the provenance ontology (PROV-O). In this way we do more proper citation of the product document. We no longer (wrongly) claim that that the Fatigue unintended effect is described by the full summary of all adverse reactions.
So even in this relatively simple example, allowing deep-linking to a document provides value, but the challenge is, what can be done, if there is no easy plain text extraction possible.
For this we have to look on page two of the product information document. This page contains a table, that lists information for the adverse reactions in more detail. Detail that is important for the unintended effect entity, such as the grade of adverse reaction.
We can identify an unintended effect "Myelosuppression" in that table which exists in grades 2 or worse. It is mentioned in the first row of table 1 with table header "Recommended Dosage Modifications of PLUVICTOR for Adverse Reactions".
To quote the textual elements with the adverse reaction text and its grade, we can an external description of the table in the PDF. It is more complex then the simple section reference by section title, but it is possible to deep-link into the specific table cells, either by row and column index or by column header.
The row 1, column 1 contains the unintended effect text, the cell in row 1 column 2 the grade(s). Now the unintended effect text for Myelosuppression can be split into the elements, or the elements are just all listed from the MedDRA or SNOMED vocabulary.
The example rephrases the first row in the table. The text value of the unintended effect text is not a 100% literal quote but a derivation from the original table. This is done because the text value is mandatory. If the updated model allows to simply refer by deep link, then the text value can be omitted.
In the original Pluvicto product information document there are more sections referring to therapeutic indications and adverse reactions, but referencing the table 1 is the most complicated one.