Automation of SHACL generation - current estimate

 

 

After an implementation of one bit of https://dil-edmcouncil.atlassian.net/wiki/spaces/IDMP/pages/7245582 the following problem was revealed.

Problem

It is a known feature of IDMP Ontology that some OWL restrictions are of conceptual nature and for this reason do not lend themselves to being transformed as SHACL shapes.

Consider the following example: https://spec.pistoiaalliance.org/idmp/ontology/ISO/ISO11238-Substances/Molecule

<owl:Class rdf:about="https://spec.pistoiaalliance.org/idmp/ontology/ISO/ISO11238-Substances/Molecule"> ... <skos:definition>electrically neutral entity consisting of more than one atom (n&gt;1)</skos:definition> <rdfs:subClassOf rdf:resource="https://spec.pistoiaalliance.org/idmp/ontology/ISO/ISO11238-Substances/MolecularEntity"/> ... <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="https://www.omg.org/spec/Commons/Collections/comprises"/> <owl:onClass rdf:resource="https://spec.pistoiaalliance.org/idmp/ontology/ISO/ISO11238-Substances/Atom"/> <owl:minQualifiedCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#nonNegativeInteger" >2</owl:minQualifiedCardinality> </owl:Restriction> </rdfs:subClassOf> ... <rdfs:label>molecule</rdfs:label> </owl:Class>

The restriction in question should not be transformed as a SHACL shape because one cannot expect that a dataset about molecules contains information about atoms that made up these molecules.

Initially, @Pawel Garbacz believed that we can mark out those conceptual restrictions so that the automation process could know what to ignore.

However, the analysis from @Thomas Weber on a small part of IDMPO (https://spec.pistoiaalliance.org/idmp/ontology/ISO/ISO11615-MedicinalProducts/) showed that the ratio of such conceptual restrictions may be as high as 70%.

Review: (random selection of some node shapes)

 

iso11615-mp-m-shacl:AcidNodeShape

OK

 

iso11615-mp-m-shacl:ActiveIngredientActiveMoietyBasisOfStrengthNodeShape

OK

 

iso11615-mp-m-shacl:ActiveIngredientNodeShape

OK

 

iso11615-mp-m-shacl:BatchNodeShape

 

Exclude properties

This refers to a physical batch. The restrictions refer to a batch manufacturing process, which is a runtime instance, not applicable in IDMP regulatory perspective. References to runtime instances should be marked to be ignored for shacl generation.

iso11615-mp-m-shacl:CodeElementNodeShape

Exclude properties

Must every code element refer to its code set? If the code element has an IRI, then this should not be needed. If the code element gets its identity from the code set and a local identifier in that code set, then both properties together are necessary.

iso11615-mp-m-shacl:CodingRationaleNodeShape

OK

 

iso11615-mp-m-shacl:ConversionBasedUnitNodeShape

Exclude properties

Is it assumed that the whole unit ontology is included in the shape dataset? If not then this is much too restrictive. A unit IRI should normally be enough, for the purpose to express a quantity. There is no need to specify the quantity kind, except in a few cases, and certainly not the base unit it is derived from. Typically instance data will refer only to the unit IRI, or via property path, to  the UoM identifier.

iso11615-mp-m-shacl:LotNodeShape

Exclude properties

See batch

iso11615-mp-m-shacl:MedicinalProductIdentifierNodeShape

Exclude properties

The restrictions describe the segments of the MPID. This cannot translate to a shape, as it describes the semantic pattern of the MPID. No instance data will do this segmentation of the id in RDF.

iso11615-mp-m-shacl:NuclideIdentifierNodeShape

Exclude properties

Classifies some Nuclide must be excluded. A nuclide is a subclass of atom and will never appear in the instance data.

iso11615-mp-m-shacl:OfficialNameNodeShape

Exclude properties

See substance name below

iso11615-mp-m-shacl:SubstanceNameNodeShape

Exclude properties

‘is applicable in jurisdiction’ should not be mandatory for substance names, esp. the full geographic region identifier. What would be the jurisdiction geographic region identifier for “Water”?

Should probably fixed in the ontology. In the IDMP standard it is optional.

iso11615-mp-m-shacl:TherapeuticIndicationNodeShape

Exclude properties

Whether hasTargetPropulation is mandatory was subject of discussion by the SMEs. There seems to be a notion, that it should be optional, and it would imply that no restriction exists which could imply that all people are targeted. Semantically this can include animals, and there was the objection, that pediatric use is another special case. From the ontology point of view, there always exists a target population, even if it is not specified.

iso11615-mp-m-shacl:WrittenLanguageNodeShape

Exclude

This should not be part of the SHACL for IDMP. ‘uses orthography’ will probably also not defined in any lcc-lr written language instance.

 

Probably we must filter classes that are not referenced from the IDMP domain ontologies.

 

Possible solutions

Semi-automation

We start with a set of SHACL shapes that were manually crafted (or with manually truncated set of automation). The automation infrastructure would support the review process so that during ontology development the SHACL engineer will be notified which shapes need to be reviewed to be aligned with the recent ontology developments.

Partial automation

We can automate shaclisation of only some, manually selected, classes (e.g., https://spec.pistoiaalliance.org/idmp/ontology/ISO/ISO11615-MedicinalProducts/MedicinalProduct), which do not involve any conceptual restrictions.