Saturday, 27 October 2018

Revisit: Reuse of Structured Data: Semantics, Linkage, and Realization (2)




(continue from part I) / Library and Information Science, 43.1 (2017): 7-46. / [[中文]]


RESEARCH HIGHLIGHTS: 

# An old record is not a data but now defined as a new semantic dataset. 
  i.e. its triples, graphs, links, file formats ...
  i.e. its revised, vocabulary encoded versions ...
  ex. data:d2148340 a dcat:dataset. #files:json-ld, ttl, XML

# A new method to curate, publish & visualize LOD graphs via CKAN portal. 
  i.e. two models for one dataset published in two views.
  ex. data:d2148340 a dcat:dataset.   # Dublin Core @schema1
  ex. data:d2148340 a data:Refined. # more semantics@schema2

# Validation & Reproducibility: Provenance and Contexts are in details. 

Practices

Example: data:d2148340 (click to enlarge)
We then make use of structured records (XML files) from a digital archive catalogue, and convert the records into semantically rich and interlinked resources on the Web. This is realized as a unified Linked Data catalogue to several digital archive collections. Our work results in a LOD catalogue (data.odw.tw) available to the public at the website . The following five parts are involved in realizing this website. 


A catalogue record, about a species of Pleione Formosana (data:d2148340), is used throughout in the paper as an example to demonstrate the way we model, convert, and represent the semantics of a structured record.

R4R Ontology (click to enlarge)
Part 1: Exploring data reuse relations in a shared context -- We review our previous research about the Relation for Reuse Ontology (R4R). In particular, we provide mechanisms for reusing article, data, and code with some flexibility of encoding provenance and license information.

Part 2: Comparing two different data conversion approaches to providing LOD for an archive catalogue -- We show two different scenarios: (1) The LOD catalogue is converted directly from a relational database, and (2) the LOD catalogue is generated from a series of format conversions --- from XML to CSV, and then to RDF. 

KB links Example (click to enlarge)
Part 3: Data profiling, cleaning and mapping -- We demonstrate format conversion processes, and we discuss the pros and cons of various ways in handling broken links in source datasets. In addition, we mapped and linked catalogue records to three external knowledge bases: GeoNames, Wikidata, and Encyclopedia of Life.  

Part 4: Using CKAN as a Linked Data platform -- We briefly introduce CKAN, an open source web-based data portal software package for curating and publishing datasets. CKAN provides data preview, search, and discovery, especially with regard to geospatial datasets. We built several extensions to CKAN in order to deposit, publish, browse, and search Linked Data. Various Linked Data representations of a catalogue record --- Turtle, RDF/XML, and JSON-LD --- can all be downloaded and reused.

Part 5: Designing an ontology for data representation and reuse -- We design an ontology voc4odw which includes the following 3 modules:

(1) The Core Model. It is comprise of a data model and a conceptual model. 




The data model represents key data structure and relation. It is a framework to illustrate data source,derivation, and provenance.

The voc4odw Data Model (click)
The conceptual model incorporates Simple Knowledge Organization System (SKOS); it also connects to key event concepts. The conceptual model allows for data contextualization using common and domain knowledge vocabularies.



(2) The Curation Model. It is responsible for disclosing the identification, classification, and publication of structured records at a curation platform, such as the classification of themes, the assignment of data identifiers, and the publication of datasets.

(3) A vocabulary voaf:Vocabulary. It is defined as "A vocabulary used in the Linked Data cloud", from the Vocabulary of a Friend . This module is to relate the Core Model to external common vocabularies. Some hierarchy relations between different external vocabularies can be traced with this vocabulary.


voc4odw ontology
Common Knowledge
Prefix
Namespace
Description
cc
http://creativecommons.org/ns#
csvw
http://www.w3.org/ns/csvw#           
dc
dcat
dct
5.       DCMI Metadata Terms
dctype
http://purl.org/dc/dcmitype/
6.       DCMI Type Vocabulary
event
http://purl.org/NET/c4dm/event.owl#
7.       Event Ontology
foaf
geo
http://www.w3.org/2003/01/geo/wgs84_pos#
gn
10.     GeoNames Ontology
gns
11.     GeoNames Entity
lcsh
http://id.loc.gov/authorities/subjects
org
prov
r4r
schema
16.     Schema.org
skos
time
http://www.w3.org/2006/time#
18.     W3C  Time Ontology
voaf
http://purl.org/vocommons/voaf#
wde
http://www.wikidata.org/entity/
20.     Wikidata Entity
 Domain Knowledge
aat
http://vocab.getty.edu/aat/
dwc
2.       Darwin Core Terms
dwciri
3.       Darwin Core terms
eol
4.       The Encyclopaedia of Life (EOL)
txn
http://lod.taxonconcept.org/ontology/txn.owl#
Local Namespace
voc
http://voc.odw.tw/ontology#  
agent
article
code
data
5.      Linked Data for ODWeb
evt84
6.      Event Entity in ODW
project
7.      Project Entity in ODW
r1 (n)
http://data.odw.tw/r1/   (r2, r3…)
refined
http://data.odw.tw/refined/
catdat
http://catalog.digitalarchives.tw/