eSciDoc - Research Data Management
Remotely controllable laboratories are the main prerequisite for experiments conducted from distant locations. Equally important is a holistic and consistent view on the research data generated by such experiments. Thus, an important feature of BW-eLabs is the acquisition of all data objects and documents related to the experiment, including processing and refinement steps like error correction, analysis, and aggregation. BW-eLabs captures relations between objects and stores them in a semantic network. It enables local and remote access to the data controlled by fine-granular policies. At the same time, BW-eLabs aims at improving the reproducibility of experiments, for which access to research data is pivotal. In short, BW-eLabs established research data management in the laboratory.
Data Acquisition in the Laboratory
For the project BW-eLabs, we see all data objects created in the course of a virtual or remotely conducted experiment as “research data”. In the case of FMF (Freiburg Materials Research Center), data objects are mainly absorption and photoluminescence spectra, whereas experiments conducted by ITO (Institut für technische Optik at Stuttgart University) mainly provide digital holograms. Both laboratories additionally produce calibration and configuration information, which is important for correctly understanding and interpreting the captured data from instruments. BW-eLabs acquires all data objects created in the laboratory the very moment they come into existence and stores them in the eSciDoc data infrastructure. eSciDoc is an e-research environment jointly developed by Max Planck Digital Library and FIZ Karlsruhe. The software is available free of charge as open source. Based on the requirements analysis at FMF, we developed a first, prototypical concept for data acquisition and data management of objects created throughout the process of synthesis of nanoparticles. By monitoring folders on hard disks used by instruments for storing measured data, we are able to replicate data objects in near real-time into eSciDoc. Later on, we refined the model by factoring in the requirements of the digital holography lab at ITO.
Fig. 1: Data acquisition workflow in the laboratory
The currently implemented acquisition workflow consists of nine steps. In the first step, a researcher invokes the browser-based eLab Solution on his desktop computer and creates a new experiment within the context of a project or an investigation series (1). The experiment is represented by a folder in eSciDoc, which will later contain the captured data objects (2). By means of the folder identifier encoded in a QR code, the experimental data in eSciDoc can later be referenced from a traditional paper-bound laboratory journal (3). The researcher then picks a predefined group of instruments (a so-called rig), which fits his needs. The system automatically generates a configuration object and sends it to the eSync Daemon and the Deposit Service (4). As soon as the researcher starts the experiment by pushing a button in the eLab solution, the eSync Daemon sets in monitoring one or more directories on the laboratory computer used by the instruments to store their measurements (5). The daemon process replicates these files to the Deposit Service (6). The Deposit Service invokes the Metadata Extractor, which creates a new metadata record based on automatically captured contextual information like rig, instruments, users logged on to the system, investigation series, project, timestamp, etc., and by analyzing the measured data (7). The Deposit Service combines the metadata record and the replicated data object to an eSciDoc Item and stores it in the eSciDoc folder created previously for the experiment (8). Back in the office, the researcher (or a colleague over the internet) may retrieve the data objects either by navigating through the data repository via projects and investigation series or by scanning the QR code in the laboratory journal (9).
Semantic Relations of Data Objects
An important aspect of research data management is expressing the relations between data objects, e.g. between raw data and the calibration data from the instrument used for the measurement. Only by combining both data objects, researchers are able to correctly interpret a measurement. Furthermore, the perception and representation of research data is not static over time. The more the research process proceeds, the more derived objects are created: calibration data may allow for corrections of measurements, visualizations and aggregations may help to better understand the results, combining the data from several instruments or experiments may offer new insights. All these steps create either new versions of existing objects or create derived objects. All these objects should be related to the original data, including the semantics of these links.
Fig. 2: Concept for Contextual Metadata
Metadata is pivotal for archival, publication, and reuse of research data. Only sufficiently described data can be understood and evaluated by researchers. Additionally, only proper metadata allows for useful searches and browsing in research data. Therefore, defining appropriate metadata profiles is important. There is a major difference between the discipline-specific and a general view on research data objects. The discipline-specific view has to focus on the requirements of the researchers. Standardization is desirable, but hard to achieve. There are several reasons: • Differences in approaches and methodologies across disciplines, but even within a single discipline • The danger of confining the research processes by too strict or inadequate models, thus decreasing or losing the acceptance of the researchers • Use of proprietary tools for creation or further processing of (meta-)data by researchers Opposing to this, librarians need to standardize metadata in order to provide cross-disciplinary cataloging and retrieval. This inevitably leads to simplifications and information loss that may not be acceptable for a discipline-specific search, but allows for discovery and reuse of data outside of a particular research community. However, we believe that the description of the context in which an experiment took place is a good candidate for standardization, even across scientific communities. Therefore, we focused our metadata model on this contextual information. Entities with their attributes are the core building blocks for all metadata models. They represent objects, services, and actors found in the real world. For BW-eLabs, we have chosen the Core Scientific Metadata Model of the Science and Technology Facilities Council (STFC) as the basis of our metadata model and reused many of their concepts and entities in our model. You can find an extensive description of BW-eLabs’ metadata model for research data in the report Metadatenkonzept für dynamische Daten(in German).
Fig. 3: Overview of all relevant entities in the BW-eLabs metadata model
Providing Research Data as Linked Dataa
Interoperability is an important requirement when it comes to reuse of data. It is not sufficient to just provide the data and related metadata, because proprietary metadata profiles are a barrier for understanding and evaluating data produced by others. Additionally, metadata often relies on implicit relations derived from the scientific and experimental context. Reuse of research data requires both: interoperability of the description and preservation of the contextual information by making relations explicit. Once again, an appropriate metadata model is the prerequisite for achieving these two goals. But relations are not restricted to the data produced within a project or institution. In order to avoid isolated “data silos”, semantic technologies may help by expressing the meaning of entities, attributes, and relations on a formal layer based on (standardized) ontologies, so that computers can read and interpret the information. This establishes a new way of knowledge representation in research data management and, at the same time, determines the idea of the Linked Data initiative. At its core, Linked Data expresses and relates structured data with HTTP Uniform Resource Identifers (URI) and RDF and exposes both via appropriate programmatic interfaces. In 2006, Tim Berners-Lee published the fundamental Linked Data concept and coined four rules known as the Linked Data Principles:
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
4. Include links to other URIs, so that they can discover more things
Within the scope of a bachelor thesis, we developed an ontology for the contextual information covered in the BW-eLabs metadata model. It is based on well-known vocabularies like DCMI, SKOS,FOAF und VoID and allows for the publication of research data and especially their contextual relations to projects, persons, institutions, instruments, and rigs as Linked Data.
Contact: Matthias Razum, FIZ Karlsruhe [Link]