Making data products FAIR with BDF

An integral part of the Baltic Data Flows project (BDF) is to ensure that project outputs are FAIR (data should be Findable, Accessible, Interoperable, and Reusable). The FAIR data principles, as defined by Wilkinson et al (2016) are “characteristics that data resources, tools, vocabularies, and infrastructures should exhibit to assist discovery and reuse by third parties.” If these principles are applied correctly, they can significantly strengthen data flows and standards for data systems within marine monitoring and assessment programmes. Here we will highlight progress and best practice application for each FAIR principle providing examples from the BDF project.

FAIR guiding principles for data resources, S Pundir CC-BY-SA-4.0

Making data FINDABLE

Data is findable when described by metadata and registered or indexed in a searchable resource that is known and accessible to potential users. A unique and constant identifier should be assigned that enables permanent linkages to be formed between the data, metadata and other related materials in order to assist data discovery and reuse. For example, Wilkinson et al state that “a Digital Object Identifier (DOI), is often used to permanently identify digital objects, providing a standard mechanism for retrieval of metadata about the object, and a means to access the data object itself”.

The HELCOM Metadata Catalogue was recently upgraded under the BDF project. The metadata catalogue links metadata records using dataset IDs with datasets available in the HELCOM MADS which enables referencing to datasets using a URL. The Metadata Catalogue uses the latest stable GeoNetwork version 3.12 which is designed to manage spatially referenced resources and provides a metadata editing and search function. The search function enables the allocation of search keywords on data such as the subject area (e.g., environment, biota, boundaries) and INSPIRE theme (e.g., land use, protected sites, species distribution). The Metadata Catalogue uses the General Multilingual Environmental Thesaurus (GEMET) for keywords and categorizing datasets. The GEMET is a reference indexing and retrieval tool used for European environmental datasets and databases.

The INSPIRE Directive (2007) highlights that “the loss of time and resources in searching for existing spatial data or establishing whether they may be used for a particular purpose is a key obstacle to the full exploitation of the data available. Metadata should therefore be used to provide descriptions of available spatial data sets and services”. Metadata, as defined in Article 3 of the INSPIRE Directive is “information describing spatial data sets and spatial data services making it possible to discover, inventory, and use them”. In other terms, metadata are information that describes and gives meaning to data. Examples of metadata include the location, time, origin, creator(s), terms of use, and access conditions of a dataset. Metadata can also provide information on data handling related to data manipulation or changes.

Making data ACCESSIBLE

HELCOM monitoring data is accessible online via web services or via download functionality, thereby making it accessible. There are several tools and services that enable open access to data and data products. The HELCOM MADS is used for viewing data products and spatial datasets resulting from HELCOM assessments. The HELCOM Metadata catalogue is linked to the MADS and used for searching and downloading data products and spatial datasets. In addition, HELCOM provides direct access to all geospatial datasets either by ArcGIS REST interface or OGC WMS Standard. Furthermore, there are several thematic databases made available by HELCOM or the data host. For example, the HELCOM Biodiversity database that contains macro species observation data made available by HELCOM Contracting Parties.

Data products and spatial datasets are made available in HELCOM MADS and the HELCOM Metadata catalogue adheres to several data quality standards. Metadata records are INSPIRE (ISO 19115) compliant enabling persistent URL referencing to a metadata record. The download service features a variety of outputs, for example, selected sets of output spatial data products (maps) are accessible as a service via standard spatial data service interfaces, e.g., WMS for viewing and INSPIRE compliant CSW for metadata descriptions. Output datasets are downloadable from each metadata record in a standard and widely used spatial data format (e.g., Raster, TIFF, Shapefile). The viewing service in the HELCOM MADS and the ArcGIS REST interface is INSPIRE (OGC WMS) compliant.

In addition, metadata records can be connected to national, European, or international metadata harvesting systems. The European Environment Agency (2019) states that “access to spatial data in Europe is governed by the INSPIRE Directive, adopted in 2007 it establishes the infrastructure for spatial information. INSPIRE provides the possibility to directly access spatial datasets, according to 34 INSPIRE spatial data themes via standard web services”. Adhering to relevant international standards and practices such as the INSPIRE Directive is promoted under the HELCOM Data and Information Strategy (2019).

Data harvesting, as defined by the European Environment Agency (2019), is “a technological solution for institutions to access data at a local or national level without the need for data requests, enabling institutions to have better and more flexible access to data”. The BDF project channels resources into the development of data harvesting systems based on Application Programming Interfaces (APIs) with the aim to automatically integrate national datasets into harmonised regional datasets along with harvesting of datasets to the European Data Portal. APIs offer a connection between computers or between computer programs. It is a software interface that offers a service to separate pieces of software, ideally used for applications of data harvesting. For example, the BDF project is supporting new data export formats to the ‘SHARKdata’ API used for accessing marine monitoring data from the sub-basins surrounding Sweden. SHARKdata is maintained by BDF project partner the Swedish Meteorological and Hydrological Institute (SMHI).

Making data INTEROPERABLE

The European Commission Expert Group on FAIR Data (2018) state that interoperable data and metadata follow a “formal, accessible, shared, and broadly applicable language for knowledge representation. They use vocabularies which themselves follow the FAIR principles, and they include qualified references to other data or metadata”. Strict standardization of the data outputs is the key for interoperability. To ensure consistency in metadata standards the HELCOM Metadata Catalogue uses a tool known as the Data Catalogue Vocabulary Application Profile (DCAT-AP). The DCAT-AP, as defined by the developers is “a specification for metadata records to meet specific application needs of data portals in Europe while providing interoperability with other applications”. HELCOM’s Metadata Catalogue is compliant with the DCAT-AP which enables the description of datasets and data services based on a standard model and vocabulary. This enables data re-users to get an overview of which datasets exist and are maintained, and further enables data providers to make datasets searchable and accessible across multiple data portals. The metadata template applied to the HELCOM Metadata Catalogue contains fields described in ISO 19115 metadata standard. These are checked using the INSPIRE Reference Validator, a tool used to ensure metadata meets the requirements defined in the INSPIRE Implementing Rules and Related Technical Guidelines.

Data harmonisation is another method used to enable the development of spatial data infrastructures. Data harmonisation is a process of integrating various levels, types, and sources of data in formats that are compatible and comparable. The European Commission learning resources on the INSPIRE Directive states that “data harmonisation aims to transform datasets to ensure that they fit together in relation to geometry and semantics. The goal is that a user, who is using data from different authorities, will have a unified view, where conflicts and tensions in the datasets have been removed”. In reference to the HELCOM eutrophication indicators, the BDF project is facilitating this aim. For example, the harmonisation of monitoring station visit-IDs will enable a link to different environmental datatypes. The harmonised visit-IDs will make it possible for ICES to integrate Chl a tube sample data into the ICES Oceanographic Data Portal facilitating the work for eutrophication indicator assessments and enabling better use of monitoring data. In addition, the BDF project will standardize vocabularies and formats across systems on the content to be harvested, featuring eutrophication data.

Making data REUSABLE

The European Commission Expert Group on FAIR Data (2018) further states that for data to be reusable there is a need for “metadata and documentation to meet relevant community standards and provide information about provenance”. This covers reporting how data was created and information about data reduction or transformation processes to make data more usable and understandable. Data use policies are one important tool to ensure data can be reused with or without certain restrictions. According to the data use policy outlined in the HELCOM Data and Information Management Strategy (2019), all monitoring data reported to a HELCOM database will be made publicly available as open data and is referred to as a HELCOM dataset. For each spatial dataset or data product originating from HELCOM monitoring and assessment activities, the data use conditions are linked to a Creative Commons Attribution (CC-BY) license to allow distribution, display, reproduction, and performing work on the data, given that the original HELCOM dataset is referenced as source. Data policies for marine monitoring programmes should promote the use of CC-BY licensing providing that no personal or sensitive data is collected or shared in adherence to the EU General Data Protection Regulation (2016). 

All software products developed by HELCOM are licensed under the GNU General Public License v3.0 to grant the ability to change and share versions of developed products and to ensure they remain free to use. This facilitates data reuse as it removes barriers to accessing software source codes. The source code for all software products, such as data processing scripts developed by HELCOM, are made available through open access platforms and repositories using the HELCOM GitHub account. For example, the integrated assessment of hazardous substances (CHASE tool) developed under the BDF project is available.

The BDF project continues to make progress to ensure that project outputs are FAIR. There are still a number of tasks within the project that will further enhance the FAIRness of data products, including the harvesting of the HELCOM Metadata Catalogue by the European Data Portal.