My open issues with Frictionless Data

As I wrote in the (German) blog post Wie läuft es? Reibungslos!, I am a big fan of describing the metadata of the data/files using the Frictionless Data Specification, especially with Tabular Schema in a machine-readable way. However, in my practical work with data, I still encountered a few open issues that I feel are missing with Frictionless Data. I would like to describe these in the following and show suggestions for suggestions:

Units of measurements

A crucial piece of information for the correct interpretation of data is knowing the unit of measurement. Is the value in the column named “speed” in km/h or m/s. Since only a factor of 3.6 separates the two units of measurement, even a human being often cannot guess - and a machine certainly cannot.

Therefore, I specify the unit of measurement using the property unit:

{
    "type": "number",
    "name": "flow",
    "description": "flow in m³/s",
    "unit": "m3/s"
}

The content of the property is using the Unified Code for Units of Measure (UCUM). The website https://ucum.nlm.nih.gov/ucum-lhc/demo.html helps you to check if you have used the correct UCUM syntax.

→ GitHub Issue

CRS

Geographic coordinates can be specified using different coordinate reference systems (CRS). WGS84 is a common one.

A CRS can be valid for a whole file or only for single columns. This is the case if coordinates are specified in several CRS in a single file.

A simple possibility is to specify the CRS as a property of a column. This covers both cases.

{
  "name" : "geogrLongitude",
  "type" : "number",
  "description": "location of the station (longitude)",
  "crs": "EPSG:4326"
}

→ GitHub Issue

Semantic Web Properties

Frictionless Data already contains approaches to provide the described data in the Semantic Web. In Tabular Schema one can use to property rdfType to inticate which class a value belongs to. For a complete description, however, it would also be necessary to specify which Propertery is specified in a column.

My suggested solution looks like this:

{
  "name" : "geogrLongitude",
  "type" : "number",
  "description": "location of the station (longitude)",
  "rdfProperty": "https://schema.org/longitude"
}

→ GitHub Issue

WKT

Well-Known-Text (WKT) is a common way to write down geographic data. A Frictionless Tabular Schema should allow to specifiy that a column contains WKT. The data should be checked for syntactic correctness during validation. To be as backwards compatible as possible, the marking could be done with the format property (as already done for URIs and email addresses).

{
  "name": "geometry",
  "type": "string",
  "format": "wkt"
}

→ GitHub Issue

Ignored lines at the end of a file

Unfortunately, some publishers in public administration produce CSV file that end with some lines that do not contain any data at all. Typical contents of these lines are references to the publisher, information on the legal basis of data collection, or license information.

Lines can already be skipped at the beginning of a CSV file. It would be useful to also be able to specify that a certain number of rows at the end of the CSV file should not be included.

→ GitHub Issue

Different type of data in one CSV file

Maybe it might be beter not to provide a solution for the following case. Otherwise, more publishers might be encouraged to use such a strange way to write their data:

Some public sector data publishers put multiple tables into a single CSV file. The CSV file roughly looks like this:

header line of table #1
1st row of table #1
...
last row of table #1

header line of table #2
1st row of table #2
...
last row of table #2

header line of table #3
1st row of table #3
...
last row of table #3

In such a case it would be useful to be able to specify (maybe using a regular expression) in which line of a CSV file to start reading and where to stop. In the example, the first table would go from the beginning to “\n\n”. The second table would go from unique first characters of header table #2 to “\n\n”. The third table would go from unique first characters of header table #3 to the end of the file.


Header image by Pexels at Pixabay (cropped)