Juan Benet / data / Data Management Vocabulary

2014-02-21

Last updated: 2014-02-28.

This is a vocabulary for the purpose of precisely defining propblems and solutions in my data management efforts. Many are standard version control concepts.

bandwidth

for our purposes, capacity to distribute files across a network, mainly the internet.

data

information stored in digital files

dataset

a meaningful collection of related data, usually packaged as a set of files and identified with a name.

Decoding

the process of converting encoded information into information, according to a format. The inverse of Encoding.

For example, using TMF defined below:

func TMFDecode(s String) Devices {
  d := json.Decode(devs)
  if d.IsValid() {
    return d
  }
  throw TMFDecodingError(d)
}

Distributed Version Control System (DVCS)

a version control system that operates in a distributed, decentralized fashion. Users do not need to interact with, or obtain permission from, a central authority for normal operation.

Encoding

the process of converting information into encoded information, according to a format. The inverse of Decoding.

For example, using TMF defined below:

func TMFEncode(d Devices) String {
  if d.IsValid() {
    return json.Encode(d)
  }
  throw TMFEncodingError(d)
}

fork

a distinct project which originated by branching -- copying and subsequently modifying -- another. If project B was built by copying and modifying project A, project B is said to be a fork of project A. Forks are often created to fix bugs, alter or add functionality, or take over maintenance.

forking

copying the files from one project and modifying them to create another, distinct project. The new project is said to be a fork of the original.

forking friction

infrastructural or cultural resistance to forking projects.

In software, developers used to be reluctant to forking because it carried the connotation of an organizational split or schism, and the weight of organizational maintenance. Github greatly reduced forking friction in software by:

  • terming original projects forks as well (as opposed to only their off-shoots)
  • making forking a common part of contributing to projects
  • namespacing projects under usernames (e.g. userA/project could be a fork of userB/project)
  • effectively combining all open-source communities under one network

In data management, users are often reluctant to forking a dataset because:

  • they do not know whether they are able to (unclear or no licensing)
  • they do not wish to cause an organizational split or schism
  • they face publishing friction

Format

"the way in which something is arranged"; a specification for how to encode and decode a message.

For example, consider the following schema-laden format spec:

Temperature Measurement Format (TMF)

  • TMF extends, or is on top of, JSON.
  • A TMFFile has a .json extension, and contains TMFData
  • A TMFData is a sequence of TMFDevice objects.
  • A TMFDevice object must have two keys:
    • name, mapped to a string.
    • temps, mapped to a TMFReadings object.
  • A TMFReadings object maps ISO Date to a TMFTemp
  • A TMFTemp is a string with a float followed by one scale letter (C, F, or K).
[
  {
    "name": "device0",
    "temps": {
      "2014-03-07 05:10:01": "292.0K",
      "2014-03-07 05:10:02": "291.7K",
      "2014-03-07 05:10:03": "5000.0K",
      "2014-03-07 05:10:04": "291.3K"
    }
  },
]

Format Compatibility

a directed measure of whether a format is able to represent the same schemas as another. For example, JSON and XML are formats perfectly compatible with each other: all schemas represented in JSON can be represented in XML and viceversa. However, GeoJSON is compatible with XML (it follows from JSON being compatible with XML), but XML is not compatible with GeoJSON (not all XML files can be transformed into valid GeoJSON).

The idea of format-compatibility is useful to better understand how format conversion relationships work.

git

a popular distributed version control system. See http://git-scm.com.

publishing friction

individual resistance to publishing projects due to costs or perceived costs in doing so. For example, costs include:

  • monetary costs for reliable distribution.
  • obligation to maintain a published project indefinitely.
  • loss of reputation if the project is deemed unsatisfactory.

Publishing friction prevents the publishing of valuable projects. Building tools which decrease it boost the possibilities of progress.

Schema

the structure, or specification of how information represents meaning.

For example:

  • Timestamp specifies the time at which a measurement was taken.
  • Temperature specifies the temperature measured.
  • TempMeasurements specifies a sequence relating a Timestamp with a Temperature
  • Device specifies a particular machine, with a Name and TempMeasurements

Schema-Format Compatibility

a measure of whether a particular format is able to represent a particular schema.

Schema-laden Format

a format designed to represent a particular schema. Schema-less or universal formats, such as JSON, can represent any schema. Schema-laden formats, such as GeoJSON, are tuned to represent a particular set of schemas. Schema-laden formats tend to implement schema specifications on top of a general format.

Schema Compatibility

a measure of whether a particular schema is able to express another. It is useful to consider whether or not data can be transformed from one schema to another.

storage

for our purposes, digital storage of files in computers.

Universal Format

For our purposes, a format capable of storing any schema. Examples: JSON, XML. Contrast to GeoJSON, a format specific to a set of schemas, or a schema-laden format.

Version

a snapshot of files at a particular point in time. Versions are useful to trace histories of changes.

Versioning

storing multiple versions of a given file to enable tracing the history of changes.

Versioning Scheme

a scheme or protocol to identify different versions of file.

Version Control

techniques to store, manage, and retrieve numerous digital files, using versioning.

Version Control System (VCS)

a system (usually a software tool) used to enact, support, and simplify a particular form of version control.