Juan Benet / data / all


Data Discussion @ScienceExchange top

2014-03-11

The awesome folks over at ScienceExchange had me over to talk about data management in science, and the tools I'm working on. Slides below.

Note: data, datadex, and datadex.io are live and work; though, I haven't written them up and are still early in development

The seemingly unclickable links in the deck are, in order of appearance:


2014-03-11

Warning: This is a Living Document.

It is important to catalog related work. I will expand on these tools and efforts over time. This list is in its infancy. Please contact me if I should be aware of something else.

AcademicTorrents.com

  • BitTorrent Tracker for scientific datasets
  • like BioTorrents

dat Version Control for Data

datahub CKAN public index

  • closest thing to a package manager
  • website with lots of features

OKFN Open Data

  • OKFN gets it.
  • Lots of hard work in standards + evangelizing + tools.
  • behind datahub.

and many more...

Please contact me to add to this list.


Format Conversion top

2014-03-07

Warning: Work In Progress

File conversion is cumbersome and more complex than it needs to be. There is no good reason why we need thousands of different programs that convert from one format to another. I claim it is possible to create a program which converts to and from all formats without sacrificing performance or flexibility.1

Semantics

Before describing such a program, it is important to define our semantic framework. It strikes me that people use the words encoding, format, and schema too loosely. This is perhaps because the words and their semantic boundaries are not precisely defined. Here are ours:

  • schema: the structure, or specification of how information represents meaning.

  • format: "the way in which something is arranged"; a specification for how to encode and decode a message.

  • encoding: the process of converting information into encoded information, according to a format. The inverse of decoding.

In general, we can say that data is structured according to a schema, and encoded into a format compatible with the schema.

Problem

Using these definitions, then, the program should convert any compatible format into another.

Conversion tools between two general formats are not complicated to build. The hard part about building a conversion tool for all formats is that most formats are not general. They are schema-laden, and are thus not perfectly compatible. Parsing and generating byte sequences is tedious and error-prone, but not hard. The real problem, I claim, is converting between schemas.

The Graph of Formats

Today, most format conversions happen via format-pair-specific tools. Examples: "json to xml", "png to jpg", "xsl to csv". Actually, these examples are not quite right, because most "format conversions" are not between universal formats, but rather between schema-laden formats. This means tools that convert from "this custom format expressing geometries" to "GeoJSON", or between csvs with different column sets, or value ranges. For example:

% cat area-codes.csv
CITY,STATE,AREACODE
SAN DIEGO,CA,619
SAN FRANCISCO,CA,415
SAN JOSE,CA,408

% ./us-city-codes.py area-codes.csv
City,Dial Code
San Diego,+1-619
San Francisco,+1-415
San Jose,+1-408

This transformation is trivial: (a) the STATE column was dropped, (b) the casing changed, (c) the AREACODE column shifted to Dial Code, which includes a +1- for international calling. And yet, a custom program had to be written to parse the source file and generate the target file. Chances are every time this had to be done, someone had to write a program to create that conversion. Consider more complex format conversions, with files with dozens of columns or deeply nested JSON or XML trees, with numerous transformations here and there. So much valuable time is wasted creating these conversion programs. More is wasted by end users -- particularly non-programmers -- struggling to wrangle data to make it "look right". This is silly.

Consider a graph of formats where directed edges represent conversion programs. Every time one format needs to be converted into another, a program is written and an edge is added. Sometimes people publish those programs and they can be found on the internet. And if we're lucky, it's possible a conversion through an intermediate format exists. This is a pretty brittle system though. Often, programs do not exist, only work one-way, are out of date, or are written again and again but never published anywhere.

Hub with Universal Data Representation

Instead of using a graph with converter edges between formats, consider a hub graph, where the center node is a "master" format that can then be translated to any other format. While it is ambitious to build one format to convert between all other formats losslessly, the benefits are significant. Rather than building <num formats>^22 tools to convert to-and-from, we would only need to build <num formats>.

(Figure of hub here)

\begin{scope}[mindmap, concept color=orange, text=white] \node [concept] {Informatique}[clockwise from=-5] child {node concept {M{\'e}thodes cat{\'e}goriques}} child {node concept {Algorithmique}} child {node concept {Compression & transmission}} child {node concept {Tra{^i}tement des images}} child {node concept {Optimisation}} child {node concept {R{\'e}seaux}}; \end{scope}

Hub is Well Established

This is not new. Compilers have worked with a similar model. Compilers like LLVM take high-level languages, parse them into an Internal Representation (IR), and then emit machine code for the relevant platform. This three-phase design is a great way to address the multiplicity of target architectures and source languages.3 Rather than building <langs> * <archs> compilers, build one compiler with <langs> + <archs> modules.

llvm-three-phases

This is not even new to conversion. Conversion tools our there already employ this design. Notably, Pandoc converts text documents between a range of markup languages. Though the Pandoc homepage displays a graph with Source * Target edges, it perhaps should display a hub graph; its design uses parsers, an internal representation, and emitters:

Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document, and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer. Pandoc Documentation

These widely used and well-established software systems use the hub model successfully.

Hub Design

To concretely specify the design, the Hub Design calls for:

  1. a Universal Data Representation capable of handling any data format (ambitious).
  2. a module per format, which converts to and from the UDR.
  3. a tool that loads and runs format modules, to facilitate conversion.

Building such a Universal Data Representation (UDR) is hard; we need a format that can express every other format well and efficiently (in storage and conversion computation). Building the modules themselves -- which do the heavy lifting -- should be an easy and straightforward process. Modules could either ship with the base tool (better for end users, but centralized), or shipped independently (falling back to maintaining an index). Of course, the UDR can be piped in and out of format converters manually, but this would do little to shift the status quo. The adoption of this tool, and the benefits of this system, depend on:

  • a modular architecture facilitating the building of format modules.
  • a wide set of interfaces to simplify usage (CLI, static lib, language bindings, web API)
  • an open-source project and community to build the modules.4

Though the UDR is a non-trivial question, from a software engineering perspective, I claim this design is much easier to realize than others.


  1. At present, I offer a framework to think about the problem. In the future, I hope to offer an existence proof. 

  2. Maximum, of course. Many converters would never be built as most format pairs are incompatible. 

  3. This compiler design is brilliant for many reasons. It addresses many challenges elegantly, including applying optimization at the IR level, and even distributing IR code as "portable" code that is compiled on-demand at the target machines. 

  4. It is important to emphasize these points because they are so often overlooked. Many elegant software projects have failed due to the lack of an interested community, a hostile codebase, or a difficult usage pattern. 


The Case for Data Package Managers top

2014-03-04

Numerous problems plague data sharing today. This post proposes Package Management as the foundation to address them. Note: to make sure we're on the same page semantically, check my data vocabulary.

Package Managers

In software, a package management system, also called package manager, is a collection of software tools to automate the process of installing, upgrading, configuring, and removing software packages for a computer's operating system in a consistent manner. It typically maintains a database of software dependencies and version information to prevent software mismatches and missing prerequisites. From Package Management System (Wikipedia).

Package managers have been hailed among the most important innovations Linux brought to the computing industry. They greatly simplified and improved software distribution. By versioning, carefully bookkeeping, and maintaining centralized repositories, the various linux communities built robust and usable systems.

For end users, Package Managers automated software installation (adding, upgrading, and removing programs). Typical installation meant having to manually (1) find and download archives, (2) compile binaries (which often failed), (3) move files to their destination, and -- worst of all -- (4) repeat for all dependencies, and dependencies of dependencies. Package managers automated the process into one command: pkgmgr install package-name. Uninstalling software (reverting installation, often undocumented) became similarly simple: pkgmgr uninstall package-name. Users no longer had to search the web for the compatible versions of programs, walk trees of dependencies, wrestle with compilers, and hope that everything interoperated just right. Instead, users could rely on common programs being versioned, listed, and properly packaged for their systems.

For software authors, Package Managers automated software distribution (packaging, hosting, and disseminating). Before, programmers would have to create bundles of their source code with (often complicated) installation instructions. Though conventions existed (include a Makefile, README, and LICENSE), installation and its documentation were confusing hurdles that harmed the adoption of programs. Additionally, authors had to find a reliable way to transfer the files to end users, which usually meant having to create and maintain a website. If the website failed, the package (and its dependents) would not be installable. Package managers host all files in a central repository and are mirrored for redundancy, which significantly increases the availability of packages and reduces the authors' time and money costs. While packaging across platforms varies from simple to complex, it is standardized and well worth it for software authors.

Solving Data Problems

The activities of both publishers and users of datasets resemble those of authors and users of software packages. Moreover, the problems I presented in Data Management Problems are precisely the kind of problems package managers solved for software. Below I outline how package managers can also solve them for data; though each issue is large and worth addressing at length in the future. The takeaway is this:

(1) A Package Manager can solve our Data Management problems.

Distribution

Package Managers generally use centralized repositories to store and distribute their packages. Individual publishers simply upload their files to the repository, offloading the costs of distribution to the network's infrastructure. This substantially reduces publishing friction, as individuals no longer need to worry about setting up their own websites, paying for distribution themselves, or sourcing funding. It also reduces overall costs, by pooling network resources.

Amortizing Distribution Costs

Monetary costs are certainly still an issue, but an organization setup with the explicit goal of managing and distributing large datasets is better equipped to handle them than individual dataset authors. By pooling the resources and efforts of interested individuals (engineers, admins, etc), such an organization can better:

  1. Attract and secure funding, such as grants or crowdfunding.
  2. Design and implement complex technical solutions to reduce costs.
  3. Simplify the publishing process with tools addressing common pain points.

(2) A Data Package Manager would amortize distribution costs.

Indexing

Indexing is the raison d'etre of package managers. They are designed precisely to provide central access points where all packages can be found. By collecting domain-specific meta-data, package managers allow efficient and programmable search for packages. For example, in our dataset case, consider collecting dataset meta-data such as the (a) title, (b) research fields, (c) file formats, (d) abstract, (e) relevant keywords, (f) relevant publications, (g) license, (h) sources, (i) authors, and (j) publication date. A central index of this information would provide powerful and precise search of available datasets. It would also standardize the meta-data requirements, ensure publishers provide it, and allow users to view it.

(3) A Data Package Manager would index and search datasets.

Permanence

As discussed previously, once published, datasets must not disappear. All publications -- paper or data -- must be available indefinitely. When do we know something is completely and utterly unnecessary anymore, even to historians of a field? Never. Thus, it is important to store all our data -- as we store papers and code. Package managers in software already do this, as it is never clear when an old package will be required. Our Data Package Manager is no different; all datasets would be retrievable at any point in the future, as long as the package manager itself survives. Additionally, as the amount of valuable data stored by the package manager increases, so do the incentives to maintain it. In a sense, by storing all the important datasets together, we increase the survival chances of the whole repository.1

(4) A Data Package Manager would provide datasets permanently.

Versioning

The need for versioning data is clear. Datasets change over time, often after being shared with others or published broadly. It is critical to track and provide access to these changes, or at least to the different published versions. Regardless of correctness or currency, all versions must be kept available; future work will often seek to understand or compare previous work done with previous dataset versions. To stress the point, for the sake of the scientific enterprise and data work in general, we must track and distribute all versions of datasets.

(5) A Data Package Manager would provide all versions of datasets.

Version Control Systems for Data

But how exactly to version data is a complex question. To date, there is no single Version Control System capable of handling datasets as well as git handles source code. Unlike code, which is plain text, datasets come in a tremendous variety of formats. Version efforts that assume datasets will follow certain properties (e.g. small size, plaintext format) are bound to be limited in scope. Perhaps this is acceptable, as format-specific VCSes can provide great domain-specific functionality. Or perhaps one VCS could be built to easily accomodate domain-specific extensions.2 This will be the subject of a future post. For now, let us separate concerns as three different but related needs:

  1. Domain-specific data versioning techniques and tools.
  2. Version Control Systems that leverage (1.) and track all separate versions.
  3. Indexing services that (a) understand (2.) and provide access to the tracked versions, (b) but are agnostic to (1.).

Separating these needs out provides clear tool scope. Existing or new Version Control Systems can handle (2.), independent of tools built to handle (1.). A Package Manager can handle (3.).

While git has achieved tremendous success and is currently the most popular version control system, history has shown that preferences shift as better tools emerge (git replaced svn, svn replaced cvs, etc). Unless it becomes abundantly clear that no new VCS would supplant current ones, it would be wise to build Package Managers independent and compatible with various VCSes.

(6) A Data Package Manager would be independent of VCS.

Formatting

The question of formats and conversions is formidable and worth lengthier discussion. For the purposes of this post, package managers would simplify three aspects: re-publishing, tooling, and automation.

1. Reformat and Republish

As discussed above, all versions of a dataset would be available. This includes versions that do not alter the data itself but change the format of the files themselves. Data formatting is a tedious process fraught with problems; valuable data worker time is spent parsing and converting data from one format to another. With a properly namespaced3 package manager, end-users themselves could reformat and republish a new version of the dataset. This permits users to leverage each other's efforts and save broader community time.

(7) A Data Package Manager would provide reformatted datasets.

2. Reformatting Tools

There exist thousands of tools to clean, convert, reformat, or otherwise modify datasets. A package manager could provide (a) programmable access to datasets, (b) dataset format information, and (c) standardized file layouts for particular formats. This would significantly simplify building these and other tools, as well as broaden their reach to other datasets outside of the authors' knowledge. The package manager is an infrastructure tool, a platform for other tools to leverage.

(8) A Data Package Manager would enhance data processing tools.

3. Automating Processes

On the other hand, these provisions would also enable the package manager itself to automate particular processes for the sake of end users. For example, consider a set of interchangeable formats (e.g. encodings like {JSON, XML}, or {matlab matrix, numpy array}). Suppose these formats and their relationship are registered with the package manager, including bi-directional conversion tools. The package manager would then be able to convert any dataset from one format to another. This could be done either in the user's own machine, or remotely (e.g. publishing one format automatically produces projections into compatible formats4). End users need not be concerned with such simple reformats. Of course, other tasks beyond reformatting could be automated similarly.

(9) A Data Package Manager would automate data processing tasks.

Licensing

By requiring a license, a package manager would enforce dataset authors consider, learn about, and formalize the rights of their end users. Additionally, by encouraging particular licenses, it could guide authors towards more open and modification-friendly licensing. This would reduce ambiguity and forking friction.

(10) A Data Package Manager would encourage better licensing.

Open Access

First, a package manager would improve accessing open-access datasets. Currently, it is common to find open-access repositories with many hurdles to retrieving its hosted files. A simple package manager interface would better the experience and save time. Also, a package manager would cover the costs and assuage the distribution concerns of authors and publishers.

Second, by reducing forking friction, a package manager would greatly support modifying and republishing (forking) of datasets. This is precisely what open access data should .

An open Data Package Manager perfectly complements pre-print publication servers like arXiv and bioRxiv

(11) A Data Package Manager would improve open access.

Summary

  1. A Package Manager can solve our Data Management problems.
  2. It would amortize distribution costs.
  3. It would index and search datasets.
  4. It would provide datasets permanently.
  5. It would provide all versions of datasets.
  6. It would be independent of VCS.
  7. It would provide reformatted datasets.
  8. It would enhance data processing tools.
  9. It would automate data processing tasks.
  10. It would encourage better licensing.
  11. It would improve open access.

It is time to build one.


  1. This is clearly the case in software: a package distributed on its own website is significantly less likely to survive than a package distributed through a mainstream software package manager (e.g. debian aptitude). 

  2. Why git is not it (yet) is argued in a future post. 

  3. Namespacing is important to reduce publication friction. This is similar to namespacing and forking on github. 

  4. It's worth mentioning that such automatic reformatting at central repositories should occur lazily


Data Management Problems top

2014-02-21

Numerous problems plague data sharing today. This post identifies some of them. Note: to make sure we're on the same page semantically, check my data vocabulary.

Distribution

Moving files from one computer to another is generally a solved problem. However, when dealing with datasets, there are complexities.

First, data files can be large (100MB+), which makes distribution costly in terms of storage and bandwidth. Usually, only organizations (academic or commercial) commit to fronting the financial costs of storing, transmitting, and maintaining large datasets. For example, academic institutions maintain the most popular Machine Learning datasets (NIST, CIFAR, ImageNet, etc). However, thanks to new cloud storage solutions, the costs (both money and time) are dropping.

Second, unlike source code, few of the people that deal with data files are proficient in setting up reliable and robust distribution of their dataset files over the internet. Today, many scientists outside of computer science are using services like Dropbox, a personal file storage solution, to distribute their data files. The most cost-effective and reliable systems remain obscure to most average users. For example, bittorrent has been used to distribute datasets by only some computer science/bioinformatics research communities.

Versioning

Many people do not realize the importance of versioning at all. Some overwrite files with new versions without considering the repercussions. Others fail to keep old versions available, only distributing the latest -- which is still damaging.

There are problems even in how versioning is implemented. While version control systems (VCS) have prolifierated in the software engineering world, most data files continue to be managed by ad-hoc and brittle filename versioning schemes. We have all seen the terrible ones (see the cartoon below).

phd-comics-final

Better ones exist (like timestamping the filenames), but still fail to support useful VCS features like branching histories, version description messages, version merging, etc. Sure, the world would be better if everybody used git. But they will not. We need to bake VCS into existing workflows.

Permanence

Sometimes published datasets are no longer available in their original location, so one must find them again. Sometimes, datasets disappear completely. This paper found that, in biology alone, "the odds of a data set being extant fall by 17% per year."

This is absurd. The point of research is to gather and reason on data. We ought to be able to keep the data around, in precisely the exact form it was originally published. No single dataset should ever go missing. No single dataset version should ever go missing either. Getting data published is hard enough. We cannot afford to then lose it.

We need a permanent repository, where all data survives.

Indexing

Datasets are strewn across the web. Generally, dataset authors need to arrange for the distribution and upkeep of the files, which means they setup whatever solution happens to be most convenient for them. This leads to distribution over a myriad different methods, with widely varying user experiences. And there is no global system to track published datasets, or their publication meta-data (date, authors, license, version, etc). Some aggregated collections exist, but they are often field-specific and wildy varying in quality, usability, or completeness. This all makes searching for datasets tedious at best, and often fruitless.

Imagine a world where every dataset published can be found in one single, persistent global index -- across fields. Imagine this index also tracks the version histories, publication meta-data, related datasets, people, figures, etc. Imagine all this information standardized and easily searchable. Imagine all URLs to this index are guaranteed not to change, and to support a programmable API.

Formatting

Researchers often have to clean, reformat, filter, or otherwise modify the data files they wish to work with. This can consume significant time and effort. Sometimes programs have to be written. Sometimes modifications must be done manually -- an error prone waste of valuable time. Often these modifications are the same that others need. Often, re-publishing the dataset after format changes might save the community valuable time.

Licensing

Often, it is unclear what license published datasets carry. People tend to distribute raw data files, neglecting to specify what end users can do with the files. Though this feels "open," it is unclear whether publishing a modified version of the data is allowed. This ambiguity, with potential legal implications, deters forking and re-publishing of data. Most would rather not risk getting in legal trouble.

Open Source code solves this problem with clear licenses and licensing guidelines. It is common practice to always include a license with open source code -- this should be the same for data. While the data licenses are not yet as sophisticated as source code licenses, there are already many licenses covering common use cases (see opendatacommons.org and How to License Research Data). We simply need licensing to become common practice. Requiring (or defaulting to) licenses in data publishing repositories may nudge people in the right direction, reduce ambiguity, and foster data forking.

Open Access

The Open Access discussion often focuses on journals and access to papers. But, open access data is just as critical to scientific progress. Others have discussed this problem much better than I can, so I will link you to them directly:

The takeaway is data should be open-source, for distribution and modification. What is often left out of these discussions is the friction people experience when attempting to do either. From undergoing a lengthy peer-review process, to applying for accounts, to asking for modification permission for others' data... the hurdles are numerous, even in open access repositories. We need to complement open access with easy access, to both read and write, or download and publish.

Imagine a world where researchers can fork and re-publish fixed, cleaned, reformatted datasets as easily as programmers fork code.

Friends, the promised land is in sight.


Data Management Vocabulary top

2014-02-21

Last updated: 2014-02-28.

This is a vocabulary for the purpose of precisely defining propblems and solutions in my data management efforts. Many are standard version control concepts.

bandwidth

for our purposes, capacity to distribute files across a network, mainly the internet.

data

information stored in digital files

dataset

a meaningful collection of related data, usually packaged as a set of files and identified with a name.

Decoding

the process of converting encoded information into information, according to a format. The inverse of Encoding.

For example, using TMF defined below:

func TMFDecode(s String) Devices {
  d := json.Decode(devs)
  if d.IsValid() {
    return d
  }
  throw TMFDecodingError(d)
}

Distributed Version Control System (DVCS)

a version control system that operates in a distributed, decentralized fashion. Users do not need to interact with, or obtain permission from, a central authority for normal operation.

Encoding

the process of converting information into encoded information, according to a format. The inverse of Decoding.

For example, using TMF defined below:

func TMFEncode(d Devices) String {
  if d.IsValid() {
    return json.Encode(d)
  }
  throw TMFEncodingError(d)
}

fork

a distinct project which originated by branching -- copying and subsequently modifying -- another. If project B was built by copying and modifying project A, project B is said to be a fork of project A. Forks are often created to fix bugs, alter or add functionality, or take over maintenance.

forking

copying the files from one project and modifying them to create another, distinct project. The new project is said to be a fork of the original.

forking friction

infrastructural or cultural resistance to forking projects.

In software, developers used to be reluctant to forking because it carried the connotation of an organizational split or schism, and the weight of organizational maintenance. Github greatly reduced forking friction in software by:

  • terming original projects forks as well (as opposed to only their off-shoots)
  • making forking a common part of contributing to projects
  • namespacing projects under usernames (e.g. userA/project could be a fork of userB/project)
  • effectively combining all open-source communities under one network

In data management, users are often reluctant to forking a dataset because:

  • they do not know whether they are able to (unclear or no licensing)
  • they do not wish to cause an organizational split or schism
  • they face publishing friction

Format

"the way in which something is arranged"; a specification for how to encode and decode a message.

For example, consider the following schema-laden format spec:

Temperature Measurement Format (TMF)

  • TMF extends, or is on top of, JSON.
  • A TMFFile has a .json extension, and contains TMFData
  • A TMFData is a sequence of TMFDevice objects.
  • A TMFDevice object must have two keys:
    • name, mapped to a string.
    • temps, mapped to a TMFReadings object.
  • A TMFReadings object maps ISO Date to a TMFTemp
  • A TMFTemp is a string with a float followed by one scale letter (C, F, or K).
[
  {
    "name": "device0",
    "temps": {
      "2014-03-07 05:10:01": "292.0K",
      "2014-03-07 05:10:02": "291.7K",
      "2014-03-07 05:10:03": "5000.0K",
      "2014-03-07 05:10:04": "291.3K"
    }
  },
]

Format Compatibility

a directed measure of whether a format is able to represent the same schemas as another. For example, JSON and XML are formats perfectly compatible with each other: all schemas represented in JSON can be represented in XML and viceversa. However, GeoJSON is compatible with XML (it follows from JSON being compatible with XML), but XML is not compatible with GeoJSON (not all XML files can be transformed into valid GeoJSON).

The idea of format-compatibility is useful to better understand how format conversion relationships work.

git

a popular distributed version control system. See http://git-scm.com.

publishing friction

individual resistance to publishing projects due to costs or perceived costs in doing so. For example, costs include:

  • monetary costs for reliable distribution.
  • obligation to maintain a published project indefinitely.
  • loss of reputation if the project is deemed unsatisfactory.

Publishing friction prevents the publishing of valuable projects. Building tools which decrease it boost the possibilities of progress.

Schema

the structure, or specification of how information represents meaning.

For example:

  • Timestamp specifies the time at which a measurement was taken.
  • Temperature specifies the temperature measured.
  • TempMeasurements specifies a sequence relating a Timestamp with a Temperature
  • Device specifies a particular machine, with a Name and TempMeasurements

Schema-Format Compatibility

a measure of whether a particular format is able to represent a particular schema.

Schema-laden Format

a format designed to represent a particular schema. Schema-less or universal formats, such as JSON, can represent any schema. Schema-laden formats, such as GeoJSON, are tuned to represent a particular set of schemas. Schema-laden formats tend to implement schema specifications on top of a general format.

Schema Compatibility

a measure of whether a particular schema is able to express another. It is useful to consider whether or not data can be transformed from one schema to another.

storage

for our purposes, digital storage of files in computers.

Universal Format

For our purposes, a format capable of storing any schema. Examples: JSON, XML. Contrast to GeoJSON, a format specific to a set of schemas, or a schema-laden format.

Version

a snapshot of files at a particular point in time. Versions are useful to trace histories of changes.

Versioning

storing multiple versions of a given file to enable tracing the history of changes.

Versioning Scheme

a scheme or protocol to identify different versions of file.

Version Control

techniques to store, manage, and retrieve numerous digital files, using versioning.

Version Control System (VCS)

a system (usually a software tool) used to enact, support, and simplify a particular form of version control.


Let's Solve Data Management top

2014-02-21

Numerous problems plague data sharing today. Scientists, engineers, analysts work with exabytes of data daily. So you would think our data management techniques and technologies are top notch. You would think it trivial to move data files from one computer to another, to share this or that dataset with a collaborator, to keep all versions accounted for, to trace the history of particular data, to convert between this and that format, to create and re-create figures from the data... In short, you would think sophisticated and well designed tools have solved these straightforward problems. After all, these same problems plagued software development decades ago. And we have solved those elegantly.

The sad truth is that data management -- particularly among scientists and analysts -- remains almost oblivious of the general solutions software engineering developed to manage code.

The good news is: we can change that in 2014.

The foundations of a better future have been laid. Existing source code tools and entirely new solutions are solving the tough problems. What's left is to connect the dots with easy to use tools, and to drive their adoption.

Over a series of posts, I will summarize a number of problems I have noticed, discuss potential solutions, and feature current efforts to solve them. I invite you to be part of the discussion, and to help the efforts. My hope is to rally others interested in these problems, and to build the open-source tools to solve them.

Join me.