Numerous problems plague data sharing today. Scientists, engineers, analysts work with exabytes of data daily. So you would think our data management techniques and technologies are top notch. You would think it trivial to move data files from one computer to another, to share this or that dataset with a collaborator, to keep all versions accounted for, to trace the history of particular data, to convert between this and that format, to create and re-create figures from the data... In short, you would think sophisticated and well designed tools have solved these straightforward problems. After all, these same problems plagued software development decades ago. And we have solved those elegantly.
The sad truth is that data management -- particularly among scientists and analysts -- remains almost oblivious of the general solutions software engineering developed to manage code.
The good news is: we can change that in 2014.
The foundations of a better future have been laid. Existing source code tools and entirely new solutions are solving the tough problems. What's left is to connect the dots with easy to use tools, and to drive their adoption.
Over a series of posts, I will summarize a number of problems I have noticed, discuss potential solutions, and feature current efforts to solve them. I invite you to be part of the discussion, and to help the efforts. My hope is to rally others interested in these problems, and to build the open-source tools to solve them.