Open data baby step: sample data & data abstracts concurrent with traditional papers.

I have a 9-month-old learning to walk. Sometimes I hold her hands. Sometimes she falls. Furniture and structures nearby to lean on are a must. I want open data, pubs, code, & meta-data online – all the time. Now. However, perhaps there are babysteps that do not interfere with this natural and logical development. Steps or simple sharing options that do not supplant this growth and even facilitate/enhance the final step. The interview with Carly Strasser about DataUp lead me to the following ideas, and I bet they are already out there.

When you get your paper accepted, archive the data. However, let’s ask/offer to journals the following as a bonus (or if you are not ready to share the full dataset yet for some reason).

1. Data samples. Sample data or a teaser like a movie trailer to accompany the paper right on the journal website. When you see the table of contents or the online early for the paper, ask the publisher to load up a small file with a sample of the data. Sometimes when I read a paper, I don’t want to see the whole dataset. I just want to see a bit of it to get a feel for the data structure and variables. Just looking at the data matrix/flat sheet structure with column headings is really useful. My students and collaborators often do this. When planning an experiment, doing a pilot, or early in the season, my students build the data file and input what they have, even a little bit. Then, we take a look at it and try to imagine what we have missed. This really accelerates discussion and stops me from emailing/posting back and forth… did you think of this? This could also really accelerate discovery when reading a paper. You are thinking about the data the authors collected and you just want to take a peek at it to try and get your head around the primary patterns – akin to when you are in the midst of your own workflow/analytical pipeline. You have your data up on one screen looking at it and popping back and and forth to the stats. This also avoids a bit of the messy data-bucket phenomenon that Mark Schildhauer described to me, i.e. that if folks just dump datasets into a repository without proper meta-data, they are not machine readable, indexable, and others don’t really want to stick their hands in their (maybe like a bucket of used diapers, :)).

The data sample could be as simple as txt file or cut and pasted excel file to skim through. Then, I can decide if I want to get the full deal or contact the authors. The final advantage is that the sample is provided in parallel with the paper. Whilst looking at the paper, I can just open the sample quickly and inspect without going to a data archive or separate site.

2. Data abstracts. When your paper is accepted, ask the journal/publisher to also publish in parallel to the paper the meta-data. Maybe you  are not ready to publish the data too, ok. At least pop up the meta-data so that we can index it and readers could quickly click on it to get a picture so to speak of the what the dataset included (this could be great if you did not use all the variables etc and a reader was wondering that or wanted to consider designing a similar study). Does the ‘data abstract’ have to be formal meta-data? Preferably yes. However, baby-stepping it, I would still love to just see a quick paragraph describing the structure of the data, a list of all the variables/factors etc and any insights into the workflow that the authors choose to provide. It would be so interesting. So, let’s call them data abstracts and have them right alongside the current narrative ones that we publish.

3. Supplements. In parallel, anything the authors want to share. Georeferenced photos of the site, a photo of a representative plot, egg, etc. I assume server space is mostly infinite for journals and publishers too, so why not give us that content with every paper.


In no way do I want this to interfere with proper data sharing initiatives. It just seems like a nice quick way to get everyone in the game and maybe even build demand for the whole data to be explored by others. Movie trailers work for me – both for the ones I want to skip and the ones I must see. In summary, I want baby to crawl then walk, but I am happy for her to be able to do both as her exploration of the environment demands.