As researchers we often have to manage the entire data lifecycle from generation to the final report. Raw data are major parts of any analysis pipeline and it is important to properly store them. Many guidelines on data management focus on raw data but also stop there. However, raw data without analysis are pretty much worthless and there are major data management decisions to be made during analysis. Which intermediate results to save? What format? Where? Cloud storage? Do I build a database? Every decision will have effects in the actual analysis code and can determine how easy errors can be found and fixed. Here I want to give a detailed overview of the different options we have to organize our data during analysis. Python is my main programming language and this will be reflected here in that I prefer open source tools and I am not experienced with some of the commercial solutions that might exist. However, I will also cover the advantages/disadvantages of human-readable data formats and tools such as Microsoft Excel. We get started with raw data.
Raw data are like a good hypothesis. Just like a hypothesis is perfect until you ruin it with data, raw data are perfect until you ruin them with a thorough analysis. Raw data are truth. This is how it happened. Therefore, the raw data is never changed and we always keep multiple copies of it. Even after we published the results of our analysis, it is a good idea to keep the raw data for at least 10 years. Depending on your funding and where you publish you might be legally and/or ethically required to keep raw data for a defined time period.
Anything we do with our raw data during the analysis we do in read-only mode. The only thing we might want to change is the file name. However, we should check whether there is a way to automate the file naming in the program that generates them. Automating file names is preferred and avoids human error. If we receive data from collaborators we should also make sure that we are not deleting or changing data that they placed in the file name or the directory hierarchy.
From an analysis point of view, we might not need any information in the file name at all. We could randomly generate it. However, we almost always want to store important metadata in the file name because it has two major advantages: 1. The file name is human-readable. You need neither programming knowledge nor knowledge about the file format to read it. 2. The file name is readable by the operating system. This means, our programs can read it even if they don’t know the file format. For example, you can use file name information to decide whether or not to load a file during analysis and save loading time.
So what do we put in the file name? The first rule is that it is better to save too much information as opposed to too little. But most file systems have limits on the file name size (I hit it many times in Windows 10). When you choose your file name format you can ask yourself: what would I want a human to know about the files before they even open them and what would be good to have easily accessible during analysis? You will read in a lot of guides that you must save the date and time. That is useful because it makes it easier to relate a file to your lab book. Otherwise I only recommend you to give each file a unique identifier. This can be a number or a string of characters. I prefer numbers because they can have the added bonus of giving the recording order at a glance. Other information in the file names is up to you.
Once you have decided which other information to put in your file names, choose one delimiter to separate items and another one for readability. I use underscore to separate items and minus for readability. That allows me to do things like: “001_YYYY-MM-DD_version-2.fmt”. Avoid empty spaces in file names. Whatever decision you make, you must remain consistent.
When it comes to the directories, I recommend having all raw files in the same directory. I used to save some directory hierarchy structure but nowadays I would put all of that into the file name. If you have a lot of files, some directory structure can help humans navigate it but during analysis I almost always prefer one directory for simplicity. The file name is certainly a good way to save metadata but there are other ways to save them. In the next chapter we will discuss metadata more generally.
At the raw data stage, the definition of metadata is clear. It is any data that describes the context of the raw data but does not fit any of the raw data formats. Images are a good example. The raw data of an image are the pixel values that make up the digital image. Metadata of the image can be anything that relates to the context where the image was taken. The time it was taken, the place it was taken, the exposure time, the model of the lens used, the temperature that day, the person who took the image, the width, the height and much, much more. As the analysis progresses, metadata may become data. Let’s say your analysis pipeline counts the number of birds in each image and you want to find out whether there are more early birds than late birds. In that case the time of day the picture was taken becomes an independent variable.
So what kind of metadata should you save? That is up to you, your research question and your field but it is usually better to save too much metadata than too little. Metadata you did not save is almost impossible to reconstruct later, so think carefully when you decide that something is not worth saving.
How to store metadata? If you are lucky, all metadata you need might already be inside your raw data files. Most of the time however, you will have to add some metadata manually. In that case I prefer to create a single .csv file where each row describes a single raw data file. One of the advantages of .csv files is that they can be read by humans and script with a variety of programs.
The id uniquely identifies the row and thereby the raw data file it refers to. A drawback of a .csv file is that it only works well for a single spreadsheet. If you require multiple related sheets – for example because you have multiple raw data sources – you might require a more complex format such as JSON or even build a database. More later on the reasons for and against building a database. Next, we will assume that we found a way to process our raw data and need to decide how to store it for analysis.
Analysis and storage of transformed raw data
Analysis of raw data most often means to iterate through all raw data files and do some computation on each. In our fictional example, we were taking images and then extracting birds from those images. Once we have extracted the birds, our unit of analysis changes. This has consequences for the metadata, which describes the raw data files. We need to relate the metadata from the raw data files to the birds and there are two major options to do this. First, we generate a spreadsheet where each row is a bird. In the image_id column we save the unique id of the image the bird was extracted from.
This is a relational data structure. The image_id in our birds table identifies a unique id in our metadata which allows us to figure out where, when and with what exposure a given bird was photographed. This is a great way to structure data. The only downside is that we have to load two files and then merge them during the analysis. Personally I prefer to create a single file where both tables are already merged. The downside of this data structure is that is is larger in storage size (because the metadata columns are repeated multiple times). Whether or not this is an issue for you depends on the total size of your data, the amount of rows extracted per image and your available disk space. If you decide on the relational data structure you might also want to consider building a database, because they are particularly well suited for relational storage. Next, we will discuss some advantages and disadvantages of databases.
You probably don’t want to build a database. Unless you do.
A database is any structured collection of information. This means that a collection of spreadsheets as we saw above could technically be considered a database. However, a database usually implies more rigorous rules of storage and access than a spreadsheet can guarantee. These rules give databases some of their major advantages. For example, multiple people and programs can interface with a database without messing things up. If two people open the same spreadsheet in excel and work on it they are almost guaranteed to end up with two conflicting versions. There are many other advantages of databases. When you decide for or against a database you should anticipate whether you will be able to make use of those advantages. Here is my list.
You should consider building a database if:
- You will have to integrate multiple raw data sources.
- Multiple raw data sources usually mean more relationships. Multiple relationships become harder to manage with spreadsheet but are the perfect use-case for a relational database.
- You expect to store a massive amount of data.
- A database usually stores data more efficiently, which can save you storage space.
- A database can also make access more efficient, because you don’t load the entire database but request smaller packages of information at a time. This avoids running into RAM issues.
- You will be adding data continuously to the project over a long period of time.
- Having a fixed logic for adding information to the database is a massive advantage over long time periods and can avoid a lot of errors.
- During very long projects you might want to train other people to interface with the database or even hand the project over to someone else. Having the relational rules that avoid errors is very helpful, as opposed to a spreadsheet where anyone can do whatever.
- You want to make you data available through a web interface.
- A web server interfacing with a spreadsheet would be very inefficient.
- Multiple people need read/write access simultaneously.
- As mentioned above, multi-user access works much better for databases.
- You have security concerns.
- This can be a security concern regarding the physical integrity of the data or malicious/unauthorized access. Databases are not perfect but they provide better safeguards than a spreadsheet
Those advantages sound amazing. So why would you ever not build a database? Well, many research projects are structured in a way that you cannot make use of any of these advantages. If a single person generates all their data in a week, processes the raw data the other week and prepares a final report in the third week, a database is overkill. The rules that database architectures follow provide major advantages but they also make them less flexible. Building and maintaining a database comes with some extra work that you only want to take on when you can make use of some of the advantages. Generally speaking, the larger a project is (regarding everything from number of people involved over data sources to time frame), the more worthwhile a database becomes.
Before we move on to data analysis, I want to briefly mention cloud storage. Cloud storage has some advantages but most of them can also be achieved in other ways. In my opinion, the most important reason to consider cloud storage is if you regularly need access to the data from a large variety of different locations that are well connected to the internet. However, if you have too much money, a commercial cloud storage can also provide services for you that might be annoying to maintain yourself. On the other hand, as a researcher you may also not be allowed (because of funding or institutional policies) to use certain cloud services. Other details of cloud storage are beyond this guide.
Visualization and statistical analysis
So far we extracted samples (birds) from our images and quantified features of those birds (species and size). Some additional features are given by the image metadata. There might be additional intermediate processing steps whose results you want to save. That’s up to you. For example, maybe you had to apply some filters to the images when you extracted the birds. Do you want to save these filtered images? If the filtering takes very long and you want to visually inspect the result, you probably want to save them. Otherwise, having the script that performs the filtering is sufficient.
The plots you draw are something you definitely want to save. You also want to save the results of statistical analysis. At this stage it is important to have a clear relationship between analysis scripts and their outputs. I hate to look through multiple scripts, trying to find the code that generated a specific figure I want to change. You could just try to put everything into one script. I have done this before but depending on the size of the project this becomes unwieldy very quickly and you will find yourselves scrolling through thousands of lines of code. There is no perfect recipe here so I think a practical example is in order.
A practical example for a project structure
Here I will give some examples, how the bird projects could look as a directory hierarchy.
In the project folder itself are only Python scripts and a README.txt, where we can give information that anyone looking at our project should be aware of. Having everything else in specific folders makes it easier to locate a Python script. There is one more advantage: if you use git to version control or share your projects, you can exclude data and figures from tracking. Data is often too large to push to remote or we don’t want to make unpublished data public. Therefore we can simply add the data folders we want to keep local to a .gitignore file. However, managing your project with git is not a must and for smaller projects it does not always pay off.
Now for the scripts themselves. Each script should have a very well defined input and output.
extract_birds.py for example loops through all raw files in
./raw and creates the file
./data/birds.csv. The other two scripts both use that same
birds.csv file but they answer different questions and have different outputs.
analyze_location.py creates the figure
02_figure_location.png and the statistical output
location_anova.txt. On the other hand
In this example the same script visualizes and performs the statistical test. You could split this up further and have a script for visualization and another for stats. I prefer to combine them because both tasks require loading of the same data. On another note, I prefer to keep things like reports, papers and presentation in a separate folder. In my experience they clutter the project space.
Planning a project in detail is important but there are also diminishing returns. At some point the only way forward is to actually start the project. I hope you learned some useful tricks to manage your own data more effectively.