Challenges for Data Management
I've done a few presentations on data management to student groups, and I have never been happy with the content. I've also attended a few data management sessions that were conducted by others and was not happy with their content, either. I've come to realize that the reason that I am never happy with the content of "data management" presentations is that what I want is simply not possible. I want a session that tells trainees step by step how to set up a system to manage their data. However, with the broad array of data types and challenges to data management, I don't think that there can be a blueprint for handling data. Therefore, presentations tend to be a list of warnings on what not to do, rather than what to do.
As stated above, data comes in many forms, large collections of numbers, images, piles of references, tissue samples, specimen from field research, and that data needs to be processed and organized Some data, such as numbers, images and references can be organized and stored electronically. Other data such as tissue samples, gems and museum specimens must be stored and cataloged. In addition to the type of data, the type of research also presents unique problems with data management. Field researchers must be able to take notes in the field and somehow protect those notes from damage or loss due to weather or infinite other possibilities that could occur outside the lab. In collaborative research, some of which is international, data must be reliably collected and shared between researchers, all the while being protected. Furthermore, if the data contains confidential information, there must be a system to protect privacy.
Difficulty managing data is not a new problem. Most people do not pursue their area of research or scholarship because they want to manage data. We go into research because designing and conducting experiments is fun. Asking new questions and creating a way to answer those questions is stimulating. Data management is just the "housekeeping" of research. And although we all want a nice house, some, like me, are challenged housekeepers. It is hard to get enthusiastic about organization of data. Yet, it is critical to keep data organized because a lot of problems can arise from bad data management. And although evolving technology has well served data collection and potentially addresses some of our data management problems, it has also created some new challenges to data management.
So, how has the evolution of technology challenged our housekeeping? A long time ago (before I started in research), an experiment had a lab notebook and maybe a data notebook. An experiment was conducted, the information was recorded in the lab notebook, the data was added to the data notebook (if this was separate), and these two books were stored safely in the lab. It was the only resource for experimental data and information. Changes could not be made to the resource without evidence of such a change, because procedures required personnel to cross out a notebook entry (written in pen) and then add amended information. It was a solid, unchangeable resource. Next generation (when I was an undergrad), labs had only one or two computers. So, all lab information was stored on the computer hard drive, and all lab personnel had to work on the lab computers, so their latest work was always saved in the lab. Two floppy disk backups were maintained for each computer. One was stored in the lab and the other was stored offsite, in the case of a lab disaster. Everyone knew that the latest version of any file was stored on the lab computer.
So, where do we stand today? Labs have multiple computers, and all lab personnel also have one or two (or more) electronic devices of their own that store data so that they can work at home, the library, or in the doctor's office. Portable drives are inexpensive and have loads of memory, so we all have drawers full of jump drives, thumb drives, external hard drives, etc. And each drive has a copy of some version of the research data. So, it is easy to lose track of which file is the most current. Multiple students sometimes take home the same file to process data or conduct analyses, which can cause problems in merging the work done by two people on the same data set. Furthermore, with all these copies of research data, it is difficult to maintain the security of a dataset, for work that should be confidential. There may be generational differences in lab members, so that some members are more "tech savvy" than others. This can create problems of communication and/or access to all data and analyses. It can also be worrisome for mentors that realize that their students are more knowledgeable about computers and/or software. I remember when I was in graduate school. A classmate had a mentor that would not allow her to graph her data on computers for publication. Rather, she was required to draw her graphs (with drafting tools), even though she knew how to graph them electronically. Her mentor did not understand the software, so she was not allowed to utilize it. Forcing her to handle data in a manner that he understood was his way of systematizing the management of his data. However, I don't think it was the best way to mentor his student. Finally, unless you are utilizing an efficient electronic notebook or survey software that "tracks changes," data can easily be modified or even deleted, without there being a record of the change. Labs that utilize Excel for managing data are vulnerable to this issue. But as an Excel user, it is difficult to move away from the software because it is familiar, convenient and flexible.
So, how do you systematize the convenience that has been gifted to us by technology? This is where data management presentations fall apart. Generally, there is a discussion on backing up data. Actually, after the above discussion, having backups is not the problem, most people have too many. The challenge is making sure that the most complete version of all critical information is backed up. Ensuring that all personnel associated with a project have access to the data and know where the latest versions have been stored. Finally, the data should be protected from personnel that should not have access to the work.
I will begin to address the solutions by presenting the system that our lab utilized but will also present the problems that we had, which included redundant and obsolete backups. Our lab utilized a lab server that contained a hierarchical folder system with differing levels of security. In the server, the highest level of security was for information that could only be accessed by the PI. The next level gave access to the PI and senior postdocs/lab managers. The last level was a "Shared Lab" folder, to which everyone had access and everyone could share data. In addition, each member of the lab had a personal folder. The personal folders could be accessed by the named student, the PI, and the student's immediate supervisor. This system allowed everyone to have a "protected" area as well as "shared" areas for data. Supervisors were able to access a subordinate's work but could keep protected data backups, as well. If you followed my description, you should already realize that this system created redundant copies of data by nature of the hierarchy.
In theory, this is a pretty good system. It allows you to keep data "protected" in the higher levels of the hierarchy, gives junior students a folder to protect their own work and allows an area for sharing information. It is also a good system to share data in collaborative work because you can give limited access to collaborators, which would only allow them to access necessary information. However, in practice, it still has problems. First, as mentioned above, the system did not protect from two students working on a dataset at the same time and bringing a merge issue back to the lab. A way that we addressed this issue was to create a project "folder" and students were told to make all changes only to the files in the project folder. When students adhered to the practice, the primary data set could only be accessed for "edit" by one person at a time. If the second person tried to access the file at the same time, they were permitted "read only" access. Yet, problems still arose because: students would still save the file and make changes away from the server;
generally, if there is more than one student on a project, they would all try to access the file the day before lab meetings/presentations; and,
occasionally personnel would access the file and leave it open on their computer while they left for class, dinner, or a meeting, which "locked out" everyone else. This last practice then promoted personnel to work off the server.
In addition, ideally in this type of hierarchical system of data management, a postdoc or senior lab member needs to oversee that junior personnel are backing up their work to the server and that senior members of the lab are moving that work to the protected levels of the hierarchy. However, when selecting the person to manage the server and backups, keep in mind that senior members of a lab often come to face career milestones such as dissertation defenses, searching for jobs, writing a grant, etc., which could distract them from regularly making these back-ups. If you have a large number of people in the lab, keeping track of everyone's work and backups can be time consuming. It may be a good thing to address every week at lab meetings and even have a computer there at the meeting so that personnel can copy their files at the meeting. In most cases, even good students forget to submit their most recent work to the server. It's just a mistake, but if left unmanaged, it could be a big mess in time. You never want to discourage students from working, especially if they are excited about their project and willing to work at home. However, regular reminders that they need to keep their data on the lab server is a healthy reminder to keep their data organized.
This is one method that may be useful for keeping data organized in a pretty large lab. I will continue to address tips and tricks for managing data in the May issue of Scholarly Messenger.
Marianne Evola is senior administrator in the Responsible Research area of the Office
of the Vice President for Research. She is a monthly contributor to Scholarly Messenger.
Alice Young, associate vice president for research/research integrity, is a contributing author/editor.