Data Archiving and Sharing
By Marianne Evola
Years ago, when I was a first year graduate student, I learned a valuable lesson about the importance of retaining collections of raw data. I was a teaching assistant (TA) for a lab course. During the course of the semester as part of the course, the undergraduate students conducted a series of experiments and collected a nice chunk of raw data. At that time, the raw data was printed out on large piles of paper that accumulated in the classroom. Once data collection was complete, it was the TAs' responsibility to compile, analyze and graph the data according to a defined course protocol for an end of the semester summary to the students. As would be expected from an undergraduate lab course, the results were pretty predictable and there were no surprises. Hence, once our work was done, we submitted the spreadsheet and results to the course professor, cleaned out the lab and set about preparing for our own final exams. However, when the professor looked at the summarized results, he had an interesting insight and contacted us to request that we rework the raw data. Dismayed, we had to admit that after working all semester to keep the large piles of data organized and finishing our task of compiling and analyzing the results, we had discarded the raw data. We had not meant to do anything wrong, but the course protocol had not told us what to do with the data and we thought that it was our responsibility to clear out any mess from the classroom at the end of the semester. What followed was an hour of growling by the professor but what I clearly remember, other than being very embarrassed, was him repeating that you "Never discard raw data."
Later, I began conducting my own graduate research. And similar to the undergraduate course, every experimental session produced a small pile of raw data printed out on paper – about 10 to 20 pages each day. As a brand new graduate student, I was proud to have that small but growing pile of paper on my otherwise empty desk. I would rapidly enter the small amount of summarized data into a paper notebook and then pile the raw data on my desk. After a few weeks, the pile of raw data grew and I decided to store it in the deep file drawer of my desk. There was plenty of room down there. After a few months, the pile of raw data outgrew the desk drawer and other than the transcribed summary data, I seldom looked at the raw data so I started placing the data into expandable data binders and storing them in a box under my desk. But then the number of binders and boxes grew. Luckily, technology was rapidly evolving and soon the raw data was being stored on the computer with backups on floppy disks then zip disks, CD's and collections of external drives and servers. At this point I as a senior member of the team and had undergraduate students working with me and the collection of raw data continued to grow in a manner reminiscent to the Star Trek episode "Trouble with Tribbles." Yes, I love to get those classic Star Trek references into my Scholarly Messenger articles.
Actually, I was lucky. A fellow graduate student was working on behavioral studies that required her to videotape rat behavior. The primary experimenter coded rat behavior live, but she needed to video tapes for secondary coding to check the reliability of her data. Her video tapes were her primary raw data. Her rapidly growing collection of data consisted of piles of paper from both primary and secondary coders, as well as an enormous collection of video tapes. And although both of us did our best to keep our raw data organized, the piles of paper and collections of video tapes grew rapidly and took a great deal of space to store. And all raw data needed to be preserved. "Never discard raw data."
Now you would think that with the evolving technology, all of our raw data would be stored electronically, and you would be right. All of our research gradually progressed so that all raw data, including videos, were stored electronically. However, even after that technology evolved, it took a while for us to trust that electronic storage was sufficiently safe and reliable for long term data storage. Thus, for a long period of time, we continued to print (and video-tape) our experimental data as well as store it electronically. Gradually, as the team began to trust electronic data storage and everyone became adept at perusing their data electronically, one-by-one we stopped printing hard copies of our data. I remember the lab meeting where this practice came to the attention of our mentor. That meeting consisted of an intense discussion regarding the vulnerability associated with this practice and whether we were putting the data at risk. Electronic storage alone seemed an unnecessary vulnerability to someone that had built a career on enormous collections of raw data stored on paper, even if all that paper took up a lot of space. Furthermore, she pointed out, all of the lab's historical data had been collected prior to the availability of electronic storage. So much of the raw data stored in cabinets and storage rooms was the only copy and historical record of the lab.
Eventually, the need for space and an enormous move across the country overshadowed the need to store large volumes of printed raw data, especially since most of the data was stored electronically. Thus, it was time to discard cabinets and storage rooms full of raw data. And even though we knew that we had most of that data stored electronically and thus had access to the raw data files, throwing out that paper felt like a mortal sin against research. Throughout the purge of paper I could not tune out those words of one of my first mentors, "Never discard raw data." Regardless, just as technology had evolved, so had our trust in the reliability of that technology. Electronic storage was sufficient and even more reliable because we could store multiple copies of the raw data in multiple locations. Our electronic raw data was, in fact, less vulnerable than when we had only one paper copy.
Now again, we face an evolution of trust as research progresses toward archiving data with the purpose of data sharing. Technology is forcing us to assess the vulnerability of our data. And similarly, our reluctance and skepticism of the security of technology will rightfully make the transition a gradual one. Regardless, the utilization and requirements for data archiving will continue to progress. Luckily, I was recently reminded that there are members of the research community, as well as the Texas Tech community that are a few steps ahead of the rest of us with regard to data archiving. Resources for secure archiving are being assessed and/or created and our experts are working to provide instruction to the rest of us so that ultimately, we can safely archive and even share our data, where appropriate. Specifically, the Texas Tech library has a team that is working to provide guidance and resources as federal agencies increasingly implement directorates that mandate data archiving as a duty associated with federal funding.
A few weeks ago, I participated in a webinar about data management and archiving hosted by Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. This webinar inspired me to think again about the challenges that many researchers face as they work to manage complex and rapidly growing data collections. I also was reminded of a seminar that I attended last year at the University Library that was led by Brian Quinn. During his seminar, Mr. Quinn provided information about data repositories that are available to the Texas Tech community and he mentioned that Texas Tech is a member of ICPSR.
So, what was it about the ICPSR webinar that refocused my attention to the resources that are available for archiving data? Specifically, it was a discussion about their systems for archiving sensitive research data. Now, my research did not deal with sensitive data. Although I loved my data as all researchers do, no one was interested in breaking down our laboratory doors to gain access to my data. If someone broke into our lab in urban Detroit, they would have most likely been interested in gaining access to narcotics. However, since I began my role in responsible research, my awareness to sensitive data and its vulnerability has been raised considerably. There are labs that are working on marketable technology and thus their discoveries may progress to valuable products. As such, these labs are very concerned about the security of their data and its vulnerability to theft. As productive researchers, often their research involves federally funded projects as well as their marketable technology, thus these researchers may be faced with federal directorates for archiving data while also focusing on protecting their marketable technology from theft.
Similarly, there are labs that work with human research data that is associated with extreme privacy issues. Although researchers that utilize human subjects work hard to de-identify data and protect the privacy of human subjects, there are data collections that cannot be readily de-identified. Laboratories that collect blood, tissues and genetic information retain identifiable information by the very nature of their research samples. Similarly, data consisting of photographs or videos images of human research subjects could similarly be traced back to the human subject, which could violate privacy. Raw data of this type requires a different level of security, especially when one considers that the goal of data archiving is ultimately to share or "mine" data for future research questions. The ICPSR webinar discussed its history of handling sensitive data and the systems that it has created for customizing who will have access to the data and in what manner. The webinar specifically mentioned that some data is restricted to the extent that the only way to access it is for the interested party to get permission of the creator and then fly to their facility where they would be given "view only" access to the data for analysis. Thus, during the webinar, I rapidly recognized how this data archive could be very useful to our Texas Tech researchers that utilize highly sensitive data.
There are many opinions about data management and archiving even as federal directorates on data sharing evolve. There are many challenges associated with mandatory archiving. I have written a bit about the challenges and opportunities that would accompany data sharing. However, when I think back about learning how to keep a massive data collection stored and somewhat organized, I am reminded that there are individuals that pursue a career in curation and they are valuable resources for providing the rest of us with guidance on how to keep our archives secure and accessible. Historically, a research career requires us to learn a wide variety of skills, we are teachers, personnel managers, writers, and most importantly, we are experimenters. We compile data, analyze data and we write up our results so that we can build a career. In my field, if you wanted to incorporate creativity into your research, you had to be able to construct an unique equipment, wire the electronics and write computer code to conduct experiments and collect data. And, as I have recently realized, I had to develop some organization and curation skills. Now, I did an acceptable job of it, I could always find what I needed with a bit of time and insight. However, I'm betting that had I sought out some instruction, I would have been much better at organizing my data so that I could easily access it at a later time. Furthermore, I'm betting that if I looked at some of the data that I collected as a brand new graduate student, even I may have difficulty deciphering my own abbreviations and labels.
I never thought to seek out the assistance of the library for developing skills associated with curation. Nor, do I believe, that our library provided guidance of that type when I was a young graduate student. However, in retrospect, I think that as students and young researchers, most of us likely could have used some guidance on how to better manage our data. It is exciting that our library is stepping up to provide opportunities and instruction for data management, archiving and sharing. Data archives will provide exciting opportunities to access historical data and/or mine that data to ask additional research questions. Regardless of our concerns about archiving and sharing data, the future will bring growing requirements for data archiving and sharing, at least for federally sponsored research. As our librarians work to assess resources and provide us with guidance on data curation, it is the responsibility of researchers and students to pay attention and seek out that instruction. Specifically, the library team that is working to ready the university for data archiving and sharing has already provided their first article on data management (link above) and are planning a series of articles that will be published in the Scholarly Messenger. In addition, per some email discussions, we will be working with the library to coordinate a series of seminars to discuss challenges to data archiving and the resources that are available to address our challenges. Please watch the Scholarly Messenger for articles and seminar announcements. Furthermore, if you would like to be included on a distribution list to receive announcements regarding seminars on Responsible Research, please email me at Marianne.firstname.lastname@example.org and request that I include you on my distribution list.
Marianne Evola is senior administrator in the Responsible Research area of the Office of the Vice President for Research. She is a monthly contributor to Scholarly Messenger.