karthikram.chrismichel.102522
Karthik Ram is a a senior research data scientist at UC Berkeley’s Berkeley Institute for Data Science. (Photo/ Chris Michel)

For years, researchers have advocated for scientific results to be more publicly accessible. Amid decreased public trust, open science practices make scientific studies more transparent and reproducible. They can also help accelerate discovery by offering access to the data, code, software and hardware that underpin findings.

In a win for open science advocates, the White House Office of Science and Technology Policy (OSTP) recently announced new policy guidance that would make federally-funded research publicly accessible more quickly. We spoke with Karthik Ram, a senior research data scientist at UC Berkeley’s Berkeley Institute for Data Science (BIDS) and co-founder and director of the nonprofit rOpenSci. His career offers a lens into the role computing and data science plays in open science work. 

In this Q&A, Ram shares how ecology research led him to a passion for open science, how his organization built tools, practices and community to make this kind of work attainable and what their efforts and the new federal guidance means for young scientists. He also discusses his latest open science project, one of several recently praised by U.S. Senators Dianne Feinstein and Alex Padilla of California for advancing scientific innovation and equity.

This interview has been edited for length and clarity.

 

Q: Why do open science and related platforms matter?

A: People are impacted by research outcomes every day – healthcare outcomes, outcomes related to climate or anything else. These findings can inform policy and laws. It can impact your life in many ways. It's also taxpayer money that funds a lot of research.

If not for open science, all you're seeing is the outcome at the end of a very long black box. You don't know what went into it. [Open science ensures] research is being done with transparent approaches – methods that have been vetted by trusted peers – and also all the artifacts that have been generated as part of a funded effort are made available for everyone else to use.

In the early days of the open science movement, there was this moral obligation to do transparent research, benefit the community and the public and be accountable to our funders and our peers. But it was really hard to do open science. The technical aspects were beyond a lot of day-to-day researchers. It didn't seem fair to say, “You need to do open science, but it's really hard to tell you how to do it.”

Open science platforms enable people who are bought into the idea to engage in open science. An open science platform is a tool, scaffolding or an environment where people can actually engage in open science practices if they would like to. It can take the form of an executable notebook or notebook system, open-source software or an open data resource.

Q: How did you become involved in this kind of work?

A: About a decade ago I was a postdoctoral fellow on a project. It was my first time doing a data science project. Data science was just being coined as a term back then. My job was to take existing data, synthesize that, ask a few questions and then write papers about them. 

I was handed a large collection of datasets that had been pre-processed, and someone created a whole bunch of derived datasets that were ready for me to work on. But when I tried to understand what people actually did to that data, it was really hard for me. [Their work] was done with a hodgepodge of proprietary tools that I couldn't get a hold of, and a lot of the methods were not really documented. It was a really frustrating experience for me to find out how we arrived at a certain point without actually seeing the whole process. 

During that time, I implemented a full open-source pipeline to get from the source data to the final data. When I did that, I realized many other people had worked with similar data before and had developed their own bespoke pipeline, which they never shared. It was never made public and was never done with open tools. 

I could have stood on the shoulders of other giants. But instead, I ended up building the same thing for the nth time. I realized more people should care about open science because it benefits everyone. That's sort of how I fell into the space.

"I could have stood on the shoulders of other giants," said Ram. "But instead, I ended up building the same thing for the nth time."

Q: How did this one project become an open science career?

A: I started implementing these practices in my own work. Any time I wrote code for a paper, if the code was general enough for someone else to use, I would turn that into a software package, publish it, and then mention it [in my paper], so there was something immediately available off-the-shelf for someone to take and reuse. 

When I wrote other code that was not generalizable beyond my paper, I would make sure I published that in a code repository, ensure there was always a way to access the code and share that [alongside] my paper and data. I made sure data was deposited in a proper archive.

I also started giving meta-talks to my peers. I’d say, “These are practices that not only benefit everyone else, but they also benefit you down the line, if you think of yourself as a future collaborator. Everything is written down. Everything is documented. You can always go back in time and see if you made a mistake. Someone else can examine everything that you did.”

At some point, I pivoted from primarily doing basic research to doing more applied research around open science and started thinking more broadly about tooling practices and teaching people these practices. I also started thinking at a higher level about policy changes and incentives that would help people transition from doing science a certain way that was not really beneficial to anyone to a way that would benefit society and research as a whole.

Q: What did that look like?

A: One of the first efforts in this space was a project we conceived called rOpenSci. Our idea was very general. If one person wrote a series of steps to retrieve data from some location, to process that data, to visualize it and to transform it in a way that's useful in a different context, that needs to be a reusable source of software. We would turn those into packages. 

Over time, the community started contributing directly. We pivoted from building tools and handing them out to the community of researchers to instead having them co-create with us. We ended up creating this whole peer review system. The end result was this highly collaborative process. It's their product and their tool. We just help them build it better. They maintain it and share it with the community. That enabled us to scale quite a bit. 

The impact was people started doing science differently. People realized that a lot of tedious work they were doing could be automated. They were making fewer mistakes. We started seeing our code showing up in places where we never interacted. That was like a true measure of how this had taken hold in the community. We also started holding trainings at a bunch of conferences. 

Q: How has this affected the field of open science?

A: Ten years in, rOpenSci is used in hundreds of research projects at any given time. People contribute software to us all the time as part of our peer review process. We're hoping this creates a lot of researchers who are up to speed on best practices, who then train their labs and their students. We're lifting overall open science capabilities without directly interacting with every researcher.

The outcomes of the work by my and other teams are quite evident in all the policies that you see –  for example, the policy from OSTP that came out recently around open access data publishing and code requirements for journals. These are areas that our work has impacted in indirect ways. 

Now, almost any researcher who is starting a Ph.D. program will somehow end up learning how to code. They'll end up thinking about where to deposit that code, how to deposit their data and how to deposit preprints. And so if you think of open science as these different pieces that all fit together, we've been involved in a couple of those little pieces. But then many other organizations have filled the rest of these pieces out. 

Compared to a decade ago, this is a really excellent time to be doing open science because it's not just an idea. It’s an idea with a lot of tooling, support, training and peers who you can talk to about how to do these things.

Q: How does your most recent grant from the U.S. National Science Foundation (NSF) build on this body of work?

A: It’s a $2 million grant under a program called Pathways to Enable Open-Source Ecosystems. We support NSF grantees who are developing open-source software, hardware and data, and helping them become sustainable enterprises within their ecosystems.

Before this program someone might get funded, build something and – once the funding ends – their obligation to work on that project would end. Some of them may exist for a long period of time after that. Some of them may stop functioning right away.

Our goal is to stop that and make sure the investment from federal funding into all these different efforts continue to have an impact beyond their usual timeline. These tools could then exist for a longer period of time, and other people could build upon them and use them.

We're going to develop a set of training modules as part of a five- to six-week training course. We will launch the course in March and run it for the next two years. Hopefully, this program will continue to persist after that, too.

"Funding will always be a scarcity," said Ram. Stopping "people from developing something over and over again will be a huge service to the community. "

Q: What do you hope this kind of work will make possible in the future? 

A: One thing I hope will be possible is that we'll be able to make much bigger leaps and bounds in [research more quickly], because we’ll have so much scaffolding to build upon. Funding will always be a scarcity, so the more we can do to stop people from developing something over and over again will be a huge service to the community. 

We need to not just ensure that these products exist, but that they're findable, discoverable and secure enough to reuse. And we need to make sure that there's a community of people interested in collaborating on and developing them. 

One more thing to add to that is I hope we do more [kinds of scientific work] that we have not previously conceived. A big outcome of open science is if you put something out in the open so it can be reproduced and verified as reliable research, it usually sparks new ideas, too. It usually helps transition methods from one scientific domain into another much, much faster.