Reflections on data science & related processes

10 minute read Updated: 3 Comments

For the last several months, we have been working on the data science component of this project. We have invited people from the diabetes community who are interested in partnering with data scientists (our research team) in order to answer questions of interest to the community. Some projects have completed: we even submitted and presented a poster (learn more about that here) at a scientific conference with our results. Other projects are still ongoing. I have two buckets of key learnings and reflections that I want to share in this post: one is about working with our partners from the community; the other is the data science team itself.

@DanaMLewis relfections on data science work from Opening Pathways

First up: reflections on the people from the community who expressed interest in doing data science. And first up is a reflection on language.

One thing you’ll note is that I’m trying to be careful not to say “participants”. These are not study participants (who are sometimes referred to as ‘subjects’ even in traditional research settings). These individuals are fellow research partners interested in researching on a topic of interest. I sometimes slip into calling them participants in order to differentiate them from the data science team on this project; but they are true partners, and indeed researchers themselves. However, even me knowing that we don’t want to refer to them as participants, it still happens! The culture around the idea of building resources to support people independently doing research is still new and growing, and I think it’s important for us to be thoughtful around the language and the terminology we are creating to help describe this work & the related ecosystem.

One other reason this language matters is our IRB. As I mentioned in my lessons learned as PI post, our IRB was great and didn’t have a problem with our study application (which actually encompasses any possible study we might do on the retrospective data already collected). However, we had lumped together Sayali’s dissertation research in the same IRB, and she does have traditional “participants” who will be interviewed; and thus, need to be consented before participating. However, our research partners aren’t being studied; and we decided that we wanted to make expectations clear about what we would do/could do/would not do in these research partnerships; but it’s not the same as a legal consent.

Now, let’s talk about the process for working with our community research partners.

The process for working with our community partners was this:

  • Interested partners would fill out the Google Form we created to express interest
  • We asked for people to self-identify into one of three groups:
    • those with specific, pre-defined research questions (who mostly want help doing the research or learning some data science skills to do this work themselves)
    • those with interest in doing research but want help in defining a more specific research question
    • those who are generally interested in doing research.·
  • After receiving a form entry, I would reach out to the individuals to follow up. If they had a specific question (group 1), I’d start by scheduling a call with a member of the data science team for the first call. If they did not have a specific question (group 2 or 3), I’d schedule a call for us to talk more about the project & to help them work on redefining (if relevant) a topic into a research question.

Interestingly, I was expecting to get 50% in Group 1; 40% in Group 2; and very few in Group 3. At this point in time, I think we’ve had 40% in Group 1; 30% in Group 2; and 30% in Group 3. Also, and this was disappointing to me, several individuals who filled out the form and were group 3 (interested, but no specific question in mind) did not respond after I followed up and emailed them to schedule a call to discuss their interests.

Much of this grant is centered around Group 1 and Group 2, and the toolkit we aim to develop in the next few months will also support these individuals. However, I don’t want to let the need of Group 3 go undocumented: I heard and saw very clearly (both from the lack of responses, and from the discussions with a few individuals that I did speak to from Group 3) that there is a need for support and resources for them, too. One individual had a good suggestion of creating a closed Facebook group for everyone to join, so that individuals who wanted to support research efforts – but not necessarily do their own projects – could chime in and support others by reviewing or commenting on protocols, etc. I decided that was beyond the scope of what we could do with our limited time and resources in this grant; but like that idea for anyone’s future projects looking at engaging the broader community of interested patient researchers.

Next, I’d like to highlight some observations and learnings from working with our data science team.

First, some assumptions that I carried into this project. I knew that our team did not have any prior experience with diabetes data, and of course not with some of the complex closed loop data that’s been generated by the diabetes community that would be involved in many of the community research projects. So I knew there would be a lot of onboarding, training, and guiding regarding that. We actually spent a good portion of our first in-person team meeting doing whiteboard sessions to talk through the data structures and diabetes data to help them understand how those would be used in some of the research projects. My goal wasn’t to make them experts in diabetes or our data samples; but to give them a framework of understanding that they could apply to the projects, and also allow them to leverage their expertise and skills in data science in ways that the community has not (for the most part) been able to access.

One idea I had with this project was the hope that if this model was successful, an output would be a recommendation for patient communities to ask for data science resources from organizations or companies, not just for financial or other contributions in the future. However, based on what I’ve learned in this process, I don’t think (in general, there may be exceptions), that this is going to be a scalable ask.

Here’s why:

There are multiple DIY tools created in the diabetes community. They all work for a lot of people, who see success and continue to use them. However, for the most part, these tools are designed for real-time usage. The underpinnings of the system are not necessarily designed for ease of data science analysis. And thus, just like in many other projects, there is a larger than expected amount of work that is data preparation and data cleansing, before one can get to the stage of data analysis.

A simple example: much of the anonymized, donated diabetes data is in json file type. Which is great for some tools like R, because data scientists and researchers can pipe the data in that same format and not have any issues. But for a large proportion of researchers and data scientists who are looking to understand and prepare the data for analysis, they often prefer it in csv format (so they can use tools like Excel to open it and look at it).

That’s why, even before this project began, I was working on creating a series of open source tools to help me and other researchers work with this complex data. First, I created a tool to convert the json to csv. (Which is more complicated than it sounds due to the unknown and infinite number of possibilities across the data sets – more details about that here if you are interested.) I’ve added a few other scripts as I’ve worked on and with other groups of researchers outside of this project. But in the scope of this project, every project with members of the community has also necessitated looking at how we review the data, separate some elements out, and then analyze it.

It’s a lot of heavy lifting on the cleansing and prepping, and I don’t think we did a good enough job mentally preparing for that amount of this type of work.

The other challenge is that a requirement (that both RWJF and I asked for/mutually agreed upon) is for everything we do – all tools, all output – to be open source. That also means all of the scripts and things we do will be online and available for anyone to use. The reason I am framing this as a challenge is because although everyone involved in the project is interested and excited about doing things in an open source manner, it takes work to actually do things open source.

Now, it’s not necessarily a lot more work – but it is different, and just like any new habit, it requires practice and repeating to get in the habit of doing it. And this is one thing that I’ve noticed multiple people on the data science team have needed. They have not had prior experience with open source of any kind, and so the habits of uploading code or documentation, tracking changes through Github, etc. are not established habits. So in addition to training our teams on diabetes data, we’ve also had to do training on things like Github (which I expected), but repeated reminders and asks (more than I expected) to upload things and log them so we have the open record of our tools that are helping us do the work.

One approach I have taken in the past year in the OpenAPS community that has worked well has been showing anyone who’s building an OpenAPS how to make a pull request (aka PR, aka requesting an edit or change) to the documentation (a simple one, adding name or initial to a list) to get practice making a PR and realizing that it’s different but not necessarily too hard to learn to do. I did a similar thing with our core research team during the first in-person meeting, and we practiced navigating around Github and making simple content PRs (tweaking biographies, changing titles, etc.) to get them comfortable with the ideas and practices. I think this was a good start, but I plan to think more about how I can do a better job in introducing open source collaboration methods to future research partners. (And I’d love help/feedback/suggestions for ways to do this – please share your ideas!)

I think also some of this comes down – again – to a difference in perspective and background. Many of us in open source are not necessarily classically trained in computer science; and our urgency/need (real-world diabetes, for example, and real problems that we need to solve) necessitates being nimble and finding ways to get things done. We see real and immediate short-term feedback and benefits in working in a nimble, and transparent way through open source. I don’t have the other perspective – but I wonder if the key to helping traditionally trained data scientist or academic research partner is helping them find ways to see the same near-term benefit in working openly and sharing their work as they go. Perhaps that might more naturally occur when there is a larger network of data scientists or researchers working in this space (and talking openly about it), rather than solely relying on the feedback loop within our team, but there may be more we can do as small communities spinning up this work to create a stronger feedback loop.

In summary: like everything else in this project, we have been learning incredibly valuable lessons both by the output of the data science work itself, as well as the meta processes of how we’re doing the work, and how we’re all collaborating together. I’ll be spending more time thinking about the open source onboarding of new researchers who join our team (or that I work with, beyond this project), and also thinking about what tools and resources are most needed and highest priority to support individual research partners in all groups (1, 2, and 3). And in the meantime, I think we still need to ask and incentivize traditional researchers to do more of their work in public and openly share in real-time, so that everyone can benefit in the near-term and not only the typical longer-term time frames.


John Harlow

I think there is probably a middle road, with solving real time stuff balanced with ease of export and analysis of data. I think scale facilitates this, but negotiation of a DIY solver and a data scientist researcher takes work, patience, strong communication. I am just reprising your second-to-last paragraph, but I just love this post, learning so much about the data science side and putting it in my pocket for future projects.

Dave deBronkart

As always, watching from a distance, admiring the trailblazing work you’re doing!

For future reference, don’t even let anyone THINK about using a Facebook group. This summer it’s become very apparent that FB’s security is disastrously bad, whether they leak for commercial benefit or they just don’t know how to keep a group “closed.” Here’s the SPM post about two major screw-ups - there’s reason to think that Cambridge Analytica was (a) not a rare screw-up and (b) not sufficient to get FB’s geniuses to fix the problem (if they can).

Note that one leak may have enabled marketers to scrape group members’ names using just a browser extension, and in another case trolls hacked into a group and behaved so severely that FB immediately destroyed the group.

If someone wants to create a private group, much better to do it somewhere like or SmartPatients.

Meanwhile keep blazing the trail! Your message about actual research partners comes through loud & clear here. Thank you!

Dana Lewis

Thanks, Dave. There’s definitely an entire topic’s worth of pros/cons related to Facebook groups. I think it depends on the membership, and the purpose of the group, and making sure everyone is aware of the privacy implications of the choice(s). For some groups - patients talking about personal experiences - they might not want to use Facebook groups. For other groups - not talking about personal experiences; would otherwise be talking about the same content publicly, but want the convenience of Facebook features/login - it may not matter as much to that group. It’s definitely something worth discussing for any group selecting any channel.

Leave a Comment

Your email address will not be published. Required fields are marked *