The FAIR principles in practice for publishing data and other scholarly objects – A conversation with Donny Winston
Bridging Academic landscapes.
At Access 2 Perspectives, we provide novel insights into the communication and management of Research. Our goal is to equip researchers with the skills and enthusiasm they need to pursue a successful and joyful career.
This podcast brings to you insights and conversations around the topics of Scholarly Reading, Writing and Publishing, Career Development inside and outside Academia, Research Project Management, Research Integrity, and Open Science.
Learn more about our work at https://access2perspectives.org
Donny Winston is a web systems engineer who is on a mission to rid the world of narrative-centric science. He is the owner and principal of Polyneme LLC, a cofounder of FAIR Points, the host of Machine-Centric Science, and writes an email newsletter on FAIRification for research directors and data stewards.
Topics we discuss in this episode:
- Expectation management for machine reading / what are the limitations of AI?
- The semantic web, linked open data
- www > spreading knowledge
- Jupiter notebooks
- Open-able science
- Stop draining that fish
- Copy-left licenses
- Scite.ai
- Data camp
More details at access2perspectives.org/2022/04/a-conversation-with-donny-winston
Host: Dr Jo Havemann, ORCID iD 0000-0002-6157-1494
Editing: Ebuka Ezeike
Music: Alex Lustig, produced by Kitty Kat
License: Attribution 4.0 International (CC BY 4.0)
At Access 2 Perspectives, we provide novel insights into the communication and management of Research. Our goal is to equip researchers with the skills and enthusiasm they need to pursue a successful and joyful career.
| Website: access2perspectives.org
Donny Winston is a web systems engineer who is on a mission to rid the world of narrative-centric science. He is the owner and principal of Polyneme LLC, a cofounder of FAIR Points, the host of Machine-Centric Science, and writes an email newsletter on FAIRification for research directors and data stewards.
Topics we discuss in this episode:
- FAIRPoints
- Expectation management for machine reading
- What are the limitations of AI?
- The semantic web and linked open data
- www > spreading knowledge
- Jupyter notebooks
- Open-able science
- Stop drawing dead fish, by Bret Victor
- Copyleft licenses
- Scite.ai
- Data camp
ORCID iD: 0000-0002-8424-0604
Linkedin: /donnywinston
Twitter: @donnywinston
Podcast website: machinecentric.transistor.fm/
Personal website: donnywinston.com/about
Consulting website: polyneme.xyz/about
Which researcher – dead or alive – do you find inspiring? Bret Victor
What is your favorite animal and why? The koala. I just love that it sleeps so much.
Name your (current) favorite song and interpret/group. Sliver by Nirvana.
What is your favorite dish/meal? The No Name at Grasshopper in Allston (Boston)
TRANSCRIPT
Jo: It is a great pleasure today to have Donny Winston with us, who joins us from New York City in the United States. Donny is a web systems engineer who is on a mission to rid the world of narrative-centric science. He’s going to tell us more about what that means in a minute. He’s also the owner and principal of Polyneme, LLC, co-founder of FAIR points, the host of Machine Centric Science, and also writes an email newsletter on the verification for research directors and status steer. Welcome to the show, Donny Winston.
Donny: Thank you for having me. Yeah,
Jo: That’s great. So we met not too long ago. I think we’ve crossed paths on Twitter here and there. Your name and face ring more than a bell. And then we crossed paths again in the initiative and project called FAIR points. Can you share a few words around what FAIR points is? And for the researchers amongst our listeners, who’ve probably heard about FAIR data, and we’ll talk a little bit more about that also during this episode. What is FAIR points? Starting with that.
Donny: Sure. Yeah. So Fairpoints is a community initiative. I got roped in by Sarah and Chris and others. Essentially, the idea is I found it through an interest in experiencing software carpentry and data carpentry. So I’ve done that for many years. I really believe in upskilling and training for scientists who aren’t necessarily going to be coders, but they sort of want to be, at least at carpentry level for that kind of stuff, to be able to do things and to learn basic data science skills and programming. That’s great. And so we didn’t see something similar that was focused on Fair data stuff like really kind of getting into the wheels of making digital objects findable accessible, interoperable, reusable, that sort of thing. And on top of that, we shared an experience, and I certainly had a certain level of brittleness of the Carpentry’s instructional material. Maybe brittleness isn’t the right word, but it’s very nice and prepackaged, but there isn’t an obvious way to take little bits and pieces of it and combine little lesson Nuggets to form a complete workshop that you trust will be coherent. You can kind of do that a lot on your own as an instructor, but there aren’t lots of variants of the same sort of lesson. You kind of want to take it as is and pick and choose. We thought it would be wonderful to not only do something that sort of teaches people about implementing Fair and making their data findable and doing that sort of thing, but doing it in a fair way of kind of making these little Nuggets, these little Fair points where you can kind of assemble them and find and interoperate all this lesson material that sort of can help people deliver instruction on how to have it go fair. So we got really excited about that idea and wanted to start off with piloting that as a series of lunchtime talks where lunchtime is wherever the Speaker’s lunchtime is. It’s very international and kind of having a little hack together on the notes from that session and turning those into what we term Fairpoints to make the outcome of that, sharing fair towards helping instructional material. So that’s what FairPoint is all about. We’ve had a couple of talks so far and yeah, I’m just really happy to be involved with that. I think it’s a great idea.
Jo: Yeah, I think it’s a brilliant idea. And maybe for the listeners so far it might have not been too easy to follow what this discussion was about, stepping a few steps back. It’s about research data. And now within the scope of open science, there’s quite a buzz and talk around making research data fair, which means, as you said, findable, accessible, interoperable and reusable. And that’s much easier said than done. And most researchers that I’ve met and I personally did struggle with how to actually implement that. And this is what FAIR points is actually about to make it approachable to find and showcase best practices, how implementation actually happens, how to make your data fair as a researcher, and also to share the message that it’s probably impossible to make your data fully fair, like from one moment to the next. But the goal is basically a process and as fair as possible. And maybe now is also a good time to explain the difference, which is a common misconception in my experience and the courses that I give and probably also for you or it’s what I assume,
that often fair and open data are being used as terms and interchangeably. So what’s the difference between open data and fair data? Maybe talk about it.
Donny: Sure, absolutely. Openness connotes that sort of anyone can get access to it, you’ll never be turned away. Maybe you have to fill out a form or something. Maybe it’s not easy, but really anyone can have access. Fair is more about machine actionability. It’s really about it being really straightforward, not just for humans, but for really dumb agents of humans programs to be able to get all this stuff. And you can have some fair data infrastructure and be a private company and do everything internally. And you can benefit from all of your historical data, your proprietary data being fair so that you can go and fetch what’s relevant for a new problem and featurize machine learning and run those things and get a lot of benefit from your data and benefit from there without any of it being beyond your firewall. Whereas Open is a bit more of a I see it as something of a philosophical benefit, but also practical benefit for science, where it’s kind of similar to something that the creator of Linux and Git said more eyes makes all bugs shallow. I think I’m paraphrasing it in terms of programming, that’s the idea of open source code, it’s just by having it openly accessible, you’ll get contributions from people and systems that you weren’t expecting or soliciting, but who still find themselves interested and want to be stakeholders and consider themselves stakeholders in your success and the project success. And so you just get better software that way. And so that’s the idea of open data for open science, the idea that you’re just going to get better science worldwide. If that’s the case, the more you can make things open, it’s just going to be better, whereas fair is also going to be better. But you don’t necessarily have to be open to do that. You can be sort of protective about it. I think that for effective open science, at least some degree of fair is a prerequisite for that, because if you’re really opening things up to everyone, unless things are fair, it’s going to be really hard to coordinate contributions and people will be able to find what’s interesting to them and what they can contribute their expertise to and how I think fair is what I say. Open science assumes fair, I would say, in my opinion.
Jo: I totally agree with that, because as much as open sounds nice and friendly, but if you literally only publish your data set without any context or explicit description to it, there’s really no use of having that online because nobody else but the person who generated the data will be able to contextualize. And even that is like going from our own data. We might forget within a week of what the context was like for some specifics and then not be able to recap that like a week or a month or a year or three years later towards the end of our PHD or whatever. It’s really also making us also favor having our data in a fair or archived and documented in a fair manner. And I think the key or one of the key components of having fair research data is the metadata, which is basically a description as rich as possible, as informative as can be without over exhausting ourselves, but also assuming details that might be interesting in the future to have recorded and like instead of skipping some assuming, oh, this is not important to me now, but it might be later. Just keeping that in mind, speaking of my own experience here.
Donny: Yeah, absolutely. Another thing I just thought of that I think is a really great articulation of this difference of open data and fair. It doesn’t have fair in it, but it’s a design note on linked data that Timber’s Lee wrote many years ago. It’s on Linked Open data, and it has five stars. The first star is for your data. Is it available on the Web in whatever format with an Open license? And so it can be a scanned PDF with an Open license. And that’s already got a star, because that’s open data. But then the second star is available as a machine readable structured data. So like an Excel spreadsheet instead of an image scan of a table, right? The third is like, okay, it’s machine readable but also using a non proprietary format like CSV instead of Excel or not needing any special program and then sort of the stars above that are what you alluded to kind of like using more standards so that people can identify things and more metadata and linking out to other things that people understand context. And so all this stuff is gradual verification but sort of the first part of that is like, well it’s open. I put a scanned image of a table up there and it’s CC zero or CC by. So it’s open and that’s true and that’s great, but it’s maybe not as helpful for mass consumption by people’s machine agents through programs and that sort of thing.
Jo: Sure, yeah. So what’s your personal interest in these kinds of technical but also research related details coming from a background in data science and then also systems engineering? Because in your body state, like you’re on a mission to rid the world of narrative-centric science. You know what I mean.
How do you try to save the world? What’s the issue and how are you solving it?
Donny: Okay, well I didn’t mention it in the background, but I’ll speak more about it. I really believe in the power of science and the accumulation of knowledge to help us solve humanity’s problems and beyond humanity, all of our species.
I studied Science and engineering. I was an electrical engineering computer science student. I studied data fabrication and materials science and I was an experimentalist. I did theory and I sort of did a lot of stuff in the lab to help advance the kinds of systems we could make, the kinds of computing systems that we could make and the limits of materials, how we can sort of just do better stuff. And what I found is I was also really interested in software and data stuff and I realized that there was a limit to the contribution that I can make by essentially putting another PDF out there or that sort of thing. Or I would notice that when I was trying to find prior art to benchmark my work or to see how things are put in context, it was really hard for me to do it as an expert. I could read papers, I could understand them. I’ve had a way of doing things very efficiently like skimming abstracts and skimming figures and seeing if something’s worth reading further. But really there’s still a limit as a human being how much knowledge I can consume and also be confident that I’m correctly interpreting based on my background. I came across the idea of the semantic web. I thought that was a really great idea and the idea of machine actionability and fair. I feel like the document website, the World Wide Web did a lot of great stuff for spreading knowledge in a way that a lot of some of the AI expert systems in the 80s and other things tried to do with programming, just making all of these documents available in a standard way to allow search engines like Google etc. to kind of use page rank and all these algorithms to essentially connect all of these different human produced documents to surface what’s interesting. And now people can do all sorts of stuff. They can search through natural language, and they can generally find what they want for a given thing. Where it isn’t as strong is in terms of granular data, especially with science where things can get very domain specific and you care about units and qualifications and Providence and how the instrument was calibrated and all this stuff that you still as an expert have to delve into the methods and materials section of the paper, a narrative in English text or whatever, and there can be ambiguities and sentence structure, and it’s just really weird. So I wanted to help with this sort of thing, and for many years before I struck it out on my own, I was working for this material science project called the Materials Project out of Berkeley National Lab, and I was kind of a Spider on the web, so to speak. There I was helping, doing data ingest and making website features, and essentially helping the material science community get more out of the data that was being produced cleanly by computational material scientists. And so this is all sort of clean data. This wasn’t necessarily experimental data, but I sort of saw a lot of value in that and making APIs and getting all that available and having the research community really find that valuable, this open data store and ways of accessing it in kind of fair ways. However, I found there were still limitations to the approach that I did there. It wasn’t sort of quite truly fair. A human still needed to read the API documentation, understand what each thing meant, and there’s just a long way to go, essentially. And I realized this was even broader than what I was doing there, and I wanted to be able to do it more broadly with different communities. And so that’s sort of what led me to the path that I’m on now. And in terms of narrative centric science specifically, it’s kind of like the flip side of machine centric science, which I feel is a basic expression of fair. I feel like a lot of people, when they read into the fair principles, they might think that human actionability is kind of fine. As long as humans can find things, access things, then that’s fine. But I don’t read into that. I’ve heard things from Darren Mons, the lead author on that. He also struggles with convincing people that really it is about the machines. Does the machine know what you mean? And then people will follow, because when things are machine centric, you can use programming to generate user interfaces that people can understand, and that sort of thing sort of another illustration of this is we had a talk at FAIR Points the other day. One of the founders of Research Equals, where they’re trying to do little modules of research. And the idea there is you might publish a bunch of things at a time that isn’t necessarily a paper. You might publish your plan, you might publish the data set, you might publish all this stuff. And from that, you might form a narrative. And that narrative might have multiple authors over time who sort of didn’t pre coordinate with who’s going to do the analysis or whatever. And from that, you might be able to string together and synthesize many different narratives. Many different things. Like, here’s an introduction, here’s our plan, here’s the data we gathered. Here are the conclusions. You could have many different narratives around that. Just like you have many different branching histories in a Git source repository. You might sort of have many narratives around little points, connected points of science. And so I feel like having science be centered on a bunch of PDFs that we then kind of can synthesize knowledge from through natural language processing and all this stuff. That seems silly. It seems silly to bury our data so that we can mine it. We’re sort of hiding it by putting it in all this natural language reporting. So that’s sort of what I want. I would like to see what we couldn’t really do 200 years ago or even really 30 or 40 years ago. Things weren’t quite there. The infrastructure, the opportunity to make machine centered science and not have narratives be the primary quantum unit of scientific progress. That’s what I want to go for. And I feel like the Fair principles are a nice articulation of that goal. Yeah. That’s sort of my take on machine centric versus narrative centric science, and it would just make things more flexible. So I feel like people could get more out of scientific progress. Things that aren’t necessarily valued initially, like bad results, quote unquote, might end up being central to a different narrative. And the fact that they’re published makes them be part of multiple different narratives, and they could be cited, et cetera, et cetera. So I feel like there’s that part of it, too.
Jo: Yeah. That resonates, well rings several bells and also resonates deeply with how I see it. I believe that as humans, we need narratives. That’s what our brains can process as information. So I feel that the fact that scientific writing hasn’t changed at all really in the past two or 300 years, ever since it’s been documented in journals. And now we have PDF files, which are also, to the most part, not machine readable, but also, as someone who works with a team of African scholars running a preprint repository, Africarchive, I’ve only learned three years into our work that it’s better practice to share all the deposits, not PDF, but XML file or both, possibly to make it and machine readable and searchable and whatever. So to have that machine access point or AI access point to be able to screen for keywords in a reasonable amount. Also, there’s never been so much data and research sharing as we have today. There’s never been so many PhD students producing results and then packaging that into PDFs. Who’s going to read all that? I also give courses in scientific writing and I tend to tell people to make sure that you put it in a format that’s machine readable, because honestly, you can guess that maybe three people actually read your paper unless it has access points for the machine, unless it’s machine readable. So it can be placed for keywords and thereby be discovered. So I totally agree that in this age and time we live in today and the masses of research output that’s being produced from around the world, and also the huge challenges we are experiencing as humankind with political circumstances, hopefully the one that we currently experience in Eastern Europe by the time most of you are listening to this will be peaceful. Well, okay, it’s already beyond that point, but hopefully it won’t exceed from here. Let’s put it that way. We have climate change we saw in the middle of Pandemic, which hopefully becomes endomic at some point. It’s about time that we put our heads together as scholars from around the world, listen more to each other and also make the information accessible and searchable from each corner of this planet to really be able to solve many, if not all of these challenges and also channels.
I’m also a bit skeptical because I’m not from a data science approach. So I trust machines and robotics, but also data science approaches so much because at some point I can’t follow like mentally conceptually. So do you see drawbacks as in algorithms, there’s also this debate, like there’s still biases because they’re still human made, so they can only be as smart and as functional as the people’s brains of those who program them or design them.
Donny: Yes.
Jo: Also there’s probably never going to be a perfect system also nature hasn’t found one. So therefore we continue to go through evolution. So maybe the skepticism shouldn’t be too large. What are some of the challenges that you see in a data-centric approach or a machine search approach through research information? Was that a sentence? I’m sorry, no.
Donny: Yeah, that’s great. Also just before that, I realized something, an analogy I also wanted to make that could also help clarify narrative-centric versus machine-centric. I do a lot of work in Jupyter notebooks, which I love. Those are ways to gather narratives. So the current publication-centric system, narrative centric is almost like if everything that the notebook needed was in the notebook and then that’s it. Whereas now part of the niceness of a notebook is you want to present the notebook to someone else. But a lot of the stuff that the Python libraries that you import and all these things that you run in the cells are actually published elsewhere. And with Python, you kind of install all this stuff. And so you bring all these dependencies into the environment of the notebook and create your narrative. But not everything is locked away there. That’s where I definitely agree with you. Humans like narrative, and narrative is important. But still with Jupyter notebooks, the narratives are put in front of you, but they’re kind of not centric to the whole enterprise, in terms of machine biases and stuff. I definitely agree with that. I think more or less. I think Steve Jobs said once that you envision the computer as a bicycle for the mind. And so I feel like a lot of this AI stuff is like a race car for the mind, which is to say that you’re still limited by the mind. Whatever someone can do, like computers can do it really fast, a lot faster. But if there are people with a lot of biases, then the computer can execute those biases really fast at scale. So, yeah, I’m with you on that. There’s nothing inherently better about them. I’m reading this book now by the Dreyfus called Mind Over Machine, where it talks about a model of skill acquisition and going from novice to advanced beginner to competent to proficient to expert. And all these things and the limitations of machines going up to a few of those levels, but they’re hard to do other things. They’re limits to what you can do with symbolic logic. And there’s something to be said for the connectionist people earning, things you have now that are subsymbolic. But still, as you mentioned, you’re kind of going to reproduce the experience that you’re given. And so if you sort of put a lock on what kind of experience in training and modeling, you can give a machine garbage in garbage out, you’re going to get the biases in biases out. And so I guess what I can say is that in the best case, these machines would be agents for us, but you’d still want human authority and trust to be involved in sort of deploying and harnessing these things. You mentioned the context of are we going to be able to read all these papers? Probably not. But I imagine someone having a machine agent that can kind of read all the papers for them in a way that they would have read them. And you can get into trouble with that because what if a paper reader is trained on your advisor as a PhD student? Then it will sort of read all the papers as your advisor reads them. But that’s not actually what you want. You might have a different perspective than your advisor on certain things. So again, it’s really hard. But yeah, you do run into that issue where your speed of performance is very fast and it outpaces your speed of learning. So when you’re reading papers as a human, you can read a paper in a few minutes, maybe, let’s say you read it really quickly and skim it in a few minutes. You can also learn what things you want to rethink in those few minutes, and you can iterate on that, whereas with a computer, you know, you can read thousands of papers in minutes, but you don’t recover all of that reflection time. So, yeah, that’s not something I really know how to solve. I don’t really have a great idea for it. All I can say is I imagine there’s still a lot of low hanging fruit where things probably seem kind of boring for someone. They don’t really expect to have their perspective changed so much. They kind of just want to read a bunch of stuff in the same way they want to see what the result of that is. And I think machines are excellent for doing that. And I mean, part of that is whatever you call it, data literacy. Algorithm literacy is sort of understanding what kinds of things work, and what’s nice is in practice. I’ve seen a lot of research that a lot of the simpler algorithms for learning work very well if you have sufficient data and sufficiently clean data, rather than having very complicated algorithms. Even someone who is not an expert in machine learning and computer scientist, I think a lot of these things could ultimately be explained pretty well to a practitioner. They already understand their domains very well in the domain of science or social science and that sort of thing. So I think understanding how computers execute a particular thing, it can just sort of be part of research. Another thing I heard the other day, and this is definitely more of a cultural thing, but in experimental Sciences and engineering, you definitely have environmental, health, and safety departments that require you. If you’re going to work with a laser, you need to take training to make sure you can work safely with a laser. And if you’re going to work around radiation again, you need to take a training class. You have personal protective equipment. They’re all this training because there are hazards associated with the work that you’re doing. It’s not quite recognized yet, but I think it could be more firmly recognized that there are hazards working with data if there could be training for this is how you work with data to avoid biases. There are hazards working with that as a researcher that we want you to be aware of and be certified and renew every year experimental apparatus, just like any other data that’s collected. So I think there’s an opportunity for that as well in the Sciences.
Jo:I think that’s an excellent point. I haven’t actually thought about that. I took notes as we spoke, and I felt before you mentioned the last thing now that there’s probably a need for expectation management, what can you expect to learn from any data point, data item really in order, I think highlighted now is to avoid misinterpretation of whatever is being presented as a data set. And also I have algorithms that are trained not to misinterpret, to just present according to keywords or whatever you teach them to look out for and then come up with some results. And then I agree totally. I think there should be like a training or a checklist of what you can draw from whatever the algorithm is presenting you as an output and where should you draw the line, like what cannot be concluded from an analysis because it’s full of biases or because there’s not enough metadata associated so it can’t be any interpretation in one or the other directions? I think the site is heading in that direction.
A.I… we can also share this in the show notes, just sharing now for you in the chat. And this is where they also use keyword search in weighing them or artificial intelligence analysis of the text. Well, it’s complex but simple, so they look out for words to weigh. Is it more pro or more con in the citation of a particular article? And has it been cited positively or negatively in order to draw conclusions? But is this highly debated or is there a common agreement, a solid approach, and I think that’s also critical. And most of the sites themselves say that we need way more data in order to have better outputs of this and should be cautious in interpreting the results. So you still have to read the paper if you want to have your own opinion weighing in. But yeah, it’s an exciting learning curve that we know in the digital age using this machine reading approach. And there’s a lot in between which is really exciting also for us as science service providers and how we can sensitize researchers what they can do with their data and also how they can prep the data so that it can be reused and analyzed by others.
Donny: Yeah. One of the things I was just thinking of was the benefit of machine actionability, sort of having something to be presented in a way that a machine agent can act on it. As you mentioned, there may be certain conclusions that could be drawn from presenting that in a certain way. An expert looking at that might say you can’t draw that conclusion because of this. If data is presented and published in the machine accessible way, you could scale that kind of expertise. So someone could sort of write down that rule logically and then you could send out a crawler that runs on all this data and could highlight these rule violations. Programmers get this kind of thing all the time in integrated development environments. Like I’ll be writing some code and their programs running asynchronously to me typing that will say things like you should probably refactor this, or like you’ve just defined a keyword that wasn’t defined elsewhere, or like you’re adding these two variables, but one of them is an integer and one of them is a string. So maybe you want to take a look at that. It’s kind of like a word processor that will highlight using a dictionary spelling errors or grammar errors. But it’s a lot more granular because it has knowledge of the structure, because programming language syntax is data structures, abstract syntax trees. And so similarly, imagine researchers preparing data things in kind of a machine in the actual format. They could get this kind of feedback really early on on the accuracy of their analysis, but they can draw certain conclusions and they won’t necessarily need to be experts in statistics or something. So this is a grand vision. But this is the kind of thing that at its core is enabled by the stuff that you create a machine being able to sniff out and look at rather than it just being paragraphs of text that a machine has to go through a natural language processing understanding layer in order to do anything with it. That’s a possible incentive. It might sort of help people develop better work in the same way that a lot of programmers will like to use Linters and IDEs rather than just use a plain text editor and throw it up on GitHub and then have the integration test say like, hey, this is wrong, they kind of can have something more local.
Jo: Yeah, I already mentioned that and I’ll appeal to natural language processing. From my experience in Workshop and also being a nonnative English Speaker myself, it tends to be hard for people whose first language is not English to translate the research into a foreign language, really. But also when I had first language English speakers in the course, they said they have the same issues because technical writing is not the same as natural language in English.
What’s your take? On the other hand, this was something we discussed in another episode with Maureen Archer. Technical writing in some disciplines nowadays is way beyond what’s humanly conceptualizable in some regards. As a researcher, try and keep it simple in wording so that it stays comprehensible for most of your readers. That’s just a side note, but when it comes to machine accessibility of machines, I would guess prefer technical writing. And then the question is, is that a threshold or a point where the machines and humans can easily misunderstand each other, like irrespective of your cultural backgrounds or language background, but also with technical language versus natural language, that makes sense.
Donny: Yeah, I think that’s very astute because we generally just have narrative. We try to cram in human understandability and machine understandability into the same channel. You might have people who want things to be readable to humans, but they might also think like, oh, how can I make my paper MLP ready? How can I make sure that a machine can also understand this? And it’s like you’re conflating your readers. It’s hard to write to two different audiences at the same time. One of the Cardinal rules of the presentation is to know your audience. It’s like, well, I have some people who want this and some people want this. It’s going to be a really silly presentation. It’s not going to be really helpful to either party. And so I definitely agree. I know that there’s this book called The Pyramid Principle. It’s about business writing and it advocates a style of writing that’s essentially like prologue logic programming. You kind of have your goal statement at the top and you should start with that and then whatever you need to support it as sub goals, then kind of do that. It’s basically like what you really want to do is program, but you have to write English sentences. So do it like this and it’ll be logically clear. The other thing is some fields have it. I wouldn’t say it easier, but like, you’ll get in a lot of math papers, you’ll have these, what I’ve heard termed in a podcast by someone, a portal abstraction, where you kind of have this word. And because no one else is really using that word in this way, you really get to go on it. The word was like Monoids, certain words that are very technical, that aren’t used in everyday speech. It’s great to use those in natural language because you’re almost certain what they’re going to refer to. And you can look up that word and find other papers on it and be sure that they’re talking about the same thing versus something that’s more ambiguous. When I was a graduate student, I was working in Lithography, and there’s this idea of resolution, optical resolution. People use different definitions even within that subfield resolution. Is it equal? What does ten nanometer resolution mean? Does it mean that the half pitch between two features that you can write? Is it the smallest feature dimension of a thing? And people were very adamant and passionate about what the definition of resolution was in another context, resolution has nothing conflict resolution, whatever. Whereas a math term like Monoid or functor, it just sort of rarely occurs outside of that context. You’re more certain what that’s going to mean. And so I don’t think there’s really a solution to this other than to have the separation of like, here’s my machine consumable thing, my sort of base thing is encoded in whatever you mention XML for your archive, or just something that is not necessarily for humans as an audience, but then there’s a companion document or multiple companions derived from that that is for humans as an audience. And to be a lot more flexible and not have to be NLP able because they’re for humans, it’s for humans. And if a human wants to delve deeper in a programmatic way, then there’s a program representation of that data, of the Providence of the experiments and the methods and the breakdown of how an experiment was done. So, yeah, I think a separation of narrative and machine centric digital assets would really be helpful. Instead of people having to cram both of these audiences into the same channel.
Jo: Okay, so some work to be done. Can I just ask you, I think this might take us back to what could have been another beginning of this episode. What is your understanding of open science now? Also with the information that we already shared with the listeners, with your focus on data analysis and research data being fair in a perfect world, or opting towards that? I’m asking this because now as a research community around the world, everybody is using open science thinking it’s a new thing. My world is nothing new. It’s just good research practice. As most of us start and assume we will put ourselves up, we’re just not sure how to do it. And open science now has more than a hundred approaches to how to do that by discipline, by research, topic, by experimental set up. And there’s no one answer. How to practice open science. And also coming back to open does not necessarily mean it needs to be 100% open. Just as open as feasible is what I tend to say. But in my training, even if it’s actually about open science or what we now understand is open science, I tend to forget about the term. Let’s just talk about good research practice and what challenges you observe in your own research and how can we find a workaround that’s to me, open science just to set the stage? What is your take to open science? And I believe I also saw several definitions. So there’s not one definition of what a concern should or should be or is, but there are several common themes that are part of open science and various understandings of what it should entail. And it’s not open science.
Donny: Okay. So to paraphrase again, or to rescue…is the question ughh…
Jo: it’s a long question.
Donny: What should be open science?
Jo: What are your personal takes with your approach with data analysis and also like a professional trainer, consultant, machine centric? Okay, your approach is not so much generally about open science, but more about research data and how to make that fair and machine readable. If you think about the term open science, what comes to mind to you in your context?
Donny: Sure. What comes to mind to me is that open science for me feels a bit like open standards, open protocols, like that kind of thing, in the sense that I see open science as good science in the sense that it’s science that you can plug into. If you have an open plug in the wall, you can plug something into it. And it’s this ability to be built upon because you don’t have to go in and open it up and see what’s in there. I’ll give an example in terms of originally non open research projects, maybe in a private company, even a company might want to collaborate, in a sense, and they often do like a company might send a job out to what’s called a contract research organization to do some part of what they need to do. And the information is like a diode. The company won’t really share that much. They only show what the CRO needs to do, and then they can send information back. But oftentimes it’ll be really horrible when the CRO returns a scanned PDF of what they did, or they’ll misinterpret the experimental protocol that the company wants to be done. So a private company might have different people that are sharing in its data output, ecosystem and data lifecycle. These contract research organizations, a partner organization. Even if you’re doing protected research, you might reach out to a collaborator. There might be a collaborator across time. You may have the thing said and done, but then you publish it, and then someone reaches out to you and wants to do a follow up or something. And at that point you sort of want it to be open. I guess open to me means more like openable, the idea that you can facilitate collaboration among people over space and time, that you may not have originally intended in your original plan. You may have seen over time that this project could actually benefit from adding more people to it with different expertise. And so the quote unquote openness can be managed with access control and licenses. Even open source, open depends on the license. They’re going to be open for data like the Creative Commons with Attribution or like CC zero, like no Attribution necessary. Or with open source, there could be so called permissive licenses like MIT and BSD, or so called Copyless licenses like GPL. They’re all open. They don’t say, like, you cannot participate. They say, Here are the terms by which you can participate. We want to participate. And so I feel like even stuff that’s sort of closed with research, you want it to be open able. We want to be able to say, like, oh, we could use this person’s opinion and get this expert on the project and not have it take us six months to figure out how to get them the data. And it should be less than that. So even though you’re not sharing it right away, it’s an open system. Yeah. So that’s what I think of as open science. In that sense, I agree with you. It just seems like that’s the way it should be done, because otherwise you have all this friction around opening things up when you decide they need to be open. But you can make them openable and fair without making them freely accessible to anyone, to the public. Right away, you’re still practicing open looking science.
Jo: Thank you so much for sharing. This is really a new perspective for me to look at that, especially with the conversation we had just before this. I also like the term open research. Openable science. That’s probably a better approach as compared to open science flat out because it’s often too often being misinterpreted and scares people away. Like, what? I have to be fully transparent. That’s not what I came here for.
Donny: I guess it’s gotten a lot of traction recently because of open source, and that’s why it’s taking that turn and riding off the popularity of that for the same reason a lot of people think open source just means sort of it’s free. But really, I see a lot of repositories on GitHub. They don’t have a license. They say they’re open, but if I need to use it, I’ll contact them. They’ll say, like, hey, there’s no license. I don’t know if I can use this or not.
Jo: Sure, because by default, that’s also what I didn’t know before I came across this license debate. Like, by default, everything is protected by Copyright.
Donny: By default. All rights reserved by the author. A lot of companies, I’ll be on a contract and they’ll say you cannot use any GPL software because it’s open. But it means that if we build on that, then we have to make our software open, which we don’t want to do. So there are all these different meanings of open that aren’t necessarily what people would think of, which probably means I’m putting it on a table in the middle of the park and anyone can get it.
Jo: Oh, okay. Thank you so much. Maybe closing off. I’d like to go back to our Icebreaker questions that I shared with you before we met here for this episode. So when I asked you about a researcher you find inspiring, you mentioned Bret Victor. Who’s that?
Donny: Oh, yeah.
Bret Victor, he has an electrical engineering background like me. He actually went to Cal as well. He’s Masters, but he has done a lot with helping me imagine and a lot of other people imagine what the Internet can sort of help us do in terms of understanding things and explaining things. There’s this one called magic ink, where you kind of explain, like, the Internet is like magic ink. Why are we doing stuff like we don’t have computers? And like, there’s one presentation called Stop Drawing Dead Fish, where he’s very empathetic with scientists, where he sort of doesn’t want scientists and researchers to use these tools that are reminiscent of a day when we don’t have computers that can sort of help us do things. And so this is part of why I like machine accessibility.
He’s published a lot of inspiring single page interactive essays that prototype lots of ideas about how you can have interactive systems. Like, imagine a news article where it mentions some numbers in a conclusion and you can just click and drag on one of the numbers to change all the other numbers and the conclusions. See the author’s model is actionable for you quantitatively in real time. You don’t have to, like, go and download the software and do this. Again, the narrative is important, but you can kind of interact with the data on your own terms. Modify assumptions. Recently, I think he started a space in Oakland, California called Dynamic Land that again in this realm of like, how can we use machine technology to kind of help people better create and collaborate in ways that are really imaginative and aren’t just sort of like the equivalent of like we have folders on our computers, on our desktop and that’s very much like we have a desk, a physical desk and a folder and then we’re just taking that on the screen now. And he’s very much about, like we can do more than just like what we did and now it’s on the computer. We can do so much more. And I love that and I love that. The vision, his articulation, the vision coupled with his technical execution of these demos so that you can really see what it is. So that’s another thing that inspires me is sort of being able to do implementations of these things even just as proofs of content.
Jo: That sounds a lot like what we spoke about earlier, like not having openable software in this case, openable Internet like actionable without having to contact the originator of the information. But you can still do that as well. You don’t have to. That also helps with language barriers and all kinds of other obstacles might be in the way. Great. When I ask you about your favorite animal, you mentioned the koala. Is that because it looks fluffy?
Donny: Yeah, it’s just fluffy too. I remember when I learned just how much it sleeps. I was like, that’s amazing. Maybe I was particularly tired that day, but I just love the extremity of that. I love it. It’s cute. I love that. But I think I just like that being representative of such a dispersion among species of characteristics, it’s not just like, oh yeah, pretty much all animals sleep like 68 hours a day. There’s a lot of different life on the planet, but also it’s just very cute. So, yeah, it sort of stuck with me as my favorite animal.
Jo: Yeah. And there I agree. And then Nirvana Sliver as your favorite song. But there’s also this evergreen and Critical Brain is certainly one of the producers of such songs that will ever resonate or forever resonate with people. But why Sliver?
Donny: Yeah. I don’t know. I didn’t think too hard about this one. I listened to it a lot and I just love just the repetition of like, Grandma, take me home, grandma take me home. And I just love the energy of the song and I don’t know, it’s sort of raw and primal and it’s simple. Like a lot of great songs are. And I don’t know, it’s just something that will get in my head and I like it. I don’t know. Don’t take me too much on my favorite, favorite, but that’s just sort of something that came up and I wrote down.
Jo: Favorite songs sometimes change over time, and some songs just decide to stay with us for longer or for life. And then since I have a favorite restaurant in Boston, is there a particular dish that they do really well?
Donny: Oh, yeah. No, it’s just called the no Name and this place called Grasshopper in the Austin neighborhood of Boston. I love that so much. So I’m vegan now. I’ve been vegan for a while. I was vegetarian at the time. I’ve been vegetarian for many years. And just one of the things that I like that I enjoy for whatever reason is like mock meat. And there’s a long tradition of doing that in East Asia doing that very well. And yeah, it’s sort of almost like a sesame chicken fast food Chinese dish, but it’s wheat gluten, and it’s just absolutely fantastic. And it was far away enough from where I was living when I was there that it would be special enough on special agents to go. And whenever I’m back in the area, I try to make a special trip out there. And yeah, honestly, I haven’t checked on it since Covid 19. I hope they’re still around. But yeah, that’s my favorite dish. It’s just absolutely wonderful. Sweet weekLooking with sesame seeds, like on the broccoli. …Anyway
Jo: Next time I’m in the states I will make sure to have a step over around Boston. It’s on my bucket list of places to visit. I’ve been to New York, but not Boston yet. I really want to go. Can I just ask you, because I’m also vegetarian for the longest time, and when I became vegetarian, people thought, oh, maybe she doesn’t like meat, but I was actually for animal welfare reasons, to put it positively. But also, I really like meat. When I decided not to eat meat anymore, I can totally relate how having mocked meat is like, for some people, it’s irrational. Like, why do you not eat meat? But then you take all the soy stuff, which is like meat, but it’s not a real thing. But damn, it’s close. And yes, I did like the taste of meat very much. And also my body kind of craves it. So I’m really happy that we have this option now. Is it the same for you? What made you a vegetarian in the first place?
Donny: Yeah, absolutely. I think the principal reason for me is environmental health. I just sort of saw the amount of resources and water and grain that goes into a pound of beef. And I’m just like, wow, the environmental economics of animal technology are terrible. I’ve been reading this book recently called After Meat, and yet it introduces this term that I hadn’t heard before, animal technology. And it just emphasizes that, yeah, we have food, and animals are a technology that we’ve been using for food, for protein, and they’re really inefficient. And we can do so much better. So much better. It’s strictly from the environmental economics part. But also yeah, I definitely agree I have two dogs this exceptionalism for canines versus other. Yeah. I think animal life is just all very valuable. On that note, I think taste and texture is like I love that kind of thing. I love the young burgers and impossible burgers and just all this stuff. I like the texture of all that stuff. More like that on that. Yeah.
Jo: Let’s go for vegan burgers sometime when we meet physically.
Donny: Absolutely. And Boston is great. I hardly recommend that as well.
Jo: I heard good things about the city. It’s still going to happen, me being there sometimes. Thank you so much for joining this episode and of course, welcome back to another future episode. At some point whenever we have some… we spoke about a lot of things. There’s a lot of potential to go deeper into one of the other aspects and probably also more that we will come up with as we move forward.
Any last comments or anything you want to add before we close?
Donny: No, I don’t know if you heard anything I said. It resonated or like you hate it or whatever. Either way, strong reactions just feel free to reach out to me. I love to talk to people about this kind of stuff about fair data and machine centric science or really anything that I talk about. Feel free to just reach out to me. My personal site is donnywinston.com you can get my email address from there. Thank you for having me on.
Jo: It’s a real pleasure and we will put all your digital profiles on the affiliated blog post so contact details will be discoverable and accessible so feel free to get in touch with Donny or reach out to him through us here in this circus in this show and we’ll come back to another episode some time soon see you around.
References
- Books
- Mind over Machine by Dreyfus and Dreyfus
- After Meat by Karthik Sekar
- Pyramid Principle by Barbara Minto
- …