DataCafé

Data Science for Good

May 31, 2021 DataCafé Season 1 Episode 16

What's the difference between a commercial data science project and a Data Science project for social benefit? Often so-called Data Science for Good projects involve a throwing together of many people from different backgrounds under a common motivation to have a positive effect.

We talk to a Data Science team that was formed to tackle the unemployment crisis that is coming out of the pandemic and help people to find excellent jobs in different industries for which they have a good skills match.

We interview Erika Gravina, Rajwinder Bhatoe and Dehaja Senanayake about their story helping to create the Job Finder Machine with the Emergent Alliance, DataSparQ, Reed and Google.

Further Information

Some links above may require payment or login. We are not endorsing them or receiving any payment for mentioning them. They are provided as is. Often free versions of papers are available and we would encourage you to investigate.

Interview date: 25 March 2021
Recording date: 13 May 2021

Intro audio Music 4 Video Library (Patreon supporter)

Thanks for joining us in the DataCafé. You can follow us on twitter @DataCafePodcast and feel free to contact us about anything you've heard here or think would be an interesting topic in the future.

Jason:

Welcome to the DataCafe. I'm Jason.

Jeremy:

And I'm Jeremy. And today we're talking about data science for good.

Jason:

Okay, cool. So what is data science for good, Jeremy.

Jeremy:

So this really comes from, you know, where we find ourselves today, I think, and it's become quite a movement in recent years, but especially around the pandemic, of course, yeah. And we've seen lots of initiatives, lots of events spun up by people trying to use data, trying to get value from data, and then trying to use data science, of course, to generate some level of social beneficial outcome. Some goes in society,

Jason:

A real noble cause then where we've got so much data, and we've got experts out there in a lot of the data science that we tend to do or talk about, and the applications usually have an industrial application or commercial application. But we're talking here about a more charitable cause or a more noble cause where you're giving back somehow.

Jeremy:

Yeah, a moral cause, I think. Absolutely. I think one where you're not, you're not doing it necessarily for your job, although that can be, of course, have a moral dimension as well. No reason why or, but but but you for this one, maybe you're coming together, possibly with people you haven't met before, or you know, aren't in your normal social or work team. And you're developing a product, using some publicly available data sources may be to, to generate something interesting, and hopefully impactful.

Jason:

Yeah, impactful actually, is real word, right? It's, it's got some impact and some use for somebody out there who maybe can't afford to buy it, but it's of impact of use. Yeah, it is great hate. And there's loads of cases out there.

Jeremy:

So one of the ones that I became aware of quite early on in pandemic last year was a couple of projects that were spun up by the Royal Society in the UK, who a scientific organisation, and they they started two projects, the RAMP project and the DELVE project, which are great acronyms, so hat tip to them.

Jason:

They're two different projects.

Jeremy:

Yes, yeah. related. So what RAMP stands for Rapid Assistance in Modelling the Pandemic and DELVE was Data Evaluation and Learning for Viral Epidemics. So one was more data focused. The other was more modelling focus, but I mean, both, arguably, our data science projects at some some level, and they brought together totally disparate sets of academics, organisations, companies, charities, and, you know, tried to get some operating insight from data that they were able to get hold off, you know, sometimes from government sometimes from publicly available sources. Yeah. You know, I mean, ramp had a mission to try to get out of lockdown more quickly. Right. What is our exit strategy?

Jason:

Yeah. And we can see why that's such a strong impact. And

Jeremy:

yeah, to two lock downs later.

Jason:

Yeah. And it's amazing, even when you said it there of using publicly available data, how much publicly available data there is, and how many opportunities there are for any of us with skills or interests or curiosity, to take some of the publicly available data and use it and go in with that question. What impact can I make?

Jeremy:

I think so you see datasets that come out of government and sort of open data initiatives that are supported by local government, central government, as well. You've got news organisations, then sourcing their own data and then putting on the website, and often they're in eminently scrape Abul form, and can then be picked up and people can start using them to spot patterns and make deductions around these areas. Exactly. So you were involved with a charity hackathon when you Jason a couple years ago. Do you want us to think about that?

Jason:

Well, we had an initiative to bring together the team in a hackathon before the social goods packed and it was with missing people charity came in and worked with us and you know, they had cleansed anonymize data and we looked at how we can apply some nice statistical methods to segment certain risk factors for where they might put their efforts in targeting risk groups who you know may go missing or wrong. You know, the missing is kind of the umbrella term. for any reason that somebody disappears, yeah, for their own reasons or because they've been forced to go into hiding or or whatever it might be.

Jeremy:

Did you feel when it was a different sort of team working experience then working on this project for for missing people? Yeah. Over your other experiences of doing data science in a company setting,

Jason:

There is an element of Yeah, collaboration and camaraderie that just gets kind of emphasised when you're working on something that everybody is inspired a little bit by, and has some passion or drive to make sure that it's as impactful as it can be? Yeah, because you know that there's a feelgood factor as well as a product to deliver that you've kind of promised you said, do you want to do this, so you really put your heart and hard work into it?

Jeremy:

Yeah, we're lucky enough today to be talking to some people who participated in one such project for organisation called the Emergence Alliance, which was set up by Rolls Royce and Google and an organization's I think there's 50 organisations involved in this alliance now. And they, this was a organisation that set up to try and address challenges surrounding the pandemic, not just in the teeth of it, but also how on earth you recover from it. And I think that's where they really, really put their efforts in, in trying to drive use of data to to help people recover. So we were lucky enough to speak to Erica and Rajwinder, and Dehaja, who were contributors to one of these projects. I'm joined in the DataCafe today by three people. I'm very excited to be talking to Erika Gravina, Rajwinder Bhatoe and Dehaja Senanayake. Welcome. Welcome to the DataCafe. Thank you. Hello, good. Thank you so much for joining us. So I'm particularly excited to be talking to you today about a project you've just coming to the end of, and one that i think i think will chime with the listeners quite a lot, which is it's a project to do with sort of social good and around the pandemic. So Erica, could you just give us an overview of the project and what you're hoping to achieve with that?

Erika:

Sure. So we got on boarded onto the project around September 2020. And the project for us started off as being about thinking of ways that we can help people think about the job market in a way that isn't as fixed and stationary as maybe the way that they can think about it now. And what I mean by that is that due to the pandemic, of course, there has been a lot of unemployment. And what we were trying to achieve was to try and think about the the industries within the job market and in a way that was more flexible, and to then allow job applicants to move across industries more fluently. And in order to do so we focused on a representation of jobs that wasn't just to do with kind of the structure of a job description, but to think in terms of skill sets. And I suppose that's kind of the overarching story of the whole project.

Jeremy:

Brilliant. So let's just take each of you in turn. So Raj, how did you how did you get involved in this in the first place? What was the sort of your your pathway into the team then before the project?

Rajwinder:

So our team came about, we took part in the code first girls data hackathon back in September, it was a week long, and we were analysing some data based on the economic and environmental impact of the pandemic. And thankfully, we ended up winning and our prize was to join this project. But before that, I had taken a code first girls Python course and when it was recommended, I applied and here we are.

Jeremy:

Brilliant. Dehaja. Was this a team that you got together before that initial project was that it was it was the team put together on day one to tackle the first challenge?

Dehaja:

No other team was put together on day one, basically. So I didn't know Erika or Raj. Yeah, we didn't know each other beforehand. Which it doesn't feel like we didn't because I think we could come up with come actually quite good friends over the course of working on that challenge but also doing like the read project. Yeah, I think Erika messaged us and just said I'll be your team leader for the challenge. Welcome. Would you like to share any of your like hobbies or anything and I think we all found that we all quite like to cook and eat and things. So we bonded over that, but, and then eventually over coding, but it's been really good.

Jeremy:

Raj, you talk about coding experience, how much coding experience did, did you have Dehaja before you before you started this? Was this new to you? Or was this something you've done a little bit of before or a lot of?

Dehaja:

Yeah, it was really new to me. So like Raj, I've done one of the code first girls courses in Python. But that was my only experience in Python beforehand. So kind of thrown in a little bit, the deep end, but I've had pretty supportive people around to help me when needed, so that that's been good

Jeremy:

Erika, what about yourself?

Erika:

Yeah, it was a real roller coaster for me as well. It was definitely kind of a new environment. And I had been selected to be team leader. And I didn't put myself forward for not thinking that I was actually going to get the position. So I think it was definitely kind of an interesting start to the project, like focusing on getting the group to get to know each other, and actually, like enjoying having many calls and discussing both the problem and our personal lives. So I think, you know, looking back, I'm really happy that we focused on the social aspect of it as well, because he made the whole experience a lot more fun. And in terms of coding experience, I had some experience through university, but I never did an official course of any kind, and not through CFG. That was my first experience being involved with CFG.

Jeremy:

And then in this project, you joined a sort of loose collection of teams as formed part of what's called an emergent Alliance. And you are asked to tackle this really super relevant, interesting problem around jobs, finding jobs in a pandemic. So I said, Raj, how did you how did you sort of go about doing this?

Rajwinder:

I think, right at the start, we had a couple of meetings with the wider team, kind of thinking about what the aim was of the project, because, of course, lots of people had lost their jobs, unfortunately. And there was a lot of push to get people to kind of use their skills that apply to jobs that were currently in demand. And that's what they were kind of focusing on from the company's perspective, that they wanted people to be applying to things that they had the skills or maybe they hadn't thought of applying to them. But those were currently in demand. So they would have the opportunity to go to those.

Jeremy:

So it wasn't jobs that they would might might have done previously, necessarily. We could have been roles that that match their skill set, but weren't weren't ones they might have considered before. Is that right?

Rajwinder:

Yeah. So I think there was a lot of discussion about how the current applying to jobs process doesn't really take into account your skills, necessarily, you're kind of just looking for a job title and applying to those. However, we kind of want something that focuses on people picking out their skills, their best skills, and kind of searching for those. And I think that was what kind of came out of those design thinking workshops we had at the start.

Jeremy:

So Dehaja, in tackling the problem, and what sort of avenues Did you did you go down, you've got this slightly, slightly, this really interesting approach of not doing a sort of traditional CV scan, I guess on an individual applicant, what was the pipeline on this, but for how you how you thought you might tackle this initially.

Dehaja:

So we first kind of looked down the avenue of having kind of sets of skills, and those were related to certain job title. And then we looked at how we can match skills, that kind of a user enters into the skills related to each of those job titles. So you could have skills that vary across many industries. But and that's kind of like the whole aim of the project. So as kind of Raj has mentioned as to try and see where maybe the gaps are that you haven't thought of. And, yeah, so in order to do that, we went down kind of the matching route. I don't know how technical to go, but we kind of created matrices of the skills and job titles, and we, we match those together. And it was really interesting for me, I had I've never worked with natural language programming NLP before. So that was new and, you know, tokenizing, and stemming and kind of the differences between lemmatization and stamas. And we have quite a few interesting conversations about the benefits of that both of those and then we ended up looking towards kind of creating a network or a cluster job titles. And yeah, that's kind of where we got up to but I think, has over to Erika to talk about the network of job titles and clustering because she, she focused on that.

Erika:

Yeah, I think everything that the highest said is essentially kind of the groundwork that went into it. And I think that without everything that happened in the beginning, in the first part of the MVP, the network approach wouldn't have come out of it. I think there was definitely a lot of interesting work and thought, like thinking out loud, like during discussions with the team about what parts of the process were working and what wasn't working. And I think a lot of it came down to the data that we had available. I think we had an idea of what data we were gonna get. And then the more we were looking into the data we actually had on hand, we realised that what we initially set ourselves out to do wasn't gonna work, as well as we thought it was. So I think, yeah, it's definitely interesting, you know, for whomever might be interested, instead of having a corpus of like a job description, and a job title, we had very specific, almost like, you know, single string inputs with just a few words within it, that were representative of the skill sets, such as the skill strings, I suppose. And those were connected to job job titles, but it was a very different format than the way that one could think about the kind of job search and the job data around. So when we actually ended up looking into the network approach, that was a way of trying to extract as much information as possible from the data that we had. So it was a very, very data specific approach. And it was very interesting, it kind of led to this idea of being able to think about job titles with respect to their sets of skills, and use a set of skills to create connections between the job titles, and ease percentage matches across job titles were created by looking at the sets of skills. So it was a very back and back and forth approach between the two elements, I suppose. And this is kind of where the network came out.

Jeremy:

I think I think it's amazing. The idea that you're not doing something fairly traditional around this, you're not doing a sort of search for a particular title, because that's, that's exactly what people would have done. Previously, you're trying to encode the sort of serendipitous nature of discovering a job that you are qualified for, but you didn't know you were qualified for before you started searching? So how did that play out? And what's the user journey sort of look like for for that?

Rajwinder:

So we got to the approach where a user would input their skills that they think they have, but we also wanted to be able to include skills that maybe they hadn't directly mentioned. But were related to, to the skill, same put it because I think people can undersell themselves, and maybe not be completely specific about all the skills they have. So let's say, yeah, let's say they, they're collaborative. But they're also good at working in groups and search. So we wanted to be able to kind of have as many skills as possible to search for us. So I think one of the more useful ways to do that was we've kind of used job titles that were exactly the same, and grouped together their skills. So you would assume that if the job title is the same, the kind of skills you'd need for them were also the same. But if they were written in, say, a different manner, or use different wordings, you'd be able to capture that by aggregating those together, and think we wanted to be able to match as many people to the correct jobs as we could.

Jeremy:

I see. So the process was one of learning from the sort of collection of jobs and their associated skills as to what those likely patterns would be not for one job, but for many jobs of hopefully a similar type. But you interested interesting, you mentioned that there was a concern over how people might describe themselves when they're giving them. They're entering the skills. And the I understand you were working on some of the bias elements in there. Is there a was there a concern that, that there might be a bias in even at the very start of that process? how someone might describe their own skill set?

Dehaja:

Yeah, so you've kind of hit the nail on the head there. So that's where we ended up seeing whether, I guess, majority of bias could come from. So if I described myself as, say, self assured, but maybe not confident, those two skills are quite similar, but maybe one would be captured more than the other. And that's so we did a bit of investigation into bias in the data set. And interestingly, there was a significant amount of ice in a sense of, we used a list of masculine words and a list of feminine words. And we, we explored the data set for the occurrences of each of those words. And the occurrences of those words was reasonably high. So it was something that we weren't, we were happy that we investigated, just to be aware of the levels of bias in the data set. But also, we've kind of thought about what we can do to reduce the bias, the effect of the bias even on the user. And one of the main things that's good about the tool is that it kind of captures these additional skills in the process, which could reduce the level of bias associated with someone inputting certain skills that may not be written in the job descriptions themselves.

Jeremy:

So Erika, what would you say then for the team of yourself, personally, maybe it was, what would you say was the biggest challenge in the project?

Erika:

I think there were a lot of ideas at all times. Because there were so many discussions with wider teams and people having a lot of input. And it was really hard, I think, to start to try and narrow down what was feasible and what wasn't feasible, at least for the sake of the this first sprint. And I think that was definitely challenging from my perspective, because I was always used in maybe working a university or with much smaller teams. So the influx of ideas was definitely more manageable. But it was a real challenge. And I think we did a great job. But keeping having lots of calls and discussions and thinking out loud with what we thought would have been the best approach forward in the time, like limited amount of time that we had. And that was a real, new type of difficulty. And I think we did a really good job with it.

Jeremy:

That's nice. So I could have finished with the same question to each of you. And it was, if you could give advice to somebody who was maybe in the person in your shoes six months ago, wanting to get into coding wanted to get into data science, particularly, what advice would you give them?

Rajwinder:

Oh, I think the best way is to kind of get stuck in you can't wait like stuff like this, you just got to try something. There's lots of resources online. Such as I know, kaggle is a really good data science way to get started. And just with data science is always a good start. To start with asking the question, figure out how to answer that question where you could get data for that question, whether there's stuff already out there. And I think, also, I'd like to plug code base skills and is a great opportunities for girls to get involved in learning Python and SQL. And I know me and the higher tech part, and it's a really good opportunity to do so.

Jeremy:

Great. Dehaja what about yourself? What What, what advice would you give?

Dehaja:

I think I completely echo what Raj has just said, definitely, if you're looking to get started code first girls is a great opportunity to do so. I think also, what I would add is trying to build a little bit of a network with maybe other people who are learning because for me, it's been really helpful to have people to ask questions to but also talk through like problems. And I think you just learned so much from talking to other people as well. So yeah, that'd be my one bit of advice, for people getting into coding.

Jeremy:

Brilliant. Erika?

Erika:

Asides from the brilliant answers that already came through, I did have one more thing to add, I think, personally, is to not be too afraid of not knowing. Don't be afraid to ask questions, any questions about anything that you might be stuck on? And just don't be afraid to? You know, think of it out of the box. Because I think especially for data science is such a great tool to try things out in a very, like nice and, you know, simple way, just keep the problem simple in your head and play around with it.

Jeremy:

That's great advice. I love that. That's been fantastic. I think I believe the tools gone live. So we will put a link to the the live job search tool in the show notes and people can have a play with that and see, see what you've had a go. So I think that's been a really nice discussion. I'd like to thank all of you Dehaja, Erika, Raj, thank you very much for joining us today in the DataCafe, it's been great. Thank you for having us. Thank you.

Jason:

The product that they've built really inspires me, because they've straightaway call it a base job application process in searching for jobs in a way I've never heard of before, which is really cool, like, really innovative. Like, what is my skill set? And now I will use that to go and find a job where every, you know, approach, certainly that I've done is I want to be an insert blank here, right now I go and read what is by other people's definitions, and try and almost sandwich what I think my skill set is into what has been presented in a job description.

Jeremy:

Yeah, I should declare an interest myself, because I was involved with this project. And working with Erika and Dehaja and Raj was really exciting, and was enormous fun. And I think they've done a lovely job with this. And the fact that they were thinking about it in a, you know, I think Erika said in a flexible way to allow people to, you know, move between jobs that they might genuinely be totally skill for, totally, they may need, they may need no extra training to do or maybe a tiny bit of extra training, or education to do you know, as everyone does online courses these days. So that's certainly well within people's capability. And I think I think thinking of the problem, in a different way, gives you so much more flexibility to really sort of start again, with a tool like that and to go right, well, if we're going to think of it in those terms, if we're gonna, if we're going to start with skills, and we're going to look at how jobs define themselves by those skills, then we can make some really interesting assertions and suggestions and gives give people hopefully some really useful ideas for what they might otherwise have not considered and not thought about.

Jason:

Exactly. Yeah. And how you lay out those skills is, it's not so much about the job title, that's always a starting point. And another point, they said, that was really interesting to me, was this difficulty in self evaluating? So yes, how do I know what skills I have? Because I'm very biassed, you know, purely living my life in whatever skill set I have, and what is my norm? means I'm not necessarily calling out what is a skill, because it's just a normal behaviour to me, and learned or otherwise, um, what is something that I want to apply in a job, you know, so a very clear like distinction would be having confidence in public speaking, can be learned or some people are naturally good at it, versus having the ability to code and code in a certain language and coding certain framework. And the has to be learned. I don't know many people who are just, you know, prodigy straightaway at coding.

Jeremy:

Yeah, they set themselves a genuinely challenging task here, because they first came to realise that jobs aren't described, even in the title in the same way. And skills aren't described by everybody in the same way. Yeah. So someone might say it's communication skills, someone might call it presentation skills, you know, so there's lots there's lots of ways of talking about these skills and then, and jobs as well. And then and then you put you said absolutely true, then you've got to somehow marry that up to the way that an individual applicant might describe themselves and the time limit or that bias didn't the Right, right, and then you've got that that difficulty of your own your own sort of preconceptions about your own skills, be that ambitious, or unambitious, really. And I think that, you know, all of these great, considerable challenges for a project like that, and there's lots of hurdles at which you could fall early on, trying to tackle that problem and trying to try to make it you know, all connect together when you've got so many disparities there, there is no common lexicon Yeah, there is no taxonomy that's officially used. I think they've they've done themselves proud with this kind of project.

Jason:

Yeah. And even when you said there, that is such a challenge in the lexicon, and then they were applying natural language processing and programming techniques and try and turn what's so difficult, you know, with the human biases involved, and the various datasets out there into logic that can be modelled and trained and applied in certain ways. Such a challenge. Yeah, really. Yeah.

Jeremy:

I mean, we'll probably do another episode on on bias because it's such a hot topic and one, that one that deserves really close sort of investigation. But then I mean, there's just a couple of instances in their projects, I think that they had to tackle. One was the bias, as we said, from the individual who was describing their own skills. And another would be the bias of the person writing the job specifications, saying, Well, I'm in my mind's eye, I'm seeing someone who has these skills, when, of course, in their mind's eye, they might be seeing someone who was maybe a man, maybe in their mid 30s, maybe, you know, all of that kind of thing going on, they wouldn't put that down. But

Jason:

Yeah exactly put scientist in the title and they're wearing a white coat. Yes.

Jeremy:

So, so I think that there's a lot to get your teeth into, if you're engaging in this kind of project I showed you, I think it shows that AI for good and data science for good. It doesn't mean it's a straightforward problem. In fact, actually quite the opposite here, the data is not necessarily there, or if it can be in a very poor state sometimes. And it can mean that you, you know, you don't have recourse to go back to the person who gathered it and say, Well, can you do it better? Or can you do it differently? Or can you give us another dataset, which shows us this, because that just may not exist, you literally have to work with what you have, and try and get something from it. So you have to be quite careful and quite sort of disciplined with yourself. But But I think, you know, these are really quite tough data science challenges.

Jason:

And we can see how it will get more attention as the, you know, prototype is out there, and people start using it and start thinking in this way, you know, that entrepreneurial shift in mindset of what is it that we can build a news and it's built on people's experience, you know, they talked about the pandemic driving a need for something like this, because I need to understand what my skill set is to match with what the needs are in whoever I'm looking to hire. You know, whether they know it or not, they, they may not have classified it right, for really, really great initiative.

Jeremy:

And also it does throw together people who would otherwise never have, have worked together and never have met under any other circumstances. So it is, it does create these nice serendipitous cross network matches that you would never, you would never otherwise have entertained or made happen.

Jason:

This is great, because I love hearing about the bringing together of different initiatives like this. So, you know, I'm wondering if somebody hears about this for the first time, how can they get involved,

Jeremy:

I mean, lots of lots of ways that there's hackathons going on as part of data science drives all over the world. So I'd encourage you to, you know, have a look online, and you know, don't be afraid to have a go, really, you don't have to be a massively experienced data scientist, in fact, quite the opposite. You just need to have some good ideas and some willingness to contribute and give, give a give your time. And, you know, a bit of technical skill may be to just help bring a team of people together. And when I talk to people about, about this, if it hasn't worked, it's always because there wasn't, there wasn't quite enough time to do stuff. And I think, you know, just just just just having a set of people with a common goal, who are prepared to give it whether it's a day, or half a day, or a week or wherever it like, not however long it lasts for but just be prepared to give that time. So I mean, if you're a woman and you're looking to get into this, and you've never coded before, like you know that they they mentioned it in the interview, then code first girls in the UK is fantastic organisation that helps women get into coding and they science now I see is very much on their, their radar. So I wholly encourage people to, to look at that. But there's loads of boot camps, and organisations who run sort of quickstart courses on this sort of thing. And which would enable you to get into this.

Jason:

Yeah, really trying to reach out and build that network as part of the advice that they gave, which is really key just just retired, it seems like nowadays as well, there's a lot of access online. So you don't even need to be situated in a tech hole. You know, like being in the centre of London, for example. You can be anywhere and get involved across virtual events, which is another kind of one of the side effects of the pandemic because we've seen this and increasing accessibility to things like this.

Jeremy:

Yeah, yeah, that's absolutely true. I don't think Erika, Raj and Dehaja ever actually met in person? Well, yeah, I think I think they've only ever met over zoom or whatever. So so

Jason:

It kind of makes it a little more challenging as well when you think about it

Jeremy:

As Dehaja said, you learn so much from talking, talking with people You may be from different backgrounds that you haven't met before they're not in your network. And then suddenly you're thrown together. And I thought that was a really nice way of sort of demonstrating the power of these, you know, social events and social hackathons and data science for good in general. So that was I think that was a really nice part of their their story. But the thing that stood out for me I think, I think it applies just as much in in data science in industry, just as it doesn't apply in in this setting, which is, Erica, at the end said her piece of advice was that she, you shouldn't be afraid of not knowing and and you shouldn't be afraid of, of asking questions. I think that's terrific advice for any aspiring data scientist.

Jason:

Thanks for joining us today at the DataCafe. You can like and review this on iTunes or your prepared podcast provider. Or if you'd like to get in touch, you can email us Jason at datacafe.uk or Jeremy at datacafe.uk or on Twitter at datacafepodcast. We'd love to hear your suggestions for future episodes.