In this episode of Data + Curiosity, I had the good fortune of chatting with Caitlin Hudon about exploratory data analysis as a means of improving model quality, delegation as a superpower, and being a data science generalist. As with every conversation I have with Caitlin, I learned so much, and am excited to share the conversation with you here:
You can also read a lightly edited transcript of our conversation below:
CAITLIN HUDON: I've been thinking about this a lot lately. I haven't talked about it yet, but design patterns for analysis is an area that's very interesting to me because I feel like we do the same things. Like I'll do an aggregation on a population, and then I'll pull in some top-level numbers. And so it accordions out at different points based on what information you're joining at what level of aggregation.
JESSE MOSTIPAK: OK.
CAITLIN: Do you know what I mean?
JESSE: Wait, I want you to say more because when someone says design, I immediately just go to typography and branding and graphic design, which I think is related, but I think you're talking more like systems design?
CAITLIN: Sort of. So design patterns will be the way that you build software or something like-- when you start to really get into it, there are patterns that people follow.
CAITLIN: In the way that they build things. Like almost like the way that certain buildings are similar and you can see like, oh, that's a skyscraper and it must have these internal components that hold it up -together. And so it's like architecture.
JESSE: Yeah. So, like-- you're so sweet for being like, kind of. No. I was like over here and you're like-- no. So almost like a blueprint for building something. Like--
CAITLIN: Yeah, but typography is a system. Like in typography, you have your uppercase and your lowercase and your punctuation and it all looks the same way, in the same way that-- if you've ever seen like a design library, you're using the same components. So that idea is like-- you're on to something with that idea.
And so I am very curious about basically components of an analysis. Like in the sense that ggplot has this language where you layer in all of these things. I think there might be something there with analysis. And just-- people do the same analyzes over and over again. Like you always see people doing the same dplyr commands. And so it's like, OK, that could be abstracted somehow.
JESSE: Yes! OK, so it's like, yes, I could see that. So I think we tend to currently think of it as like this cycle. Like you collect your data, you wrangle your data, and then you go through the explore and model cycle and that iteration, and then you come out in communication.
But you're talking more about that granular level of when I'm doing EDA or when I'm doing wrangling. Here, it's almost like looking at the outcome. I'm trying to do X, and then these are the functions or the code that pulls into that.
CAITLIN: Yeah. Yeah, like I think to build a data set functionally, you're doing the same things over and over and over again. And so I'm just thinking about how sometimes to build different kinds of data set, like if you want to build a data set that has time slices, you're doing the same series of commands over and over again, but just slightly differently.
And then feature engineering is kind of similar. You're taking these same pieces and you're just doing the same transformations of them.
JESSE: Yeah. Where I could see like, yes, enthusiastically 100% on board, I think you're right. There's something there. I almost see it as this core-- you've got these design systems. And then where it becomes individualized is almost when you're in a domain-- level of domain expertise or specialty.
I'm thinking about education. When I was doing education data, there was a consistent, OK, I have a new data set, I know that I have to do these three things every single time no matter what, but we don't articulate that. We usually are like, oh, I just know. It's like making that stuff that you know you have to do explicit for other people.
CAITLIN: I've been thinking about a question I haven't posted on Twitter yet, but just like, what's the white whale of your career?
JESSE: Ooh! So like the elusive thing that you keep chasing?
JESSE: Do you have a white whale of your career?
CAITLIN: Like, I'm honing in on it, and it's around EDA. Because I think that EDA is the best thing that you can do to improve modeling. And that's the hill I'll die on-- I'm very willing to die on that hill. So yeah, it's thinking about what parts of EDA could we automate? How could we expand it? How can we get better at it? I'm super fascinated by that area.
JESSE: Yeah. Well, that was a conversation I had-- I was talking to Randy Au and we were talking about Census data. And he had made a comment about really just-- oh gosh, now I'm going to mess it all up. But really just taking it week to understand a couple of columns of data.
Something like that. I'll figure out the correct quote. But there is a sense like, when I was learning data science-- and this is-- like EDA was like, get it done. Get through it, figure some stuff out, and get it done. And as I progressed in my career, it really is this fun place. Like it is unconstrained idea generation.
CAITLIN: Yeah. And nonlinear, often.
JESSE: I was just going to say, you could FigJam-- like there's something in there about FigJam and EDA and being able to really pull on that and explore-- I'm just thinking about how fun it would be to have a data set that you never take to modeling and you just-- like the only thing you do is EDA. And that's the goal, is like what can you come up with and what can you learn?
CAITLIN: Yeah. I think there's-- my consulting time taught me that so much of the value of the modeling process is in communicating the relationships in the data to people who don't know them yet.
And so yes, the model is super valuable. Great. We have predictions, we can make business decisions, but also, the thinking around the model and just the thinking around your data is like almost more valuable for helping people to think about the way like their customers interact or whatever it is that they're trying to measure.
JESSE: Yeah. I was-- who was I talking to? We were talking about the hard problems in data science, and I maintain that the hardest problem in data science is people. And-- for like a thousand reasons. But I think about my time as a data scientist and it was never-- I thought it was hard like, oh, I would do technically challenging things and I would be so proud of myself and like, I did it! But people come to me and be like, I don't know how to read this graph.
CAITLIN: Mmm. Mm-hmm.
JESSE: Oooh. Like that. That's where it falls down. Like I would make-- very early in my career, I remember pulling together this 100-page report for the Girl Scouts and I was, it's so obvious that what we need to do is invest in ABC, and people were like, I'm not reading any of this.
CAITLIN: Yeah. It's like one deck, 10 slides, please.
JESSE: Yeah. But explaining that relationship is so important because I think that is a place where, as data scientists, it really-- it takes me a lot of work to realize like oh, just because I'm immersed in this world doesn't mean everybody else is immersed in this world and knows what I'm talking about.
Even like-- it is worth explaining when you show a bar chart, your statement like, oh, blank is why because of this bar being bigger and that's how we see it.
CAITLIN: Yeah. I think some of the hardest things in data science communication are exactly what you're saying with people. You want to meet them where you are, but that also means you have to figure out where they are. And that can be really difficult because everyone has different contexts.
And so I think when you're writing, transitioning is the hardest part of writing, like making sure that you're getting clearly from one idea to the next. And that also happens in data science, because you have all these insights or this output that you want to tie together and present to people, and so you have to bring them with you at each stage.
And that's the part like when I'm reading my own communication around data, it's like, I'm constantly going back and looking at the transitions and like, am I telling you why I'm looking at this next thing after I just presented this first thing and before I present the thing after? And I think it's really hard. Like even having done it forever, it's--
But you have some cool projects. So you have so many projects. So I was going through your website and revisiting some of my favorite blog posts from you. And the one that I think of the most and that I've actually shared with quite a few people is your N equals 1 on motherhood and tech.
And you-- first of all, I saw your data that you have this beautiful data visualization where you look at your time before your baby, your first baby, and then you look at how your time was allocated for the months after. And that was the moment where I was like, I don't think I could be a parent. Like that is data-driven decision-making. Like that is not the life for me. But how-- so you are-- you're a mom of two now. And you have two dogs. Like, how has that graph changed?
CAITLIN: A lot. So I think there are too many axes to plot it in a clean way now because the time is also like time I'm spending with my son. So I have a baby who is eight months, and then my daughter's three.
And so yeah, things have changed even going from N equals 1 to N equals 2. Yeah, I'm not sure that I could plot it anymore without including my husband and the way that he's spending his time and the way that we have childcare and the childcare is covering time in the middle. And so yeah, it's a lot.
But it was really cool to make that visualization because I think there was so much-- the whole emphasis of that post is there's so much I didn't know before having a kid. And I asked a lot of questions, but there were still stuff that I just like wish I could have internalized a bit better.
And the time, like what do you do when you're on maternity leave? Like I really didn't know what those days looked like. I hadn't had anyone super close to me have a baby since I was an adult.
So yeah, I wanted to communicate that and I had a friend at work, and I was having this conversation with him about how-- there's a whole math around feeding babies that is just like-- it takes so much of mental space, like the cognitive load of keeping track of all of that stuff is just a whole thing that I also didn't know was coming.
But I was describing to him how dropping a feed and the calculus involved there, meant that I had like an extra, I think, an hour or two of free time. And he was just like, what? I'm trying to explain this in conversation and was like, let me show you. And so that is the reason for that particular bar chart, which shows my time before baby and their colors for like free time going to work, taking care-- or not taking care of kids because I didn't have a baby yet, but all the things I was doing, and then what I look like at different months afterwards.
JESSE: Yeah. So I do-- I want to take a little bit of a detour because in going through your blog, I found your delegation post-- that whole post is brilliant, but you have this thing about narrowing down your list, and I have it written down here in terms of delegation.
So when you delegate, you have these five questions. Does the work really need to be done? Which I was like, whoo! That's a whole thing to unpack. Could this work be automated? Which, of course, have I learned all that I can from my time owning this work? Is there someone else who would benefit from the experience of owning this work, and do I want to be the subject matter expert in this area?
And that is just life advice, I feel, in so many ways. But I really want to talk about, have I learned all that I can from my time owning this work? What suggestions or how do you evaluate that?
CAITLIN: Yeah. I think those last few questions really go hand-in-hand. So a really specific example of something that I delegated around the time I wrote that post was we had an internal office hour.
So once a week we would set up a Zoom call, and for the first 20 to 30 minutes, it would be a presentation from someone on the data team, and then an open space for Q&A, if people are learning SQL or having questions about our data models, or wanting a quick data poll or someone to check their work, all of those things were fair game.
And so I set that up, got the program running, would coordinate the week-to-week who's talking, what topics are they doing. And this is something I've done before as an organizer of our ladies and all the ladies in tech. So I had a lot of experience putting together groups of talks and curating a schedule.
And so it was something I felt like I'd reached the end of my learning journey with it. Like, I'm not getting a lot out of organizing this. I'm certainly not getting as much out of organizing this then as someone would be who hadn't had that experience yet.
And so I think thinking about, what are the things that you want to go deep in? Are there areas where you want to be known as an expert? And if so, make sure that you're trying to angle your responsibilities towards that goal or towards that area that you're interested in learning more about.
But if there are things that you're doing that are in areas that you aren't interested in or aren't serving you or you feel like you've learned really well, I think it's a good idea to consider letting them go if you have that opportunity.
Obviously we all have to do things sometimes that we don't want to do or love to do, but sometimes I think there's more opportunity to delegate responsibility than you might see at first glance.
Like if you really take a look at everything on your plate and say, do I need to do this? Am I the only one who has the personalized skills necessary to do this thing?
And I couldn't train someone else to do it, I think most often the answer is no, even if there are things that we want to maybe tell ourselves, that we're the only one who knows how to do this. It's like, you can train someone, you can work with other people. And so I think thinking more about that was really eye-opening and helpful for me.
JESSE: Yeah. I'm just thinking about-- I think one of the hard questions that I still struggle with is, do I want to be the subject matter expert in this area? And that is, I think, gets back to your white whale question. And is that something-- like do you think about that often? Like how often do you reflect about these kinds of things?
CAITLIN: I would say quite often.
CAITLIN: And something that's tricky for me is I'm a generalist. And so within data science, I have experienced-- across the board. So I've done a lot of analysis, I've done A/B testing, I've done machine learning engineering, and I love it all. And so that is a problem in the sense of I haven't picked one thing that I want to commit to forever.
So I think thinking about which parts of working in those different areas I maybe could automate and maybe building the automation is something that I'm interested in. Maybe that's something I haven't done yet. But I would say I think a lot about it.
And if there are things that I think-- like parts of those processes that I've done or that other people would like to own or would be better suited to own, then I'm happy to delegate those or work with other people to transition responsibility.
JESSE: Yeah, Yeah. There's definitely a lot of self reflection involved in that. And I think you are such a great example of a generalist-- a successful generalist in data science. Did you intend to become a generalist? Or were you like-- how did that come about?
CAITLIN: I think-- so my first four years were spent at a predictive analytics company. And so we built predictive analytics software and then we applied that software. And so I did-- my role was such a mix. It was a really small company, and so I wore a lot of hats, including writing blocks. Like it was that many hats.
JESSE: Oh my gosh.
CAITLIN: One of the things that I did in that role was consulting. And so taking problems from end to end, like hearing the stakeholders talk about what they wish they knew about-- we worked a lot with colleges and universities. So what they wish they knew about their students or what they wished they could predict.
And figuring out how do we formulate a data set to solve that problem, how do we build a model to solve that problem, and then how do we make sure that you could run the model and you understand the outputs? And we're getting buy-in in all of the steps.
So I think my first experience was really going end-to-end, and that is my happy place. So taking a problem and working to shepherd it through all of the data science steps and making sure that we're checking in and we're getting as much value as possible out of each step is something that's really important to me.
JESSE: Yeah, yeah. And it's a-- I think it's a harder skill to know that you want to do going into data science, and it's harder to train for it, I think especially now when we look at data science. You go through schools, you go through programs, and it's very-- it tends to be more technique-focused. But yeah, I think being a generalist is a lot of fun. I think it gives you a lot of exposure to so many different pieces.
CAITLIN: I love it. So that was one of the things about consulting that I really loved, is we got to play in everyone's backyard and really get to learn like, what data do you have? And oh, that's a really interesting thing. Like we could build some cool features out of that, let's try it.
As far as the education-- so I started doing data science in 2011. And things looked very different then. And when I joined this first company, it was started by a guy who was consulting and he wanted to automate some of the parts of the consulting process.
And so it's funny. He started with modeling and building something to automate building models. And then he was like, oh, we need to focus on the data cleaning and preparation because that's really the hard part that people are struggling with. And so I learned how to do that.
And the entire four years that I worked there, I worked under him. And so when-- I was ready to share my first model and my first analysis with clients, I'm like, OK, I have the meeting scheduled, I'm good to go. And he was like, can we review it? Like yeah, of course. And he looked at it and just like-- I found all of these things that I had missed.
They were just subject matter stuff, but it was really great to learn from someone who had a ton of experience and I was learning in a way that was very applied and very hands-on. I got to be very deep in the data. And I wouldn't trade those four years for anything.
Especially now, I think something like code review is common on the analysis and modeling side, but also people don't have the time to understand your domain as well as you do. And so getting to work really closely in a couple of domains with someone who knew them really well just instilled a lot of really good best practices that I've taken everywhere else I've gone.
JESSE: Yeah. That's an incredible experience, yeah. So I have no way to bring you along on this mind map journey. This is just out of the blue. I would like to talk about the Cubs because I do live in the Chicago suburbs. I'm not going to get Chicagoans mad by saying I live in Chicago. I live outside Chicago.
But I had a call with Nick Wan and was like, how does someone get into sports? And he was like, do baseball. Like, baseball is a great sport. And then you are-- you're a Cubs fan.
CAITLIN: I'm a lifelong Cubs fan.
JESSE: Lifelong Cubs fan! OK. So you live in Texas now. Why the Cubs?
CAITLIN: So I, until I was 10, lived in the southwest suburbs of Chicago. So I grew up in Oak Lawn, and then also New Hampshire. But I grew up going to baseball games. And my dad actually would take us to both the White Sox games and Cubs games.
But tickets were really cheap for the Cubs, and so you could just buy tickets on like a random school day and go to a game and that was easy enough to do, and so we did it. And when I was a kid, that was when the Mark McGwire, Sammy Sosa era was happening. So we got to see at least one of those games.
And it was just a really fun time to be a fan even though they weren't doing well. That's OK, it's going to happen forever. But yeah, 2016 was like pretty amazing. That month, November 2016, the Cubs won the World Series and I got married. I was just like, this is amazing.
JESSE: Like two big things. Yeah, as a lifelong-- so I grew up in Buffalo, so I have yet to secure something like the Super Bowl or Stanley Cup in my lifetime. But I can imagine what that must be like if you have a team that gets close or is just never close. And yeah, and a wedding all in the same month, that's incredible.
CAITLIN: I cried. It was really amazing.
JESSE: Yeah. So, I mean, like this is-- Buffalo fans are huge. Like, we are diehard fans. Even if you don't follow football, like you're Bills fan. It's just in-- it's in the air you breathe, it's in the water.
And I remember, I was in Dallas, Texas watching a Nick Offerman show, and he had the blue W hanging up behind him. And he had to explain to the audience that the Cubs were in the playoffs, and this was 2016, and that he was getting a live feed of game updates. And he would actually stop the show to like cheer for what was happening in the game. I was like--
CAITLIN: It was really-- it was such a fun time. And after so many years, like decades, lifetimes of people not seeing them do well, yeah, I impulse bought plane tickets to one of the away games because I was like, maybe we should-- this was like, I don't know, two weeks before my wedding, so we're like, we can't be doing this. But yeah, it was such an exciting cool thing to see.
JESSE: I love it. Yeah, when I was looking at apartments out here, Rogers Park was actually one of the places that I was looking to live because I was like, I could just like pop down to a baseball game and that could be fun.
CAITLIN: So I went to Loyola. And so I lived in Rogers Park for three years, yeah.
JESSE: I love Chicago, love the area, yes. Huge fan. So to close things out, though, where-- I mean, I could talk about the Cubs all day even though I don't follow the Cubs yet. But where on the internets is a good place to find you?
CAITLIN: Twitter. For as long as it's around. I started a Mastodon account, too, but I am committed to Twitter. So I will be there for as long as other people are there. And I hope that's forever. And then I have a blog at caitlinhudon.com And I think those are the best places to find me.
Thanks for tuning in to our latest episode of Data + Curiosity! I’d love to hear your thoughts on this interview, your thoughts on exploratory data analysis and modeling, and what you’re curious about! Let me know in the video comments – I can’t wait to hear from you!
🥇 N=1 Motherhood in Tech - https://www.caitlinhudon.com/posts/motherhood-in-tech
🦸♀️ Delegation is a superpower - https://www.caitlinhudon.com/posts/delegation-is-a-superpower
🎓 Loyola University - https://www.luc.edu/
⚾️ Chicago Cubs - https://www.mlb.com/cubs
Follow Caitlin online
🦜 on Twitter - https://twitter.com/beeonaposy
✍️ on her Blog - https://www.caitlinhudon.com/
Fine-tune FLAN-T5 on Blueprint today!
You can now fine-tune FLAN-T5, an instruction-tuned text-to-text transformer model developed by Google, on Blueprint!