Data + Curiosity: Move over, Ames – it’s Palmer Penguin time!

What is the Palmer Penguins data set?

The Palmer Penguin dataset comprises 344 rows, with each row representing a penguin belonging to one of three species: Adelie, gentoo, or chinstrap. Hidden within this dataset are features that make it an absolute delight to use when teaching introductory data science and machine learning courses, regardless of programming language. In this episode of Data + Curiosity, we talk with two of the three package creators, Alison Presmanes-Hill and Allison Horst, about how this dataset came to be, why they think it's become so popular, and their advice on collaboration. 

This post is an excerpt from the full interview and has been edited for length and clarity - you can watch the full video below!

Why penguins?

ALLISON HORST: I'm going to quote Kristen Gorman here, who is the original collector and first author on that paper of the Penguins data, who says, “Everybody loves penguins.” They're a pretty non-controversial subject for a dataset. And I also think that they are charismatic and have pretty intuitive pieces, like flippers. And I think it kind of lent itself to a pretty friendly dataset.

“Everybody loves penguins.”

ALISON HILL: There's a fun fact, though, about that. We have a paper published in the R Journal and one of the reviewers did comment that we needed to define what a flipper was. And so I think if you read our paper, there's like this hilarious clarification which was in response to reviewer number two, which was like, flippers are the appendages that penguins use.

Did you expect the Palmer Penguins package to be so popular?

ALLISON HORST: I have been really surprised by the adoption of the penguins. And I mean, it's wild. Like I see penguins everywhere, everywhere! Including in my dreams. But every year in the data science community it’s in R and Python and JavaScript and it's really taking on a life of its own, which I think is really cool. I have sometimes thought, “is the thing that years from now that we’re going to be known for is this, you know, 300 row data set about penguins?”

ALISON HILL: We joked about that because Allison and I both had like scientific research careers, and I think Palmer Penguins might be our most cited paper or researcher product ever, and possibly the most impactful too in a funny way. Like if you really think about the number of people who are learning now with Palmer Penguins– and I mean we prepared this talk for the useR! conference this year, and we were kind of reflecting on it, and I got kind of gooey.

I think Palmer Penguins might be our most cited paper or researcher product ever, and possibly the most impactful too

I was like, you know, it's kind of weird to think that if my daughter goes to school and starts learning data science, it would probably be with the Palmer Penguins data package. Like everywhere I go, every tutorial I see uses it, and I'm like, “this is crazy!”

What do you look for when choosing a dataset for teaching?

ALLISON HORST: Real world data is pretty valuable for teaching applied data science, and ithas relevance or interest for learners. It's something they're kind of intuitively like “oh, okay. I can understand why this would be interesting to explore these things,” with pretty intuitive features. Like the variables don't take a whole lecture to explain what those mean and why you would care about them, especially for data science courses where you want to get in to the wrangling.

Like you can do data wrangling with it, you can do a lot of data visualization, but you can also do clustering and regression with it. And there's a Simpson's paradox example, actually a few of them, which we love so much. So I think this one in particular is kind of unique in how many different– especially intro level data science methods you would want to teach with it, so it makes it really wildly relevant across a range of intro data science courses.

ALISON HILL: The two things that stick out to me are like meaningful rows. Each row is a penguin, you can talk about it in plain language, and it's easy, and that's kind of nice. Like sometimes you get genetic datasets or something like that and it's kind of hard to sit there and talk through with somebody who's trying to learn.

Each row is a penguin, you can talk about it in plain language, and it's easy, and that's kind of nice.

So something where there's meaningful entities in each row, like a person, a penguin or something that you can talk about and kind of latch on to because then it's easier to use words to describe the things that you're seeing. Like if you're doing a scatterplot, every dot is a penguin.

And then I come at it from teaching databases for many years and honestly being able to see a data set. It's actually really hard in the wild to find a dataset that has three or more, but not too many factor levels to be able to teach like coloring or shapes or things like that for for the genomes that you're plotting.

In closing

We’d love to hear your thoughts on this interview, how you’ve used the Palmer Penguin dataset, and what you’re curious about! Let us know in the video comments – we can’t wait to hear from you!