Mining the IMDB Database - summary statistics

This past spring I was in a data mining class and I spent some time working with the IMDB database. It is freely available, and there are some nifty tools for working with it. The goal of my data mining project was to develop a decision tree or a rule based classifier that could be used to predict a movies average user rating based on its other attributes. So ideally, we would have a decision tree that said something like “if a movie is made by Quentin Tarantino and stars Brad Pitt, then it will have an average user rating of 9/10.” Well the decision tree turned out to be quite a bit more complicated and less insightful than we were hoping, but in the process of building it we did come across some interesting findings that this post is intended to summarize.

Note: the data set was filtered to only include movies that meet the following constraints:

Genres

When building a rule-based classifier, rules similar to this kept showing up:

if genre = 'Drama' then user_rating > 7
if genre = 'Horror' then user_rating < 5

This got us thinking that perhaps movies of certain genres just tend to be better. The following table seems to support that idea.

GenreAverage User Rating
Documentary6.3
Animation6.0
Western5.9
Romance5.9
Drama5.7
Crime5.7
Mystery5.7
Adventure5.6
Family5.5
Comedy5.5
Fantasy5.4
Thriller5.1
Action5.1
Sci-Fi4.9
Horror4.4

So it looks like either there are a lot of crappy horror movies out there, or people just don’t like horror movies as much as they like documentaries. It also so happens that documentaries are on average the cheapest genre of movie to produce, as the chart below shows.

Animation movies are also very popular, but their high cost is prohibitive. They make up only 1% of the US movies in the IMDB database:

MPAA ratings

One of the data attributes that lent itself really well to data mining was the MPAA rating attribute. This attribute looked something like:

Rated R for strong violence, language and some sexual content/nudity.

The reason these ratings worked so well for data mining was because the MPAA likes to use certain keywords in their ratings: language, drugs, sex, violence, and nudity. So the rating above could be broken down into to a series of booleans indicating that the movie contained violence, language, sex and nudity, but didn’t contain any drugs. I was kinda expecting to find some dark statistic like “movies with violence and drugs tend to be rated higher than those without,” but it turned out that MPAA ratings weren’t a very good predictor of user ratings. Actually, the presence of any of those keywords had a slight negative correlation with the user rating attribute! Maybe there’s hope yet for humanity.

So the MPAA ratings weren’t very useful for predicting user ratings, but they are still useful for looking at what kind of obscenities are filling our movies:

We can also look at rating distributions:

And how rating distributions have changed over the past 20 years:

I was surprised that the percentage of ‘R’ rated movies being produced has declined over the past 20 years. I wonder if the MPAA is just getting less sensitive or if Hollywood is really becoming more PG as the chart above suggests.

Top Directors, Actors, and Actresses

While MPAA ratings and genres are interesting to look at, the real predictors of how good a movie will be should be the people who made it. Here are some people who consistently make great movies:

Note: Only includes directors with at least 3 movies, and actors or actresses with at least 20 movies.

DirectorAverage user ratingNumber of movies
Nolan, Christopher8.417
Unkrich, Lee8.134
Darabont, Frank7.984
Jackson, Peter7.947
Kubrick, Stanley7.8813
Aronofsky, Darren7.865
Tarantino, Quentin7.8413
Bird, Brad7.834
Wright, Edgar7.784
Vaughn, Matthew7.773
Fincher, David7.769
Affleck, Ben7.733
Ritchie, Guy7.654
Sharpsteen, Ben7.654
NameAverage user ratingNumber of movies
Hitchcock, Alfred7.6126
Chaplin, Charles7.3330
Serkis, Andy7.2720
Lloyd, Harold7.1821
Bale, Christian7.1128
DiCaprio, Leonardo7.1027
Parker, Trey7.0735
Eckhardt, Oliver7.0025
Rickman, Alan6.9832
Norton, Edward6.9728
Fishman, Duke6.9720
Howard, Art6.9726
Guinness, Alec6.9522
Goelz, Dave6.9544
Stone, Matt6.9523
NameAverage user ratingNumber of movies
Bonham Carter, Helena7.1726
Swanson, Gloria7.0829
McGowan, Mickie7.0629
Blanchett, Cate7.0629
Bergman, Mary Kay7.0620
Lynn, Sherry7.0332
Cooper, Gladys6.9830
Harlow, Jean6.9629
Davies, Marion6.9541
Dietrich, Marlene6.9533
Hepburn, Audrey6.9421
Plowright, Hilda6.9146
Gale, Norah6.9020
Winslet, Kate6.8921
Michelson, Esther6.8823

Well that gives a quick snapshot of some of our findings. It would be cool to be able to pinpoint the exact ingredients that make a great movie, but if that were possible then making movies would be more of a science than an art.