Welcome to our project analyzing the efficiency of production companies in the movie industry. Movies are a form of entertainment that has been around for over a century and have become an integral part of popular culture. The movie industry is a highly competitive market where producing highly rated and profitable movies is a challenging task. People go to the movies to escape reality, experience new worlds, and immerse themselves in stories that may be inspiring or bring out pure joy.
Creating a movie that is both easy to watch and successful is very difficult. It takes an extremely commited team of producers to be both creative and financially savvy. Production companies are under a great pressure to deliver high-quality content while also maximizing profits.
In this project, we will explore how different production companies use their budgets to create successful movies. We will look at the relationship between movie budgets, ratings, and profitability, with the aim of identifying trends and patterns in the budget of the top production companies. The datasets available on Kaggle provide an overview of various production companies' movie budgets, ratings, and profits, making it possible to conduct a thorough analysis of the movie industry's efficiency.
Our analysis will involve various data manipulation techniques and machine learning models to find insights into how different production companies can create a successful movie with a certain restricted budget.
Overall, this project aims to provide an overview of the movie industry's efficiency We hope that this analysis will be useful for anyone interested in the movie industry, including aspiring filmmakers, investors, and industry analysts.
import pandas as pd
import json
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
The main library that we will be using for a majority of the project is Pandas. Pandas is super useful for manipulating datasets in structures called dataframes. This allows us to analyze individual features of a dataset and explore set wide trends. Pandas is python compatable, open-source tool that can be found at the link below:
The two main data sources that we will be using in this project are both found on the website Kaggle. The links to these datasets can be found here:
Top 250 IMDB Movies: https://www.kaggle.com/datasets/karkavelrajaj/imdb-top-250-movies
TMDB 5000 Movie Dataset: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv
We start by using pandas to to turn both CSV files into seperate dataframes.
# Read CSV files into dataframes.
top250_df = pd.read_csv('/content/movies.csv')
allMovies_df = pd.read_csv('/content/tmdb_5000_movies.csv')
We now have 2 dataframes named 'top250_df' and 'allMovies_df' to store our data.
The 'top250_df' holds the data from the file movies.csv which includes the top 250 ranked movies on IMDB. We can see the categories of data stored here:
top250_df.columns
Index(['rank', 'movie_id', 'title', 'year', 'link', 'imbd_votes', 'imbd_rating', 'certificate', 'duration', 'genre', 'cast_id', 'cast_name', 'director_id', 'director_name', 'writer_id', 'writer_name', 'storyline', 'user_id', 'user_name', 'review_id', 'review_title', 'review_content'], dtype='object')
The 'allMovies_df' holds the data from the file tmdb_5000_movies.csv. We can see the categories of that stored data here:
allMovies_df.columns
Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count'], dtype='object')
As you can see, not all of the columns are useful to us or might not even have meaning at all. Lets reform the dataframes to only have the data we want:
top250_df = top250_df[["rank", "title", "year", "imbd_rating"]]
allMovies_df = allMovies_df[["budget", "production_companies", "revenue", "title", "release_date", "genres"]]
Let's check for missing data in the dataframes:
print(len(top250_df[top250_df.isnull().any(axis=1)]), len(allMovies_df[allMovies_df.isnull().any(axis=1)]))
0 1
Lucky for us, there is only 1 row between the two dataframes with missing data. Let's remove that row to eliminate any unwanted errors:
print(allMovies_df[allMovies_df.isnull().any(axis=1)])
allMovies_df = allMovies_df.drop(4553)
budget production_companies revenue title \ 4553 0 [] 0 America Is Still the Place release_date genres 4553 NaN []
We want to create a merged dataframe that allows us to see both a movies rank as well as it's budget and revenue:
# Merge on the title column of the two dataframes, only keeping a movie if it is in both
merged_df = pd.merge(top250_df, allMovies_df, on='title', how='inner')
# Make a company names column to hold names instead of production_companies objects
merged_df['company_names'] = merged_df['production_companies'].apply(lambda x: [i['name'] for i in json.loads(x)])
allMovies_df['company_names'] = allMovies_df['production_companies'].apply(lambda x: [i['name'] for i in json.loads(x)])
# Extract year from datetime objects to create a year column
allMovies_df['year'] = pd.to_datetime(allMovies_df['release_date']).dt.year
# Remove unwanted rows
del merged_df["production_companies"]
del allMovies_df["production_companies"]
merged_df
rank | title | year | imbd_rating | budget | revenue | release_date | genres | company_names | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | The Shawshank Redemption | 1994 | 9.3 | 25000000 | 28341469 | 1994-09-23 | [{"id": 18, "name": "Drama"}, {"id": 80, "name... | [Castle Rock Entertainment] |
1 | 2 | The Godfather | 1972 | 9.2 | 6000000 | 245066411 | 1972-03-14 | [{"id": 18, "name": "Drama"}, {"id": 80, "name... | [Paramount Pictures, Alfran Productions] |
2 | 3 | The Dark Knight | 2008 | 9.0 | 185000000 | 1004558444 | 2008-07-16 | [{"id": 18, "name": "Drama"}, {"id": 28, "name... | [DC Comics, Legendary Pictures, Warner Bros., ... |
3 | 5 | 12 Angry Men | 1957 | 9.0 | 350000 | 1000000 | 1957-03-25 | [{"id": 18, "name": "Drama"}] | [United Artists, Orion-Nova Productions] |
4 | 6 | Schindler's List | 1993 | 9.0 | 22000000 | 321365567 | 1993-11-29 | [{"id": 18, "name": "Drama"}, {"id": 36, "name... | [Universal Pictures, Amblin Entertainment] |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
156 | 245 | The Iron Giant | 1999 | 8.1 | 70000000 | 23159305 | 1999-08-06 | [{"id": 12, "name": "Adventure"}, {"id": 16, "... | [Warner Bros. Animation] |
157 | 247 | The Help | 2011 | 8.1 | 25000000 | 124272124 | 2011-08-09 | [{"id": 18, "name": "Drama"}] | [DreamWorks SKG, 1492 Pictures, Participant Pr... |
158 | 248 | Aladdin | 1992 | 8.0 | 28000000 | 504050219 | 1992-11-25 | [{"id": 16, "name": "Animation"}, {"id": 10751... | [Walt Disney Pictures] |
159 | 249 | Gandhi | 1982 | 8.0 | 22000000 | 77737889 | 1982-11-30 | [{"id": 18, "name": "Drama"}, {"id": 36, "name... | [Columbia Pictures, Goldcrest Films Internatio... |
160 | 250 | Dances with Wolves | 1990 | 8.0 | 22000000 | 424208848 | 1990-11-09 | [{"id": 12, "name": "Adventure"}, {"id": 18, "... | [Tig Productions, Majestic Films International] |
161 rows × 9 columns
In order to figure out what we would like to investigate or inquire about, we must first look at the data we have to see if there are any trends that might warrant a deep dive into. This can be done with various graphs.
# Create a datafram of the mean budget grouped by year
mean_budget_df = allMovies_df.groupby('year')['budget'].mean().reset_index()
# Modify look of the graph
sns.set_style('darkgrid')
sns.set_palette('viridis')
fig, ax = plt.subplots(figsize=(12, 6))
# Plot line plot with the x-axis as year and y-axis as budget
sns.lineplot(x='year', y='budget', data=mean_budget_df, ax=ax)
ax.set_xlim(1930, 2016)
ax.set_ylim(0, 50000000)
(0.0, 50000000.0)
As you can see, the funding of movies over time has increased a considerable amount.
# Create mean revenue dataframe to store the mean revenue by year
mean_revenue_df = allMovies_df.groupby('year')['revenue'].mean().reset_index()
# Visual tweaks
sns.set_style('darkgrid')
sns.set_palette('viridis')
fig, ax = plt.subplots(figsize=(12, 6))
# Create lineplot with the x-axis as year and y-axis as revenue
sns.lineplot(x='year', y='revenue', data=mean_revenue_df, ax=ax)
ax.set_xlim(1930, 2016)
ax.set_ylim(0, 200000000)
(0.0, 200000000.0)
Similar to budget, the mean revenue has increased over time. However, movies from all time periods that are extremely successful may produce spikes in the data.
# Wrangle data to create a dataframe of genres
genre_names = allMovies_df['genres'].apply(lambda x: [i['name'] for i in json.loads(x)])
# Create a list of genres
genre_list = [genre for genres in genre_names for genre in genres]
# Count amount of each genre in list
genre_counts = pd.Series(genre_list).value_counts()
# Keep the top 15 genres
top_genres = genre_counts[:15]
# Plot a pie chart
fig, ax = plt.subplots(figsize=(12, 6))
top_genres.plot.pie(ax=ax)
ax.set_title('Movies by Genre')
Text(0.5, 1.0, 'Movies by Genre')
The top 15 genres are on par with what you might expect, including Comedy, Drama, Action, and Thriller occupying over 50% of all movies.
merged_df['genres'] = merged_df['genres'].apply(lambda x: [i['name'] for i in json.loads(x)])
# Get top 15 genres in a list
genre_list = merged_df['genres'].sum()
genre_counts = pd.Series(genre_list).value_counts()
top_genres = genre_counts[:15].index.tolist()
# Group the movies by genre and calculate the average budget
genre_budgets = merged_df.explode('genres').groupby('genres')['budget'].mean()
# Filter only the budgets of the top genres
top_genre_budgets = genre_budgets.loc[top_genres]
# Create a bar graph of the average budget for each genre
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x=top_genre_budgets.index, y=top_genre_budgets.values, ax=ax)
ax.set_title('Average Budget of Top 15 Genres')
ax.set_xlabel('Genre')
ax.set_ylabel('Budget (in millions)')
plt.xticks(rotation=90)
plt.show()
There doesn't seem to be any correlation between the most popular movie type and there budget. Animation, suprisingly, on average has the highest budget.
# Store all genre revenues
genre_revenues = merged_df.explode('genres').groupby('genres')['revenue'].mean()
# Store the average revenue of the top 15 genres
top_genre_revenues = genre_revenues.loc[top_genres]
# Create bar plot of top 15 genres
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x=top_genre_revenues.index, y=top_genre_revenues.values, ax=ax)
ax.set_title('Average Revenue of Top 15 Genres')
ax.set_xlabel('Genre')
ax.set_ylabel('Revenue (in millions)')
plt.xticks(rotation=90)
plt.show()
There does seem to be a correlation between genre budget and genre revenue, as seen above.
# Average Revenue
company_stats = {}
for companies, revenue in zip(allMovies_df['company_names'], allMovies_df['revenue']):
for company in companies:
if company in company_stats:
company_stats[company]['total_revenue'] += revenue
company_stats[company]['num_movies'] += 1
else:
company_stats[company] = {'total_revenue': revenue, 'num_movies': 1}
# Calculate the average revenue for each company
company_revenues = {}
for company in company_stats:
company_revenues[company] = company_stats[company]['total_revenue'] / company_stats[company]['num_movies']
# Sort the companies based on their average revenue and select the top 20
top20_companies = sorted(company_revenues, key=company_revenues.get, reverse=True)[:20]
# Create a bar graph with the company names on the x-axis and the average revenue on the y-axis
company_revenues_top20 = {company: company_revenues[company] for company in top20_companies}
sns.barplot(x=list(company_revenues_top20.keys()), y=list(company_revenues_top20.values()))
plt.xticks(rotation=90)
plt.xlabel('Company')
plt.ylabel('Average Revenue')
plt.title('Top 20 Companies by Average Revenue')
plt.show()
A surprising amount of foreign companies round out the top 20 companies, based on revenue.
# Average Budget
company_stats = {}
for companies, budget in zip(allMovies_df['company_names'], allMovies_df['budget']):
for company in companies:
if company in company_stats:
company_stats[company]['total_budget'] += budget
company_stats[company]['num_movies'] += 1
else:
company_stats[company] = {'total_budget': budget, 'num_movies': 1}
# Calculate the average budget for each company
company_budgets = {}
for company in company_stats:
company_budgets[company] = company_stats[company]['total_budget'] / company_stats[company]['num_movies']
# Sort the companies based on their average budget and select the top 20
top20_companies = sorted(company_budgets, key=company_budgets.get, reverse=True)[:20]
# Create a bar graph with the company names on the x-axis and the average budget on the y-axis
company_budgets_top20 = {company: company_budgets[company] for company in top20_companies}
sns.barplot(x=list(company_budgets_top20.keys()), y=list(company_budgets_top20.values()))
plt.xticks(rotation=90)
plt.xlabel('Company')
plt.ylabel('Average Budget')
plt.title('Top 20 Companies by Average Budget')
plt.show()
From this plot we can see that having the highest revenue does not always mean that the highest budget was given. Definitely an interesting correlation to look into.
First, lets refresh the raw data from the experiment:
import pandas as pd
import json
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('tmdb_5000_movies.csv')
#name the columns you want to use
df = df[["budget", "production_companies", "revenue", "title", "release_date", "genres"]]
#use apply to extract the company names from the list of JSON
df['company_names'] = df['production_companies'].apply(lambda x: [i['name'] for i in json.loads(x)])
df['genres'] = df['genres'].apply(lambda x: [i['name'] for i in json.loads(x)])
#change release date to just the year in a new column
#coerce invalid dates to change to NaT which gets wiped up by dropna
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').dt.year.astype('Int64')
#you can now remove the production companies and release date columns since
#we've cleaned them up into better cleaner columns
del df["production_companies"]
del df["release_date"]
#the use of drop and dropna allows us to get rid of any rows that have a 0
#for a value that clearly shouldn't be 0, or any row that has otherwise
#invalid data
df = df.drop(index=df[df.eq(0)].dropna(how='all').index)
df
budget | revenue | title | genres | company_names | year | |
---|---|---|---|---|---|---|
0 | 237000000 | 2787965087 | Avatar | [Action, Adventure, Fantasy, Science Fiction] | [Ingenious Film Partners, Twentieth Century Fo... | 2009 |
1 | 300000000 | 961000000 | Pirates of the Caribbean: At World's End | [Adventure, Fantasy, Action] | [Walt Disney Pictures, Jerry Bruckheimer Films... | 2007 |
2 | 245000000 | 880674609 | Spectre | [Action, Adventure, Crime] | [Columbia Pictures, Danjaq, B24] | 2015 |
3 | 250000000 | 1084939099 | The Dark Knight Rises | [Action, Crime, Drama, Thriller] | [Legendary Pictures, Warner Bros., DC Entertai... | 2012 |
4 | 260000000 | 284139100 | John Carter | [Action, Adventure, Science Fiction] | [Walt Disney Pictures] | 2012 |
... | ... | ... | ... | ... | ... | ... |
4773 | 27000 | 3151130 | Clerks | [Comedy] | [Miramax Films, View Askew Productions] | 1994 |
4788 | 12000 | 6000000 | Pink Flamingos | [Horror, Comedy, Crime] | [Dreamland Productions] | 1972 |
4792 | 20000 | 99000 | Cure | [Crime, Horror, Mystery, Thriller] | [Daiei Studios] | 1997 |
4796 | 7000 | 424760 | Primer | [Science Fiction, Drama, Thriller] | [Thinkfilm] | 2004 |
4798 | 220000 | 2040920 | El Mariachi | [Action, Crime, Thriller] | [Columbia Pictures] | 1992 |
3229 rows × 6 columns
#create a dataframe with one row for every company name present in the df
unique_companies = df.explode('company_names')['company_names'].drop_duplicates()
companies_df = pd.DataFrame(index=unique_companies)
companies_df.index.name = 'company'
#build the other columns necessary for company-based analysis
companies_df['movie_count'] = companies_df.index.map(lambda x: sum(df['company_names'].apply(lambda y: x in y)))
companies_df['gross_budget'] = companies_df.index.map(lambda company: df[df['company_names'].apply(lambda x: company in x)]['budget'].sum())
companies_df['gross_revenue'] = companies_df.index.to_series().apply(lambda x: df.loc[df['company_names'].apply(lambda y: x in y), 'revenue'].sum())
companies_df['avg_budget'] = companies_df['gross_budget'] / companies_df['movie_count']
companies_df
movie_count | gross_budget | gross_revenue | avg_budget | |
---|---|---|---|---|
company | ||||
Ingenious Film Partners | 24 | 2340000000 | 7907230983 | 9.750000e+07 |
Twentieth Century Fox Film Corporation | 201 | 10493790000 | 39136501776 | 5.220791e+07 |
Dune Entertainment | 54 | 3906000003 | 13634281207 | 7.233333e+07 |
Lightstorm Entertainment | 5 | 722000000 | 5621881784 | 1.444000e+08 |
Walt Disney Pictures | 96 | 9248780000 | 28515762158 | 9.634146e+07 |
... | ... | ... | ... | ... |
Tiny Ponies | 1 | 65000 | 416498 | 6.500000e+04 |
Low Spark Films | 1 | 4000000 | 600896 | 4.000000e+06 |
Signal Film Group | 1 | 4000000 | 600896 | 4.000000e+06 |
Dreamland Productions | 1 | 12000 | 6000000 | 1.200000e+04 |
Daiei Studios | 1 | 20000 | 99000 | 2.000000e+04 |
3565 rows × 4 columns
Now that our data is structured to be able to see company-specific statistics, lets take a look at each company's gross spending.
ax = companies_df.plot.bar(y='gross_budget', legend=False, figsize=(10, 6))
ax.set_title('Gross Budget by Company')
ax.set_xlabel('Company')
ax.set_ylabel('Gross Budget')
plt.xticks(rotation=90)
plt.show()
Not a great way to visualize the data. Lets try just looking at the top 25 performers instead.
top_25 = companies_df.nlargest(25, 'gross_budget')
#lets make the plot and color it red for losses. I'm not so much
#of an accountant as I am a data scientist
top_25.plot(kind='bar', y='gross_budget', figsize=(12,6), legend=False, color='red')
plt.title('Top 25 Companies by Gross Budget')
plt.xlabel('Company')
plt.ylabel('Gross Budget (in millions)')
plt.show()
Interesting. We can see that Warner Bros., Universal, Paramount, Columbia, 20th Century Fox, and Disney are some of the biggest spenders in movie history.
Next lets check out who the biggest earners of all time are. That's what matters to these studios at the end of the day right? We are going to use the same code only slightly tweaked. Make a prediction now; I expect similar results.
top_25 = companies_df.nlargest(25, 'gross_revenue')
#and lets go ahead and color it green this time for money
top_25.plot(kind='bar', y='gross_revenue', figsize=(12,6), legend=False, color='green')
plt.title('Top 25 Companies by Gross Revenue')
plt.xlabel('Company')
plt.ylabel('Gross Revenue (in millions)')
plt.show()
print(companies_df.columns)
Index(['movie_count', 'gross_budget', 'gross_revenue', 'avg_budget'], dtype='object')
Looks a first glance to be very similar data, although scaled up since movies generally make more money than they cost. Cough, cough Cats 2019. For other funny box office flops heres a good read.
Let's only focus on these 25 companies. We want to see who spends their money well and who is just throwing money at horrible projects. Ah, the art of data science!
#discard the small players. We're dealing with the big ballers
top_25_companies = companies_df.nlargest(25, 'gross_budget')
top_25_companies.index.name = 'company'
top_25_companies = top_25_companies.reset_index()
Add a column here to the dataframe called 'efficiency' which is the measure of revenue/budget. It will show us the ROI from these companies on average for their productions. Thats return on investment, for those who haven't seen The Wolf of Wall Street hundreds of times.
top_25_companies['efficiency'] = top_25_companies['gross_revenue'] / top_25_companies['gross_budget']
And now lets check out everyone's spending efficiencies now that the playing field among the unstoppable entertainment mega-corportaions.
top_25_companies_sorted = top_25_companies.sort_values(by='efficiency', ascending=False)
plt.bar(top_25_companies_sorted['company'], top_25_companies_sorted['efficiency'])
plt.xticks(rotation=90)
plt.xlabel('Company')
plt.ylabel('Efficiency (Revenue / Budget)')
plt.title('Top 25 Movie Production Companies by Efficiency')
plt.show()
Quite a different set of victors than our last few analyses. Here at the most efficient spenders we have Amblin Entertainment, the folks who made Back to the Future, Jurassic Park, The Goonies, and many more. Amblin is showing a ~4.5x return on their investment into movies on average; this is a very impressive return. Next,we have Marvel Studios best known for the legendary MCU movies. A close third is Pixar Animation studios of the Toy Story movies, Finding Nemo, and more. Some close runner-ups are 20th Century Fox, Revolution Sun Studios, and Dune Entertainment.
While this data is novel and interesting, we need to find out which studio is going to be the biggest/best in the future.
I'm not a part of Amblin Entertainment so I can't cash in on their incredible money-multiplying machine, but the best we can do is study the winners and predict who the next winners will be. Out of curiousity, let's see what their ROI is...
(also demonstrate usage of the loc and iloc functions)...
#use the loc and iloc functions
top_25_companies.loc[top_25_companies['company'] == 'Amblin Entertainment', 'efficiency'].iloc[0]
4.616109687015893
Let's create a linear regression model showing the trend of each movie studio's efficiency over the years. Surely most have only refined their production process to be more efficient? This is only a prediction; lets check out the data:
#get the original 5000-movie dataframe, and were going to make
#a new dataframe where each movie has a separate entry for each
#of it's studio's who worked on it.
expl_compdf = df.explode('company_names')
expl_compdf = expl_compdf.rename(columns={'company_names' : 'company'})
expl_compdf = expl_compdf.reset_index(drop=True)
#and we only want the top 25 most efficient companies because
#everyone else is clearly not going to be a big winner for efficiency
expl_compdf = expl_compdf[expl_compdf['company'].isin(top_25_companies['company'])]
grouped = expl_compdf.groupby(['company', 'year'])
# Compute the revenue and budget totals for each group
totals = grouped.agg({'revenue': 'sum', 'budget': 'sum'})
# Compute the efficiency for each group
totals['efficiency'] = totals['revenue'] / totals['budget']
# Reset the index to make the company and year columns into regular columns
totals = totals.reset_index()
#since we're focusing on very sporatic data, and the future is our only
#concern, we can just look more recent than 1990
totals = totals[totals['year'] >= 1990]
efficiency_by_company = totals.pivot(index='year', columns='company', values='efficiency')
# Create the line plot using pandas
ax = efficiency_by_company.plot(figsize=(10,15))
# Change some labels.
ax.set_xlabel('Year')
ax.set_ylabel('Efficiency (Revenue/Budget)')
ax.set_title('Efficiency of Top Movie Companies Over Time')
plt.show()
We can see that this data is really all over the place. Studio's go up and down seemingly randomly. Some have huge peaks far above the rest, but they always come down into the mix. It may be able to be explained that those peaks are coming from classic movies. That largest peak from Summit Entertainment is coming from American Pie (1999), which budgeted \$11million but had a box office revenue of \$235.5million. Quite the turnaround.
It is important for us still to find the studio that is most likely to multiply their money going forward. Lets check out the efficiency over time to see what is a fair comparison to compare other studios to.
#we will use the scikit learn LinearRegression
from sklearn.linear_model import LinearRegression
#initialize the model
model = LinearRegression()
X = totals[['year']]
y = totals['efficiency']
#fit the model to our X and y
model.fit(X, y)
#this is how you can access the real equation for the line
#these pieces of data are oftentimes important for analyzing
#a smaller piece of data in comparison to the whole
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
#predict the future!
y_pred = model.predict(X)
plt.plot(X, y_pred)
plt.xlabel('Year')
plt.ylabel('Efficiency')
plt.title('Average Studio Efficiency over Time')
plt.show()
Coefficients: [-0.03117026] Intercept: 65.73283668086948
Thats upsetting. Maybe this isnt a good time to enter the movie business in general. We're still above 2.9 efficiency, but this isn't entirely helpful.
Lets try to see the specific studio with the best regression model for the future.
#Initialize some things to keep track of the best
max_coefficient = 0
best_model = None
best_company = ""
best_company_data = None
top_25_companies = companies_df.nlargest(25, 'gross_budget')
#go through each company and compare their regression model
for company in top_25_companies.index:
company_data = totals[totals['company'] == company]
x = company_data[['year']]
y = company_data['efficiency']
model = LinearRegression().fit(x, y)
if model.coef_ > max_coefficient:
max_coefficient = model.coef_
best_model = model
best_company = company
best_company_data = company_data
#Plot the best one
if best_model is not None:
years = pd.DataFrame({'year': range(2005, 2020)})
predicted_efficiency = best_model.predict(years[['year']])
plt.plot(best_company_data['year'], best_company_data['efficiency'], label='Actual')
plt.plot(years['year'], predicted_efficiency, label='Predicted')
plt.xlabel('Year')
plt.ylabel('Efficiency')
plt.title("Efficiency of over time of Best Company")
plt.legend()
plt.show()
#Print this for us to see
print(f"The best model has a coefficient of {max_coefficient[0]} and is: {best_company}")
The best model has a coefficient of 0.2830128009815079 and is: Marvel Studios
We can see that Marvel Studios has made the biggest efficiency increase and is targeted to have a much higher efficiency than others very soon. Although their data spans a very short time, and ends before the majority of extremely high grossing Marvel movies like Avengers Infinity War and Avengers End Game, we can still notice the extreme productivity that Marvel is making with their money.
By now you will have seen many examples demonstrating how to perform a data analysis in python using pandas, matplotlib, numpy, and scikit learn. These skills are fundamental to data science and will be very important as you go forward analyzing more data.
The first thing we did was a general overview of the data at hand. We looked at all sorts of possible trends and interesting things shown in the data. What was most interesting to us was movie studio efficiency over time. The second most interesting thing that we might have done was the genre efficiency; we could have looked at which genres gross the most and return the most on your investment. My prediction is Action, as the extremely high grossing Marvel movies fall under Action.
Next, we showed what poor plot design looks like and how to fix it. We made bar plots showing the top 25 largest companies of all time by budget and revenue. Then, we did a rudamentary analysis of the top performing companies and found that the highest all time efficiency was Amblin Entertainment, formerly known as Stephen Spielberg Productions. Amblin had the highest revenue:budget ratio of any of the top performing studios dealing with big money. There may have been someone who spent $4 on a movie budget and ended making 100dollars from sales, but these cases are so irrelevant to the grand scheme of wealth we were trying to analyze.
After that, we took a look at the efficiencies of the top performing studios over time. And for a little more clarity, we excluded anything before the year 1990. As an old friend of mine used to say "we're not looking out the rear-view mirror, we're looking out the front-view mirror". This forward-looking approach allows us to take the modern context into consideration.
Our winner for the data analysis was Marvel Studios. Their biggest movies are Avengers Endgame, Avengers Infinity War, Spider-Man: No Way Home, The Avengers, Avengers: Age of Ultron, Black Panther, Iron Man 3, Captain America: Civil War, Spider-Man: Far From Home, and Captain Marvel. All of these movies had unbelievable returns on investment, with some nearing 10x efficiency, and they all came after 2015. So, the prediction of which studio will become the most efficient was right. Estimates say Avengers: Endgame cost Marvel about \$356million to produce, and returned $2.799billion worldwide. That movie alone had an efficiency of 7.862359551, a very high efficiency especially for this magnitude of money. Marvel is a top performer economically and will likely continue for at least a decade more.