Data Visualization Using Tableau

Case Analysis Summary

The present case analysis aims at discovering the common features of TV shows and movies available on Netflix. In particular, the case analysis focuses on the date of publication, ratings, type, and year of release. The paper also aims at providing a historical analysis of changes in the number of available TV shows and movies, their ratings, and categories. The analysis hopes to answer the question if TV shows have become more popular than movies on Netflix. Additionally, the analysis is expected to discover what categories of TV shows and movies are the most popular on Netflix.

Background

Netflix is a streaming service that provides its users with an opportunity to watch a wide variety of movies, TV shows, documentaries, and anime (Netflix, 2020b). The platform can be used to watch any content without limitations on almost any device for a monthly payment (Netflix, 2020b). It is one of the most popular streaming services in the world, with around 167 million paid subscriptions in over 190 countries (Netflix, 2020a). Additionally, more than 2 million US users are subscribed for receiving DVDs by mail (Netflix, 2020a). The company’s net income grew by more than 50%, from $1.21 billion in 2018 to $1.87 billion in 2019 (Netflix, 2020a). In 2020, the platform’s popularity keeps growing due to lockdowns, which limit the entertaining options of people around the globe.

Since the number of Netflix users is growing rapidly, it must have been monitoring the demand and interests of its users to provide an adequate supply of movies and TV shows. Thus, that analysis of types and formats of movies and TV shows available at Netflix can represent how the taste of its auditory changes. I have always been interested in understanding what people prefer to watch today. Additionally, I am a keen user of Netflix, and I always wanted to know how much content is available on the website. Thus, the analysis of a dataset that contains information about the shows available on the platform was expected to help achieve the desired goal and satisfy my personal curiosity.

Dataset

The dataset under analysis is called Netflix Movies and TV Shows. Shivam Bansal (2020) made the dataset publically available on kaggle.com in January 2020. This dataset consists of TV shows and movies available on Netflix at the end of 2019. The dataset was initially collected using Flexible, which is a third-party search engine for Netflix. The dataset includes 12 variables: show ID, type (movie or TV show), title, director, cast, date added, release year, rating, duration, category (listed in), and description. In total, the dataset includes 6236 titles, which makes the dataset rather large. The information presented in the dataset is difficult to comprehend. However, data visualization using Tableau is expected to help with understanding what the data can tell.

Seven Stages

Stage 1: Acquire

First, the dataset was acquired from the web using Google. Since I wanted to study information on Netflix, I entered “Netflix dataset” in the search bar and selected the second link provided by Google. After that, I downloaded the dataset in the.csv format and opened it with Microsoft Excel 2016. The opened file was unusable for analysis, as all the data was clustered into one column (see Figure 1 below.

Acquiring
Figure 1. Acquiring

Stage 2: Parse

As seen in Figure 1, the raw dataset file could not be used for conducting the analysis in Tableau. Thus, according to Fry (2007), the data needs to be parsed before commencing analysis. Parsing usually includes structuring, grouping, sorting, and clearing the data (Fry, 2007). Parsing also often includes editing the titles of the columns to clarify what the variables measure. During the second stage, separated the data from one column into 12, changed the names of the columns, and replaced all the blanks in the “Director” column with N/A, as recommended by Fry (2007). However, the action was unneeded, as the variable was filtered out. The column was separated using the “Text to Columns” feature in Excel. I selected that the delimiter should be a comma, and Excel automatically made the dataset usable. The results of the second stage are provided in Figure 2 below.

Parsing
Figure 2. Parsing

After commencing into Stage 4, it was realized that the column “date added” was difficult to use, as it was textual data. It was decided to use the “Text to Columns” function to extract only the year when the show was added. This information was then used for analysis.

Stage 3: Filter

The third stage, filtering, is deleting all the data that is not expected to be used for analysis. This step can also be done in excel, as it does not include complicated manipulations with the data. The aim of the case analysis was to understand what categories of movies and TV shows are most popular and how popularity was changing during the past four years. Thus, it was clear that the information about show ID, cast, duration, directors, show titles, and descriptions was not needed. It was also unclear if the country column should be used for data analysis. It was decided to leave this variable, as analyzing the trends by country may be beneficial. Therefore, six variables were left after the filtering stage, including the show type, country, date added, release year, rating, and category (see Figure 3 below).

Filtering
Figure 3. Filtering

After commencing into Sage 4, it was realized that some of the data were incorrect. For instance, the information about data added included some textual data, which was clearly a mistake. Thus, I was forced to delete rows that were unusable. Additionally, it was decided to avoid using the information in the column “Country,” as the dataset failed to describe what this column meant: countries where the show was available or countries where the show was created. Since the results could not be interpreted, the variable was deleted.

Stages 4 and 5: Mine and Represent

The fourth step in data visualization is data mining, which is transforming raw data into useful information (Twin, 2020). Representing is choosing the visual model that can describe the data in the most appropriate way (Fry, 2007). These two stages were united together, as Tableau aggregates and represent the data simultaneously.

It was decided to provide an overall and historical analysis of the dataset. The variables under analysis were the type of the show, its category, and its rating. First, the distribution of data about the type of show was used. Since there were only two types (TV shows and movies), it was decided to use the pie chart as the representation model. Tableau automatically calculated the number of records in every category and organized the data into a pie chart that clearly demonstrated that there were more movies on Netflix than there were TV shows (see Figure 4 below).

Distribution by type
Figure 4. Distribution by type

Second, it was decided to represent the distribution of shows by category. A bar chart was created and sorted to show the most popular categories of shows (see Figure 5). The analysis demonstrated that documentaries, stand-up comedies, and dramas were the most favorable categories of shows on Netflix.

Distribution by category
Figure 5. Distribution by category

Third, the distribution of shows by rating was visualized using bar charts (see Figure 6). The analysis demonstrated that the majority of shows on Netflix are TV-MA and TV-14, which implies that Netflix does not target children under 14 as their primary audience.

Distribution by year
Figure 6. Distribution by year

All these views were united into a dashboard, as they all helped to understand what kind of movies were available at Netflix in January 2020. Figure 7 demonstrates the initial view of the first dashboard, which was named “Distribution.”

First dashboard
Figure 7. First dashboard

For the second set of visualizations, it was decided to focus on the historical analysis of the data. In particular, the year of addition to Netflix was included in the analysis of the types and ratings of the shows. Since the categories of shows were numerous, it appeared rational to avoid adding historical analysis of the variable not to overload the viewer. First, the number of available shows by year was represented using a simple graph (see Figure 8). The analysis demonstrated that at the beginning of 2020, the majority of shows were added in 2019. This can lead to two possible conclusions. On the one hand, this may mean that Netflix increases the number of added shows every year. On the other hand, this may mean that Netflix deletes old shows when they become no longer popular to control the sizes of servers it uses. In both cases, it is evident that Netflix believes that recent shows are more likely to be popular than the old ones.

Number of shows by year added
Figure 8. Number of shows by year added

Second, the number of shows by type was represented to understand how the preferences of people were changing in the past (see Figure 9). The analysis revealed that the number of added TV shows decreased in 2019 in comparison with 2018, which may mean a drop in the popularity of TV shows on the streaming platform.

Types of shows by year added
Figure 9. Types of shows by year added

Finally, the number of shows by rating was analyzed against the year added (see Figure 10). The analysis revealed no significant changes in the number of shows by rating during the past six years.

Ratings of shows by year added
Figure 10. Ratings of shows by year added

The visualizations shown in Figures 8-10 were united into a dashboard that helped the viewer to understand how the preferences of viewed were changing. The screenshot dashboard is provided in Figure 11 below. The dashboard was called “Historical Analysis.”

Second dashboard
Figure 11. Second dashboard

Stage 6: Refine

Two major issues were identified after creating the initial visualization. First, the distribution of shows by category demonstrated in Figure 5 showed only the top 20 categories available at Netflix. Additionally, the visualization models of data presented in Figure 5 and Figure 6 were similar, which made the first dashboard not captivating enough. Therefore, it was decided to change the type of the model to “packed bubbles,” which was helpful to show all the categories on one screen and diversified the first dashboard.

Second, the variable called “release year” was not used in the analysis. Therefore, it was decided to add a filter for all the visualization in the first dashboard. This was meant to help the viewer see how the characteristics of shows changed depending on the release year. Figure 12 demonstrates below demonstrates the results of the changes. The filter helped to realize that the popularity of TV shows started growing in the 1990s, as the number started to grow faster in comparison with the number of movies.

Refining
Figure 12. Refining

Stage 7: Interact

For the last stage, it was crucial to add interactions to help the data tell the story. However, due to the nature of the selected data, there is little that could be added in terms of interaction. As a result, there were two types of interactions added to the Tableau file. First, a filter for release year was added to every visualization model included in the first dashboard. Second, navigation buttons were included in both dashboards to help the viewer see more information on the data by pressing the buttons. The final view of the dashboards is provided in Figures 13-14.

Final version of the first dashboard
Figure 13. Final version of the first dashboard
Final version of the second dashboard
Figure 14. Final version of the second dashboard

Problems and Solutions

In the process of working with Tableau visualizations, I run into several problems. The first problem I encountered was the technical issues with representing data in Tableau 2020.1.3. As it turned, the version I had was unstable, and it failed to visualize the data from Excel (the screen appeared white after selecting the row and columns). After spending several hours with the support team, it was recommended to upgrade to the latest version of the product (Tableau 2020.3).

The second problem was associated with preparing data for Tableau in Excel. As mentioned in the previous section of the present paper, the acquired data was unusable after downloading. As shown in Figure 1, all the data were clustered into one column, and I feared that I would have to separate it manually. However, after a brief research, the function called “Text to Columns” was discovered, which helped to convert the dataset into an acceptable format.

The third issue with the dataset was that most of the shows belonged to multiple categories. This implied that Excel and Tableau viewed every unique set of categories separately. As a result, the data presented in Figure 12 was not accurate. It was considered to use the “Text to Columns” feature to separate the categories from one another. However, the results were confusing, as the number of categories for each show was inconsistent. The dataset was unusable after the manipulations, so it was decided to use raw data instead. Thus, the problem was not resolved.

Finally, as was mentioned above, during the refining stage, the original representation of show distribution by rating was changed from a bar chart to packed bubbles. This helped to show all the categories of shows available at Netflix on one screen. Such a substitution was also a half-measure for addressing the third problem described in the present section, as it allowed the viewer to investigate the issue.

Conclusions

Visualization is an outstanding method for analyzing data and making inferences. Tableau makes visualization easy, as even an unskilled user can make basic inferences using simple manipulations. At the same time, Tableau can be helpful for advanced users, as it includes a variety of features that can be difficult to grasp, such as interactions, parameters, functions, and calculated fields. Even though the case analysis helped to improve data visualization skills, there are still areas of growth.

The analysis of the dataset helped to arrive at several conclusions. First, there are many more movies on Netflix than TV shows, which may seem obvious, as TV shows often include several seasons of episodes, which makes them difficult to make. The popularity of TV shows seems to be growing steadily since the 1990s; however, there was a significant drop in the number of uploaded TV shows in 2019. Second, according to the analysis, the most popular categories of shows on Netflix are dramas and documentaries. However, it should be noticed that there were significant problems with analyzing the number of TV shows by category, which implies that the results may be biased. Third, even though Netflix provides content for children, its target audience is adults, as the majority of content is restricted to age 14 and above. Finally, Netflix believes that recent shows are more likely to be successful.

The dataset can be used for further analysis to acquire a deeper understanding of the information. First, future research should focus on the most popular categories of shows by resolving the problem described in the previous section. Second, the analysis may use data that was deleted during the filtering stage. For instance, the data can be used to understand which directors like to work with which actors. Finally, the dataset can be integrated with other databases, such as IMDB ratings and rotten tomatoes, to understand if the rating affected the likelihood of Netflix buying a show.

References

Bansal, S. (2020). Netflix movies and TV shows [Dataset].

Fry, B. (2007). Visualizing data (2nd ed.). O’Reilly Media.

Netflix. (2020a). Annual report

Netflix. (2020b). Unlimited movies, TV shows, and more.

Twin, A. (2020). Data mining. Investopedia. Web.