Bilibili Content Creator Data Crawling and Analysis

Data Source

  1. User Subscription List (https://space.bilibili.com/(user’s id)/fans/fans)

    https://api.bilibili.com/x/relation/followings?vmid=(user’s id)&pn=((user’s subscription list page number))&ps=20&order=desc&jsonp=jsonp

    This API returns the basic information of other users that a user subscribes to, i.e., id and user name, in Json format.

  2. User Homepage (https://space.bilibili.com/(user’s id)/video)

    This HTML page contains all the basic information of a user (content creator), i.e., the number of followers, the total number of videos uploaded, the amount of videos uploaded by each categories, etc.

Data Crawling

Python Package Used: requests, selenium, bs4, json, sqlite3

Data Source 1

Using the request module get the id and username information of a user’s subscription list.

As Bilibili has around 400 million user with a large part of it not being content creators, I took a Breadth First Search (BFS) approach to traverse the Subscription List (the users that a user subscribes to) starting with a famous content creator.

Since a user usually subscribes to active content creators, this approach minimizes the possibility of obtaining irrelevant information of non-content creators. At the same time, it also obtains the subscription relationships between content creators.

Data Source 2

Since data source 1 only provides basic information of id and username, Further information needed to be obtained using data source 2.

Implementation

Bilibili limits the frequency of api requests to once every 5 seconds, thus, I rented a CentOS cloud server and configured the Python environment to execute the crawler program. The cloud server was remotely controlled using Putty client, the cloud server and local data transfer using WinSCP, and login using SSH key generated by PuttyKey for authentication.

As certain information (i.e., the amount of videos uploaded by each categories) cannot be requested through APIs and is not directly displayed in the HTML source code, selenium module is used for crawling data source 2.

Due to the limited speed of selenium, a total of 3730 uploaders with a frequency greater than 20 in the relationship data of data source 1 is selected, ensuring only well-known content creators is analyzed.

Python’s built-in sqlite3 module is used to store the data.

Result

A total of 155120 records of basic user data (user id, user name), and 558345 records of relationship data (user id, user id) is obtained from data source 1.

A total of 3730 detailed user data (number of followers, number of videos uploaded, video category with the highest and second-highest number of videos uploaded)

Data Analysis

Python Package Used: pandas, matplotlib

Content Creator Distribution

Most content creator has less than 500,000 followers, and the number of content creators drops sharply after they has more than 500,000 followers. Content creator with more than 2 million is almost zero, but the maximum number of followers can reach 12 million. All of which reflects extreme stratification of content creators on Bilibili.

alt
Histogram of Content Creator Number Based on the Number of Followers

The number of video uploaded from each content creator has a mean value of 337, median value of 108, and maximum value of 110293. Most content creator uploads less than 500 videos. Extreme value might come from bot user.

alt
Histogram of Content Creator Number Based on the Number of Videos Uploaded

Relationship between the Number of Followers and the Number of Videos Uploaded

No significant positive relationship is found between the number of followers and the number of videos uploaded in general. However, when looking into such relationship within each video category, significant relationship can be found, i.e., positive in animation, cinema, dance; negative in fashion, game, life.

alt
the Number of Followers and the Number of Videos Uploaded
alt
Heatmap of the Number of Followers and the Number of Videos Uploaded Based on Video Categories

Social Network of Content Creator

The social network of content creators is visualized using networkx. Nodes with a degree greater than 8 is drawn, and 9 node with highest degree is identified.

alt
Social Network of Content Creators