Python小练习——电影数据集TMDB预处理

python

加载TMDB数据集,进行数据预处理

TMDb电影数据库,数据集中包含来自1960-2016年上映的近11000部电影的基本信息,主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。用于练习数据分析。

参考文章https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2

import pandas as pd

credits = pd.read_csv("./tmdb_5000_credits.csv")

movies = pd.read_csv("./tmdb_5000_movies.csv")

查看各个dataframe的一般信息

# 这是movies表的信息

movies.head(1)

Out[3]:

budget genres homepage id ... tagline title vote_average vote_count

0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800

 

这是credits表的信息

print(credits.info())

credits.head(1)

Out[4]:

<class"pandas.core.frame.DataFrame">

RangeIndex: 4803 entries, 0 to 4802

Data columns (total 4 columns):

movie_id 4803 non-null int64

title 4803 non-null object

cast 4803 non-null object

crew 4803 non-null object

dtypes: int64(1), object(3)

memory usage: 150.2+ KB

None

movie_id ... crew

0 19995 ... [{"credit_id": "52fe48009251416c750aca23", "de...

 

credits表的cast列很奇怪,数据很多
进行具体查看

# 查看credists表的cast列索引0的值,发现是一长串东西

print("cast格式:", type(credits["cast"][0])) # 查看其类型,为`str`类型,无法处理

Out[5]:

cast格式: <class"str">

 

json格式化数据处理 从表中看出,cast列其实是json格式化数据,应该用json包进行处理
json格式是[{},{}]
将json格式的字符串转换成Python对象用json.loads()
json.load()针对的是文件,从文件中读取json

import json

type(json.loads(credits["cast"][0]))

Out[6]:

list

 

从上面可以看出json.loads()将json字符串转成了list,可以知道list里面又包裹多个dict
接下来批量处理

import json

json_col = ["cast","crew"]

for i in json_col:

credits[i] = credits[i].apply(json.loads)

>> credits["cast"][0][:3]

Out[7]:

[{"cast_id": 242,

"character": "Jake Sully",

"credit_id": "5602a8a7c3a3685532001c9a",

"gender": 2,

"id": 65731,

"name": "Sam Worthington",

"order": 0},

{"cast_id": 3,

"character": "Neytiri",

"credit_id": "52fe48009251416c750ac9cb",

"gender": 1,

"id": 8691,

"name": "Zoe Saldana",

"order": 1},

{"cast_id": 25,

"character": "Dr. Grace Augustine",

"credit_id": "52fe48009251416c750aca39",

"gender": 1,

"id": 10205,

"name": "Sigourney Weaver",

"order": 2}]

print("再次查看cast类型是:",type(credits["cast"][0]))

# 数据类型变成了list,可以用于循环处理

Out[8]:

再次查看cast类型是: <class"list">

提取其中的名字

credits["cast"][0][:3]

# credits第一行的cast,是个列表

Out[9]:

[{"cast_id": 242,

"character": "Jake Sully",

"credit_id": "5602a8a7c3a3685532001c9a",

"gender": 2,

"id": 65731,

"name": "Sam Worthington",

"order": 0},

{"cast_id": 3,

"character": "Neytiri",

"credit_id": "52fe48009251416c750ac9cb",

"gender": 1,

"id": 8691,

"name": "Zoe Saldana",

"order": 1},

{"cast_id": 25,

"character": "Dr. Grace Augustine",

"credit_id": "52fe48009251416c750aca39",

"gender": 1,

"id": 10205,

"name": "Sigourney Weaver",

"order": 2}]

credits["cast"][0][0]["name"] # 获取第一行第一个字典的人名

Out[10]:

"Sam Worthington"

dict字典常用的函数 dict.get() 返回指定键的值,如果值不在字典中返回default值
dict.items() 以列表返回可遍历的(键, 值) 元组数组

# 代码测试如下:

i = credits["cast"][0][0]

for x in i.items():

print(x)

Out[11]:

("cast_id", 242)

("character", "Jake Sully")

("credit_id", "5602a8a7c3a3685532001c9a")

("gender", 2)

("id", 65731)

("name", "Sam Worthington")

("order", 0)

 

创建get_names()函数,进一步分割cast

def get_names(x):

return",".join(i["name"] for i in x)

credits["cast"] = credits["cast"].apply(get_names)

credits["cast"][:3]

Out[12]:

0 Sam Worthington,Zoe Saldana,Sigourney Weaver,S...

1 Johnny Depp,Orlando Bloom,Keira Knightley,Stel...

2 Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph...

Name: cast, dtype: object

 

crew提取导演

credits["crew"][0][0]

Out[13]:

{"credit_id": "52fe48009251416c750aca23",

"department": "Editing",

"gender": 0,

"id": 1721,

"job": "Editor",

"name": "Stephen E. Rivkin"}

# 需要创建循环,找到job是director的,然后读取名字并返回

def director(x):

for i in x:

if i["job"] == "Director":

return i["name"]

credits["crew"] = credits["crew"].apply(director)

print(credits[["crew"]][:3])

credits.rename(columns = {"crew":"director"},inplace=True) #修改列名

credits[["director"]][:3]

Out[[14]:

crew

0 James Cameron

1 Gore Verbinski

2 Sam Mendes

movies表进行json解析

>>> movies.head(1)

Out[15]:

budget genres homepage id ... tagline title vote_average vote_count

0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800

 

可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的

# 方法同crew表

json_col = ["genres","keywords","spoken_languages","production_countries","production_companies"]

for i in json_col:

movies[i] = movies[i].apply(json.loads)

movies[i] = movies[i].apply(get_names)

>>> movies.head(1)

Out[16]:

budget genres homepage id ... tagline title vote_average vote_count

0 237000000 Action,Adventure,Fantasy,Science Fiction http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800

 

这样,就把数据预处理做完了。

以上是 Python小练习——电影数据集TMDB预处理 的全部内容, 来源链接: utcz.com/z/530512.html

回到顶部