Python小练习——电影数据集TMDB预处理
加载TMDB数据集,进行数据预处理
TMDb电影数据库,数据集中包含来自1960-2016年上映的近11000部电影的基本信息,主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。用于练习数据分析。
参考文章https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2
import pandas as pdcredits
= pd.read_csv("./tmdb_5000_credits.csv")movies
= pd.read_csv("./tmdb_5000_movies.csv")
查看各个dataframe的一般信息
# 这是movies表的信息movies.head(1)
Out[3]:
budget genres homepage id ... tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800
这是credits表的信息
print(credits.info())credits.head(
1)Out[
4]: <class"pandas.core.frame.DataFrame">RangeIndex:
4803 entries, 0 to 4802Data columns (total
4 columns):movie_id
4803 non-null int64title
4803 non-null objectcast
4803 non-null objectcrew
4803 non-null objectdtypes: int64(
1), object(3)memory usage:
150.2+ KBNone
movie_id ... crew
0
19995 ... [{"credit_id": "52fe48009251416c750aca23", "de...
credits表的cast列很奇怪,数据很多
进行具体查看
# 查看credists表的cast列索引0的值,发现是一长串东西print("cast格式:", type(credits["cast"][0])) # 查看其类型,为`str`类型,无法处理
Out[5]:
cast格式: <class"str">
json格式化数据处理 从表中看出,cast列其实是json格式化数据,应该用json包进行处理
json格式是[{},{}]
将json格式的字符串转换成Python对象用json.loads()
json.load()
针对的是文件,从文件中读取json
import jsontype(json.loads(credits[
"cast"][0]))Out[
6]:list
从上面可以看出json.loads()
将json字符串转成了list,可以知道list里面又包裹多个dict
接下来批量处理
import jsonjson_col
= ["cast","crew"]for i in json_col:credits[i]
= credits[i].apply(json.loads)>> credits["cast"][0][:3]Out[
7]:[{
"cast_id": 242,"character": "Jake Sully","credit_id": "5602a8a7c3a3685532001c9a","gender": 2,"id": 65731,"name": "Sam Worthington","order": 0},{
"cast_id": 3,"character": "Neytiri","credit_id": "52fe48009251416c750ac9cb","gender": 1,"id": 8691,"name": "Zoe Saldana","order": 1},{
"cast_id": 25,"character": "Dr. Grace Augustine","credit_id": "52fe48009251416c750aca39","gender": 1,"id": 10205,"name": "Sigourney Weaver","order": 2}]print("再次查看cast类型是:",type(credits["cast"][0])) # 数据类型变成了list,可以用于循环处理Out[8]:
再次查看cast类型是: <class"list">
提取其中的名字
credits["cast"][0][:3]# credits第一行的cast,是个列表Out[9]:
[{"cast_id": 242,
"character": "Jake Sully",
"credit_id": "5602a8a7c3a3685532001c9a",
"gender": 2,
"id": 65731,
"name": "Sam Worthington",
"order": 0},
{"cast_id": 3,
"character": "Neytiri",
"credit_id": "52fe48009251416c750ac9cb",
"gender": 1,
"id": 8691,
"name": "Zoe Saldana",
"order": 1},
{"cast_id": 25,
"character": "Dr. Grace Augustine",
"credit_id": "52fe48009251416c750aca39",
"gender": 1,
"id": 10205,
"name": "Sigourney Weaver",
"order": 2}]
credits["cast"][0][0]["name"] # 获取第一行第一个字典的人名
Out[10]:
"Sam Worthington"
dict字典常用的函数 dict.get() 返回指定键的值,如果值不在字典中返回default值
dict.items() 以列表返回可遍历的(键, 值) 元组数组
# 代码测试如下:i = credits["cast"][0][0]
for x in i.items():
print(x)
Out[11]:
("cast_id", 242)
("character", "Jake Sully")
("credit_id", "5602a8a7c3a3685532001c9a")
("gender", 2)
("id", 65731)
("name", "Sam Worthington")
("order", 0)
创建get_names()函数,进一步分割cast
def get_names(x):return",".join(i["name"] for i in x)credits[
"cast"] = credits["cast"].apply(get_names)credits[
"cast"][:3]Out[
12]:0 Sam Worthington,Zoe Saldana,Sigourney Weaver,S...
1 Johnny Depp,Orlando Bloom,Keira Knightley,Stel...2 Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph...Name: cast, dtype: object
crew提取导演
credits["crew"][0][0]Out[
13]:{
"credit_id": "52fe48009251416c750aca23","department": "Editing","gender": 0,"id": 1721,"job": "Editor","name": "Stephen E. Rivkin"}# 需要创建循环,找到job是director的,然后读取名字并返回def director(x):
for i in x:
if i["job"] == "Director":
return i["name"]
credits["crew"] = credits["crew"].apply(director)
print(credits[["crew"]][:3])
credits.rename(columns = {"crew":"director"},inplace=True) #修改列名
credits[["director"]][:3]
Out[[14]:
crew
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
movies表进行json解析
>>> movies.head(1)Out[
15]:budget genres homepage id ... tagline title vote_average vote_count
0
237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800
可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的
# 方法同crew表json_col = ["genres","keywords","spoken_languages","production_countries","production_companies"]
for i in json_col:
movies[i] = movies[i].apply(json.loads)
movies[i] = movies[i].apply(get_names)
>>> movies.head(1)
Out[16]:
budget genres homepage id ... tagline title vote_average vote_count
0 237000000 Action,Adventure,Fantasy,Science Fiction http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800
这样,就把数据预处理做完了。
以上是 Python小练习——电影数据集TMDB预处理 的全部内容, 来源链接: utcz.com/z/530512.html