pandas.io.json.json_normalize与非常嵌套的json

我一直在尝试normalize一个非常嵌套的json文件,稍后将对其进行分析。我正在努力的是如何深入到一个以上的层次来进行标准化。

我仔细阅读了pandas.io.json.json_normalize文档,因为它确实可以实现我想要的功能。

我已经能够规范化其中的一部分,现在了解字典的工作原理,但是我仍然没有。

使用下面的代码,我只能获得第一级。

import json

import pandas as pd

from pandas.io.json import json_normalize

with open('authors_sample.json') as f:

d = json.load(f)

raw = json_normalize(d['hits']['hits'])

authors = json_normalize(data = d['hits']['hits'],

record_path = '_source',

meta = ['_id', ['_source', 'journal'], ['_source', 'title'],

['_source', 'normalized_venue_name']

])

我正在尝试使用下面的代码“挖掘”到“作者”字典中,但是record_path = ['_source',

'authors']抛出了我TypeError: string indices must be

integers。就我所知json_normalize,逻辑应该不错,但是我仍然不太了解如何使用dictvs 深入到json中list

我什至经历了这个简单的例子。

authors = json_normalize(data = d['hits']['hits'], 

record_path = ['_source', 'authors'],

meta = ['_id', ['_source', 'journal'], ['_source', 'title'],

['_source', 'normalized_venue_name']

])

以下是json文件的一部分(5条记录)。

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},

u'hits': {u'hits': [{u'_id': u'7CB3F2AD',

u'_index': u'scibase_listings',

u'_score': 1.0,

u'_source': {u'authors': None,

u'deleted': 0,

u'description': None,

u'doi': u'',

u'is_valid': 1,

u'issue': None,

u'journal': u'Physical Review Letters',

u'link': None,

u'meta_description': None,

u'meta_keywords': None,

u'normalized_venue_name': u'phys rev lett',

u'pages': None,

u'parent_keywords': [u'Chromatography',

u'Quantum mechanics',

u'Particle physics',

u'Quantum field theory',

u'Analytical chemistry',

u'Quantum chromodynamics',

u'Physics',

u'Mass spectrometry',

u'Chemistry'],

u'pub_date': u'1987-03-02 00:00:00',

u'pubtype': None,

u'rating_avg_weighted': 0,

u'rating_clarity': 0.0,

u'rating_clarity_weighted': 0.0,

u'rating_innovation': 0.0,

u'rating_innovation_weighted': 0.0,

u'rating_num_weighted': 0,

u'rating_reproducability': 0,

u'rating_reproducibility_weighted': 0.0,

u'rating_versatility': 0.0,

u'rating_versatility_weighted': 0.0,

u'review_count': 0,

u'tag': [u'mass spectra', u'elementary particles', u'bound states'],

u'title': u'Evidence for a new meson: A quasinuclear NN-bar bound state',

u'userAvg': 0.0,

u'user_id': None,

u'venue_name': u'Physical Review Letters',

u'views_count': 0,

u'volume': None},

u'_type': u'listing'},

{u'_id': u'7AF8EBC3',

u'_index': u'scibase_listings',

u'_score': 1.0,

u'_source': {u'authors': [{u'affiliations': [u'Punjabi University'],

u'author_id': u'780E3459',

u'author_name': u'munish puri'},

{u'affiliations': [u'Punjabi University'],

u'author_id': u'48D92C79',

u'author_name': u'rajesh dhaliwal'},

{u'affiliations': [u'Punjabi University'],

u'author_id': u'7D9BD37C',

u'author_name': u'r s singh'}],

u'deleted': 0,

u'description': None,

u'doi': u'',

u'is_valid': 1,

u'issue': None,

u'journal': u'Journal of Industrial Microbiology & Biotechnology',

u'link': None,

u'meta_description': None,

u'meta_keywords': None,

u'normalized_venue_name': u'j ind microbiol biotechnol',

u'pages': None,

u'parent_keywords': [u'Nuclear medicine',

u'Psychology',

u'Hydrology',

u'Chromatography',

u'X-ray crystallography',

u'Nuclear fusion',

u'Medicine',

u'Fluid dynamics',

u'Thermodynamics',

u'Physics',

u'Gas chromatography',

u'Radiobiology',

u'Engineering',

u'Organic chemistry',

u'High-performance liquid chromatography',

u'Chemistry',

u'Organic synthesis',

u'Psychotherapist'],

u'pub_date': u'2008-04-04 00:00:00',

u'pubtype': None,

u'rating_avg_weighted': 0,

u'rating_clarity': 0.0,

u'rating_clarity_weighted': 0.0,

u'rating_innovation': 0.0,

u'rating_innovation_weighted': 0.0,

u'rating_num_weighted': 0,

u'rating_reproducability': 0,

u'rating_reproducibility_weighted': 0.0,

u'rating_versatility': 0.0,

u'rating_versatility_weighted': 0.0,

u'review_count': 0,

u'tag': [u'flow rate',

u'operant conditioning',

u'packed bed reactor',

u'immobilized enzyme',

u'specific activity'],

u'title': u'Development of a stable continuous flow immobilized enzyme reactor for the hydrolysis of inulin',

u'userAvg': 0.0,

u'user_id': None,

u'venue_name': u'Journal of Industrial Microbiology & Biotechnology',

u'views_count': 0,

u'volume': None},

u'_type': u'listing'},

{u'_id': u'7521A721',

u'_index': u'scibase_listings',

u'_score': 1.0,

u'_source': {u'authors': [{u'author_id': u'7FF872BC',

u'author_name': u'barbara eileen ryan'}],

u'deleted': 0,

u'description': None,

u'doi': u'',

u'is_valid': 1,

u'issue': None,

u'journal': u'The American Historical Review',

u'link': None,

u'meta_description': None,

u'meta_keywords': None,

u'normalized_venue_name': u'american historical review',

u'pages': None,

u'parent_keywords': [u'Social science',

u'Politics',

u'Sociology',

u'Law'],

u'pub_date': u'1992-01-01 00:00:00',

u'pubtype': None,

u'rating_avg_weighted': 0,

u'rating_clarity': 0.0,

u'rating_clarity_weighted': 0.0,

u'rating_innovation': 0.0,

u'rating_innovation_weighted': 0.0,

u'rating_num_weighted': 0,

u'rating_reproducability': 0,

u'rating_reproducibility_weighted': 0.0,

u'rating_versatility': 0.0,

u'rating_versatility_weighted': 0.0,

u'review_count': 0,

u'tag': [u'social movements'],

u'title': u"Feminism and the women's movement : dynamics of change in social movement ideology, and activism",

u'userAvg': 0.0,

u'user_id': None,

u'venue_name': u'The American Historical Review',

u'views_count': 0,

u'volume': None},

u'_type': u'listing'},

{u'_id': u'7DAEB9A4',

u'_index': u'scibase_listings',

u'_score': 1.0,

u'_source': {u'authors': [{u'author_id': u'0299B8E9',

u'author_name': u'fraser j harbutt'}],

u'deleted': 0,

u'description': None,

u'doi': u'',

u'is_valid': 1,

u'issue': None,

u'journal': u'The American Historical Review',

u'link': None,

u'meta_description': None,

u'meta_keywords': None,

u'normalized_venue_name': u'american historical review',

u'pages': None,

u'parent_keywords': [u'Superconductivity',

u'Nuclear fusion',

u'Geology',

u'Chemistry',

u'Metallurgy'],

u'pub_date': u'1988-01-01 00:00:00',

u'pubtype': None,

u'rating_avg_weighted': 0,

u'rating_clarity': 0.0,

u'rating_clarity_weighted': 0.0,

u'rating_innovation': 0.0,

u'rating_innovation_weighted': 0.0,

u'rating_num_weighted': 0,

u'rating_reproducability': 0,

u'rating_reproducibility_weighted': 0.0,

u'rating_versatility': 0.0,

u'rating_versatility_weighted': 0.0,

u'review_count': 0,

u'tag': [u'iron'],

u'title': u'The iron curtain : Churchill, America, and the origins of the Cold War',

u'userAvg': 0.0,

u'user_id': None,

u'venue_name': u'The American Historical Review',

u'views_count': 0,

u'volume': None},

u'_type': u'listing'},

{u'_id': u'7B3236C5',

u'_index': u'scibase_listings',

u'_score': 1.0,

u'_source': {u'authors': [{u'author_id': u'7DAB7B72',

u'author_name': u'richard m freeland'}],

u'deleted': 0,

u'description': None,

u'doi': u'',

u'is_valid': 1,

u'issue': None,

u'journal': u'The American Historical Review',

u'link': None,

u'meta_description': None,

u'meta_keywords': None,

u'normalized_venue_name': u'american historical review',

u'pages': None,

u'parent_keywords': [u'Political Science', u'Economics'],

u'pub_date': u'1985-01-01 00:00:00',

u'pubtype': None,

u'rating_avg_weighted': 0,

u'rating_clarity': 0.0,

u'rating_clarity_weighted': 0.0,

u'rating_innovation': 0.0,

u'rating_innovation_weighted': 0.0,

u'rating_num_weighted': 0,

u'rating_reproducability': 0,

u'rating_reproducibility_weighted': 0.0,

u'rating_versatility': 0.0,

u'rating_versatility_weighted': 0.0,

u'review_count': 0,

u'tag': [u'foreign policy'],

u'title': u'The Truman Doctrine and the origins of McCarthyism : foreign policy, domestic politics, and internal security, 1946-1948',

u'userAvg': 0.0,

u'user_id': None,

u'venue_name': u'The American Historical Review',

u'views_count': 0,

u'volume': None},

u'_type': u'listing'}],

u'max_score': 1.0,

u'total': 36429433},

u'timed_out': False,

u'took': 170}

回答:

在下面的熊猫示例中,方括号是什么意思?有没有遵循[]的逻辑。 […]

result = json_normalize(data, 'counties', ['state', 'shortname',

[‘info’, ‘governor’]])

值中的每个字符串或字符串列表都是 除所选行之外要['state', 'shortname', ['info',

'governor']]包含的元素的路径。第二个参数实参(在文档示例中设置为)告诉该函数如何从输入数据结构中选择组成输出中各行的元素,并且路径会添加更多元数据,这些元数据将包含在每行中。如果可以的话,可以将它们视为数据库中的表联接。

__json_normalize()``record_path``'counties'``meta

对于输入的 美国各州 文档例如在一个列表两个字典,而且这两个字典有一个counties关键是引用类型的字典的另一个列表:

>>> data = [{'state': 'Florida',

... 'shortname': 'FL',

... 'info': {'governor': 'Rick Scott'},

... 'counties': [{'name': 'Dade', 'population': 12345},

... {'name': 'Broward', 'population': 40000},

... {'name': 'Palm Beach', 'population': 60000}]},

... {'state': 'Ohio',

... 'shortname': 'OH',

... 'info': {'governor': 'John Kasich'},

... 'counties': [{'name': 'Summit', 'population': 1234},

... {'name': 'Cuyahoga', 'population': 1337}]}]

>>> pprint(data[0]['counties'])

[{'name': 'Dade', 'population': 12345},

{'name': 'Broward', 'population': 40000},

{'name': 'Palm Beach', 'population': 60000}]

>>> pprint(data[1]['counties'])

[{'name': 'Summit', 'population': 1234},

{'name': 'Cuyahoga', 'population': 1337}]

它们之间有5行数据可用于输出:

>>> json_normalize(data, 'counties')

name population

0 Dade 12345

1 Broward 40000

2 Palm Beach 60000

3 Summit 1234

4 Cuyahoga 1337

meta然后,该参数命名位于这些列表 旁边的

一些元素,然后将这些元素counties分别合并。来自第一个data[0]字典的这些meta元素的值('Florida', 'FL',

'Rick Scott')分别是和,来自这些字典data[1]的值分别来自于同一顶级字典('Ohio', 'OH', 'John

Kasich')counties行,分别重复了3次和2次:

>>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']

('Florida', 'FL', 'Rick Scott')

>>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']

('Ohio', 'OH', 'John Kasich')

>>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

name population state shortname info.governor

0 Dade 12345 Florida FL Rick Scott

1 Broward 40000 Florida FL Rick Scott

2 Palm Beach 60000 Florida FL Rick Scott

3 Summit 1234 Ohio OH John Kasich

4 Cuyahoga 1337 Ohio OH John Kasich

因此,如果您为meta参数传递一个列表,则列表中的每个元素都是单独的路径,并且每个单独的路径都标识要添加到输出中的行的数据。

您的

例子JSON,只有少数嵌套列表的第一个参数提升,喜欢'counties'的例子一样。该数据结构中的唯一示例是嵌套'authors'键。您必须提取每个['_source',

'authors']路径,然后才能从父对象添加其他键以增加这些行。

然后,第二个meta参数_id从最外面的对象中提取键,然后是嵌套['_source', 'title']['_source',

'journal']嵌套的路径。

record_path参数以authors列表为起点,如下所示:

>>> d['hits']['hits'][0]['_source']['authors']   # this value is None, and is skipped

>>> d['hits']['hits'][1]['_source']['authors']

[{'affiliations': ['Punjabi University'],

'author_id': '780E3459',

'author_name': 'munish puri'},

{'affiliations': ['Punjabi University'],

'author_id': '48D92C79',

'author_name': 'rajesh dhaliwal'},

{'affiliations': ['Punjabi University'],

'author_id': '7D9BD37C',

'author_name': 'r s singh'}]

>>> d['hits']['hits'][2]['_source']['authors']

[{'author_id': '7FF872BC',

'author_name': 'barbara eileen ryan'}]

>>> # etc.

因此为您提供以下行:

>>> json_normalize(d['hits']['hits'], ['_source', 'authors'])

affiliations author_id author_name

0 [Punjabi University] 780E3459 munish puri

1 [Punjabi University] 48D92C79 rajesh dhaliwal

2 [Punjabi University] 7D9BD37C r s singh

3 NaN 7FF872BC barbara eileen ryan

4 NaN 0299B8E9 fraser j harbutt

5 NaN 7DAB7B72 richard m freeland

然后我们可以使用第三个meta参数来添加更多的列一样_id_source.title并且_source.journal,使用['_id',

['_source', 'journal'], ['_source', 'title']]

>>> json_normalize(

... data['hits']['hits'],

... ['_source', 'authors'],

... ['_id', ['_source', 'journal'], ['_source', 'title']]

... )

affiliations author_id author_name _id \

0 [Punjabi University] 780E3459 munish puri 7AF8EBC3

1 [Punjabi University] 48D92C79 rajesh dhaliwal 7AF8EBC3

2 [Punjabi University] 7D9BD37C r s singh 7AF8EBC3

3 NaN 7FF872BC barbara eileen ryan 7521A721

4 NaN 0299B8E9 fraser j harbutt 7DAEB9A4

5 NaN 7DAB7B72 richard m freeland 7B3236C5

_source.journal

0 Journal of Industrial Microbiology & Biotechno...

1 Journal of Industrial Microbiology & Biotechno...

2 Journal of Industrial Microbiology & Biotechno...

3 The American Historical Review

4 The American Historical Review

5 The American Historical Review

_source.title \

0 Development of a stable continuous flow immobi...

1 Development of a stable continuous flow immobi...

2 Development of a stable continuous flow immobi...

3 Feminism and the women's movement : dynamics o...

4 The iron curtain : Churchill, America, and the...

5 The Truman Doctrine and the origins of McCarth...

以上是 pandas.io.json.json_normalize与非常嵌套的json 的全部内容, 来源链接: utcz.com/qa/416216.html

回到顶部