pandas.io.json.json_normalize与非常嵌套的json

Z时代
2024-01-10
分类：问答

我一直在尝试normalize一个非常嵌套的json文件，稍后将对其进行分析。我正在努力的是如何深入到一个以上的层次来进行标准化。

我仔细阅读了pandas.io.json.json_normalize文档，因为它确实可以实现我想要的功能。

我已经能够规范化其中的一部分，现在了解字典的工作原理，但是我仍然没有。

使用下面的代码，我只能获得第一级。

import json
import pandas as pd
from pandas.io.json import json_normalize
with open('authors_sample.json') as f:
    d = json.load(f)
raw = json_normalize(d['hits']['hits'])
authors = json_normalize(data = d['hits']['hits'], 
                         record_path = '_source', 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

我正在尝试使用下面的代码“挖掘”到“作者”字典中，但是record_path = ['_source',

'authors']抛出了我TypeError: string indices must be

integers。就我所知json_normalize，逻辑应该不错，但是我仍然不太了解如何使用dictvs 深入到json中list。

我什至经历了这个简单的例子。

authors = json_normalize(data = d['hits']['hits'], 
                         record_path = ['_source', 'authors'], 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

以下是json文件的一部分（5条记录）。

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, u'hits': {u'hits': [{u'_id': u'7CB3F2AD', u'_index': u'scibase_listings', u'_score': 1.0, u'_source': {u'authors': None, u'deleted': 0, u'description': None, u'doi': u'', u'is_valid': 1, u'issue': None, u'journal': u'Physical Review Letters', u'link': None, u'meta_description': None, u'meta_keywords': None, u'normalized_venue_name': u'phys rev lett', u'pages': None, u'parent_keywords': [u'Chromatography', u'Quantum mechanics', u'Particle physics', u'Quantum field theory', u'Analytical chemistry', u'Quantum chromodynamics', u'Physics', u'Mass spectrometry', u'Chemistry'], u'pub_date': u'1987-03-02 00:00:00', u'pubtype': None, u'rating_avg_weighted': 0, u'rating_clarity': 0.0, u'rating_clarity_weighted': 0.0, u'rating_innovation': 0.0, u'rating_innovation_weighted': 0.0, u'rating_num_weighted': 0, u'rating_reproducability': 0, u'rating_reproducibility_weighted': 0.0, u'rating_versatility': 0.0, u'rating_versatility_weighted': 0.0, u'review_count': 0, u'tag': [u'mass spectra', u'elementary particles', u'bound states'], u'title': u'Evidence for a new meson: A quasinuclear NN-bar bound state', u'userAvg': 0.0, u'user_id': None, u'venue_name': u'Physical Review Letters', u'views_count': 0, u'volume': None}, u'_type': u'listing'}, {u'_id': u'7AF8EBC3', u'_index': u'scibase_listings', u'_score': 1.0, u'_source': {u'authors': [{u'affiliations': [u'Punjabi University'], u'author_id': u'780E3459', u'author_name': u'munish puri'}, {u'affiliations': [u'Punjabi University'], u'author_id': u'48D92C79', u'author_name': u'rajesh dhaliwal'}, {u'affiliations': [u'Punjabi University'], u'author_id': u'7D9BD37C', u'author_name': u'r s singh'}], u'deleted': 0, u'description': None, u'doi': u'', u'is_valid': 1, u'issue': None, u'journal': u'Journal of Industrial Microbiology & Biotechnology', u'link': None, u'meta_description': None, u'meta_keywords': None, u'normalized_venue_name': u'j ind microbiol biotechnol', u'pages': None, u'parent_keywords': [u'Nuclear medicine', u'Psychology', u'Hydrology', u'Chromatography', u'X-ray crystallography', u'Nuclear fusion', u'Medicine', u'Fluid dynamics', u'Thermodynamics', u'Physics', u'Gas chromatography', u'Radiobiology', u'Engineering', u'Organic chemistry', u'High-performance liquid chromatography', u'Chemistry', u'Organic synthesis', u'Psychotherapist'], u'pub_date': u'2008-04-04 00:00:00', u'pubtype': None, u'rating_avg_weighted': 0, u'rating_clarity': 0.0, u'rating_clarity_weighted': 0.0, u'rating_innovation': 0.0, u'rating_innovation_weighted': 0.0, u'rating_num_weighted': 0, u'rating_reproducability': 0, u'rating_reproducibility_weighted': 0.0, u'rating_versatility': 0.0, u'rating_versatility_weighted': 0.0, u'review_count': 0, u'tag': [u'flow rate', u'operant conditioning', u'packed bed reactor', u'immobilized enzyme', u'specific activity'], u'title': u'Development of a stable continuous flow immobilized enzyme reactor for the hydrolysis of inulin', u'userAvg': 0.0, u'user_id': None, u'venue_name': u'Journal of Industrial Microbiology & Biotechnology', u'views_count': 0, u'volume': None}, u'_type': u'listing'}, {u'_id': u'7521A721', u'_index': u'scibase_listings', u'_score': 1.0, u'_source': {u'authors': [{u'author_id': u'7FF872BC', u'author_name': u'barbara eileen ryan'}], u'deleted': 0, u'description': None, u'doi': u'', u'is_valid': 1, u'issue': None, u'journal': u'The American Historical Review', u'link': None, u'meta_description': None, u'meta_keywords': None, u'normalized_venue_name': u'american historical review', u'pages': None, u'parent_keywords': [u'Social science', u'Politics', u'Sociology', u'Law'], u'pub_date': u'1992-01-01 00:00:00', u'pubtype': None, u'rating_avg_weighted': 0, u'rating_clarity': 0.0, u'rating_clarity_weighted': 0.0, u'rating_innovation': 0.0, u'rating_innovation_weighted': 0.0, u'rating_num_weighted': 0, u'rating_reproducability': 0, u'rating_reproducibility_weighted': 0.0, u'rating_versatility': 0.0, u'rating_versatility_weighted': 0.0, u'review_count': 0, u'tag': [u'social movements'], u'title': u"Feminism and the women's movement : dynamics of change in social movement ideology, and activism", u'userAvg': 0.0, u'user_id': None, u'venue_name': u'The American Historical Review', u'views_count': 0, u'volume': None}, u'_type': u'listing'}, {u'_id': u'7DAEB9A4', u'_index': u'scibase_listings', u'_score': 1.0, u'_source': {u'authors': [{u'author_id': u'0299B8E9', u'author_name': u'fraser j harbutt'}], u'deleted': 0, u'description': None, u'doi': u'', u'is_valid': 1, u'issue': None, u'journal': u'The American Historical Review', u'link': None, u'meta_description': None, u'meta_keywords': None, u'normalized_venue_name': u'american historical review', u'pages': None, u'parent_keywords': [u'Superconductivity', u'Nuclear fusion', u'Geology', u'Chemistry', u'Metallurgy'], u'pub_date': u'1988-01-01 00:00:00', u'pubtype': None, u'rating_avg_weighted': 0, u'rating_clarity': 0.0, u'rating_clarity_weighted': 0.0, u'rating_innovation': 0.0, u'rating_innovation_weighted': 0.0, u'rating_num_weighted': 0, u'rating_reproducability': 0, u'rating_reproducibility_weighted': 0.0, u'rating_versatility': 0.0, u'rating_versatility_weighted': 0.0, u'review_count': 0, u'tag': [u'iron'], u'title': u'The iron curtain : Churchill, America, and the origins of the Cold War', u'userAvg': 0.0, u'user_id': None, u'venue_name': u'The American Historical Review', u'views_count': 0, u'volume': None}, u'_type': u'listing'}, {u'_id': u'7B3236C5', u'_index': u'scibase_listings', u'_score': 1.0, u'_source': {u'authors': [{u'author_id': u'7DAB7B72', u'author_name': u'richard m freeland'}], u'deleted': 0, u'description': None, u'doi': u'', u'is_valid': 1, u'issue': None, u'journal': u'The American Historical Review', u'link': None, u'meta_description': None, u'meta_keywords': None, u'normalized_venue_name': u'american historical review', u'pages': None, u'parent_keywords': [u'Political Science', u'Economics'], u'pub_date': u'1985-01-01 00:00:00', u'pubtype': None, u'rating_avg_weighted': 0, u'rating_clarity': 0.0, u'rating_clarity_weighted': 0.0, u'rating_innovation': 0.0, u'rating_innovation_weighted': 0.0, u'rating_num_weighted': 0, u'rating_reproducability': 0, u'rating_reproducibility_weighted': 0.0, u'rating_versatility': 0.0, u'rating_versatility_weighted': 0.0, u'review_count': 0, u'tag': [u'foreign policy'], u'title': u'The Truman Doctrine and the origins of McCarthyism : foreign policy, domestic politics, and internal security, 1946-1948', u'userAvg': 0.0, u'user_id': None, u'venue_name': u'The American Historical Review', u'views_count': 0, u'volume': None}, u'_type': u'listing'}], u'max_score': 1.0, u'total': 36429433}, u'timed_out': False, u'took': 170}

回答：

在下面的熊猫示例中，方括号是什么意思？有没有遵循[]的逻辑。 […]
result = json_normalize(data, 'counties', ['state', 'shortname',
[‘info’, ‘governor’]])

值中的每个字符串或字符串列表都是 除所选行之外要['state', 'shortname', ['info',

'governor']]包含的元素的路径。第二个参数实参（在文档示例中设置为）告诉该函数如何从输入数据结构中选择组成输出中各行的元素，并且路径会添加更多元数据，这些元数据将包含在每行中。如果可以的话，可以将它们视为数据库中的表联接。

__json_normalize()``record_path``'counties'``meta

对于输入的 美国各州 文档例如在一个列表两个字典，而且这两个字典有一个counties关键是引用类型的字典的另一个列表：

>>> data = [{'state': 'Florida',
...          'shortname': 'FL',
...         'info': {'governor': 'Rick Scott'},
...         'counties': [{'name': 'Dade', 'population': 12345},
...                      {'name': 'Broward', 'population': 40000},
...                      {'name': 'Palm Beach', 'population': 60000}]},
...         {'state': 'Ohio',
...          'shortname': 'OH',
...          'info': {'governor': 'John Kasich'},
...          'counties': [{'name': 'Summit', 'population': 1234},
...                       {'name': 'Cuyahoga', 'population': 1337}]}]
>>> pprint(data[0]['counties'])
[{'name': 'Dade', 'population': 12345},
 {'name': 'Broward', 'population': 40000},
 {'name': 'Palm Beach', 'population': 60000}]
>>> pprint(data[1]['counties'])
[{'name': 'Summit', 'population': 1234},
 {'name': 'Cuyahoga', 'population': 1337}]

它们之间有5行数据可用于输出：

>>> json_normalize(data, 'counties')
         name  population
0        Dade       12345
1     Broward       40000
2  Palm Beach       60000
3      Summit        1234
4    Cuyahoga        1337

meta然后，该参数命名位于这些列表 旁边的

一些元素，然后将这些元素counties分别合并。来自第一个data[0]字典的这些meta元素的值('Florida', 'FL',

'Rick Scott')分别是和，来自这些字典data[1]的值分别来自于同一顶级字典('Ohio', 'OH', 'John

Kasich')的counties行，分别重复了3次和2次：

>>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']
('Florida', 'FL', 'Rick Scott')
>>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']
('Ohio', 'OH', 'John Kasich')
>>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

因此，如果您为meta参数传递一个列表，则列表中的每个元素都是单独的路径，并且每个单独的路径都标识要添加到输出中的行的数据。

在您的

例子JSON，只有少数嵌套列表的第一个参数提升，喜欢'counties'的例子一样。该数据结构中的唯一示例是嵌套'authors'键。您必须提取每个['_source',

'authors']路径，然后才能从父对象添加其他键以增加这些行。

然后，第二个meta参数_id从最外面的对象中提取键，然后是嵌套['_source', 'title']和['_source',

'journal']嵌套的路径。

该record_path参数以authors列表为起点，如下所示：

>>> d['hits']['hits'][0]['_source']['authors']   # this value is None, and is skipped
>>> d['hits']['hits'][1]['_source']['authors']
[{'affiliations': ['Punjabi University'],
  'author_id': '780E3459',
  'author_name': 'munish puri'},
 {'affiliations': ['Punjabi University'],
  'author_id': '48D92C79',
  'author_name': 'rajesh dhaliwal'},
 {'affiliations': ['Punjabi University'],
  'author_id': '7D9BD37C',
  'author_name': 'r s singh'}]
>>> d['hits']['hits'][2]['_source']['authors']
[{'author_id': '7FF872BC',
  'author_name': 'barbara eileen ryan'}]
>>> # etc.

因此为您提供以下行：

>>> json_normalize(d['hits']['hits'], ['_source', 'authors'])
           affiliations author_id          author_name
0  [Punjabi University]  780E3459          munish puri
1  [Punjabi University]  48D92C79      rajesh dhaliwal
2  [Punjabi University]  7D9BD37C            r s singh
3                   NaN  7FF872BC  barbara eileen ryan
4                   NaN  0299B8E9     fraser j harbutt
5                   NaN  7DAB7B72   richard m freeland

然后我们可以使用第三个meta参数来添加更多的列一样_id，_source.title并且_source.journal，使用['_id',

['_source', 'journal'], ['_source', 'title']]：

>>> json_normalize( ... data['hits']['hits'], ... ['_source', 'authors'], ... ['_id', ['_source', 'journal'], ['_source', 'title']] ... ) affiliations author_id author_name _id \ 0 [Punjabi University] 780E3459 munish puri 7AF8EBC3 1 [Punjabi University] 48D92C79 rajesh dhaliwal 7AF8EBC3 2 [Punjabi University] 7D9BD37C r s singh 7AF8EBC3 3 NaN 7FF872BC barbara eileen ryan 7521A721 4 NaN 0299B8E9 fraser j harbutt 7DAEB9A4 5 NaN 7DAB7B72 richard m freeland 7B3236C5 _source.journal 0 Journal of Industrial Microbiology & Biotechno... 1 Journal of Industrial Microbiology & Biotechno... 2 Journal of Industrial Microbiology & Biotechno... 3 The American Historical Review 4 The American Historical Review 5 The American Historical Review _source.title \ 0 Development of a stable continuous flow immobi... 1 Development of a stable continuous flow immobi... 2 Development of a stable continuous flow immobi... 3 Feminism and the women's movement : dynamics o... 4 The iron curtain : Churchill, America, and the... 5 The Truman Doctrine and the origins of McCarth...

以上是 pandas.io.json.json_normalize与非常嵌套的json 的全部内容，来源链接： utcz.com/qa/416216.html