合并多个大型DataFrame的有效方法

假设我有4个小型DataFrame

df1df2df3df4

import pandas as pd

from functools import reduce

import numpy as np

df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])

df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])

df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])

df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])

df1.columns = ['name', 'id', 'price']

df2.columns = ['name', 'id', 'price']

df3.columns = ['name', 'id', 'price']

df4.columns = ['name', 'id', 'price']

df1 = df1.rename(columns={'price':'pricepart1'})

df2 = df2.rename(columns={'price':'pricepart2'})

df3 = df3.rename(columns={'price':'pricepart3'})

df4 = df4.rename(columns={'price':'pricepart4'})

上面创建的是4个DataFrame,下面的代码是我想要的。

# Merge dataframes

df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

# Fill na values with 'missing'

df = df.fillna('missing')

因此,我为4个没有很多行和列的DataFrame实现了这一点。

因此,我通过使用lambda reduce的另一个StackOverflow答案构建了这个解决方案:

from functools import reduce

import pandas as pd

import numpy as np

dfList = []

#To create the 48 DataFrames of size 62245 X 3

for i in range(0, 49):

dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))

#The solution I came up with to extend the solution to more than 3 DataFrames

df_merged = reduce(lambda left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')

这引起了MemoryError

我不知道该怎么做才能阻止内核崩溃。.我已经坚持了两天。.我执行的EXACT

merge操作的一些代码不会导致MemoryError或产生与您相同的结果结果,将不胜感激。

另外,在主数据帧(不是可再现48个DataFrames中的例子)的3列的类型int64int64float64与我宁愿他们留,因为整数和浮子,它代表的这种方式。

编辑:

我不是以迭代方式尝试运行合并操作或使用reduce

lambda函数,而是以2为一组来完成它!另外,我更改了某些列的数据类型,而有些则不需要float64。所以我把它归结为float16。它距离很远,但最终仍会抛出MemoryError

intermediatedfList = dfList

tempdfList = []

#Until I merge all the 48 frames two at a time, till it becomes size 2

while(len(intermediatedfList) != 2):

#If there are even number of DataFrames

if len(intermediatedfList)%2 == 0:

#Go in steps of two

for i in range(0, len(intermediatedfList), 2):

#Merge DataFrame in index i, i + 1

df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

print(df1.info(memory_usage='deep'))

#Append it to this list

tempdfList.append(df1)

#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,

#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.

intermediatedfList = tempdfList

else:

#If there are odd number of DataFrames, keep the first DataFrame out

tempdfList = [intermediatedfList[0]]

#Go in steps of two starting from 1 instead of 0

for i in range(1, len(intermediatedfList), 2):

#Merge DataFrame in index i, i + 1

df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

print(df1.info(memory_usage='deep'))

tempdfList.append(df1)

#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,

#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.

intermediatedfList = tempdfList

有什么我可以优化代码来避免的方法MemoryError,我什至使用过AWS 192GB

RAM(我现在欠他们7美元,我本来可以给你们一个),这比我得到的要远远得多,而且将MemoryError28个DataFrame的列表减少到4个后仍然抛出。

回答:

通过使用执行索引对齐的串联,您可能会获得一些好处pd.concat。希望它应该比外部合并更快,更有效地利用内存。

df_list = [df1, df2, ...]

for df in df_list:

df.set_index(['name', 'id'], inplace=True)

df = pd.concat(df_list, axis=1) # join='inner'

df.reset_index(inplace=True)

或者,您可以用concat迭代代替(第二步)join

from functools import reduce

df = reduce(lambda x, y: x.join(y), df_list)

这可能会更好,也可能不会更好merge

以上是 合并多个大型DataFrame的有效方法 的全部内容, 来源链接: utcz.com/qa/408502.html

回到顶部