Python:如何通过保留第一个数据框的信息来合并列上的两个数据框?

我有两个数据框df1和df2。df1包含人的年龄信息,而df2包含人的性别信息。并非所有人都在里面df1或里面df2

df1

Name Age

0 Tom 34

1 Sara 18

2 Eva 44

3 Jack 27

4 Laura 30

df2

Name Sex

0 Tom M

1 Paul M

2 Eva F

3 Jack M

4 Michelle F

我想有人民的性别的信息df1和设置NaN,如果我没有在这个信息df2。我尝试这样做,df1 = pd.merge(df1, df2, on = 'Name', how = 'outer')但是我保留了一些df2我不想要的信息。

df1

Name Age Sex

0 Tom 34 M

1 Sara 18 NaN

2 Eva 44 F

3 Jack 27 M

4 Laura 30 NaN

回答:

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],

'Age': [34, 18, 44, 27, 30]})

#print (df1)

df3 = df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],

'Sex': ['M', 'M', 'F', 'M', 'F']})

#print (df2)

使用map由Series创建人set_index

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])

print (df1)

Name Age Sex

0 Tom 34 M

1 Sara 18 NaN

2 Eva 44 F

3 Jack 27 M

4 Laura 30 NaN

merge左连接的替代解决方案:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')

print (df)

Name Age Sex

0 Tom 34 M

1 Sara 18 NaN

2 Eva 44 F

3 Jack 27 M

4 Laura 30 NaN

如果需要通过多列映射(例如Year和Code),则需要merge左连接:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 

'Year':[2000,2003,2003,2004,2007],

'Code':[1,2,3,4,4],

'Age': [34, 18, 44, 27, 30]})

print (df1)

Name Year Code Age

0 Tom 2000 1 34

1 Sara 2003 2 18

2 Eva 2003 3 44

3 Jack 2004 4 27

4 Laura 2007 4 30

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],

'Sex': ['M', 'M', 'F', 'M', 'F'],

'Year':[2001,2003,2003,2004,2007],

'Code':[1,2,3,5,3],

'Val':[21,34,23,44,67]})

print (df2)

Name Sex Year Code Val

0 Tom M 2001 1 21

1 Paul M 2003 2 34

2 Eva F 2003 3 23

3 Jack M 2004 5 44

4 Michelle F 2007 3 67

#merge by all columns

df = df1.merge(df2, on=['Year','Code'], how='left')

print (df)

Name_x Year Code Age Name_y Sex Val

0 Tom 2000 1 34 NaN NaN NaN

1 Sara 2003 2 18 Paul M 34.0

2 Eva 2003 3 44 Eva F 23.0

3 Jack 2004 4 27 NaN NaN NaN

4 Laura 2007 4 30 NaN NaN NaN

#specified columns - columns for join (Year, Code) need always + appended columns (Val)

df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')

print (df)

Name Year Code Age Val

0 Tom 2000 1 34 NaN

1 Sara 2003 2 18 34.0

2 Eva 2003 3 44 23.0

3 Jack 2004 4 27 NaN

4 Laura 2007 4 30 NaN

如果获取错误map意味着按连接列重复,则在这里Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 

'Age': [34, 18, 44, 27, 30]})

print (df1)

Name Age

0 Tom 34

1 Sara 18

2 Eva 44

3 Jack 27

4 Laura 30

df3, df4 = df1.copy(), df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'],

'Val': [1,2,3,4,5]})

print (df2)

Name Val

0 Tom 1 <-duplicated name Tom

1 Tom 2 <-duplicated name Tom

2 Eva 3

3 Jack 4

4 Michelle 5

s = df2.set_index('Name')['Val']

df1['New'] = df1['Name'].map(s)

print (df1)

InvalidIndexError:重新索引仅对唯一值的Index对象有效

解决方案通过删除重复项DataFrame.drop_duplicates,或dict在最后一次重复匹配中使用map by :

#default keep first value

s = df2.drop_duplicates('Name').set_index('Name')['Val']

print (s)

Name

Tom 1

Eva 3

Jack 4

Michelle 5

Name: Val, dtype: int64

df1['New'] = df1['Name'].map(s)

print (df1)

Name Age New

0 Tom 34 1.0

1 Sara 18 NaN

2 Eva 44 3.0

3 Jack 27 4.0

4 Laura 30 NaN

#add parameter for keep last value 

s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']

print (s)

Name

Tom 2

Eva 3

Jack 4

Michelle 5

Name: Val, dtype: int64

df3['New'] = df3['Name'].map(s)

print (df3)

Name Age New

0 Tom 34 2.0

1 Sara 18 NaN

2 Eva 44 3.0

3 Jack 27 4.0

4 Laura 30 NaN

#map by dictionary

d = dict(zip(df2['Name'], df2['Val']))

print (d)

{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}

df4['New'] = df4['Name'].map(d)

print (df4)

Name Age New

0 Tom 34 2.0

1 Sara 18 NaN

2 Eva 44 3.0

3 Jack 27 4.0

4 Laura 30 NaN

以上是 Python:如何通过保留第一个数据框的信息来合并列上的两个数据框? 的全部内容, 来源链接: utcz.com/qa/426084.html

回到顶部