在pandas数据框中将datetime64列拆分为日期和时间列

Z时代
2024-01-10
分类：问答

如果我有第一列是datetime64列的数据框。如何将此列拆分为2个新列，即日期列和时间列。到目前为止，这是我的数据和代码：

DateTime,Actual,Consensus,Previous
20140110 13:30:00,74000,196000,241000
20131206 13:30:00,241000,180000,200000
20131108 13:30:00,200000,125000,163000
20131022 12:30:00,163000,180000,193000
20130906 12:30:00,193000,180000,104000
20130802 12:30:00,104000,184000,188000
20130705 12:30:00,188000,165000,176000
20130607 12:30:00,176000,170000,165000
20130503 12:30:00,165000,145000,138000
20130405 12:30:00,138000,200000,268000
...
import pandas as pd
nfp = pd.read_csv("NFP.csv", parse_dates=[0])
nfp

给出：

Out[10]: <class 'pandas.core.frame.DataFrame'>
         Int64Index: 83 entries, 0 to 82
         Data columns (total 4 columns):
         DateTime     82  non-null values
         Actual       82  non-null values
         Consensus    82  non-null values
         Previous     82  non-null values
         dtypes: datetime64[ns](1), float64(3)

一切都很好，但不确定从这里做什么。

我不确定两点：

当我一开始阅读csv文件时，可以这样做吗？如果是这样，怎么办？

执行csv_read后，任何人都可以帮助我显示如何拆分吗？

在任何地方都可以查找此类信息吗？

很难找到类库的详细参考资料！

回答：

将函数字典传递给pandas.read_csv的converters关键字参数：

import pandas as pd
import datetime as DT
nfp = pd.read_csv("NFP.csv", 
                  sep=r'[\s,]',              # 1
                  header=None, skiprows=1,
                  converters={               # 2
                      0: lambda x: DT.datetime.strptime(x, '%Y%m%d'),  
                      1: lambda x: DT.time(*map(int, x.split(':')))},
                  names=['Date', 'Time', 'Actual', 'Consensus', 'Previous'])
print(nfp)

产量

Date Time Actual Consensus Previous 0 2014-01-10 13:30:00 74000 196000 241000 1 2013-12-06 13:30:00 241000 180000 200000 2 2013-11-08 13:30:00 200000 125000 163000 3 2013-10-22 12:30:00 163000 180000 193000 4 2013-09-06 12:30:00 193000 180000 104000 5 2013-08-02 12:30:00 104000 184000 188000 6 2013-07-05 12:30:00 188000 165000 176000 7 2013-06-07 12:30:00 176000 170000 165000 8 2013-05-03 12:30:00 165000 145000 138000 9 2013-04-05 12:30:00 138000 200000 268000

sep=r'[\s,]'告诉read_csv您在正则表达式模式下将csv的行拆分r'[\s,]'-空格或逗号。

该converters参数告诉read_csv您将给定功能应用于某些列。键（例如0和1）引用列索引，并且值是要应用的功能。

import pandas as pd
nfp = pd.read_csv("NFP.csv", parse_dates=[0], infer_datetime_format=True)
temp = pd.DatetimeIndex(nfp['DateTime'])
nfp['Date'] = temp.date
nfp['Time'] = temp.time
del nfp['DateTime']
print(nfp)

这取决于CSV的大小。（感谢Jeff指出了这一点。）

对于小型CSV，使用解析后，直接将CSV直接解析为所需格式比使用DatetimeIndex更快parse_dates=[0]：

def using_converter():
    nfp = pd.read_csv("NFP.csv", sep=r'[\s,]', header=None, skiprows=1,
                      converters={
                          0: lambda x: DT.datetime.strptime(x, '%Y%m%d'),
                          1: lambda x: DT.time(*map(int, x.split(':')))},
                      names=['Date', 'Time', 'Actual', 'Consensus', 'Previous'])
    return nfp
def using_index():
    nfp = pd.read_csv("NFP.csv", parse_dates=[0], infer_datetime_format=True)
    temp = pd.DatetimeIndex(nfp['DateTime'])
    nfp['Date'] = temp.date
    nfp['Time'] = temp.time
    del nfp['DateTime']
    return nfp
In [114]: %timeit using_index()
100 loops, best of 3: 1.71 ms per loop
In [115]: %timeit using_converter()
1000 loops, best of 3: 914 µs per loop

但是，对于只有几百行或更多行的CSV，使用DatetimeIndex更快。

N = 20
filename = '/tmp/data'
content = '''\
DateTime,Actual,Consensus,Previous
20140110 13:30:00,74000,196000,241000
20131206 13:30:00,241000,180000,200000
20131108 13:30:00,200000,125000,163000
20131022 12:30:00,163000,180000,193000
20130906 12:30:00,193000,180000,104000
20130802 12:30:00,104000,184000,188000
20130705 12:30:00,188000,165000,176000
20130607 12:30:00,176000,170000,165000
20130503 12:30:00,165000,145000,138000
20130405 12:30:00,138000,200000,268000'''
def setup(n):
    header, remainder = content.split('\n', 1)
    with open(filename, 'w') as f:
        f.write('\n'.join([header]+[remainder]*n))
In [304]: setup(50)
In [305]: %timeit using_converter()
100 loops, best of 3: 9.78 ms per loop
In [306]: %timeit using_index()
100 loops, best of 3: 9.3 ms per loop