Python:滑动窗口均值,忽略丢失的数据

我目前正在尝试处理实验性时间序列数据集,该数据集缺少值。我想在处理nan值的同时计算该数据集随时间的滑动窗口平均值。对我而言,正确的方法是在每个窗口内计算有限元素的总和,然后将其除以它们的数量。这种非线性迫使我使用非卷积方法来面对这个问题,因此在该过程的这一部分中我遇到了严重的时间瓶颈。作为我要完成的工作的代码示例,我提出以下内容:

import numpy as np

#Construct sample data

n = 50

n_miss = 20

win_size = 3

data= np.random.random(50)

data[np.random.randint(0,n-1, n_miss)] = None

#Compute mean

result = np.zeros(data.size)

for count in range(data.size):

part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]

mask = np.isfinite(part_data)

if np.sum(mask) != 0:

result[count] = np.sum(part_data[mask]) / np.sum(mask)

else:

result[count] = None

print 'Input:\t',data

print 'Output:\t',result

输出:

Input:  [ 0.47431791  0.17620835  0.78495647  0.79894688  0.58334064  0.38068788

0.87829696 nan 0.71589171 nan 0.70359557 0.76113969

0.13694387 0.32126573 0.22730891 nan 0.35057169 nan

0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334

nan nan nan nan 0.63227417 0.92781944

0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597

nan 0.01410721 nan nan 0.6976317 nan

0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961

nan 0.2021964 ]

Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849

0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305

0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451

0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567

0.77625334 nan nan 0.63227417 0.7800468 0.85141944

0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619

0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282

0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845

0.29155301 0.2021964 ]

可以在不使用for循环的情况下通过numpy操作产生此结果吗?

回答:

这是基于卷积的方法,使用np.convolve-

mask = np.isnan(data)

K = np.ones(win_size,dtype=int)

out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)

请注意,这将在两侧各增加一个元素。

如果您正在处理2D数据,我们可以使用Scipy's 2D

convolution

方法-

def original_app(data, win_size):

#Compute mean

result = np.zeros(data.size)

for count in range(data.size):

part_data = data[max(count - (win_size - 1) / 2, 0): \

min(count + (win_size + 1) / 2, data.size)]

mask = np.isfinite(part_data)

if np.sum(mask) != 0:

result[count] = np.sum(part_data[mask]) / np.sum(mask)

else:

result[count] = None

return result

def numpy_app(data, win_size):

mask = np.isnan(data)

K = np.ones(win_size,dtype=int)

out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)

return out[1:-1] # Slice out the one-extra elems on sides

样品运行-

In [118]: #Construct sample data

...: n = 50

...: n_miss = 20

...: win_size = 3

...: data= np.random.random(50)

...: data[np.random.randint(0,n-1, n_miss)] = np.nan

...:

In [119]: original_app(data, win_size = 3)

Out[119]:

array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,

nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,

0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,

0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,

0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,

0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,

0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,

0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,

0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,

0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])

In [120]: numpy_app(data, win_size = 3)

__main__:36: RuntimeWarning: invalid value encountered in divide

Out[120]:

array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,

nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,

0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,

0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,

0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,

0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,

0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,

0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,

0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,

0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])

运行时测试-

In [122]: #Construct sample data

...: n = 50000

...: n_miss = 20000

...: win_size = 3

...: data= np.random.random(n)

...: data[np.random.randint(0,n-1, n_miss)] = np.nan

...:

In [123]: %timeit original_app(data, win_size = 3)

1 loops, best of 3: 1.51 s per loop

In [124]: %timeit numpy_app(data, win_size = 3)

1000 loops, best of 3: 1.09 ms per loop

In [125]: import pandas as pd

# @jdehesa's pandas solution

In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()

100 loops, best of 3: 3.34 ms per loop

以上是 Python:滑动窗口均值,忽略丢失的数据 的全部内容, 来源链接: utcz.com/qa/413192.html

回到顶部