MongoDB实例重启失败探究(大事务Redo导致)

database

 1.实例重启背景

收到监控组同学反馈,连接某一个MongoDB实例的应用耗时异常,并且出现了超时。查看数据库监控平台,发现此实例服务器的IO异常飙升,而查看副本集状态(rs.status()),主从是坏掉的,从节点不可达。

登入从节点,查看mongodb服务状态,是stop的。

查看服务器的log,发现出现了OOM,Mongodb被关闭了。需要手动重启。

Jan 1712:02:48 qqorderdb02 kernel: Out of memory: Kill process 83717 (mongod) score 919 or sacrifice child

Jan 1712:02:48 qqorderdb02 kernel: Killed process 83717 (mongod), UID 1001, total-vm:21256876kB, anon-rss:15529572kB, file-rss:0kB, shmem-rss:0kB

Jan 1712:42:42 qqorderdb02 systemd[1]: mongodbqq.service: main process exited, code=killed, status=9/KILL

Jan 1712:42:42 qqorderdb02 systemd[1]: Unit mongodbqq.service entered failed state.

Jan 1712:42:42 qqorderdb02 systemd[1]: mongodbqq.service failed.

事后分析,主节点的内存比从节点的内存大,创建索引,主节点正常执行了,而从节点出席那了OOM(12:02),KIll后,服务自启动也失败了(12:42)。

 2.重启服务

重启,查看mongodblog,我们会看到redo未提交的创建索引的事务

2019-01-17T19:38:11.529+0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is "always".

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** We suggest setting it to "never"

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten]

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is "always".

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** We suggest setting it to "never"

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten]

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: Running wiredTiger with the --nojournal option in a replica set

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** is deprecated and subject to be removed in a future version.

2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten]

2019-01-17T19:38:11.592+0800 I INDEX [initandlisten] found 1 index(es) that wasn"t finished before shutdown

2019-01-17T19:38:11.595+0800 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory "/var/mongodbqq/db/diagnostic.data"

2019-01-17T19:38:11.596+0800 I INDEX [initandlisten] found 1 interrupted index build(s) on qqorderdb.weixinordersn

2019-01-17T19:38:11.596+0800 I INDEX [initandlisten] note: restart the server with --noIndexBuildRetry to skip index rebuilds

 但执行一会后,重启失败,退出。查看server log,报错信息如下:

Jan 1719:41:10 qqorderdb02 systemd[1]: mongodbqq.service stop-final-sigterm timed out. Killing.

Jan 1719:41:10 qqorderdb02 systemd[1]: Failed to start mongodbqq_service.

Jan 1719:41:10 qqorderdb02 systemd[1]: Unit mongodbqq.service entered failed state.

Jan 1719:41:10 qqorderdb02 systemd[1]: mongodbqq.service failed.

查看mongodb的log,最新的信息如下:

2019-01-17T19:41:00.001+0800 I -        [initandlisten]   Index Build: 55387600/19257642628%

2019-01-17T19:41:03.002+0800 I - [initandlisten] Index Build: 57463100/19257642629%

2019-01-17T19:41:06.002+0800 I - [initandlisten] Index Build: 59385700/19257642630%

2019-01-17T19:41:09.001+0800 I - [initandlisten] Index Build: 61549000/19257642631%

通过查看Server log 和 Mongodb log ,我们可以判断:启动时需要重建关闭时未完成的index,但是重建这个大集合(本案例为weixinordersn,5亿数据量,102Gsize)的索引耗时较长,超过了启动服务允许的时间。服务超时后会被killed。

3.解决方案

设置systemd的service超时时间,在mongodb服务中,指明TimeoutSec参数。

TimeoutSec:定义 Systemd 停止当前服务之前等待的秒数。单位是秒,设置为0是不限制.

例如mongodbtest.service的编写,增加 TimeoutSec=1800

[Unit]

Description=mongodbtest

After=network.target remote-fs.target nss-lookup.target

[Service]

User=mongouser

Group=mongouser

# (open files)

LimitNOFILE=64000

Type=forking

ExecStart=/data/mongodb/mongobin404/bin/mongod --config /data/mongodb/mongobin404/bin/mongodbtest.conf

ExecReload=/bin/kill -s HUP $MAINPID

ExecStop=/data/mongodb/mongobin404/bin/mongod --shutdown --config /data/mongodb/mongobin404/bin/mongodbtest.conf

PrivateTmp=true

TimeoutSec=1800

[Install]

WantedBy=multi-user.target

4.性能下降的分析

结合 应用超时 和 数据库监控的IO飙升的时间关联,性能下降主要分别是 创建索引 和 主从断掉导致oplog.rs 插入、查询、更新变慢导致。

5.参考文献

1.https://cloudblue.freshdesk.com/support/solutions/articles/44001881778

2.https://www.cnblogs.com/f-society/p/13177614.html

 

 

本文版权归作者所有,未经作者同意不得转载,谢谢配合!!!

以上是 MongoDB实例重启失败探究(大事务Redo导致) 的全部内容, 来源链接: utcz.com/z/535858.html

回到顶部