更新(2015年5月24日): 3年後、RAID 1アレイが劣化する真の原因を調査しました。
tl; dr: ドライブの1つが不良でしたが、正常なドライブで全面的なテストのみを実行したため、これに気付きませんでした。
3年前、I / Oの問題に関するログをチェックすることは考えていませんでした。私がチェックし/var/log/syslog
たいと思っていたらmdadm
、アレイの再構築をあきらめたときにこのようなものを見たでしょう:
May 24 14:08:32 node51 kernel: [51887.853786] sd 8:0:0:0: [sdi] Unhandled sense code
May 24 14:08:32 node51 kernel: [51887.853794] sd 8:0:0:0: [sdi]
May 24 14:08:32 node51 kernel: [51887.853798] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 24 14:08:32 node51 kernel: [51887.853802] sd 8:0:0:0: [sdi]
May 24 14:08:32 node51 kernel: [51887.853805] Sense Key : Medium Error [current]
May 24 14:08:32 node51 kernel: [51887.853812] sd 8:0:0:0: [sdi]
May 24 14:08:32 node51 kernel: [51887.853815] Add. Sense: Unrecovered read error
May 24 14:08:32 node51 kernel: [51887.853819] sd 8:0:0:0: [sdi] CDB:
May 24 14:08:32 node51 kernel: [51887.853822] Read(10): 28 00 00 1b 6e 00 00 00 01 00
May 24 14:08:32 node51 kernel: [51887.853836] end_request: critical medium error, dev sdi, sector 14381056
May 24 14:08:32 node51 kernel: [51887.853849] Buffer I/O error on device sdi, logical block 1797632
ログでその出力を取得するには、次のコマンドを使用して、最初の問題のあるLBA(私の場合は14381058)を探しました。
root@node51 [~]# dd if=/dev/sdi of=/dev/zero bs=512 count=1 skip=14381058
dd: error reading ‘/dev/sdi’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 7.49287 s, 0.0 kB/s
wonderめたのも不思議でmd
はありません!不良ドライブからアレイを再構築することはできません。
新しい技術(smartmontools
ハードウェアの互換性が向上しましたか?)により、最後の5つのエラー(これまでの1393個のエラー)を含むSMART情報をドライブから取得できました。
root@node51 [~]# smartctl -a /dev/sdi
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-43-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Hitachi Deskstar 5K3000
Device Model: Hitachi HDS5C3020ALA632
Serial Number: ML2220FA040K9E
LU WWN Device Id: 5 000cca 36ac1d394
Firmware Version: ML6OA800
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5940 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun May 24 14:13:35 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART STATUS RETURN: incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (21438) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 358) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 93
3 Spin_Up_Time 0x0007 172 172 024 Pre-fail Always - 277 (Average 362)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 174
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 8
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 146 146 020 Pre-fail Offline - 29
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 22419
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 161
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 900
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 900
194 Temperature_Celsius 0x0002 127 127 000 Old_age Always - 47 (Min/Max 19/60)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 8
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 30
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 2
SMART Error Log Version: 1
ATA Error Count: 1393 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1393 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 06 02 70 db 00 Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 00 70 db 40 00 1d+03:59:34.096 READ DMA EXT
25 00 08 00 70 db 40 00 1d+03:59:30.334 READ DMA EXT
b0 d5 01 09 4f c2 00 00 1d+03:57:59.057 SMART READ LOG
b0 d5 01 06 4f c2 00 00 1d+03:57:58.766 SMART READ LOG
b0 d5 01 01 4f c2 00 00 1d+03:57:58.476 SMART READ LOG
Error 1392 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 06 02 70 db 00 Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 00 70 db 40 00 1d+03:59:30.334 READ DMA EXT
b0 d5 01 09 4f c2 00 00 1d+03:57:59.057 SMART READ LOG
b0 d5 01 06 4f c2 00 00 1d+03:57:58.766 SMART READ LOG
b0 d5 01 01 4f c2 00 00 1d+03:57:58.476 SMART READ LOG
b0 d5 01 00 4f c2 00 00 1d+03:57:58.475 SMART READ LOG
Error 1391 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 06 02 70 db 00 Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 00 70 db 40 00 1d+03:56:28.228 READ DMA EXT
25 00 08 00 70 db 40 00 1d+03:56:24.549 READ DMA EXT
25 00 08 00 70 db 40 00 1d+03:56:06.711 READ DMA EXT
25 00 10 f0 71 db 40 00 1d+03:56:06.711 READ DMA EXT
25 00 f0 00 71 db 40 00 1d+03:56:06.710 READ DMA EXT
Error 1390 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 06 02 70 db 00 Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 00 70 db 40 00 1d+03:56:24.549 READ DMA EXT
25 00 08 00 70 db 40 00 1d+03:56:06.711 READ DMA EXT
25 00 10 f0 71 db 40 00 1d+03:56:06.711 READ DMA EXT
25 00 f0 00 71 db 40 00 1d+03:56:06.710 READ DMA EXT
25 00 10 f0 70 db 40 00 1d+03:56:06.687 READ DMA EXT
Error 1389 occurred at disk power-on lifetime: 22419 hours (934 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 06 02 70 db 00 Error: UNC 6 sectors at LBA = 0x00db7002 = 14381058
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 00 70 db 40 00 1d+03:56:06.711 READ DMA EXT
25 00 10 f0 71 db 40 00 1d+03:56:06.711 READ DMA EXT
25 00 f0 00 71 db 40 00 1d+03:56:06.710 READ DMA EXT
25 00 10 f0 70 db 40 00 1d+03:56:06.687 READ DMA EXT
25 00 f0 00 70 db 40 00 1d+03:56:03.026 READ DMA EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 21249 14381058
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
ああ…それでいい。
さて、この質問を3つの簡単な手順で解決しました。
- 3年でシステム管理者になります。
- ログを確認してください。
- スーパーユーザーに戻って3年前からの私のアプローチを笑ってください。
更新(2015年7月19日):好奇心 '盛な人にとって、ドライブは最終的にリマップするセクターを使い果たしました:
root@node51 [~]# smartctl -a /dev/sdg
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-43-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Hitachi Deskstar 5K3000
Device Model: Hitachi HDS5C3020ALA632
Serial Number: ML2220FA040K9E
LU WWN Device Id: 5 000cca 36ac1d394
Firmware Version: ML6OA800
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5940 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Jul 19 14:00:33 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART STATUS RETURN: incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 117) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (21438) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 358) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 099 099 016 Pre-fail Always - 2
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 93
3 Spin_Up_Time 0x0007 163 163 024 Pre-fail Always - 318 (Average 355)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 181
5 Reallocated_Sector_Ct 0x0033 001 001 005 Pre-fail Always FAILING_NOW 1978
7 Seek_Error_Rate 0x000b 086 086 067 Pre-fail Always - 1245192
8 Seek_Time_Performance 0x0005 146 146 020 Pre-fail Offline - 29
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 23763
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 167
192 Power-Off_Retract_Count 0x0032 092 092 000 Old_age Always - 10251
193 Load_Cycle_Count 0x0012 092 092 000 Old_age Always - 10251
194 Temperature_Celsius 0x0002 111 111 000 Old_age Always - 54 (Min/Max 19/63)
196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 2927
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 33
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 2
SMART Error Log Version: 1
ATA Error Count: 2240 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2240 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 f0 18 0f 2f 00 Error: IDNF 240 sectors at LBA = 0x002f0f18 = 3084056
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 f0 18 0f 2f 40 00 00:25:01.942 WRITE DMA EXT
35 00 f0 28 0e 2f 40 00 00:25:01.168 WRITE DMA EXT
35 00 f0 38 0d 2f 40 00 00:25:01.157 WRITE DMA EXT
35 00 f0 48 0c 2f 40 00 00:25:01.147 WRITE DMA EXT
35 00 f0 58 0b 2f 40 00 00:25:01.136 WRITE DMA EXT
Error 2239 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 5a 4e f7 2e 00 Error: IDNF 90 sectors at LBA = 0x002ef74e = 3077966
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 f0 b8 f6 2e 40 00 00:24:57.967 WRITE DMA EXT
35 00 f0 c8 f5 2e 40 00 00:24:57.956 WRITE DMA EXT
35 00 f0 d8 f4 2e 40 00 00:24:57.945 WRITE DMA EXT
35 00 f0 e8 f3 2e 40 00 00:24:57.934 WRITE DMA EXT
35 00 f0 f8 f2 2e 40 00 00:24:57.924 WRITE DMA EXT
Error 2238 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 40 a8 c6 2e 00 Error: IDNF 64 sectors at LBA = 0x002ec6a8 = 3065512
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 f0 f8 c5 2e 40 00 00:24:49.444 WRITE DMA EXT
35 00 f0 08 c5 2e 40 00 00:24:49.433 WRITE DMA EXT
35 00 f0 18 c4 2e 40 00 00:24:49.422 WRITE DMA EXT
35 00 f0 28 c3 2e 40 00 00:24:49.412 WRITE DMA EXT
35 00 f0 38 c2 2e 40 00 00:24:49.401 WRITE DMA EXT
Error 2237 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 ea be ba 2e 00 Error: IDNF 234 sectors at LBA = 0x002ebabe = 3062462
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 f0 b8 ba 2e 40 00 00:24:39.263 WRITE DMA EXT
35 00 f0 c8 b9 2e 40 00 00:24:38.885 WRITE DMA EXT
35 00 f0 d8 b8 2e 40 00 00:24:38.874 WRITE DMA EXT
35 00 f0 e8 b7 2e 40 00 00:24:38.862 WRITE DMA EXT
35 00 f0 f8 b6 2e 40 00 00:24:38.852 WRITE DMA EXT
Error 2236 occurred at disk power-on lifetime: 23763 hours (990 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 86 c2 2a 2e 00 Error: IDNF 134 sectors at LBA = 0x002e2ac2 = 3025602
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 f0 58 2a 2e 40 00 00:24:25.605 WRITE DMA EXT
35 00 f0 68 29 2e 40 00 00:24:25.594 WRITE DMA EXT
35 00 f0 78 28 2e 40 00 00:24:25.583 WRITE DMA EXT
35 00 f0 88 27 2e 40 00 00:24:25.572 WRITE DMA EXT
35 00 f0 98 26 2e 40 00 00:24:25.561 WRITE DMA EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short captive Completed: read failure 50% 23763 869280
# 2 Extended offline Completed without error 00% 22451 -
# 3 Short offline Completed without error 00% 22439 -
# 4 Extended offline Completed: read failure 90% 21249 14381058
1 of 2 failed self-tests are outdated by newer successful extended offline self-test # 2
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.