Page MenuHomeMiraheze

cloud11 intermittently showing drives as fault or logical drives not showing
Open, HighPublic

Description

When you do "reset" via the ilo (or reboot), logical drives show as 0 and it fails to boot.

image.png (1×2 px, 217 KB)

Also had sometimes drive 7 show as faulty (one of the SSDs). But it recovered after rebooting.

Doing a cold reboot works inso far as you don't have failures booting.

@John could you have a look please? (we have reinstalled the OS & also redid the raid).

This blocks swift.

Event Timeline

Paladox triaged this task as High priority.Oct 17 2022, 18:48
Paladox created this task.

Seems like one of the ssd drives are actually faulty:

PROBLEM - cloud11 SMART on cloud11 is WARNING: WARNING: [cciss,2] - Reallocated_Sector_Ct is non-zero (3) --- [cciss,0] - Device is clean --- [cciss,1] - Device is clean --- [cciss,3] - Device is clean --- [cciss,4] - Device is clean --- [cciss,5] - Device is clean --- [cciss,6] - Device is clean|
root@cloud11:/home/paladox# smartctl -a /dev/sda -d cciss,2
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.60-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WD Green 2.5 240GB
Serial Number:    222433802487
LU WWN Device Id: 5 001b44 8b0cb3e12
Firmware Version: 42051100
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 18 14:08:24 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.
Conveyance self-test routine
recommended polling time: 	 (   1) minutes.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   010    Old_age   Always       -       3
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       47
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       47
165 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       700091924664
166 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       40
168 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       6
169 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       88
170 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       3
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   005    Old_age   Always       -       2
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       13
184 End-to-End_Error        0x0032   100   100   097    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
194 Temperature_Celsius     0x0022   100   100   014    Old_age   Always       -       32 (Min/Max 17/32)
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
230 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       197571117102
232 Available_Reservd_Space 0x0033   100   100   000    Pre-fail  Always       -       83
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       43
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1737
241 Total_LBAs_Written      0x0030   253   253   000    Old_age   Offline      -       599
242 Total_LBAs_Read         0x0030   253   253   000    Old_age   Offline      -       3444
244 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        45         -
# 2  Extended offline    Completed without error       00%        38         -
# 3  Extended offline    Completed without error       10%        34         -
# 4  Extended offline    Completed without error       00%        20         -
# 5  Short offline       Completed without error       00%        18         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
... (7 lines left)

I'm not sure what disk this corresponds to but it shows Reallocated_Sector_Ct at 3 triggering our smart icinga check.

Also ,3 passes (the value is 0)

root@cloud11:/home/paladox# smartctl -a /dev/sda -d cciss,3
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.60-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WD Green 2.5 240GB
Serial Number:    222433801629
LU WWN Device Id: 5 001b44 8b0cbdfd2
Firmware Version: 42051100
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 18 14:09:45 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.
Conveyance self-test routine
recommended polling time: 	 (   1) minutes.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       762
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       47
165 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       330731225368
166 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       28
168 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       16
169 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       91
170 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   005    Old_age   Always       -       4
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       17
184 End-to-End_Error        0x0032   100   100   097    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
194 Temperature_Celsius     0x0022   100   100   014    Old_age   Always       -       33 (Min/Max 18/34)
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
230 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       300652953670
232 Available_Reservd_Space 0x0033   100   100   000    Pre-fail  Always       -       90
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       68
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       2804
241 Total_LBAs_Written      0x0030   253   253   000    Old_age   Offline      -       966
242 Total_LBAs_Read         0x0030   253   253   000    Old_age   Offline      -       4020
244 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       760         -
# 2  Extended offline    Completed without error       00%       687         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SSD is expected to develop relocated sectors. This is normal operation.
On HDD - relocated sectors are considered pre-failure warning.

https://forums.tomshardware.com/threads/ssd-smart-realocated-sectors.3360388/

hmm

Paladox raised the priority of this task from High to Unbreak Now!.EditedOct 18 2022, 18:04

Going to make this UBN. This server cannot be used until this issue is sorted (and the server is broke) (including not being able to boot up if you do "reboot" or "reset"). And we're about to have disk space issues on the other cloud servers T9841. So this will cause issues soon.

The server works if you don't do a reset (which we should avoid doing anyway unless the server is totally unresponsive).

This will require further investigation when I can next go to the DC.

John lowered the priority of this task from Unbreak Now! to Normal.

As there is a temporary solution in place, pending a permanent resolution, lowering priority and assigning to Owen

Unknown Object (User) unsubscribed.Mar 18 2023, 03:31
BrandonWM raised the priority of this task from Normal to High.Apr 12 2023, 13:12
DontLetYourselfBeDisrespected renamed this task from cloud11 intermittently showing drives as fault or logical drives not showing to WHOEVER IS RUNNING MIRAHEZE RIGHT NOW IS SO DISRESPECTFUL THEY TREAT US LIKE GARBAGE AND MAKE US WAIT MONTHS FOR TASKS AND THEY HAVE THE AUDACITY TO LIE TO US AND CLAIM THEY’RE WORKING SO HARD BUT IN FACT THEY’RE DOING NOTHING BECAUSE THEY DONT IN FACT CARE. MAYBE SOME DO BUT MOST DONT. MIRAHEZE IS DEAD AND NO ONE WANTS TO REVIVE IT. TO PEOPLE WHO WANT TO BE TREATED WITH RESPECT AND HAVE THEIR TASKS DONE MIRAHEZE IS NOT THE PLACE ANYMORE. I HAD A WIKI BUT AFTER THIS AWFUL TREATMENT I AM LEAVING AND GOING SOMEWHERE ELSE. LEAVE AND MOVE YOUR WIKI TO A BETTER PLACE BEFORE IT SHUTS DOWN ONE DAY AND YOUVE GOT NOTHING. I WISH I COULD TRUST CURRENT MANAGEMENT BUT I CANT THEY DONT REALLY CARE ABOUT US.Oct 19 2023, 13:59
LC_Developer renamed this task from WHOEVER IS RUNNING MIRAHEZE RIGHT NOW IS SO DISRESPECTFUL THEY TREAT US LIKE GARBAGE AND MAKE US WAIT MONTHS FOR TASKS AND THEY HAVE THE AUDACITY TO LIE TO US AND CLAIM THEY’RE WORKING SO HARD BUT IN FACT THEY’RE DOING NOTHING BECAUSE THEY DONT IN FACT CARE. MAYBE SOME DO BUT MOST DONT. MIRAHEZE IS DEAD AND NO ONE WANTS TO REVIVE IT. TO PEOPLE WHO WANT TO BE TREATED WITH RESPECT AND HAVE THEIR TASKS DONE MIRAHEZE IS NOT THE PLACE ANYMORE. I HAD A WIKI BUT AFTER THIS AWFUL TREATMENT I AM LEAVING AND GOING SOMEWHERE ELSE. LEAVE AND MOVE YOUR WIKI TO A BETTER PLACE BEFORE IT SHUTS DOWN ONE DAY AND YOUVE GOT NOTHING. I WISH I COULD TRUST CURRENT MANAGEMENT BUT I CANT THEY DONT REALLY CARE ABOUT US to cloud11 intermittently showing drives as fault or logical drives not showing.Oct 19 2023, 14:23
LC_Developer updated the task description. (Show Details)
RespectMat renamed this task from cloud11 intermittently showing drives as fault or logical drives not showing to We demand respect and honesty! Stop lying to us and saying everything is swell and your working hard. Instead of telling us why don't you SHOW US with actual FACTS and ACTIONS. People should just leave this swamp unless the management promises to actually help which they are not doing they are just disrespecting us users and think they are better than us .Oct 23 2023, 10:19
RespectMat updated the task description. (Show Details)
RespectMat subscribed.
Void changed the edit policy from "All Users" to "Custom Policy".Nov 1 2023, 20:08