Skip to main content
Diagnose and replace the defective disk
Last update:

Diagnose and replace the defective disk

You can check the status of the disk using SMART (Self-Monitoring, Analysis and Reporting Technology) attributes.If the test results show that the disk is faulty, you can replace the faulty disk.

Check disk condition

  1. Get SMART attributes.
  2. Evaluate the values of the SMART attributes.

1. Get SMART attributes

The method of obtaining SMART attributes depends on the operating system installed on the server and the way the disk is connected to the server:

  • without RAID controller — the disk is connected directly to the motherboard or through an HBA controller;
  • via RAID controller — the disk is connected via an Adaptec or MegaRAID controller installed on the server.
  1. Connect to the server via SSH or via KVM console.

  2. Install the smartmontools package, which is a set of utilities for monitoring the health of SMART-enabled HDDs and SSDs.

    apt-get install smartmontools
  3. Output information about the disks connected to the server:

    lsblk

    Disk information will appear in the response. Memorize or copy the disk IDs. For example:

    NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
    sda 8:0 0 1.8T 0 disk
    └─sda1 8:1 0 1.8T 0 part /mnt/data
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part /mnt/backup
    nvme0n1 259:0 0 465.8G 0 disk
    ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
    ├─nvme0n1p2 259:2 0 16G 0 part [SWAP]
    └─nvme0n1p3 259:3 0 449.3G 0 part /

    Here sda, sdb, nvme0n1 are disk identifiers.

  4. Start reading SMART attributes. The command to run depends on the disk interface:

    • for SATA:
    smartctl -iA /dev/<disk_id>

    Specify <disk_id> is the disk ID of the disk you copied in step 3.

    • for NVME:
    nvme smart-log /dev/<disk_id>

    Specify <disk_id> is the disk ID of the disk you copied in step 3.

2. Assess SMART attributes

A disk is considered faulty if at least one of the SMART attributes fits the specified conditions.

Attribute DescriptionFieldAttribute value
5 Reallocated_Sector_CtNumber of sectors reassigned due to errorsRAW_VALUE> 0
7 Seek_Error_RateError rate for positioning the head assemblyVALUE< 45
9 Power_on_hours.Hours workedRAW_VALUE> 43800
10 Spin_Retry_CountNumber of repeated attempts to spin up disks to operating speed in case the first attempt was unsuccessfulRAW_VALUE> 10
197 Current_Pending_SectorNumber of sectors in the reassignment queueRAW_VALUE> 0
198 Offline_UncorrectableNumber of sectors on the disk that the disk controller tried to fix on its ownRAW_VALUE> 0

Replace a defective disk

You can determine if a disk is faulty by checking the status of the disk.If the SMART attribute assessment results in a faulty disk, you can initiate a replacement.To do so:

  1. Get the serial number of the faulty disk.
  2. Coordinate the replacement of the disk.
  3. Remove the disk from the RAID array.
  4. Light up the disk.
  5. Check the disk in the system.
  6. Add a disk to a RAID array.

1. Get the serial number of the defective disk

  1. Connect to the server via SSH or via KVM console.

  2. Get the serial number of the faulty disk, to do this, print the disk information:

    lsblk -o name,serial,model

    Disk information will appear in the response. Copy the serial number of the failed disk. For example:

    NAME    SERIAL            MODEL
    sdb S0H0N0XYZ123456 Samsung SSD 970 EVO Plus 500GB
    nvme0n1 S0D0NX0M001234 Samsung SSD 980 PRO 1TB

    Here SERIAL is the serial number of the disk.

2. Coordinate disk replacement

  1. Create a ticket. In the ticket specify:

  2. If a disk replacement is agreed upon, a Selectel employee will specify a convenient time and duration of the work for you. The duration of the work will be required to determine when the disk will be illuminated.

3. Remove the disk from the RAID array

If the disk is in a RAID array, remove the disk from the array.

4. Illuminate the disk

At the time scheduled for the work, we will notify you in a ticket that we are ready to proceed with the disk replacement.

If the disk fails to illuminate and the engineers cannot identify it by serial number, we will need to shut down the server to replace the disk.In this case, we will report the problem when identifying the disk and agree on a time to shut down the server in the ticket.

To light a disk, put a load on the disk, such as running a write or read operation.If you eject the disk while these operations are in progress, there will be read errors.This is normal behavior because the command is trying to access data on a disk that has already been ejected.

  1. Connect to the server via SSH or via KVM console.

  2. Output information about the disks connected to the server:

    lsblk

    Disk information will appear in the response. Memorize or copy the disk ID. For example:

    NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
    sda 8:0 0 1.8T 0 disk
    └─sda1 8:1 0 1.8T 0 part /mnt/data
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part /mnt/backup
    nvme0n1 259:0 0 465.8G 0 disk
    ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
    ├─nvme0n1p2 259:2 0 16G 0 part [SWAP]
    └─nvme0n1p3 259:3 0 449.3G 0 part /

    Here sda, sdb, nvme0n1 are disk identifiers.

  3. Light up the disk:

    dd if=/dev/<disk_id> of=/dev/null

    Specify <disk_id> is the disk ID of the disk you copied in step 2.

5. Check the disk in the system

  1. Wait for a message on the ticket that the disk has been replaced.

  2. Connect to the server via SSH or via KVM console.

  3. Verify that the drive has initialized to the system:

    lsblk
  4. If the disk is not in the list, reboot the server. If after the reboot the disk is not initialized in the system, report it in the ticket.

6. Add a disk to a RAID array

If the disk was in a RAID array, add the replaced disk to the array.