Tools: Leaf Node Failures (Continued)

Replace the Leaf Node in the Same Host

There is another option of recovery where you can introduce a new node in the same host. This method can be followed in registered hosts which have only lost data due to corruption or loss of data due to disk replacement.

To recover the same host for your SingleStore cluster, ensure that the hardware issue with the host is resolved and the machine is running fine.

For hosts missing the SingleStore data directory, the sdb-admin list-nodes CLI tool still shows the node as part of the cluster but it would be either in "Unknown" or "Stopped" state without any memsql-id.

|                MemSQL ID               |  Role   |                   Host                    | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+-----------------------------------------+---------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 5FA133A599027155BC53E172B273734FF303494E | Master  | ip-10-3-12-244.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Online         |                    | 0.0.0.0      |
| 6E7D4378559F38A48A039FB90294030768C04451 | Leaf    | ip-10-3-13-199.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Online         | 1                  | 0.0.0.0      |
|                                          | Unknown | ip-10-3-9-161.us-east-2.compute.internal  | 3306 | Stopped       | False        |         | Unknown        |                    | 0.0.0.0      |

Without the memsql-id , you cannot run sdb-admin remove-leaf or delete-node.

To remove the leaf, login to the Singlestore CLI and execute REMOVE LEAF 'host':port;

singlestore> REMOVE LEAF 'ip-10-3-9-161.us-east-2.compute.internal':3306;
Query OK, 1 row affected, 2 warnings (1.47 sec)
singlestore> SHOW LEAVES;
| Host                                      | Port | Availability_Group | Pair_Host | Pair_Port | State  | Opened_Connections | Average_Roundtrip_Latency_ms | NodeId | Grace_Period_In_seconds | Minimum_Pre_Prepare_Ts | Runtime_State |
+-------------------------------------------+------+--------------------+-----------+-----------+--------+--------------------+------------------------------+--------+-------------------------+------------------------+---------------+
| ip-10-3-13-199.us-east-2.compute.internal | 3306 |                  1 | NULL      |      NULL | online |                 16 |                        0.266 |      2 |                    NULL |                2097153 | online        |

sdb-admin list-nodes will still show the node as "Unknown".

|                MemSQL ID                 |  Role   |                   Host                    | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+------------------------------------------+---------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 5FA133A599027155BC53E172B273734FF303494E | Master  | ip-10-3-12-244.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Online         |                    | 0.0.0.0      |
| 6E7D4378559F38A48A039FB90294030768C04451 | Leaf    | ip-10-3-13-199.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Online         | 1                  | 0.0.0.0      |
|                                          | Unknown | ip-10-3-9-161.us-east-2.compute.internal  | 3306 | Stopped       | False        |         | Unknown        |                    | 0.0.0.0    |

Using sdb-deploy uninstall to uninstall the SingleStore from the affected host would fail because the node is still part of the cluster.In this scenario, SingleStore needs to be uninstalled using rpm or dpkg commands.

vsr@ip-10-3-9-161:/$ sudo dpkg --remove singlestoredb-server8.1.11-d94707d722
(Reading database ... 121824 files and directories currently installed.)
Removing singlestoredb-server8.1.11-d94707d722 (8.1.11) ...
update-alternatives: warning: alternative /opt/singlestoredb-server-8.1.11-d94707d722/memsqlctl (part of link group memsqlctl) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/memsqlctl is dangling; it will be updated with best choice
update-alternatives: warning: alternative /opt/singlestoredb-server-8.1.11-d94707d722/memsql_exporter/memsql_exporter (part of link group memsql_exporter) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/memsql_exporter is dangling; it will be updated with best choice
update-alternatives: warning: alternative /opt/singlestoredb-server-8.1.11-d94707d722/memsql_exporter/memsql_pusher (part of link group memsql_pusher) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/memsql_pusher is dangling; it will be updated with best choice

Clear out any SingleStore files and directories, for example: /etc/memsql , /var/lib/memsql. Check for any running process on 3306 port(or the port used for intra-cluster communication) and kill them.

vsr@ip-10-3-9-161:~$ sudo netstat -nlp | grep :3306
tcp        0      0 0.0.0.0:3306            0.0.0.0:*               LISTEN      1709/memsqld
vsr@ip-10-3-9-161:~$ ps -ef | grep 1709
memsql      1709    1702  4 Oct12 ?        01:00:00 /opt/singlestoredb-server-8.1.11-d94707d722/memsqld --defaults-file /var/lib/memsql/6aa92299-0027-4e95-958f-89a45a449881/memsql.cnf --user 116
memsql      1710    1709  0 Oct12 ?        00:00:00 /opt/singlestoredb-server-8.1.11-d94707d722/memsqld --defaults-file /var/lib/memsql/6aa92299-0027-4e95-958f-89a45a449881/memsql.cnf --user 116
vsr+   11123   10954  0 08:42 pts/0    00:00:00 grep --color=auto 1709

Now, do a fresh SingleStore install on the affected host:

vsr@ip-10-3-12-244:/tmp$ sdb-deploy install --host ip-10-3-9-161.us-east-2.compute.internal --version 8.1.11
Toolbox will perform the following actions:
  · Download singlestoredb-server 8.1.11
  · Install singlestoredb-server 8.1.11 on ip-10-3-9-161.us-east-2.compute.internal
Would you like to continue? [y/N]: y
✓ Downloaded singlestoredb-server production:8.1.11
✓ Installed singlestoredb-server8.1.11-d94707d722 on host ip-10-3-9-161.us-east-2.compute.internal (1/1)
✓ Successfully installed on 1 host
Operation completed successfully

Create a node and add it as a leaf:

vsr@ip-10-3-12-244:~$ sdb-admin create-node --host ip-10-3-9-161.us-east-2.compute.internal --password '5eWiWLoaErUn5Y8Z' --port 3307 --base-install-dir /data/memsql/instance/
Toolbox is about to perform the following actions on host ip-10-3-9-161.us-east-2.compute.internal:
  · Run 'memsqlctl create-node --port 3307 --base-install-dir /data/memsql/instance/ --password ●●●●●●'
Would you like to continue? [y/N]: y
✓ Created node
+------------------------------------------+------------------------------------------+
|                MemSQL ID                 |                   Host                   |
+------------------------------------------+------------------------------------------+
| 6580AA1C250791658EEF8F38591E0C28DBF614E6 | ip-10-3-9-161.us-east-2.compute.internal |
+—————————————————————+---------------------------------------------------------------+
vsr@ip-10-3-12-244:~$ sdb-admin list-nodes;
+------------+---------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| MemSQL ID  |  Role   |                   Host                    | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+------------+---------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 5FA133A599 | Master  | ip-10-3-12-244.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Unknown        |                    | 0.0.0.0      |
| 6E7D437855 | Leaf    | ip-10-3-13-199.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Online         | 1                  | 0.0.0.0      |
| 6580AA1C25 | Unknown | ip-10-3-9-161.us-east-2.compute.internal  | 3307 | Running       | True         | 8.1.11  | Online         |                    | 0.0.0.0      |
+——————+---------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------------+
vsr@ip-10-3-12-244:~$ sdb-admin add-leaf --memsql-id 6580AA1C25 --password '5eWiWLoaErUn5Y8Z' --availability-group 2
✓ Collected report for host ip-10-3-9-161.us-east-2.compute.internal
✓ Passed environment validation
Toolbox will perform the following actions on host ip-10-3-12-244.us-east-2.compute.internal:
  · Run 'memsqlctl add-leaf --host ip-10-3-9-161.us-east-2.compute.internal --port 3307 --user root --password ●●●●●● --availability-group 2'
Would you like to continue? [y/N]: y
✓ Successfully ran 'memsqlctl add-leaf'
Operation completed successfully
vsr@ip-10-3-12-244:~$ sdb-admin list-nodes;
+------------+--------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| MemSQL ID  |  Role  |                   Host                    | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+------------+--------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 5FA133A599 | Master | ip-10-3-12-244.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Online         |                    | 0.0.0.0      |
| 6E7D437855 | Leaf   | ip-10-3-13-199.us-east-2.compute.internal | 3306 | Running       | True         | 8.1.11  | Online         | 1                  | 0.0.0.0      |
| 6580AA1C25 | Leaf   | ip-10-3-9-161.us-east-2.compute.internal  | 3307 | Running       | True         | 8.1.11  | Online         | 2                  | 0.0.0.0      |
+------------+--------+-------------------------------------------+------+---------------+--------------+---------+----------------+--------------------+--------------+

Rebalance the partitions:

REBALANCE ALL PARTITIONS;

A Pair of Leaf Nodes Fail

When a pair of leaf nodes fail, their partitions will no longer have any remaining instances, which effectively takes these partitions offline for both reads and writes.

If either of the leaf nodes’ hosts, or a leaf node’s data on a host, is recoverable, a failed leaf node can be reintroduced and reattached to its partitions by following the steps in the Replace a Failed Leaf Node in a Redundancy-1 Cluster section. After following those steps, the partitions will be back online for both reads and writes.

If neither leaf node’s host is recoverable, then data loss has occurred. You must now add replacement leaf nodes and run REBALANCE PARTITIONS ... FORCE to create new (empty) replacement partitions. This can be done by following the steps in the Replace a Leaf Node in a Redundancy-2 Cluster section.

Many Unpaired Leaf Nodes Fail

So long as two paired leaf nodes have not failed, all partitions are still available for reads and writes.

In certain circumstances, all of the leaf nodes in one availability group can fail, but no data loss will be incurred so long as redundancy is restored before another leaf node fails in the remaining availability group.

Many Leaf Nodes Fail, Some of them Paired

When both leaf nodes in a pair fail, every partition that is hosted by these two leaf nodes will be offline for reads and writes.

When one leaf node of a pair fails, the partitions of its pair will remain online for reads and writes.

Offline partitions should be handled using the method detailed in the A Pair of Leaf Nodes Fail section. However, as both leaf nodes are unrecoverable, RESTORE REDUNDANCY or Cluster Downsizing Steps should only be run after all partitions have either been recovered or abandoned as lost data.

Last modified: November 24, 2023

Was this article helpful?