Brain

Memories

homelab-02 NIC Diagnostics contextkeep

homelab_02_nic_diagnostics_2026_03_20 2026-03-20

Node: homelab-02 (192.168.4.54) Interface: nic0 bridged into vmbr0 Updated diagnostics on 2026-03-20 03:32 UTC. New findings: - SSH access path from svc-ai: user svc-admin with ke...

corosync homelab network nic proxmox

homelab-02 NIC Diagnostics

active Edit Selected

Key: homelab_02_nic_diagnostics_2026_03_20
Source: contextkeep
Namespace: none
Doc Section: none
Created: 2026-03-20 03:18
Updated: 2026-03-20 03:32
Doc Version: none
Chunk: none

corosync homelab network nic proxmox watchdog

Node: homelab-02 (192.168.4.54) Interface: nic0 bridged into vmbr0 Updated diagnostics on 2026-03-20 03:32 UTC. New findings: - SSH access path from svc-ai: user svc-admin with key ~/.ssh/id_ed25519_homelab. - nic0 remained in prior diagnostic state when rechecked: - speed 100Mb/s - duplex full - autoneg off - EEE disabled - offloads largely disabled (tso/gso/gro/rx/tx off) - carrier_changes at recheck: 11864, later 11868 before watchdog disable. - Kernel logs from this boot showed many periodic link drops. After forced 100Mb, link-down/up events continued at exact ~2 minute intervals. - nic-watchdog.timer was active and nic-watchdog.service failed every run. - /usr/local/sbin/nic-watchdog.sh is misconfigured for this host: it runs `ping -I nic0 192.168.4.1`, but nic0 is a bridge slave without an IP. The host IP/gateway live on vmbr0. - Live verification: - `ping -I nic0 192.168.4.1` lost 100%. - `ping -I vmbr0 192.168.4.1` succeeded. - /var/log/nic-watchdog.log showed every 2-minute run declaring gateway unreachable on nic0 and resetting the interface. - journal timestamps matched exactly: watchdog start -> ~3s later nic0 Link Down -> ~5s later Link Up -> watchdog failure. - Action taken during diagnostics: - `systemctl disable --now nic-watchdog.timer` - stopped further scheduled watchdog resets. - Observation after disabling watchdog: - monitored nic0 for ~4 minutes from 2026-03-20 03:28 UTC to 03:32 UTC - carrier_changes stayed flat at 11868 the entire window - no new kernel link flap events occurred Revised conclusion: - Earlier in the investigation there were genuine e1000e hardware-hang and link-flap symptoms, so a physical/NIC issue may still have existed historically. - However, the ongoing periodic flaps observed during the later phase of diagnostics were being actively caused by the broken nic-watchdog service, not by spontaneous link loss. - Current immediate cluster instability driver on homelab-02 was the watchdog resetting nic0 every 2 minutes because it probed the wrong interface. Recommended next actions: - Leave nic-watchdog.timer disabled. - If watchdog behavior is still desired, rewrite it to test vmbr0 (or routing reachability) rather than nic0. - Observe cluster stability with watchdog disabled before making further hardware conclusions. - After observation, consider reverting temporary diagnostics (autoneg/offloads) in a controlled step if no spontaneous flaps recur. State after this update: - nic-watchdog.timer disabled/inactive - nic0 stable for at least 4 minutes post-disable - corosync quorate at time of recheck --- **2026-03-20 03:32:48 UTC | AI Update via MCP**

Edit Memory

View Selected

Key Title Source Namespace External URL Doc Version Doc Section Doc Path

Chunk Index Chunk Total

Tags Content

Node: homelab-02 (192.168.4.54)
Interface: nic0 bridged into vmbr0
Updated diagnostics on 2026-03-20 03:32 UTC.

New findings:
- SSH access path from svc-ai: user svc-admin with key ~/.ssh/id_ed25519_homelab.
- nic0 remained in prior diagnostic state when rechecked:
  - speed 100Mb/s
  - duplex full
  - autoneg off
  - EEE disabled
  - offloads largely disabled (tso/gso/gro/rx/tx off)
- carrier_changes at recheck: 11864, later 11868 before watchdog disable.
- Kernel logs from this boot showed many periodic link drops. After forced 100Mb, link-down/up events continued at exact ~2 minute intervals.
- nic-watchdog.timer was active and nic-watchdog.service failed every run.
- /usr/local/sbin/nic-watchdog.sh is misconfigured for this host: it runs `ping -I nic0 192.168.4.1`, but nic0 is a bridge slave without an IP. The host IP/gateway live on vmbr0.
- Live verification:
  - `ping -I nic0 192.168.4.1` lost 100%.
  - `ping -I vmbr0 192.168.4.1` succeeded.
  - /var/log/nic-watchdog.log showed every 2-minute run declaring gateway unreachable on nic0 and resetting the interface.
  - journal timestamps matched exactly: watchdog start -> ~3s later nic0 Link Down -> ~5s later Link Up -> watchdog failure.
- Action taken during diagnostics:
  - `systemctl disable --now nic-watchdog.timer`
  - stopped further scheduled watchdog resets.
- Observation after disabling watchdog:
  - monitored nic0 for ~4 minutes from 2026-03-20 03:28 UTC to 03:32 UTC
  - carrier_changes stayed flat at 11868 the entire window
  - no new kernel link flap events occurred

Revised conclusion:
- Earlier in the investigation there were genuine e1000e hardware-hang and link-flap symptoms, so a physical/NIC issue may still have existed historically.
- However, the ongoing periodic flaps observed during the later phase of diagnostics were being actively caused by the broken nic-watchdog service, not by spontaneous link loss.
- Current immediate cluster instability driver on homelab-02 was the watchdog resetting nic0 every 2 minutes because it probed the wrong interface.

Recommended next actions:
- Leave nic-watchdog.timer disabled.
- If watchdog behavior is still desired, rewrite it to test vmbr0 (or routing reachability) rather than nic0.
- Observe cluster stability with watchdog disabled before making further hardware conclusions.
- After observation, consider reverting temporary diagnostics (autoneg/offloads) in a controlled step if no spontaneous flaps recur.

State after this update:
- nic-watchdog.timer disabled/inactive
- nic0 stable for at least 4 minutes post-disable
- corosync quorate at time of recheck

---
**2026-03-20 03:32:48 UTC | AI Update via MCP**