homelab-02 NIC Diagnostics
active
Edit Selected
Node: homelab-02 (192.168.4.54)
Interface: nic0 bridged into vmbr0
Updated diagnostics on 2026-03-20 03:32 UTC.
New findings:
- SSH access path from svc-ai: user svc-admin with key ~/.ssh/id_ed25519_homelab.
- nic0 remained in prior diagnostic state when rechecked:
- speed 100Mb/s
- duplex full
- autoneg off
- EEE disabled
- offloads largely disabled (tso/gso/gro/rx/tx off)
- carrier_changes at recheck: 11864, later 11868 before watchdog disable.
- Kernel logs from this boot showed many periodic link drops. After forced 100Mb, link-down/up events continued at exact ~2 minute intervals.
- nic-watchdog.timer was active and nic-watchdog.service failed every run.
- /usr/local/sbin/nic-watchdog.sh is misconfigured for this host: it runs `ping -I nic0 192.168.4.1`, but nic0 is a bridge slave without an IP. The host IP/gateway live on vmbr0.
- Live verification:
- `ping -I nic0 192.168.4.1` lost 100%.
- `ping -I vmbr0 192.168.4.1` succeeded.
- /var/log/nic-watchdog.log showed every 2-minute run declaring gateway unreachable on nic0 and resetting the interface.
- journal timestamps matched exactly: watchdog start -> ~3s later nic0 Link Down -> ~5s later Link Up -> watchdog failure.
- Action taken during diagnostics:
- `systemctl disable --now nic-watchdog.timer`
- stopped further scheduled watchdog resets.
- Observation after disabling watchdog:
- monitored nic0 for ~4 minutes from 2026-03-20 03:28 UTC to 03:32 UTC
- carrier_changes stayed flat at 11868 the entire window
- no new kernel link flap events occurred
Revised conclusion:
- Earlier in the investigation there were genuine e1000e hardware-hang and link-flap symptoms, so a physical/NIC issue may still have existed historically.
- However, the ongoing periodic flaps observed during the later phase of diagnostics were being actively caused by the broken nic-watchdog service, not by spontaneous link loss.
- Current immediate cluster instability driver on homelab-02 was the watchdog resetting nic0 every 2 minutes because it probed the wrong interface.
Recommended next actions:
- Leave nic-watchdog.timer disabled.
- If watchdog behavior is still desired, rewrite it to test vmbr0 (or routing reachability) rather than nic0.
- Observe cluster stability with watchdog disabled before making further hardware conclusions.
- After observation, consider reverting temporary diagnostics (autoneg/offloads) in a controlled step if no spontaneous flaps recur.
State after this update:
- nic-watchdog.timer disabled/inactive
- nic0 stable for at least 4 minutes post-disable
- corosync quorate at time of recheck
---
**2026-03-20 03:32:48 UTC | AI Update via MCP**