{"key":"homelab_02_nic_diagnostics_2026_03_20","title":"homelab-02 NIC Diagnostics","content":"Node: homelab-02 (192.168.4.54)\nInterface: nic0 bridged into vmbr0\nUpdated diagnostics on 2026-03-20 03:32 UTC.\n\nNew findings:\n- SSH access path from svc-ai: user svc-admin with key ~/.ssh/id_ed25519_homelab.\n- nic0 remained in prior diagnostic state when rechecked:\n  - speed 100Mb/s\n  - duplex full\n  - autoneg off\n  - EEE disabled\n  - offloads largely disabled (tso/gso/gro/rx/tx off)\n- carrier_changes at recheck: 11864, later 11868 before watchdog disable.\n- Kernel logs from this boot showed many periodic link drops. After forced 100Mb, link-down/up events continued at exact ~2 minute intervals.\n- nic-watchdog.timer was active and nic-watchdog.service failed every run.\n- /usr/local/sbin/nic-watchdog.sh is misconfigured for this host: it runs `ping -I nic0 192.168.4.1`, but nic0 is a bridge slave without an IP. The host IP/gateway live on vmbr0.\n- Live verification:\n  - `ping -I nic0 192.168.4.1` lost 100%.\n  - `ping -I vmbr0 192.168.4.1` succeeded.\n  - /var/log/nic-watchdog.log showed every 2-minute run declaring gateway unreachable on nic0 and resetting the interface.\n  - journal timestamps matched exactly: watchdog start -> ~3s later nic0 Link Down -> ~5s later Link Up -> watchdog failure.\n- Action taken during diagnostics:\n  - `systemctl disable --now nic-watchdog.timer`\n  - stopped further scheduled watchdog resets.\n- Observation after disabling watchdog:\n  - monitored nic0 for ~4 minutes from 2026-03-20 03:28 UTC to 03:32 UTC\n  - carrier_changes stayed flat at 11868 the entire window\n  - no new kernel link flap events occurred\n\nRevised conclusion:\n- Earlier in the investigation there were genuine e1000e hardware-hang and link-flap symptoms, so a physical/NIC issue may still have existed historically.\n- However, the ongoing periodic flaps observed during the later phase of diagnostics were being actively caused by the broken nic-watchdog service, not by spontaneous link loss.\n- Current immediate cluster instability driver on homelab-02 was the watchdog resetting nic0 every 2 minutes because it probed the wrong interface.\n\nRecommended next actions:\n- Leave nic-watchdog.timer disabled.\n- If watchdog behavior is still desired, rewrite it to test vmbr0 (or routing reachability) rather than nic0.\n- Observe cluster stability with watchdog disabled before making further hardware conclusions.\n- After observation, consider reverting temporary diagnostics (autoneg/offloads) in a controlled step if no spontaneous flaps recur.\n\nState after this update:\n- nic-watchdog.timer disabled/inactive\n- nic0 stable for at least 4 minutes post-disable\n- corosync quorate at time of recheck\n\n---\n**2026-03-20 03:32:48 UTC | AI Update via MCP**","summary":"Node: homelab-02 (192.168.4.54)\nInterface: nic0 bridged into vmbr0\nUpdated diagnostics on 2026-03-20 03:32 UTC.\n\nNew findings:\n- SSH access path from svc-ai: user svc-admin with key ~/.ssh/id_ed25519_homelab.\n- nic0 remained in prior diagnostic state when rechecked:\n  - speed 100Mb/s\n  - duplex full\n  - autoneg off\n  - EEE disabled\n  - offloads largely disabled (tso/gso/gro/rx/tx off)\n- carrier_changes at recheck: 11864, later 11868 before watchdog disable.\n- Kernel logs from this boot showed many periodic link drops. After forced 100Mb, link-down/up events continued at exact ~2 minute intervals.\n- nic-watchdog.timer was active and nic-watchdog.service failed every run.\n- /usr/local/sbin/nic-watchdog.sh is misconfigured for this host: it runs `ping -I nic0 192.168.4.1`, but nic0 is a bridge slave without an IP. The host IP/gateway live on vmbr0.\n- Live verification:\n  - `ping -I nic0 192.168.4.1` lost 100%.\n  - `ping -I vmbr0 192.168.4.1` succeeded.\n  - /var/log/nic-watchdog.log showed every 2-minute run declaring gateway unreachable on nic0 and resetting the interface.\n  - journal timestamps matched exactly: watchdog start -> ~3s later nic0 Link Down -> ~5s later Link Up -> watchdog failure.\n- Action taken during diagnostics:\n  - `systemctl disable --now nic-watchdog.timer`\n  - stopped further scheduled watchdog resets.\n- Observation after disabling watchdog:\n  - monitored nic0 for ~4 minutes from 2026-03-20 03:28 UTC to 03:32 UTC\n  - carrier_changes stayed flat at 11868 the entire window\n  - no new kernel link flap events occurred\n\nRevised conclusion:\n- Earlier in the investigation there were genuine e1000e hardware-hang and link-flap symptoms, so a physical/NIC issue may still have existed historically.\n- However, the ongoing periodic flaps observed during the later phase of diagnostics were being actively caused by the broken nic-watchdog service, not by spontaneous link loss.\n- Current immediate cluster instability driver on homelab-02 was the watchdog resetting nic0 every 2 minutes because it probed the wrong interface.\n\nRecommended next actions:\n- Leave nic-watchdog.timer disabled.\n- If watchdog behavior is still desired, rewrite it to test vmbr0 (or routing reachability) rather than nic0.\n- Observe cluster stability with watchdog disabled before making further hardware conclusions.\n- After observation, consider reverting temporary diagnostics (autoneg/offloads) in a controlled step if no spontaneous flaps recur.\n\nState after this update:\n- nic-watchdog.timer disabled/inactive\n- nic0 stable for at least 4 minutes post-disable\n- corosync quorate at time of recheck\n\n---\n**2026-03-20 03:32:48 UTC | AI Update via MCP**","status":"active","namespace":"general","namespace_name":"general","namespace_tier":"shared","tags":[]}