Shared AI Memory System

Brain

Browse, filter, edit, and archive the shared memory store from one page.

Memories

1

homelab-02 NIC Diagnostics

active Edit Selected
Key
homelab_02_nic_diagnostics_2026_03_20
Source
contextkeep
Namespace
none
Doc Section
none
Created
2026-03-20 03:18
Updated
2026-03-20 03:32
Doc Version
none
Chunk
none
corosync homelab network nic proxmox watchdog
Node: homelab-02 (192.168.4.54) Interface: nic0 bridged into vmbr0 Updated diagnostics on 2026-03-20 03:32 UTC. New findings: - SSH access path from svc-ai: user svc-admin with key ~/.ssh/id_ed25519_homelab. - nic0 remained in prior diagnostic state when rechecked: - speed 100Mb/s - duplex full - autoneg off - EEE disabled - offloads largely disabled (tso/gso/gro/rx/tx off) - carrier_changes at recheck: 11864, later 11868 before watchdog disable. - Kernel logs from this boot showed many periodic link drops. After forced 100Mb, link-down/up events continued at exact ~2 minute intervals. - nic-watchdog.timer was active and nic-watchdog.service failed every run. - /usr/local/sbin/nic-watchdog.sh is misconfigured for this host: it runs `ping -I nic0 192.168.4.1`, but nic0 is a bridge slave without an IP. The host IP/gateway live on vmbr0. - Live verification: - `ping -I nic0 192.168.4.1` lost 100%. - `ping -I vmbr0 192.168.4.1` succeeded. - /var/log/nic-watchdog.log showed every 2-minute run declaring gateway unreachable on nic0 and resetting the interface. - journal timestamps matched exactly: watchdog start -> ~3s later nic0 Link Down -> ~5s later Link Up -> watchdog failure. - Action taken during diagnostics: - `systemctl disable --now nic-watchdog.timer` - stopped further scheduled watchdog resets. - Observation after disabling watchdog: - monitored nic0 for ~4 minutes from 2026-03-20 03:28 UTC to 03:32 UTC - carrier_changes stayed flat at 11868 the entire window - no new kernel link flap events occurred Revised conclusion: - Earlier in the investigation there were genuine e1000e hardware-hang and link-flap symptoms, so a physical/NIC issue may still have existed historically. - However, the ongoing periodic flaps observed during the later phase of diagnostics were being actively caused by the broken nic-watchdog service, not by spontaneous link loss. - Current immediate cluster instability driver on homelab-02 was the watchdog resetting nic0 every 2 minutes because it probed the wrong interface. Recommended next actions: - Leave nic-watchdog.timer disabled. - If watchdog behavior is still desired, rewrite it to test vmbr0 (or routing reachability) rather than nic0. - Observe cluster stability with watchdog disabled before making further hardware conclusions. - After observation, consider reverting temporary diagnostics (autoneg/offloads) in a controlled step if no spontaneous flaps recur. State after this update: - nic-watchdog.timer disabled/inactive - nic0 stable for at least 4 minutes post-disable - corosync quorate at time of recheck --- **2026-03-20 03:32:48 UTC | AI Update via MCP**

Edit Memory

View Selected