Solaris SVM: stuck in pre-maintenance mode: resolution · 20 days ago
Well, the Sun techs just recommended we restore from backups or upgrade our distribution .. great, thanks for the in-depth technical insight … so one system we did restore from backups.
A coworker of mine who is quite brilliant with bare metal troubleshooting was able to get the first system back online by doing the following:
- Move the disk that was still bootable but stuck in the pre-maintenance loop to another system configured like the system that failed
- Mount the disk under /mnt
- Remove all of the files and directories under /mnt/dev
rm -rf /mnt/dev/*
- Copy all of the device tree and /etc/path_to_inst from the like system to the mounted drive
cp -Rp /dev/* /mnt/dev/
cp /etc/path_to_inst /mnt/etc/path_to_inst
He then unmounted the drive, moved it back to the original system, and voila, we could get into single user (maintenance) mode with
boot -m milestone=none
Turned out that the HBA card coincidentally (no joke) went bad during or after being removed. Replacing the HBA card fixed that issue .. and let us boot single user, great!
So, I then re-initialized the meta device database, restored all mirrors and submirrors and rebooted .. and … whoops, kernel starts complaining about /etc/system being full of junk and the system doesn’t boot.
A boot from CD-ROM showed that now both root partitions on both disks were full of what appeared to be random garbage (2 MB worth!)
The Sun tech wrote back about a day or two after this failed and proceeded to in essence ‘scold’ us for trying to copy devices from one system to another .. well, at least that got the system boot single user! She then asked again about restoring at which point I told her to just close the ticket as we were making more progress on our own than we did with her (she is resonsible as well for the ‘9MB zip file’ quote under the Humor section of my blog).
End result – we had to restore both systems and have no idea why breaking the mirror on these systems wasn’t something we could recover from the way we are supposed to be able to do with SVM.
Disappointing, especially since our tier 4 (Sun) was not able to help us get through this without restoring, in fact, they started suggesting restoring after 1 call to their tier 2 people. So much for paying for support contracts and expecting expertise :(.
— Max Schubert
How come I get 'no server suitable for synchronization found' from my NTP client when the server is returning a valid NTP response to the client? · 8 days ago
This is one I hadn’t seen before last week. On our Solaris 9 clients, running
ntpdate -d -u ip.of.server
Kept on returning
no server suitable for synchronization found
even though the debug mode showed UDP responses coming back from the server. The server in question runs using UDP / unicast mode.
We used snoop to look at the NTP response
snoop -v -v port 123
(use -v -v to get the protocol decode output), and saw these suspicious field/value pairs:
NTP: Leap: 0x03 - clock unsynchronized
NTP: Reference clock: INIT
NTP: Reference time: 0x00000000.00000000
There were other headers, but they did not indicate problems. 0×03 in the Leap field, the INIT state, and a reference time of 0×00 indicated that the NTP server was not properly initialized / configured properly. Further investigation revealed that indeed, this was the case, the Sidewinder / G2 NTP server was not properly configured.
— Max Schubert
How do I know what group Oracle expects to run as? · 8 days ago
Learned the hard way about how picky Oracle is about the group it runs as … do not change the group ownership of files owned by Oracle unless you have the .c source for the distribution of Oracle you are using and a lot of time to fix :p ..
The group Oracle runs as is found in the files
You can also do a ‘strings’ command on config.o and you will see the group name.
Unfortunately in my case the version of Oracle came with the commercial product (eHealth) and the only thing we found we could do was reinstall … took our in-house database staff to figure out what the problem was, CA left our ticket open for over a month without a resolution (RMAN backups were failing, not a trivial problem).
really getting tired of the lack of support commercial support contracts provide, seems like support used to be so much better back in the old 90s :p.
— Max Schubert