Thursday, January 23, 2014

I've reached the 20,000 page view mark! Please, hold your applause...

Whoa!  For a neglected blog I sure have had quite a few page views!  Based on Blogger's analytics, it looks like my technical articles are the most popular ones by far.  So what does this mean?  Besides the internet being full of knowledge hungry geeks like myself?  Well, it's time for a change.  A lot has happened in my life since I started all this.  I am going to work on making this blog more of a personal site.  The primary focus will be on being a single dad, dealing with adult ADD, and a lot of other randomness (probably mostly randomness knowing me).  The technical articles will be moved to my Yarbisoft blog (www.yarbisoft.com).  I've also got a couple of other projects in the hopper but those will be announced at a later date because all the details are not hashed out yet.  Stay tuned and thanks for all the support over the years!

Thursday, March 14, 2013

Domain Controller rebooting into DSRM and no password?

I recently ran into a client whose Domain Controller (was Windows SBS 2011) somehow got stuck in a state where after a reboot it would only go into Directory Services Restore Mode (DSRM).  Normally, I would just log in with the DSRM password, run MSConfig and uncheck the safeboot option responsible for this and reboot.  However, in this case, we did not have the correct DSRM password documented so I could not get to MSConfig or a command prompt.

I vaguely recall being able to use a boot ISO to reset the password for DSRM (which is essentially just a hidden local Administrator on a Domain Controller) but for whatever reason I could not get that to work this time.  Maybe I just didn't try hard enough, or maybe my memory isn't what it used to be and it is just not possible, but I gave up on this and tried something else.

My next thought was, since this was a VM, to open the VMDK with another virtual machine and just edit the BOOT.INI file.  Showing my age here, but BOOT.INI files are old school.  I had totally forgotten about the Boot Configuration Data Store replacing BOOT.INI in 2008 and up.  Wasn't even sure if this would work to begin with but it was now clear it wasn't even an option.

Realizing BCD was in play, however, led me to bcdedit, which I remembered was available from the Windows 2008 R2 installation media's repair console.  In the past I had used it to repair servers that would not boot at all for various reasons, but this time I would use it to clear the safeboot parameter.  So I loaded the ISO into the virtual DVD drive, booted into the repair console, and ran:

bcdedit /deletevalue {default} safeboot

Made sure the command was successful and rebooted.  Viola!  The server came up in normal mode and we immediately reset the DSRM password to what we had documented.

Hopefully this saves someone somewhere some time.

UPDATE:  I just had the same server reboot into the DSRM loop again.  This time, we actually had the password so fixing it with MSConfig was possible.  While in that utility I noticed an option to make the changes permanent so I checked that, applied the setting and confirmed I wanted to make the change.  Hopefully this will make sure the loop does not happen again.

Tuesday, November 27, 2012

EMC ~management storage group problems

I was recently working with a client that had an EMC CX4-120 and we were trying to setup a Cisco UCS blade cluster to boot from iSCSI on it.  Another vendor was doing the UCS portion so I was tasked with getting the EMC side setup.  Through Unisphere I had to go in and manually create the hosts and register their IQN's.  Next I created the LUN's and the storage groups.  When assigning the LUN's to the storage group, make sure that the "Host LUN ID" field gets changed if the initiator will eventually be in multiple groups where LUN ID 0 is already taken.  So in the end, each blade had a storage group that was mapped to a LUN with a unique LUN ID to be used for iSCSI booting ESXi. 

Great, but the hosts would not connect. 

We kept getting a warning "EV_TargetMapEntry::UpdateTargetMapEntry - different ports" and an information message saying iSCSI login failure.  I went ahead and opened up a case with EMC support because from what I could tell the configuration looked correct at this point.  We weren't really getting anywhere until after some clicking around I noticed the ~management storage group showed all the blades in it and so did the individual storage groups created for the blades.  I recalled a previous experience similar to this where we had to get the hosts out of the ~management group.  Simply removing them from that group doesn't work.  I ended up having to remove them from their individual groups, apply the settings, then re-add them to the individual groups to get them to disappear from the ~management group.  Once that was done, the blades were able to see their respective LUN's and life was good.

A couple of notes:

* Re-registering the hosts manually seems to have caused them to go back to the ~management group.
* I'm not positive is the Host LUN ID is able to stay 0 or not, but I did not want to risk a LUN ID conflict since production data was on a LUN that already had ID 0 for another storage group (one that the blades will eventually belong to).
* If you are using Active/Failover when registering a host connection, the UCS blades will show one success and one failure if setup correctly (at least what I think is correctly!).
* To deregister certain things required being in Engineer mode.  To get into that mode from Unisphere, press Ctrl+Shift+F12 and type in the password.  Appears to be either "SIR" or "messner" depending on how old your firmware is.  Don't use this mode unless you are on the phone with support and absolutely sure you know what you are doing.

Sorry if I am missing some details, just wanted to do a brain dump while it was still semi-fresh.

Monday, August 27, 2012

NetApp disk reallocation - not all that scary

I recently had the experience of restructuring aggregates on a production NetApp FAS2050 cluster due to some incorrect initial tray structuring. Any time I work with production data there is always an uneasy feeling, but after some research I felt pretty confident.  Here is an overview of why I had to do this and what steps are involved.

The original design had (essentially) split a few trays down the middle and assigned half the disks to each head to attempt to balance a semi-random VMware load. We quickly realized that the bottleneck would not be head resource (CPU/memory/HBA) contention but instead contention from the number of disk spindles. Six active spindles (actually 4 after RAID_DP makes parity disks) doesn't allow for much load, especially in a SATA tray.

To remedy this specific case, we decided it would make more sense to assign a tray per head instead of half a tray per head.  Controller A would get the SATA tray and controller B would get the 15K fibre channel tray, and any future trays would try to match up so that we can build larger aggregates.  The goal was to take the 6 SATA disk aggregate plus the hot spare from controller B and reassign those disks to controller A, bring them into controller A's SATA aggregate, and be left with a 12 disk aggregate and 2 hot spares.  All without losing any data of course.  Then perform the same steps to assign the FC disk tray to controller B.

So, there is not magic way to combine the aggregates that I know of without first evacuating the contents of one.  Luckily in this case, we had enough extra storage that we were able to perform Storage vMotion's and easily get the aggregate empty.  If you do not have the extra space just laying around or you do not have Storage vMotion then you may not be able to proceed.  Depending on the capacity in question and the I/O load, there are some pretty cheap ways to get a decently solid device like a ReadyNAS that could be used temporarily as an NFS datastore.  Maybe be resourceful and get a 30 day trial ReadyNAS, use a trial license for VMware so you get the Enterprise Plus feature set which includes Storage vMotion... or setup Veeam Backup and Replication in trial mode so you can replicate, then failover and failback when you are done.  Just thinking out loud here :)

Anyway, once the LUN/export/volume/aggregate is completely evacuated, you are ready to start destroying things!  Actually, I recommend doing this in phases if you have time.  First and foremost, ensure that your backups are rock solid.  Next, if you have a LUN, take it offline.  If you have an export or a directly accessed volume, take it offline.  This helps you make sure that a) you have the right object and aren't going to ruin your life because something was labeled wrong and b) nothing breaks unexpectedly.  It is very easy to bring it back online and fix the problem.  Not so easy to "undestroy" an aggregate, although it looks like it can be done.

Before you proceed, I recommend taking note of what disks you are actually wanting to migrate so that when you start reassigning disk ownership you get the correct ones.  Do this by typing:

disk show

and looking for any disks owned the original controller and in the aggregate.  Also make note of any current spares that you want to reassign.  Ensure that you get the whole disk identifier such as 1d.29 since there may be multiple disk 29's.

Once you are confident the aggregate is offline, no data is missing, and nothing is broken, now you can proceed with aggregate destruction.  If you have the NetApp System Manager, right click the offline aggregate and click Delete. Otherwise from the console of the controller that owns the target aggregate, type:

aggr status

Confirm you see it offline, then take a deep breath and type:

aggr destroy <your_aggr_name>

You will be prompted and forced to confirm that are about to destroy all the data on the aggregate and that you want to proceed. If you are comfortable with this and confident that you are deleting the correct, empty aggregate, proceed by either clicking the appropriate buttons or typing yes.  If you are not, then stop and call someone for a second opinion, then repeat.  If you delete the wrong one somehow, get NetApp on the phone immediately and hope that your backups restore if needed.  I take NO responsibility for your choices, sorry :)

So at this point, your aggregate should be gone and the disks that were assigned to the aggregate should now be marked as spare drives.  Confirm this from the console by typing:

disk show

You should see all the drives noted earlier marked as spare and still owned by the original controller.  At this point I recommend waiting again if you have time.  Once you proceed from this point the chances of recovering data from your destroyed aggregate are pretty much gone.  Reassigning disk ownership shouldn't do it, but once you add disks to the aggregate they will be zeroed and undestroy will no longer work.  Paranoid, yes.  Employed, yes :)

To reassign the disks to the other controller, login to the console of the controller that still owns the "new" spare disks and do the following:

Reference:  https://communities.netapp.com/docs/DOC-5030

Turn off disk auto assign: 

options disk.auto_assign off

Remove ownership from the disks you want to move by issuing the following command:

disk assign -s unowned disk.id1 [disk.id2 disk.id3 ...]

Now, go to the other controller (the one you want to claim ownership of the disks with) and make sure it sees the disks as not owned:

disk show -n

The disks should be listed as "Not Owned".  You can now assign the disks to the destination by typing one of the following commands (from the head you want to grant ownership to):

If you want to assign all unowned/unassigned disks to this controller:

disk assign all

If you only want to assign the ones we are working with:

disk assign disk.id1 [disk.id2 disk.id3 ...]

If you have made it this far and not ran into anything unexpected then great.  However, here is the first step where data will actually become unrecoverable. If I'm not mistaken, all the previous steps left the bits in tact.  This step will actually zero the newly added disks.  Ready?  Let's go!

Side track for a moment:  In this case, we were going to a 12 disk aggregate with 2 hot spares.  If you are going larger than that, there are specific guidelines set forth by NetApp about RAID group sizes and structure.  Please reference the following article before allocating your disks if you are dealing with more than 12 or think you will eventually be adding a tray that would give you less than 12 and would want to add them to the same aggregate:  https://communities.netapp.com/thread/1587  There are a lot of considerations so think this through such as firmware version, SATA vs SAS vs FC, future plans, etc.  I wish I could go into detail, but the previously mentioned thread covers most of the details.  Specifically note the mini articles posted by the NetApp employees pretty far down into it.  Also, here is a great write-up from the NetApp Community forum that deals with the aggregate structure and the next steps: https://communities.netapp.com/people/dgshuenetapp/blog/2011/12/19/my-first-experience-adding-disks-and-reallocation

Anyway, back to building our little 12 disk aggregate:

aggr add aggr_name -g raid_group -d disk.id1 [disk.id2 disk.id3 ...]

Note that the raid_group mentioned is the RAID group that you will add these disks to.  For small aggregates (sub 14) there is typically a RAID group called rg0.  To find out which RAID group, you will need to type:

aggr status -v

This should display something that shows "/aggr_name/plex#/rg_name".  Make note of the RAID group name.

Now...... you wait.  If you look at the status of the aggregate it will say growing.  If you look at the status of the disks they will say zeroing.  If you look at the CPU on the filer it will be higher than normal.  If you look at your watch, it will say wait :)  Do a couple of sanity checks to make sure things look good still and then go get some coffee.  Looking back at our case, the disks took about 4 hours to initialize according to the syslog.  Once they are done, they will show up in the aggregate as available space.

Now the fun part, actually reclaiming the space and increasing your volume size.  Since our case was backend storage for a VMware environment (VMFS, not NFS) we needed to increase the volume size and increase the LUN size, then increase the VMFS volume.  Since vSphere 5.0 and VMFS 5 support up to 64TB datastores now in a single LUN, we could have created one large volume and one large LUN.  We opted to keep things in less than 2TB volumes though due to some point of no return limitations with deduplication outlined here:  https://communities.netapp.com/thread/4360.  (Update:  I actually tried to create a >2TB LUN and it wouldn't let me.  I guess our FAS2050 doesn't know about VMFS5.)

Do increase the sizes, I've found that NetApp System Manager makes life much simpler.  However, for command line reference, to increase the size of the volume:

vol size /vol/name +###GB

As I said, this was a lot easier through System Manager so I used that.

For the LUN commands reference http://www.wafl.co.uk/lun/. To increase the size of the LUN via command line:

lun resize /vol/name/lun_name absolute_size

Again, System Manager made this much easier.

Go back into the command line and re-enable the disk auto_assign:

options disk.auto_assign on

Before you put additional load on the larger aggregate, I recommend running an reallocate so that the blocks will be optimized across the new disks.  See the previously mentioned article:  https://communities.netapp.com/people/dgshuenetapp/blog/2011/12/19/my-first-experience-adding-disks-and-reallocation.  If you do not perform this, your disks may grandually start to balance out, but you are not going to see the full benefits of having the new spindles without it.  A couple quick notes:  it does require free space in the volume to run (10-20% I believe), it does take a while (ours took approximately 26 hours), and it does cause high CPU and disk activity.  The performance increase was pretty significant though so I highly recommend learning more about reallocate and how/when to use it.  I will try to write a follow up article that talks a little more about this process and what to expect while it runs.

So you know have a larger aggregate, larger volume, and larger LUN.  Now, in VMware, grow the datastore.  Go to the datastore that is backed by this aggregate/volume/LUN and go to the properties of it.  You should be able to increase the disk.  Here is a VMware KB for the steps involved with that:  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1017662.  This should just allow you to grow the VMFS volume without having to create any extents or splice things together.

That should do it!  You should now have more spindles and more capacity.  Win win!  Let me know if you find any problems with the outlined info above or if you had any success with it.

Friday, August 24, 2012

Cisco irritates me

Why must everything be so difficult with Cisco?

Specifically in this case, 802.1Q tagging on a Catalyst 2960.  The lack of a "vlan dot1q tag native" command is causing an issue for me that is actually solved in my case by using a Cisco (aka Linksys) SG300-28 switch which costs hundreds (thousands maybe?) less.

Can I vent for a minute?  OK, thanks. 

Don't get me wrong, Cisco makes solid products and they are pretty much the gold standard.  However, as a close friend of mine said when talking about a PIX, "It's easy to be rock solid when you are a rock."  The fact that you need to buy separate products for separate functions for just about everything makes it unrealistic for the types of environments I deal with on a regular basis.

Take for example a scenario where we need a firewall, WAN load balancer, WAN optimization, web filtering, inline antivirus, wireless LAN controller, and IDS/IPS.  To do this would require several Cisco devices with advanced licensing, consulting services with several white collar project managers, potentially a couple of servers, and even third party software from a vendor like WebSense.  Or... we could just buy a FortiGate UTM appliance for the same cost as a single Cisco device and it includes all of these things and more and has a very nice user interface.

Even with Cisco's purchase of Linksys and various other strategic acquisitions, now all they seem to have is a disjoined selection of product lines with new sets of limitations.  Maybe Cisco is still the premier solution for enterprises that have unlimited budgets, but for the other 95% of the world, I think it is time to raise the bar.

Now, Cisco does still hold the spot for T1 routers in my book (and a few niche products like the Catalyst 3750-X stack).  But for general switches, price drives most conversations and I can get all the features I need for most implementations in HP, Dell, Netgear, and yes, even the SMB Cisco/Linksys lines.  For Firewalls/VPN/UTM/WAN OPT/LB/wireless, Fortinet holds my heart now.  For phone systems, the verdict is still out but I am pretty sure Cisco doesn't sit anywhere on my list because of complexity and cost.  If I had to pick a phone system, I would say ShoreTel at this point.  And for servers, the Cisco UCS does look pretty nice and listening to a sales/marketing pitch on it paints it to be so much better than HP or anything else, but at the end of the day in my virtual reality of a life, hardware is hardware and HP/Dell(/even SuperMicro) works perfectly fine for me and my client base at a lower cost.

I'm feeling bad for being so down on Cisco right now and am worried I will get a horse head in my bed tonight when big brother sees this but I needed to get it off my chest.  I guess they do still have good certification tracks if nothing else :)