Where To Start With Infrastructure Monitoring

Recently I spent time revisiting our monitoring system. It needed a little bit of TLC and some of the staff wasn’t clear on exactly how it works and does its magic. As a follow-up, I thought it might be useful to write a little about monitoring. I mean, so what’s the point anyway?

Monitoring has evolved over the years. Especially with cloud computing and more resilient infrastructures. The tools have also progressed and I think it’s pretty clear that anyone deploying a serious monitoring system has long since abandoned the old-days of MRTG (http://oss.oetiker.ch/mrtg/), and mon (https://mon.wiki.kernel.org/index.php/Main_Page).  Even the infamous and all power Nagios is falling by the wayside. Finally, on the other end of the spectrum is software like SMARTS (http://www.emc.com/it-management/smarts/index.htm) formerly System Management Arts.

A good monitoring tool starts with:

  • Trends collected data (collection history)
  • Applies thresholds to data
  • Sends notifications based
  • Displays information in a meaningful way

That’s really the nuts and bolts of it. After that, things get much more in-depth. For example, how the data is collected and what escalation rules can be applied when sending notifications, etc. In addition, what about correlating the data and setting dependencies? The feature list goes on and on.

Once the data is collected and made useful (graphs, excel, whatever) it opens up the doors to things outside of monitoring such as planning, troubleshooting, faster SLAs, etc.

So if you’re planning on doing a Performance Monitoring Project, think about what you want and a little bit about how you might get there. What makes a tool do performance and monitoring in one package? Explore what others have and how they have leveraged it to improve their SLAs, planning, troubleshooting etc. Finally, it would also be worth considering what software has been used in conjunction with Monitoring Software to leverage it even further.

I noticed our system works well and is now a mature deployment. Our challenges now revolve around making sure people really know how to leverage the data and continuously document and improve the system.

Remote Desktop Services

With businesses attaining more WAN bandwidth and businesses trust with hosted services increasing, Microsoft is investing heavily in Remote Desktop Services.  Renamed from Terminal Services to Remote Desktop Services, it encompasses multiple ways to deliver application access from any location.  Below, you will find information on some of the features and requirements in an RDS deployment.

MS RDS Blog
WAN Optimization
RDP Client / Server features cross reference

Why RDS?

  1. Local-lan connectivity when using applications (e.g. Quickbooks) and when accessing the LAN resources (i.e. loading large files)
  2. Improved security for remote users
    1. Data is stored on the servers, not on laptops. This also means data is backed up consistently.
  3. New user setup is quickly done and without the need to “reimage” existing computers
  4. Portability for remote work
  5. Thin Client support
  6. Business Continuity and Disaster Recovery
  7. Green computing (more effective use of resources)
  8. Non-compliant PCs can connect with minimal security compromises
  9. Encrypted connectivity and application-level access limitation for compliance purposes or restricted access for external partners
  10. Centralize application management (updates, configuration is done in one place)

Functions

On the surface, RDS can be broken down into 2 Functions: Session Hosts and Virtual Desktop Infrastructure (VDI).  When breaking down the session hosts function further, we can include features such as RemoteApps and Remote Session Host (Terminal Services).  Similarly, VDI provides us with Personal Virtual Desktops and Pooled Virtual Desktops.

Virtual Desktop Infrastructure

Personal Desktops
This is geared for full desktop replacement deployments. The user will treat this is as their own personal computer in a VM.

Pooled Desktops
Pooled desktops are similar to deploying VMs in an academic environment. This usually means the VMs are preinstalled with generic applications and users have full administrative access to install their custom applications.  Of course, after they log off, the VM is reverted to it’s original state for the next user. An example usage would be to provide a pool of 10 Windows XP VMs for users to use intermittently due to legacy software incompatibilities.

Remote Session Host (aka Terminal Services)

Web Access – Single sign-on web portal showing RemoteApps

RemoteApp  – A more seamless integration between remote applications and local desktop

    1. Does not require Windows 7 computer to be joined to domain
    2. Updates automatically when the feeds are updated by administrators
    3. Users have to log on only once to create the connection
    4. XML – so can be used in other ways

Capacity Planning

Servers
It’s better to purchase 2 Server than it is to purchase 1 loaded with more memory. The reason is you can load balance between 2 RDS servers and the cost of smaller memory modules is a lot less than of larger ones. Scaling OUT instead of UP is more cost effective, increases Disk IO paths, and creates redundancy.

Processor
Unfortunately, adding processors isn’t a 1:1 improvement. Usually, going from 1 to 2 processors will achieve a 1.8:1 gain, while going from 2 to 4 processors will achieve a 1.65:1 improvement.
If you have each user session taking up 10% of CPU, then the server’s CPU can handle up to 10 users at full load. If you added more CPUs to get a total of 4 CPUs, it would be 10*1.8 (1 => 2 cpu)*1.65 (2 => 4 cpu) = 30 users total. As you can see, it’s not 40 users.

  • Use a processor with SLAT support

Memory
Usually, allocate about 500MB per session for a 64-bit OS. Of course, the best thing to do is to find the working set of a user’s session.

Hardware Integration

Enlightenments

This feature in Windows Server 2008, Vista+ coordinates actions with the hypervisor to make sure that they’re interacting with the hardware as efficiently as possible.  The kernel basically only asks for instructions to be carried out within the confines of it’s child partition instead of all the partitions.  It reduces wasted CPU usage.

VM integration components

These components accelerate VM access to devices.  Without it, the VM will configure hardware device drivers with the emulated devices that the hypervisor presents to it.

SLAT

AMD-V Rapid Virtualization Indexing (RVI) and Intel VT Extended Page Tables (EPT)

Although running RDS in a VM isn’t a problem, it does take up additional CPU cycles to maintain a “shadow” page table.  When this is updated in the VM, the Hypervisor has to update it’s “shadow” page table also.  This can take away precious CPU cycles that will slow down your server.  This is where SLAT-enabled processors mitigate this issue.  It maintains the address mappings in hardware, not software.  Just as hardware raid is file management using hardware, SLAT provides memory address management using hardware.  In the end, both memory usage and processor overhead will decrease.  This enables you to host more VM sessions by a factor of 1.6-2.5 times.  It’s highly recommended to have this for memory intensive workloads like RDS, SQL, IIS, Exchange, etc.

Improved Application Compatibility

  1. MSI package installation – Prevention of simultaneous first-time uses of applications based on MSI installs from blocking each other
  2. Dynamic Fair Share Scheduling – A better way of preventing a single session from starving other sessions for processor cycles
  3. IP Virtualizaton – Allows a session or application within a session to have a unique IP.  Applications with requirements of a discreet IP address can be used.

High-Fidelity User Experience

  1. True multi-monitor support, including varying layouts and landscape/portrait orientations
  2. Aero remoting for single-monitor sessions on Windows 7
  3. Cilent-side rendering of multimedia and audio Windows Media Player files
  4. Improved display of video from Silverlight and WIndows Media Foundation
  5. Bi-directional audio remoting, including sound recording to a remote session

Exploring Malware Types

Malware is the term given to a set of software with one specific function: Malicious activity. Most users know of this danger as a “Computer Virus”, but the term virus these days has a very specific meaning. When we break down the dozens of terms given to Malware, we can build an understanding of the level of infection we face during the removal process.

Here are a few of the major types of Malware users should be aware of:

Trojan

  • Malware that disguises itself as a normal file or program to trick users into downloading and installing malware. Does not self replicate or spread.

Virus

  • Malware that replicates and spreads based on user interaction. Opening infected files or running an infected executable usually triggers the virus.

Worm

  • The most common type of malware. They spread over networks by exploiting operating system vulnerabilities. Worms can contain “payloads” that perform certain actions (such as deleting or stealing data). Worms differ from Viruses in that they are able to self-replicate and spread independently. Ex. Polymorphic or Metamorphic.

Rootkit

  • Malware that enables continued privileged access to a computer. As a result, it can subvert software that is designed to circumvent or destroy it.  Typically deployed through Trojans, or security vulnerabilities. Can reside in the kernel of the OS, or even firmware of devices.

Spyware

  • Focuses on data harvesting or modifying security/permissions settings. Typically deployed through trojans.

Ransomware

  • Malware that essentially holds a system captive while demanding ransom. The most damage will come from users with Admin/root access running  a trojan.

Adware

  • Automatically delivers advertisements. Not always malware. When bundled with Spyware, can create elaborate phishing attempts.

Bot

  • Software that performs specific operations using a host computer. This can include cheating at video games, but more dangerously used in botnets to perform DDoS attacks.

Zero Day Attack

  • Not a type of Malware, but a description of the threat. A Zero-day attack is a threat that exploits a previously unknown application vulnerability. It is named as such because developers have had no time to address and patch the issue.

With an understanding of the different types of Malware, we can hope to prevent further infection and reinfection, as well as build a background to understand the newest threats.

Server Rack Configuration

Proper server rack configuration is key for every business as it provides the technological backbone. There are many options for racks, rack components, and the way they are configured. In this blog post, I will discuss the various options and best practices.

Server rack options?  There are a few options to choose from such as a 2 post rack, 4 post rack and rack enclosures.  2 post server racks are ideal for light equipment (E.g. patch panels, switches and firewalls.) They may also be used for heavier equipment when optional accessories are added such as Trays or conversion kits.  Keep in mind, most of those 2 post rack systems can only support up to 1000lbs. 2 post racks are also cheaper than 4 post rack systems.  4 post racks cost more money but can support more equipment.  The average 4 post rack system can support up to 3000lbs. You have the option of getting bare bone server rack which comes with no options and built in cable management or a 4 post rack enclosure which generally comes with features such as secure access and built in cable management.  2 post and 4 post racks also come in a variety of sizes such as 6U and up to 55U.  Most common rack size used in most small/medium sized business are 42U (6 ½ ft.) and 3.5 ft. deep (4 post.)

What kind of rack should my business use? This all depends on several items; Business size / amount of equipment; Future expansion – you always want to plan for future growth; Available real estate – Server room size may not allow for certain racks; Environment – Do you have a secure server room? Does you need rack enclosure with a lock because your business does not have a server room?  Remember, unauthorized access can cause damage to any business; Money – yes, in the end it comes down to how much money you may have available.  So why, why all this need for server racks? Two simple reasons, organization and equipment security.

What is a U? A U is a rack unit – A rack mounted size described as a number in U. Most server racks have 1U markings along the posts to make mounting hardware easier/efficient.

How should the server rack be installed.. You should always examine the environment where the server rack will be placed.  Find the cold/hot spots in the room and place the front of the rack facing the cold area to provide maximum cooling for your hardware. Ensure you also have enough space around the rack to conduct any service and don’t forget about doors/access panels that swing open. All server racks should be secured in some way. For 2 post rack systems, they should be bolted to the ground with a top ladder support heading out to the rear wall.  4 post rack systems can also be bolted to the ground but also come with screw out feet. Lastly, remember to ground your rack to an electrical panel or busbar.  This task should be handled by an electrician.

How should I install my rack mounted hardware?  This task can sometimes be confusing as there can be many devices to mount. Easiest solution is planning!  Inventory your equipment and determine the space needed.  I also recommend using Visio’s rack diagram as you can get a virtual view of your rack. Before you begin mounting big devices such as servers, you’ll want to mount any cable management options and power distribution units. When the time comes to mounting main devices, I follow one rule, heaviest items on the bottom.  No one wants to pick up 50lbs UPS and mount it to the top or even the middle. Example of mounted devices from the bottom up: UPS, Servers, Video/input, switches, patch panels.

What management options can I get with a server rack?  Some basic options include server rails, which allow you to pull out servers without having to completely remove them. Server rack trays/shelves can also be used for none rack mount compatible devices such as server towers. A must have in all server racks are cable management ducts. These can be installed on the side of racks or in between switches and patch panels. They provide a clean look and make management easier. 2 post server racks can also be fitted with 2 post rack adapters that allow full rack mount spec or 4 post systems to be mounted.

That’s all I have for now, hope this has helped those reading.

DNSChanger Malware on Monday, July 9th, 2012

If you’ve browsed Facebook or Google lately, you may have come across a few articles with the warning that “millions of Americans will lose their internet connections” on Monday, July 9th. Some articles claim this so-called ‘DNSChanger’ malware is set to go off like a timed bomb; others claim the FBI is forcefully causing the shutdown. Regardless of the reason, there has been much concern about a possible internet outage this Monday, and whether or not it affects you both at work and at home. All of us here at NetCal would like to save you the headache, and break down the facts from the fiction.

Q: Is this issue real?

A: Yes, but the facts are greatly distorted.

The ‘DNSChanger’ malware is not lying dormant on your computer until Monday, and the FBI is not cutting off your internet access forcefully. The malware was real however, and may have infected your computer 4-5 years ago.

Computers use something called a DNS (Domain Name System) in order to translate ‘internet names’ into ‘internet numbers’. When websites like ‘www.google.com’ are typed into your browser, a request goes to a server which translates the name into the proper IP address (74.125.224.65). Your computer is normally setup to acquire the DNS server automatically from your ISP (Internet Service Provider), or from a DNS server set up in your business.

The ‘DNSChanger’ malware, widely released in 2007, changed the settings on the computers it infected and redirected the DNS address to private servers run by scam artists and identity thieves. Instead of www.google.com translating to 74.125.224.65, it would translate to their private IP addresses instead!

The scam was so widespread (half a million computers infected in the US), the FBI was forced to get involved to shut the criminals down. The criminals were caught, their equipment confiscated, and computers were rid of the infection in record time. There was just one catch: Getting rid of the DNSChanger infection did not change the computer’s DNS settings back to normal!

The FBI decided to setup real DNS servers using the IP Addresses that the criminals used. In the end, even if you were infected by the malware, your internet access was no longer compromised. Fast forward 5 years later to 2012, and the FBI are now retiring these servers. As a result, the previously infected computers will be without DNS services.

Q: How can I find out if I was infected?

A: You can visit ‘dcwg.org’ and have your computer tested online.

Click on “Detect” towards the top and see if you are using the FBI’s DNS servers.

Q: How severe is this infection? Can it be fixed?

A: It is very quick to fix, and does not permanently harm any systems.

 

For more information please visit the following:

http://www.slashgear.com/dnschanger-malware-for-dummies-sophos-video-explains-it-all-06237487/

Exchange 2007-2010: Brief Overview of Changes

 

Exchange 2007

– Routing groups are tied with Active Directory sites and services

– Replication is done using Active Directory replicattion

– Bridgehead server role was eliminated and replaced with the Hub Transport seerver

– Outlook Web Access (OWA) was dramatically improved to similar to 32-bit version of Outlook

– Direct file access (Access shares on servers through OWA)

– OWA provides access to mailbox rules, out-of-office rules, provisioning of Mobile devices, access to digital rights managed content

– LCR – two databases replicated on separate drives on the same server

– CCR – users mailbox replication across servers and sites (fail-over and fail-back capabilities)

 

Exchange 2007 SP1

– Public folders available in OWA

– Standby Continuous Replication (SCR) allowed for offsite, over-the-wan replication of databases with 20 minute replication delays.

– Geo-cluster is possible for remote CCR

 

Exchange 2010

– Server Licensing

– Standard supports 5 database stores

– Enterprise supports up to 150 stores

– User Licensing (non-relating/exclusive to server licensing)

– Enterprise license provides unified messaging, per-user journaling for compliance support, and use of Exchange Server hosted services for message filtering

– No more Recovery Storage Groups (RSG)

– No more STM databases

– OWA enhanced features available to other browsers

– Database Availability Group (DAG, Basically CCR, No more LCR, CCR, SCR)

– Remote execution of EMS commands

Record-breaking uptime is over – 1003 days

Please, a moment of silence, for one of longest uptimes for a actively used server.

When we started many years ago and moved into an office, our first server was a white-box desktop. We scrambled to build it out of components we had… some memory from here, a motherboard from over there, and hard drives (software RAID) from who knows what. It was by no means anything comparable to our current arsenal made out of stacks of PowerEdge servers running vSphere. Anyway, we have moved a few times and it has faithfully followed us. It has occupied our current location for about 3 years.

The other day, it got jealous. Well actually, I think there was a sharp voltage drop when we plugged a 4U PowerEdge server into the UPS it was sharing. The high-quality components it’s made out of apparently showed their true colors this time causing …wait for it…. a reboot!

So now we’re back to 0… it’ll be a long journey. No one has committed to upgrading the critical software it holds, so it won’t be decommissioned anytime soon.

See you again in 2.747945205479452 years.

Before and After the 4U server was plugged into the UPS. Ouch!

BEFORE 4U PowerEdge
LINEV    : 117.0 Volts
LOADPCT  :  23.9 Percent Load Capacity
BCHARGE  : 100.0 Percent
TIMELEFT :  85.0 Minutes
LASTXFER : Automatic or explicit self test

AFTER 4U PowerEdge
LINEV    : 113.7 Volts
LOADPCT  :  50.4 Percent Load Capacity
BCHARGE  : 100.0 Percent
TIMELEFT :  39.0 Minutes
LASTXFER : Unacceptable line voltage changes

Troubleshooting/Debugging BSOD errors

What happens when you get a Blue Screen of Death (BSOD)?  I’m sure almost everyone just says something like “____ Microsoft!”  Unfortunately, most of the time, you would just be using Microsoft as a scape goat.  Why?  According to Microsoft and other gurus, about 70-80% of crashes are caused by 3rd party drivers.  Yep, all those great toys you have hooked up to your computer and the software that control them are most likely responsible.

I have probably just blown your mind or you are probably full of skeptism.  Hopefully these debugging techniques can make you a believer….

Step 1:  Disable auto-reboot on a crash

Step 2:  Create a memory dump versus a Mini crash dump..  This will allow you to get more information from the dumps.

Step3:  Install Windows Debugger tools

Step4:  Set environment variable to automatically download symbols from the Microsoft symbol servers (WinDBG->Source Symbol Path->”srv*C:symbols*http://msdl.microsoft.com/download/symbols”)

Step5: Open the crash dump file located in C:Windows or C:Windowsminidump

Step6: Run “analyze -v” to get list of drivers in the stack text.  If the driver points to one of the Windows core system files (ntoskrnl.exe, win2k32.sys, etc), then you probably have to dig a little deeper.

Step7: Additional helpful debug commands to run to find the culprit

kv – Looks at stack of current thread.  This is used for misdiagnosed analysis.  Look for suspicious drivers

lm kv – Shows version information (dates, etc) of currently loaded drivers to find updates for.

!vm – Check pool usage (if close to maximum, then it’s a leaky driver)

!thread – looks at currently running threads

!process 0 0 – summary level display of processes during crash

!irp <irp from IRP List from !thread> – Associates drivers thread (it’s a hint to investigate)

!poolused (needs to enable on xp and earlier) – Use with Strings

!deadlock

 

 

Debugging mode (F8) – Use when no crash dump created…, needs to connect using usb (modify boot.ini) or serial from another system running windbg

Windbg – File->Kernel Debug

Debug -> Break to connect to crashed system

.dump (saves dump information)

 

Hung system troubleshooting (computer freeze)

– Use crash on control-scrl-scrl (registry setting)

– Check other processors on multiple processors

lm kv <driver name from stack>

Help for Asterisk AA50 including issues, how to rebuild compact flash filesystem, and workarounds

First, I would like to say that the AA50 is not a recommended product.  Actually, I think it's the opposite of it.  I would recommend an analog Phone with a voicemail recorder before I would recommend one of these things.  Why do I have such harsh feelings towards it?  Well, support personnel is unable to realize that a PBX has major issues if it reboots randomly and prevents you from leaving voicemails or getting voice prompts.  I even tried to make them understand by explaining to them that the problem is not an advance or unsupported feature, but one that's critical to the basic intended functionality of the device itself.  My response was "It's not meant to be used as a full PBX".  Secondly, they told me the issues are being worked on, but they haven't figured it out yet.  Uhh… my support ticket was created about a year ago!  Response "Do you know how hard it is to rewrite a firmware?"  I'm a very patient and understanding person, but if you fail to recognize a critical issue with a product at such a simple level, I feel my point will never be accepted.  Just imagine if Toyota took a year to fix their brake problems or say the cars weren't suppose to be fully used that way…. 

I'm proud to do Digium's job for everyone by providing the public community a work-around and documenting what I've learned.  Hope this help others.  As for the AA50, I will never buy anything solely and directly made by Digium again.  Buy Sangoma and use open-source Asterisk.

Background: http://www.keycruncher.com/blog/2009/11/02/digium-confirms-major-issues-with-aa50-voip-appliance-spotaneous-reboots-and-memory-card-write-lock-a-review/

Symptoms:

  1. The system reboots randomly and frequently
  2. The system loses access to the compaq flash filesystem frequently, thus no voicemails or voicemenu prompts or even backups.
  3. The system prevents you from deleting voicemails due to the issue with Symptom 2.

Detail Description:

Basically, the reasons are:  Memory leak(s) (Symptoms 1) and Memory card write-locks (Symptoms 2,3)

Work-around:

Create an automated cronjob to reboot the system on a nightly basis.

  1. Create a script (reboot-24hrs.sh) in /etc/config (use this directory because it's backed up to the local storage; not flash storage)
    #!/bin/sh
    sleep 86400
    /bin/asterisk -rx
    reboot

Edit /etc/config/rc.local and add /etc/config/reboot-24hrs.sh &

What if you wanted to rebuild your compact flash card?  The answer is simple:

  • The appliance on startup (/etc/rc) mounts the compact flash using this command:  "mount -t ext3 /dev/hda1 /var/lib/asterisk/sounds"
  1. /sbin/create_sounds (Formats the compact flash memory card and creates the proper sounds directory.  It also downloads the files from the Internet)
  2. /sbin/update_tz (Downloads time zone files from the Internet)
  3. /sbin/update_phoneprov (Downloads phone provisioning files from the Internet)

A useful print server configuration tool

Have you ever wanted to make a backup of all your printers, it’s shares, the permissions for them, and the drivers on your print server?  Well, Microsoft has a very useful tool that does this.  Furthermore, it also does restores!  I couldn’t believe my eyes either!  It’s great for when you need to setup redundant print server configurations or when you are migrating print servers!

Here it is:

http://www.microsoft.com/WindowsServer2003/techinfo/overview/printmigrator3.1.mspx