December 25, 2008

Day 25 - dotfiles and power users

Dotfiles are precious. They help you maintain your desired environment. My dotfiles have been built very slowly over time as I find features I like or change the way I use a tool. Additionally, I learn by reading other people's rc files. Ignoring your ability to make very useful changes in behavior and operation of your favorite tools will leave you at a very minimal level of productivity.

The more time you spend using a tool should mean putting more time into configuring it and learning about it. Using a program with its defaults alone, over time, is a massive productivity killer. For example, the speed at which I was able to do things in a unix shell skyrocketed when I learned that I could have vi keybindings in my shell (with 'set -o vi' or 'bindkey -v' in most shells).

As mentioned above, part of learning how to configure a tool is simply reading documentation, or searching online for how to do something. Another important part is by learning from others: reading their dotfiles. In order to read your dotfile, it must be available somewhere. I heavily encourage you to post your rc files online. Not only publish them, but make sure you put comments describing what each configuration decision does. Knowledge grows faster when there's a community contributing to it, so post documented snippets online!

Rather than covering another best practice or tool, today's article is an attempt to try and fill your mind with some useful things you may want to try in your own tools. Covered below are some of my configurations for various tools. Each configuration has a link to respective documentation (if available online) about that option. Futher, it is not my hope that you agree with my configurations, but that you find options here that you didn't know about that might help you.

In the process of reading peer rc files and gathering data for this article, I found a neat website where people can publish their own dotfiles, dotfiles.org. This site lets you view everyone's uploaded dotfiles.

There are far too many tools and options to cover, so I'll cover the three tools closest to my heart: zsh, vim, and screen. However, before I get into it, I want to make a few, important points.

  1. Vi mode in your shell is one of the best features available if you are familiar with vi. Bash, ksh, zsh, and tcsh all support 'vi mode' in varying degrees of compatibility. In your shell, type 'set -o vi' (bash, ksh) or 'bindkey -v' (zsh, tcsh), and be happy with your increase in productivity.
  2. Set your terminal (screen or terminal) title! There are lots of existing rc files that show you how to do this. For zsh, try searching zsh screen title.
  3. Don't ignore configuration of tools you use every day. Being a power user doesn't mean you automatically do lame things like recompiling vim with -O9999, it means understanding the tools you use and how to configure them to best fit your pattern of work and your style preferences.
To repeat one more time, publish and document your dotfile configurations. Ok, on to some options for zsh, vim, and screen.

December 24, 2008

Day 24 - Message Brokers

You know that cron job you have that runs every 5 minutes, checks if there is work to do, and does work only if there is work to do? Stop that. The old ways of primitive, distributed processing should be put behind you. Cease the old habit of having your pipeline members check "Is there input?" periodically; whether it's asking mysql for signal data, looking for an empty file as a signal, or whatever. There's a better way: use a message broker.

Let's take a small example. You have a machine database which includes useful data about your hardware such as hardware type, mac addresses, services, etc. You were smart and decided that your dhcp configs would be autogenerated from this database, so now your dhcp server has a cron job that runs every 5 minutes and regenerates the dhcp config and restarts dhcp. Bonus points if you only restart dhcp if the config is actually different.

What you should have done is hooked whatever interface that changes (or permits change) to your machine database into sending a message telling your dhcp server to regenerate it's config. How can we easily do that?

Message brokers act as a channel for processes to communicate with each other easily, and they facilitate reliable, cross-platform, cross-language, cross-network messaging. AMQP has support for a variety of messaging models (see further reading). The messaging model we want here is the 'store and forward' kind, since we only have one writer (machine database) and one reader (dhcp config updater), but if we had more readers on a single channel, we would want a 'publish and subscribe' (pubsub) model. Message brokers support multiple independent channels for your processes to communicate with. For convenience, you choose the name of that channel.

What are our options? AMQP, Advanced Message Queueing Protocol, is a fancy standard that is supported by software called message brokers. Popular message brokers include ActiveMQ, RabbitMQ, and OpenAMQ. In addition to AMQP, there are other protocols designed for messaging, such as JMS and STOMP. STOMP is simple and can work on just about any message broker. With STOMP on ActiveMQ, you can do both queue and topic message models.

I'm assuming you have a message broker that supports STOMP already configured. If you don't, try out ActiveMQ. There are other brokers that support STOMP. Alternately, you can use the StompConnect to add STOMP functionality to anything supporting JMS.

First, we'll want to write the code that sends a message. Since STOMP's message contents are just text, let's send the 'UPDATED' message to notify that the machines database has been modified. Here's an example in ruby:

require "rubygems"
require "stomp"   # install with 'gem install stomp'

# Connect to the stomp server
client = Stomp::Client.open "stomp://mystompserver:5906"

# Send "UPDATED" to the destination '/topic/dhcp'
client.send("/topic/dhcp", "UPDATED");
client.close
Assuming your machines database (or the interface too it) can be told to run this script when a modification occurs, you are halfway to completing this project.

The other part is the receiver. You need a script that will listen for notifications and regenerate the dhcp config as necessary.

require "rubygems" 
require "stomp"

while true do
  client = Stomp::Client.open "stomp://mystompserver:5906"

  client.subscribe("/topic/dhcp", :ack => :client) do |msg|
    if msg.body == "UPDATED"
      system("/usr/local/bin/gendhcpconf")
    end
    client.acknowledge(msg)
  end

  client.join
  client.close
  sleep(10)
done
The receiver code is a bit longer. We subscribe to the same topic as the sender and send the server an acknowledgement that we received the message. The 'client.join' at the end is using a thread join function to wait until the stomp client disconnects or dies. We wrap the whole bit in an infinite loop so we will reconnect in the event that the stomp server dies.

With this configuration, any changes made to your machines database will cause a dhcp config regeneration. The benefits here are two fold: first, that you no longer wake up every 5 minutes trying to do work, and second, that any changes are propogated to your dhcp server immediately (for some small definition of immediately).

Message brokers are extremely useful in helping automate your production systems, especially across platforms and languages. There are STOMP, AMQP, and JMS libraries for just about every language you might use. Lastly, with the availability of free and open source messaging tools and libraries, the cost of deploying a broker is pretty low while the gains in automation and reliability can be high.

Further reading:

December 23, 2008

Day 23 - Change Management

This post was contributed by Matt Simmons. Thanks, Matt! :)

It's been said that change is the only thing that ever stays the same, and whoever said that probably worked in IT. Transitions are a part of life, but we administrators are burdened by what I would judge to be more than our fair share.

Too frequently, we find ourselves picking up the pieces from the last major system change we made, while at the same time designing the next iteration of the infrastructure that we'll be putting in place. How many times have you chosen an implementation that wasn't ideal now, because a bigger change was just around the corner, and you wanted to "future proof" your design? Bonus points for having to make that decision due to a previous change that was still being implemented. It doesn't seem to matter how precisely you've planned a major upgrade, snags and snafus are expected to rear their ugly heads.

Is this something that we just have to deal with? Are we at the mercy of Murphy, or are there ways we can induce these issues to work to our benefit? Sure, it would be easy if we had a crystal ball, but too often we don't even have a rough guess as to where our plans will encounter problems.

Change itself isn't the enemy. Change promotes progress, and from the 10,000ft view, our long-term goals should work towards this progress. Dealing with change is a natural and positive endeavor.

Instead of being thrown about by the winds of chance, lets put some sails on our boat, and see if we can make headway by trying to manage the change on our terms. If we know that problems are going to be encountered, and we face those facts before we edit the first configuration, then we've taken the first step towards real change management.

The enemies of successful change (and the resulting progress) are imprecise requirements and lack of project leadership. Unless you plan around these pitfalls, your project may very well go into ventricular fibrillation, flip-flopping back and forth, unable to decide between two unforeseen evils midway through the work flow. While it's possible to recover from this with an injection of leadership, it's much easier to inoculate against the problem in the beginning.

If you're going to be planning a big project, you will probably want to follow a methodology. There are just about as many methods of managing a change as there are people who want you to pay them to do it, but with IT projects, I've found what I consider to be the most efficient for me. Your mileage may vary, of course.

  1. Team and goal formation

    Assuming your change is moderate to large scale, you've (hopefully) got a team of people involved, and one of them has been appointed leader. This is the point where you want to decide on your goals. Determine what success will be defined as at the end of the project, and how best to get there.

    Many times we don't yet know what or how success will be defined, or even what the target should be. Because of this, it's natural to perform step 2 before your goals have been decided upon. In fact, I'd recommend it.

  2. Analysis (Research) & Information Organization

    Too often (or not often enough, depending on your view point) we're asked to do too much with too little. Frequently, we don't even know how to do it. This Analysis step is here to allow you to make informed decisions, and to acquire the skills and resources necessary to succeed in your task. Sometimes the resources are people, in the form of new employees or contractors, or both.

  3. Design

    By this time, you know what the task entails, but you don't have a road map of to how you're going to get there. This step makes you the cartographer, planning the route from where you are to the implementation of your project and beyond. Some details of the design may change during development, but it's important to have the major framework laid out in this step as you proceed.

  4. Development

    In a perfect world, you would take the design produced in step three and translate it straight into something usable. We all know that this rarely, if ever happens. Instead, you encounter the first set of really difficult problems in this stage. Issues spring up with the technology that you're using, or with kinks in the design that you thought were smoothed over, but weren't. Development appears to follow Hofstadter's Law: 'It always takes longer than you expect, even when you take into account Hofstadter's Law'. Thorough testing at the end of the development stage will prevent misery in the next step.

  5. Implementation

    Here we find the second repository of unforeseen bugs and strange glitches that counteract your carefully planned designs. The good thing about issues at this point is that, provided you've tested thoroughly enough in development, you won't find many show stoppers. On the other hand, sometimes these bugs can appear as niggling details and intermittent issues, hard to reproduce.

  6. Support

    If you're designing, developing, and implementing a product, support is just another part of the game. This is where you pay for how carefully you performed the preceding steps. Garbage In, Garbage Out, they say, but because you've designed and built a solid system, your support tasks will be light, possibly just educating the users and performing routine maintenance.

  7. Evaluation

    Remember that part in step 1, where you decided what success would be defined as? Dust it off and evaluate your project according to those requirements. Discuss with your team what you could have improved on, and don't forget to give credit where it is due. Hard work deserves appreciation.

This method is really a modified ADDIE design, so named because it consists of Analysis, Design, Development, Implementation, and Evaluation. We've added a couple of steps to help it flow better in the IT world we live in. There are certainly other methods to look at. The Instructional Systems Design (ISD) is another one which is well known.

However you decide to manage change, it's important to stay with your plan and follow through. Remember to work and communicate with your teammates, and don't stress because the project is too big. Just take it one step at a time, follow your plan, and you'll get the job done. s

December 22, 2008

Day 22 - What's the problem?

From Wikipedia's software development process article, "the most important task in creating a software product is extracting the requirements or requirements analysis." Requirements analysis is an early part in the engineering process. This means learning what problem needs to be solved and the parameters and constraints with which you must solve them.

The wikipedia article goes on, "customers typically have an abstract idea of what they want as an end result, but not what software should do." My experience in systems administration is that customers may have an idea of what they want as an end result, but they may not know what problem they are solving and often present their idea in the form of a solution.

To describe this with an example, let's say you have a small team of sysadmins who are familiar with mysql and your group supports a few mysql deployments in your company.

A customer says, "I need postgresql installed." This is a request for action, not a description of the problem. You are only given the solution that the customer believes will bring about their desired end result. Do you simply install postgresql for them, or do you ask why they need it? If you already have a well-supported mysql deployment, you should be asking why you need to support another database.

Ask about the problem they need solved. Get details. Why does he or she need postgres? Can the existing mysql deployment and knowledge be used instead? Most of the time customers who simply ask for actions, "please implement this solution," often are unaware of existing, similar options already available. It's also possible that this customer is trying to solve a problem that doesn't exist, doesn't affect your company, or isn't feasible to solve completely.

If you get requirements, you might find they are simply "I need a database that speaks SQL." Alternately, you might find that the requirements include "I need to run this 3rd party tool which requires postgres." Dig deeper. What does this tool do? Can it's features be provided by another tool that doesn't require burdening your team with additional products to support? Is the problem the customer wants to solve even in the scope of your team?

In addition to getting the necessary information about the problem, you should also make sure you are given other constraints and parameters. Is there a deadline? What is the scope of the problem, who is affected, etc? What is the priority?

Let's examine another common situation. Another customer says, "I need apache on serverfoo restarted." Again, you should ask for a description of the problem. What are the symptoms the customer is observing? Restarting apache is an action that could bring about a solution, but what are you solving? What is broken? What if a customer reports "mail is down?" What does "mail is down" mean? What are the symptoms being observed?

When digging for a description of the problem from your customer(s), be careful to not offend the customer. It's easy to dismiss the customer as an idiot if you the information you are given doesn't make sense or doesn't help you fix a problem. This issue can easily occur when a non-domain-expert interacts with a domain expert. Remember that perspective is reality, and that "mail is down" makes total sense to your customer but is confusing to you. Make sure your fellow sysadmins follow the advice in this paragraph, too.

Asking for requirements can be a tool to help push back on bad ideas. Sometimes a management hammer comes down from above and says you must implement something that you disagree with or don't understand. Being a domain expert, you might be disagreeing or lacking understanding because the request doesn't make sense. Ask for requirements! Sometimes ideas manifest themselves into requests (or mandates) without the idea being actually thought out.

Lastly, always remember you can say, "no." Not every idea is a good one. Bad ideas can come with urgency. Be understanding of any urgency from your customers, but remember that you have the most information about what makes a change bad. Otherwise, why would they be asking you or your team to do it? Be aware of things people will say to convince you to do something even though you can show it is incorrect, such as "the CEO said we have to do this." Facts are your ally, so use facts to show why a proposal is wrong or why the requirements are impossible to fulfill.

December 21, 2008

Day 21 - Out-of-Band Management

Knowing I don't have to drive to the datacenter to reboot a machine gives me a warm fuzzy feeling. Do you have remote out-of-band management on your servers? Do you need it? If you need it, read on.

Remote management features include power management, KVM, serial console, and other things. Remote management is most critical when the host is not fully booted: when it's off, or you need to configure bios settings, or debug a kernel crash.

When deciding on what vendor and model to buy, knowing which of these features will save you time and money in the long term will help you decide what features to buy. More features often more money. For example, KVM-over-LAN is probably going to increase the server cost by quite a bit, so be sure to only buy features you need.

Remote management systems offer many different ways to interface: serial console, web browser, IPMI, OPMA, SSH, telnet, and others. Serial (ie; RS232) costs the most to own because the rest work over the network (which you already provide). Serial port control probably requires a separate device to provide you remote access to that serial port.

The most basic remote management feature, I think, is power management: Power on, off, and reset. Power management alone comes in two main forms, smart power strips and remote access controllers (RAC). Smart power strips offer an interface for controlling the state of each power port. RACs live closer to your system and connect to the power, reset, and other controls on your system motherboard (via wires or a separate interface).

Smart power strips are an easy way to provide remote power control to systems that don't have RACs, but there are drawbacks. A smart power strip toggling power won't do anything if your server is plugged into a UPS that's plugged into your power strip (for obvious reasons). Further, if your servers have redundant power supplies, you'll need one managed power port per power supply, and rebooting the server requires turning off all power ports before turning any back on, for a given server.

RAC modules come in varying forms. Like smart power strips, they offer interfacing over serial or network, depending on the model. Avoid serial if you can for reasons already stated. There are some standardized RAC network interfaces, such as IPMI and OPMA. Exact vendor support varies. Many Dell and HP server models come with IPMI. SuperMicro offers 'Supermicro Intelligent Management' which supports IPMI. Rackable's RAC goes by the 'Roamer' name, some of which support IPMI. Recent Intel chipsets support AMT (branded with the 'vPro' name).

IPMI RACs live on the same server and share power and often share layer 1 connectivity with an onboard network device. IPMI can be configured while the server is online, which lends itself to easy automation. In Linux, for example, you'll want the IPMI kernel drivers (from OpenIPMI) and the ipmitool tool. Ipmitool will let you talk to the local system's IPMI (via OpenIPMI kernel drivers) or to remote hosts using the IPMI protocol.

Simple power management isn't the only feature provided by the RACs mentioned here. IPMI, the protocol, supports serial-over-lan, sensor information, event logging, etc, but the features supported will vary by hardware. I don't have experience with OPMA or Intel AMT, but from their respective descriptions, they sound similar to IPMI in features.

Be sure to include out-of-band management (power, serial, etc) when considering your future purchases. I don't want to define your own server requirements, but for a point of note, even Dell's cheapest 1U rack server appears to come with IPMI support, so there may not be any reason for you not to buy hardware that supports remote, out-of-band management.

Further reading:

December 20, 2008

Day 20 - Ganglia

Ganglia is a monitoring tool designed originally to help scalably monitor computing grids and clusters. How can it help you, even if you don't run traditional computing grids or clusters?

Ganglia is an RRDtool-based (like Cacti) monitoring and graphing system. Ganglia differs from Cacti in that configuration is much more automatic. Ganglia's design centers around two programs: gmond and gmetad. The gmond program listens for metric reports from other gmond programs (or tools that emit the same messages). The gmetad program periodically polls a single gmond for data on an entire cluster. The trick, here, is that every time gmond gets data, it sends that data via multicast to other gmonds, so every gmond has state for the whole cluster. I presume that the actual gmond used by gmetad is chosen at random, and if the chosen gmond host fails, another gmond host is chosen.

In addition to clusters (one gmetad for N gmonds), Ganglia supports a higher level collection they call a grid. A grid is automatically learned when you have one gmetad polling from another gmetad. I am unaware if you can have more than these levels (host, cluster, grid).

Multicast: This means your network gear will need multicast routing enabled if you hope to span broadcast domains with this monitoring. Alternately, gmond can be configured to send updates to a unicast address which can avoid needing multicast routing and other potentially difficult network features.

Both gmond and gmetad have reasonably easy-to-use configuration files and come with very reasonable default values. Simply running gmond and gmetad from the default configurations will result in data you can access easily. The primary Ganglia human interface is through a webserver.

Getting data out of Ganglia is easy. The historical data is stored in RRD files in a known location organized by cluster and hostname, so you can use your favorite rrdtool interface to query data. The current data is stored on any gmond which is queryable by connecting to ganglia's xml port (default 8649). The service listening on that port will dump the metric data by cluster and host in XML. XML might make you groan, but it's use will help you write tools (like nagios checks) to use the current data.

My first question after playing with Ganglia for a few minutes was, "How do I monitor my network gear?" Typical network gear won't allow you to run arbitrary binaries on them. Luckily, Ganglia comes with a tool for broadcasting metric messages, gmetric. With gmetric, you can spoof the source of a piece of data and easily claim that it came from your switch or router. This tool is also the easiest way to extend the metrics ganglia monitors for you. For example:

% gmetric -S "192.168.0.254:myrouter" -n uptime -v 3644 -t uint32 -u seconds
  spoofName: myrouter    spoofIP: 192.168.0.254 
And the metric reported on the xml port is:
<HOST NAME="myrouter" IP="192.168.0.254" REPORTED="1229761211" TN="28" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="0">
<METRIC NAME="uptime" VAL="3644" TYPE="uint32" UNITS="seconds" TN="28" TMAX="60" DMAX="0" SLOPE="both" SOURCE="gmetric"/>
</HOST>
You can also specify the lifetime of the value, which will cause it to automatically be dropped from your gmond processes.

In addition to the gmetric command, there is are C and Python interfaces to gmetric. Additionally, if you want your own software to emit gmetric messages, there's an embedded gmetric library you can use.

My second question was, "What if a host is retired, is renamed, or is reconfigured?" If a host never again reports data about itself, the last state is still kept on every gmond and gmetad. To expire hosts that haven't spoken in a while, you should set the host_dmax value in gmond.conf to some value of seconds after which the host's state will be dropped. However, I am not sure the RRDs for this host will be cleaned up automatically. Host renaming probably requires that you rename the rrd directory holding the host's data if you wish to maintain historical data across the rename. If you reconfigure a host, deleting no-longer-used metric rrds is probably prudent. All of the above changes will likely require walking around in the ganglia rrd storage directory in your gmetad.

My last concern with ganglia was that the monitoring unit was a host. I think most often in services, not hosts. Thankfully, there are workarounds. Since you can spoof hostnames with gmetric, I decided to try something not-hostname-like to identify a service+host combo. I tried to spoof "service/hostname," but Ganglia uses this as the directory name and fails mkdir for not doing it recursively. Choosing another delimiter, comma, works fine:

Note: gmetric -S "foo/bar" will succeed, but gmetad will crash trying to write to that file path (mkdir not so smart). If you try this, you'll have to stop all gmond and gmetad instances then start them all again to clear the knowledge of the host named "foo/bar"

% gmetric -S "192.168.0.1:apache,aaa123" -n "apache errors" -v 5 -t uint32 -u "errors"
 spoofName: apache,aaa123    spoofIP: 192.168.0.1 
However, it appears that ganglia keys on IP as unique, so trying to add another entry of "192.168.0.1:mysql,aaa123" will appear in the web interface. Further testing revealed that the IP portion of the spoof is not validated. If we use the unique combination of "service,host" as both the IP and hostname, everything is peachy:
% gmetric -S "apache,aaa123:apache,aaa123" ...
% gmetric -S "mysql,aaa123:mysql,aaa123" ...
Pushing only apache-related data to the 'apache' host prefixes will help you organize your monitoring by service. This means you could easily view only apache data for a host, etc.

Ganglia makes me happy because I can worry only about giving it data. Adding a new data source only requires knowing how to get the data and feed it periodically to gmetric. After that, the data is automatically available in Ganglia's web interface. I fed (with gmetric) some fake data about mysql connections to see what happens, and this was the result.-

The wish-list of Ganglia features includes a talk of making it easy to provide custom graphs and views. Adding your own custom graphs requires, currently, hacking on the php code that presents the web interface. Glancing at the PHP powering the web interface doesn't make me cringe (easy to read!), so extending ganglia by adding your own views is probably reasonable.

Considering what I've found so far with Ganglia, the shortcomings mentioned here are easily worked around, and the potential benefits of less-painfully-configured data trending and monitoring seem quite good.

Oh, by the way, gmond works in Windows. The project has a cygwin binary that works by itself and can report data about a windows host. You can also use gmetric from windows hosts to report info about themselves.

Further reading:

December 19, 2008

Day 19 - Visibility and Communication

A friend said to me tonight, "Once again, at a company function, the CEO has forgotten operations exists." This is a visibility problem that often hits operations and support teams.

Good systems administration is about hidden work and effort nobody ever sees. You are probably accustomed to only being highly visible when there is a problem - interrupts asking if something is broken, when it will be fixed, etc. If this is the only visibility you receive, how can you expect to be loved and adored by the world? You need to help yourself and your teammates with visibility. Your manager should help you, too.

Improving local visibility, that of your teammates and manager, is just as important as external visibility to your customers (employees or otherwise).

Keep track of your work: code commits, ticket resolutions, etc. A habit I developed while working at Google was to maintain a weekly report of things done. I've sinced modified this habit to include not only finished tasks, but things not done yet, progress blockers, and future todos. At the end of each week, send this data to your manager. Have the rest of your team do the same. For bonus points, send it to your team, so your coworkers will know what you did last week.

This weekly tracking will help you do two things: first, to maintain a high quality stream of communication to your manager and your coworkers, and second, to help you better track things that need to get done. If you're lucky (and you probably are), a fellow coworker will see that you have something on your todo list that he or she would like to do and offer to relieve you of this burden.

Tracking this data will also help you show how you can or can't take on that new project for time management reasons. Further, it's a huge help to have a document of accomplishments for career advancement.

To enhance visibility and communication with your manager, have periodic (weekly, etc) one-on-one meetings. Email is good for status reports, like above, but face to face contact is best for discussion. It's a two-way street, so use this time to make sure your visibility and perceived performance is what you expect it to be. Additionally, make requests of your manager if you have any. If you submit status reports that include things that are blocking you, your manager should ask how he or she can help remove these blocks.

Visibility to your customers and to your management should be handled differently. Your manager should be the funnel of information up (and down!) the management stack. Make sure he or she is performing this task. A good time to ask about this is in your one-on-one meetings.

Your customers are very important. Your work will directly, positively or negatively, affect their work. This is power and responsibility that can lead to resentment and anger if not handled properly. Creating interaction policies and informed expectations is critical to customer visibility and happiness.

Your interaction policy should explain how to contact your team, and be sure it's accessibly documented. I find bug systems to be great for tracking customer requests or problem reports, so require usage of this system for such things. Define escalation criteria, such as "if a critical problem is not responded to in X minutes, please email this pager address." You need to define "critical" in the previous statement, too. Don't use email alone for problem reporting, as it doesn't easily lend itself to historical tracking.

Set expectations! Planned changes that will cause outages should be announced ahead of time, at the start of the work, and at the end of the work. Any changes in planned change should be announced in a clear way. Announce known issues to anyone affected as soon as you are aware of the problem and include a contact (if not you), a time estimate on repair, and a description of the scope of the outage. Define an SLA for any service you support. An SLA is a common form of expectation declaration.

Additionally, don't waste someone's time. If you send an email about an upgrade to a specific component that only a subset of your customers use, then put a very clear header at the top, such as:

This is regarding an upgrade to the internal mysql servers. If you don't know what this is or don't use these systems, you can stop reading now.

Lastly, visibility and related communication does not have to be manually generated. Automation is sexy, and automatically informing customers about information important to them is a great way to avoid getting 15 tickets filed for the same problem. Have a web-based dashboard that includes a list of known problems and links to related trouble tickets, a list of upcoming planned changes, perhaps a "tip of the day," and any other useful information you see fit. There's plenty of content management systems available for free to help you get this dashboard site up and running in a very short time.

Healthy visibility is about good communication. Systems can go down and customers still be happy because you've involved them in the process by telling them and setting appropriate expectations. Your manager will be happy knowing your team is working effectively by knowing what everyone is working on without having to ask. Happy customers and happy managers means happy and appreciated sysadmins, even when things are on fire.