Friday, March 2, 2012

Highly volatile postgresql

I've had a few clients that have had some issues with postgresql and performance problems taking place over time due to their databases being highly transactional. Mind you these customers are really dealing with a single database that has grown in size, as opposed to several on a single server. It had gotten to the point where my colleague Rich Keggans (from 3X Systems also) had been scheduling a monthly cluster to be run due to how bad the performance would become.

So in response we tried a little experiment involving tweaking the auto-vacuum settings and creating an automated task. I felt the need to share it with you, as we've seen some rather promising results so far as performance degradation seems to have ceased in two instances so far.

For the auto-vacuum we really just made the settings a bit more aggressive. In the case we tested this the customer's database had a very high level of transactions taking place. The chance for bloat to build up here is pretty high so we made the settings reflect the following in order to fight it:

autovacuum = on                         # enable autovacuum subprocess?
                                        # 'on' requires stats_start_collector
                                        # and stats_row_level to also be on
autovacuum_naptime = 1min               # time between autovacuum runs
autovacuum_vacuum_threshold = 250       # min # of tuple updates before
                                        # vacuum
autovacuum_analyze_threshold = 150      # min # of tuple updates before
                                        # analyze
autovacuum_vacuum_scale_factor = 0.2    # fraction of rel size before
                                        # vacuum
autovacuum_analyze_scale_factor = 0.1   # fraction of rel size before
                                        # analyze
autovacuum_freeze_max_age = 200000000   # maximum XID age before forced vacuum
                                        # (change requires restart)
autovacuum_vacuum_cost_delay = -1       # default vacuum cost delay for
                                        # autovacuum, -1 means use
                                        # vacuum_cost_delay
autovacuum_vacuum_cost_limit = -1       # default vacuum cost limit for
                                        # autovacuum, -1 means use
                                        # vacuum_cost_limit

So as you can see we changed a couple values by a small amount in order to make it more aggressive. Paired with this I wrote a quick one line script that would run Analyze. The analyze should update the statistics for the table which makes postgresql's rather effective query planner work at its best. The following one liner is the script, which I had put into /sbin/:

 psql -c "analyze" -q -S -d boxicom -U boxicom > /usr/vault/log/pganalyzetask.log

This was paired with an entry in the crontab using crontab -e:
*   4,20   *   *   * /sbin/pganalyzetask

In this case I picked 4am and 8pm due to the fact that the customer had constant activity during normal business hours and again during a certain period at night. This ideally is going to update the stats after those periods allowing his scripts to perform well on each subsequent time period. 

Further tweaking is going to be required to get these servers running just right, but this is a start anyway. I'll likely be diving back into the postgresql.conf file to see what other changes we can make (based on the hardware available of course). Hopefully we'll continue to see performance improvements on these postgresql databases.

Thursday, February 23, 2012

Postgresql corruption

I've had a couple of Postgresql corruption problems I've been working through lately so I figured I'd share some method so that you don't have to go through the trial and error process I did in getting to the bottom of it. The first thing you have to do is figure out exactly where the corruption is, and what sort of corruption you have. Most of the time corruption is going to be do to an errant action by the admin, or hardware conditions, or a corrupt index in your table.

So first figure out where in your table the corruption is. You can do this using a select query and using the ctid value to figure out where it begins. You will want to set the following two options when doing this:

=>\set FETCH_COUNT 1
=>\pset pager off

Next we will want to run the query itself a sample of this would be as follows:

=>select ctid, * from table;

This will scroll and scroll until you hit the nasty portion where it will crash the instance. Once this occurs you can start to narrow down the affected area. You may wish to attempt a reindex first before you continue on, to make sure that the corruption just isn't a bad index.

=>reindex table tablename

The next step if the reindex doesn't work it is likely you'll need to 'cut out the bruise'. The first thing you can try is a query to delete the values based on the ctid with something like the following:

=> delete from table where ctid >= '(12312,0)' and ctid < '(12313,0)'

This may fail with an error message similar to:

ERROR:  could not open segment [n] of relation [x]/[y]/[z] (target block [b]): No such file or directory

If that occurs we are going to have to zero out the block ourselves. To do this we can use the dd utility.
First lets take a look at the file in question which we can do using a command similar to:

dd if=data/postgresql/base/x/y bs=8192 skip=17897 count=1 | hexdump -C | less

The values in 'if=<value>' are going to match up as base/y/z (with the y and z being from the error message above). The value in 'skip=<number>' is going to match the first value in the page ctid we discovered as corrupt earlier. You will likely see some junk data in here such as a file name written in or a time stamp or what have you. To fix this we are merely going to write an 8k block in at that page as it would be rather unlikely that we could recover rows at this point. To do this we'll use the following:

dd if=/dev/zero of=data/postgresql/base/x/y bs=8192 seek=17897 count=1 conv=notrunc

And now if you run a query looking in that page you'll find a count of zero rows. Now you should be able to do a successful pg_dump.

Ryan Koch
3X Systems
ryan.koch@3x.com

Tuesday, February 14, 2012

Own your data

Through my relatively short IT career I've seen the rise of this thing that is referred to as the 'cloud'. It's spoken about as if some monolithic being that provides you the wonderful content your consume on your PCs, or use to conduct business. The reality of the thing as significantly more complex and drastically more piecemeal and important. We've entered a new time where we can centralize our infrastructures in such a way that the user experience is seamless, and our workload is reduced dramatically. With that have come a large and varying number of technologies claiming to be the best at backing up your infrastructure and your clients. Some have started to migrate their data to the 'public cloud' in order to protect themselves from disasters and the like. This is a respectable aim as having data off site is really the best way to go, but they are also handing over their mission essential data to a third party who may or may not be reliable, much less trustworthy. Having your own private cloud doesn't have to be difficult, and it doesn't have to be expensive.

It's important to own your data. I cannot stress this enough. The documents your users store, the records that keep your business going, and the power points your executives display they are all vital to the survivability of your business. You really shouldn't trust data of that kind of importance to some distant data center that isn't truly accountable to you. If you wanted to pull your data, or perhaps virtual infrastructure you've placed there down the road, how cooperative are they going to be? Even worse, what if the public infrastructure goes down? I don't know if you recall but Amazon's cloud service got hit with 11 hours of downtime or limited usage. While the data was in itself safe access to it was cut off.

There are a few solutions out there that can allow you to accomplish a sufficient backup strategy, and without relying on the public cloud. A lot of where to go depends on what sort of infrastructure you have, and what you are looking to back up. Do you have a Hyper-V or ESX(i) set up? Do you have a lot of windows clients, and some windows servers? How big is your infrastructure?

If you are using primarily windows based computers and servers, and/or you are using Hyper-V I might humbly suggest the product I work on for a living. The 3X Backup Appliance is your private cloud right out of the box. You can do file and system level backups for windows machines, and even do VM level backups of Hyper-V.  Other than touching the product every day, why do I think it works so well for the private cloud? It's easy to use, requires little maintenance, and features some very effective deduplication.

What ever solution you decide to go with, please do take heed to the plea to 'own your data'. 43% of businesses lose important data due to not having a proper backup strategy. Don't be one of those. At the end of the day your clients will appreciate it, you will be protected, and your data will be safe and accessible. If you have any questions at all about our solution or any others feel free to comment here, email me at ryan.koch@3x.com, use #3XSystems on Twitter, or check out our new Reddit Community.

Monday, February 13, 2012

Tasker and an announcement

Before we get into the meat of this post (Tasker) I'd like to share some 3X Systems related news. From here on you can get answers to questions and quick support via the hash tag #3xsystems on Twitter. You'll get a response from either @3XSystems or myself @Ryan_Koch. And if that isn't enough, we have also created a subreddit on Reddit you can use to discuss any issues or questions you have, which can be found here: http://www.reddit.com/r/3xsystems.

So I imagine if you are an Android user that likes to tinker around a bit you may be aware of an application called Tasker. If you aren't aware Tasker is an application that allows you to create automated tasks based on profiles you create. The limits to this application are limited only by your imagination and skill set. While it's been around for a while I wanted to prod at yet further awareness of it (because it's simply amazing) and share a couple of the recipes I'm using.

Recipe #1 : Arriving at work

When I arrive at work Tasker picks up via network location that I'm within a KM of the building. I've set the radius this large because net location is rather bad with accuracy and I don't want to destroy my battery by leaving GPS turned on all the time. The simple view of the profile shows the location variable which is 'Work' and a task called 'Wifi On'. The wifi task does two different things for my benefit. The first of these is that it turns on wifi (one less thing to do once at the desk) and the second thing it does is launch the application AirDroid, which is a pretty cool application that can manage your phone from a web browser (assuming you are on the same wifi network). When I leave the acceptable radius the task exists and runs the exit task called 'wifi off' which in this case turns wifi off and kills the AirDroid application.





















Recipe #2: Meeting Mode


This recipe puts the phone into a 'Meeting Mode' when you press a custom shortcut key, and then brings the phone out of said mode when a separate shortcut key is pressed. The Meeting Mode automatically rejects phone calls and sends a notification SMS to the caller, and silences the ringer and notification sound. In this screen capture the coffee cup is the meeting starting, and the space ship is exiting the meeting. When you press the meeting start short cut Tasker sets a variable called %Meeting to 1. Which leads to the a profile that is activated when %Meeting matches '1'. There is then a corresponding task called 'Meeting Mode'. All this task does is set the ringer and notification mode to level 0 or muted more or less.

The profile is then paired with another that automatically sends all calls to voicemail, and sends an SMS message to the caller proclaiming that I was in a meeting. The profile uses a similar logic to know when to kick off but it differs with the entrance of the call event.This happens by setting a task to make use of the SendSMS function in tasker and using the %CNUM variable to pull the 'last called number (in).'


















Recipe #3: Custom Alarm

This one is a favorite of mine as it wakes me up in a way I don't detest too much. The customer alarm uses both a time and day context on the profile in order to define when it goes off. In my case I have it set up for weekday mornings. You can set it up to only run for a specified amount of time, and since I'm some what kind to the half asleep version of me I set it to run for 3 minutes. The profile runs a task called 'morning' which does the following:

1. Sets the system volume all the way up
2. Sets the media volume all the way up
3. Uses the text to speech engine to say "Good Morning Ryan, I.T. Like a champion today"
4. Use the music application to start playing the song of my choice
5. Load the music application so I can stop the song if I so choose



I definitely suggest giving this application a look, as it only costs $6.49 and is more than worth it. Between the various amount of playing around one can do, and the convenience of automating some tasks it is just simple hard to justify passing up.

Wednesday, January 25, 2012

IT Purchasing and career ambitions

Lately I've been pondering exactly how those of us decide what to buy for our IT departments. I imagine there are the usual biases including brand reputation and past platform use, but also there may be a more career oriented bias that is sometimes overlooked: the resume.

Let's imagine that you are tasked with a project that involves building up an entirely new infrastructure set, without worry of any current platforms being used. How are you going to make your purchasing decisions? Obviously you'll have a budget and so a bang for your buck calculation will have to be made. But if all platforms are equally affordable, what is the chance that you'll pick something that will look good for your future endeavors? For example, will you use VMware ESXi and vSphere knowing that yet further experience in designing and maintaining such a platform will provide benefits to your professional development? I am finding lately that a significant number of the administrators I maintain contact with are actually taking this factor into consideration.There is actually a rather nice and short article on TechRepublic talking about what they expect to be the top skills for 2012 (http://www.techrepublic.com/blog/career/top-it-skills-wanted-for-2012/3503) and it is certainly worth a read.

What is interesting about this observation is that one will find that the effect starts to diminish the larger the organization in question becomes. As the decision making process drifts from being about one individual, and the expertise of a large group is brought into play you end up with a sort of 'sum of its parts' situation. The career desires and past experiences of the entire group, plus vendor relationships and technical support needs all begin to pull in different directions until a middle ground is created or found. It would appear to me however that careerism doesn't completely disappear from the decision, but instead shows faintly through the ambitions of each individual member of the decision making process. Each opinion and proposal will reflect not only current opinion, but in some way the future plans of the person in question and where they see the organization traveling.

For management knowing that this tendency may exist opens a new way to understand their IT staff. By getting recommendations on a regular basis one can actually tell how the group views the future of the IT department, as well as a hint as to what each of them would like to be doing in the long term with their career personally. For the staff being self aware of the tendency would allow them to embrace the concept and attempt to further both business and personal development goals. I suppose truly the factor that is at play is that even though we're IT workers, engineers, and administrators constantly surrounded by machines, we are still quite considerably human in what influences our decisions. So now I ask you, what drives your decision making process?

Tuesday, January 17, 2012

NTFS USN Journal and 3X backup

A problem I saw crop up a couple days with some of my customers using 3X Appliances is an issue where their USN journal fills up and causes their backup set to crash and burn. The good news with this is that there is actually a relatively easy fix that just requires the use of the fsutil utility from Microsoft.

Generally the USN journal max size is set fairly low, and this can be a problem depending on the number of files that exist in your file system. The journal's purpose is to add resiliency to the file system by providing a persistent log of changes being made to the file system. We generally suggest changing your journal max size to something along the lines of 512MB if you aren't tight for space. Generally for every 400,000 files you have you will want to add 128MB to the journal. The following technet article below discusses the command to change the journal in detail: http://technet.microsoft.com/en-us/library/cc788042%28WS.10%29.aspx. Making this adjustment will avoid problems our backup solution, as well as any others that make use of the journal.

This particular post is to remain rather short, however this week will have a second post to make up for it.





Tuesday, January 10, 2012

Disaster Recovery planning for your small business

As promised we're going to discuss some options for the small business disaster recovery (DR) situation. For a small business a significant problem with any sort of IT project is cost. This makes it tempting to start off using home brew type solutions, but eventually you will find yourself wanting something more polished and reliable. For instance my client KTDID, LLC has an ESXi infrastructure that I maintain off site backups of at my residence. To start off with I scripted out the backups using some cmdlets, but am in the process of looking for something more professional. But where should we turn?

The first thing we need to consider is what exactly we need to backup. In my case I have to worry about several server VMs and a couple of user physical machines. At this time you'd want to ask yourself if you want to back up files, the entire system or perhaps even both.As far as the virtual machines go one attractive solution would be Veeam Backup and Replication as they have taken what would have been painstaking to script and maintain and made it possible with mouse clicks. The software isn't cheap, but if you have a sizable virtual environment with your small business its worth the investment. For file level backups it's preferred to use something the 3X Backup Appliance or a software product such as Arconis, Backup Exec or the countless other competitors. One thing that I consider to be of the greatest import however is to keep ownership of your data, and especially so if you are in a field that deal with sensitive information. In my scenario I have a 3TB NAS that I use to keep backups on site and then also have backups off site at my residence as an additional layer of contingency. One of the reason's the 3X Backup Appliance makes the list of considerations for this is that with a couple simple port forwards you can place the device any where with an internet connection, and the client devices pick up the new location automatically. The device is good for file level and some limited use for system level backups, though it does not quite yet suit the bill for VM level backups for VMware (though if you have Hyper-V it does pretty well with that). So I find myself needing two products more or less if I want to maintain ownership and get a total solution. At this point it looks like once the funding is available we'll be looking at an implementation of Veeam, and perhaps a 3X Backup Appliance to compliment.

Looking a bit into the other issues of DR planning I find that as a small business there are some pretty simple things we can do. For example since we use my residence as a basic DR site and that location is in a completely different city I can bring services up at some level without too much downtime or risk of both sites being eliminated at once. Also, the infrastructure is set in such a way that once I bring everything up again the users in question shouldn't have too much trouble working either from home, or another location until the original site's functionality is restored. When you are making your plan do not overlook the users in the effort to plan your technological recovery to your hearts content. Without their ability to work, the rest of the infrastructure has little use.

My next project is to research the best method for backing up Zimbra that allows for item level restore without a tremendous amount of difficulty. If I come up with something before next week I will most assuredly make the post.