2010-06-26

Is your computer slow? Fix your own computer.

I have fixed about six computers in the last two weeks, and just like 95% of the other computer help I give people these days, it came down to this one problem:

"Why is my computer so slow?"
"My computer seems to have a virus, how do I fix it?"

One thing I have noticed is that most people just live with slow computers because they don't know where to turn for help, or they don't know that things could be better.  I figured it was time to write this blog post -- slow computers are fixable, and in fact you can fix it yourself.  The following steps should fix 95% of your computer slowness issues, and your computer will probably feel 2x-5x faster when you are done.  Please spread the word, you don't need to suffer in silence anymore :-)

(Note that this blog post applies to the Windows operating system, but you should be aware that there are other alternatives that don't get slower the longer you use them, including Mac OS, Linux and soon Google Chrome OS.)


HOW TO FIX YOUR OWN SLOW WINDOWS COMPUTER, AND ELIMINATE VIRUSES:


Step 1: Fix DNS settings (if necessary)

Some viruses redirect Web traffic to malicious sites, so you can't actually download or update antivirus software, etc. to fix your computer. If antivirus sites are inaccessible, and/or you get suspicious popups or fake sites when trying to visit well-known domains, follow the advice here and/or here.


Step 2: Disable unnecessary startup processes

Explanation: One of the biggest reasons your computer is slow is probably that when you switch your computer on, it loads all sorts of programs into RAM (memory) that are not needed.  (A lot of these programs put icons down in the System Tray at the bottom left of the screen, but not all of them do.)  When you run out of RAM, your computer uses the hard drive as extra storage (it "swaps out to disk"), and the hard drive is on the order of 1000 times slower than RAM.  So the first thing you should do is stop all the unnecessary programs from starting when you boot, and you'll be less likely to run out of RAM.

How to do it: Download, install and run a startup manager.You will then be presented with a list of programs that are run when your computer starts up.  In most cases, you won't break anything really badly if you uncheck all of them -- but try to understand what they each are before you uncheck.  If you don't know, you can always uncheck it and come back and re-check it again if you notice something doesn't work right.  A few cases to get you started:
  • If you disable anything that says "Synaptics" or "SynTPE" then your touchpad might not have the full functionality, e.g. this program detects movement on the right hand side of the touchpad and causes the current window to scroll without you having to drag the scrollbar.  You probably want to leave this checked.  Also anything that talks about hotkeys handles the Fn+F4 key combinations etc., leave that checked if you need those functions.
  • If you sync an iPod with your computer, you probably want to leave the iPod stuff checked.
  • If you see something about HP printing and you have an HP printer, you can uncheck this and you will still be able to print, but you might not get notified on your computer when your ink is running low (not a big deal).
  • If you uncheck something to do with your digital camera, then your photo editing software might not pop up when you plug in your camera, but you will get the standard Windows photo import window instead, and you can still start your photo editing software manually, so again it's not a big deal if you uncheck it.
  • The programs you really want to kill is anything that says "QuickStart" or similar -- Acrobat reader, OpenOffice and other programs have quickstart options.  They may be bringing your computer to its knees by filling up your memory.
  • There are usually programs that start up that put advanced control panels for your graphics card and/or sound into the system tray.  However you can usually disable these too and your sound and graphics will continue to work fine.
  • In general, if you don't know what it is or why you need it, you probably won't miss it -- uncheck it!  Some people would think this is extreme and generally bad advice -- but honestly, your computer won't be in worse shape than it was before when it was unusably slow :-)  And again, you are not likely to break anything serious, but if something doesn't work properly after this step (e.g. your webcam doesn't work right), you can just try switching some of these programs back on again (by re-running CodeStuff Starter) and rebooting until you figure out what needed to be switched on.

Now reboot your computer and hopefully your computer will feel faster already.

Step 3: Get all Microsoft updates

Explanation: Microsoft pushes out critical and non-critical updates that fix bugs on your computer, speed your computer up, and make your computer less vulnerable to viruses and spyware.  (Viruses and spyware are another big reason why computers can slow down.)  In particular you need all Service Pack downloads, and you should update Internet Explorer to version 8, as this fixes some critical parts of Windows -- but you should get all the critical and recommended updates if you don't have them already.

How to do it: Open Internet Explorer and go to http://update.microsoft.com/ .  You may be asked if you want to update from Windows Update to Microsoft Update.  If you have the option, definitely install Microsoft Update, it is better than the older Windows Update because it keeps not just Windows but also Office up to date.  Next follow directions to check for the latest software updates, and download and install all critical and recommended updates.  You can also manually select a few optional updates here but you don't need them.  You probably have to reboot after installing updates.  Once you have rebooted, go back to http://update.microsoft.com/ and make sure there aren't new recommended updates listed that were there before.  If there are new updates, then rinse and repeat.  (Sometimes you can't install all updates at one time, some of them have to be performed in separate steps.)

Step 4: Replace your antivirus software with Microsoft Security Essentials

Explanation: Antivirus software has to wedge itself into all sorts of critical parts of your operating system to stop viruses in their tracks.  As a result it can sometimes cause more problems than it prevents.  Also you have to pay $60-100 per year for antivirus updates, otherwise your computer is still at risk from being infected by the newest viruses.  Microsoft recently realized they need to start cleaning up the messes they created by selling an operating system full of security holes, and they released a product called Microsoft Security Essentials, which seems to be one of the few products they have gotten really right -- it's solid, fast, and best of all free forever: they will keep sending you updates for free and you never need to pay the Symantec / Norton / McAfee yearly tax again.

How to do it:
  1. Go to Start -> Settings -> Control Panel, and find the Control Panel program that lets you uninstall programs.  You might have to click on "Classic View" on the left hand side of the Control Panel to find it easily.  It might be called "Add/remove programs" or "Programs" or something else, they keep changing the name of it in different Windows versions, and I can never remember what it's called on each Windows version (I'm a Linux user) -- but when you find it it will have a list of all the programs you have installed.  Uninstall anything that has "Symantec", "Norton", "McAfee", "ClamAV", "Kapersky" or similar in the name.  Also uninstall anything that contains keywords like "Antivirus", "Spyware", "Spybot", "Internet security" etc. in the name.  It's important to uninstall these before installing Microsoft Security Essentials, and don't worry, you'll only have it uninstalled briefly.  You'll probably need to reboot after installing one or each of these.
  2. After rebooting, go to http://www.microsoft.com/security_essentials/ and download and install Microsoft Security Essentials.  It's pretty easy to install, and when it is done it will ask if you want to download the latest virus definitions and run a scan -- choose Yes.
Step 5: Install the Google Chrome Web browser and never use Internet Explorer to browse the Web again

Explanation: Something most people don't know is that 95% of virus infections today come from using Internet Explorer.  You should only ever use it for Microsoft Update and for nothing else.  Something a lot of people also don't know is that they have a choice when it comes to web browsers.  Internet Explorer is a terrible program, Firefox is better, safer and faster, and Google Chrome is like Fort Knox as far as security and it's faster than greased lightning.  Using Google Chrome will therefore protect you online and it will make your computer feel even faster.

How to do it: Go to http://google.com/chrome and download and install Google Chrome, and then go find any Internet Explorer icons on your desktop or in your Quick Launch tray (bottom left, by Start) and delete them so you don't accidentally use IE again.  Use Chrome for all your browsing, I promise your computer will feel much, much faster.

THAT'S IT!  Enjoy your new turbo-charged computer.

2010-06-20

Clustering large datasets

On the MIT-internal "csail-related" mailing list someone recently asked for software to help him perform matrix multiplications of 10^6 x 10^6-sized matrices.  Ron Rivest quite correctly replied that to multiply matrices this size, even for a single multiplication you would probably need about 4 years of compute time -- because there are a trillion entries in matrices this size.  I posted the following reply, which I am re-posting here partly for my own reference because it contains a lot of the points I have learned in various work clustering huge datasets, and partly in the hope that somebody else will find it useful.

--

Ron's analysis is correct, unless your matrix is very sparse -- in which case sparse matrix methods may make this problem entirely tractable, and any of the linear algebra toolkits implement efficient sparse matrix methods that you can use. The main problem you'll have is fitting it all in memory -- you'll need to look into matrix blocking techniques (dividing the big matrix into submatrices, and figuring out the correct way to multiply the subblocks to get the full result). There's some great discussion about keeping subblocks in CPU cache in the following paper that may help you figure out how to keep your much larger subblocks in main memory as long as possible: http://homepages.cae.wisc.edu/~ece539/software/matrix/mbpv11.ps The difference between swapping subblocks in and out at the right time vs. the wrong time could make several orders of magnitude in difference in how long it takes to compute your result. There are also some parallel solutions to multiplying large matrices that will run on large clusters and trade off time swapping subblocks in and out of memory for data communication overhead between nodes.

A similar problem to matrix multiplication problem you describe is encountered in data clustering, given the "N Choose 2" or O(N^2) scaling of the number of pairwise distances in a dataset. It is intractable to use all-pairs distances with even the simplest clustering algorithms in large datasets, for example hierarchical agglomerative / bottom-up clustering applied to more than tens of thousands of points. Depending on the exact nature of the problem, you may be able to exploit spatial coherence within your problem space -- e.g. for agglomerative clustering, you use only the nearest neighbors when joining clusters, so you can often reduce the complexity of clustering problems using a smart data structure like a kd-tree that gives approximately O(log(N)) time per nearest neighbor lookup. However the kd-tree algorithm quickly degrades to ~O(N) performance per lookup in high-dimensional spaces because of the curse of dimensionality, so you may need to do dimensionality reduction first anyway to make the kd-tree useful. (You'd also have to reframe your matrix multiplication problem into a format where using a kd-tree to find nearest neighbors in your vectorspace helps you compute a fast, close approximation to your desired solution.)

Another approach used to cluster datasets with millions of points (and therefore trillions of pairs of points) is to pick a few exemplar points and cluster these instead to generate a sample approximation of cluster assignments for the full dataset. For example you could randomly choose one point out of every thousand, and cluster these into your K target clusters (= O(N^2 / 1000^2) time to cluster 1/1000th of the points), then go back through your full dataset and find the closest exemplar point to each original data point in order to compute the cluster labeling (= O(N^2/1000), although you can often skip this step entirely if you just use the exemplars). The largish constants in the time complexity can make this approach tractable for larger datasets than you could otherwise work with. The exemplar method I just described is incidentally half an iteration of the k-means/k-medians algorithm applied to a set of exemplar points. You can go further by completing the full first iteration of k-medians by going back and updating your selection of exemplar points using the medians of the newly-labeled clusters, and then depending on how much compute time you can afford, you could run multiple whole iterations of this exemplar-modified k-medians algorithm (or run until convergence) to get better exemplars -- though even the first half-iteration may be sufficient to get you a useful result. As far as how to phrase your matrix multiplication problem in this framework, depending on the problem you may be able to find representative row/column vectors this way and then just multiply the representative vectors together to get a product that in some sense is a low-dimensional approximation of the full matrix product.

Ultimately whether you use (PCA-based) dimensionality reduction or k-means / k-medians, you're doing approximately the same thing, and this is why preprocessing with PCA can often help k-means to converge faster : quoting from http://en.wikipedia.org/wiki/Principal_component_analysis#Relation_between_PCA_and_K-means_clustering :

It has been shown recently (2007) [12] [13] that the relaxed solution of K-means clustering, specified by the cluster indicators, is given by the PCA principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix. Thus PCA automatically projects to the subspace where the global solution of K-means clustering lie, and thus facilitate K-means clustering to find near-optimal solutions.