Why we all need to learn Chinese

I had to write a short essay about an emerging issue in US-China relations and what I think should be done about it. I figured I'd post it here too. Short story: we all need to learn Chinese.


The next 10 years will see the beginning of the end of the US' "free lunch".  As the standard of living continues to improve in China and as economic reform and access to information continues to spur growth, wages and prices will rise, which will cause an increase in the cost of goods manufactured in China, and much of this cost increase will be passed on to the American consumer. The increase in the cost of consumer goods may make continuing to import goods from China unsustainable for some industries. Unfortunately the US has already lost (or never developed) the ability to manufacture certain goods and materials in quantity, and has long relied on cheap manufacturing in Chinese factories. Chinese economic growth is therefore likely to cause tensions between China and the US.

Meanwhile China has started investing heavily in outsourcing cheap manufacturing to Africa and other developing regions, so it is likely that China will emerge as the next super-consumer country, and with an emerging middle-class and much greater purchasing power than the US (and maintaining trillions of dollars of US debt), the rise of China will likely drag the US into economic doldrums.

The traditional business and economic approaches to address this problem will all of course be pursued (investing in emerging Chinese markets, exporting Western brands to China and/or developing multinational business conglomerates).  However I think to truly stay relevant, the US needs to focus on teaching Chinese language and culture to every school student the way that every Chinese school student is taught English language and culture, and the US government needs to focus on setting up an extensive network of student exchange programs with China and other Chinese-speaking countries.  By exposing school children to Chinese language and culture, the next generation of business leaders, political leaders, scientists and engineers will be enabled to work alongside Chinese counterparts rather than simply competing against them while the economics of scale turn in China's favor.


My quick analysis of the leaked Gawker passwords

I just got an email from Gawker Media stating that their login details on Lifehacker, Gizmodo etc. had been compromised and a database of 1.3M usernames and passwords was being distributed via Bittorrent. Naturally I went and found the database and downloaded it. I extracted the subset of passwords from the file that have already been cracked, and uniquified and generated counts. You can download the list at the end of this post.

I have written about the dangers of using one password on multiple accounts before, and when I used to work at a company where I had access to a massive password database, I was shocked to discover how many people use really weak passwords -- like a first name or a number like 123456, or the word "password".

The leaked Gawker data contains the following explanatory text (along with a ton of leaked private chat logs between Gawker executives, and other juicy stuff):

  After gaining access to gawkers MySQL database we stumble upon a huge
  table containing ~1,500,000 users. After a few days of dumping we
  decided that 1.3 million was enough.

  Gawker uses a really outdated hashing algorithm known as DES (Data Encryption Standard).
  Because DES has a maximum of 8chars using a password like "abcdefgh1234" only the
  first 8 characters "abcdefgh" are encrypted and stored in the database. If your
  password is longer than 8 characters you only need to enter the first 8 characters
  to log in! 


  Because of this we were only able to recover the first 8 characters of someones password!
  If the password is 8 characters long there's a good chance that it migt be longer
  than 8 characters! But still, there's 1000's of people using 1 - 8 character passwords
  for us to have some fun with! 

  We managed to crack ~200,000 hashes, if you want the rest of them cracking
  DO IT YOUR ****ING SELF! >:3 (censored)

So ~200,000 hashes were cracked out of 1.3M by de-hashing (actually 188281 hashes were cracked, producing 91688 unique passwords). I assume that the 189k passwords that were cracked are somewhat representative of the rest of the database.

I ran some basic statistics on the password database because I was interested in seeing the distribution of password usage.  Here is a plot of the usage count (out of 189k cracked passwords total) for the top 50 passwords:

Here is the same plot with a log Y axis and with the rank of all cracked passwords shown on the X axis:

Basically the top 5 or so passwords are used by a ridiculously high proportion of users, and the top few thousand passwords are very common and therefore very easy to guess using a dictionary attack.

Here are the top 50 passwords, with their rank and count out of 189k:

Rank    Count    Password
   1     3057    123456
   2     1955    password
   3     1119    12345678
   4      661    lifehack
   5      418    qwerty
   6      333    abc123
   7      311    111111
   8      300    monkey
   9      273    consumer
  10      253    12345
  11      247    letmein
  12      241    trustno1
  13      233    dragon
  14      213    baseball
  15      208    superman
  16      202    iloveyou
  17      202    1234567
  18      199    gizmodo
  19      196    sunshine
  20      194    1234
  21      187    princess
  22      184    starwars
  23      179    whatever
  24      175    shadow
  25      158    000000
  26      157    cheese
  27      156    123123
  28      149    nintendo
  29      149    football
  30      148    computer
  31      141    ****you (censored)
  32      135    654321
  33      134    blahblah
  34      132    passw0rd
  35      132    master
  36      126    soccer
  37      124    michael
  38      120    666666
  39      118    jennifer
  40      115    gawker
  41      114    Password
  42      114    jordan
  43      113    pokemon
  44      113    pepper
  45      113    michelle
  46      113    killer
  47      111    welcome
  48      111    batman
  49      109    kotaku
  50      109    internet

This gives an insight into the password-setting habits (and, if you read through more of the list, the mentality) of a large proportion of the Internet-using population.

A lot of people use numerical passwords -- here are the top 50 numerical passwords. Check out the password at rank 221:

Rank   Count    Password
   1    3057    123456
   3    1119    12345678
   7     311    111111
  10     253    12345
  17     202    1234567
  20     194    1234
  25     158    000000
  27     156    123123
  32     135    654321
  38     120    666666
  74      82    123321
 100      72    123
 101      72    121212
 137      63    159753
 163      56    88888888
 164      56    11235813
 186      53    7777777
 202      50    555555
 221      48    8675309
 236      47    98765432
 237      47    11111111
 243      46    696969
 253      45    112233
 267      43    00000000
 272      42    1111
 286      41    123654
 318      39    222222
 344      37    131313
 430      32    0000
 501      29    987654
 502      29    55555
 537      28    12341234
 608      25    102030
 643      24    147258
 645      24    101010
 684      23    888888
 685      23    159357
 741      22    789456
 742      22    11223344
 743      22    007007
 799      21    12312312
 869      20    99999999
 871      20    147852
 872      20    1212
 873      20    11111
 874      20    09876543
1105      17    0123456
1218      16    151515
1219      16    123789
1221      16    112358

DOWNLOAD LINK: Curious to see the passwords of all 189,000 users? Here's the whole password list with counts for each password.  It's a .tsv file (tab-separated values), you can load it into a spreadsheet or text editor. (This file doesn't contain names or usernames, just the password info.  If you want usernames you'll have to go get them yourself.)


  1. Ranks and accidentally-stripped leading zeroes fixed.
  2. Highlighting one of my replies to a comment: if you have to even ask if your password is on this list, it's probably not secure enough! 


Someone stole your password on facebook or your email account? What to do about it -- and why it's worse than you think

I get emails or facebook messages "from friends" every couple of weeks that are spam, and that the friend clearly did not intend to send.  It is obvious that the password on the friend's email account or facebook account got stolen.  This is potentially a lot more serious than it seems -- this can make old-school identity theft look like childplay:  If somebody steals your email password they have everything about you.  They can not only trawl through all your email to gather information about your life for impersonating you, blackmailing you or other nefarious purposes, but they can also send password reset requests to any other website you have registered with, and steal your passwords on all those sites too.

For this reason, your email password is gold, and you should protect it more than any other password: it should be harder to guess than other passwords, longer, not be based on dictionary words etc., and you shouldn't use your email account password on any other website!  Why?  Because when you register on another site XYZ.com, and XYZ.com asks you to register with your email address and a password, and if you use your email account password as your password on XYZ.com too, then if that website's owners are evil or if somebody hacks their site, they have both your email address and your email account password.

There are three main way people steal your email passwords:

(1) Dictionary attack. I had access to a database of thousands of user login credentials in one job I worked at, and more than 60% of passwords were just a first name!  Scammers/spammers just go through a dictionary and submit every word in the dictionary to a website as passwords with common or publicly-visible usernames until they get in.  This is becoming less common because of security precautions like having to type in a Captcha when you get the password wrong, but still -- if you have a simple name or word as your password, fix it!  Use letters, numbers, upper and lowercase, and punctuation.  Make it at least 8 characters long.  Write down passwords somewhere secure and memorize your main email account password.

(2) Keylogging.  This is becoming the most common method for stealing passwords today: most viruses that infect your computer will install a keylogger that sends every keystroke you type -- login names, passwords, credit card numbers, love letters -- to a computer in China or Russia.  And a substantial number of computers you have probably used in the last few years have had one of these viruses on them.  No I'm not kidding.  See below for how to clean up your computer if it's infected, and general guidelines of how to avoid this issue.

(3) You used your email account password on other sites too, and those sites were either malicious or got hacked. It was recently revealed that around 75% of people reuse their email password on social networking sites like Facebook.  Don't ever, ever use your email account password to register anywhere else.

Q: My password got stolen.  What do I do?

Step 1: If you can log into your email account and the spammers literally just started sending out messages from it, log in immediately and change the password.  You might have to change the password again later, because your computer may still be infected with a keylogger (see point (2) above) and they might be able to watch you change your password.  But change it now to halt them in their tracks.  If you can't log into your email account because they changed your password on you (it does happen), then try sending a password reset request, if your account is connected with another email account that you can access, or contact your email provider (or facebook, if it's your facebook account) to notify them of the problem.

Step 2: Assess the damage.  Look in your Sent folder to see what they sent and to whom.  Send out a "Sorry -- don't click on this link, I didn't send it" message or something -- lots of the links sent out by spammers go to sites that try to infect more people's computers with viruses.

Step 3: Get all the critical updates for your computer and update your antivirus software.  Follow all the steps in my other blog post about speeding up your computer.  Following those steps will not only speed up your computer but should guarantee that any viruses on your computer are killed and that you are safe from viruses in the future.  In the end you should (a) have all the critical Microsoft updates on your computer, (b) have replaced your antivirus software (Norton/Symantec/McAfee/Kapersky/ClamAV/etc.) with Microsoft Security Essentials (it's a lot better at catching viruses than the others, and it's free forever so you don't have to pay to stay up to date so your computer catches the latest viruses), and (c) you should have Google Chrome installed as your browser and you should never use Internet Explorer again, it is one of the biggest reasons computers get infected with viruses, because it is so insecure.

Step 4: With your new bulletproof computer, go to any other site that you used the same password on as the email password that was stolen, and change your passwords there too.  Then probably go back and change your main email account password again, just in case you had a keylogger on your computer that got eliminated in Step 3.  Here's a general strategy for selecting passwords:
  • Password Level 1: Have a throwaway, "don't care" password that you can use for all those sites that ask you to register that you wouldn't mind someone stealing your password for -- like when you have to register on a bulletin board to ask a question about your car, and you never plan to go back there again.  Reuse this password on other sites only when you don't care if somebody who has access to one of those sites with a stolen password has access to all of them -- because if your account gets hacked on that site, it can get hacked on other sites too.
  • Password Level 2: Have a second level of password that you use for sites that you do care about somebody breaking into -- for example any e-commerce site like Amazon.com that saves your credit card details.  Use a different password for each site that stores credit card details!  Remember that sites that save credit card details are especially targeted by hackers/crackers.  Ways of generating unique passwords for these sites include just writing each password down, coming up with a tricky way of taking a base password and combining some letters from the domain name of the site into the password to make a non-predictable unique password for each site, or using something like LastPass to generate and save the passwords for you.
  • Password Level 3: You should have a more secure password for your bank and your bank only.  Each bank should have its own password.  You should also request extra authentication methods from your bank if it is available -- e.g. PayPal has a keyfob that generates a pseudo-random number when you push a button (a new number every minute) that you have to type in right after your password.  This verifies you have the physical device, and protects you from password attacks.
  • Password level 4: Your main email account password on gmail, yahoo, hotmail etc.  This must be secure, unique, and unguessable.  Note that I put the security requirements for this password even higher than the requirements for your bank password.  Don't underestimate what somebody can do with your email account password.

    Step 5: In future, don't log into your main email account on any computer that is *not* running Microsoft Security Essentials, and don't ever use Internet Explorer to do it -- no matter how badly you need the Internet.  I never, under any circumstances, ever log into my email account on a Windows computer that is not my own (but I'm more paranoid than most, I just deal with the fallout all the time when people come to me to fix their virus-ridden computers, and I know that most people have not followed the steps above for guaranteeing their computers are virus-free).  Getting a cellphone that you can read your email on helps with this, it lets you check email and reply even when you're away from the safety haven of your own computer.  In particular, avoid Internet cafes especially while traveling -- a large majority of computers in Internet cafes are infected with some sort of spyware that will send your passwords and credit card numbers who-knows-where.

    Good luck :-)


    Taking OpenCourseWare to North Korea

    A blog post I wrote for ChosonExchange.org, cross-posted from there.  I'm going with Choson Exchange to Pyongyang in September -- my second trip to North Korea -- and I'm heading up their OpenCourseWare strategy.


    Choson Exchange to Share Creative Commons Licensed Materials from the World's Best Universities With North Korea 

    Choson Exchange is committed to providing educational materials from the world's best educational institutions to North Korean students free of charge. This goal is made possible through the OpenCourseWare (OCW) initiative, in which dozens of top universities all around the world have chosen to post a large number of course materials such as lecture videos, lecture notes, handouts and assignments on the Internet under the Creative Commons open access license. This license gives people all over the world the ability to obtain a top-quality education for free, and gives professors the ability to legally reuse these materials and incorporate them into their own teaching.

    Several other sources of top-quality educational materials are also available under Creative Commons licenses, such as lectures on a wide array of topics in mathematics, economics and finance from the Khan Academy, full high-quality textbooks on WikiBooks.org and encyclopedic content on Wikipedia.org. Recently, WikiBooks and Wikipedia added the ability to select sets of articles and have them assembled into a PDF format e-book for downloading, or these books can be easily printed, bound and shipped with a few mouse clicks through a company called Pedia Press. This provides an easy method for creation of high-quality printed textbooks or e-books that meet the content and pedagogical requirements of our North Korean colleagues.

    Choson Exchange has been invited to present Open CourseWare content and e-books at the Pyongyang International Science and Technology Book Fair (PISTBF) in September 2010. The initial content that we will take to North Korea includes both OCW and Wikipedia/WikiBooks-sourced material in the subject areas of business, economics and finance; basic sciences such as physics, chemistry and biology; medicine, including first aid, physiology and gynaecology; computer science and engineering. We plan to bring both electronic copies of lecture videos and lecture notes as well as printed copies of some WikiBooks to use in exhibitions in Pyongyang and training programs.

    The quality of many of the materials available through Creative Commons sources is very high. However no educational program can stand on the strength of the educational materials alone, there is a lot of structure and that has to be put in place for an educational program to succeed. For this reason, Choson Exchange is also committed to helping create and support the teaching infrastructure necessary to effectively kickstart training courses incorporating open content. To accomplish this, foreign advisors who are expert in each teaching area are being recruited to assist in helping their North Korean counterparts get up to speed with teaching the new academic material. We are confident this is the fastest way to improve the quality of education, and that improving education will improve quality of life and the level of wellbeing of the country.

    Finally, North Korea is unprecedented in its culture and rich history. As we work with our North Korean colleagues to bring the highest quality Creative Commons academic materials from the best educational institutions to North Korea and help them to build programs that employ these resources, we would also like to work with them, if they choose, to contribute North Korean literature, cultural and academic course materials back into the body of Open CourseWare, so that the world can learn about the North Korean story directly from North Koreans themselves. This will add to the richness of the cultural tapestry that is the Creative Commons.

    Posted by Luke Hutchison, Director of Educational Technologies for Choson Exchange


    Why Harvard Students are successful; dealing with Harvard Karma Envy

    I just tweeted:

    @MetaLev: The Social Network http://goo.gl/rarh looks semi-interesting, Zuck well-casted, but actors look more like Hollywood than Harvard

    I immediately got the following response:

    @positiveneuro: I think our different reactions to the social network are so interesting! I'm electrified by it! You're so humdrum about it! I wonder if it's because Harvard is not a land of mystique for you--it's your daily. For better or worse, the whole meme of Harvard seduces me.

    (I take classes at Harvard and attend MIT.)  I posted the following reply, which I thought was worth reposting here, because I think I learned something by tweeting it:

    @MetaLev: Harvard seduced me too until I went to take classes there, and realized everyone is pretty normal.  Normal but with an edge that I would best describe as a tendency to choose to thrive. They're a bit more enthusiastic and excited about possibilities and opportunities than the average person. That is all. Although that's a big "all", it has remarkable consequences. I hope this is teachable, I want my kids to take this approach to life.

    and, because karma envy is relevant to any discussion about Harvard, here's some more of the conversation:

    @positiveneuro: That's really insightful re:differences about Harvard kids, Luke. That's GOT to be a conditioned attitude. A friend's comment to me: "I think you reacted so strongly to the preview because you want to be a mark zuckerberg"

    @MetaLev: Yeah -- karma envy :-) Be your own MZ.

    @positiveneuro: I can't remember if we've discussed this principle from the bhagavad gita or not: it is better to live your own destiny imperfectly than to live someone else's destiny perfectly

    @MetaLev: Hmm. You could also say it's better to live your own destiny perfectly than someone else's imperfectly.  You probably won't be the next Mark Zuckerberg, you'd be imperfect at it if you tried. But he will never be the perfect you either.


    Stupid Facebook Javascript Viruses

    "Sarah liked GIRLS ARE UNABLE TO STARE AT THIS FOR 10 SECONDS, BUT GUYS CAN on Facebook and suggested you like it too."

    I get up to two invitations per week on facebook from friends to view some page that promises to show me something amazing/shocking/titillating. These are usually sent by friends who I doubt intended to send me these invitations, and inevitably they are links to facebook pages that tell me to paste some javascript code into the addressbar to view the advertised page. Of course if you do as you're told to do, then all your friends are automatically emailed an invitation to view the page -- without your knowledge.

    A surprising number of people have been falling for this attack -- probably in the millions because facebook has 500M users and a good number of my own fb friends have fallen for this. Someday I'm sure I'll get an invitation from someone that they'll be very embarrassed about -- because it is something they never would have sent, but the fact I got it indicates that they opened the link themselves...

    I'm having a very hard time getting browser vendors to take this combination of cross-site scripting (XSS) and social engineering seriously :( It's rather ridiculous that both the addressbar and the bookmarks bar (via bookmarklets) will happily execute Javascript code without warning the user or enforcing any sort of constraints on security context!

    The WHATWG mailing list thread I started about this: Please disallow "javascript:" URLs in browser address bars

    I filed bug reports for Chromium, but unfortunately the bug reports are security-related so you probably can't see them unless you're a Chromium developer: http://code.google.com/p/chromium/issues/detail?id=44796 ; http://code.google.com/p/chromium/issues/detail?id=49995 ; http://code.google.com/p/chromium/issues/detail?id=33304
    UPDATE: Firefox has a related bug:  https://bugzilla.mozilla.org/show_bug.cgi?id=305692

    All bugs have been closed as WONTFIX, and the WHATWG mailing list (the only list that most of the browser vendors subscribe to, with the exception of MS of course) doesn't really want to fix this.

    Here are my suggestions from the latest bug report for how to fix this:

    When you install a .crx extension, you are warned about the security implications of doing so.  However if you drag a "javascript:" bookmarklet to the bookmarks bar, you are not given a security warning -- however bookmarklets have access to the security context of whatever page is currently open when they are clicked.  For that reason, the bookmarklets system is vulnerable to exploitation via social engineering, and literally millions of facebook friends lists have been hacked this way by self-propagating js viruses.
    Also, having a user paste javascript: URLs into the address bar is already heavily exploited by facebook viruses to spread like wildfire by auto-sending themselves to all your fb friends.
    Proposed solutions:
    (1) The same warning should be given when dragging bookmarklets to the addressbar as is given when installing .crx extensions.
    (2) Chrome's anti-phishing system should be used to check where bookmarklets have originated (if dragged/dropped), and sites like facebook.com should be blacklisted for javascript:* bookmarklets (*not* for javascript:* URLs that are clicked on, just for URLs dragged to the addressbar).
    (3) Javascript that has no known origin (that is typed directly into the URL bar) should either be disabled by default (re-enablable via debug option, for the tiny 0.0001% of users that need this functionality), or at the very least and less preferably, the user should be given a security warning when hitting Enter after entering such a URL.  There is no legitimate reason for the other 99.9999% of users to need to enter javascript URLs into the addressbar.
    Given the success of these exploits so far on facebook, the use and nefariousness of them will likely only increase.

    Here's an example of the sort of javascript employed -- e.g. this one is from a page entitled "World's Hardest Riddle" and has you type Ctrl-C, Alt-D, Ctrl-V and then Enter to reveal the riddle (i.e. to copy all this into the addressbar): anyone care to disentangle what this is doing?

    /^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=funct
    ion(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'
    \\b','g'),k[c]);return p}('J e=["\\n\\g\\j\\g\\F\\g\\i\\g\\h\\A","\\j\\h\\A\\i


    Android vs iPhone 4 signal strength display (FWIW)

    In the light of the iPhone 4 Grip of Death fiasco, AnandTech reverse-engineered the signal-strength-to-bars mapping for the iPhone 4.  Tim Bray tweeted "Interested in signal-bar calculations? Android is open source, check updateSignalStrength() in http://is.gd/dd2Kh".  I used this source, combined with the 3GPP spec referenced in the Android source (thanks to @tweakt for the link) to produce the following graph comparing signal strength indicators on Android and the iPhone 4.

    I want to stress that since the number of bars displayed for a given signal strength is just a subjective way of presenting signal strength info to the user, so this graph is only presented for what it's worth -- don't read too much into it or get hung up on the details :-)  (As pointed out in the comment below, "for all the millions of dollars in lost productivity spent discussing 'bars'...")

    • The iPhone 4 consistently displays a greater percentage signal strength than Android (as defined by the fraction of bars lit).  However the signal-strength-to-bars mapping is not regulated or defined anywhere, other than the fact that Apple said in their open letter that AT&T recently came up with their own recommendation for this mapping on their own network.  Nothing necessarily says Android is more "right" than Apple.
    • Both Android and the iPhone 4 display the maximum number of bars (5/5 on iPhone, 4/4 on Android) for over half the usable signal strength range (as measured on the dBm scale).  The implication of this is probably that it's common industry practice to show full bars whenever the signal is strong enough that there are no real or noticeable connection problems.  So Apple may be inflating their signal strength status slightly for weaker signals in order to make it look like the iPhone 4 has excellent reception, but at least the practice of reporting full bars at -90dBm or greater appears to be the norm (based on these two data points), even though there's still a lot of signal strength headroom above that level.
    • Assuming AnandTech's measurements are accurate, it's possible to come to the conclusion that the Apple signal strength numbers appear manually fudged, accidentally or otherwise: note the short dBm range for 3 bars and the extra-long dBm range for 4 bars.  In other words the iPhone reports 4 bars at a much lower signal strength than it should relative to the other thresholds, the chosen thresholds don't follow a smooth curve.
    • The iPhone 4 is generally overreporting the number of bars relative to Android for lower signal strengths (under -101dBm or so), but is about in line with Android for the highest signal strengths (over -97dBm).  Assuming that a good set of thresholds were chosen in the Android source, and assuming that the radio in an average Android device and the radio in the iPhone 4 have similar characteristics, this supports Apple's point in their letter that weak signals were previously given too many bars.  (Note however that the 3GPP standard only reports signal strength at 2dBm intervals, so selecting thresholds is not exactly a precise science, and the Android numbers are probably chosen somewhat arbitrarily too...)
    • It's hard to say what mapping Apple will have to use to make it look like the Grip of Death isn't an issue on the iPhone, but based on AnandTech's testing the attenuation is high enough that they won't be able to hide it entirely.
    • Note that the Nexus One suffers from a similar problem to the iPhone 4, you can easily lose 3G reception if you grip the phone along the metal strip at the back.
    Update: The comparison is made murkier by the fact that AT&T uses WCDMA for their 3G data connection and T-Mobile and other GSM carriers use standard GSM 3G standards.  The characteristics of different data transmission standards may be dissimilar as signal strength varies and across changes in environmental characteristics such as noise levels and terrain.

    Follow me on Twitter <



    (1) an insightful comment by xtal on Slashdot:

    For all of the millions of dollars being lost on productivity aimlessly discussing 'bars'..

    Can someone please dissect the antenna and then connect it to a calibrated spectrum analyser? This is so mindbogglingly trivial to do it is beginning to hurt my soul. I do similar exercises at work with new, untested antenna designs. I am sure I am not the only one.

    For comparison, do the same to other phones and publish actual measurements of received signal drops and the effect from the disturbance caused from closing your hand around the antenna. This is similar to how touching an old rabbit-ears style antenna effects the picture on a analog TV broadcast, if the effect is as I suspect.

    Voila! An actual, meaningful assessment of what the phone bars mean in real numbers from a calibrated instrument.

    An uncalibrated receiver, such as the iphone, is not a proper tool to do this.


    Post updated, see new link below

    This post has been superceded by a newer post illustrating the Android vs iPhone 4 signal-strength-to-bars mapping, please visit that post instead.

    Signal-strength-to-bars mapping on iPhone 4

    Apple released the statement today that they were "stunned to find" (could they be any more dramatic?) that the formula used to map signal strength to bars on the iPhone 4 is "totally wrong".  They state they will fix the problem by reducing the number of bars displayed for a given signal strength.

    "Our formula, in many instances, mistakenly displays 2 more bars than it should for a given signal strength.  For example, we sometimes display 4 bars when we should be displaying as few as 2 bars. Users observing a drop of several bars when they grip their iPhone in a certain way are most likely in an area with very weak signal strength, but they don’t know it because we are erroneously displaying 4 or 5 bars. Their big drop in bars is because their high bars (sic) were never real in the first place (sic again)."

    HOWEVER I claim that they will increase the number of bars for weak signals, and only decrease the number of bars for medium-strength signals.  Here's the reasoning:

    • The open letter implies several times that when the number of bars is low, the reading doesn't reflect the actual operability of the phone, i.e. that the low end of the bar scale is too low: "iPhone 4 can drop 4 or 5 bars when tightly held...This is a far bigger drop than normal, and as a result some have accused the iPhone 4 of having a faulty antenna design.  At the same time, we continue to read articles and receive hundreds of emails from users saying that iPhone 4 reception is better than the iPhone 3GS. They are delighted. This matches our own experience and testing...The iPhone 4's performance is the best we've ever shipped".
    • They say "To fix this, we are adopting AT&T’s recently recommended formula for calculating how many bars to display for a given signal strength", but it's impossible to believe that they'll just apply AT&T's formula without tweaking it if you also believe their claims that the iPhone 4 really is better at operating on a weak signal than previous phones -- because bars are a subjective measure of the cellphone's ability to connect over the cell network.  These two claims simply don't match up.  They'll jack up the number of bars compared to the AT&T recommendation if it really is true the iPhone 4 can get the same quality of connection on a weaker signal than previous iPhones.
    • There would be a public outcry if they just drop the signal threshold for each of the bars (a drop of 4->2 could now be 2->1 or 2->0 if they are just displaying "two more bars than we should"), or even if they just dropped the signal strength threshold of the high-numbered bars -- "I got the iOS update and now my phone has a weaker signal than ever!".  The whole problem is that people believe bars and don't understand the science or math behind the bars.
    • Apple's magic fix for previous iPhone signal strength problems was supposedly an inflation of the number of displayed bars.  (It keeps people happier.)

    The great irony of all this is that it appears Apple has been caught red-handed trying to inflate their signal strength in the first place...


    Is your computer slow? Fix your own computer.

    I have fixed about six computers in the last two weeks, and just like 95% of the other computer help I give people these days, it came down to this one problem:

    "Why is my computer so slow?"
    "My computer seems to have a virus, how do I fix it?"

    One thing I have noticed is that most people just live with slow computers because they don't know where to turn for help, or they don't know that things could be better.  I figured it was time to write this blog post -- slow computers are fixable, and in fact you can fix it yourself.  The following steps should fix 95% of your computer slowness issues, and your computer will probably feel 2x-5x faster when you are done.  Please spread the word, you don't need to suffer in silence anymore :-)

    (Note that this blog post applies to the Windows operating system, but you should be aware that there are other alternatives that don't get slower the longer you use them, including Mac OS, Linux and soon Google Chrome OS.)


    Step 1: Fix DNS settings (if necessary)

    Some viruses redirect Web traffic to malicious sites, so you can't actually download or update antivirus software, etc. to fix your computer. If antivirus sites are inaccessible, and/or you get suspicious popups or fake sites when trying to visit well-known domains, follow the advice here and/or here.

    Step 2: Disable unnecessary startup processes

    Explanation: One of the biggest reasons your computer is slow is probably that when you switch your computer on, it loads all sorts of programs into RAM (memory) that are not needed.  (A lot of these programs put icons down in the System Tray at the bottom left of the screen, but not all of them do.)  When you run out of RAM, your computer uses the hard drive as extra storage (it "swaps out to disk"), and the hard drive is on the order of 1000 times slower than RAM.  So the first thing you should do is stop all the unnecessary programs from starting when you boot, and you'll be less likely to run out of RAM.

    How to do it: Download, install and run a startup manager.You will then be presented with a list of programs that are run when your computer starts up.  In most cases, you won't break anything really badly if you uncheck all of them -- but try to understand what they each are before you uncheck.  If you don't know, you can always uncheck it and come back and re-check it again if you notice something doesn't work right.  A few cases to get you started:
    • If you disable anything that says "Synaptics" or "SynTPE" then your touchpad might not have the full functionality, e.g. this program detects movement on the right hand side of the touchpad and causes the current window to scroll without you having to drag the scrollbar.  You probably want to leave this checked.  Also anything that talks about hotkeys handles the Fn+F4 key combinations etc., leave that checked if you need those functions.
    • If you sync an iPod with your computer, you probably want to leave the iPod stuff checked.
    • If you see something about HP printing and you have an HP printer, you can uncheck this and you will still be able to print, but you might not get notified on your computer when your ink is running low (not a big deal).
    • If you uncheck something to do with your digital camera, then your photo editing software might not pop up when you plug in your camera, but you will get the standard Windows photo import window instead, and you can still start your photo editing software manually, so again it's not a big deal if you uncheck it.
    • The programs you really want to kill is anything that says "QuickStart" or similar -- Acrobat reader, OpenOffice and other programs have quickstart options.  They may be bringing your computer to its knees by filling up your memory.
    • There are usually programs that start up that put advanced control panels for your graphics card and/or sound into the system tray.  However you can usually disable these too and your sound and graphics will continue to work fine.
    • In general, if you don't know what it is or why you need it, you probably won't miss it -- uncheck it!  Some people would think this is extreme and generally bad advice -- but honestly, your computer won't be in worse shape than it was before when it was unusably slow :-)  And again, you are not likely to break anything serious, but if something doesn't work properly after this step (e.g. your webcam doesn't work right), you can just try switching some of these programs back on again (by re-running CodeStuff Starter) and rebooting until you figure out what needed to be switched on.

    Now reboot your computer and hopefully your computer will feel faster already.

    Step 3: Get all Microsoft updates

    Explanation: Microsoft pushes out critical and non-critical updates that fix bugs on your computer, speed your computer up, and make your computer less vulnerable to viruses and spyware.  (Viruses and spyware are another big reason why computers can slow down.)  In particular you need all Service Pack downloads, and you should update Internet Explorer to version 8, as this fixes some critical parts of Windows -- but you should get all the critical and recommended updates if you don't have them already.

    How to do it: Open Internet Explorer and go to http://update.microsoft.com/ .  You may be asked if you want to update from Windows Update to Microsoft Update.  If you have the option, definitely install Microsoft Update, it is better than the older Windows Update because it keeps not just Windows but also Office up to date.  Next follow directions to check for the latest software updates, and download and install all critical and recommended updates.  You can also manually select a few optional updates here but you don't need them.  You probably have to reboot after installing updates.  Once you have rebooted, go back to http://update.microsoft.com/ and make sure there aren't new recommended updates listed that were there before.  If there are new updates, then rinse and repeat.  (Sometimes you can't install all updates at one time, some of them have to be performed in separate steps.)

    Step 4: Replace your antivirus software with Microsoft Security Essentials

    Explanation: Antivirus software has to wedge itself into all sorts of critical parts of your operating system to stop viruses in their tracks.  As a result it can sometimes cause more problems than it prevents.  Also you have to pay $60-100 per year for antivirus updates, otherwise your computer is still at risk from being infected by the newest viruses.  Microsoft recently realized they need to start cleaning up the messes they created by selling an operating system full of security holes, and they released a product called Microsoft Security Essentials, which seems to be one of the few products they have gotten really right -- it's solid, fast, and best of all free forever: they will keep sending you updates for free and you never need to pay the Symantec / Norton / McAfee yearly tax again.

    How to do it:
    1. Go to Start -> Settings -> Control Panel, and find the Control Panel program that lets you uninstall programs.  You might have to click on "Classic View" on the left hand side of the Control Panel to find it easily.  It might be called "Add/remove programs" or "Programs" or something else, they keep changing the name of it in different Windows versions, and I can never remember what it's called on each Windows version (I'm a Linux user) -- but when you find it it will have a list of all the programs you have installed.  Uninstall anything that has "Symantec", "Norton", "McAfee", "ClamAV", "Kapersky" or similar in the name.  Also uninstall anything that contains keywords like "Antivirus", "Spyware", "Spybot", "Internet security" etc. in the name.  It's important to uninstall these before installing Microsoft Security Essentials, and don't worry, you'll only have it uninstalled briefly.  You'll probably need to reboot after installing one or each of these.
    2. After rebooting, go to http://www.microsoft.com/security_essentials/ and download and install Microsoft Security Essentials.  It's pretty easy to install, and when it is done it will ask if you want to download the latest virus definitions and run a scan -- choose Yes.
    Step 5: Install the Google Chrome Web browser and never use Internet Explorer to browse the Web again

    Explanation: Something most people don't know is that 95% of virus infections today come from using Internet Explorer.  You should only ever use it for Microsoft Update and for nothing else.  Something a lot of people also don't know is that they have a choice when it comes to web browsers.  Internet Explorer is a terrible program, Firefox is better, safer and faster, and Google Chrome is like Fort Knox as far as security and it's faster than greased lightning.  Using Google Chrome will therefore protect you online and it will make your computer feel even faster.

    How to do it: Go to http://google.com/chrome and download and install Google Chrome, and then go find any Internet Explorer icons on your desktop or in your Quick Launch tray (bottom left, by Start) and delete them so you don't accidentally use IE again.  Use Chrome for all your browsing, I promise your computer will feel much, much faster.

    THAT'S IT!  Enjoy your new turbo-charged computer.


    Clustering large datasets

    On the MIT-internal "csail-related" mailing list someone recently asked for software to help him perform matrix multiplications of 10^6 x 10^6-sized matrices.  Ron Rivest quite correctly replied that to multiply matrices this size, even for a single multiplication you would probably need about 4 years of compute time -- because there are a trillion entries in matrices this size.  I posted the following reply, which I am re-posting here partly for my own reference because it contains a lot of the points I have learned in various work clustering huge datasets, and partly in the hope that somebody else will find it useful.


    Ron's analysis is correct, unless your matrix is very sparse -- in which case sparse matrix methods may make this problem entirely tractable, and any of the linear algebra toolkits implement efficient sparse matrix methods that you can use. The main problem you'll have is fitting it all in memory -- you'll need to look into matrix blocking techniques (dividing the big matrix into submatrices, and figuring out the correct way to multiply the subblocks to get the full result). There's some great discussion about keeping subblocks in CPU cache in the following paper that may help you figure out how to keep your much larger subblocks in main memory as long as possible: http://homepages.cae.wisc.edu/~ece539/software/matrix/mbpv11.ps The difference between swapping subblocks in and out at the right time vs. the wrong time could make several orders of magnitude in difference in how long it takes to compute your result. There are also some parallel solutions to multiplying large matrices that will run on large clusters and trade off time swapping subblocks in and out of memory for data communication overhead between nodes.

    A similar problem to matrix multiplication problem you describe is encountered in data clustering, given the "N Choose 2" or O(N^2) scaling of the number of pairwise distances in a dataset. It is intractable to use all-pairs distances with even the simplest clustering algorithms in large datasets, for example hierarchical agglomerative / bottom-up clustering applied to more than tens of thousands of points. Depending on the exact nature of the problem, you may be able to exploit spatial coherence within your problem space -- e.g. for agglomerative clustering, you use only the nearest neighbors when joining clusters, so you can often reduce the complexity of clustering problems using a smart data structure like a kd-tree that gives approximately O(log(N)) time per nearest neighbor lookup. However the kd-tree algorithm quickly degrades to ~O(N) performance per lookup in high-dimensional spaces because of the curse of dimensionality, so you may need to do dimensionality reduction first anyway to make the kd-tree useful. (You'd also have to reframe your matrix multiplication problem into a format where using a kd-tree to find nearest neighbors in your vectorspace helps you compute a fast, close approximation to your desired solution.)

    Another approach used to cluster datasets with millions of points (and therefore trillions of pairs of points) is to pick a few exemplar points and cluster these instead to generate a sample approximation of cluster assignments for the full dataset. For example you could randomly choose one point out of every thousand, and cluster these into your K target clusters (= O(N^2 / 1000^2) time to cluster 1/1000th of the points), then go back through your full dataset and find the closest exemplar point to each original data point in order to compute the cluster labeling (= O(N^2/1000), although you can often skip this step entirely if you just use the exemplars). The largish constants in the time complexity can make this approach tractable for larger datasets than you could otherwise work with. The exemplar method I just described is incidentally half an iteration of the k-means/k-medians algorithm applied to a set of exemplar points. You can go further by completing the full first iteration of k-medians by going back and updating your selection of exemplar points using the medians of the newly-labeled clusters, and then depending on how much compute time you can afford, you could run multiple whole iterations of this exemplar-modified k-medians algorithm (or run until convergence) to get better exemplars -- though even the first half-iteration may be sufficient to get you a useful result. As far as how to phrase your matrix multiplication problem in this framework, depending on the problem you may be able to find representative row/column vectors this way and then just multiply the representative vectors together to get a product that in some sense is a low-dimensional approximation of the full matrix product.

    Ultimately whether you use (PCA-based) dimensionality reduction or k-means / k-medians, you're doing approximately the same thing, and this is why preprocessing with PCA can often help k-means to converge faster : quoting from http://en.wikipedia.org/wiki/Principal_component_analysis#Relation_between_PCA_and_K-means_clustering :

    It has been shown recently (2007) [12] [13] that the relaxed solution of K-means clustering, specified by the cluster indicators, is given by the PCA principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix. Thus PCA automatically projects to the subspace where the global solution of K-means clustering lie, and thus facilitate K-means clustering to find near-optimal solutions.