“Thank you guys for coming to the class at this time…”(The class is at 8:00AM). The professor, seems young, tall, long nose, dressed like he is going to work after the class.
“How many of you are graduate student?” All raised their hands
“How many of you whose semester is your last semester?” 75% raised.
“I hope you enjoy the experience…..”
You know these small details queries shows that the professor actually care about his student which is cool
The course is about information retrieval through the analysis of search engines of which you have to build one
*Web crawler. HW: Write one
*Indexing data. HW : Using solr
*Query processing with analysis of special and famous ranking algorithm. e.g: PageRank and HITS
*Detail look at Google Search Engine, incl. Google File System, Map Reduce.
Assignments will compose of : short explanation of student solution, source code. Good one be praised in class( anonymously demonstrated ??? )
…The course however won’t mainly focus on Google( one of the greatest innovator & game changer & their research result published online ) but on search engine in general : principles , how it works, component-wise, etc.
It got Webtech feels wherein most of the time we will spend on lecture, other times watching some interesting videos ( Those lynda videos are boring. I am not used to videos )
2. Search Engine Basics
-History time: Consult this picture
*Some check points:
+Archie( based of the comic character ) was developed to archive list of files shared through FPT servers. It provide regex search
+Veronica & Jughead were created to search files through Gopher Server, predecessor of WWW.
-6 Stanford undergrads founded Excite with the ability to have query processing based on statistical analysis of word relationship, making search more efficient.Excite later filed for bankruptcy.
-June 1993, Matthew Gray released WWW Wanderer, Perl-based crawler to measure web’s growth by counting active web server.
-Late 1993, ALIWEB was out, a meta information crawler allowing users to submit their site and page descriptions( aliasing: searching turns into query all these page descriptions ? )
–Alta Vista came around 1995, a game changer as it provides many features : ~unlimited bandwidth, can process natural language, offer advanced search techniques, allow user to add/delete link within 24 hours(of submitting?) and inbound link checking. Alta Vista was later bought by Overture which was then bought by Yahoo to incorporate the tech into Yahoo! Search
–InfoSeek started off in 1994 bundled with Netscape, predecessor of nowadays Firefox, a big player in the Web Browser industry, was the first to sell ads on Cost Per Thousand(CPM) basis. It was later bought by Disney in 1998,.
-1994, David Filo and Jerry Yang founded Yahoo, a directory of web pages. It was not a search engine because all the links are updated manually. As the listings grow in size, the tasks becomes unmanageble upon which Yahoo bought Alta Vista to try to change that. Nowadays, they just use Google or Bing.
-1998, two Stanford graduate students built a prototype BackRub, a precursor of Google, a play on the word Googol( 1 followed by 100 zeros ) and the rest is history. The professor said it was the time that wake up feel different; it changes instantly.
How we ended up today stems from one ideal : organizing data ( in some way ) and query them( smartly ). Web search is a multi billion dollar business, often cross the boundary of intellectual property laws.
* Search Engine internals:
– Crawler( Web spider ) : recursively follow links to build corpus (map of the www)
– Indexer : store web, keywords, etc in such a way that is easy to retrieve later.
– Query processor: extrapolate query to find matching documents.
==> Challenges : Dynamic web site in which contents are generated can be missed by the search engine. Query processor also requires semantic analysis : lang of the query, filtering stop words, specific queries ( company, location, restaurants, etc ), user profile, past queries.
–User & search Engine:
+ Sometimes cannot distinguish between search bar and URL field, rarely use scroll bar
+ Growing mobile user, high speed connectivity
+ Search & browse. avg search term : 2.5. Poor comprehension of syntax( modern search engine blurs this )
+ Needs : informational( learn sth new ), navigational( go to that page ), transactional( do sth ), gray areas( in between/ unclassified )
+ Evaluation of search engine :
. Relevances & validity of results
. User interface : simple & not cluttered.
. Trust: objective results.
. Post-processing tools provided : Refined/Suggested options
. Deal with idiosyncrasies.