Monday, March 30, 2009

Is There Any Use In Using Keywords In The URL?

Is there any value in using keywords in the URLs of web pages? Would a search engine look at keywords that you might include in the addresses of your pages, and associate those keywords with the content of your pages in the search engine’s index?

If so, how would a search engine go about looking at the web addresses indicated in the URLs to your pages, and break them down into meaningful parts to identify keywords?

Breaking URLs down into parts may also play a role in how the pages of a web site might be crawled by a search engine.

A newly published Yahoo patent application gives us some ideas on how it might extract keywords from the URLs of pages, and rank them, as well as using information uncovered in the process to determine which pages to crawl first from a web site.

Techniques for Tokenizing URLs
Invented by Krishna Leela Poola and Arun Ramanujapuram
Assigned to Yahoo
US Patent Application 20090083266
Published March 26, 2009
Filed November 6, 2007

A search engine will look at many different signals to determine what a page on the Web is about, and attempt to rank pages based upon keywords that might be an indication of the subject matter or content of those pages.

Many of those keywords are extracted from the content of pages themselves, but a search engine can look at other information associated with pages, such as the addresses of the pages.

Keywords may also be extracted from the URLs of pages, by using an algorithm that can break the URL into components, understanding the structure of those URLs, and removing candidate keywords from the different parts found within the URL.

Parts of URLs

The patent application provides a definition for different parts of URLs:

Scheme - This section of a URL identifies the internet protocol used to access a resource, such as HTTP or FTP

Authority - The part of a URL that identifies the host server where the documents or resources are located, or the domain name.

Path - This is the information following the slash character after the authority, or domain name, and it identifies the specific page or resource

Query arguments - A string that may appear in a path that can be broken down into name and value pairs, such as “category=shirts”

Fragments - A fragment identifies a subsection within a page that might be pointed to in a URL, ususally started with the “#” symbol

An example of these five different components from the patent filing:

http://www.yahoo.com:80/shopping/search?kw=blaupunkt#desc

In this URL, the scheme is “http”

The authority is “www.yahoo.com:80″ which shows the domain, and also includes a port number of “80″ in this instance.

The path is technically everything after that first single slash: “shopping/search?kw=blaupunkt#desc”

A query argument shown in this example is “kw=blaupunkt”

A fragment from this URL is #desc

Tokenizing URLs for Keywords and Web Crawling

The patent application describes a way that it might break down URLs into parts, or components, to extract keywords from URLs. Those keywords could be used to categorize pages for web search, and to understand what pages are about when providing advertisements for those pages.

This breaking down of URLs into components, and even smaller parts is referred to as “tokenizing URLs.” In addition to helping a search engine find keywords in URLs, it can have an impact on the indexing of the pages of a web site:

The tokens generated by URL tokenization may also be assigned with features of the web document to improve the efficiency of a web search. Tokenizing URLs is also the first step when clustering URLs of a website. Clustering URLs allows the identification of portions of a web document that hold more relevance. Thus, when a website is crawled by a search engine, some portions of web documents may be white-listed and should be crawled, while other portions may be black-listed and should not be crawled. This leads to more efficient web crawling.

Conclusion

Yahoo provides a fair amount of detail in the patent filing on how URLs can be broken down into components, and how keywords can be extracted from those components, as well as provided different rankings. If you’re interested in how the URLs of your site might be treated under this process, it’s worth spending some time with the patent filing itself to get a grasp of the technical details. Keep in mind that the processes from this patent application may not be the ones that Yahoo may presently be using at this time,

A cautionary note - changing the URLs to your pages, especially if those URLs have been around for a while and are indexed by search engines, is an undertaking that shouldn’t be started without careful consideration, and without using a cautious approach that keeps the risk behind such a change to a minimum. Such an approach can include using proper redirects (permanent 301 redirects) to any new URLs for external links pointed to pages of the site, actually changing URLs in internal links to the new addresses upon the site itself, and other technical methods that might help a site retain its rankings in search engines. How a search engine might react to changes to the URLs of the pages of a site can vary from one search engine to another, and traffic to the pages of a site may be negatively impacted by such a change for a period of time regardless of how carefully such a change is implemented.

No comments: