Summary
Short URLs have seen a significant increase in usage over the past years mostly attributed to length restrictions in popular social networks such as Twitter. This burst in popularity and usage puts in question their positive and negative impact on the web, whether that is the private social networks or the entire Internet as a whole.
This study is focused on analyzing the “web of short URLs” through the traces of short URLs from two different perspectives in an attempt to provide a more in depth description of the distribution, stability, lifespan and overall use of the short URLs. Their focalized use in specialized communities and services, as well as the enormous size of the shortening services in terms of web population, are the two main reason that drive this research.
The collection of data was achieved with the enlistment of two different crawling platforms; one aimed at twitter messages (community use) and the other at the shortening services (general use). In addition, the bit.ly services were used as info provider regarding short URLs statistics such as referrers, hit ration, geographic location and aggregated versions of the metadata. The twitter crawler was used in two versions (one search every 5 minutes, every 30 seconds) in an attempt to verify the validity of the stats for the milder one. The brute force crawler targeting ow.ly and bit.ly was tweaked to include the entire keyspace of short URLs from up to 3-characters (all returned data had at least 6-characters indicating exhaustion of shorter combinations). With the help of heuristics based on the evolution of the keyspace in a specified timeframe, it was possible to determine the short URLs’ rate creation (70000 a day).
In order to better understand short URLs it was imperative to break them down to 5 sections/questions:
· Where do they come from? : The majority of users arrive from non-web applications like IM (Instant Messaging), email clients and mobile apps, and since there area of dominance is inside social network communities , this distribution suggests a new browsing model based on a “word of mouth” type of propagation (spreading from user to user).
· Where do they point? : From all traces, it became apparent that most short URLs point to news related sites. The results confirmed previous studies even though the number of results were relatively small due to wrapping techniques used by spammers to hide short URLs.
· Location : The results indicated that short URLs spread through different channels since there were very few users from countries like India and China, both notorious for a staggering amount of internet users.
· Popularity : The popularity in short URLs is measured in 3 different perspectives.
o Number of hits : According to the Cumulative Distribution Function (CDF) there is a power-law behavior, meaning a small fraction of the data accounts for the most hits, with the rest considered uninteresting. To eliminate any bias, all short URLs created in the last week of the trace collection were excluded.
o Active or Inactive : Inactive URLs are those with no activity during the last week (threshold). Comparing the distribution of clicks between active and inactive (even when moving the threshold) resulted in similar curves of 90-10 rule (10% of the URLs amount for 90% of the hits).
o The web sites : The overall finding suggest that while the community using the short URLs shares some interest with the broader web (Youtube, etc) it also presents a distinctive focus on web sites of special interest (news, entertainment, etc). In a number of cases those sites received a steady amount of visits for long periods of time.
Short URLs are not searchable, indicating a high probability for a short lifespan (the number of days from the first and last observed hit). Splitting the traces in active and inactive the CDF graphs disproved the probability by showing that in general 15% of short URLs leaved for at least a month. Although in 51% of the inactive URLs their lifespan was about a day, in 50% of the active ones, their lifespan ranged from 3 to 4 months. As for the hit distribution during this lifetime, as expected in both cases (active or not) the dominant ratio was during the first day, followed by a continuous drop varying from 40% (from day to day) to 100% and even 200% on how many short URLs we included in our results. More URLs meant more less popular URLs, affecting the changes faster. The lifetime of an active URL did prove to be linear in log-log scale with its hit rate, whereas for the inactive ones, no kind of dependency became apparent.
The publishers of short URLs (mostly in Twitter), were another topic of interest, with questions about their daily posts, content and popularity coming into focus. In 90% of the cases most users generated 5 or less tweets every day, with original content mostly. If a user began to increase his posts in a disproportionate amount, that didn’t correlate always with an increase in hits, since other users would start to view him as a spammer, thus staying away from any links related to him. For the cases that posting rates did incite higher hit ratio, there was a clear pattern of a semi-automated behavior, meaning most of the posts were produced from applications relaying content directly from RSS feeds and not from a “live” user stuck behind a pc.
Last, we analyze the potential performance implications that short URLs present regarding two principles:
1. Space Reduction: Measuring the amount of space saved, means defining the relative ration of the URLs’ length before and after the shortening. One out of two URLs had a 91% size reduction, noticing a 95% of less space for 90% of the URLs when compared to their longer versions. The improvement is more substantial when considering that 69% of the tweets gathered wouldn’t exist due to limitations if instead of the short URL they used the longer original one.
2. Latency: The additional step of indirection added from the short URLs, results in higher access time, pointing to a degradation of performance. The worst case scenario (from multiple shortening services) was about 0.46 seconds which may seem low, but when applying to the 200 top short URLs results in a 54% additional overhead in 50% of the cases, going up to 100% in 10% of those accesses.
All in all, we learned a great deal of the “web of short URLs” ranging from their communities, their lifespan and access patters to the users responsible for spreading them. The general understanding is that short URLs are here to stay, since they have become a vital part of services like Twitter. However, it is imperative for future expansion to focus on a new shortening architecture that will be able to handle the latency issue much more efficiently than the one currently in use.
Page 1