Wikipedia:Modelling Wikipedia's growth

From Free net encyclopedia

This page analyses the article count data in Wikipedia:size of Wikipedia as of December 2003, and attempts to fit a simple numerical model of past and future growth to the observed article count size and growth data.

Contents

Is the growth in article count of Wikipedia exponential?

One common model of Wikipedia growth is that:

  • more content leads to more traffic
  • which leads to more edits
  • which generate more content

Thus, the average rate of growth should be proportional to the size of the Wikipedia.

However, it is quite difficult to see whether this is the case, given the disturbing effects of auto-generated articles, sampling noise and server slowdowns, which would act to hide any such trend, even if it were present.

Here is the graph of article count growth data for the English-language Wikipedia alone, based on Erik Zachte's new statistics analysis: see Wikipedia:size of Wikipedia for more discussion.

Image:English-language-wikipedia-.png

Given the sizeable artifacts in the data, it is almost impossible to test whether the current growth is approximately linear, quadratic or exponential by fitting a curve to the entire curve. However, it is possible to consider the data as a series of shorter intervals, and to ask whether this hypothesis can be applied to a majority of these intervals, and whether valid explanations can be found to explain the deviations from the hypothesis.

Graph of the log of page count

Another approach is to simply plot the logarithm of the page count. If a graph of the log of the page count is linear, then we do have exponential growth. The following graph shows this:

Image:Log en page count.JPG

This linear nature of this plot does seem to indicate that English language page count is in fact growing exponentially.

Graph of total of all languages

This is graph for the total number of articles in all languages:

Image:Wikipedia size all languages.png

June 2003: is growth approximately proportional to size?

The scatter-plot below attempts to examine the exponential-growth hypothesis by looking at the relationship of incremental short-term changes against absolute size. For each successive pair of data points recorded, the average rate of increase in the period was plotted against the average article count. The plot was then cropped to remove the moderate number of large positive and a few negative outliers that represented the submission of large numbers of auto-generated articles, re-scalings of the article count, and software glitches.

Some remaining outliers are still visible: by cross-checking with the article count vs. date graph, you can see how they correlate with Rambot activity, so I have chosen to ignore them for the purposes of curve fitting.

In particular, the following features are present:

  • two very low points at around 35,000 articles represent editing during the major server slowdown in June/July 2002
  • between 40,000 and 90,000 articles, the data are dominated by Rambot's auto-generated articles: most of the sample points in these intervals are well off the top of the chart, with thousands of articles per day. Note that although this represents a large interval in the article count axis, it only represents a relatively small period in time.
  • the low outlier at around 120,000 articles is caused by the article counter being locked

Note that the data is really quite noisy. Further analysis is welcomed!

Image:New articles vs. article count Jun 2003 new fit.png

The red line is a visual fit for the trend, ignoring the outliers.

Speculative growth predictions, as of December 2003

Hypothesis: growth rate is a constant number of articles per day, submitted by "hard-core" wikipedians, with an extra number that is proportional to the article count of Wikipedia. Thus, it should be possible to fit a straight line to the bulk of the "main-line" points in the scatter plot. Note that there are some huge outliers that are above the range of the current plot: these can be attributed to the Rambot data-dump of machine-generated gazeteer articles, and are discounted from this analysis.

The graph below shows that, apart from outliers, the model of growth of the English-language Wikipedia as being roughly proportional to size still holds as of December 2003.

The data below are based on data from Erik Zachte's dump analysis, see http://www.wikipedia.org/wikistats/EN/TablesWikipediaEN.htm , and uses the "official article count" criterion for the article count. Because of record-keeping differences, Erik's earlier data points may not exactly correspond to previous analyses, and growth rates are aggregated monthly in Erik's data. However, the overall import of the graph is very similar: there is an even clearer linear relationship between the size and growth rate, and as a result growth can be expected to be dominated by an exponential trend in the short- and medium-term future.

Image:Growth vs article count to dec 2003.png


Here's a by-eye fit of the data without outliers:

<math>\frac {dy} {dt} = 40 + \frac {150} {110000} y</math>


where y is the article count and t the time since January 10, 2001, measured in days. This is a first-order nonhomogeneous linear differential equation.

Note: no error bounds are provided: this is just a visual fit, and there is insufficient data for a better technique. However, this is likely to change over the next year, and there should be enough data by mid-2004 to resolve some of the questions posed. For now, it is interesting just to make a "straw man" prediction which can be tested in the future, rather than creating models retroactively.

Setting y = 183375 articles as of December 15 2003 and integrating the formula above gives some very crude predictions for the en: Wikipedia article count at the end of each month, as follows:

Image:Historical and predicted en article count dec 2003 model.png


Questions:

  • is this model even remotely valid? (Time will tell).
  • how long can exponential growth go on, or is this just really the early part of a logistic curve?
  • what does this imply for server and traffic scaling?

Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to lack of things to write on. But probably the amount of information in each article will begin to increase a lot more. More to the point, limitations on the (current) Wikipedia interface will cause a bottle neck of sorts, limiting the type (and by default, the amount) of growth to vertical monolingual growth patterns, as opposed to lateral cross-lingual interlanga ones.

November 2005 update: A glance at the updated article count graph at the top of the page will show that the graph lags some way behind the actual growth, which accelerated in early 2004 and has now resulted in a total article count of over 800000, 45% more than predicted.

The December 2003 model predictions vs. actual data

Note that these figures are predictions only, and refer to figures for the end of the given month.

           Predicted     Actual         Diff       Actual Monthly
           Total         Total                     Increase    
Dec 2003:  188000        191000         3000       -
Jan 2004:  197000        202000         5000       11000
Feb 2004:  207000        219000        12000       17000
Mar 2004:  217000        241000        24000       22000
Apr 2004:  227000        257000        30000       16000
May 2004:  238000        276000        38000       19000
Jun 2004:  249000        295000        46000       19000
Jul 2004:  261000        316000        56000       21000
Aug 2004:  274000        338000        64000       22000
Sep 2004:  286000        360000        74000       22000
Oct 2004:  300000        383000        83000       23000
Nov 2004:  314000        410000        96000       27000
Dec 2004:  329000        438000       109000       28000
Jan 2005:  344000        463000       119000       25000
Feb 2005:  359000        486000       127000       23000
Mar 2005:  375000        514000       139000       28000
Apr 2005:  392000        546000       154000       32000
May 2005:  410000        578000       168000       32000
Jun 2005:  429000        615000       186000       37000
Jul 2005:  449000        662000       213000       47000
Aug 2005:  469000        710000       231000       48000
Sep 2005:  490000        751000       261000       41000
Oct 2005:  512000        800000       288000       49000
Nov 2005:  535000        844000       309000       44000
Dec 2005:  559000        893000       334000       49000 *1
Jan 2006:  585000        948000       363000       55000
Feb 2006:  609000        998000       389000       50000
Mar 2006:  636000       1053000       417000       55000
Apr 2006:  664000
May 2006:  694000
Jun 2006:  724000
Jul 2006:  757000
Aug 2006:  790000
Sep 2006:  825000
Oct 2006:  861000
Nov 2006:  899000
Dec 2006:  939000
Jan 2007:  980000
Feb 2007: 1020000

Note 1: From the beginning of Dec 05 only registered users can create new pages.

Relationship of Usenet cites to article growth

The relationship of Usenet cites of the word "Wikipedia" to the official article count for the en: Wikipedia appears to show a curve, rather than a linear relationship. (See Wikipedia:Awareness statistics for data). Or does it show a line broken into two parts, one before and one (horizontally shifted) after the Rambot-created articles? If so, this would suggest that the Rambot articles do not stimulate significant comment on Usenet, but that the linear relationship does in fact hold for all other articles. As ever, more data are needed.

Image:Usenet cites vs article count dec 2003.png

Projections using new data

Image:Wikigrowthjul05.jpg As of July 2005, the graph makes projections on data for the English Wikipedia. This graph is by no means accurate for future dates. It covers Wikipedia growth and predictions from December 2003 to January 2012. It should be noted that even these more aggressive growth predictions given here have fallen short of the most recent growth of Wikipedia. See Original Predictions for predictions and projections.



Modelling growth of Wikipedia page views per million

Using the Alexa page views per million data from Wikipedia:Awareness statistics (see [1] for a graph) in the period 1 January 2003 to 5 September 2005, filtering out all points less than 28 days away from the previous point (to avoid excessive weighting during time periods where points are densely sampled), and performing a linear least-squares fit of the logarithm of the data, gives the following formula:

log_e(page_views_per_million) = -49.8177569301 + 5.02511420201e-08 * unix_epoch_of_date

for n = 21 points fitted

This implies a doubling period of (log_e(2) / log_e(5.02511420201e-08)) / 86400 days = 159.64 days, and an annual growth factor in page views per million of exp(5.02511420201e-08*365.25*86400) = 4.88.

Playing around with different time periods and filter times, we get a range of results from which can reasonably say that Wikipedia's estimated page views per million doubling time is somewhere in the range 130 - 160 days, with the recent (2005) doubling time of 156 days or so being within the range of the longest-term doubling time of about 155 - 159 days, with the 2004 period being the exception to the long-term and short-term trends.

Modelling improvement in Wikipedia's Alexa traffic rank

Applying a similar linear regression fit to the log of Wikipedia's Alexa traffic rank from October 2002 to September 2005 gives a similar result, with a halving period (lower is better for rank) of roughly 134 - 138 days over the long term, with a more recent (2005 data only) halving time of 114 days! Since the current page rank as of September 2005 is roughly 40, this suggests, if taken to logical extremes, and using the most cautious of the three figures, and rounding it to 4.5 months, that Wikipedia will reach:

  • page rank 20 in 4.5 months
  • page rank 10 in 9 months
  • page rank 5 in 13.5 months
  • be fighting its way into the top 3 in 18 months, and
  • be fighting its way to the #1 spot in 22.5 months...

So, clearly:

  • either this exponential growth has got to stop or slow down, or
  • it's going to be a wild ride...

November 2005 update: Well, it's November, and Wikipedia is currently moved up only to 38th place, so it isn't quite keeping up with these predictions. However, the daily page rank is hovering around 34 and reached 31 in October, so it's doing OK...

January 2006 update (Wikipedia's 5th anniversary): The daily page rank has been hovering around 20 for about a week in line with the original predictions above.

See also

External links