Document Type

Honors Project


Modern society has become increasingly balkanized, or ideologically polarized and socially fragmented. Political parties interact with each other over divisive issues by using polarizing rhetoric. Internet users create small, opinionated communities like dKosapedia and Conservapedia, Wikipedia-like websites written from positions of left-leaning and right-leaning bias. Even Wikipedia may not be immune to balkanization. As a free encyclopedia written by users from an unbiased point of view, it is in the interest of the general public to keep Wikipedia as free of balkanization and polarization as possible. If Wikipedia authors are free to express their conflicting points of view, the quality of information may degrade and the community could become divided. This might lead to less time being invested in improving articles, and more time spent resolving conflict. Moreover, balkanization could also lead to users developing discipline-specific editing norms, creating less cohesiveness in the overall site

Wikipedia offers a chance to investigate balkanization on a large scale, but is also difficult to study for exactly this reason. Wikipedia represents terabytes of text. The restrictive size of the dataset and the lack of available software libraries for parsing make working with Wikipedia difficult.

This thesis presents a software library capable of handling Wikipedia's large dataset by using parallel processing techniques. The software library parses Wikipedia revision histories into a graph model so that we can investigate balkanization on Wikipedia using established graph theory metrics. I contribute both the java software library for parsing revision histories and implementations and analysis of three possible balkanization metrics: density, degree centrality, and conditional probability.



© Copyright is owned by author of this document