Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 26;12(1):e0170527.
doi: 10.1371/journal.pone.0170527. eCollection 2017.

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks

Affiliations

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks

Camilo Akimushkin et al. PLoS One. .

Abstract

Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Example of co-occurrence network.
The network was obtained for the text “It was the best of times; it was the worst of times; it was the age of wisdom; it was the age of foolishness”, which is an extract from the book “A Tale of Two Cities”, by Charles Dickens. Note that, after the removal of stopwords (such as “it” and “was”) and lemmatization process (“times” is mapped to “time”), the remaining words are linked if they are adjacent.
Fig 2
Fig 2. Methodology used to characterize a documents as a set of time series.
In the first step, the document is splitted into shorter pieces of equal length. For each subtext, a network is formed. Then, the sequence of networks yield a sequence of complex network measurements. The features extracted from the times series are finally used to identify authorship.
Fig 3
Fig 3. Time series for Moby Dick by Herman Melville.
The horizontal axis denotes the index of realizations, and the vertical axis brings the base 10 logarithm of the metrics identified in the inset.
Fig 4
Fig 4. Autocorrelation and histograms for Moby Dick.
(a) Autocorrelation for the series of clustering coefficient of Moby Dick by Herman Melville. Dashed lines mark the 5% threshold which is surpassed only by chance. (b) Histograms for time series of degree K (connectivity) from all books on the collection grouped by author. The distributions have characteristic moments for each author.
Fig 5
Fig 5. Success scores and combinations of attributes using the variance threshold and score-based feature selection.
In (a), (b) the maximum values with minimum number of attributes are marked with circles. In (c), (d) for each network metric (represented by a label in the horizontal axis) the four first moments are presented in increasing order from left to right. A black cell indicates that the attribute is present in the combination. For the variance threshold feature selection, there is a unique combination for every threshold denoted by the vertical axis in (c). For instance, thresholds for the four maximum scores in (a) are marked in (c) by the four dashed horizontal lines. For the score-based feature selection there can be multiple combinations of attributes with the same number of attributes and the same score. Only the combinations with maximum scores and marked with circles in (b) are presented in (d); for KNN algorithm there were two combinations with maximum score.
Fig 6
Fig 6. Validation and visualization of complex network measurements.
(a) Validation of the classification without dimensionality reduction (red), and with feature extraction using PCA (green) and Isomap (blue). (b) Reduction to two-dimensional attribute space using Isomap. Each point represents a book and each color represents an author.

References

    1. Xia C, Wang Z, Sanz J, Meloni S, Moreno Y. Effects of delayed recovery and nonuniform transmission on the spreading of diseases in complex networks. Physica A. 2013;392(7):1577–1585. 10.1016/j.physa.2012.11.043 - DOI - PMC - PubMed
    1. Xia C, Wang L, Sun S, Wang J. An SIR model with infection delay and propagation vector in complex networks. Nonlinear Dynamics. 2012;69(3):927–934. 10.1007/s11071-011-0313-y - DOI
    1. Chen M, Wang L, Wang J, Sun S, Xia C. Impact of individual response strategy on the spatial public goods game within mobile agents. Applied Mathematics and Computation. 2015;251:192–202. 10.1016/j.amc.2014.11.052 - DOI
    1. Chen M, Wang L, Sun S, Wang J, Xia C. Evolution of cooperation in the spatial public goods game with adaptive reputation assortment. Physics Letters A. 2016;380(1–2):40–47. 10.1016/j.physleta.2015.09.047 - DOI
    1. Sun S, Wu Y, Ma Y, Wang L, Gao Z, Xia C. Impact of Degree Heterogeneity on Attack Vulnerability of Interdependent Networks. Scientific Reports. 2016;6:32983 10.1038/srep32983 - DOI - PMC - PubMed

LinkOut - more resources