Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 22;13(3):e0193703.
doi: 10.1371/journal.pone.0193703. eCollection 2018.

Authorship attribution based on Life-Like Network Automata

Affiliations

Authorship attribution based on Life-Like Network Automata

Jeaneth Machicao et al. PLoS One. .

Abstract

The authorship attribution is a problem of considerable practical and technical interest. Several methods have been designed to infer the authorship of disputed documents in multiple contexts. While traditional statistical methods based solely on word counts and related measurements have provided a simple, yet effective solution in particular cases; they are prone to manipulation. Recently, texts have been successfully modeled as networks, where words are represented by nodes linked according to textual similarity measurements. Such models are useful to identify informative topological patterns for the authorship recognition task. However, there is no consensus on which measurements should be used. Thus, we proposed a novel method to characterize text networks, by considering both topological and dynamical aspects of networks. Using concepts and methods from cellular automata theory, we devised a strategy to grasp informative spatio-temporal patterns from this model. Our experiments revealed an outperformance over structural analysis relying only on topological measurements, such as clustering coefficient, betweenness and shortest paths. The optimized results obtained here pave the way for a better characterization of textual networks.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Authorship attribution framework based on LLNA method.
The following steps are applied: (1) a written text is pre-processed; (2) a network is generated based on the extraction of keywords from the pre-processing; (3) a selected LLNA rule evolves over the textual network topology; (4) spatio-temporal features from the LLNA are extracted and then are used for the authorship attribution task.
Fig 2
Fig 2. Exemplification of the network modeling.
Here was used a short text “Complex networks model several properties of texts. A complex text displays a complex organization”. In this example, we considered the lemmatization of all words to construct the network.
Fig 3
Fig 3. Spatio-temporal diagrams.
Here was used the LLNA rule B2478-S25 obtained from books written by eight authors. The partial-dataset was used in this case. The LLNA dynamics was performed until t = 500 and the initial states s0 were defined by a random uniform distribution. The spatio-temporal diagram shows the nodes’ states: dead, in black; and alive, in white. While the horizontal axis represent the nodes (sorted by increasing order of degree k, for illustration purposes only), the vertical axis represents the temporal variable.
Fig 4
Fig 4. Histogram of the distribution of accuracy.
This figure shows the histogram of the distribution of accuracy for all 262, 144 evaluated rules of the LLNA in the rule-selection-dataset comprising 12 authors. From left to right, the histograms for each of the 3 datasets none, partial and full, are shown respectively. As an example, the highlighted five rules maximizes the classification of the rule-selection-dataset, when a partial lemmatization was applied. For this rule selection experiment, both Shannon entropy and Lempel-Ziv complexity were considered as corresponding feature vectors, and kNN classifier.
Fig 5
Fig 5. Authorship recognition performance using LLNA.
a) Canonical analysis performed for the authorship recognition task using the five books from the authors of the classification-dataset using partial lemmatization. For this plot was used rule B2478-S25 and the Lempel-Ziv distribution μL as a feature vector. b) Confusion matrix using kNN method achieved by the best classification rate. Each cell shows the number of correct predicted instances, where nonzero elements are indicated. c) Comparison of the accuracy obtained by the proposed method treating the authorship verification as a one-class classification problem. The accuracy was calculated as the average and standard deviation for the classification of five books of an author A against five books from unknown authors X, using three different classifiers.
Fig 6
Fig 6. Comparison performance regarding network structural measurements and robustness performance.
a) Comparison of the accuracy obtained by the proposed method (left side) and the classical network measurements (right side). The histograms on the left (mean and standard deviation) represent the best accuracies obtained when using rules B124-S257, B2478-S25 and B3567-S03468 for none-, partial- and full-dataset, respectively. In a similar way, the histograms on the right show the best accuracies obtained using the combination of the network measurements: mean degree (〈k〉), average hierarchical degree of level 1 (Hk1), average hierarchical degree of level 2 (Hk2), average clustering coefficient (〈C〉), average path length (l) and degree assortativity (Γ), as a feature vector. b) Average accuracy obtained in the variations of the original dataset. Each variation considers a different number of authors, which ranges from 2 to 8. c) Performance evaluation for different text size, ranging from 2000 to 22000 words, using rule B2478-S25. For all these experiments kNN method was used.
Fig 7
Fig 7. Average network measurements.
Network structural measurement extracted from eight authors highlighted in the diagrams and for the three datasets: none-, partial- and full-dataset (see description in the Material and methods section). The following distributions are shown for each author: number of nodes (N), number of edges (E), average connectivity (〈k〉), average clustering coefficient (〈C〉), average path length (〈L〉), diameter (D), density (d), power-law exponent (γ) and degree assortativity (Γ).

Similar articles

Cited by

References

    1. Franco-Salvador M, Rosso P, Montes-y-Gómez M. A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management. 2016;52(4):550–570. doi: 10.1016/j.ipm.2015.12.004 - DOI
    1. Labbé C, Labbé D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics. 2013;94(1):379–396. doi: 10.1007/s11192-012-0781-y - DOI
    1. Vacca JR. Computer Forensics: Computer Crime Scene Investigation (Networking Series) (Networking Series). Rockland, MA, USA: Charles River Media, Inc; 2005.
    1. Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. 2009;60(3):538–556. doi: 10.1002/asi.21001 - DOI
    1. Amancio DR. Authorship recognition via fluctuation analysis of network topology and word intermittency. Journal of Statistical Mechanics: Theory and Experiment. 2015;2015(3):P03005 doi: 10.1088/1742-5468/2015/03/P03005 - DOI

Publication types

LinkOut - more resources