. 2018 Mar 22;13(3):e0193703.

doi: 10.1371/journal.pone.0193703. eCollection 2018.

Authorship attribution based on Life-Like Network Automata

Jeaneth Machicao¹, Edilson A Corrêa Jr², Gisele H B Miranda², Diego R Amancio², Odemir M Bruno¹

Affiliations

¹ Scientific Computing Group, São Carlos Institute of Physics, University of São Paulo, PO Box 369, 13560-970, São Carlos, São Paulo, Brazil.
² Institute of Mathematics and Computer Science, University of São Paulo, Avenida Trabalhador são-carlense, 400, 13566-590, São Carlos, São Paulo, Brazil.

PMID: 29566100
PMCID: PMC5863954
DOI: 10.1371/journal.pone.0193703

Authorship attribution based on Life-Like Network Automata

Jeaneth Machicao et al. PLoS One. 2018.

. 2018 Mar 22;13(3):e0193703.

doi: 10.1371/journal.pone.0193703. eCollection 2018.

Authors

Jeaneth Machicao¹, Edilson A Corrêa Jr², Gisele H B Miranda², Diego R Amancio², Odemir M Bruno¹

Affiliations

¹ Scientific Computing Group, São Carlos Institute of Physics, University of São Paulo, PO Box 369, 13560-970, São Carlos, São Paulo, Brazil.
² Institute of Mathematics and Computer Science, University of São Paulo, Avenida Trabalhador são-carlense, 400, 13566-590, São Carlos, São Paulo, Brazil.

PMID: 29566100
PMCID: PMC5863954
DOI: 10.1371/journal.pone.0193703

Abstract

The authorship attribution is a problem of considerable practical and technical interest. Several methods have been designed to infer the authorship of disputed documents in multiple contexts. While traditional statistical methods based solely on word counts and related measurements have provided a simple, yet effective solution in particular cases; they are prone to manipulation. Recently, texts have been successfully modeled as networks, where words are represented by nodes linked according to textual similarity measurements. Such models are useful to identify informative topological patterns for the authorship recognition task. However, there is no consensus on which measurements should be used. Thus, we proposed a novel method to characterize text networks, by considering both topological and dynamical aspects of networks. Using concepts and methods from cellular automata theory, we devised a strategy to grasp informative spatio-temporal patterns from this model. Our experiments revealed an outperformance over structural analysis relying only on topological measurements, such as clustering coefficient, betweenness and shortest paths. The optimized results obtained here pave the way for a better characterization of textual networks.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Fig 1. Authorship attribution framework based on LLNA method.**
The following steps are applied: (1) a written text is pre-processed; (2) a network is generated based on the extraction of keywords from the pre-processing; (3) a selected LLNA rule evolves over the textual network topology; (4) spatio-temporal features from the LLNA are extracted and then are used for the authorship attribution task.

**Fig 2. Exemplification of the network modeling.**
Here was used a short text *“Complex networks model several properties of texts*. *A complex text displays a complex organization”*. In this example, we considered the lemmatization of all words to construct the network.

**Fig 3. Spatio-temporal diagrams.**
Here was used the LLNA rule B2478-S25 obtained from books written by eight authors. The *partial-dataset* was used in this case. The LLNA dynamics was performed until t = 500 and the initial states s₀ were defined by a random uniform distribution. The spatio-temporal diagram shows the nodes’ states: dead, in black; and alive, in white. While the horizontal axis represent the nodes (sorted by increasing order of degree k, for illustration purposes only), the vertical axis represents the temporal variable.

**Fig 4. Histogram of the distribution of accuracy.**
This figure shows the histogram of the distribution of accuracy for all 262, 144 evaluated rules of the LLNA in the *rule-selection-dataset* comprising 12 authors. From left to right, the histograms for each of the 3 datasets *none*, *partial* and *full*, are shown respectively. As an example, the highlighted five rules maximizes the classification of the *rule-selection-dataset*, when a partial lemmatization was applied. For this rule selection experiment, both Shannon entropy and Lempel-Ziv complexity were considered as corresponding feature vectors, and kNN classifier.

**Fig 5. Authorship recognition performance using LLNA.**
a) Canonical analysis performed for the authorship recognition task using the five books from the authors of the *classification-dataset* using partial lemmatization. For this plot was used rule B2478-S25 and the Lempel-Ziv distribution ${\vec{μ}}_{L}$ as a feature vector. b) Confusion matrix using kNN method achieved by the best classification rate. Each cell shows the number of correct predicted instances, where nonzero elements are indicated. c) Comparison of the accuracy obtained by the proposed method treating the authorship verification as a one-class classification problem. The accuracy was calculated as the average and standard deviation for the classification of five books of an author A against five books from unknown authors X, using three different classifiers.

**Fig 6. Comparison performance regarding network structural measurements and robustness performance.**
a) Comparison of the accuracy obtained by the proposed method (left side) and the classical network measurements (right side). The histograms on the left (mean and standard deviation) represent the best accuracies obtained when using rules B124-S257, B2478-S25 and B3567-S03468 for *none-*, *partial-* and *full-dataset*, respectively. In a similar way, the histograms on the right show the best accuracies obtained using the combination of the network measurements: mean degree (〈k〉), average hierarchical degree of level 1 ( $〈 H_{k_{1}} 〉$ ), average hierarchical degree of level 2 ( $〈 H_{k_{2}} 〉$ ), average clustering coefficient (〈C〉), average path length (l) and degree assortativity (Γ), as a feature vector. b) Average accuracy obtained in the variations of the original dataset. Each variation considers a different number of authors, which ranges from 2 to 8. c) Performance evaluation for different text size, ranging from 2000 to 22000 words, using rule B2478-S25. For all these experiments kNN method was used.

**Fig 7. Average network measurements.**
Network structural measurement extracted from eight authors highlighted in the diagrams and for the three datasets: *none-*, *partial-* and *full-dataset* (see description in the Material and methods section). The following distributions are shown for each author: number of nodes (N), number of edges (E), average connectivity (〈k〉), average clustering coefficient (〈C〉), average path length (〈L〉), diameter (D), density (d), power-law exponent (γ) and degree assortativity (Γ).

See this image and copyright information in PMC

Cited by

Classification of Literary Works: Fractality and Complexity of the Narrative, Essay, and Research Article.
Ramirez-Arellano A. Ramirez-Arellano A. Entropy (Basel). 2020 Aug 17;22(8):904. doi: 10.3390/e22080904. Entropy (Basel). 2020. PMID: 33286673 Free PMC article.
Using citation networks to evaluate the impact of text length on keyword extraction.
Tohalino JAV, Silva TC, Amancio DR. Tohalino JAV, et al. PLoS One. 2023 Nov 27;18(11):e0294500. doi: 10.1371/journal.pone.0294500. eCollection 2023. PLoS One. 2023. PMID: 38011182 Free PMC article.
Identifying the perceived local properties of networks reconstructed from biased random walks.
Guerreiro L, Silva FN, Amancio DR. Guerreiro L, et al. PLoS One. 2024 Jan 19;19(1):e0296088. doi: 10.1371/journal.pone.0296088. eCollection 2024. PLoS One. 2024. PMID: 38241390 Free PMC article.
Comparing random walks in graph embedding and link prediction.
Vital A Jr, Silva FN, Amancio DR. Vital A Jr, et al. PLoS One. 2024 Nov 6;19(11):e0312863. doi: 10.1371/journal.pone.0312863. eCollection 2024. PLoS One. 2024. PMID: 39504339 Free PMC article.
A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts.
Xia T, Chen X, Wang J, Qiu F. Xia T, et al. Sensors (Basel). 2023 Nov 4;23(21):8975. doi: 10.3390/s23218975. Sensors (Basel). 2023. PMID: 37960672 Free PMC article.

See all "Cited by" articles

References

1. Franco-Salvador M, Rosso P, Montes-y-Gómez M. A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management. 2016;52(4):550–570. doi: 10.1016/j.ipm.2015.12.004 - DOI
1. Labbé C, Labbé D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics. 2013;94(1):379–396. doi: 10.1007/s11192-012-0781-y - DOI
1. Vacca JR. Computer Forensics: Computer Crime Scene Investigation (Networking Series) (Networking Series). Rockland, MA, USA: Charles River Media, Inc; 2005.
1. Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. 2009;60(3):538–556. doi: 10.1002/asi.21001 - DOI
1. Amancio DR. Authorship recognition via fluctuation analysis of network topology and word intermittency. Journal of Statistical Mechanics: Theory and Experiment. 2015;2015(3):P03005 doi: 10.1088/1742-5468/2015/03/P03005 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Authorship attribution based on Life-Like Network Automata

Affiliations

Authorship attribution based on Life-Like Network Automata

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources