Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 9;11(9):e0162360.
doi: 10.1371/journal.pone.0162360. eCollection 2016.

Abundant Topological Outliers in Social Media Data and Their Effect on Spatial Analysis

Affiliations

Abundant Topological Outliers in Social Media Data and Their Effect on Spatial Analysis

Rene Westerholt et al. PLoS One. .

Erratum in

Abstract

Twitter and related social media feeds have become valuable data sources to many fields of research. Numerous researchers have thereby used social media posts for spatial analysis, since many of them contain explicit geographic locations. However, despite its widespread use within applied research, a thorough understanding of the underlying spatial characteristics of these data is still lacking. In this paper, we investigate how topological outliers influence the outcomes of spatial analyses of social media data. These outliers appear when different users contribute heterogeneous information about different phenomena simultaneously from similar locations. As a consequence, various messages representing different spatial phenomena are captured closely to each other, and are at risk to be falsely related in a spatial analysis. Our results reveal indications for corresponding spurious effects when analyzing Twitter data. Further, we show how the outliers distort the range of outcomes of spatial analysis methods. This has significant influence on the power of spatial inferential techniques, and, more generally, on the validity and interpretability of spatial analysis results. We further investigate how the issues caused by topological outliers are composed in detail. We unveil that multiple disturbing effects are acting simultaneously and that these are related to the geographic scales of the involved overlapping patterns. Our results show that at some scale configurations, the disturbances added through overlap are more severe than at others. Further, their behavior turns into a volatile and almost chaotic fluctuation when the scales of the involved patterns become too different. Overall, our results highlight the critical importance of thoroughly considering the specific characteristics of social media data when analyzing them spatially.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Map showing overlapping tweets in central London.
The yellowish tweets represent a semantic “work” topic described in the following section. The greenish tweets, in contrast, were assigned a “home” topic (cf. [20] for details on these topics). The background map is based on OpenStreetMap data.
Fig 2
Fig 2. Overview of the two employed datasets.
a) Simulated pattern, colors indicate the two primal sub-pattern. b) Twitter data from London. The background map of b) is based on OpenStreetMap data.
Fig 3
Fig 3. Heat map of pairwise covariance terms and semivariogram of topic associations.
The dashed semivariogram refers to the right-hand y-axis (same line-style). The left-hand y-axis refers to the color-coded bins. Figure bases on the entire Twitter dataset from London, see Section ‘Datasets.’
Fig 4
Fig 4. Local eigenvalues of a single pattern (top) and a combined pattern (bottom).
Please note the differing value ranges, which are tributes to different distributions of eigenvalues across the maps. Size classification is Jenks natural breaks.
Fig 5
Fig 5. Violin plots (cf. [55]) for a single pattern (left) and an overlapping pattern (right).
The central box illustrates data between first and third quartile. The white dot refers to the median.
Fig 6
Fig 6. Typical Moran scatterplot for positively spatially autocorrelated data.
Blue line shows the trend. HL: High-Low, LH: Low-High, LL: Low-Low and HH: High-High interaction.
Fig 7
Fig 7. Moran scatterplot for the combined pattern.
Dashed lines show the trends of the similarly colored points.
Fig 8
Fig 8. Interrelationships between points within the zone of overlap.
a) from a small-scale perspective and b) from a large-scale perspective.
Fig 9
Fig 9. Comparison between the Moran scatterplot and associated local eigenvalues.
Colors are in accordance to Fig 7. Shown ellipses mark the respective 95% confidence ellipses. a) Ellipse for a non-overlapping pattern. b) Ellipse for an overlapping pattern. Note that the magnitudes of the axes differ. Similar sizes were chosen for visualization purposes.
Fig 10
Fig 10. Numbers of interactions between two overlapping patterns across a range of scale differences.
a) Overlapping patterns; b) Mutual involvement. Dark gray: small-scale perspective; light gray: large-scale perspective. (1a/b) to (3a/b): fitted decay functions for sub-ranges.
Fig 11
Fig 11. Course of the slope of the red component from the Moran scatterplot.
Dark-red: increasing attribute values from pattern center toward the boundary. Light-red: reversed attribute dispersal. The dashed line indicates the true Moran’s ℐ value of 0.81.
Fig 12
Fig 12. Correlogram of the serial correlation at different lags for the slopes of the red component.
Dashed line indicates the 95% confidence interval.
Fig 13
Fig 13. Course of the slope of the blue component from the Moran scatterplot.
Dark-blue: increasing attribute values from pattern center toward the boundary. Light-blue: reversed attribute dispersal.
Fig 14
Fig 14. Correlograms of the serial correlation at different lags within the slopes of the blue component.
a) Scale differences up to 45 m. b) Scale differences between 45 and 90 m. Dashed line indicates the 95% confidence interval.
Fig 15
Fig 15. Course of the slope of the blue component from the Moran scatterplot.
Light-yellow: increasing attribute values from pattern center toward the boundary. Dark-yellow: reversed attribute dispersal.
Fig 16
Fig 16. Correlograms of the serial correlation at different lags within the slopes of the yellow component.
a) Scale differences up to 45 m. b) Scale differences between 45 and 90 m. Dashed line indicates the 95% confidence interval.

References

    1. Cranshaw J, Schwartz R, Hong JI, Sadeh N. The livehoods project: Utilizing social media to understand the dynamics of a city. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM-12); 2012, Dublin, Ireland.
    1. Lee R, Wakamiya S, Sumiya K. Urban area characterization based on crowd behavioral lifelogs over Twitter. Pers Ubiquit Comput. 2013;17: 605–620. 10.1007/s00779-012-0510-9 - DOI
    1. Newsome T H, Walcott WA, Smith PD. Urban activity spaces: Illustrations and application of a conceptual model for integrating the time and space dimensions. Transportation. 1998;25: 357–377. 10.1023/A:1005082827030 - DOI
    1. Rai R, Balmer M, Rieser M, Vaze V, Schönfelder S, Axhausen K. Capturing human activity spaces: New geometries. Transport Res Rec. 2007;2021: 70–80. 10.3141/2021-09 - DOI
    1. Gayo-Avello D. No, you cannot predict elections with Twitter. IEEE Internet Comput. 2012;16: 91–94. 10.1109/MIC.2012.137 - DOI

Publication types