. 2021;26(4):65.

doi: 10.1007/s10664-021-09961-9. Epub 2021 May 8.

Understanding and improving the quality and reproducibility of Jupyter notebooks

João Felipe Pimentel¹, Leonardo Murta¹, Vanessa Braganholo¹, Juliana Freire²

Affiliations

¹ Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ Brazil.
² Department of Computer Science and Engineering, New York University, New York, NY USA.

PMID: 33994841
PMCID: PMC8106381
DOI: 10.1007/s10664-021-09961-9

Understanding and improving the quality and reproducibility of Jupyter notebooks

João Felipe Pimentel et al. Empir Softw Eng. 2021.

. 2021;26(4):65.

doi: 10.1007/s10664-021-09961-9. Epub 2021 May 8.

Authors

João Felipe Pimentel¹, Leonardo Murta¹, Vanessa Braganholo¹, Juliana Freire²

Affiliations

¹ Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ Brazil.
² Department of Computer Science and Engineering, New York University, New York, NY USA.

PMID: 33994841
PMCID: PMC8106381
DOI: 10.1007/s10664-021-09961-9

Abstract

Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and other rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way in which notebooks are being used leads to unexpected behavior, encourages poor coding practices, and makes it hard to reproduce its results. To better understand good and bad practices used in the development of real notebooks, in prior work we studied 1.4 million notebooks from GitHub. We presented a detailed analysis of their characteristics that impact reproducibility, proposed best practices that can improve the reproducibility, and discussed open challenges that require further research and development. In this paper, we extended the analysis in four different ways to validate the hypothesis uncovered in our original study. First, we separated a group of popular notebooks to check whether notebooks that get more attention have more quality and reproducibility capabilities. Second, we sampled notebooks from the full dataset for an in-depth qualitative analysis of what constitutes the dataset and which features they have. Third, we conducted a more detailed analysis by isolating library dependencies and testing different execution orders. We report how these factors impact the reproducibility rates. Finally, we mined association rules from the notebooks. We discuss patterns we discovered, which provide additional insights into notebook reproducibility. Based on our findings and best practices we proposed, we designed Julynter, a Jupyter Lab extension that identifies potential issues in notebooks and suggests modifications that improve their reproducibility. We evaluate Julynter with a remote user experiment with the goal of assessing Julynter recommendations and usability.

Keywords: GitHub; Jupyter notebook; Lint; Quality; Reproducibility.

PubMed Disclaimer

Figures

**Fig. 1**
An example of an executed notebook with Markdown, code, and output

**Fig. 2**
Original notebook and two executions that follow different orders

**Fig. 3**
Three types of Hidden States: a Re-execution; b edited cell; c removed cell

**Fig. 4**
Top 15 most declared programming languages. Notebooks axis in logarithmic scale

**Fig. 5**
Distribution of code cells and maximum execution counter for overall group (a) and popular group (b)

**Fig. 6**
Notebook corpus and its partitions used in the analyses

**Fig. 7**
Distribution of cell types among all notebooks with Markdown (a), and popular notebooks (b)

**Fig. 8**
Snippet of IBM/Science/sklearn_cookbook.ipynb from the GitHub repository WatPro/binder-workspace

**Fig. 9**
Distributions of filename lengths in the overall group (a) and popular group (b)

**Fig. 10**
Top 20 most imported modules

**Fig. 11**
Distribution of cells with imports in valid Python notebooks (a) and popular notebooks (b)

**Fig. 12**
Distribution of Python constructs in notebooks. This figure groups constructs into categories. The constructs of a category appear on the right of the category bar. A category corresponds to the union of its constructs

**Fig. 13**
Snippet of train_actions_csv.ipynb from the GitHub repository AdrianHsu/charades-parser

**Fig. 14**
Snippet of pythoncode/improvedlm.ipynb from the GitHub repository poorbaby/Predict-New-York-Taxi-Demand

**Fig. 15**
Distribution of code cells in executed notebooks (a) and popular notebooks (b)

**Fig. 16**
Distribution of skips in notebooks with unambiguous execution order (a) and popular notebooks (b)

**Fig. 17**
Snippet of pparker-roach/project_7-SANDBOX.ipynb from the GitHub repository mohsseha/DSI-BOS-students

**Fig. 18**
Failure reasons for the executions in each execution mode. The blue bars represent the Top 10 exceptions. The “Timeout” orange bar represents executions that we stopped when they took 5 minutes to run. The “Other” orange bar groups all the other exceptions that are not part of the Top 10

**Fig. 19**
Julynter in action (left pane). By analyzing the notebook on the right pane, Julynter identified ten issues from four different categories

**Fig. 20**
Architecture of Julynter. Blue arrows represent input messages that occur before the cell execution. Red arrows represent output messages that occur after the kernel executes the cell

**Fig. 21**
Participants experiment flow

**Fig. 24**
Satisfaction with the lint groups

**Fig. 25**
Chosen words in the Microsoft Product Reaction Cards (Benedek and Miner 2002). The colors vary according to the experiment phase in a gradient. Mixed colors indicate that participants of both phases chose the word and the mixing intensity indicates the proportion

See this image and copyright information in PMC

References

1. Agrawal R, Srikant R, et al. (1994) Fast algorithms for mining association rules. In: VLDB conference, VLDB, vol 1215, pp 487–499
1. Anaconda (2018) Anaconda software distribution. https://www.anaconda.com. Accessed: 2019-10-01
1. Arnaoudova V, Di Penta M, Antoniol G. Linguistic antipatterns: what they are and how developers perceive them. Empir Softw Eng. 2016;21(1):104–158. doi: 10.1007/s10664-014-9350-8. - DOI
1. Bangor A, Kortum PT, Miller JT. An empirical evaluation of the system usability scale. Int J Hum–Comput Interact. 2008;24(6):574–594. doi: 10.1080/10447310802205776. - DOI
1. Benedek J, Miner T. Measuring desirability: new methods for evaluating desirability in a usability lab setting. Proc Usabil Prof Assoc. 2002;2003(8–12):57.

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Understanding and improving the quality and reproducibility of Jupyter notebooks

Affiliations

Understanding and improving the quality and reproducibility of Jupyter notebooks

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous