Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021;26(4):65.
doi: 10.1007/s10664-021-09961-9. Epub 2021 May 8.

Understanding and improving the quality and reproducibility of Jupyter notebooks

Affiliations

Understanding and improving the quality and reproducibility of Jupyter notebooks

João Felipe Pimentel et al. Empir Softw Eng. 2021.

Abstract

Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and other rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way in which notebooks are being used leads to unexpected behavior, encourages poor coding practices, and makes it hard to reproduce its results. To better understand good and bad practices used in the development of real notebooks, in prior work we studied 1.4 million notebooks from GitHub. We presented a detailed analysis of their characteristics that impact reproducibility, proposed best practices that can improve the reproducibility, and discussed open challenges that require further research and development. In this paper, we extended the analysis in four different ways to validate the hypothesis uncovered in our original study. First, we separated a group of popular notebooks to check whether notebooks that get more attention have more quality and reproducibility capabilities. Second, we sampled notebooks from the full dataset for an in-depth qualitative analysis of what constitutes the dataset and which features they have. Third, we conducted a more detailed analysis by isolating library dependencies and testing different execution orders. We report how these factors impact the reproducibility rates. Finally, we mined association rules from the notebooks. We discuss patterns we discovered, which provide additional insights into notebook reproducibility. Based on our findings and best practices we proposed, we designed Julynter, a Jupyter Lab extension that identifies potential issues in notebooks and suggests modifications that improve their reproducibility. We evaluate Julynter with a remote user experiment with the goal of assessing Julynter recommendations and usability.

Keywords: GitHub; Jupyter notebook; Lint; Quality; Reproducibility.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
An example of an executed notebook with Markdown, code, and output
Fig. 2
Fig. 2
Original notebook and two executions that follow different orders
Fig. 3
Fig. 3
Three types of Hidden States: a Re-execution; b edited cell; c removed cell
Fig. 4
Fig. 4
Top 15 most declared programming languages. Notebooks axis in logarithmic scale
Fig. 5
Fig. 5
Distribution of code cells and maximum execution counter for overall group (a) and popular group (b)
Fig. 6
Fig. 6
Notebook corpus and its partitions used in the analyses
Fig. 7
Fig. 7
Distribution of cell types among all notebooks with Markdown (a), and popular notebooks (b)
Fig. 8
Fig. 8
Snippet of IBM/Science/sklearn_cookbook.ipynb from the GitHub repository WatPro/binder-workspace
Fig. 9
Fig. 9
Distributions of filename lengths in the overall group (a) and popular group (b)
Fig. 10
Fig. 10
Top 20 most imported modules
Fig. 11
Fig. 11
Distribution of cells with imports in valid Python notebooks (a) and popular notebooks (b)
Fig. 12
Fig. 12
Distribution of Python constructs in notebooks. This figure groups constructs into categories. The constructs of a category appear on the right of the category bar. A category corresponds to the union of its constructs
Fig. 13
Fig. 13
Snippet of train_actions_csv.ipynb from the GitHub repository AdrianHsu/charades-parser
Fig. 14
Fig. 14
Snippet of pythoncode/improvedlm.ipynb from the GitHub repository poorbaby/Predict-New-York-Taxi-Demand
Fig. 15
Fig. 15
Distribution of code cells in executed notebooks (a) and popular notebooks (b)
Fig. 16
Fig. 16
Distribution of skips in notebooks with unambiguous execution order (a) and popular notebooks (b)
Fig. 17
Fig. 17
Snippet of pparker-roach/project_7-SANDBOX.ipynb from the GitHub repository mohsseha/DSI-BOS-students
Fig. 18
Fig. 18
Failure reasons for the executions in each execution mode. The blue bars represent the Top 10 exceptions. The “Timeout” orange bar represents executions that we stopped when they took 5 minutes to run. The “Other” orange bar groups all the other exceptions that are not part of the Top 10
Fig. 19
Fig. 19
Julynter in action (left pane). By analyzing the notebook on the right pane, Julynter identified ten issues from four different categories
Fig. 20
Fig. 20
Architecture of Julynter. Blue arrows represent input messages that occur before the cell execution. Red arrows represent output messages that occur after the kernel executes the cell
Fig. 21
Fig. 21
Participants experiment flow
Fig. 22
Fig. 22
Participants’ experience
Fig. 23
Fig. 23
Solved and unsolved lints
Fig. 24
Fig. 24
Satisfaction with the lint groups
Fig. 25
Fig. 25
Chosen words in the Microsoft Product Reaction Cards (Benedek and Miner 2002). The colors vary according to the experiment phase in a gradient. Mixed colors indicate that participants of both phases chose the word and the mixing intensity indicates the proportion

Similar articles

Cited by

References

    1. Agrawal R, Srikant R, et al. (1994) Fast algorithms for mining association rules. In: VLDB conference, VLDB, vol 1215, pp 487–499
    1. Anaconda (2018) Anaconda software distribution. https://www.anaconda.com. Accessed: 2019-10-01
    1. Arnaoudova V, Di Penta M, Antoniol G. Linguistic antipatterns: what they are and how developers perceive them. Empir Softw Eng. 2016;21(1):104–158. doi: 10.1007/s10664-014-9350-8. - DOI
    1. Bangor A, Kortum PT, Miller JT. An empirical evaluation of the system usability scale. Int J Hum–Comput Interact. 2008;24(6):574–594. doi: 10.1080/10447310802205776. - DOI
    1. Benedek J, Miner T. Measuring desirability: new methods for evaluating desirability in a usability lab setting. Proc Usabil Prof Assoc. 2002;2003(8–12):57.