Data Collection

Overview

Our work on OSS aims to analyze open source software in use and we have collected data on installable libraries/packages for programming languages like Python, Julia and R. In addition to that, we have collected project-level repository information for Code.gov.

Programming Languages

Modern programming languages like Python, Julia and R, have built-in package registries like PyPi, JuliaPackages, CRAN, etc. Therefore, the data collection task involved accessing the registry as a file or through REST endpoints or by using web-scraping techniques to get a list of packages.

Python (PyPi)

The first step in the data collection task for Python was to get a list of all the packages. This was accomplished by making RPC request to https://pypi.python.org/pypi using a built-in Python package called xmlrpc. At the time of our research, we obtained 441095 packages from the RPC endpoint. The next step is to get the metadata for each package now that we have package names. To do that, we made REST requests to package-specific endpoints. The metadata obtained had a lot of relevant information about the package like package versions, package dependencies, package URLs (source code, homepage), package creation date, package description summary and package authors, etc. This is followed by flattening the JSON objects to obtain a table and storing the table to the database.

R (CRAN)

The data collection process for R is relatively straightforward, thanks to their logs’ database that can be accessed through a user-friendly API. The logs’ database can be accessed through a library called cranlogs for R which provides wrapper functions for the database calls. We collected the following information for each package in CRAN: package, version, dependencies, license, author, description, maintainer, title, source, URL and reverse dependencies.

This package can also be used to obtain daily downloads count for each package in CRAN. We obtained the “overall downloads” by summing over the daily downloads, and we obtained the “yearly downloads” by aggregating the download counts yearly. The results were joined to form the main table.

Julia

The Julia package registry resides on GitHub at https://github.com/JuliaRegistries/General. In this repository, we can find a file named Registry.toml. This file contains an exhaustive list of all the packages in the JuliaPackages registry in .toml format which is similar to .ini format for storing configuration files. Once we obtain the list of packages, we obtained the relevant source code URL for the packages using a Julia package called PackageAnalyzer. This package contains functions to obtain the GitHub URL associated with the package names. Now we have a dataframe that has the following columns: package name, GitHub URL. The file we use for our task is Project.toml which contains metadata like authors, dependencies, version.

Accessing the Project.toml file for all the packages is difficult because the files are scattered across different GitHub repositories. To access the Project.toml files we make HTTP GET requests to https://raw.githubusercontent.com/{user}/{repo_name}/master/Project.toml for each user and repo_name in the GitHub URLs we collected. This was followed by joining the dataframes and flattening out the metadata.

GitHub (GraphQL API)

As we have discussed before, for Python, Julia and R packages, we have collected their respective source GitHub URLs. This can give us a lot of information at both repository and commits level.

The repository-level data for the packages aid in getting more metadata like- 1. Created date 2. Description of the project 3. Readme 4. Fork count 5. Stargazer count 6. Issues count, etc

The commits-level data lets us have a more granular look into the package development. We can obtain the number of lines of codes added and deleted by each developer for the package which aids us in determining the key developers for the package. We can also get a time-series view of the package development.