Data Collection

Overview

For this project we have collected data pertaining to installable libraries/packages for programming languages like Python, Julia and R. In addition to that, we have collected project-level repository information for Code.gov

Modern programming languages like Python, Julia and R, have built-in package registries like PyPi, JuliaPackages, CRAN, etc. Therefore, the data collection task involves accessing the registry as a file or through REST endpoints or by using web-scraping techniques to get a list of packages. Most of these registries also contain metadata for the packages like

Author name
License
Source code URL
Version history
Downloads
Date of creation, etc

Once we relate a package to its source code URL, it’s relatively easy to obtain the repository-level and commit-level information for the project. The data collection for the packages can be broken down into the following steps:

Obtain the list of packages
Obtain metadata for the packages along with their GitHub URL
Get repository-level information for the packages through their respective GitHub URL
Get commits-level information for the packages through their respective GitHub URL

Data collection from PyPi for Python

The first step in the data collection task for Python is to get a list of all the packages. This was accomplished by making RPC request to https://pypi.python.org/pypi using a built-in Python package called xmlrpc

At the time of our research, we obtained 441095 packages from the RPC endpoint

The next step is to get the metadata for each package now that we have package names. To do that, we made REST requests to package-specific endpoints which looks something like this https://pypi.org/pypi/{package_name}/json

When you make request to a package, for example requests, you get the following JSON which contains the metadata

{
  "info": {
        "author": "Kenneth Reitz",
        "author_email": "me@kennethreitz.org",
        "bugtrack_url": null,
    "classifiers": [
        "Development Status :: 5 - Production/Stable",
        "Environment :: Web Environment",
        "Intended Audience :: Developers",
        "License :: OSI Approved :: Apache Software License",
        "Natural Language :: English",
        "Operating System :: OS Independent",
        "Programming Language :: Python",
    ........
    ],
        "description": "# Requests\n\n**Requests** is a simple, ...n",
    },
        "home_page": "https://requests.readthedocs.io",
        "keywords": "",
        "license": "Apache 2.0",
        "maintainer": "",
        "maintainer_email": "",
        "name": "requests",
        "package_url": "https://pypi.org/project/requests/",
        "platform": null,
        "project_url": "https://pypi.org/project/requests/",
        "project_urls": {
        "Documentation": "https://requests.readthedocs.io",
        "Homepage": "https://requests.readthedocs.io",
        "Source": "https://github.com/psf/requests"
        }
    ..........
    ..........
    ..........
    ..........
}

The metadata thus obtained has a lot of relevant information about the package like 1. Package versions 2. Package dependencies 3. Package URLs (source code, homepage) 4. Package creation date 5. Package description summary 6. Package authors, etc

This step is followed by flattening the JSON objects to obtain a table

Data collection from CRAN for R

The data collection process for R is relatively straightforward, thanks to their logs’ database that can be accessed through a user-friendly API. The logs’ database can be accessed through a library called cranlogs for R which provides wrapper functions for the database calls. We collected the following information for each package in CRAN. 1. Package 2. Version 3. Dependencies 4. License 5. Author 6. Description 7. Maintainer 8. Title 9. Source URL 10. Reverse Dependencies

This package can also be used to obtain daily downloads count for each package in CRAN. We obtained the “overall downloads” by summing over the daily downloads, and we obtained the “yearly downloads” by aggregating the download counts yearly.

The results were joined to form the main table.

Data collection from JuliaPackages for Julia

The Julia package registry resides on GitHub at https://github.com/JuliaRegistries/General. In this repository, we can find a file named Registry.toml. This file contains an exhaustive list of all the packages in the JuliaPackages registry in .toml format which is similar to .ini format for storing configuration files. Once we obtain the list of packages, we obtained the relevant source code URL for the packages using a Julia package called PackageAnalyzer. This package contains functions to obtain the GitHub URL associated with the package names. Now we have a dataframe that has the following columns,

Package Name
GitHub URL

The source code for the Julia packages must be structured to contain the following files and folders

./src
./test
README.md
LICENSE
Manifest.toml
Project.toml

The file pertinent to our task is Project.toml which contains metadata like 1. Authors 2. Dependencies 3. Version

Accessing the Project.toml file for all the packages seems like a hard task as they are scattered across different GitHub repositories. But we already have the GitHub repositories associated with the packages. Therefore, now the task of accessing the Project.toml file boils down to making HTTP GET requests to https://raw.githubusercontent.com/{user}/{repo_name}/master/Project.toml for each user and repo_name in the GitHub URLs we collected.

This was followed by joining the dataframes and flattening out the metadata.

Using GitHub GraphQL API to collect more metadata

As we have discussed before, for Python, Julia and R packages, we have collected their respective source GitHub URLs. This can give us a lot of information at both repository and commits level.

The repository-level data for the packages aid in getting more metadata like- 1. Created date 2. Description of the project 3. Readme 4. Fork count 5. Stargazer count 6. Issues count, etc

The commits-level data lets us have a more granular look into the package development. We can obtain the number of lines of codes added and deleted by each developer for the package which aids us in determining the key developers for the package. We can also get a time-series view of the package development.

Collection of GitHub repository-level information for the packages

One of the main problems with collecting huge amounts of data from the GitHub API is rate-limits. At the time of writing, the rate-limit associated with the GitHub GraphQL API is 5000 requests per hour. To handle this issue, we use a list of multiple GitHub API keys to collect data. For every query, we use the least-used key, thereby going through all the keys.

The GraphQL query to check the health of an API key is of the following form

query {
  viewer {
    login
  }
  rateLimit {
    limit
    cost
    remaining
    resetAt
  }
}

In addition to this, we also had to handle secondary rate limits which prevent a client from making “too many” requests in short durations. This can be circumvented by adding a random delay between 0.5 to 3 seconds before every request.

query {
  repository(owner: "%s", name: "%s") {
    name
    description
    shortDescriptionHTML
    url
    createdAt
    updatedAt
    pushedAt
    forkCount
    stargazerCount: stargazers {
      totalCount
    }
    issues(states: OPEN) {
      totalCount
    }
    pullRequests(states: OPEN) {
      totalCount
    }
    owner {
      login
    }
    licenseInfo {
      name
      spdxId
      url
    }
    object(expression: "HEAD:README.md") {
      ... on Blob {
        text
      }
    }
  }
  viewer {
    login
  }
  rateLimit {
    limit
    cost
    remaining
    resetAt
  }
}

The data collection mechanism to collect the repositories is represented in the block diagram below.

Collection of GitHub commits-level information for the packages

Along with the challenges associated with primary and secondary rate-limits that we discussed previously, we also run into problem with query limits with collecting commits data.

In the case of the repository-level query, we essentially have one row of data in the response. But querying commits for a repository will have as many rows as the number of commits for the repository. If the result of the query turns out to be more than 1000 (query limit at the time of writing) rows, the GraphQL query only returns the first 1000 rows and truncates the remaining results.

To solve this problem, we can use pagination. Pagination helps divide the query into multiple chunks of 1000 called ‘pages’. This means that we need to run the query multiple times for the same repository till there aren’t any more pages.

The GraphQL query used for this task is of the following form

query ($cursor: String) {
  repository(owner: "%s", name: "%s") {
    defaultBranchRef {
      target {
        ... on Commit {
          history(first: 100, after: $cursor, since: "%s") {
            pageInfo {
              hasNextPage
              endCursor
            }
            edges {
              node {
                oid
                messageHeadline
                author {
                  name
                  email
                  date
                  user {
                    login
                    location
                    company
                    pronouns
                    bio
                    websiteUrl
                    twitterUsername
                  }
                }
                additions
                deletions
              }
            }
          }
        }
      }
    }
  }
}

The data collection mechanism to collect the commits-information is represented in the block diagram below.