Code.gov is the Federal Government’s platform for sharing and open sourcing software from Government Agencies. Recent policies of the U.S. Federal Government such as the Federal Source Code Policy (FSCP) promote the sharing of software source code developed by or for the Federal Government. Public funding for creation of software by or for the Federal Government and for research through universities plays an important role but it is not fully accounted. Our aim is to measure how much of OSS is developed in the U.S. Federal Government and estimate its value. As OSS is disseminated online, a wealth of information is available in the code and source code repositories of the software programs themselves. Examples include: contributors, their organizations, lines of code contributed, etc. We use non-survey data from Code.gov which is the Federal Government’s platform for sharing and open sourcing software from Government Agencies.

Objectives

Our aim is to measure how much of OSS is developed in the U.S. Federal Government and estimate its value. This page below provides insights from Exploratory Data Analysis

Data and Methods

The data acquired from Code.gov is in JSON format covering over 17,000 software projects, across 21 different Federal Agencies and includes information on each project, agency name, host links, descriptions and other information. Most of these are developed and hosted on GitHub. We also use GitHub to obtain related information.

Our second source of data is GitHub – a popular version control and hosting service which is the largest platform in the world with over 108 million users and 100 million public repositories. From GitHub we have collected data on over 7.8 million repositories, 2.1 million users and 778 million commits. The information gathered from GitHub includes commits, lines of code added and deleted, OSI licenses, users’ login information, organization, locations, emails, etc.

Exploratory Data Analysis

Add text here

Hosting Platforms

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

hosting <- dbReadTable(con, "codegov_agency_hosting")

dbDisconnect(con)
[1] TRUE
Code
hosting <- hosting |>
  mutate(total_hosting = rowSums(across(where(is.numeric)))) |>
  select(domain, total_hosting) |>
  filter(!row_number() %in% c(10))



#theme_set(theme_wes())
  

hosting |>
  mutate(domain = fct_reorder(domain, total_hosting)) |>
  ggplot(aes(x = domain, y = total_hosting)) + 
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    coord_flip() +
    theme(legend.title=element_blank()) +
    labs(title = "Hosting Platforms by Projects Hosted",
         subtitle = "Github is the most popular hosting platform",
        x = "Hosting Platform",
        y = "Count of Projects")

The most popular hosting platform for Code.gov projects is GitHub followed by the Code.gov platform itself.

Government Agencies

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

agency <- dbReadTable(con, "codegov_gh_agency_count")

dbDisconnect(con)
[1] TRUE
Code
agency |>
  select(agency, projects) |>
  mutate(agency = fct_reorder(agency, projects)) |>
  ggplot(aes(x = agency, y = projects)) +
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    coord_flip() +
    theme(legend.title=element_blank()) +
    labs(title = "Number of Projects by Agency",
        x = "",
        y = "Count of Projects")

The Department of Energy, the General Services Administration and the Department of Health & Human Services are the top 3 agencies by number of projects showcased on the Code.gov platform.

Contributing Sectors

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

sectors <- dbReadTable(con, "codegov_users_sectored")

dbDisconnect(con)
[1] TRUE
Code
sectors_count <- sectors |>
  summarise(Academic = sum(academic, na.rm = T),
            Business = sum(business, na.rm = T),
            Government = sum(government,  na.rm = T),
            `Non-Profit` = sum(nonprofit, na.rm = T))

sectors_count <- sectors_count |>
  gather(key = "Sectors", value = "Contributors")
  
ggplot(sectors_count, aes(area = Contributors, fill = Sectors, label = Contributors)) +
  geom_treemap() +
  geom_treemap_text(colour = "black",
                    place = "centre",
                    size = 15)+
  ggtitle("Number of Contributors Per Sector")+
  scale_fill_economist()

Contributors belonging to the Government sector were the highest with 6,324 contributors followed by the Academic sector with 2,635 contributors, Business sector with 1,615 contributors and the Non-Profit sector with 232 contributors.

Top 20 Organizations By Number of Contributors

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

users <- dbReadTable(con, "codegov_users_sectors")

dbDisconnect(con)
[1] TRUE
Code
orgs_count <- users |>
  count(organization, name = 'Contributors', sort = T) |>
  rename(Organization = organization) |>
  arrange(desc(Contributors)) |>
  top_n(21, Contributors) |>
  slice(2:21)

orgs_count |>
  select(Organization, Contributors) |>
  mutate(Organization = fct_reorder(Organization, Contributors)) |>
  ggplot(aes(x = Organization, y = Contributors)) +
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    coord_flip() +
   geom_text(aes(label=Contributors), color='white', position=position_dodge(width = 0.05), hjust=1.2, size = 7/.pt) +
    theme(legend.title=element_blank()) +
    labs(title = "Number of Contributors by Organization",
        x = "",
        y = "Count of Contributors")

The highest number of contributors by organizations were the Lawrence Livermore National Laboratory and Sandia National Laboratory. Private sector organizations with the highest number of contributors were Intel Corporation and Google.

Top 20 Contributors by Lines of Code Added

Code
#establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

contri <- dbReadTable(con, "codegov_gh_commits_by_login")

dbDisconnect(con)
[1] TRUE
Code
cont_count <- contri |>
  select(login, additions) |>
  group_by(login) |>
  summarise(Contribution = sum(additions)) |>
  rename(Contributor = login) |>
  arrange(desc(Contribution)) |>
  top_n(20, Contribution)

cont_count |>
  select(Contributor, Contribution) |>
  mutate(Contributor = fct_reorder(Contributor, Contribution)) |>
  ggplot(aes(x = Contributor, y = Contribution)) +
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    scale_y_continuous(labels = label_number(suffix = "M", scale = 1e-6, big.mark = ",")) +
    coord_flip() +
    theme(legend.title=element_blank()) +
    labs(title = "Lines of Code Contributed by Top 20 Contributors",
        x = "",
        y = "Lines of Code")

Detailed Info of Top 20 Contributors (Table)

Code
top_20_users <- cont_count %>%
  left_join(users, by = join_by(Contributor == login)) |>
  group_by(Contributor) |>
  filter(row_number()==1)


top_20_users

Top 20 Academic Institutions By Number of Contributors

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

users <- dbReadTable(con, "codegov_users_sectors")

dbDisconnect(con)
[1] TRUE
Code
orgs_count <- users |>
  filter(sector == 'Academic') |>
  count(organization, name = 'Contributors', sort = T) |>
  rename(Organization = organization) |>
  arrange(desc(Contributors)) |>
  top_n(20, Contributors)

orgs_count |>
  select(Organization, Contributors) |>
  mutate(Organization = fct_reorder(Organization, Contributors)) |>
  ggplot(aes(x = Organization, y = Contributors)) +
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    coord_flip() +
    geom_text(aes(label=Contributors), color='white', position=position_dodge(width = 0.05), hjust=1.2, size = 7/.pt) +
    theme(legend.title=element_blank()) +
    labs(title = "Top 20 Academic Insitutions by Contributors",
        x = "",
        y = "Count of Contributors")

Stanford University is the academic institution with the highest number of contributors followed by University of California - Berkeley and University of Michigan - Ann Arbor.

Top 20 Contributing Research Labs

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

users <- dbReadTable(con, "codegov_users_sectors")

dbDisconnect(con)
[1] TRUE
Code
orgs_count <- users |>
  filter(grepl('Laboratory', organization)) |>
  count(organization, name = 'Contributors', sort = T) |>
  rename(Organization = organization) |>
  arrange(desc(Contributors)) |>
  top_n(20, Contributors)

orgs_count |>
  select(Organization, Contributors) |>
  mutate(Organization = fct_reorder(Organization, Contributors)) |>
  ggplot(aes(x = Organization, y = Contributors)) +
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    geom_text(aes(label=Contributors), color='#014d64', position=position_dodge(width=0.5), hjust=-0, size = 9/.pt) +
    coord_flip() +
    theme(legend.title=element_blank()) +
    labs(title = "Top 20 Labs by Contributors",
        x = "",
        y = "Count of Contributors")

The Lawrence Livermore National Laboratory is has the highest number of contributors out of all Laboratories.

Top 20 Contributing Businesses

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

users <- dbReadTable(con, "codegov_users_sectors")

dbDisconnect(con)
[1] TRUE
Code
orgs_count <- users |>
  filter(sector == 'Business') |>
  count(organization, name = 'Contributors', sort = T) |>
  rename(Organization = organization) |>
  arrange(desc(Contributors)) |>
  top_n(20, Contributors)

orgs_count |>
  select(Organization, Contributors) |>
  mutate(Organization = fct_reorder(Organization, Contributors)) |>
  ggplot(aes(x = Organization, y = Contributors)) +
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    coord_flip() +
    geom_text(aes(label=Contributors), color='white', position=position_dodge(width = 0.05), hjust=1.2, size = 7/.pt) +
    theme(legend.title=element_blank()) +
    labs(title = "Top 20 Businesses by Contributors",
        x = "",
        y = "Count of Contributors")

Intel Corporation, Google and Kitware are the top 3 corporations with the highest number of contributors.


Cost Calculations

Code
# establishing connection and listing all tables in database
con <- dbConnect(MySQL(), user = Sys.getenv("DB_USERNAME"), password = Sys.getenv("DB_PASSWORD"),
    dbname = "codegov", host = "oss-1.cij9gk1eehyr.us-east-1.rds.amazonaws.com")

counts_repo <- dbReadTable(con, "codegov_gh_commits_by_agency")

dbDisconnect(con)
[1] TRUE

Costs by Agency

Code
#Visualizing the Costs

counts_by_agency |>
  select(agency, sum_w_gross) |>
  mutate(agency = fct_reorder(agency, sum_w_gross)) |>
  ggplot(aes(x = agency, y = sum_w_gross)) +
    geom_bar(stat = "identity", fill = '#014d64', width=.6) +
    scale_y_continuous(labels = label_number(prefix = "$", suffix = "M", scale = 1e-6, big.mark = ",")) +
    coord_flip() +
    theme(legend.title=element_blank()) +
    labs(title = "Cost of Projects by Agency",
        x = "",
        y = "Cost of Projects")

Summary Table by Agencies

Code
#Preparing Final Summary Table

counts_by_agency <- counts_by_agency |>
  left_join(counts_contrib, join_by(Agency == agency)) |>
  select(!index)


counts_by_agency <- counts_by_agency |>
  rename(Contributors = contributors)

counts_by_agency