Analyses
Measuring the Cost of Open Source Software Projects Hosted on GitHub
Open source software (OSS) is software that anyone can review, modify, and distribute freely, usually with only minor restrictions such as giving credit to the creator of the work. The use of OSS is growing rapidly, due to its value in increasing firm and economy-wide productivity. Despite its widespread use, there is no standardized methodology for measuring the scope and impact of this fundamental intangible asset. This study presents a framework to measure the value of OSS using data collected from GitHub, the largest platform in the world with over 100 million developers. The data include over 7.6 million repositories where software is developed, stored, and managed. We collect information about contributors and development activity such as code changes and license detail. By adopting a cost estimation model from software engineering, we develop a methodology to generate estimates of investment in OSS that are consistent with the U.S. national accounting methods used for measuring software investment. We generate annual estimates of current and inflation-adjusted investment as well as the net stock of OSS for the 2009–2019 period. Our estimates show that the U.S. investment in 2019 was $37.8 billion with a current-cost net stock of $74.3 billion.
Products:
Publications
Korkmaz, G., Calderón, J., Kramer, B., Guci, L., Robbins, C. (2024). From GitHub to GDP: A framework for measuring open source software innovation. Research Policy, 53(3). https://doi.org/10.1016/j.respol.2024.104954.
Contribution of the U.S. Federal Government to Open Source Software
This study involves an in-depth analysis of patterns and trends in the open-source software (OSS) contributions by the U.S. federal government agencies. Prompted by the Federal Source Code Policy. Code.gov was established as a platform to facilitate the sharing of custom-developed software across various federal government agencies. We use data from Code.gov, which catalogs OSS projects developed and shared by government agencies, and enhance this data with detailed development and contributor information from GitHub. By adopting a cost estimation methodology that is consistent with the U.S. national accounting framework for software investment proposed in Korkmaz et al. (2024), we provide annual estimates of investment in OSS by government agencies for the 2009–2021 period. This study not only sheds light on the government’s role in fostering OSS development but also offers a valuable framework for assessing the scope and value of OSS initiatives within the public sector.
Products:
Publications
Shrivastava, R., & Korkmaz, G. (2024). Measuring public open-source software in the federal government: An analysis of Code.gov. Journal of Data Science, 22(3), 356-375. https://doi.org/10.6339/24-JDS1148.
Presentations
Carol Robbins (National Science Foundation), J. Bayoán Santiago-Calderón (Bureau of Economic Analysis), Gizem Korkmaz (Westat), Brandon Kramer (Edge & Node) (September 2023). “Measuring the Cost of Open Source Software Using GitHub.” NYU Stern Center for the Future of Management - Benefits and Challenges of Open Source Conference. New York, NY. Link.
J. Bayoán Santiago-Calderón (Bureau of Economic Analysis), Gizem Korkmaz (Westat), Brandon Kramer (Edge & Node), Ledia Guci (Bureau of Economic Analysis), Carol Robbins (National Science Foundation) (December 2022). “Measuring the Cost of Open Source Software Using GitHub.” NYU - Economics of Open Source Workshop. Virtual.
Gizem Korkmaz (Westat), Nicholas Askew (Westat), Clara Boothby (National Science Foundation) (January 2025, to be held). “Attributing Credit and Measuring Impact of Open-Source Software Using Fractional Counting.” 2025 American Economic Association / Allied Social Science Associations (ASSA) Annual Meeting, San Francisco, California. Link.
Gizem Korkmaz (Westat), Nicholas Askew (Westat), Clara Boothby (National Science Foundation) (November 2024, to be held). “Mapping Open Source Software: A Bibliometric Approach to Attributing Credit and Measuring Impact.” Association for Public Policy Analysis and Management (APPAM) 2024 Fall Research Conference, National Harbor, Maryland. Link.
Nicholas Askew (Westat), Gizem Korkmaz (Westat), Clara Boothby (National Science Foundation) (June 2024). “Attributing Credit and Measuring Impact of Open Source Software Using Fractional Counting.” 2024 Symposium on Data Science and Statistics, Richmond, Virginia. Link.
Carol Moore (University of Virginia), Uyen Nguyen (University of Virginia), Gizem Korkmaz (Westat) (June 2024). “Gender Differences in the Development of R Packages on GitHub.” 2024 Symposium on Data Science and Statistics, Richmond, Virginia. Link.
Gizem Korkmaz (Westat), Rahul Shrivastava (Westat), Anil Battalahalli (Westat), Ekaterina Levitskaya (Coleridge Initiative), J. Bayoán Santiago Calderón (Bureau of Economic Analysis), Ledia Guci (Bureau of Economic Analysis), Carol Robbins (National Science Foundation) (June 2023). “Open Source Software in the Federal Government: An Analysis of Code.Gov.” Government Advances in Statistical Programming (GASP) 2023 Conference, Virtual. Link.
A Bibliometric Approach to Attributing Credit and Measuring Impact of Open-Source Software
Influential contributors to OSS can contribute heavily to the priorities and practices of scientific research when their work is widely used or built upon by other researchers. In this context, studying the global distribution, collaboration, and impact of the contributors is important to understanding the landscape of innovation in scientific research.This study uses data collected on Python and R packages from GitHub, and leverages fractional-counting methods to measure the exact contribution of each developer and use weighted counting based on the lines of code added by each developer to accurately sum the contribution of countries. We also use the dependency relationship between packages and study the pairwise connections between countries to measure their respective impact.
Products:
Presentations
Gizem Korkmaz (Westat), Nicholas Askew (Westat), Clara Boothby (National Science Foundation) (January 2025, to be held). “Attributing Credit and Measuring Impact of Open-Source Software Using Fractional Counting” American Economic Association, San Francisco, California. Link.
Gizem Korkmaz (Westat), Nicholas Askew (Westat), Clara Boothby (National Science Foundation) (November 2024, to be held). “Mapping Open Source Software: A Bibliometric Approach to Attributing Credit and Measuring Impact” Association for Public Policy Analysis and Management Fall Research Conference, National Harbor, Maryland. Link.
Gizem Korkmaz (Westat), Nicholas Askew (Westat), Clara Boothby (National Science Foundation) (June 2024). “Attributing Credit and Measuring Impact of Open Source Software Using Fractional Counting” Symposium on Data Science and Statistics, Richmond, Virginia. Link.
Gender Differences in Open-Source Software Development
The analysis of the gender dynamics in scientific research and respective outputs is crucial for ensuring that science policy is inclusive and equitable. Similar to other research outputs such as publications and patents, open source software (OSS) projects are also developed by contributors from universities, government research institutions, and nonprofits, in addition to businesses. Despite its reach and continued rapid growth, reliable and comprehensive survey data on OSS does not exist, limiting insights into contributions by gender and policy-makers’ ability to assess trends in gender representation. Like in scientific research, the inclusion of diverse perspectives in software development enhances creativity and problem-solving. This exploratory study aims to quantify gender differences in development and use (impact) of OSS using publicly available information collected from GitHub.
Products:
Presentations
Gizem Korkmaz (Westat), Carol Moore, Uyen Nguyen (University of Virginia) (June 2024). “Gender Differences in the Development of R Packages on GitHub” Symposium on Data Science and Statistics, Richmond, Virginia. Link.
Classification of Open-Source Software into Categories Using Machine Learning
Understanding the landscape of OSS requires identifying the major contributors (countries, institutions, and individuals), as well as capturing trends in the fields of application. Studying prevailing topics (such as artificial intelligence), which categories are thriving and which are lacking would complement existing science and technology indicators on peer-reviewed publications that are calculated from databases covering scientific articles (Hall and Jaffe, 2012). The classification of research activity into fields is a fundamental aspect of the academic and scientific ecosystem, which can help with the efficient dissemination, understanding, and advancement of knowledge and technology. In this research, we apply machine learning-based classification framework to classify OSS projects to its respective field of application. We collect and use data from GitHub on packages developed for programming languages R, Python and Julia to identify topics using the README files. Researchers can use the framework presented here to capture research fields, data types, methods, prevailing topics in text-based research output.