CSci 39542 Syllabus    Resources    Coursework



Program 13: Github Activity.
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023


Classwork    Quizzes    Homework    Project   

Program Description



Program 13: Github Activity.Due 10am, Wednesday, 10 May.
Learning Objective: to introduce SQL to access data.
Available Libraries: pandas, pandasql and core Python 3.6+.
Data Source:
Github Activity Data
Sample Datasets: github_projects_2017.csv, github_projects_2018.csv


The assignment is broken into the following functions to allow for unit testing:




Let's use the 2017 data for testing: Finding total number of repos where the Github url is not specified:

df = make_df('program13/github_projects_2017.csv')
df_count_null = count_null_repositories(df)
print(df_count_null)
would print:
             num_repos
    0        590

Aggregating and counting number of repos by language:
df_repos_by_language = count_repos_by_language(df)
print(df_repos_by_language)
would print:
language	num_repos
ApacheConf	1
C	        11
C#	        52
...
TeX	        2
TypeScript	76
Vue	        7
XSLT	        1

Counting the number of repositories tagged with machine learning:
df_ml_repos = count_ml_repos(df)
print(df_ml_repos)
would print:
         num_repos
0        2

Finding the most recent created timestamp:
df_most_recent_timestamp = find_most_recent_timestamp(df)
print(df_most_recent_timestamp)
would print:
       most_recent_timestamp
0      2017-12-31 23:18:35 UTC

Finding the number of Python repos with a missing license:
df_python_repo_with_missing_licens = count_python_repo_with_missing_license(df)
print(df_python_repo_with_missing_licens)
would print:
       num_go_repo_with_missing_license
0      4