Program 13: Github Activity. Due 10am, Wednesday, 10 May.
Let's use the 2017 data for testing:
Finding total number of repos where the Github url is not specified:
Learning Objective: to introduce SQL to access data.
Available Libraries: pandas, pandasql and core Python 3.6+.
Data Source: Github Activity Data
Sample Datasets: github_projects_2017.csv, github_projects_2018.csv
The assignment is broken into the following functions to allow for unit testing:
make_df(file_name)
:
This function takes one input:
The function should return a Pandas DataFrame.
file_name
: the input data filename
count_null_repositories(df)
:
This function takes one input:
Write a query that returns a DataFrame containing one column, df
: the Github Activity dataset loaded as a Pandas DataFrame
num_repos
, which counts the number of rows in the dataframe df
where the repository url is missing
count_repos_by_language(df)
:
This function takes one input:
Write a query that returns a DataFrame containing two columns: df
: the Github Activity dataset loaded as a Pandas DataFrame
language
, num_repos
. The column language
contains the unique languages in the dataframe df
. The column num_repos
counts how many rows in the dataframe for each language. Sort the resulting dataframe alphabetically by the first column, language
.
count_ml_repos(df)
:
This function takes one input:
Write a query that returns a DataFrame containing one column, df
: the Github Activity dataset loaded as a Pandas DataFrame
num_repos
, which counts the number of rows in the dataframe df
that contains the keywords "machine learning"
find_most_recent_timestamp(df)
:
This function takes one input:
Write a query that returns a DataFrame containing one column, df
: the Github Activity dataset loaded as a Pandas DataFrame
most_recent_timestamp
, which gives the latest created timestamp in the dataframe df
count_python_repo_with_missing_license(df)
:
This function takes one input:
Write a query that returns a DataFrame containing one column, df
: the Github Activity dataset loaded as a Pandas DataFrame
num_go_repo_with_missing_license
, which counts the number of rows in the dataframe df
where the licenses column value is missing and the language of the repo is Python.
would print:
df = make_df('program13/github_projects_2017.csv')
df_count_null = count_null_repositories(df)
print(df_count_null)
num_repos
0 590
Aggregating and counting number of repos by language:
would print:
df_repos_by_language = count_repos_by_language(df)
print(df_repos_by_language)
language num_repos
ApacheConf 1
C 11
C# 52
...
TeX 2
TypeScript 76
Vue 7
XSLT 1
Counting the number of repositories tagged with machine learning:
would print:
df_ml_repos = count_ml_repos(df)
print(df_ml_repos)
num_repos
0 2
Finding the most recent created timestamp:
would print:
df_most_recent_timestamp = find_most_recent_timestamp(df)
print(df_most_recent_timestamp)
most_recent_timestamp
0 2017-12-31 23:18:35 UTC
Finding the number of Python repos with a missing license:
would print:
df_python_repo_with_missing_licens = count_python_repo_with_missing_license(df)
print(df_python_repo_with_missing_licens)
num_go_repo_with_missing_license
0 4