Skip to content

My practices of cleaning and summarizing data using R, Numpy, Pandas etc.

Notifications You must be signed in to change notification settings

pakbungdesu/data-manipulation

Repository files navigation

Content

  • Data Manipulation using R
  • Data Manipulation using Python (Numpy, Pandas etc.)

Data Manipulation using R

  • Data Cleaning

    • Adjust data structure to prevent data type errors

    • Tackle with missing values using is.na() and complete.cases()

      • Replace with correct data

        # Example
        
        df[is.na(df$State) & df$City == "New York", "State"] <- "NY"
        df[is.na(df$State) & df$City == "San Francisco", "State"] <- "CA"
        
      • Fill with median imputation of sub-sector

        # Example
        
        med_emp_retail <- median(df[df$Industry == "Retail",]$Employees, na.rm = TRUE)
        df[is.na(df$Employees) & df$Industry == "Retail", "Employees"] <- med_emp_retail
        
  • Aggregation and Summarisation

    • Using apply(), lapply(), sapply() through matrix
    • Store dataframe in a list and export in one time by export_list()

About

My practices of cleaning and summarizing data using R, Numpy, Pandas etc.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages