Saturday, July 1, 2023

20 Must Know Python Data Cleaning Commands

1. Importing necessary libraries:

```python

import pandas as pd

import numpy as np

```


2. Reading data from a CSV file:

```python

df = pd.read_csv('filename.csv')

```


3. Checking the first few rows of the DataFrame:

```python

df.head()

```


4. Checking the information about the DataFrame, including data types and missing values:

```python

df.info()

```


5. Handling missing values by dropping rows or columns:

```python

df.dropna()  # drop rows with any missing value

df.dropna(axis=1)  # drop columns with any missing value

```


6. Handling missing values by filling with a specific value:

```python

df.fillna(value)  # fill missing values with a specific value

```


7. Handling missing values by forward filling or backward filling:

```python

df.ffill()  # forward fill missing values

df.bfill()  # backward fill missing values

```


8. Handling duplicates by dropping them:

```python

df.drop_duplicates()  # drop duplicate rows

```


9. Renaming columns:

```python

df.rename(columns={'old_name': 'new_name'}, inplace=True)  # rename columns

```


10. Removing leading/trailing whitespaces from column values:

```python

df['column_name'] = df['column_name'].str.strip()

```


11. Removing special characters from column values:

```python

df['column_name'] = df['column_name'].str.replace('[^\w\s]', '')

```


12. Changing data types of columns:

```python

df['column_name'] = df['column_name'].astype('new_type')

```


13. Handling inconsistent capitalization in column values:

```python

df['column_name'] = df['column_name'].str.lower()  # convert values to lowercase

df['column_name'] = df['column_name'].str.upper()  # convert values to uppercase

```


14. Removing outliers from numerical columns:

```python

df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]

```


15. Extracting information from strings using regular expressions:

```python

df['new_column'] = df['column_name'].str.extract(r'(\d+)')  # extract numbers from a string column

```


16. Handling inconsistent categories by merging them:

```python

df['column_name'] = df['column_name'].replace({'old_category': 'new_category'})

```


17. Creating dummy variables for categorical columns:

```python

dummy_cols = pd.get_dummies(df['column_name'])

df = pd.concat([df, dummy_cols], axis=1)

```


18. Applying mathematical functions to columns:

```python

df['new_column'] = df['column_name'].apply(lambda x: math_function(x))

```


19. Sorting the DataFrame by column values:

```python

df.sort_values('column_name', ascending=False, inplace=True)  # sort in descending order

```


20. Saving the cleaned DataFrame to a CSV file:

```python

df.to_csv('cleaned_data.csv', index=False)

```


These are just some of the commonly used data cleaning commands in Python. Depending on your specific data and cleaning requirements, you may need to explore additional techniques and functions.

Tuesday, January 31, 2023

Handling Missing Values in Python Pandas

Handling missing values in Pandas can be done using several methods:

  1. Drop missing values:
    • df.dropna(axis=0, how='any', inplace=True) - This will remove the rows containing any missing value.
    • df.dropna(axis=1, how='any', inplace=True) - This will remove the columns containing any missing value.
  2. Fill missing values with a constant value:
    • df.fillna(value, inplace=True) - This will replace all missing values with the given constant value.
  3. Fill missing values with the mean/median/mode of the column:
    • df.fillna(df.mean(), inplace=True) - This will replace all missing values with the mean of the column.
    • df.fillna(df.median(), inplace=True) - This will replace all missing values with the median of the column.
    • df.fillna(df.mode().iloc[0], inplace=True) - This will replace all missing values with the mode of the column.
  4. Fill missing values with the value of the previous/next row:
    • df.fillna(method='bfill', inplace=True) - This will replace all missing values with the value of the next row.
    • df.fillna(method='ffill', inplace=True) - This will replace all missing values with the value of the previous row.

It is important to choose the right method for handling missing values based on the context of the data. 

Saturday, January 28, 2023

Top 20 free data science datasets in csv format with links

 top 20 free data science data sets in csv format with links



  1. Iris Flower Data Set: https://archive.ics.uci.edu/ml/datasets/iris
  2. Wine Quality Data Set: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
  3. Adult Income Data Set: https://archive.ics.uci.edu/ml/datasets/Adult
  4. Pima Indians Diabetes Data Set: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
  5. Titanic Data Set: https://www.kaggle.com/c/titanic/data
  6. Student Performance Data Set: https://archive.ics.uci.edu/ml/datasets/Student+Performance
  7. Breast Cancer Wisconsin Data Set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
  8. Housing Data Set: https://archive.ics.uci.edu/ml/datasets/Housing
  9. Mushroom Data Set: https://archive.ics.uci.edu/ml/datasets/Mushroom
  10. Forest Fires Data Set: https://archive.ics.uci.edu/ml/datasets/Forest+Fires
  11. Abalone Data Set: https://archive.ics.uci.edu/ml/datasets/Abalone
  12. Bank Marketing Data Set: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
  13. Car Evaluation Data Set: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
  14. Climate Model Simulation Data Set: https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes
  15. Facebook Comment Volume Data Set: https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
  16. Human Activity Recognition Data Set: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
  17. Insurance Company Benchmark Data Set: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29
  18. Kickstarter Projects Data Set: https://www.kaggle.com/kemical/kickstarter-Projects
  19. Online News Popularity Data Set: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
  20. Twitter Sentiment Analysis Data Set: https://www.kaggle.com/kazanova/sentiment140

Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license.

Top 20 free data science alzheimers data sets with links


  1. Alzheimer's Disease Neuroimaging Initiative (ADNI): https://adni.loni.usc.edu/data-samples/access-data/
  2. National Alzheimer's Coordinating Center (NACC) database: https://www.alz.washington.edu/NACC/NACCdata.html
  3. The European Medical Information Framework for Alzheimer's Disease (EMIF-AD): https://www.emif.eu/data-access/
  4. The National Institute on Aging's Data Archive: https://www.nia.nih.gov/research/data-archive
  5. The National Alzheimer's Project Act (NAPA) Data and Surveillance Center: https://www.alzheimers.gov/research/data
  6. The Adult Changes in Thought (ACT) study: https://www.seattlecca.org/research/study/adult-changes-thought-study
  7. The Rush Memory and Aging Project: https://www.rush.edu/research/discoveries-and-innovations/memory-and-aging-project
  8. The Mayo Clinic Study of Aging: https://www.mayo.edu/research/centers-programs/mayo-clinic-study-aging/about
  9. The Canadian Consortium on Neurodegeneration in Aging (CCNA): https://ccna-ccnv.ca/data-access
  10. The National Alzheimer's Disease Genetics Study: https://www.nia.nih.gov/research/alzheimers-disease-genetics-study-nadgs
  11. The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad
  12. The National Alzheimer's Disease Genetics Repository: https://www.alzforum.org/research-resources/national-alzheimers-disease-genetics-repository
  13. The National Alzheimer's Disease Center (NADC) database: https://www.alzheimers.org.uk/research/data-and-samples
  14. The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad
  15. The Alzheimer's Disease Neuroimaging Initiative 2 (ADNI2): https://adni2.loni.usc.edu/
  16. The Human Connectome Project (HCP): https://www.humanconnectome.org/study/hcp-young-adult/document/data-use-agreement-and-download-instructions
  17. The National Alzheimer's Coordinating Center (NACC) database: https://www.alz.washington.edu/NACC/NACCdata.html
  18. The National Alzheimer's Disease Center (NADC) database: https://www.alzheimers.org.uk/research/data-and-samples
  19. The Alzheimer's Disease Metabolomics Consortium (ADMC) dataset: https://www.alz.washington.edu/research/ADMCdata.html
  20. The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad

Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license

Top 20 free data science healthcare data sets with links

 top 20 free data science healthcare data sets with links

  1. MIMIC-III: https://mimic.physionet.org/
  2. NHANES: https://www.cdc.gov/nchs/nhanes/index.htm
  3. Framingham Heart Study: https://www.framinghamheartstudy.org/data-download/
  4. UK Biobank: https://www.ukbiobank.ac.uk/
  5. Million Veteran Program: https://www.research.va.gov/programs/mvp/index.cfm
  6. The Cancer Genome Atlas: https://cancergenome.nih.gov/
  7. Partners HealthCare: https://www.partners.org/For-Researchers/Research-Resources/Data-Sets/Partners-HealthCare-Research-Patient-Data-Sets.aspx
  8. LHDDS:https://www.healthdata.gov/dataset/large-hospital-dataset-discharge-summary-lhdds
  9. Public Use Microdata Sample (PUMS): https://www.census.gov/programs-surveys/acs/data/pums.html
  10. Data.gov: https://www.healthdata.gov/
  11. The National Cancer Institute: https://www.cancer.gov/data
  12. The National Center for Health Statistics: https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
  13. The World Health Organization: https://www.who.int/data/gho
  14. The National Library of Medicine: https://www.nlm.nih.gov/databases/download/
  15. The Medical Expenditure Panel Survey: https://meps.ahrq.gov/data_stats/download_data_files.jsp
  16. The Behavioral Risk Factor Surveillance System: https://www.cdc.gov/brfss/data_documentation/index.htm
  17. The National Hospital Discharge Survey: https://www.cdc.gov/nchs/nhds/index.htm
  18. The National Survey of Children’s Health: https://www.cdc.gov/nchs/slaits/nsch.htm
  19. The National Survey of Children with Special Health Care Needs: https://www.cdc.gov/nchs/slaits/nsch.htm
  20. The National Health Interview Survey: https://www.cdc.gov/nchs/nhis/index.htm

Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license of each dataset before use.

Top 20 free data science data sets with links

  1. Iris Flowers: https://archive.ics.uci.edu/ml/datasets/Iris
  2. Titanic: https://www.kaggle.com/c/titanic/data
  3. Adult Income: https://archive.ics.uci.edu/ml/datasets/Adult
  4. Wine Quality: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
  5. Boston Housing: https://archive.ics.uci.edu/ml/datasets/Housing
  6. Student Performance: https://archive.ics.uci.edu/ml/datasets/Student+Performance
  7. Breast Cancer Wisconsin: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
  8. Black Friday: https://www.kaggle.com/mehdidag/black-friday
  9. Pokemon: https://www.kaggle.com/abcsds/pokemon
  10. FIFA 18: https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset
  11. US Accidents: https://www.kaggle.com/sobhanmoosavi/us-accidents
  12. IMDB Movies: https://www.kaggle.com/tmdb/tmdb-movie-metadata
  13. Human Resources: https://www.kaggle.com/ludobenistant/hr-analytics
  14. Twitter Sentiment Analysis: https://www.kaggle.com/kazanova/sentiment140
  15. Car Evaluation: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
  16. Air Quality: https://archive.ics.uci.edu/ml/datasets/Air+Quality
  17. SMS Spam Collection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  18. Heart Disease: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
  19. Chicago Crime: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
  20. NYC Taxi Trip: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license of each dataset before use.