Suraj Joshi's Blog

Sunday, July 2, 2023

FAVS

Popular	Linkedin			Craigslist	Google	BOA
	Reader		Life Hacker
NEWS	Nagarik	MySansar	abc7News	CNN	FOX
	TED	PBS	WashPost	NYtimes	Forbes	BussWeek
	GoogNEWS	INC	Entrepreneur	BBC	Aljazeera	DCNepal
	CanadaNepal	SF Weather
Mail	Inbox	Gmail	Yahoo	Hotmail	Outlook	SkyDrive
Sports	Soccernet	Livescore		Social	Facebook	Twitter
Photos	Picasa	Flickr		Shopping	deals2buy	Dealcatcher
Misc	Delicious	Youtube	Filehippo	Pandora	Craigslist	Wiki
	AOL Radio	Dropio	Dropbox	SF Chronicle	Weather	Box.com

Saturday, July 1, 2023

20 Must Know Python Data Cleaning Commands

1. Importing necessary libraries:

```python

import pandas as pd

import numpy as np

```

2. Reading data from a CSV file:

```python

df = pd.read_csv('filename.csv')

```

3. Checking the first few rows of the DataFrame:

```python

df.head()

```

4. Checking the information about the DataFrame, including data types and missing values:

```python

df.info()

```

5. Handling missing values by dropping rows or columns:

```python

df.dropna() # drop rows with any missing value

df.dropna(axis=1) # drop columns with any missing value

```

6. Handling missing values by filling with a specific value:

```python

df.fillna(value) # fill missing values with a specific value

```

7. Handling missing values by forward filling or backward filling:

```python

df.ffill() # forward fill missing values

df.bfill() # backward fill missing values

```

8. Handling duplicates by dropping them:

```python

df.drop_duplicates() # drop duplicate rows

```

9. Renaming columns:

```python

df.rename(columns={'old_name': 'new_name'}, inplace=True) # rename columns

```

10. Removing leading/trailing whitespaces from column values:

```python

df['column_name'] = df['column_name'].str.strip()

```

11. Removing special characters from column values:

```python

df['column_name'] = df['column_name'].str.replace('[^\w\s]', '')

```

12. Changing data types of columns:

```python

df['column_name'] = df['column_name'].astype('new_type')

```

13. Handling inconsistent capitalization in column values:

```python

df['column_name'] = df['column_name'].str.lower() # convert values to lowercase

df['column_name'] = df['column_name'].str.upper() # convert values to uppercase

```

14. Removing outliers from numerical columns:

```python

df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]

```

15. Extracting information from strings using regular expressions:

```python

df['new_column'] = df['column_name'].str.extract(r'(\d+)') # extract numbers from a string column

```

16. Handling inconsistent categories by merging them:

```python

df['column_name'] = df['column_name'].replace({'old_category': 'new_category'})

```

17. Creating dummy variables for categorical columns:

```python

dummy_cols = pd.get_dummies(df['column_name'])

df = pd.concat([df, dummy_cols], axis=1)

```

18. Applying mathematical functions to columns:

```python

df['new_column'] = df['column_name'].apply(lambda x: math_function(x))

```

19. Sorting the DataFrame by column values:

```python

df.sort_values('column_name', ascending=False, inplace=True) # sort in descending order

```

20. Saving the cleaned DataFrame to a CSV file:

```python

df.to_csv('cleaned_data.csv', index=False)

```

These are just some of the commonly used data cleaning commands in Python. Depending on your specific data and cleaning requirements, you may need to explore additional techniques and functions.

Tuesday, January 31, 2023

Handling Missing Values in Python Pandas

Handling missing values in Pandas can be done using several methods:

Drop missing values:

df.dropna(axis=0, how='any', inplace=True) - This will remove the rows containing any missing value.
df.dropna(axis=1, how='any', inplace=True) - This will remove the columns containing any missing value.

Fill missing values with a constant value:

df.fillna(value, inplace=True) - This will replace all missing values with the given constant value.

Fill missing values with the mean/median/mode of the column:

df.fillna(df.mean(), inplace=True) - This will replace all missing values with the mean of the column.
df.fillna(df.median(), inplace=True) - This will replace all missing values with the median of the column.
df.fillna(df.mode().iloc[0], inplace=True) - This will replace all missing values with the mode of the column.

Fill missing values with the value of the previous/next row:

df.fillna(method='bfill', inplace=True) - This will replace all missing values with the value of the next row.
df.fillna(method='ffill', inplace=True) - This will replace all missing values with the value of the previous row.

It is important to choose the right method for handling missing values based on the context of the data.

Saturday, January 28, 2023

Top 20 free data science datasets in csv format with links

top 20 free data science data sets in csv format with links

Iris Flower Data Set: https://archive.ics.uci.edu/ml/datasets/iris
Wine Quality Data Set: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Adult Income Data Set: https://archive.ics.uci.edu/ml/datasets/Adult
Pima Indians Diabetes Data Set: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Titanic Data Set: https://www.kaggle.com/c/titanic/data
Student Performance Data Set: https://archive.ics.uci.edu/ml/datasets/Student+Performance
Breast Cancer Wisconsin Data Set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Housing Data Set: https://archive.ics.uci.edu/ml/datasets/Housing
Mushroom Data Set: https://archive.ics.uci.edu/ml/datasets/Mushroom
Forest Fires Data Set: https://archive.ics.uci.edu/ml/datasets/Forest+Fires
Abalone Data Set: https://archive.ics.uci.edu/ml/datasets/Abalone
Bank Marketing Data Set: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Car Evaluation Data Set: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
Climate Model Simulation Data Set: https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes
Facebook Comment Volume Data Set: https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
Human Activity Recognition Data Set: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Insurance Company Benchmark Data Set: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29
Kickstarter Projects Data Set: https://www.kaggle.com/kemical/kickstarter-Projects
Online News Popularity Data Set: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
Twitter Sentiment Analysis Data Set: https://www.kaggle.com/kazanova/sentiment140

Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license.

Top 20 free data science alzheimers data sets with links

Alzheimer's Disease Neuroimaging Initiative (ADNI): https://adni.loni.usc.edu/data-samples/access-data/
National Alzheimer's Coordinating Center (NACC) database: https://www.alz.washington.edu/NACC/NACCdata.html
The European Medical Information Framework for Alzheimer's Disease (EMIF-AD): https://www.emif.eu/data-access/
The National Institute on Aging's Data Archive: https://www.nia.nih.gov/research/data-archive
The National Alzheimer's Project Act (NAPA) Data and Surveillance Center: https://www.alzheimers.gov/research/data
The Adult Changes in Thought (ACT) study: https://www.seattlecca.org/research/study/adult-changes-thought-study
The Rush Memory and Aging Project: https://www.rush.edu/research/discoveries-and-innovations/memory-and-aging-project
The Mayo Clinic Study of Aging: https://www.mayo.edu/research/centers-programs/mayo-clinic-study-aging/about
The Canadian Consortium on Neurodegeneration in Aging (CCNA): https://ccna-ccnv.ca/data-access
The National Alzheimer's Disease Genetics Study: https://www.nia.nih.gov/research/alzheimers-disease-genetics-study-nadgs
The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad
The National Alzheimer's Disease Genetics Repository: https://www.alzforum.org/research-resources/national-alzheimers-disease-genetics-repository
The National Alzheimer's Disease Center (NADC) database: https://www.alzheimers.org.uk/research/data-and-samples
The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad
The Alzheimer's Disease Neuroimaging Initiative 2 (ADNI2): https://adni2.loni.usc.edu/
The Human Connectome Project (HCP): https://www.humanconnectome.org/study/hcp-young-adult/document/data-use-agreement-and-download-instructions
The National Alzheimer's Coordinating Center (NACC) database: https://www.alz.washington.edu/NACC/NACCdata.html
The National Alzheimer's Disease Center (NADC) database: https://www.alzheimers.org.uk/research/data-and-samples
The Alzheimer's Disease Metabolomics Consortium (ADMC) dataset: https://www.alz.washington.edu/research/ADMCdata.html
The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad

Top 20 free data science healthcare data sets with links

top 20 free data science healthcare data sets with links

MIMIC-III: https://mimic.physionet.org/
NHANES: https://www.cdc.gov/nchs/nhanes/index.htm
Framingham Heart Study: https://www.framinghamheartstudy.org/data-download/
UK Biobank: https://www.ukbiobank.ac.uk/
Million Veteran Program: https://www.research.va.gov/programs/mvp/index.cfm
The Cancer Genome Atlas: https://cancergenome.nih.gov/
Partners HealthCare: https://www.partners.org/For-Researchers/Research-Resources/Data-Sets/Partners-HealthCare-Research-Patient-Data-Sets.aspx
LHDDS:https://www.healthdata.gov/dataset/large-hospital-dataset-discharge-summary-lhdds
Public Use Microdata Sample (PUMS): https://www.census.gov/programs-surveys/acs/data/pums.html
Data.gov: https://www.healthdata.gov/
The National Cancer Institute: https://www.cancer.gov/data
The National Center for Health Statistics: https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
The World Health Organization: https://www.who.int/data/gho
The National Library of Medicine: https://www.nlm.nih.gov/databases/download/
The Medical Expenditure Panel Survey: https://meps.ahrq.gov/data_stats/download_data_files.jsp
The Behavioral Risk Factor Surveillance System: https://www.cdc.gov/brfss/data_documentation/index.htm
The National Hospital Discharge Survey: https://www.cdc.gov/nchs/nhds/index.htm
The National Survey of Children’s Health: https://www.cdc.gov/nchs/slaits/nsch.htm
The National Survey of Children with Special Health Care Needs: https://www.cdc.gov/nchs/slaits/nsch.htm
The National Health Interview Survey: https://www.cdc.gov/nchs/nhis/index.htm

Top 20 free data science data sets with links

Iris Flowers: https://archive.ics.uci.edu/ml/datasets/Iris
Titanic: https://www.kaggle.com/c/titanic/data
Adult Income: https://archive.ics.uci.edu/ml/datasets/Adult
Wine Quality: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Boston Housing: https://archive.ics.uci.edu/ml/datasets/Housing
Student Performance: https://archive.ics.uci.edu/ml/datasets/Student+Performance
Breast Cancer Wisconsin: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Black Friday: https://www.kaggle.com/mehdidag/black-friday
Pokemon: https://www.kaggle.com/abcsds/pokemon
FIFA 18: https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset
US Accidents: https://www.kaggle.com/sobhanmoosavi/us-accidents
IMDB Movies: https://www.kaggle.com/tmdb/tmdb-movie-metadata
Human Resources: https://www.kaggle.com/ludobenistant/hr-analytics
Twitter Sentiment Analysis: https://www.kaggle.com/kazanova/sentiment140
Car Evaluation: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
Air Quality: https://archive.ics.uci.edu/ml/datasets/Air+Quality
SMS Spam Collection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
Heart Disease: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Chicago Crime: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
NYC Taxi Trip: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page