Popular | Craigslist | BOA | |||||
Reader | Life Hacker | ||||||
NEWS | Nagarik | MySansar | abc7News | CNN | FOX | ||
TED | PBS | WashPost | NYtimes | Forbes | BussWeek | ||
GoogNEWS | INC | Entrepreneur | BBC | Aljazeera | DCNepal | ||
CanadaNepal | SF Weather | ||||||
Inbox | Gmail | Yahoo | Hotmail | Outlook | SkyDrive | ||
Sports | Soccernet | Livescore | Social | ||||
Photos | Picasa | Flickr | Shopping | deals2buy | Dealcatcher | ||
Misc | Delicious | Youtube | Filehippo | Pandora | Craigslist | Wiki | |
AOL Radio | Dropio | Dropbox | SF Chronicle | Weather | Box.com |
Suraj Joshi's Blog
Sunday, July 2, 2023
FAVS
Saturday, July 1, 2023
20 Must Know Python Data Cleaning Commands
1. Importing necessary libraries:
```python
import pandas as pd
import numpy as np
```
2. Reading data from a CSV file:
```python
df = pd.read_csv('filename.csv')
```
3. Checking the first few rows of the DataFrame:
```python
df.head()
```
4. Checking the information about the DataFrame, including data types and missing values:
```python
df.info()
```
5. Handling missing values by dropping rows or columns:
```python
df.dropna() # drop rows with any missing value
df.dropna(axis=1) # drop columns with any missing value
```
6. Handling missing values by filling with a specific value:
```python
df.fillna(value) # fill missing values with a specific value
```
7. Handling missing values by forward filling or backward filling:
```python
df.ffill() # forward fill missing values
df.bfill() # backward fill missing values
```
8. Handling duplicates by dropping them:
```python
df.drop_duplicates() # drop duplicate rows
```
9. Renaming columns:
```python
df.rename(columns={'old_name': 'new_name'}, inplace=True) # rename columns
```
10. Removing leading/trailing whitespaces from column values:
```python
df['column_name'] = df['column_name'].str.strip()
```
11. Removing special characters from column values:
```python
df['column_name'] = df['column_name'].str.replace('[^\w\s]', '')
```
12. Changing data types of columns:
```python
df['column_name'] = df['column_name'].astype('new_type')
```
13. Handling inconsistent capitalization in column values:
```python
df['column_name'] = df['column_name'].str.lower() # convert values to lowercase
df['column_name'] = df['column_name'].str.upper() # convert values to uppercase
```
14. Removing outliers from numerical columns:
```python
df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]
```
15. Extracting information from strings using regular expressions:
```python
df['new_column'] = df['column_name'].str.extract(r'(\d+)') # extract numbers from a string column
```
16. Handling inconsistent categories by merging them:
```python
df['column_name'] = df['column_name'].replace({'old_category': 'new_category'})
```
17. Creating dummy variables for categorical columns:
```python
dummy_cols = pd.get_dummies(df['column_name'])
df = pd.concat([df, dummy_cols], axis=1)
```
18. Applying mathematical functions to columns:
```python
df['new_column'] = df['column_name'].apply(lambda x: math_function(x))
```
19. Sorting the DataFrame by column values:
```python
df.sort_values('column_name', ascending=False, inplace=True) # sort in descending order
```
20. Saving the cleaned DataFrame to a CSV file:
```python
df.to_csv('cleaned_data.csv', index=False)
```
These are just some of the commonly used data cleaning commands in Python. Depending on your specific data and cleaning requirements, you may need to explore additional techniques and functions.
Tuesday, January 31, 2023
Handling Missing Values in Python Pandas
Handling missing values in Pandas can be done using several methods:
- Drop missing values:
- df.dropna(axis=0, how='any', inplace=True) - This will remove the rows containing any missing value.
- df.dropna(axis=1, how='any', inplace=True) - This will remove the columns containing any missing value.
- Fill missing values with a constant value:
- df.fillna(value, inplace=True) - This will replace all missing values with the given constant value.
- Fill missing values with the mean/median/mode of the column:
- df.fillna(df.mean(), inplace=True) - This will replace all missing values with the mean of the column.
- df.fillna(df.median(), inplace=True) - This will replace all missing values with the median of the column.
- df.fillna(df.mode().iloc[0], inplace=True) - This will replace all missing values with the mode of the column.
- Fill missing values with the value of the previous/next row:
- df.fillna(method='bfill', inplace=True) - This will replace all missing values with the value of the next row.
- df.fillna(method='ffill', inplace=True) - This will replace all missing values with the value of the previous row.
It is important to choose the right method for handling missing values based on the context of the data.
Saturday, January 28, 2023
Top 20 free data science datasets in csv format with links
top 20 free data science data sets in csv format with links
- Iris Flower Data Set: https://archive.ics.uci.edu/ml/datasets/iris
- Wine Quality Data Set: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
- Adult Income Data Set: https://archive.ics.uci.edu/ml/datasets/Adult
- Pima Indians Diabetes Data Set: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
- Titanic Data Set: https://www.kaggle.com/c/titanic/data
- Student Performance Data Set: https://archive.ics.uci.edu/ml/datasets/Student+Performance
- Breast Cancer Wisconsin Data Set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
- Housing Data Set: https://archive.ics.uci.edu/ml/datasets/Housing
- Mushroom Data Set: https://archive.ics.uci.edu/ml/datasets/Mushroom
- Forest Fires Data Set: https://archive.ics.uci.edu/ml/datasets/Forest+Fires
- Abalone Data Set: https://archive.ics.uci.edu/ml/datasets/Abalone
- Bank Marketing Data Set: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
- Car Evaluation Data Set: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
- Climate Model Simulation Data Set: https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes
- Facebook Comment Volume Data Set: https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
- Human Activity Recognition Data Set: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
- Insurance Company Benchmark Data Set: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29
- Kickstarter Projects Data Set: https://www.kaggle.com/kemical/kickstarter-Projects
- Online News Popularity Data Set: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
- Twitter Sentiment Analysis Data Set: https://www.kaggle.com/kazanova/sentiment140
Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license.
Top 20 free data science alzheimers data sets with links
- Alzheimer's Disease Neuroimaging Initiative (ADNI): https://adni.loni.usc.edu/data-samples/access-data/
- National Alzheimer's Coordinating Center (NACC) database: https://www.alz.washington.edu/NACC/NACCdata.html
- The European Medical Information Framework for Alzheimer's Disease (EMIF-AD): https://www.emif.eu/data-access/
- The National Institute on Aging's Data Archive: https://www.nia.nih.gov/research/data-archive
- The National Alzheimer's Project Act (NAPA) Data and Surveillance Center: https://www.alzheimers.gov/research/data
- The Adult Changes in Thought (ACT) study: https://www.seattlecca.org/research/study/adult-changes-thought-study
- The Rush Memory and Aging Project: https://www.rush.edu/research/discoveries-and-innovations/memory-and-aging-project
- The Mayo Clinic Study of Aging: https://www.mayo.edu/research/centers-programs/mayo-clinic-study-aging/about
- The Canadian Consortium on Neurodegeneration in Aging (CCNA): https://ccna-ccnv.ca/data-access
- The National Alzheimer's Disease Genetics Study: https://www.nia.nih.gov/research/alzheimers-disease-genetics-study-nadgs
- The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad
- The National Alzheimer's Disease Genetics Repository: https://www.alzforum.org/research-resources/national-alzheimers-disease-genetics-repository
- The National Alzheimer's Disease Center (NADC) database: https://www.alzheimers.org.uk/research/data-and-samples
- The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad
- The Alzheimer's Disease Neuroimaging Initiative 2 (ADNI2): https://adni2.loni.usc.edu/
- The Human Connectome Project (HCP): https://www.humanconnectome.org/study/hcp-young-adult/document/data-use-agreement-and-download-instructions
- The National Alzheimer's Coordinating Center (NACC) database: https://www.alz.washington.edu/NACC/NACCdata.html
- The National Alzheimer's Disease Center (NADC) database: https://www.alzheimers.org.uk/research/data-and-samples
- The Alzheimer's Disease Metabolomics Consortium (ADMC) dataset: https://www.alz.washington.edu/research/ADMCdata.html
- The Genetic and Environmental Risk in AD (GERAD) study: https://www.alz.co.uk/research/our-research/gerad
Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license
Top 20 free data science healthcare data sets with links
top 20 free data science healthcare data sets with links
- MIMIC-III: https://mimic.physionet.org/
- NHANES: https://www.cdc.gov/nchs/nhanes/index.htm
- Framingham Heart Study: https://www.framinghamheartstudy.org/data-download/
- UK Biobank: https://www.ukbiobank.ac.uk/
- Million Veteran Program: https://www.research.va.gov/programs/mvp/index.cfm
- The Cancer Genome Atlas: https://cancergenome.nih.gov/
- Partners HealthCare: https://www.partners.org/For-Researchers/Research-Resources/Data-Sets/Partners-HealthCare-Research-Patient-Data-Sets.aspx
- LHDDS:https://www.healthdata.gov/dataset/large-hospital-dataset-discharge-summary-lhdds
- Public Use Microdata Sample (PUMS): https://www.census.gov/programs-surveys/acs/data/pums.html
- Data.gov: https://www.healthdata.gov/
- The National Cancer Institute: https://www.cancer.gov/data
- The National Center for Health Statistics: https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
- The World Health Organization: https://www.who.int/data/gho
- The National Library of Medicine: https://www.nlm.nih.gov/databases/download/
- The Medical Expenditure Panel Survey: https://meps.ahrq.gov/data_stats/download_data_files.jsp
- The Behavioral Risk Factor Surveillance System: https://www.cdc.gov/brfss/data_documentation/index.htm
- The National Hospital Discharge Survey: https://www.cdc.gov/nchs/nhds/index.htm
- The National Survey of Children’s Health: https://www.cdc.gov/nchs/slaits/nsch.htm
- The National Survey of Children with Special Health Care Needs: https://www.cdc.gov/nchs/slaits/nsch.htm
- The National Health Interview Survey: https://www.cdc.gov/nchs/nhis/index.htm
Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license of each dataset before use.
Top 20 free data science data sets with links
- Iris Flowers: https://archive.ics.uci.edu/ml/datasets/Iris
- Titanic: https://www.kaggle.com/c/titanic/data
- Adult Income: https://archive.ics.uci.edu/ml/datasets/Adult
- Wine Quality: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
- Boston Housing: https://archive.ics.uci.edu/ml/datasets/Housing
- Student Performance: https://archive.ics.uci.edu/ml/datasets/Student+Performance
- Breast Cancer Wisconsin: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- Black Friday: https://www.kaggle.com/mehdidag/black-friday
- Pokemon: https://www.kaggle.com/abcsds/pokemon
- FIFA 18: https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset
- US Accidents: https://www.kaggle.com/sobhanmoosavi/us-accidents
- IMDB Movies: https://www.kaggle.com/tmdb/tmdb-movie-metadata
- Human Resources: https://www.kaggle.com/ludobenistant/hr-analytics
- Twitter Sentiment Analysis: https://www.kaggle.com/kazanova/sentiment140
- Car Evaluation: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
- Air Quality: https://archive.ics.uci.edu/ml/datasets/Air+Quality
- SMS Spam Collection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
- Heart Disease: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
- Chicago Crime: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
- NYC Taxi Trip: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Please note that these are just a few examples and there are many other free datasets available on the internet, these datasets may be updated or removed by their authors in the future, also you can use these datasets for educational and research purposes, but please read and follow the usage license of each dataset before use.