Saturday, July 1, 2023

20 Must Know Python Data Cleaning Commands

1. Importing necessary libraries:

```python

import pandas as pd

import numpy as np

```


2. Reading data from a CSV file:

```python

df = pd.read_csv('filename.csv')

```


3. Checking the first few rows of the DataFrame:

```python

df.head()

```


4. Checking the information about the DataFrame, including data types and missing values:

```python

df.info()

```


5. Handling missing values by dropping rows or columns:

```python

df.dropna()  # drop rows with any missing value

df.dropna(axis=1)  # drop columns with any missing value

```


6. Handling missing values by filling with a specific value:

```python

df.fillna(value)  # fill missing values with a specific value

```


7. Handling missing values by forward filling or backward filling:

```python

df.ffill()  # forward fill missing values

df.bfill()  # backward fill missing values

```


8. Handling duplicates by dropping them:

```python

df.drop_duplicates()  # drop duplicate rows

```


9. Renaming columns:

```python

df.rename(columns={'old_name': 'new_name'}, inplace=True)  # rename columns

```


10. Removing leading/trailing whitespaces from column values:

```python

df['column_name'] = df['column_name'].str.strip()

```


11. Removing special characters from column values:

```python

df['column_name'] = df['column_name'].str.replace('[^\w\s]', '')

```


12. Changing data types of columns:

```python

df['column_name'] = df['column_name'].astype('new_type')

```


13. Handling inconsistent capitalization in column values:

```python

df['column_name'] = df['column_name'].str.lower()  # convert values to lowercase

df['column_name'] = df['column_name'].str.upper()  # convert values to uppercase

```


14. Removing outliers from numerical columns:

```python

df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]

```


15. Extracting information from strings using regular expressions:

```python

df['new_column'] = df['column_name'].str.extract(r'(\d+)')  # extract numbers from a string column

```


16. Handling inconsistent categories by merging them:

```python

df['column_name'] = df['column_name'].replace({'old_category': 'new_category'})

```


17. Creating dummy variables for categorical columns:

```python

dummy_cols = pd.get_dummies(df['column_name'])

df = pd.concat([df, dummy_cols], axis=1)

```


18. Applying mathematical functions to columns:

```python

df['new_column'] = df['column_name'].apply(lambda x: math_function(x))

```


19. Sorting the DataFrame by column values:

```python

df.sort_values('column_name', ascending=False, inplace=True)  # sort in descending order

```


20. Saving the cleaned DataFrame to a CSV file:

```python

df.to_csv('cleaned_data.csv', index=False)

```


These are just some of the commonly used data cleaning commands in Python. Depending on your specific data and cleaning requirements, you may need to explore additional techniques and functions.

No comments: