1. Importing necessary libraries:
```python
import pandas as pd
import numpy as np
```
2. Reading data from a CSV file:
```python
df = pd.read_csv('filename.csv')
```
3. Checking the first few rows of the DataFrame:
```python
df.head()
```
4. Checking the information about the DataFrame, including data types and missing values:
```python
df.info()
```
5. Handling missing values by dropping rows or columns:
```python
df.dropna() # drop rows with any missing value
df.dropna(axis=1) # drop columns with any missing value
```
6. Handling missing values by filling with a specific value:
```python
df.fillna(value) # fill missing values with a specific value
```
7. Handling missing values by forward filling or backward filling:
```python
df.ffill() # forward fill missing values
df.bfill() # backward fill missing values
```
8. Handling duplicates by dropping them:
```python
df.drop_duplicates() # drop duplicate rows
```
9. Renaming columns:
```python
df.rename(columns={'old_name': 'new_name'}, inplace=True) # rename columns
```
10. Removing leading/trailing whitespaces from column values:
```python
df['column_name'] = df['column_name'].str.strip()
```
11. Removing special characters from column values:
```python
df['column_name'] = df['column_name'].str.replace('[^\w\s]', '')
```
12. Changing data types of columns:
```python
df['column_name'] = df['column_name'].astype('new_type')
```
13. Handling inconsistent capitalization in column values:
```python
df['column_name'] = df['column_name'].str.lower() # convert values to lowercase
df['column_name'] = df['column_name'].str.upper() # convert values to uppercase
```
14. Removing outliers from numerical columns:
```python
df = df[(np.abs(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()) < 3]
```
15. Extracting information from strings using regular expressions:
```python
df['new_column'] = df['column_name'].str.extract(r'(\d+)') # extract numbers from a string column
```
16. Handling inconsistent categories by merging them:
```python
df['column_name'] = df['column_name'].replace({'old_category': 'new_category'})
```
17. Creating dummy variables for categorical columns:
```python
dummy_cols = pd.get_dummies(df['column_name'])
df = pd.concat([df, dummy_cols], axis=1)
```
18. Applying mathematical functions to columns:
```python
df['new_column'] = df['column_name'].apply(lambda x: math_function(x))
```
19. Sorting the DataFrame by column values:
```python
df.sort_values('column_name', ascending=False, inplace=True) # sort in descending order
```
20. Saving the cleaned DataFrame to a CSV file:
```python
df.to_csv('cleaned_data.csv', index=False)
```
These are just some of the commonly used data cleaning commands in Python. Depending on your specific data and cleaning requirements, you may need to explore additional techniques and functions.
No comments:
Post a Comment