snippets - Pandas

Set numer of columns/rows

pd.options.display.max_columns = None
pd.options.display.max_rows = None

‘SettingWithCopyWarning’

import pandas as pd
df = pd.DataFrame({'Number' : [100,200,300,400,500], 'Letter' : ['a','b','c', 'd', 'e']})
df

	Number	Letter
0	100	a
1	200	b
2	300	c
3	400	d
4	500	e

criteria = df['Number']>300
criteria

0    False
1    False
2    False
3     True
4     True
Name: Number, dtype: bool

Boolean filters

df[criteria]

	Number	Letter
3	400	d
4	500	e

#Create a new DataFrame based on filtering criteria
df_2 = df[criteria]

#Assign a new column and print output
df_2['new column'] = 'new value'
df_2

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2['new column'] = 'new value'

	Number	Letter	new column
3	400	d	new value
4	500	e	new value

Using .loc[]

#Create New DataFrame Based on Filtering Criteria
df_2 = df.loc[criteria, :]

#Add a New Column to the DataFrame
df_2.loc[:, 'new column'] = 'new value'
df_2

/Users/fabian/.venv/fastai/lib/python3.8/site-packages/pandas/core/indexing.py:844: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
/Users/fabian/.venv/fastai/lib/python3.8/site-packages/pandas/core/indexing.py:965: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

	Number	Letter	new column
3	400	d	new value
4	500	e	new value

Solution 1: using indices

df_2 = df[criteria]

criteria_row_indices = df[criteria].index
criteria_row_indices

Int64Index([3, 4], dtype='int64')

new_df = df.loc[criteria_row_indices, :]
new_df

	Number	Letter
3	400	d
4	500	e

new_df['New Column'] = 'New Value'
new_df

	Number	Letter	New Column
3	400	d	New Value
4	500	e	New Value

Solution 2: using `.copy()`

df[criteria]

	Number	Letter
3	400	d
4	500	e

#Create a new DataFrame based on filtering criteria
df_2 = df[criteria].copy()

#Assign a new column and print output
df_2['new column'] = 'new value'
df_2

	Number	Letter	new column
3	400	d	new value
4	500	e	new value

Conditional assignments

using `.map`

df['Number'].map({100:1, 200:2})

0    1.0
1    2.0
2    NaN
3    NaN
4    NaN
Name: Number, dtype: float64

using `.apply`

Similiar to .map but with a function.

def criteria(num):
    return 0 if num>300 else 1

df['Number'].apply(criteria)

0    1
1    1
2    1
3    0
4    0
Name: Number, dtype: int64

Oversampling

Oversampling is a very useful technique that can be used across many contexts to a great effect https://arxiv.org/abs/1710.05381.

from fastai2.tabular.all import *
titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
titanic.dropna(subset=['embark_town'], inplace=True)

titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

titanic.embark_town.value_counts().plot.bar();

df = pd.DataFrame(1 / titanic.embark_town.value_counts())

weights = titanic.embark_town.apply(lambda x: df.to_dict()['embark_town'][x])

weighted_sample = titanic.sample(n=900, weights=weights, replace=True)

weighted_sample.embark_town.value_counts().plot.bar();

weighted_sample.duplicated().value_counts()

True     485
False    415
dtype: int64

Categorify

cat_vars = ['survived', 'pclass', 'sex', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']
cont_vars = ['age', 'sibsp', 'parch', 'fare']

titanic.dtypes

survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object

criteria = titanic.columns.isin(cat_vars)
titanic.loc[:,criteria] = titanic.loc[:,criteria].astype('category')

titanic.dtypes

survived       category
pclass         category
sex            category
age             float64
sibsp             int64
parch             int64
fare            float64
embarked       category
class          category
who            category
adult_male     category
deck           category
embark_town    category
alive          category
alone          category
dtype: object

Pandas table styling

#hide
from fastai2.vision.all import *
path = untar_data(URLs.MNIST_SAMPLE)
threes = (path/'train'/'3').ls().sorted()
im3_path = threes[1]
im3 = Image.open(im3_path)

im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	29	150	195	254	255	254	176	193	150	96	0	0
2	0	0	48	166	224	253	253	234	196	253	253	253	253	233	0	0
3	93	244	249	253	187	46	10	8	4	10	194	253	253	233	0	0
4	107	253	253	230	48	0	0	0	0	0	192	253	253	156	0	0
5	3	20	20	15	0	0	0	0	0	43	224	253	245	74	0	0
6	0	0	0	0	0	0	0	0	0	249	253	245	126	0	0	0
7	0	0	0	0	0	0	14	101	223	253	248	124	0	0	0	0
8	0	0	0	0	11	166	239	253	253	253	187	30	0	0	0	0
9	0	0	0	0	16	248	250	253	253	253	253	232	213	111	2	0
10	0	0	0	0	0	0	43	98	98	208	253	253	253	253	187	22

Set numer of columns/rows

‘SettingWithCopyWarning’

Boolean filters

Using .loc[]

Solution 1: using indices

Solution 2: using .copy()

Conditional assignments

using .map

using .apply

Oversampling

Categorify

Pandas table styling

Solution 2: using `.copy()`

using `.map`

using `.apply`