pandasのDataFrameをfloatからintに変換する方法

「pandas float int 変換」で検索する人が結構いるので、まとめておきます。

準備
1列だけをfloatからintに変換する
複数列をfloatからintに変換する
すべての列をfloatからintに変換する
文字列とかがある場合は?
NaNを含む場合は?
int型で欠損値をNaNのままで扱う方法は
何でみんなこれで検索してるのか（read_csvでfloatになってしまった?）

準備

import pandas as pd
import numpy as np
pd.options.display.notebook_repr_html = False  # jupyter notebook上での出力形式を制御するために書いています。無くても動きます。

# 動作環境の確認
print(pd.__version__)
print(np.__version__)

# --------------------

1.0.5
1.18.1

1列だけをfloatからintに変換する

astype関数を使ってデータの型を変換すれば良い。

df = pd.DataFrame({
    'col_A': [1.2 ,3.4, 5.6],
    'col_B': [9.8, 7.6, 5.4],
    'col_C': [11.1, 22.2, 33.3],
    'col_D': [99.9, 88.8, 77.7]
})
df

# --------------------

   col_A  col_B  col_C  col_D
0    1.2    9.8   11.1   99.9
1    3.4    7.6   22.2   88.8
2    5.6    5.4   33.3   77.7

df.dtypes

# --------------------

col_A    float64
col_B    float64
col_C    float64
col_D    float64
dtype: object

df['col_B'].astype('int')

# --------------------

0    9
1    7
2    5
Name: col_B, dtype: int64

df['col_B'] = df['col_B'].astype('int')
df

# --------------------

   col_A  col_B  col_C  col_D
0    1.2      9   11.1   99.9
1    3.4      7   22.2   88.8
2    5.6      5   33.3   77.7

df.dtypes

# --------------------

col_A    float64
col_B      int64
col_C    float64
col_D    float64
dtype: object

複数列をfloatからintに変換する

1列の場合とほぼ同じ。列名の指定をリストに変えるだけで良い。

df = pd.DataFrame({
    'col_A': [1.2 ,3.4, 5.6],
    'col_B': [9.8, 7.6, 5.4],
    'col_C': [11.1, 22.2, 33.3],
    'col_D': [99.9, 88.8, 77.7]
})
df

# --------------------

   col_A  col_B  col_C  col_D
0    1.2    9.8   11.1   99.9
1    3.4    7.6   22.2   88.8
2    5.6    5.4   33.3   77.7

df[['col_B', 'col_D']].astype('int')

# --------------------

   col_B  col_D
0      9     99
1      7     88
2      5     77

df[['col_B', 'col_D']] = df[['col_B', 'col_D']].astype('int')
df

# --------------------

   col_A  col_B  col_C  col_D
0    1.2      9   11.1     99
1    3.4      7   22.2     88
2    5.6      5   33.3     77

df.dtypes

# --------------------

col_A    float64
col_B      int64
col_C    float64
col_D      int64
dtype: object

すべての列をfloatからintに変換する

dfに直接astype関数を適用すれば良い。

df = pd.DataFrame({
    'col_A': [1.2 ,3.4, 5.6],
    'col_B': [9.8, 7.6, 5.4],
    'col_C': [11.1, 22.2, 33.3],
    'col_D': [99.9, 88.8, 77.7]
})
df

# --------------------

   col_A  col_B  col_C  col_D
0    1.2    9.8   11.1   99.9
1    3.4    7.6   22.2   88.8
2    5.6    5.4   33.3   77.7

df.astype('int')

# --------------------

   col_A  col_B  col_C  col_D
0      1      9     11     99
1      3      7     22     88
2      5      5     33     77

df = df.astype('int')
df

# --------------------

   col_A  col_B  col_C  col_D
0      1      9     11     99
1      3      7     22     88
2      5      5     33     77

df.dtypes

# --------------------

col_A    int64
col_B    int64
col_C    int64
col_D    int64
dtype: object

文字列とかがある場合は?

文字列がある場合、文字列はintに変換することができない。この場合、DataFrame全体に対して普通にastype()を適用するとエラーになってしまう。

df = pd.DataFrame({
    'col_A': [1.2 ,3.4, 5.6],
    'col_B': [9.8, 7.6, 5.4],
    'col_C': [11.1, 22.2, 33.3],
    'col_string': ['hello', 'good_morning', 'good_night']
})
df

# --------------------

   col_A  col_B  col_C    col_string
0    1.2    9.8   11.1         hello
1    3.4    7.6   22.2  good_morning
2    5.6    5.4   33.3    good_night

df.astype('int')
# --------------------
→エラー

クリックするとエラー内容が開きます

    ValueError                                Traceback (most recent call last)
    <ipython-input-17-d2a2db5e8de2> in <module>
    ----> 1 df.astype('int')
    
    /usr/local/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
       5696         else:
       5697             # else, only a single dtype is given
    -> 5698             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
       5699             return self._constructor(new_data).__finalize__(self)
       5700 
    /usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
        580 
        581     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
    --> 582         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
        583 
        584     def convert(self, **kwargs):
    /usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
        440                 applied = b.apply(f, **kwargs)
        441             else:
    --> 442                 applied = getattr(b, f)(**kwargs)
        443             result_blocks = _extend_blocks(applied, result_blocks)
        444 
    /usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
        623             vals1d = values.ravel()
        624             try:
    --> 625                 values = astype_nansafe(vals1d, dtype, copy=True)
        626             except (ValueError, TypeError):
        627                 # e.g. astype_nansafe can fail on object-dtype of strings
    /usr/local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
        872         # work around NumPy brokenness, #1987
        873         if np.issubdtype(dtype.type, np.integer):
    --> 874             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
        875 
        876         # if we have a datetime/timedelta array of objects
    pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
    ValueError: invalid literal for int() with base 10: 'hello'

この場合でも全てのfloat型のデータをintにして、なおかつエラーにならない方法は、 errors='ignore'を指定することである。
このオプションに関しては、pandasのdtypeの記事を参照のこと。

linus-mk.hatenablog.com

df.astype('int', errors='ignore')

# --------------------

   col_A  col_B  col_C    col_string
0      1      9     11         hello
1      3      7     22  good_morning
2      5      5     33    good_night

df = df.astype('int', errors='ignore')
df.dtypes

# --------------------

col_A          int64
col_B          int64
col_C          int64
col_string    object
dtype: object

NaNを含む場合は?

DataFrame内部にNaNを含む場合も、DataFrame全体に対して普通にastype()を適用するとエラーになってしまう。

df = pd.DataFrame({
    'col_A': [1.2 ,3.4, 5.6],
    'col_B': [np.nan, 7.6, 5.4],
    'col_C': [11.1, 22.2, 33.3],
    'col_D': [99.9, np.nan, 77.7]
})
df

# --------------------

   col_A  col_B  col_C  col_D
0    1.2    NaN   11.1   99.9
1    3.4    7.6   22.2    NaN
2    5.6    5.4   33.3   77.7

df.astype('int')
# --------------------
→エラー

クリックするとエラー内容が開きます

    ValueError                                Traceback (most recent call last)
    <ipython-input-21-d2a2db5e8de2> in <module>
    ----> 1 df.astype('int')
    
    /usr/local/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
       5696         else:
       5697             # else, only a single dtype is given
    -> 5698             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
       5699             return self._constructor(new_data).__finalize__(self)
       5700 
    /usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
        580 
        581     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
    --> 582         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
        583 
        584     def convert(self, **kwargs):
    /usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
        440                 applied = b.apply(f, **kwargs)
        441             else:
    --> 442                 applied = getattr(b, f)(**kwargs)
        443             result_blocks = _extend_blocks(applied, result_blocks)
        444 
    /usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
        623             vals1d = values.ravel()
        624             try:
    --> 625                 values = astype_nansafe(vals1d, dtype, copy=True)
        626             except (ValueError, TypeError):
        627                 # e.g. astype_nansafe can fail on object-dtype of strings
    /usr/local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
        866 
        867         if not np.isfinite(arr).all():
    --> 868             raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
        869 
        870     elif is_object_dtype(arr):
    ValueError: Cannot convert non-finite values (NA or inf) to integer

errors='ignore'を指定しても、期待した通りの動作にならない。これは文字列が入っている場合と違う挙動だ。

df.astype('int', errors='ignore')

# --------------------

   col_A  col_B  col_C  col_D
0    1.2    NaN   11.1   99.9
1    3.4    7.6   22.2    NaN
2    5.6    5.4   33.3   77.7

df.astype('int', errors='ignore').dtypes

# --------------------

col_A    float64
col_B    float64
col_C    float64
col_D    float64
dtype: object

……謎の挙動してるな。
文字列の場合と同じく、「nanを含まないcol_A, col_Cは整数になり、nanを含む列はfloatのまま」になるのかと思ったよ。
そうじゃないんだな。何故か全ての列がそのままですね。何でだろう。

……というわけでerrors='ignore'を使うとエラーは出ませんが目的は達成できません。
正解は、fillna関数を使って欠損値NaNを埋めてからdtypeをintに変換することである。
（fillna関数の引数を使うと、欠損値を埋める方法は色々指定できる。しかし、この記事の主眼ではないので、割愛する。公式ドキュメントを参照のこと。）

pandas.DataFrame.fillna — pandas 1.1.0 documentation

df.fillna(0)

# --------------------

   col_A  col_B  col_C  col_D
0    1.2    0.0   11.1   99.9
1    3.4    7.6   22.2    0.0
2    5.6    5.4   33.3   77.7

df.fillna(0).astype('int', errors='ignore')

# --------------------

   col_A  col_B  col_C  col_D
0      1      0     11     99
1      3      7     22      0
2      5      5     33     77

int型で欠損値をNaNのままで扱う方法は

「データ型をintにしたい」と「欠損値をそのまま（他の値で埋めずにNaNのままで）扱いたい」を両立する方法はないのだろうか?
古いデータ型では、これを両方満たすものは無い。比較的最近になって、これを両方満たすためのデータ型が誕生した。
pandas 0.24.0 で導入された新しめの機能なので、私もよく分かっていない。 experimental（実験的機能）であり、今後仕様が変更される可能性もあるという注意書きがあるので、まだ本格的に使うには早いのかなと思う。

Nullable integer data type — pandas 1.1.1 documentation