daskでデータ絞り込みをするためにquery関数を使ったけど、構文が難しくてちょっと詰まった話。
daskのDataFrameに対するquery関数の公式ドキュメントはこちらだ。
dask.dataframe.DataFrame.query
このドキュメントを見ると、
「pandasは@で変数名を使えるが、daskではそれは使えないので、代わりにf文字列かlocal_dictキーワードを使ってくれ」と書いてある。
なるほど、@の代わりにf文字列を使えばそれで良いのね。……と単純に考えていると、ちょっとつまずく。という話である。
準備
import pandas as pd import dask.dataframe as dd import dask pd.options.display.notebook_repr_html = False # jupyter notebook上での出力形式を制御するために書いています。無くても動きます。
# 動作環境の確認 print(pd.__version__) print(dask.__version__) # -------------------- 1.1.2 2023.1.0
サンプルデータの作成
# データに特に意味はない。https://linus-mk.hatenablog.com/entry/pandas-unique-integer-id から持ってきて適宜改変。 df = pd.DataFrame({ 'name' : ['Alice', 'Bob', 'Charlie', 'Charlie', 'Alice', 'Bob'], 'item' : ['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'], 'number' : [3, 2, 4, 3, 2, 1], 'id_code' : ['012', '123', '234', '123', '012', '345'] })
df # -------------------- name item number id_code 0 Alice aaa 3 012 1 Bob bbb 2 123 2 Charlie ccc 4 234 3 Charlie ddd 3 123 4 Alice eee 2 012 5 Bob fff 1 345
df.dtypes # -------------------- name object item object number int64 id_code object dtype: object
ddf = dd.from_pandas(df) # -------------------- --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-6-bcb6b963da62> in <module> ----> 1 ddf = dd.from_pandas(df) /usr/local/lib/python3.8/site-packages/dask/dataframe/io/io.py in from_pandas(data, npartitions, chunksize, sort, name) 260 261 if (npartitions is None) == (chunksize is None): --> 262 raise ValueError("Exactly one of npartitions and chunksize must be specified.") 263 264 nrows = len(data) ValueError: Exactly one of npartitions and chunksize must be specified.
どうもよく分かっていないのだが、dask.dataframe.from_pandasはnpartitionsとchunksizeのうちどちらか片方(のみ)を指定する必要があるらしい。 今は特に何でも良いので、npartitions=1を指定する。
ddf = dd.from_pandas(df, npartitions=1) print(ddf) # -------------------- Dask DataFrame Structure: name item number id_code npartitions=1 0 object object int64 object 5 ... ... ... ... Dask Name: from_pandas, 1 graph layer
ddf.compute() # -------------------- name item number id_code 0 Alice aaa 3 012 1 Bob bbb 2 123 2 Charlie ccc 4 234 3 Charlie ddd 3 123 4 Alice eee 2 012 5 Bob fff 1 345
これで準備はできた。
数値型のカラムの場合
上述の公式ドキュメントにも載っている、数字の例を見てみよう。
まず、データのうち、numberカラムが2であるものを抽出しよう。
# pandas 直接値を指定 df.query("number==2") # -------------------- name item number id_code 1 Bob bbb 2 123 4 Alice eee 2 012
# dask 直接値を指定 ddf.query("number==2").compute() # -------------------- name item number id_code 1 Bob bbb 2 123 4 Alice eee 2 012
# pandas 変数名を使用 @ num = 2 df.query(f"number==@num") # -------------------- name item number id_code 1 Bob bbb 2 123 4 Alice eee 2 012
# dask 変数名を使用 f文字列、成功 num = 2 ddf.query(f"number=={num}").compute() # -------------------- name item number id_code 1 Bob bbb 2 123 4 Alice eee 2 012
# pandas 変数名を使用 実はf文字列でも行ける num = 2 df.query(f"number=={num}") # -------------------- name item number id_code 1 Bob bbb 2 123 4 Alice eee 2 012
pandas側で@variable_name
と書く代わりに、daskでは{variable_name}
と書けば良さそうな気がしてくる。
ところがそれが上手く行かないケースが存在するのだ。
文字列型のカラムの場合
データのうち、nameカラムが"Bob"であるものを抽出しよう。
# pandas 直接値を指定 df.query("name=='Bob'") # -------------------- name item number id_code 1 Bob bbb 2 123 5 Bob fff 1 345
# dask 直接値を指定 ddf.query("name=='Bob'").compute() # -------------------- name item number id_code 1 Bob bbb 2 123 5 Bob fff 1 345
ここまでは何も問題ない。
ところが、変数名を使用すると状況が変わってくる。
# pandas 変数名を使用 @ target = 'Bob' df.query(f"name==@target") # -------------------- name item number id_code 1 Bob bbb 2 123 5 Bob fff 1 345
pandas側で@variable_name
と書く代わりに、daskでは{variable_name}
と書くと失敗する。
# dask 変数名を使用 f文字列 失敗例 target = 'Bob' ddf.query(f"name=={target}").compute() # -------------------- エラー。長いので折りたたみます。
クリックでエラー内容を表示
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
187 if self.has_resolvers:
--> 188 return self.resolvers[key]
189
/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __getitem__(self, key)
897 pass
--> 898 return self.__missing__(key) # support subclasses that define __missing__
899
/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __missing__(self, key)
889 def __missing__(self, key):
--> 890 raise KeyError(key)
891
KeyError: 'Bob'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
198 # e.g., df[df > 0]
--> 199 return self.temps[key]
200 except KeyError as err:
KeyError: 'Bob'
The above exception was the direct cause of the following exception:
UndefinedVariableError Traceback (most recent call last)
/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
194 try:
--> 195 yield
196 except Exception as e:
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
6570 with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
-> 6571 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
6572
/usr/local/lib/python3.8/site-packages/dask/utils.py in __call__(self, _methodcaller__obj, *args, **kwargs)
1102 def __call__(self, __obj, *args, **kwargs):
-> 1103 return getattr(__obj, self.method)(*args, **kwargs)
1104
/usr/local/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
3339 kwargs["target"] = None
-> 3340 res = self.eval(expr, **kwargs)
3341
/usr/local/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
3469
-> 3470 return _eval(expr, inplace=inplace, **kwargs)
3471
/usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
340
--> 341 parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
342
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in __init__(self, expr, engine, parser, env, level)
786 self._visitor = _parsers[parser](self.env, self.engine, self.parser)
--> 787 self.terms = self.parse()
788
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in parse(self)
805 """
--> 806 return self._visitor.visit(self.expr)
807
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Module(self, node, **kwargs)
403 expr = node.body[0]
--> 404 return self.visit(expr, **kwargs)
405
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Expr(self, node, **kwargs)
406 def visit_Expr(self, node, **kwargs):
--> 407 return self.visit(node.value, **kwargs)
408
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Compare(self, node, **kwargs)
698 binop = ast.BinOp(op=op, left=node.left, right=comps[0])
--> 699 return self.visit(binop)
700
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_BinOp(self, node, **kwargs)
519 def visit_BinOp(self, node, **kwargs):
--> 520 op, op_class, left, right = self._maybe_transform_eq_ne(node)
521 left, right = self._maybe_downcast_constants(left, right)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in _maybe_transform_eq_ne(self, node, left, right)
440 if right is None:
--> 441 right = self.visit(node.right, side="right")
442 op, op_class, left, right = self._rewrite_membership_op(node, left, right)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Name(self, node, **kwargs)
532 def visit_Name(self, node, **kwargs):
--> 533 return self.term_type(node.id, self.env, **kwargs)
534
/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in __init__(self, name, env, side, encoding)
83 self.is_local = tname.startswith(_LOCAL_TAG) or tname in _DEFAULT_GLOBALS
---> 84 self._value = self._resolve_name()
85 self.encoding = encoding
/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in _resolve_name(self)
100 def _resolve_name(self):
--> 101 res = self.env.resolve(self.local_name, is_local=self.is_local)
102 self.update(res)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
203
--> 204 raise UndefinedVariableError(key, is_local) from err
205
UndefinedVariableError: name 'Bob' is not defined
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-17-6b73727f207c> in <module>
1 # dask 変数名を使用 f文字列 失敗例
2 target = 'Bob'
----> 3 ddf.query(f"name=={target}").compute()
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in query(self, expr, **kwargs)
5178 2 1 3 2
5179 """
-> 5180 return self.map_partitions(M.query, expr, **kwargs)
5181
5182 @derived_from(pd.DataFrame)
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(self, func, *args, **kwargs)
873 None as the division.
874 """
--> 875 return map_partitions(func, self, *args, **kwargs)
876
877 @insert_meta_param_description(pad=12)
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(func, meta, enforce_metadata, transform_divisions, align_dataframes, *args, **kwargs)
6639 dfs = [df for df in args if isinstance(df, _Frame)]
6640
-> 6641 meta = _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
6642 if all(isinstance(arg, Scalar) for arg in args):
6643 layer = {
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
6750 # Use non-normalized kwargs here, as we want the real values (not
6751 # delayed values)
-> 6752 meta = _emulate(func, *args, udf=True, **kwargs)
6753 meta_is_emulated = True
6754 else:
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
6569 """
6570 with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
-> 6571 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
6572
6573
/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py in __exit__(self, type, value, traceback)
129 value = type()
130 try:
--> 131 self.gen.throw(type, value, traceback)
132 except StopIteration as exc:
133 # Suppress StopIteration *unless* it's the same exception that
/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
214 )
215 msg = msg.format(f" in `{funcname}`" if funcname else "", repr(e), tb)
--> 216 raise ValueError(msg) from e
217
218
ValueError: Metadata inference failed in `query`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
UndefinedVariableError("name 'Bob' is not defined")
Traceback:
---------
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
yield
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py", line 6571, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "/usr/local/lib/python3.8/site-packages/dask/utils.py", line 1103, in __call__
return getattr(__obj, self.method)(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3340, in query
res = self.eval(expr, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3470, in eval
return _eval(expr, inplace=inplace, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
self.terms = self.parse()
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
return self._visitor.visit(self.expr)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
return visitor(node, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
return self.visit(expr, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
return visitor(node, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 407, in visit_Expr
return self.visit(node.value, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
return visitor(node, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 699, in visit_Compare
return self.visit(binop)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
return visitor(node, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 520, in visit_BinOp
op, op_class, left, right = self._maybe_transform_eq_ne(node)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 441, in _maybe_transform_eq_ne
right = self.visit(node.right, side="right")
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
return visitor(node, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 533, in visit_Name
return self.term_type(node.id, self.env, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py", line 84, in __init__
self._value = self._resolve_name()
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py", line 101, in _resolve_name
res = self.env.resolve(self.local_name, is_local=self.is_local)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py", line 204, in resolve
raise UndefinedVariableError(key, is_local) from err
またpandasでも似たようなエラーが出る。
# pandas 変数名を使用 f文字列 失敗例 target = 'Bob' df.query(f"name=={target}") # -------------------- エラー。長いので折りたたみます。
クリックでエラー内容を表示
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
187 if self.has_resolvers:
--> 188 return self.resolvers[key]
189
/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __getitem__(self, key)
897 pass
--> 898 return self.__missing__(key) # support subclasses that define __missing__
899
/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __missing__(self, key)
889 def __missing__(self, key):
--> 890 raise KeyError(key)
891
KeyError: 'Bob'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
198 # e.g., df[df > 0]
--> 199 return self.temps[key]
200 except KeyError as err:
KeyError: 'Bob'
The above exception was the direct cause of the following exception:
UndefinedVariableError Traceback (most recent call last)
<ipython-input-18-52c23030a7f6> in <module>
1 # pandas 変数名を使用 f文字列 失敗例
2 target = 'Bob'
----> 3 df.query(f"name=={target}")
/usr/local/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
3338 kwargs["level"] = kwargs.pop("level", 0) + 1
3339 kwargs["target"] = None
-> 3340 res = self.eval(expr, **kwargs)
3341
3342 try:
/usr/local/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
3468 kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers)
3469
-> 3470 return _eval(expr, inplace=inplace, **kwargs)
3471
3472 def select_dtypes(self, include=None, exclude=None) -> "DataFrame":
/usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
339 )
340
--> 341 parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
342
343 # construct the engine and evaluate the parsed expression
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in __init__(self, expr, engine, parser, env, level)
785 self.parser = parser
786 self._visitor = _parsers[parser](self.env, self.engine, self.parser)
--> 787 self.terms = self.parse()
788
789 @property
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in parse(self)
804 Parse an expression.
805 """
--> 806 return self._visitor.visit(self.expr)
807
808 @property
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
396 method = "visit_" + type(node).__name__
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
400 def visit_Module(self, node, **kwargs):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Module(self, node, **kwargs)
402 raise SyntaxError("only a single expression is allowed")
403 expr = node.body[0]
--> 404 return self.visit(expr, **kwargs)
405
406 def visit_Expr(self, node, **kwargs):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
396 method = "visit_" + type(node).__name__
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
400 def visit_Module(self, node, **kwargs):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Expr(self, node, **kwargs)
405
406 def visit_Expr(self, node, **kwargs):
--> 407 return self.visit(node.value, **kwargs)
408
409 def _rewrite_membership_op(self, node, left, right):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
396 method = "visit_" + type(node).__name__
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
400 def visit_Module(self, node, **kwargs):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Compare(self, node, **kwargs)
697 op = self.translate_In(ops[0])
698 binop = ast.BinOp(op=op, left=node.left, right=comps[0])
--> 699 return self.visit(binop)
700
701 # recursive case: we have a chained comparison, a CMP b CMP c, etc.
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
396 method = "visit_" + type(node).__name__
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
400 def visit_Module(self, node, **kwargs):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_BinOp(self, node, **kwargs)
518
519 def visit_BinOp(self, node, **kwargs):
--> 520 op, op_class, left, right = self._maybe_transform_eq_ne(node)
521 left, right = self._maybe_downcast_constants(left, right)
522 return self._maybe_evaluate_binop(op, op_class, left, right)
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in _maybe_transform_eq_ne(self, node, left, right)
439 left = self.visit(node.left, side="left")
440 if right is None:
--> 441 right = self.visit(node.right, side="right")
442 op, op_class, left, right = self._rewrite_membership_op(node, left, right)
443 return op, op_class, left, right
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
396 method = "visit_" + type(node).__name__
397 visitor = getattr(self, method)
--> 398 return visitor(node, **kwargs)
399
400 def visit_Module(self, node, **kwargs):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Name(self, node, **kwargs)
531
532 def visit_Name(self, node, **kwargs):
--> 533 return self.term_type(node.id, self.env, **kwargs)
534
535 def visit_NameConstant(self, node, **kwargs):
/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in __init__(self, name, env, side, encoding)
82 tname = str(name)
83 self.is_local = tname.startswith(_LOCAL_TAG) or tname in _DEFAULT_GLOBALS
---> 84 self._value = self._resolve_name()
85 self.encoding = encoding
86
/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in _resolve_name(self)
99
100 def _resolve_name(self):
--> 101 res = self.env.resolve(self.local_name, is_local=self.is_local)
102 self.update(res)
103
/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
202 from pandas.core.computation.ops import UndefinedVariableError
203
--> 204 raise UndefinedVariableError(key, is_local) from err
205
206 def swapkey(self, old_key: str, new_key: str, new_value=None):
UndefinedVariableError: name 'Bob' is not defined
これはquery関数の中にある文字列を単独で表示させるとよく分かる。
print(f"name=={target}") # -------------------- name==Bob
Bobに引用符をつけなければいけないのに、ついていない。これではBobはただの列名として扱われるはずだ。
正しい結果を得るためには、f文字列の中、Bobの外側に引用符を書く必要がある。
# dask 変数名を使用 f文字列 成功例 target = 'Bob' ddf.query(f"name=='{target}'").compute() # -------------------- name item number id_code 1 Bob bbb 2 123 5 Bob fff 1 345
# pandas 変数名を使用 実はf文字列でも行ける target = 'Bob' df.query(f"name=='{target}'") # -------------------- name item number id_code 1 Bob bbb 2 123 5 Bob fff 1 345
数字が入っている文字列型の場合
データのうち、id_codeカラムが"123"であるものを抽出しよう。
# pandas 直接値を指定 df.query("id_code=='123'") # -------------------- name item number id_code 1 Bob bbb 2 123 3 Charlie ddd 3 123
# dask 直接値を指定 ddf.query("id_code=='123'").compute() # -------------------- name item number id_code 1 Bob bbb 2 123 3 Charlie ddd 3 123
ここまでは何も問題ない。
ところが、変数名を使用すると状況が変わってくる。
# pandas 変数名を使用 @ code = '123' df.query(f"id_code==@code") # -------------------- name item number id_code 1 Bob bbb 2 123 3 Charlie ddd 3 123
# dask 変数名を使用 f文字列 失敗例 code = '123' ddf.query(f"id_code=={code}").compute() # -------------------- Empty DataFrame Columns: [name, item, number, id_code] Index: []
query関数の結果は空のDataFrameになる。
エラーが出るほうがまだハッキリ間違い箇所が分かる分だけ修正しやすいかもしれない……
これもquery関数の中にある文字列を単独で表示させるとよく分かる。
print(f"id_code=={code}") # -------------------- id_code==123
これをquery関数に入れると、id_codeが数字の123に等しいものを探してしまう。だから該当する行は無く、空のDataFrameが返る。
なお、pandasでは数字(より正確には10進数の整数リテラル)の先頭に0をつけてはいけないので、012
で同じことをやると違う状況になる。
# pandasでは数字の先頭に0をつけてはいけない x = 012 # -------------------- File "<ipython-input-27-d581e4a9bb8c>", line 2 x = 012 ^ SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
# dask 変数名を使用 f文字列 失敗例その2 code = '012' ddf.query(f"id_code=={code}").compute() # -------------------- エラー。長いので折りたたみます。
クリックでエラー内容を表示
SyntaxError Traceback (most recent call last)
/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
194 try:
--> 195 yield
196 except Exception as e:
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
6570 with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
-> 6571 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
6572
/usr/local/lib/python3.8/site-packages/dask/utils.py in __call__(self, _methodcaller__obj, *args, **kwargs)
1102 def __call__(self, __obj, *args, **kwargs):
-> 1103 return getattr(__obj, self.method)(*args, **kwargs)
1104
/usr/local/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
3339 kwargs["target"] = None
-> 3340 res = self.eval(expr, **kwargs)
3341
/usr/local/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
3469
-> 3470 return _eval(expr, inplace=inplace, **kwargs)
3471
/usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
340
--> 341 parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
342
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in __init__(self, expr, engine, parser, env, level)
786 self._visitor = _parsers[parser](self.env, self.engine, self.parser)
--> 787 self.terms = self.parse()
788
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in parse(self)
805 """
--> 806 return self._visitor.visit(self.expr)
807
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
393 e.msg = "Python keyword not valid identifier in numexpr query"
--> 394 raise e
395
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
389 try:
--> 390 node = ast.fix_missing_locations(ast.parse(clean))
391 except SyntaxError as e:
/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py in parse(source, filename, mode, type_comments, feature_version)
46 # Else it should be an int giving the minor version for 3.x.
---> 47 return compile(source, filename, mode, flags,
48 _feature_version=feature_version)
SyntaxError: invalid syntax (<unknown>, line 1)
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-28-5004d24aebb2> in <module>
1 # dask 変数名を使用 f文字列 失敗例その2
2 code = '012'
----> 3 ddf.query(f"id_code=={code}").compute()
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in query(self, expr, **kwargs)
5178 2 1 3 2
5179 """
-> 5180 return self.map_partitions(M.query, expr, **kwargs)
5181
5182 @derived_from(pd.DataFrame)
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(self, func, *args, **kwargs)
873 None as the division.
874 """
--> 875 return map_partitions(func, self, *args, **kwargs)
876
877 @insert_meta_param_description(pad=12)
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(func, meta, enforce_metadata, transform_divisions, align_dataframes, *args, **kwargs)
6639 dfs = [df for df in args if isinstance(df, _Frame)]
6640
-> 6641 meta = _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
6642 if all(isinstance(arg, Scalar) for arg in args):
6643 layer = {
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
6750 # Use non-normalized kwargs here, as we want the real values (not
6751 # delayed values)
-> 6752 meta = _emulate(func, *args, udf=True, **kwargs)
6753 meta_is_emulated = True
6754 else:
/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
6569 """
6570 with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
-> 6571 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
6572
6573
/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py in __exit__(self, type, value, traceback)
129 value = type()
130 try:
--> 131 self.gen.throw(type, value, traceback)
132 except StopIteration as exc:
133 # Suppress StopIteration *unless* it's the same exception that
/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
214 )
215 msg = msg.format(f" in `{funcname}`" if funcname else "", repr(e), tb)
--> 216 raise ValueError(msg) from e
217
218
ValueError: Metadata inference failed in `query`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
SyntaxError('invalid syntax', ('<unknown>', 1, 13, 'id_code ==0 12 \n'))
Traceback:
---------
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
yield
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py", line 6571, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "/usr/local/lib/python3.8/site-packages/dask/utils.py", line 1103, in __call__
return getattr(__obj, self.method)(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3340, in query
res = self.eval(expr, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3470, in eval
return _eval(expr, inplace=inplace, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
self.terms = self.parse()
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
return self._visitor.visit(self.expr)
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 394, in visit
raise e
File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 390, in visit
node = ast.fix_missing_locations(ast.parse(clean))
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
正しい結果を得るためには、f文字列の中、123の外側に引用符を書く必要がある。
# dask 変数名を使用 f文字列 成功例 code = '123' ddf.query(f"id_code=='{code}'").compute() # '012' の場合も同様なので、省略する。 # -------------------- name item number id_code 1 Bob bbb 2 123 3 Charlie ddd 3 123
まとめ
「pandas側で@variable_name
と書く代わりに、daskでは{variable_name}
と書く」という意識だと失敗する。
「変数を使わずにquery関数の引数の文字列を書くにはどうすればよいか」「それをf文字列で実現するにはどうすればよいか」 を考えれば良さそう。長々と色々な例を書いてきたけど、要約すれば上記のとおりになる。
daskのqueryの場合、扱っているのは普通のf文字列なので、文字列内の変数を展開したときに期待通りになっていれば良いというわけだ。
……というかここまで書いて気づいたけど、 pandasのqueryも引数に取るのはただの文字列なんだから、
num = 2 df.query(f"number=={num}")
が行けるとか書いてたけど、ただの文字列の書き方の違いじゃん。引数の文字列をそのまま書くかf文字列の展開を使って書くかの違いじゃん。
pandasのqueryといえば@を使うのが当たり前で、f文字列でも上手くいくのが意外で特別なことのように見えてしまった。
しかし、むしろ@を使った記法の方が、引数文字列の中身が違うから特殊だった(pandasの特殊な記法)。f文字列による指定は普通のpythonが分かっていれば自然な、一般的な話であった。
それでは。