daskのquery関数で変数名を指定する方法

daskでデータ絞り込みをするためにquery関数を使ったけど、構文が難しくてちょっと詰まった話。

daskのDataFrameに対するquery関数の公式ドキュメントはこちらだ。
dask.dataframe.DataFrame.query

このドキュメントを見ると、
「pandasは@で変数名を使えるが、daskではそれは使えないので、代わりにf文字列かlocal_dictキーワードを使ってくれ」と書いてある。
なるほど、@の代わりにf文字列を使えばそれで良いのね。……と単純に考えていると、ちょっとつまずく。という話である。

準備

import pandas as pd
import dask.dataframe as dd
import dask
pd.options.display.notebook_repr_html = False  # jupyter notebook上での出力形式を制御するために書いています。無くても動きます。
# 動作環境の確認
print(pd.__version__)
print(dask.__version__)

# --------------------

1.1.2
2023.1.0

サンプルデータの作成

# データに特に意味はない。https://linus-mk.hatenablog.com/entry/pandas-unique-integer-id から持ってきて適宜改変。
df = pd.DataFrame({
    'name'    : ['Alice', 'Bob', 'Charlie', 'Charlie', 'Alice', 'Bob'],
    'item' : ['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'],
    'number'    : [3, 2, 4, 3, 2, 1],
    'id_code' : ['012', '123', '234', '123', '012', '345']
})
df

# --------------------

      name item  number id_code
0    Alice  aaa       3     012
1      Bob  bbb       2     123
2  Charlie  ccc       4     234
3  Charlie  ddd       3     123
4    Alice  eee       2     012
5      Bob  fff       1     345
df.dtypes

# --------------------

name       object
item       object
number      int64
id_code    object
dtype: object
ddf = dd.from_pandas(df)

# --------------------

---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-6-bcb6b963da62> in <module>
    ----> 1 ddf = dd.from_pandas(df)
    
    /usr/local/lib/python3.8/site-packages/dask/dataframe/io/io.py in from_pandas(data, npartitions, chunksize, sort, name)
        260 
        261     if (npartitions is None) == (chunksize is None):
    --> 262         raise ValueError("Exactly one of npartitions and chunksize must be specified.")
        263 
        264     nrows = len(data)
    ValueError: Exactly one of npartitions and chunksize must be specified.

どうもよく分かっていないのだが、dask.dataframe.from_pandasはnpartitionsとchunksizeのうちどちらか片方(のみ)を指定する必要があるらしい。 今は特に何でも良いので、npartitions=1を指定する。

ddf = dd.from_pandas(df, npartitions=1)
print(ddf)

# --------------------

Dask DataFrame Structure:
                 name    item number id_code
npartitions=1                               
0              object  object  int64  object
5                 ...     ...    ...     ...
Dask Name: from_pandas, 1 graph layer
ddf.compute()

# --------------------

      name item  number id_code
0    Alice  aaa       3     012
1      Bob  bbb       2     123
2  Charlie  ccc       4     234
3  Charlie  ddd       3     123
4    Alice  eee       2     012
5      Bob  fff       1     345

これで準備はできた。

数値型のカラムの場合

上述の公式ドキュメントにも載っている、数字の例を見てみよう。
まず、データのうち、numberカラムが2であるものを抽出しよう。

# pandas 直接値を指定
df.query("number==2")

# --------------------

    name item  number id_code
1    Bob  bbb       2     123
4  Alice  eee       2     012
# dask 直接値を指定
ddf.query("number==2").compute()

# --------------------

    name item  number id_code
1    Bob  bbb       2     123
4  Alice  eee       2     012
# pandas 変数名を使用 @
num = 2
df.query(f"number==@num")

# --------------------

    name item  number id_code
1    Bob  bbb       2     123
4  Alice  eee       2     012
# dask 変数名を使用 f文字列、成功
num = 2
ddf.query(f"number=={num}").compute()

# --------------------

    name item  number id_code
1    Bob  bbb       2     123
4  Alice  eee       2     012
# pandas 変数名を使用 実はf文字列でも行ける
num = 2
df.query(f"number=={num}")

# --------------------

    name item  number id_code
1    Bob  bbb       2     123
4  Alice  eee       2     012

pandas側で@variable_nameと書く代わりに、daskでは{variable_name}と書けば良さそうな気がしてくる。 ところがそれが上手く行かないケースが存在するのだ。

文字列型のカラムの場合

データのうち、nameカラムが"Bob"であるものを抽出しよう。

# pandas 直接値を指定
df.query("name=='Bob'")

# --------------------

  name item  number id_code
1  Bob  bbb       2     123
5  Bob  fff       1     345
# dask 直接値を指定
ddf.query("name=='Bob'").compute()

# --------------------

  name item  number id_code
1  Bob  bbb       2     123
5  Bob  fff       1     345

ここまでは何も問題ない。
ところが、変数名を使用すると状況が変わってくる。

# pandas 変数名を使用 @
target = 'Bob'
df.query(f"name==@target")

# --------------------

  name item  number id_code
1  Bob  bbb       2     123
5  Bob  fff       1     345

pandas側で@variable_nameと書く代わりに、daskでは{variable_name}と書くと失敗する。

# dask 変数名を使用 f文字列 失敗例
target = 'Bob'
ddf.query(f"name=={target}").compute()

# --------------------

エラー。長いので折りたたみます。

クリックでエラー内容を表示

    KeyError                                  Traceback (most recent call last)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
        187             if self.has_resolvers:
    --> 188                 return self.resolvers[key]
        189 
    /usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __getitem__(self, key)
        897                 pass
    --> 898         return self.__missing__(key)            # support subclasses that define __missing__
        899 
    /usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __missing__(self, key)
        889     def __missing__(self, key):
    --> 890         raise KeyError(key)
        891 
    KeyError: 'Bob'
    
    During handling of the above exception, another exception occurred:
    KeyError                                  Traceback (most recent call last)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
        198                 # e.g., df[df > 0]
    --> 199                 return self.temps[key]
        200             except KeyError as err:
    KeyError: 'Bob'
    
    The above exception was the direct cause of the following exception:
    UndefinedVariableError                    Traceback (most recent call last)
    /usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
        194     try:
    --> 195         yield
        196     except Exception as e:
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
       6570     with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
    -> 6571         return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
       6572 
    /usr/local/lib/python3.8/site-packages/dask/utils.py in __call__(self, _methodcaller__obj, *args, **kwargs)
       1102     def __call__(self, __obj, *args, **kwargs):
    -> 1103         return getattr(__obj, self.method)(*args, **kwargs)
       1104 
    /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
       3339         kwargs["target"] = None
    -> 3340         res = self.eval(expr, **kwargs)
       3341 
    /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
       3469 
    -> 3470         return _eval(expr, inplace=inplace, **kwargs)
       3471 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
        340 
    --> 341         parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
        342 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in __init__(self, expr, engine, parser, env, level)
        786         self._visitor = _parsers[parser](self.env, self.engine, self.parser)
    --> 787         self.terms = self.parse()
        788 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in parse(self)
        805         """
    --> 806         return self._visitor.visit(self.expr)
        807 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Module(self, node, **kwargs)
        403         expr = node.body[0]
    --> 404         return self.visit(expr, **kwargs)
        405 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Expr(self, node, **kwargs)
        406     def visit_Expr(self, node, **kwargs):
    --> 407         return self.visit(node.value, **kwargs)
        408 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Compare(self, node, **kwargs)
        698             binop = ast.BinOp(op=op, left=node.left, right=comps[0])
    --> 699             return self.visit(binop)
        700 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_BinOp(self, node, **kwargs)
        519     def visit_BinOp(self, node, **kwargs):
    --> 520         op, op_class, left, right = self._maybe_transform_eq_ne(node)
        521         left, right = self._maybe_downcast_constants(left, right)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in _maybe_transform_eq_ne(self, node, left, right)
        440         if right is None:
    --> 441             right = self.visit(node.right, side="right")
        442         op, op_class, left, right = self._rewrite_membership_op(node, left, right)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Name(self, node, **kwargs)
        532     def visit_Name(self, node, **kwargs):
    --> 533         return self.term_type(node.id, self.env, **kwargs)
        534 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in __init__(self, name, env, side, encoding)
         83         self.is_local = tname.startswith(_LOCAL_TAG) or tname in _DEFAULT_GLOBALS
    ---> 84         self._value = self._resolve_name()
         85         self.encoding = encoding
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in _resolve_name(self)
        100     def _resolve_name(self):
    --> 101         res = self.env.resolve(self.local_name, is_local=self.is_local)
        102         self.update(res)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
        203 
    --> 204                 raise UndefinedVariableError(key, is_local) from err
        205 
    UndefinedVariableError: name 'Bob' is not defined
    
    The above exception was the direct cause of the following exception:
    ValueError                                Traceback (most recent call last)
    <ipython-input-17-6b73727f207c> in <module>
          1 # dask 変数名を使用 f文字列 失敗例
          2 target = 'Bob'
    ----> 3 ddf.query(f"name=={target}").compute()
    
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in query(self, expr, **kwargs)
       5178         2  1  3    2
       5179         """
    -> 5180         return self.map_partitions(M.query, expr, **kwargs)
       5181 
       5182     @derived_from(pd.DataFrame)
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(self, func, *args, **kwargs)
        873         None as the division.
        874         """
    --> 875         return map_partitions(func, self, *args, **kwargs)
        876 
        877     @insert_meta_param_description(pad=12)
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(func, meta, enforce_metadata, transform_divisions, align_dataframes, *args, **kwargs)
       6639     dfs = [df for df in args if isinstance(df, _Frame)]
       6640 
    -> 6641     meta = _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
       6642     if all(isinstance(arg, Scalar) for arg in args):
       6643         layer = {
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
       6750         # Use non-normalized kwargs here, as we want the real values (not
       6751         # delayed values)
    -> 6752         meta = _emulate(func, *args, udf=True, **kwargs)
       6753         meta_is_emulated = True
       6754     else:
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
       6569     """
       6570     with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
    -> 6571         return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
       6572 
       6573 
    /usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py in __exit__(self, type, value, traceback)
        129                 value = type()
        130             try:
    --> 131                 self.gen.throw(type, value, traceback)
        132             except StopIteration as exc:
        133                 # Suppress StopIteration *unless* it's the same exception that
    /usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
        214         )
        215         msg = msg.format(f" in `{funcname}`" if funcname else "", repr(e), tb)
    --> 216         raise ValueError(msg) from e
        217 
        218 
    ValueError: Metadata inference failed in `query`.
    
    You have supplied a custom function and Dask is unable to 
    determine the type of output that that function returns. 
    
    To resolve this please provide a meta= keyword.
    The docstring of the Dask function you ran should have more information.
    
    Original error is below:
    ------------------------
    UndefinedVariableError("name 'Bob' is not defined")
    
    Traceback:
    ---------
      File "/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
        yield
      File "/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py", line 6571, in _emulate
        return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
      File "/usr/local/lib/python3.8/site-packages/dask/utils.py", line 1103, in __call__
        return getattr(__obj, self.method)(*args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3340, in query
        res = self.eval(expr, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3470, in eval
        return _eval(expr, inplace=inplace, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
        parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
        self.terms = self.parse()
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
        return self._visitor.visit(self.expr)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
        return visitor(node, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
        return self.visit(expr, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
        return visitor(node, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 407, in visit_Expr
        return self.visit(node.value, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
        return visitor(node, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 699, in visit_Compare
        return self.visit(binop)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
        return visitor(node, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 520, in visit_BinOp
        op, op_class, left, right = self._maybe_transform_eq_ne(node)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 441, in _maybe_transform_eq_ne
        right = self.visit(node.right, side="right")
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
        return visitor(node, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 533, in visit_Name
        return self.term_type(node.id, self.env, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py", line 84, in __init__
        self._value = self._resolve_name()
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py", line 101, in _resolve_name
        res = self.env.resolve(self.local_name, is_local=self.is_local)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py", line 204, in resolve
        raise UndefinedVariableError(key, is_local) from err

またpandasでも似たようなエラーが出る。

# pandas 変数名を使用 f文字列 失敗例
target = 'Bob'
df.query(f"name=={target}")
# --------------------
エラー。長いので折りたたみます。

クリックでエラー内容を表示

    KeyError                                  Traceback (most recent call last)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
        187             if self.has_resolvers:
    --> 188                 return self.resolvers[key]
        189 
    /usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __getitem__(self, key)
        897                 pass
    --> 898         return self.__missing__(key)            # support subclasses that define __missing__
        899 
    /usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/collections/__init__.py in __missing__(self, key)
        889     def __missing__(self, key):
    --> 890         raise KeyError(key)
        891 
    KeyError: 'Bob'
    
    During handling of the above exception, another exception occurred:
    KeyError                                  Traceback (most recent call last)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
        198                 # e.g., df[df > 0]
    --> 199                 return self.temps[key]
        200             except KeyError as err:
    KeyError: 'Bob'
    
    The above exception was the direct cause of the following exception:
    UndefinedVariableError                    Traceback (most recent call last)
    <ipython-input-18-52c23030a7f6> in <module>
          1 # pandas 変数名を使用 f文字列 失敗例
          2 target = 'Bob'
    ----> 3 df.query(f"name=={target}")
    
    /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
       3338         kwargs["level"] = kwargs.pop("level", 0) + 1
       3339         kwargs["target"] = None
    -> 3340         res = self.eval(expr, **kwargs)
       3341 
       3342         try:
    /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
       3468         kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers)
       3469 
    -> 3470         return _eval(expr, inplace=inplace, **kwargs)
       3471 
       3472     def select_dtypes(self, include=None, exclude=None) -> "DataFrame":
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
        339         )
        340 
    --> 341         parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
        342 
        343         # construct the engine and evaluate the parsed expression
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in __init__(self, expr, engine, parser, env, level)
        785         self.parser = parser
        786         self._visitor = _parsers[parser](self.env, self.engine, self.parser)
    --> 787         self.terms = self.parse()
        788 
        789     @property
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in parse(self)
        804         Parse an expression.
        805         """
    --> 806         return self._visitor.visit(self.expr)
        807 
        808     @property
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        396         method = "visit_" + type(node).__name__
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
        400     def visit_Module(self, node, **kwargs):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Module(self, node, **kwargs)
        402             raise SyntaxError("only a single expression is allowed")
        403         expr = node.body[0]
    --> 404         return self.visit(expr, **kwargs)
        405 
        406     def visit_Expr(self, node, **kwargs):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        396         method = "visit_" + type(node).__name__
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
        400     def visit_Module(self, node, **kwargs):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Expr(self, node, **kwargs)
        405 
        406     def visit_Expr(self, node, **kwargs):
    --> 407         return self.visit(node.value, **kwargs)
        408 
        409     def _rewrite_membership_op(self, node, left, right):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        396         method = "visit_" + type(node).__name__
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
        400     def visit_Module(self, node, **kwargs):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Compare(self, node, **kwargs)
        697             op = self.translate_In(ops[0])
        698             binop = ast.BinOp(op=op, left=node.left, right=comps[0])
    --> 699             return self.visit(binop)
        700 
        701         # recursive case: we have a chained comparison, a CMP b CMP c, etc.
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        396         method = "visit_" + type(node).__name__
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
        400     def visit_Module(self, node, **kwargs):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_BinOp(self, node, **kwargs)
        518 
        519     def visit_BinOp(self, node, **kwargs):
    --> 520         op, op_class, left, right = self._maybe_transform_eq_ne(node)
        521         left, right = self._maybe_downcast_constants(left, right)
        522         return self._maybe_evaluate_binop(op, op_class, left, right)
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in _maybe_transform_eq_ne(self, node, left, right)
        439             left = self.visit(node.left, side="left")
        440         if right is None:
    --> 441             right = self.visit(node.right, side="right")
        442         op, op_class, left, right = self._rewrite_membership_op(node, left, right)
        443         return op, op_class, left, right
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        396         method = "visit_" + type(node).__name__
        397         visitor = getattr(self, method)
    --> 398         return visitor(node, **kwargs)
        399 
        400     def visit_Module(self, node, **kwargs):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit_Name(self, node, **kwargs)
        531 
        532     def visit_Name(self, node, **kwargs):
    --> 533         return self.term_type(node.id, self.env, **kwargs)
        534 
        535     def visit_NameConstant(self, node, **kwargs):
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in __init__(self, name, env, side, encoding)
         82         tname = str(name)
         83         self.is_local = tname.startswith(_LOCAL_TAG) or tname in _DEFAULT_GLOBALS
    ---> 84         self._value = self._resolve_name()
         85         self.encoding = encoding
         86 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/ops.py in _resolve_name(self)
         99 
        100     def _resolve_name(self):
    --> 101         res = self.env.resolve(self.local_name, is_local=self.is_local)
        102         self.update(res)
        103 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
        202                 from pandas.core.computation.ops import UndefinedVariableError
        203 
    --> 204                 raise UndefinedVariableError(key, is_local) from err
        205 
        206     def swapkey(self, old_key: str, new_key: str, new_value=None):
    UndefinedVariableError: name 'Bob' is not defined

これはquery関数の中にある文字列を単独で表示させるとよく分かる。

print(f"name=={target}")

# --------------------

name==Bob

Bobに引用符をつけなければいけないのに、ついていない。これではBobはただの列名として扱われるはずだ。
正しい結果を得るためには、f文字列の中、Bobの外側に引用符を書く必要がある。

# dask 変数名を使用 f文字列 成功例
target = 'Bob'
ddf.query(f"name=='{target}'").compute()

# --------------------

  name item  number id_code
1  Bob  bbb       2     123
5  Bob  fff       1     345
# pandas 変数名を使用 実はf文字列でも行ける
target = 'Bob'
df.query(f"name=='{target}'")

# --------------------

  name item  number id_code
1  Bob  bbb       2     123
5  Bob  fff       1     345

数字が入っている文字列型の場合

データのうち、id_codeカラムが"123"であるものを抽出しよう。

# pandas 直接値を指定
df.query("id_code=='123'")

# --------------------

      name item  number id_code
1      Bob  bbb       2     123
3  Charlie  ddd       3     123
# dask 直接値を指定
ddf.query("id_code=='123'").compute()

# --------------------

      name item  number id_code
1      Bob  bbb       2     123
3  Charlie  ddd       3     123

ここまでは何も問題ない。
ところが、変数名を使用すると状況が変わってくる。

# pandas 変数名を使用 @
code = '123'
df.query(f"id_code==@code")

# --------------------

      name item  number id_code
1      Bob  bbb       2     123
3  Charlie  ddd       3     123
# dask 変数名を使用 f文字列 失敗例
code = '123'
ddf.query(f"id_code=={code}").compute()

# --------------------

Empty DataFrame
Columns: [name, item, number, id_code]
Index: []

query関数の結果は空のDataFrameになる。
エラーが出るほうがまだハッキリ間違い箇所が分かる分だけ修正しやすいかもしれない……
これもquery関数の中にある文字列を単独で表示させるとよく分かる。

print(f"id_code=={code}")

# --------------------

id_code==123

これをquery関数に入れると、id_codeが数字の123に等しいものを探してしまう。だから該当する行は無く、空のDataFrameが返る。
なお、pandasでは数字(より正確には10進数の整数リテラル)の先頭に0をつけてはいけないので、012で同じことをやると違う状況になる。

# pandasでは数字の先頭に0をつけてはいけない
x = 012

# --------------------

  File "<ipython-input-27-d581e4a9bb8c>", line 2
    x = 012
          ^
SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
# dask 変数名を使用 f文字列 失敗例その2
code = '012'
ddf.query(f"id_code=={code}").compute()
# --------------------
エラー。長いので折りたたみます。

クリックでエラー内容を表示


    SyntaxError                               Traceback (most recent call last)
    /usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
        194     try:
    --> 195         yield
        196     except Exception as e:
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
       6570     with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
    -> 6571         return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
       6572 
    /usr/local/lib/python3.8/site-packages/dask/utils.py in __call__(self, _methodcaller__obj, *args, **kwargs)
       1102     def __call__(self, __obj, *args, **kwargs):
    -> 1103         return getattr(__obj, self.method)(*args, **kwargs)
       1104 
    /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs)
       3339         kwargs["target"] = None
    -> 3340         res = self.eval(expr, **kwargs)
       3341 
    /usr/local/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs)
       3469 
    -> 3470         return _eval(expr, inplace=inplace, **kwargs)
       3471 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace)
        340 
    --> 341         parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
        342 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in __init__(self, expr, engine, parser, env, level)
        786         self._visitor = _parsers[parser](self.env, self.engine, self.parser)
    --> 787         self.terms = self.parse()
        788 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in parse(self)
        805         """
    --> 806         return self._visitor.visit(self.expr)
        807 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        393                     e.msg = "Python keyword not valid identifier in numexpr query"
    --> 394                 raise e
        395 
    /usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py in visit(self, node, **kwargs)
        389             try:
    --> 390                 node = ast.fix_missing_locations(ast.parse(clean))
        391             except SyntaxError as e:
    /usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py in parse(source, filename, mode, type_comments, feature_version)
         46     # Else it should be an int giving the minor version for 3.x.
    ---> 47     return compile(source, filename, mode, flags,
         48                    _feature_version=feature_version)
    SyntaxError: invalid syntax (<unknown>, line 1)
    
    The above exception was the direct cause of the following exception:
    ValueError                                Traceback (most recent call last)
    <ipython-input-28-5004d24aebb2> in <module>
          1 # dask 変数名を使用 f文字列 失敗例その2
          2 code = '012'
    ----> 3 ddf.query(f"id_code=={code}").compute()
    
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in query(self, expr, **kwargs)
       5178         2  1  3    2
       5179         """
    -> 5180         return self.map_partitions(M.query, expr, **kwargs)
       5181 
       5182     @derived_from(pd.DataFrame)
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(self, func, *args, **kwargs)
        873         None as the division.
        874         """
    --> 875         return map_partitions(func, self, *args, **kwargs)
        876 
        877     @insert_meta_param_description(pad=12)
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in map_partitions(func, meta, enforce_metadata, transform_divisions, align_dataframes, *args, **kwargs)
       6639     dfs = [df for df in args if isinstance(df, _Frame)]
       6640 
    -> 6641     meta = _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
       6642     if all(isinstance(arg, Scalar) for arg in args):
       6643         layer = {
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _get_meta_map_partitions(args, dfs, func, kwargs, meta, parent_meta)
       6750         # Use non-normalized kwargs here, as we want the real values (not
       6751         # delayed values)
    -> 6752         meta = _emulate(func, *args, udf=True, **kwargs)
       6753         meta_is_emulated = True
       6754     else:
    /usr/local/lib/python3.8/site-packages/dask/dataframe/core.py in _emulate(func, udf, *args, **kwargs)
       6569     """
       6570     with raise_on_meta_error(funcname(func), udf=udf), check_numeric_only_deprecation():
    -> 6571         return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
       6572 
       6573 
    /usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py in __exit__(self, type, value, traceback)
        129                 value = type()
        130             try:
    --> 131                 self.gen.throw(type, value, traceback)
        132             except StopIteration as exc:
        133                 # Suppress StopIteration *unless* it's the same exception that
    /usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
        214         )
        215         msg = msg.format(f" in `{funcname}`" if funcname else "", repr(e), tb)
    --> 216         raise ValueError(msg) from e
        217 
        218 
    ValueError: Metadata inference failed in `query`.
    
    You have supplied a custom function and Dask is unable to 
    determine the type of output that that function returns. 
    
    To resolve this please provide a meta= keyword.
    The docstring of the Dask function you ran should have more information.
    
    Original error is below:
    ------------------------
    SyntaxError('invalid syntax', ('<unknown>', 1, 13, 'id_code ==0 12 \n'))
    
    Traceback:
    ---------
      File "/usr/local/lib/python3.8/site-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
        yield
      File "/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py", line 6571, in _emulate
        return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
      File "/usr/local/lib/python3.8/site-packages/dask/utils.py", line 1103, in __call__
        return getattr(__obj, self.method)(*args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3340, in query
        res = self.eval(expr, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3470, in eval
        return _eval(expr, inplace=inplace, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
        parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
        self.terms = self.parse()
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
        return self._visitor.visit(self.expr)
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 394, in visit
        raise e
      File "/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 390, in visit
        node = ast.fix_missing_locations(ast.parse(clean))
      File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ast.py", line 47, in parse
        return compile(source, filename, mode, flags,

正しい結果を得るためには、f文字列の中、123の外側に引用符を書く必要がある。

# dask 変数名を使用 f文字列 成功例
code = '123'
ddf.query(f"id_code=='{code}'").compute()
# '012' の場合も同様なので、省略する。

# --------------------

      name item  number id_code
1      Bob  bbb       2     123
3  Charlie  ddd       3     123

まとめ

「pandas側で@variable_nameと書く代わりに、daskでは{variable_name}と書く」という意識だと失敗する。
「変数を使わずにquery関数の引数の文字列を書くにはどうすればよいか」「それをf文字列で実現するにはどうすればよいか」 を考えれば良さそう。長々と色々な例を書いてきたけど、要約すれば上記のとおりになる。
daskのqueryの場合、扱っているのは普通のf文字列なので、文字列内の変数を展開したときに期待通りになっていれば良いというわけだ。

……というかここまで書いて気づいたけど、 pandasのqueryも引数に取るのはただの文字列なんだから、

num = 2
df.query(f"number=={num}")

が行けるとか書いてたけど、ただの文字列の書き方の違いじゃん。引数の文字列をそのまま書くかf文字列の展開を使って書くかの違いじゃん。
pandasのqueryといえば@を使うのが当たり前で、f文字列でも上手くいくのが意外で特別なことのように見えてしまった。 しかし、むしろ@を使った記法の方が、引数文字列の中身が違うから特殊だった(pandasの特殊な記法)。f文字列による指定は普通のpythonが分かっていれば自然な、一般的な話であった。

それでは。