# BSD 3-Clause License; see https://github.com/scikit-hep/awkward-1.0/blob/main/LICENSE from __future__ import absolute_import import re import keyword try: from collections.abc import Iterable from collections.abc import Sized except ImportError: from collections import Iterable from collections import Sized import awkward as ak np = ak.nplike.NumpyMetadata.instance() numpy = ak.nplike.Numpy.instance() _dir_pattern = re.compile(r"^[a-zA-Z_]\w*$") def _suffix(array): out = ak.operations.convert.kernels(array) if out is None or out == "cpu": return "" else: return ":" + out class Array( ak._connect._numpy.NDArrayOperatorsMixin, Iterable, Sized, ): u""" Args: data (#ak.layout.Content, #ak.partition.PartitionedArray, #ak.Array, `np.ndarray`, `cp.ndarray`, `pyarrow.*`, str, dict, or iterable): Data to wrap or convert into an array. - If a NumPy array, the regularity of its dimensions is preserved and the data are viewed, not copied. - CuPy arrays are treated the same way as NumPy arrays except that they default to `kernels="cuda"`, rather than `kernels="cpu"`. - If a pyarrow object, calls #ak.from_arrow, preserving as much metadata as possible, usually zero-copy. - If a dict of str \u2192 columns, combines the columns into an array of records (like Pandas's DataFrame constructor). - If a string, the data are assumed to be JSON. - If an iterable, calls #ak.from_iter, which assumes all dimensions have irregular lengths. behavior (None or dict): Custom #ak.behavior for this Array only. with_name (None or str): Gives tuples and records a name that can be used to override their behavior (see below). check_valid (bool): If True, verify that the #layout is valid. kernels (None, `"cpu"`, or `"cuda"`): If `"cpu"`, the Array will be placed in main memory for use with other `"cpu"` Arrays and Records; if `"cuda"`, the Array will be placed in GPU global memory using CUDA; if None, the `data` are left untouched. For `"cuda"`, [awkward-cuda-kernels](https://pypi.org/project/awkward-cuda-kernels) must be installed, which can be invoked with `pip install awkward[cuda] --upgrade`. High-level array that can contain data of any type. For most users, this is the only class in Awkward Array that matters: it is the entry point for data analysis with an emphasis on usability. It intentionally has a minimum of methods, preferring standalone functions like ak.num(array1) ak.combinations(array1) ak.cartesian([array1, array2]) ak.zip({"x": array1, "y": array2, "z": array3}) instead of bound methods like array1.num() array1.combinations() array1.cartesian([array2, array3]) array1.zip(...) # ? because its namespace is valuable for domain-specific parameters and functionality. For example, if records contain a field named `"num"`, they can be accessed as array1.num instead of array1["num"] without any confusion or interference from #ak.num. The same is true for domain-specific methods that have been attached to the data. For instance, an analysis of mailing addresses might have a function that computes zip codes, which can be attached to the data with a method like latlon.zip() without any confusion or interference from #ak.zip. Custom methods like this can be added with #ak.behavior, and so the namespace of Array attributes must be kept clear for such applications. See also #ak.Record. Interfaces to other libraries ============================= NumPy ***** When NumPy [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html) (ufuncs) are applied to an ak.Array, they are passed through the Awkward data structure, applied to the numerical data at its leaves, and the output maintains the original structure. For example, >>> array = ak.Array([[1, 4, 9], [], [16, 25]]) >>> np.sqrt(array) See also #ak.Array.__array_ufunc__. Some NumPy functions other than ufuncs are also handled properly in NumPy >= 1.17 (see [NEP 18](https://numpy.org/neps/nep-0018-array-function-protocol.html)) and if an Awkward override exists. That is, np.concatenate can be used on an Awkward Array because ak.concatenate exists. If your NumPy is older than 1.17, use `ak.concatenate` directly. Pandas ****** Ragged arrays (list type) can be converted into Pandas [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) rows and nested records can be converted into MultiIndex columns. If the Awkward Array has only one "branch" of nested lists (i.e. different record fields do not have different-length lists, but a single chain of lists-of-lists is okay), then it can be losslessly converted into a single DataFrame. Otherwise, multiple DataFrames are needed, though they can be merged (with a loss of information). The #ak.to_pandas function performs this conversion; if `how=None`, it returns a list of DataFrames; otherwise, `how` is passed to `pd.merge` when merging the resultant DataFrames. Numba ***** Arrays can be used in [Numba](http://numba.pydata.org/): they can be passed as arguments to a Numba-compiled function or returned as return values. The only limitation is that Awkward Arrays cannot be *created* inside the Numba-compiled function; to make outputs, consider #ak.ArrayBuilder. Arrow ***** Arrays are convertible to and from [Apache Arrow](https://arrow.apache.org/), a standard for representing nested data structures in columnar arrays. See #ak.to_arrow and #ak.from_arrow. NumExpr ******* [NumExpr](https://numexpr.readthedocs.io/en/latest/user_guide.html) can calculate expressions on a set of ak.Arrays, but only if the functions in `ak.numexpr` are used, not the functions in the `numexpr` library directly. Like NumPy ufuncs, the expression is evaluated on the numeric leaves of the data structure, maintaining structure in the output. See #ak.numexpr.evaluate to calculate an expression. See #ak.numexpr.re_evaluate to recalculate an expression without rebuilding its virtual machine. Autograd ******** Derivatives of a calculation on a set of ak.Arrays can be calculated with [Autograd](https://github.com/HIPS/autograd#readme), but only if the function in `ak.autograd` is used, not the functions in the `autograd` library directly. Like NumPy ufuncs, the function and its derivatives are evaluated on the numeric leaves of the data structure, maintaining structure in the output. See #ak.autograd.elementwise_grad to calculate a function and its derivatives elementwise on each numeric value in an ak.Array. """ def __init__( self, data, behavior=None, with_name=None, check_valid=False, cache=None, kernels=None, ): if cache is not None: raise TypeError("__init__() got an unexpected keyword argument 'cache'") if isinstance(data, (ak.layout.Content, ak.partition.PartitionedArray)): layout = data elif isinstance(data, Array): layout = data.layout elif isinstance(data, np.ndarray) and data.dtype != np.dtype("O"): layout = ak.operations.convert.from_numpy(data, highlevel=False) elif type(data).__module__.startswith("cupy."): layout = ak.operations.convert.from_cupy(data, highlevel=False) elif type(data).__module__ == "pyarrow" or type(data).__module__.startswith( "pyarrow." ): layout = ak.operations.convert.from_arrow(data, highlevel=False) elif isinstance(data, dict): keys = [] contents = [] length = None for k, v in data.items(): keys.append(k) contents.append(Array(v).layout) if length is None: length = len(contents[-1]) elif length != len(contents[-1]): raise ValueError( "dict of arrays in ak.Array constructor must have arrays " "of equal length ({0} vs {1})".format(length, len(contents[-1])) + ak._util.exception_suffix(__file__) ) parameters = None if with_name is not None: parameters = {"__record__": with_name} layout = ak.layout.RecordArray(contents, keys, parameters=parameters) elif isinstance(data, str): layout = ak.operations.convert.from_json(data, highlevel=False) else: layout = ak.operations.convert.from_iter( data, highlevel=False, allow_record=False ) if not isinstance(layout, (ak.layout.Content, ak.partition.PartitionedArray)): raise TypeError( "could not convert data into an ak.Array" + ak._util.exception_suffix(__file__) ) if with_name is not None: layout = ak.operations.structure.with_name( layout, with_name, highlevel=False ) if self.__class__ is Array: self.__class__ = ak._util.arrayclass(layout, behavior) if kernels is not None and kernels != ak.operations.convert.kernels(layout): layout = ak.operations.convert.to_kernels(layout, kernels, highlevel=False) self.layout = layout self.behavior = behavior docstr = self.layout.purelist_parameter("__doc__") if isinstance(docstr, str): self.__doc__ = docstr if check_valid: ak.operations.describe.validity_error(self, exception=True) self._caches = ak._util.find_caches(self.layout) @property def layout(self): """ The composable #ak.layout.Content elements that determine how this Array is structured. This may be considered a "low-level" view, as it distinguishes between arrays that have the same logical meaning (i.e. same JSON output and high-level #type) but different * node types, such as #ak.layout.ListArray64 and #ak.layout.ListOffsetArray64, * integer type specialization, such as #ak.layout.ListArray64 and #ak.layout.ListArray32, * or specific values, such as gaps in a #ak.layout.ListArray64. The #ak.layout.Content elements are fully composable, whereas an Array is not; the high-level Array is a single-layer "shell" around its layout. Layouts are rendered as XML instead of a nested list. For example, the following `array` ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]) is presented as but `array.layout` is presented as (with truncation for large arrays). """ return self._layout @layout.setter def layout(self, layout): if isinstance(layout, (ak.layout.Content, ak.partition.PartitionedArray)): self._layout = layout self._numbaview = None else: raise TypeError( "layout must be a subclass of ak.layout.Content" + ak._util.exception_suffix(__file__) ) @property def behavior(self): """ The `behavior` parameter passed into this Array's constructor. * If a dict, this `behavior` overrides the global #ak.behavior. Any keys in the global #ak.behavior but not this `behavior` are still valid, but any keys in both are overridden by this `behavior`. Keys with a None value are equivalent to missing keys, so this `behavior` can effectively remove keys from the global #ak.behavior. * If None, the Array defaults to the global #ak.behavior. See #ak.behavior for a list of recognized key patterns and their meanings. """ return self._behavior @behavior.setter def behavior(self, behavior): if behavior is None or isinstance(behavior, dict): self._behavior = behavior else: raise TypeError( "behavior must be None or a dict" + ak._util.exception_suffix(__file__) ) @classmethod def _internal_for_jax(cls, layout, jaxtracers, isscalar=False): if isscalar: arr_withtracers = cls(numpy.asarray(layout)) else: arr_withtracers = cls(layout) arr_withtracers._tracers = jaxtracers ( _dataptrs, _map_ptrs_to_tracers, ) = ak._connect._jax.jax_utils._find_dataptrs_and_map( layout, jaxtracers, isscalar ) arr_withtracers._isscalar = isscalar arr_withtracers._dataptrs = _dataptrs arr_withtracers._map_ptrs_to_tracers = _map_ptrs_to_tracers return arr_withtracers @property def caches(self): return self._caches class Mask(object): def __init__(self, array, valid_when): self._array = array self._valid_when = valid_when def __str__(self): return self._str() def __repr__(self): return self._repr() def _str(self, limit_value=85): return self._array._str(limit_value=limit_value) def _repr(self, limit_value=40, limit_total=85): suffix = _suffix(self) limit_value -= len(suffix) value = ak._util.minimally_touching_string( limit_value, self._array.layout, self._array._behavior ) try: name = super(Array, self._array).__getattribute__("__name__") except AttributeError: name = type(self._array).__name__ limit_type = limit_total - (len(value) + len(name) + len("<.mask type=>")) typestr = repr( str( ak._util.highlevel_type( self._array.layout, self._array._behavior, True ) ) ) if len(typestr) > limit_type: typestr = typestr[: (limit_type - 4)] + "..." + typestr[-1] return "<{0}.mask{1} {2} type={3}>".format(name, suffix, value, typestr) def __getitem__(self, where): return ak.operations.structure.mask(self._array, where, self._valid_when) @property def mask(self, valid_when=True): """ Whereas array[array_of_booleans] removes elements from `array` in which `array_of_booleans` is False, array.mask[array_of_booleans] returns data with the same length as the original `array` but False values in `array_of_booleans` are mapped to None. Such an output can be used in mathematical expressions with the original `array` because they are still aligned. See <> and #ak.mask. """ return self.Mask(self, valid_when) def tolist(self): """ Converts this Array into Python objects; same as #ak.to_list (but without the underscore, like NumPy's [tolist](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tolist.html)). """ return ak.operations.convert.to_list(self) def to_list(self): """ Converts this Array into Python objects; same as #ak.to_list. """ return ak.operations.convert.to_list(self) def to_numpy(self, allow_missing=True): """ Converts this Array into a NumPy array, if possible; same as #ak.to_numpy. """ return ak.operations.convert.to_numpy(self, allow_missing=allow_missing) @property def nbytes(self): """ The total number of bytes in all the #ak.layout.Index, #ak.layout.Identities, and #ak.layout.NumpyArray buffers in this array tree. Note: this calculation takes overlapping buffers into account, to the extent that overlaps are not double-counted, but overlaps are currently assumed to be complete subsets of one another, and so it is theoretically possible (though unlikely) that this number is an underestimate of the true usage. It also does not count buffers that must be kept in memory because of ownership, but are not directly used in the array. Nor does it count the (small) C++ nodes or Python objects that reference the (large) array buffers. """ return self.layout.nbytes @property def ndim(self): """ Number of dimensions (nested variable-length lists and/or regular arrays) before reaching a numeric type or a record. There may be nested lists within the record, as field values, but this number of dimensions does not count those. (Some fields may have different depths than others, which is why they are not counted.) """ return self.layout.purelist_depth @property def fields(self): """ List of field names or tuple slot numbers (as strings) of the outermost record or tuple in this array. If the array contains nested records, only the fields of the outermost record are shown. If it contains tuples instead of records, its fields are string representations of integers, such as `"0"`, `"1"`, `"2"`, etc. The records or tuples may be within multiple layers of nested lists. If the array contains neither tuples nor records, it is an empty list. See also #ak.fields. """ return ak.operations.describe.fields(self) @property def type(self): """ The high-level type of this Array; same as #ak.type. Note that the outermost element of an Array's type is always an #ak.types.ArrayType, which specifies the number of elements in the array. The type of a #ak.layout.Content (from #ak.Array.layout) is not wrapped by an #ak.types.ArrayType. """ return ak.operations.describe.type(self) def __len__(self): """ The length of this Array, only counting the outermost structure. For example, the length of ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]) is `3`, not `5`. """ return len(self.layout) def __iter__(self): """ Iterates over this Array in Python. Note that this is the *slowest* way to access data (even slower than native Python objects, like lists and dicts). Usually, you should express your problems in array-at-a-time operations. In other words, do this: >>> print(np.sqrt(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))) [[1.05, 1.48, 1.82], [], [2.1, 2.35]] not this: >>> for outer in ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]): ... for inner in outer: ... print(np.sqrt(inner)) ... 1.0488088481701516 1.4832396974191326 1.816590212458495 2.0976176963403033 2.345207879911715 Iteration over Arrays exists so that they can be more easily inspected as Python objects. See also #ak.to_list. """ for x in self.layout: yield ak._util.wrap(x, self._behavior) def __getitem__(self, where): """ Args: where (many types supported; see below): Index of positions to select from this Array. Select items from the Array using an extension of NumPy's (already quite extensive) rules. All methods of selecting items described in [NumPy indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) are supported with one exception ([combining advanced and basic indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#combining-advanced-and-basic-indexing) with basic indexes *between* two advanced indexes: the definition NumPy chose for the result does not have a generalization beyond rectilinear arrays). The `where` parameter can be any of the following or a tuple of the following. * **An integer** selects one element. Like Python/NumPy, it is zero-indexed: `0` is the first item, `1` is the second, etc. Negative indexes count from the end of the list: `-1` is the last, `-2` is the second-to-last, etc. Indexes beyond the size of the array, either because they're too large or because they're too negative, raise errors. In particular, some nested lists might contain a desired element while others don't; this would raise an error. * **A slice** (either a Python `slice` object or the `start:stop:step` syntax) selects a range of elements. The `start` and `stop` values are zero-indexed; `start` is inclusive and `stop` is exclusive, like Python/NumPy. Negative `step` values are allowed, but a `step` of `0` is an error. Slices beyond the size of the array are not errors but are truncated, like Python/NumPy. * **A string** selects a tuple or record field, even if its position in the tuple is to the left of the dimension where the tuple/record is defined. (See <> below.) This is similar to NumPy's [field access](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#field-access), except that strings are allowed in the same tuple with other slice types. While record fields have names, tuple fields are integer strings, such as `"0"`, `"1"`, `"2"` (always non-negative). Be careful to distinguish these from non-string integers. * **An iterable of strings** (not the top-level tuple) selects multiple tuple/record fields. * **An ellipsis** (either the Python `Ellipsis` object or the `...` syntax) skips as many dimensions as needed to put the rest of the slice items to the innermost dimensions. * **A np.newaxis** or its equivalent, None, does not select items but introduces a new regular dimension in the output with size `1`. This is a convenient way to explicitly choose a dimension for broadcasting. * **A boolean array** with the same length as the current dimension (or any iterable, other than the top-level tuple) selects elements corresponding to each True value in the array, dropping those that correspond to each False. The behavior is similar to NumPy's [compress](https://docs.scipy.org/doc/numpy/reference/generated/numpy.compress.html) function. * **An integer array** (or any iterable, other than the top-level tuple) selects elements like a single integer, but produces a regular dimension of as many as are desired. The array can have any length, any order, and it can have duplicates and incomplete coverage. The behavior is similar to NumPy's [take](https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html) function. * **An integer Array with missing (None) items** selects multiple values by index, as above, but None values are passed through to the output. This behavior matches pyarrow's [Array.take](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.take) which also manages arrays with missing values. See <