Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM]Support dictionary shuffle #8893

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jinchengchenghh
Copy link
Contributor

Add a new flag in header to indicate the column is a dictionary vector.
Like this, first create a dictionary map to save the dictionaryValues, and then if insert, add this value to the distinctValues(BufferPtr), convert the it to FlatVector as dictionaryValues in DictinaryVector.
 
The indices is the map it->second.
Split the nulls as before.
 
Only supports RowVector(DictionaryVector)
 
For string type, the key type is StringView, it is also supported as map key
Based on this code, I'm not sure whether the input RowVectors encoding is all dictionary or not.

template <typename T>
DictionaryVectorPtr<EvalType<T>> VectorMaker::dictionaryVector(
    const std::vector<std::optional<T>>& data) {
  using TEvalType = EvalType<T>;
  // Encodes the data saving distinct values on `distinctValues` and their
  // respective indices on `indices`.
  std::vector<TEvalType> distinctValues;
  std::unordered_map<TEvalType, int32_t> indexMap;
  BufferPtr indices = AlignedBuffer::allocate<int32_t>(data.size(), pool_);
  auto rawIndices = indices->asMutable<int32_t>();
  BufferPtr nulls =
      AlignedBuffer::allocate<bool>(data.size(), pool_, bits::kNotNull);
  auto rawNulls = nulls->asMutable<uint64_t>();
  vector_size_t nullCount = 0;
  for (auto i = 0; i < data.size(); ++i) {
    auto val = data[i];
    if (val == std::nullopt) {
      ++nullCount;
      bits::setNull(rawNulls, i, true);
    } else {
      const auto& [it, inserted] = indexMap.emplace(*val, indexMap.size());
      if (inserted) {
        distinctValues.push_back(*val);
      }
      *rawIndices = it->second;
    }
    ++rawIndices;
  }
  auto values = flatVector(distinctValues);
  auto stats = genVectorMakerStats(data);
  auto dictionaryVector = std::make_unique<DictionaryVector<TEvalType>>(
      pool_,
      nullCount ? nulls : nullptr,
      data.size(),
      std::move(values),
      std::move(indices),
      stats.asSimpleVectorStats(),
      indexMap.size(),
      nullCount,
      stats.isSorted);
 
  return dictionaryVector;
}

@github-actions github-actions bot added the VELOX label Mar 4, 2025
Copy link

github-actions bot commented Mar 4, 2025

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant