-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Support Python in OS Scripting Service #17432
Comments
This is a fantastic proposal. We all agree that python can definitely help opensearch pricking into lots of potential areas. While scripts are consider to be a light weight interface for our users to do customizations. Python, as a popular language, will turn over the users' impress to painless script which is 'painful'. |
I think this is a really interesting idea. I'm glad that you're considering the security implications! I see that as the biggest obstacle that we'll need to overcome to make this reality. The GraalVM sandboxing could be a promising start, but we'll definitely need to be very careful that we don't introduce a new vector for attackers. |
I love the idea of using Python rather than Painless -- as you say, it is better-known and has far more capabilities. |
From my perspective, the real value of adding Python is when we go big. My gut feeling is that adding Python just as a bare bones syntax replacement for Painless, while worthwhile in itself isn't the big win. We'll just be fielding requests to "support this Python library" and "can I make this call out?" The big win is more meta... It's when we can tell our Data Scientist colleages that "We Appreciate YOU and Care About YOU", and we're expressing that by bring your number 1 tool to the table: Python. The next great search engine is the one that the Data Scientist community embraces wholeheartedly, and we want that to be OpenSearch. We want OpenSearch to be a tool they naturally reach for, just like Jupyter Notebooks, Pandas etc. That means supporting Python as a first class citizen up and down the stack. Yes, there are engineering challenges and we need to embrace them, not shy away from them. Someone will embrace them, why not us? |
@epugh @smacrakis we will definitely raise another RFC on that, after we finish a short demo. |
We also want to save the effort of users writing |
Introduction
This RFC proposes adding Python 3 as a supported language in the OpenSearch Scripting Service, especially to comply Painless script which is now considered to be ‘painful’.
Python is widely recognized as a simple yet powerful language, especially within the data science community. By integrating Python, OpenSearch aims to broaden its appeal to users who rely on Python for data processing and analytical tasks.
Background & Motivation
OpenSearch currently supports several scripting languages, such as Painless, Mustache, and Expressions. While each has merits, they also come with learning curves that may be unfamiliar to Python users. Python’s ecosystem offers extensive data processing, machine learning, and analytical libraries. Enabling Python scripts within OpenSearch will reduce adoption barriers and empower a broader segment of the community to write custom logic for tasks such as scoring documents, executing specialized aggregations, and customizing ingestion pipelines.
Proposed Solution
Overview
The proposal is to implement a Python script plugin that integrates with the existing Scripting Service. This plugin will allow users to evaluate Python scripts at runtime under various contexts.
Below is a high-level flowchart illustrating how Python scripting will interact with the existing OpenSearch architecture:
Implementation Approaches
We have identified two primary implementation strategies:
Parsing and Translating Python Code
The Python code could be parsed into an intermediate form (e.g. Calcite’s logical plan) that OpenSearch can convert into its native execution plan.
Direct Execution as Guest Language (GraalVM)
Using GraalVM’s Polyglot APIs, Python code can run directly within the OpenSearch process, where a standalone python runtime is hosted inside the same JVM. This solution embeds Python as a guest language, enabling executions of custom Python scripts.
A PoC has demonstrated the feasibility of the second approach with GraalVM.
Demo
Demo1: Custom Scoring with Python
This demo exemplifies how to calculate scores as an average of ratings using a Python script
Create an index called “books” and insert 3 books into it
Store a Python script called
agg_ratings
.The script takes the average of book ratings and multiply it by a factor, which will be passed from query parameters.
Execute the script under the score context. The
score
context runs a script as if the script were in ascript_score
function in afunction_score
query.The
params
object specifies thefactor
as2.0
, which will scale the average ratings to a 0–10 range.A sample response might look as follows:
Here,
_score
is the average of the document’sratings
multiplied by the specified factor of2.0
. This confirms that the Python script correctly evaluates the provided documents and parameters.Demo2: Post-processing tensor output in neural search
Neural search applies language models to transform document texts into vector embedding for a better performance in semantic search. It supports using externally hosted models to embed documents. This tutorial explains the process in more details. However, different language model vendors return tensors wrapped in different formats. Historically, users have to write Painless scripts to transform the data to a unified format that can be recognized by the document ingestion pipeline. In this demonstration, we use Python to process responses from the Bedrock Cohere embed-english model. The following steps follow the standard way to connect to externally hosted models and is modified from this blueprint; we only alter the post-processing part to use custom Python script. Irrelevant parts are omitted for brevity.
Create a connector for Amazon Bedrock
In the above example:
pre_process_function
: Utilizes a Painless script to prepare the request payload for the model.post_process_function
: Uses a Python script to transform the returned embeddings into JSON objects that include metadata such asname
,data_type
, andshape
.The Python script is shown below:
This script unpacks the returned list of tensors into JSON objects with their corresponding metadata, which can then be used by downstream components in the ingestion pipeline.
Note:
ml-commons
plugin has been modified to support the optionalpre_process_lang
andpost_process_lang
parameters for this proof-of-concept.ml-commons
, built-in support for certain Cohere models is available viaconnector.pre_process.cohere.embedding
andconnector.post_process.cohere.embedding
. This demonstration uses custom scripts for illustrative purposes and to verify correctness.Generate embeddings with custom post-processing
The
<MODEL_ID>
is an identifier for the external model generated from previous steps. The response is as follows:The embeddings here have been post-processed by the Python script to provide standardized metadata alongside the raw tensor data.
Python packages
Built-in Python libraries are self-contained in GraalVM’s Polyglot Python runtime. Third-party python packages can be configured with GraalPy gradle plugin by specifying package names and versions in
build.gradle
:GraalPy is compatible with common Python packages such as Numpy and Pandas. Please consult GraalPy package compatibility for the list of supported Python packages.
Security and Compatibility
Security (varies based on implementation)
Sandboxing: GraalVM offers security mechanisms like sandboxing and host access control out of the box. We will need to scrutinize them to ensure the extended capability aligns with the security guidelines of OpenSearch.
Malicious scripts: Multiple approaches has been discussed to eliminate the risks of malicious scripts
Resource management: Python scripts should be subject to resource usage limits (e.g., CPU and memory) to ensure they do not disrupt cluster stability.
GraalVM Compatibility
GraalVM’s polyglot API is able to run on various Java runtime, including OpenJDK, GraalVM Community Edition, Oracle JDK, etc. This should cover most use cases. Runtimes that are not GraalVM can be further optimized if experimental VM options as below are enabled:
The text was updated successfully, but these errors were encountered: