-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write_database is Significantly slower than .to_pandas().to_sql() #7852
Comments
|
Yes, you are right! I copied the dependencies from the wrong environment. Just edited it now, and the pandas version I am using is 1.5.3 |
@alexander-beedie interested in doing similar magic (similar to excel) on sqlalchemy one day? ;) |
@ritchie46: Lol.... This is right up my alley actually; I can write to various backends -anything with a postgres flavour, for example- from a Polars frame faster than SQLAlchemy or Pandas can1 (typically leveraging bulk insert via a Footnotes
|
Wow. What do you serialize it into then? CSV with a fixed schema? Or specific bytes dump per backend?
That's a good question. I think some complexity is worth it here. Dumping to databases is great functionality to have and to have performant as well. |
For PostgreSQL-compatible targets I target the "TEXT" format1 for Footnotes |
Sound good. Everything will be in cache and probably mostly compute bound. I shall ensure that bound is higher with: #6155 |
This sounds like what I do for MSSQL at my company: Microsoft has a bulk copy utility The result is a 300x speedup over |
@mcrumiller Do you have to call |
@alexander-beedie I dump a I hadn't considered driver interfaces, as I'm not too experienced on that front, but a quick google search gives me SQL Server Driver Extensions - Bulk Copy Functions which looks promising. I was hoping to implement this in polars as well, which is why I've been asking around a bit about whether pl.Series([1], dtype=pl.UInt8).cast(pl.Binary).item()
pl.Series([1], dtype=pl.UInt64).cast(pl.Binary).item() both give |
There is also a Rust/Python project which converts a ArrowDataframe into the Postgres binary format and then does the bulk insert. This should be faster than TSV bulk insert (I have measured 4x on speed and 1.7x on memory) |
For example, here's a snippet of how to convert pandas # See https://bornsql.ca/blog/how-sql-server-stores-data-types-datetime-date-time-and-datetime2/
# days since 0001-01-01, then grab only 3 bytes
null = data.isna()
data = (data.to_numpy().astype("datetime64[D]") - np.datetime64("0001-01-01")) \
.astype(np.uint32).astype("V3")
# nullable is True if the SQL column we're uploading to allows nulls
if nullable:
prefix_type = np.uint8
prefix = np.full(N_rows, length, dtype=prefix_type)
null_prefix = prefix_type(np.iinfo(prefix_type).max)
data = prefix.view(np.void) + data.astype(object)
data[null] = null_prefix This encodes each value in the Series into 3-byte values each, plus (optionally) the null prefix that specifies if the value is null or not. In the end, each column is processed similarly (ints/datetimes/booleans/strings/etc.) and at the end, you can perform a horizontal df = b''.join(df.to_numpy().ravel()) Which can be written directly to disk. This then also needs an accompanying |
Hi @alexander-beedie, any chance of getting this picked up over the coming weeks/months? I have some further thoughts, but in summary, the performance benefit gained by for anyone writing to postgres would be massive! Happy to create a new feature request for that specifically if you think it is a better place to continue. Thanks in advance |
Do you have a full example of going from pandas to dat and xml file you want to share? 😇 @mcrumiller . And what your |
I found this open issue after testing polars 'write_database' with the 'abcd' engine. AWS postgres destination, 6K row dataframe. It took over 5 mins with polars vs. 2.3seconds with pandas+sqlalchemy. |
@bobcolner: Got an example of how you called it? That sounds waaaaaay out 😅 |
@henryharbeck: I'll get there in the end 😅 Have been focused on the |
sure, I used this call for polars:
and this for pandas:
FYI I was writing to Using AWS Aurora Serverless v2 as the postgres destination. see: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.html thanks for looking into this! PS -it would be great to add a |
Thank you very much @alexander-beedie Just wanted to ensure that it was still on the cards for the future. |
@alexander-beedie
where Before using the bulk insert statement, you will have to create the SQL table
Now, this code is a few years old and can be optimized/sanitized. What I found was drastic improvements of inserts using BCP. However, it would be even quicker if we can somehow keep the CSV as Bytes in memory and push to the database, as the csv writing is quite a heavy task by itself. This, however, poses the problem of keeping it all in-memory, so writing a temporary csv file and deleting it after the bulk insert is a safer bet. Nonetheless, it should be way quicker than any other method for MS SQL. Of course, this is pandas to MS SQL, so not at all generalizable. But I found it worth sharing. |
I've since updated my bcp-packaging and it's mainly now performed with zero-copy |
Here is an interesting article, also how to upload to a Azure SQL database, with some tips and tricks on the bcp command: https://www.sqlshack.com/how-to-handle-100-million-rows-with-sql-server-bcp/ |
@alexander-beedie I've done that approach you mentioned of converting to a csv io object then using copy but one wrinkle is if, let's say, you have a counter column that generates values so you can't bulk copy direct so I'll do the postgres copy to a temp table and then insert from the temp table to the real table so that the df to be inserted doesn't have to have the same columns as the destination table. I just mention that to preempt the "I need to insert 5 column df to 6 column table in DB" issue. |
For clarification. To Polars developers: what is the intended objective of Polars a) to be a "fastest-possible loading" tool, using any means available from each RDBMS implementation (PostgreSQL b) to be an "as general as possible" command that can pump data into any SQL database (like Pandas After reading this thread, I'm having a strong feeling that the answer is a). In this case, I cannot help very much. Otherwise, I came here from #13513. My wish was (initially) just to remove Pandas dependency (a large package currently used just to add a thin wrapper over SQLAlchemy to insert data). In this case, I'm interested in helping with some PRs. (not neglecting performance, of course). Disclaimer: I have a strong interest in working with data from several RDBMS, but not enough to go into the specifics of each one (COPY, BCP, etc.). I understand they can be orders of magnitude faster, but the lack of standardization is a very strong problem for me (e.g., the cited tools need the source data to be accessible from the server's local filesystem, which is not available to the user in many cases). |
Hi, is this issue still ongoing where polars.write_database() is way slower than pandas.to_sql() with sqlalchemy engine? Or improvements already made? |
We haven't done the big This means that, in the cases where we were previously slower, we should now be able to get the same speed as pandas Coming in the |
I am a noob so you may not satisfied by the code quality but this is a lot faster than pandas.to_sql or polars write_database : https://gist.github.com/Tapanhaz/9c53e47a95f7759e6f506410cea0b3c3
|
@alexander-beedie Out of curiosity, is this rewrite being tracked in a separate issue? If so, could you please provide the reference? |
I have been using the following code which gives me a 25x speedup against import io
from datetime import datetime, timedelta
from time import time
import numpy as np
import polars as pl
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
db_url: str = "postgresql+psycopg://postgres:postgres@localhost:5432/mydb"
engine = create_engine(db_url)
# The tables are created with:
# create table mytable (int_col integer, string_col text, timestamp_col timestamp, float_col float4);
def generate_polars_dataframe(num_rows=1_000_000):
int_col = np.random.randint(0, 100, size=num_rows)
string_col = [f"string_{i}" for i in range(num_rows)]
start_date = datetime(2020, 1, 1)
timestamp_col = [start_date + timedelta(days=i) for i in range(num_rows)]
float_col = np.random.rand(num_rows) * 100.0
df = pl.DataFrame(
{
"int_col": int_col,
"string_col": string_col,
"timestamp_col": pl.Series(timestamp_col),
"float_col": float_col,
}
)
return df
def write_to_pg2(df: pl.DataFrame) -> None:
session = sessionmaker(bind=engine)()
cols = ",".join(df.columns)
cursor = session.connection().connection.cursor()
io_buffer = io.BytesIO()
with cursor.copy(f"COPY mytable2 ({cols}) FROM STDIN DELIMITER ','") as copy:
df.write_csv(io_buffer, include_header=False)
BLOCK_SIZE = 128 * 1024 # copied from psycopg's MAX_BUFFER_SIZE
io_buffer.seek(0)
while data := io_buffer.read(BLOCK_SIZE):
copy.write(data)
session.commit()
def write_to_pg1(df: pl.DataFrame) -> None:
df.write_database("mytable", engine, if_table_exists="append")
df = generate_polars_dataframe()
a = time()
write_to_pg1(df)
b = time()
write_to_pg2(df)
c = time()
print(b - a) # 64.45s
print(c - b) # 2.54s
# swapping write methods changes results by ~1%, so it's not just caching ;) |
The difference in speed between COPY and INSERT in postgres is a longtime somewhat well known thing. Here's a gist I did in rust using copy. If no one beats me to it I might make a pl_to_pg one day. If I were better at rust I'd like to do something like sink_csv to a buffer in one thread and in another thread read from that buffer to postgres while flushing the buffer to keep memory or even temp disk space low. The |
@deanm0000 postgres has a faster |
@mcrumiller take a look at that gist I linked to. It uses binary encoding. |
Ah! Indeed it does, that's very nice. |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
I have been using the
.to_pandas().to_sql()
for a while now, and with polars latest version it's now possible to use .write_database(). However, I have found it to be much slower than the previous option.Having looked at the code, it seems that the only difference is the usage of
use_pyarrow_extension_array=True
. This might be an issue from pandas side, as I also tested.to_pandas(use_pyarrow_extension_array=True).to_sql()
and for the test cases, it more than doubled the exporting time (on my use cases, around 2.5x slower)Reproducible example
Expected behavior
So, while
.to_pandas(use_pyarrow_extension_array=True)
is indeed faster, it seems.to_pandas().to_sql()
is actually faster overall. If this is true from your side, maybe it would be good to change.to_pandas(use_pyarrow_extension_array=True).to_sql()
to.to_pandas().to_sql()
.In addition to this, It seems that not being able to set the
fast_execute_many=True
when creating thesqlalchemy
engine further slows down the exporting process to 10x slower than the following code:Below is a suggestion how the suggested changes could look like:
Installed versions
The text was updated successfully, but these errors were encountered: