improving the speed of to_csv

Hello,

I dont know if that is possible, but it would great to find a way to speed up the to_csv method in Pandas.

In my admittedly large dataframe with 20 million observations and 50 variables, it takes literally hours to export the data to a csv file.

Reading the csv in Pandas is much faster though. I wonder what is the bottleneck here and what can be done to improve the data transfer.

Csv files are ubiquitous, and a great way to share data (without being too nerdy with hdf5and other subtleties). What do you think?

Issue Analytics

State:
Created 7 years ago
Comments:17 (7 by maintainers)

Top GitHub Comments

13reactions

PCJimmmycommented, Aug 5, 2019

So how about an answer to the initial question rather than getting off track.

I need cvs format

Can pandas improve speed of writing that file format?

5reactions

jrebackcommented, Apr 13, 2016

duplicate of #3186

Well using 1/10 the rows, about 800MB in memory

In [5]: df = DataFrame(np.random.randn(2000000,50))

In [6]: df.memory_usage().sum()
Out[6]: 800000072

In [7]: df.memory_usage().sum()/1000000
Out[7]: 800

In [8]: %time df.to_csv('test.csv')
CPU times: user 2min 53s, sys: 3.71 s, total: 2min 56s
Wall time: 2min 57s

about 8.5MB/sec in raw throughput, way below IO speeds, so obviously quite some room to improve. you might be interested in this blog here

Of course there IS really no reason at all to use CSV unless you are forced.

In [11]: %time df.to_hdf('test.hdf','df')
CPU times: user 10.5 ms, sys: 588 ms, total: 599 ms
Wall time: 2.22 s

Top Results From Across the Web

Pandas to_csv() slow saving large dataframe - Stack Overflow

Try without the chunksize kwag... It could be a lot of things, like quoting, value conversion, etc. Try to profile it and see...

Comparing speed and size of to_csv(), np.save(), to_hdf ...

Performance of speed and time complexity comparison between pandas read_csv(), read_hdf(), read_pickle() and numpy save and ...

The fastest way to read a CSV in Pandas - Python⇒Speed

Measured purely by CPU, fastparquet is by far the fastest. Whether it gives you an elapsed time improvement will depend on whether you...

Pandas To_Csv() Slow Saving Large Dataframe - ADocLib

to_csv () and you will notice a drastic improvement in the speed at which a csv file gets written. In this post, we'll...

to_csv() – datatable.Frame.

This format is around 3 times faster to write/read compared to usual decimal representation, so its use is recommended if you need maximum...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

improving the speed of to_csv

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post