improving the speed of to_csv

See original GitHub issue

Hello,

I dont know if that is possible, but it would great to find a way to speed up the to_csv method in Pandas.

In my admittedly large dataframe with 20 million observations and 50 variables, it takes literally hours to export the data to a csv file.

Reading the csv in Pandas is much faster though. I wonder what is the bottleneck here and what can be done to improve the data transfer.

Csv files are ubiquitous, and a great way to share data (without being too nerdy with hdf5and other subtleties). What do you think?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:17 (7 by maintainers)

github_iconTop GitHub Comments

13reactions
PCJimmmycommented, Aug 5, 2019

So how about an answer to the initial question rather than getting off track.

I need cvs format

Can pandas improve speed of writing that file format?

5reactions
jrebackcommented, Apr 13, 2016

duplicate of #3186

Well using 1/10 the rows, about 800MB in memory

In [5]: df = DataFrame(np.random.randn(2000000,50))

In [6]: df.memory_usage().sum()
Out[6]: 800000072

In [7]: df.memory_usage().sum()/1000000
Out[7]: 800

In [8]: %time df.to_csv('test.csv')
CPU times: user 2min 53s, sys: 3.71 s, total: 2min 56s
Wall time: 2min 57s

about 8.5MB/sec in raw throughput, way below IO speeds, so obviously quite some room to improve. you might be interested in this blog here

Of course there IS really no reason at all to use CSV unless you are forced.

In [11]: %time df.to_hdf('test.hdf','df')
CPU times: user 10.5 ms, sys: 588 ms, total: 599 ms
Wall time: 2.22 s
Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas to_csv() slow saving large dataframe - Stack Overflow
Try without the chunksize kwag... It could be a lot of things, like quoting, value conversion, etc. Try to profile it and see...
Read more >
Comparing speed and size of to_csv(), np.save(), to_hdf ...
Performance of speed and time complexity comparison between pandas read_csv(), read_hdf(), read_pickle() and numpy save and ...
Read more >
The fastest way to read a CSV in Pandas - Python⇒Speed
Measured purely by CPU, fastparquet is by far the fastest. Whether it gives you an elapsed time improvement will depend on whether you...
Read more >
Pandas To_Csv() Slow Saving Large Dataframe - ADocLib
to_csv () and you will notice a drastic improvement in the speed at which a csv file gets written. In this post, we'll...
Read more >
to_csv() – datatable.Frame.
This format is around 3 times faster to write/read compared to usual decimal representation, so its use is recommended if you need maximum...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found