Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Excellent article and tips! I learned a lot from Joel's knowledge on Puppeteer and its intricacies.

I'm running a somewhat different business of which Puppeteer is a pretty big part. All points made in the post are valid, I could only add the following after having run ~400k Puppeteer sessions.

- Race conditions happen. This issue [0] is causing roughly 3% of all Puppeteer runs to fail in my case. I had to bake in a retry mechanism.

- Memory matters, CPU not so much. Just looking at my Librato and AWS stats, the CPU is mostly idle when running multiple concurrent sessions.

- One way to establish compartmentalisation is to actually run each scraping session in a separate, one-off Docker container. You pass in the code via a disk mount or such. No hassle with too many tabs, or shared context between runs. Each container is destroyed after running.

[0]: https://github.com/GoogleChrome/puppeteer/issues/1325



Hey Tim! Nice to see you here! I agree with your points overall, especially the disposable containers as they’re super hard to keep up. The CPU and Memory notes are accurate as well — except for canvas intense sites (which makes sense).

Thanks for posting your thoughts!


I'm amazed that one must go as far as creating different containers - duplicating the browser, dependencies and whatnot - to achieve tab isolation and browser resiliency in Chrome, when Firefox does this with as little as container tabs.


Well, in my specific case it is also a security measure. In the context of a Saas you want very strict separation between user sessions. This was my main concern.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: