Excellent article and tips! I learned a lot from Joel's knowledge on Puppeteer and its intricacies.
I'm running a somewhat different business of which Puppeteer is a pretty big part. All points made in the post are valid, I could only add the following after having run ~400k Puppeteer sessions.
- Race conditions happen. This issue [0] is causing roughly 3% of all Puppeteer runs to fail in my case. I had to bake in a retry mechanism.
- Memory matters, CPU not so much. Just looking at my Librato and AWS stats, the CPU is mostly idle when running multiple concurrent sessions.
- One way to establish compartmentalisation is to actually run each scraping session in a separate, one-off Docker container. You pass in the code via a disk mount or such. No hassle with too many tabs, or shared context between runs. Each container is destroyed after running.
Hey Tim! Nice to see you here! I agree with your points overall, especially the disposable containers as they’re super hard to keep up. The CPU and Memory notes are accurate as well — except for canvas intense sites (which makes sense).
I'm amazed that one must go as far as creating different containers - duplicating the browser, dependencies and whatnot - to achieve tab isolation and browser resiliency in Chrome, when Firefox does this with as little as container tabs.
Well, in my specific case it is also a security measure. In the context of a Saas you want very strict separation between user sessions. This was my main concern.
I'm running a somewhat different business of which Puppeteer is a pretty big part. All points made in the post are valid, I could only add the following after having run ~400k Puppeteer sessions.
- Race conditions happen. This issue [0] is causing roughly 3% of all Puppeteer runs to fail in my case. I had to bake in a retry mechanism.
- Memory matters, CPU not so much. Just looking at my Librato and AWS stats, the CPU is mostly idle when running multiple concurrent sessions.
- One way to establish compartmentalisation is to actually run each scraping session in a separate, one-off Docker container. You pass in the code via a disk mount or such. No hassle with too many tabs, or shared context between runs. Each container is destroyed after running.
[0]: https://github.com/GoogleChrome/puppeteer/issues/1325