Size: a a a

2021 January 29

H

Harsh in Scrapy
I had question related to captcha solving while crawling a site with scrapy.

The site I'm trying to scrape have captcha before reaching the target ( an pdf link ).

We could use 2captcha or other services for solving captchas. I just don't know how to incorporate in scrapy crawler.

I feel the asynchronous nature of scrapy won't allow to wait for the captcha solution to be submitted from solving service. I could be wrong.

If someone have experienced same problem, please give insight. Thanks
источник

M

Max in Scrapy
Harsh
I had question related to captcha solving while crawling a site with scrapy.

The site I'm trying to scrape have captcha before reaching the target ( an pdf link ).

We could use 2captcha or other services for solving captchas. I just don't know how to incorporate in scrapy crawler.

I feel the asynchronous nature of scrapy won't allow to wait for the captcha solution to be submitted from solving service. I could be wrong.

If someone have experienced same problem, please give insight. Thanks
If captcha is simple, you can use middleware to handle it. As I remember, it should be called after Httpmiddleware
источник

К

Кирилл in Scrapy
Harsh
I had question related to captcha solving while crawling a site with scrapy.

The site I'm trying to scrape have captcha before reaching the target ( an pdf link ).

We could use 2captcha or other services for solving captchas. I just don't know how to incorporate in scrapy crawler.

I feel the asynchronous nature of scrapy won't allow to wait for the captcha solution to be submitted from solving service. I could be wrong.

If someone have experienced same problem, please give insight. Thanks
Yes, there is no simple solution for this. You can try to scrape synchronously if your tasks allow you to scrape slowly.
In the opposite case, you have to split the flow of requests into separate flows, sessions, stop some requests, and retry them to wait for the solution of a captcha
источник

H

Harsh in Scrapy
Max
If captcha is simple, you can use middleware to handle it. As I remember, it should be called after Httpmiddleware
Thanks for Middle ware hint.
It's cloud flair bot detection, followed by re capatch2. So two different requests I guess.
источник

H

Harsh in Scrapy
Кирилл
Yes, there is no simple solution for this. You can try to scrape synchronously if your tasks allow you to scrape slowly.
In the opposite case, you have to split the flow of requests into separate flows, sessions, stop some requests, and retry them to wait for the solution of a captcha
Thanks for the inputs.
Yeah. We could go with synchronous approach.
источник

К

Кирилл in Scrapy
Also you have to write your own functions to interact with captcha solving services because most of them have libs based on the synchronous requests package
источник

К

Кирилл in Scrapy
Harsh
Thanks for the inputs.
Yeah. We could go with synchronous approach.
synchronous way is the easiest
источник

H

Harsh in Scrapy
Кирилл
Also you have to write your own functions to interact with captcha solving services because most of them have libs based on the synchronous requests package
Yeah. We'll need to use simple requests, form reuqest.

I checked for example etc on GitHub, so far no luck with it. I'll search scrapy issues if I find something.
источник

H

Harsh in Scrapy
Кирилл
synchronous way is the easiest
Yeah :)
источник

К

Кирилл in Scrapy
Harsh
Thanks for the inputs.
Yeah. We could go with synchronous approach.
Then just put your requests in a chain(call every request from the callback of the previous request), check every response for a captcha. With this approach, you can even use requests from  within scrapy without any problem
источник

H

Harsh in Scrapy
Кирилл
Then just put your requests in a chain(call every request from the callback of the previous request), check every response for a captcha. With this approach, you can even use requests from  within scrapy without any problem
Currently, I start navigation through home page to target, other information is scrapped in that journey. For i.e car name, car year, make etc.

captchas comes for all and last page where pdf download is

So If only I could chain last response with all information in meta tag, I could add pdf link to all that information from solved captchas.

Could you give hint on in general how we make it synchronous? The scrapy requests?
источник

К

Кирилл in Scrapy
Кирилл
Then just put your requests in a chain(call every request from the callback of the previous request), check every response for a captcha. With this approach, you can even use requests from  within scrapy without any problem
I've given
источник

К

Кирилл in Scrapy
A chain of scrapy requests
источник

H

Harsh in Scrapy
Кирилл
A chain of scrapy requests
i.e
def get_captcha(self, response):
 yield Request ( url = 2captchaEndpoint, callback = parse_something)
источник

H

Harsh in Scrapy
Кирилл
A chain of scrapy requests
I found a reference link. Thanks sir :)
источник

СТ

Семён Трояновский... in Scrapy
Sounds like you probably don't need scrapy at all
источник

H

Harsh in Scrapy
Семён Трояновский
Sounds like you probably don't need scrapy at all
Actually, we have crawlers running in scrappinghub. So it'll be good if it could be achieved with scrapy
источник

H

Harsh in Scrapy
Otherwise it'll be big change to move to apify for puppeteer based automation. It may come to that too, yet in not recent future
источник

A

Andrii in Scrapy
Кто-то замерял если ли разница в скорости скрапинга если запускать его из допустим пайчарма или из wsl?
источник

AR

Andrey Rahmatullin in Scrapy
Пайчарм-то при чём
источник