본문 바로가기
공부/파이썬 Python

HTTPX로 웹 크롤링 10배 빠르게 하기

by 혼밥맨 2023. 1. 14.
반응형

HTTPX로 웹 크롤링 10배 빠르게 하기

 

Synchronous 크롤링했을 때 소요 시간: 7.8초

Asynchronous HTTPX 크롤링했을 때 소요 시간: 0.79초

 

HTTPX란...

httpx는 retryablehttp 라이브러리를 사용하여 여러 프로브를 실행할 수 있는 빠른 다목적 HTTP 툴킷입니다. 증가된 스레드 수로 결과 신뢰성을 유지하도록 설계되었습니다.

HTTPX는 동기화 및 비동기 API를 제공하고 HTTP/1.1 및 HTTP/2를 모두 지원하는 Python 3용 완전한 기능을 갖춘 HTTP 클라이언트입니다.

 

필요 라이브러리

https://www.python-httpx.org/

 

HTTPX

HTTPX A next-generation HTTP client for Python. HTTPX is a fully featured HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2. Install HTTPX using pip: Now, let's get started: >>> import httpx >>> r = http

www.python-httpx.org

 

라이브러리 설치

1
2
pip install requests
pip install httpx
cs

 

클로링 타겟 웹 페이지 

https://books.toscrape.com/index.html

 

All products | Books to Scrape - Sandbox

£51.77 In stock

books.toscrape.com

구글에 books to scrape 검색하면 나오는 스크레이핑, 크롤링 연습용 공개 웹페이지다. 

 

Sync 크롤링 Full Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# pip install requests
 
 
import time
import requests
 
 
def fetch():
    urls = ["https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-2.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-3.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-4.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-5.html",        
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-6.html"        
            ]
 
    results = [requests.get(url) for url in urls]
 
    print(results)
 
 
start = time.perf_counter()
fetch()
end = time.perf_counter()
 
print(end-start)
 
cs

Synchronous 크롤링했을 때 소요 시간: 7.8초

 

Async 크롤링 Full Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# pip install httpx
 
 
import time
import asyncio
import httpx
 
 
async def fetch():
    urls = ["https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-2.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-3.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-4.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-5.html",        
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-6.html"        
            ]
    
    async with httpx.AsyncClient() as client:
        reqs = [client.get(url) for url in urls]
        results = await asyncio.gather(*reqs)
 
    print(results)
 
start = time.perf_counter()
asyncio.run(fetch())
end = time.perf_counter()
 
print(end-start)
cs

Asynchronous HTTPX 크롤링했을 때 소요 시간: 0.79초

 

Async HTTPX 크롤링이 10배 이상 빨랐다.

반응형

댓글