HTTPX로 웹 크롤링 10배 빠르게 하기

Synchronous 크롤링했을 때 소요 시간: 7.8초

Asynchronous HTTPX 크롤링했을 때 소요 시간: 0.79초

HTTPX란...

httpx는 retryablehttp 라이브러리를 사용하여 여러 프로브를 실행할 수 있는 빠른 다목적 HTTP 툴킷입니다. 증가된 스레드 수로 결과 신뢰성을 유지하도록 설계되었습니다.

HTTPX는 동기화 및 비동기 API를 제공하고 HTTP/1.1 및 HTTP/2를 모두 지원하는 Python 3용 완전한 기능을 갖춘 HTTP 클라이언트입니다.

필요 라이브러리

https://www.python-httpx.org/

HTTPX

HTTPX A next-generation HTTP client for Python. HTTPX is a fully featured HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2. Install HTTPX using pip: Now, let's get started: >>> import httpx >>> r = http

www.python-httpx.org

라이브러리 설치

1
2

pip install requests
pip install httpx

cs

클로링 타겟 웹 페이지

https://books.toscrape.com/index.html

All products | Books to Scrape - Sandbox

£51.77 In stock

books.toscrape.com

구글에 books to scrape 검색하면 나오는 스크레이핑, 크롤링 연습용 공개 웹페이지다.

Sync 크롤링 Full Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

# pip install requests
 
 
import time
import requests
 
 
def fetch():
    urls = ["https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-2.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-3.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-4.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-5.html",        
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-6.html"        
            ]
 
    results = [requests.get(url) for url in urls]
 
    print(results)
 
 
start = time.perf_counter()
fetch()
end = time.perf_counter()
 
print(end-start)
 
Colored by Color Scripter

cs

Synchronous 크롤링했을 때 소요 시간: 7.8초

Async 크롤링 Full Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

# pip install httpx
 
 
import time
import asyncio
import httpx
 
 
async def fetch():
    urls = ["https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-2.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-3.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-4.html",
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-5.html",        
            "https://books.toscrape.com/catalogue/category/books/nonfiction_13/page-6.html"        
            ]
    
    async with httpx.AsyncClient() as client:
        reqs = [client.get(url) for url in urls]
        results = await asyncio.gather(*reqs)
 
    print(results)
 
start = time.perf_counter()
asyncio.run(fetch())
end = time.perf_counter()
 
print(end-start)
Colored by Color Scripter

cs

Asynchronous HTTPX 크롤링했을 때 소요 시간: 0.79초

Async HTTPX 크롤링이 10배 이상 빨랐다.

저작자표시 비영리 변경금지 (새창열림)

'공부 > 파이썬 Python' 카테고리의 다른 글

[펌][Github] Github에 업로드하는 기본적인 방법 (0)	2023.07.15
변수명 짓는 방법 (Camel Case, Snake Case 등등 ) (1)	2023.01.21
파이썬 스케줄작업 생성하기 (Python schedule) (2)	2023.01.12
Python cProfile 튜토리얼! (feat. 함수 별 소요시간 파악하기) (1)	2023.01.11
Python으로 PC 볼륨 조절하기 (1)	2023.01.11

혼밥맨

HTTPX로 웹 크롤링 10배 빠르게 하기