본문 바로가기
공부/파이썬 Python

인디드 모든 구인공고 크롤링하기! (feat. Python, Selenium, BeautifulSoup)

by 혼밥맨 2021. 2. 15.
반응형

인디드 모든 구인공고 크롤링하기! (feat. Python, Selenium, BeautifulSoup)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
from bs4 import BeautifulSoup
import requests
import csv
from time import sleep
from random import randint
from datetime import datetime
 
 
headers = {
    'accept''text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding''gzip, deflate, br',
    'accept-language''en-US,en;q=0.9',
    'cache-control''max-age=0',
    'sec-fetch-dest''document',
    'sec-fetch-mode''navigate',
    'sec-fetch-site''none',
    'sec-fetch-user''?1',
    'upgrade-insecure-requests''1',
    'user-agent''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47'
}
 
 
def get_url(position, location):
        template = 'https://www.indeed.com/jobs?q={}&l={}'
        url = template.format(position, location)
        return url
 
 
def get_record(card):
    '''Extract job date from a single record '''
    atag = card.h2.a
    try:
        job_title = atag.get('title')
    except AttributeError:
        job_title = ''
    try:
        company = card.find('span''company').text.strip()
    except AttributeError:
        company = ''
    try:
        location = card.find('div''recJobLoc').get('data-rc-loc')
    except AttributeError:
        location = ''
    try:
        job_summary = card.find('div''summary').text.strip()
    except AttributeError:
        job_summary = ''
    try:
        post_date = card.find('span''date').text.strip()
    except AttributeError:
        post_date = ''
    try:
        salary = card.find('span''salarytext').text.strip()
    except AttributeError:
        salary = ''
    
    extract_date = datetime.today().strftime('%Y-%m-%d')
    job_url = 'https://www.indeed.com' + atag.get('href')
    
    return (job_title, company, location, job_summary, salary, post_date, extract_date, job_url)
 
 
def main(position, location):
    records = []  # creating the record list
    url = get_url(position, location)
    
    while True:
        print(url)
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        cards = soup.find_all('div''jobsearch-SerpJobCard')
 
        for card in cards:
            record = get_record(card)
            records.append(record)
 
        try:
            url = 'https://www.indeed.com' + soup.find('a', {'aria-label''Next'}).get('href')
            delay = randint(110)
            sleep(delay)
        except AttributeError:
            break
 
    with open('results.csv''w', newline='', encoding='utf-8'as f:
        writer = csv.writer(f)
        writer.writerow(['Job Title''Company''Location''Salary''Posting Date''Extract Date''Summary''Job Url'])
        writer.writerows(records)
 
 
 
CITIES = ["New+York+City""Los+Angeles""Chicago""Houston",  "Phoenix","Philadelphia""San+Antonio",
          "San+Diego""Dallas""San+Jose""Austin""Jacksonville""Fort+Worth""Columbus""Charlotte",
          "San+Francisco""Indianapolis""Seattle""Denver" ,"Washington","Boston","El+Paso","Nashville","Detroit","Oklahoma+City",
          "Las+Vegas""Memphis""Louisville""Baltimore""Milwaukee""Albuquerque""Tucson""Fresno""Mesa""Sacramento""Atlanta""Woodbridge""Bend",
          "Kansas+City""Colorado+Springs""Omaha""Raleigh""Miami""Long+Beach""Virginia+Beach""Oakland""Minneapolis""Tulsa""Tampa""Arlington""New+Orleans""Wichita"
          "Bakersfield""Cleveland""Aurora""Anaheim""Honolulu""Santa+Ana""Riverside""Corpus+Christi""Lexington""Henderson""Stockton""Saint+Paul"
          "Cincinnati""St.Louis""Pittsburgh""Greensboro""Lincoln""Anchorage""Plano""Orlando""Irvine""Newark""Durham""Chula+Vista""Toledo""Fort+Wayne""St.+Petersburg"
          "Laredo""Jersey+City""Chandler""Madison""Lubbock""Reno""Buffalo""Gilbert""North+Las+Vegas""Winston–Salem""Chesapeake""Norfolk""Fremont""Garland"
          "Irving""Hialeah""Richmond""Boise""Spokane""Baton+Rouge""Tacoma""San+Bernardino""Modesto""Fontana""Des+Moines""Moreno+Valley""Santa+Clarita",
          "Fayetteville""Birmingham""Oxnard""Rochester""Port+St.+Lucie""Grand+Rapids",
          "Huntsville""Salt+Lake+City""Frisco""Yonkers""Amarillo""Glendale""Huntington+Beach""McKinney""Montgomery""Augusta""Akron""Little+Rock""Tempe""Columbus""Overland+Park""Grand+Prairie""Tallahassee""Cape+Coral""Mobile""Knoxville""Shreveport"
          "Worcester""Ontario""Vancouver""Sioux+Falls""Chattanooga""Brownsville""Fort+Lauderdale""Providence""Newport+News""Rancho+Cucamonga""Santa+Rosa""Peoria""Oceanside""Elk+Grove""Salem""Pembroke+Pines""Eugene""Garden+Grove""Cary""Fort+Collins""Corona"
          "Springfield""Jackson""Alexandria""Hayward""Clarksville""Lakewood""Lancaster""Salinas""Palmdale""Hollywood""Macon""Sunnyvale""Pomona""Killeen""Escondido""Pasadena""Naperville""Bellevue""Joliet""Murfreesboro""Midland"
          "Rockford""Paterson""Savannah""Bridgeport""Torrance""McAllen""Syracuse""Surprise""Denton""Roseville""Thornton""Miramar""Mesquite""Olathe""Dayton""Carrollton""Waco""Orange""Fullerton""Charleston""West+Valley+City"
          "Visalia""Hampton""Gainesville""Warren""Coral+Springs""Cedar+Rapids""Round+Rock""Sterling+Heights""Kent""Columbia""Santa+Clara""New+Haven""Stamford""Concord""Elizabeth""Athens""Thousand+Oaks""Lafayette""Simi+Valley""Topeka""Norman""Fargo"
          "Wilmington""Abilene""Odessa""Pearland""Victorville""Hartford""Vallejo""Allentown""Berkeley""Richardson""Arvada""Ann+Arbor""Cambridge""Sugar+Land""Lansing""Evansville""College+Station""Fairfield""Clearwater""Beaumont""Independence"
          "Provo""West+Jordan""Murrieta""Palm+Bay""El+Monte""Carlsbad""North+Charleston""Temecula""Clovis""Meridian""Westminster""Costa+Mesa""High+Point""Manchester""Pueblo""Lakeland""Pompano+Beach""West+Palm+Beach""Antioch""Everett""Downey""Lowell""Centennial"
          "Elgin""Broken+Arrow""Miami+Gardens""Billings""Jurupa+Valley""Sandy+Springs""Gresham""Lewisville""Hillsboro""Ventura""Greeley""Inglewood""Waterbury""Tyler""Davie""Daly+City""Boulder""League+City""Santa+Maria""Allen""West+Covina""Sparks"
          "Wichita+Falls""Green+Bay""San+Mateo""Norwalk""Rialto""Las+Cruces""Chico""El+Cajon""Portland""Burbank""South+Bend""Renton""Vista""Davenport""Edinburg""Tuscaloosa""Carmel""Spokane+Valley""San Angelo""Vacaville""Clinton"
]
 
 
 
for city in CITIES:
    main('developer', city)
 
 
cs

 

- 미국의 도시 정보는 구글에 "us cities by population"를 검색해서 얻었습니다.

- 인디드에서 불법 크롤링 방지를 위해 Google CAPTCHA를 띄웁니다.

- 혹은 일시적으로 IP Address를 블락할 수도 있습니다.

- 중간 중간 time.sleep(random.randint(0, 5))를 했습니다.  (미국 내 거의 모든 구인공고 정보를 크롤링하기 위해서 입니다. 시간이 오래 걸리지만 어쩔 수 없습니다.)

- Python Multiprocessing을 활용하면 더욱 빠르게 할 수 있지만, 동일하게 Google CAPTCHA를 피해가야 합니다.

- Google CAPTCHA를 피해가는 방법은 requests.get(URL, proxy = proxies) 할 때 IP 주소를 매번 바꾸는 방법이 있습니다. 하지만 IP 주소를 바꾸는 데에도 역시 시간이 걸립니다.

반응형

댓글