더팀스 모든 구인공고 크롤링하기! (python, openpyxl, csv, scraping)

더팀스 모든 구직공고 크롤링하기! (python, openpyxl, csv, scraping)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
from openpyxl import Workbook
 
 
BASE_URL = "https://www.theteams.kr/recruit/"
 
 
POSTING_NUM_LIST = []        # 도메인의 마지막 자리 숫자를 모으는 리스트
JOB_DESC_LIST = []             # 공고내용 (col-md-12)
TITLE_LIST = []             # 채용공고 제목 (tm_mgt_title)
COMPANY_NAME_LIST = []         # 회사이름 (tm_h2_title_company_info)
CATEGORY_LIST = []             # 부문 (rc_categories_name)
 
 
def get_desc(URL):
  print("$", end="")
  driver = webdriver.Chrome('chromedriver', options=chrome_options)
  driver.implicitly_wait(1)
 
  driver.get(URL)
  html = driver.page_source
  soup = BeautifulSoup(html, 'lxml') # html parser
 
  # 제목; TITLE = soup.select('div.tm_mgt_title')
  title = soup.findAll("h3", {"class": "tm_mgt_title"})
  title = str(title)
  title = title[title.find(">") + 1 : title.find("</")]
  TITLE_LIST.append(title)
 
  # 부분; CATEGORY = soup.findAll; div.rc_categories_name
  category = soup.findAll("div", {"class": "rc_categories_name"})
  category = str(category)
  category = category[category.find(">") + 3 : category.find("</")]
  CATEGORY_LIST.append(category)
 
  # 회사이름; COMPANY_NAME; tm_h2_title_company_info
  company_name = soup.findAll("h2", {"class": "tm_h2_title_company_info"})
  company_name = str(company_name)
  company_name = company_name[company_name.find(">") + 1 : company_name.find("</")]
  COMPANY_NAME_LIST.append(company_name)
 
  # 채용공고 내용; JOB_DESC; col-md-12
  # job_desc = soup.select('div.col-md-12 p')
  job_desc = soup.select('div.col-md-12')
  job_des=""
 
  for i in range(0, len(job_desc), 1):
    job_des = job_des + job_desc[i].get_text()
 
  JOB_DESC_LIST.append(job_des)
 
  driver.quit()
 
 
# https://www.theteams.kr/recruit/
def get_posting_numbers(URL):
  print("$", end="")
  driver = webdriver.Chrome('chromedriver', options=chrome_options)
  driver.implicitly_wait(3)
 
  driver.get(URL)
  html = driver.page_source
 
  # bs4 html parser
  soup = BeautifulSoup(html, 'lxml')
 
  # a 태그에서 recruitment page number 추출
  a_tags = soup.select('div.card_recruit a')
  
  emp_list=[]; want_list=[];
  emp_list.append(a_tags)
  
  want_list = str(emp_list[-1]).split("wanted/")
 
  for i, want in enumerate(want_list):
    if (i%2==1):
      POSTING_NUM_LIST.append(want[:want.find('"')])
 
  driver.quit()
 
 
# M A I N
# 채용공고 URL 가져오기
for i in range(1, 93, 1):
  get_posting_numbers(BASE_URL+str(i))
 
# Remove Duplicates from the List; POSTING_NUM_LIST
POSTING_NUM_LIST = list(dict.fromkeys(POSTING_NUM_LIST))
 
# 각 채용공고에 들어가서 (회사이름, 직무, desc 가져오기)
for num in POSTING_NUM_LIST:
  URL = "https://www.theteams.kr/recruit/wanted/"+num
  get_desc(URL)
 
print("C O M P L E T E D")
print(COMPANY_NAME_LIST)
 
# .csv 포맷 엑셀 파일 작성
ABC = ["A1", "B1", "C1", "D1"]
columns = ["기업이름", "직무", "채용공고", "채용내용"]
 
write_wb = Workbook()
write_ws = write_wb.active
 
# Head Columns 만들기
for (alphabet, col) in zip(ABC, columns): 
  write_ws[alphabet] = col
 
#행 단위로 추가
for i in range(len(TITLE_LIST)):
  write_ws.append([ 
                    COMPANY_NAME_LIST[i],
                   CATEGORY_LIST[i],
                   TITLE_LIST[i],
                   JOB_DESC_LIST[i]
                   
                ])
 
write_wb.save("The_TEAMS_1.csv")
Colored by Color Scripter

cs

 
 
 
 

더팀스에서는 아무리 빠르게 크롤링하더라도 IP주소를 제재한다거나, Google CAPTCHA를 띄우는 제약조건은 없었습니다. 그래서 빠르게 크롤링할 수 있었습니다. 사실 selenium보다 requests.get, BeautifulSoup만 사용하면 더욱 빠르게 할 수 있습니다.

결과 파일이 아주 깔끔하게 생각한대로 추출된 것을 확인할 수 있습니다!

'공부 > 파이썬 Python' 카테고리의 다른 글

글래스도어 구인공고 크롤링하기! (python, csv, BeautifulSoup) (0)	2021.02.15
원티드 구인공고 전부 크롤링하기! (python, BeautifulSoup) (10)	2021.02.15
피플앤잡 직업정보 크롤링하기! (python, csv, requests) (2)	2021.02.15
인디드 모든 구인공고 크롤링하기! (feat. Python, Selenium, BeautifulSoup) (0)	2021.02.15
Open Skills API 이용해서 세상 모든 직무능력 불러오기(feat. requests, json) (0)	2021.01.15

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

혼밥맨

더팀스 모든 구인공고 크롤링하기! (python, openpyxl, csv, scraping)

더팀스 모든 구직공고 크롤링하기! (python, openpyxl, csv, scraping)

'공부 > 파이썬 Python' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

더팀스 모든 구인공고 크롤링하기! (python, openpyxl, csv, scraping)

더팀스 모든 구직공고 크롤링하기! (python, openpyxl, csv, scraping)

'공부 > 파이썬 Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역