20. Python - Web Crawling

2021. 2. 25. 00:15

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from http.client import HTTPSConnection
from bs4 import BeautifulSoup
 
# Webcrawlling
#     HTML Parsing
 
 
# JSON / XML 
#     DB에 있는 데이터를 남들이 보기 편하게 표현해 놓은 것
 
# HTML
#     웹사이트 이쁘게 보이도록 - 디자인
 
# 소스보기 (view-source:)
# view-source:https://m.daum.net/
 
hc = HTTPSConnection("m.daum.net")
hc.request("GET","/")
resBody = hc.getresponse().read()
 
# HTMl Parsing
# Java   : jsoup.jar             - maven
# Python : bs4.py(BeautifulSoup) - pip
 
# pip으로 설치하기
#     cmd - pip3 install 이름
#           pip3 install bs4
 
 
# BeautifulSoup(응답받은내용, "내장된 HTML파서 이름" from_encoding="utf-8")
daumNews = BeautifulSoup(resBody,"html.parser", from_encoding="utf-8" )
 
# BeautifulSoup 문법
#     xxx.select("CSS 선택자")
 
# 웹에서 CSS 쉽게찾기
#     f12 - 개발자 모드 - ctrl + shift + c
#     내가 보고싶은 소스 부분 클릭해서 잘 찾아가기
 
news = daumNews.select("li.ta_txt.rubics_single a")
 
# 문자열.strip() : 앞뒤 공백 제거
 
for n in news:
    print(n.text.strip())
    print("----------")
Colored by Color Scripter
cs

'Python' 카테고리의 다른 글

19. Python - JSON Parsing (0)	2021.02.25
18. Python - XML Parsing2 (0)	2021.02.24
17. Python - XML Parsing1 (0)	2021.02.24
16. Python - Exception (0)	2021.02.24
15. Python - 다중상속 (0)	2021.02.23

원태기네

CATEGORIES

20. Python - Web Crawling

'Python' 카테고리의 다른 글

BELATED ARTICLES

NOTICE

ARCHIVE

RECENTPOST

RECENTCOMMENT

티스토리툴바