JUST DO IT!

[TIL]KDT_20230419 ๋ณธ๋ฌธ

TIL

[TIL]KDT_20230419

sunhokimDev 2023. 4. 19. 14:13

๐Ÿ“š KDT WEEK 3 DAY 3 TIL

  • BeautifulSoup ํ™œ์šฉํ•œ HTML parsing
    • ์›ํ•˜๋Š” ์š”์†Œ ๊ฐ€์ ธ์˜ค๊ธฐ
    • ํŽ˜์ด์ง€๋„ค์ด์…˜(Pagination)
    • ๋™์  ์›น ํŽ˜์ด์ง€

 


 

๐ŸŸฅ BeautifulSoup4

HTML์—์„œ ์›ํ•˜๋Š” ์š”์†Œ๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•  HTML Parser

  • soup = BeautifulSoup(res.text, "html.parser")
    • requests๋กœ ๊ฐ€์ ธ์˜จ res์˜ text๋ฅผ ํŒŒ์‹ฑํ•œ๋‹ค
    • html์„ ํŒŒ์‹ฑํ•  ๊ฒƒ์ž„์œผ๋กœ ๋’ค์˜ ์ธ์ž์— "html.parser"๋ฅผ ์„ ์–ธํ•œ๋‹ค.
  • h1 = soup.find("h1")
    • soup์—์„œ ํŠน์ • ํƒœ๊ทธ ์š”์†Œ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.
    • ์ด๋•Œ, ํƒœ๊ทธ๊ฐ€ ๊ฐ™์€ ์ด๋ฆ„์˜ ํƒœ๊ทธ๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ๋ผ๋ฉด ๊ฐ€์žฅ ์ฒซ ๋ฒˆ์งธ๊ฒƒ์„ ๊ฐ€์ ธ์˜จ๋‹ค.
    • ๋ชจ๋“  ๊ฐ™์€ ์ด๋ฆ„์˜ ํƒœ๊ทธ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด find_all("h1")์„ ์‚ฌ์šฉํ•˜์ž.
  • h1.name, h1.text์˜ ํ˜•ํƒœ๋กœ ํƒœ๊ทธ์˜ ์ด๋ฆ„์ด๋‚˜ ๋‚ด์šฉ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค

 

์ด๊ฑด ์‚ฌ์šฉ์˜ˆ์‹œ!

๋”๋ณด๊ธฐ

์—ฌ๋Ÿฌ๊ฐ€์ง€ ์˜ˆ์‹œ๋ฅผ ๋„ฃ์–ด๋ดค๋‹ค.

๋„ค์ด๋ฒ„์˜ ์‘๋‹ต์„ ๊ทธ๋Œ€๋กœ ์ถœ๋ ฅํ•˜๋ฉด ์ด๋ ‡๊ฒŒ ๋‚˜์˜ค์ง€๋งŒ..

 

BeautifulSoup์„ ์‚ฌ์šฉํ•˜๋ฉด ๋“ค์—ฌ์“ฐ๊ธฐ๊ฐ€ ๋˜์–ด ๊น”๋”ํ•˜๊ฒŒ ๋ณด์—ฌ์ค€๋‹ค!

 

์ด์ฒ˜๋Ÿผ ํ•˜๋‚˜์˜ ํƒœ๊ทธ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•˜๊ณ ..

 

ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฐพ์•„ ์ด๋ฆ„์„ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค!

 


 

์‹ค์ œ ์‚ฌ์ดํŠธ์—์„œ ์›ํ•˜๋Š” ์š”์†Œ ๊ฐ€์ ธ์˜ค๊ธฐ

ํ™œ์šฉํ•œ ์‚ฌ์ดํŠธ : http://books.toscrape.com/catalogue/category/books/travel_2/index.html

 

Travel | Books to Scrape - Sandbox

£44.34 In stock

books.toscrape.com

 

๋จผ์ €, ์‚ฌ์ดํŠธ์—์„œ ํ•„์š”ํ•œ ์ •๋ณด์˜ ์œ„์น˜๋ฅผ ํฌ๋กฌ์˜ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ๋ฅผ ํ†ตํ•ด ์•Œ์•„๋‚ธ๋‹ค.

  • ํ•ด๋‹น ์š”์†Œ์— ๋งˆ์šฐ์Šค ์˜ค๋ฅธ์ชฝ ํด๋ฆญ โžก ๊ฒ€์‚ฌ ๋ฒ„ํŠผ
  • ์ง์ ‘ F12 ๋‹จ์ถ•ํ‚ค๋กœ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ๋ฅผ ๋ถˆ๋Ÿฌ์™€ ์ฐพ์•„๋ณด๊ธฐ

 

๋‚ด๊ฐ€ ํ•„์š”ํ–ˆ๋˜ ์ œ๋ชฉ์€ h3 ํƒœ๊ทธ์•ˆ์— ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค.

 

๋‹ค์Œ์œผ๋กœ BeautifulSoup๋ฅผ ํ†ตํ•ด ํŒŒ์‹ฑ

import requests
from bs4 import BeautifulSoup

# requests์™€ BeautifulSoup๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๋‹น ์‚ฌ์ดํŠธ๋ฅผ html ํŒŒ์‹ฑ
res = requests.get("http://books.toscrape.com/catalogue/category/books/travel_2/index.html")
soup = BeautifulSoup(res.text, "html.parser")

# ์‚ฌ์ดํŠธ์—์„œ h3 ํƒœ๊ทธ์•ˆ์— ์ œ๋ชฉ์ด ์žˆ์Œ์„ ํ™•์ธํ–ˆ์œผ๋ฏ€๋กœ h3 ์š”์†Œ๋ฅผ ์ฐพ์•„ ์ €์žฅํ•œ๋‹ค.
h3_result = soup.find_all("h3")

# h3 > a ํƒœ๊ทธ ์•ˆ์˜ title ์†์„ฑ์— ์ œ๋ชฉ์ด ์žˆ์œผ๋ฏ€๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ €์žฅํ•˜๊ณ  ์ถœ๋ ฅ
book_list = [x.a["title"] for x in h3_result]
print(book_list)

 

 

์ œ๋ชฉ์ด ์ด์˜๊ฒŒ ๋‹ด๊ธด๊ฑธ ๋ณผ์ˆ˜์žˆ๋‹ค! (ํŽธ์˜์ƒ for๋ฌธ์œผ๋กœ ์ถœ๋ ฅํ–ˆ๋‹ค)

 


HTML์˜ Locator ํ™œ์šฉํ•˜๊ธฐ

  • ํƒœ๊ทธ์˜ ์ด๋ฆ„์œผ๋กœ ํ•„์š”ํ•œ ๋ถ€๋ถ„์„ ์ฐพ๋Š” ๊ฒƒ์€ ์ข‹์ง€ ์•Š์€ ๋ฐฉ๋ฒ•์ด๋‹ค.
    • ๊ฐ™์€ ์ด๋ฆ„์˜ ํƒœ๊ทธ๊ฐ€ ์กด์žฌํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์•„์ฃผ ๋‹ค๋ถ„ํ•˜๊ธฐ ๋•Œ๋ฌธ!
  • ๋”ฐ๋ผ์„œ ํƒœ๊ทธ๋งˆ๋‹ค ๊ฐ€์ง„ ๊ณ ์œ ์˜ id๋‚˜ class ์†์„ฑ์„ ํ†ตํ•ด ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด ์ข‹๋‹ค.
  • BeautifulSoup์˜ find ๋ฉ”์†Œ๋“œ๋ฅผ ํ™œ์šฉํ•˜๋ฉด id์™€ class๋ฅผ ์ง€์ •ํ•ด์„œ ์ฐพ์„ ์ˆ˜๋„ ์žˆ๋‹ค.

๋จผ์ €, ์•„๊นŒ์ฒ˜๋Ÿผ ๋‚ด๊ฐ€ ์ฐพ์œผ๋ ค๋Š” ๊ฐ’(์ฑ…์˜ ๊ฐ€๊ฒฉ)์˜ ์œ„์น˜๋ฅผ ์•Œ์•„๋‚ธ๋‹ค. (์•„๊นŒ์™€ ๊ฐ™์€ ์‚ฌ์ดํŠธ)

 

div ํƒœ๊ทธ์˜ product_price ํด๋ž˜์Šค๋ฅผ ๊ฐ€์กŒ๋‹ค๋Š” ๊ฑธ ํ™•์ธํ–ˆ๋‹ค.

 

๋ฐ”๋กœ ์ฝ”๋“œ๋กœ ํŒŒ์‹ฑํ•ด๋ณธ๋‹ค!

#์‚ฌ์ดํŠธ์—์„œ html ํŒŒ์‹ฑ
import requests
from bs4 import BeautifulSoup
res = requests.get("http://books.toscrape.com/catalogue/category/books_1/index.html")
soup = BeautifulSoup(res.text, "html.parser")

#class๋กœ ์ฐพ๋Š” ๊ฒฝ์šฐ ์ฐพ๋Š” ํƒœ๊ทธ์˜ ์ด๋ฆ„ ๋’ค ์ธ์ž์— class ์ด๋ฆ„์„ ๊ฐ™์ด ์ ์–ด์ค€๋‹ค.
find_result = soup.find("div","product_price")

#id์˜ ๊ฒฝ์šฐ
#find_result = soup.find("ํƒœ๊ทธ์ด๋ฆ„", id = "id์ด๋ฆ„")

#๋‚ด๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๊ฐ’์„ ์ถœ๋ ฅํ•˜๋ฉด ๋!
print(find_result.p.text)

 

๋‹จ์œ„๊ฐ€ ๋ญ”์ง„ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ์ž˜ ์ถœ๋ ฅ๋˜๊ณ  ์žˆ๋‹ค.

 


 

ํŽ˜์ด์ง€๋„ค์ด์…˜(Pagination)

 

์‚ฌ์šฉ๋œ ์‚ฌ์ดํŠธ : https://hashcode.co.kr/ 

 

QnA | ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค ์ปค๋ฎค๋‹ˆํ‹ฐ

ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค QnA๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œํ•ด๊ฒฐ์„ ์œ„ํ•œ QnA์„œ๋น„์Šค์ž…๋‹ˆ๋‹ค. ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๊ด€๋ จํ•ด์„œ ๊ฐœ๋ฐœ์ž๋“ค๋ผ๋ฆฌ ๊ถ๊ธˆํ•œ๊ฑด ๋ฌผ์–ด๋ณด๊ณ  ์•„๋Š”๊ฑด ํ•จ๊ป˜ ๋‚˜๋ˆ ์š”. C, Java, Python, Ruby๋“ฑ์˜ ์ฝ”๋“œ๋ฅผ ์›น์—์„œ ์ง์ ‘ ์‹คํ–‰

qna.programmers.co.kr

 

ํ•ญ์ƒ ํ•˜๋˜ ๊ฒƒ์ฒ˜๋Ÿผ.. ๊ฐ€์žฅ ๋จผ์ € ์Šคํฌ๋ž˜ํ•‘ํ•  ๋‚ด์šฉ์„ ์ฐพ์•„๋ณธ๋‹ค!

 

๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๋‚ด์šฉ์€, li ํƒœ๊ทธ ์•ˆ์ชฝ์˜, h4 ํƒœ๊ทธ์•ˆ์— ๋‹ด๊ฒจ์žˆ์—ˆ๋‹ค.

 

์ด๋ฒˆ์—”, user_agent๋ฅผ dic ํ˜•ํƒœ๋กœ ๋„ฃ์–ด ์š”์ฒญํ•  ๋•Œ ๊ฐ™์ด ์ „์†กํ•ด์ฃผ์—ˆ๋‹ค.

import requests
from bs4 import BeautifulSoup

user_agent = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
res = requests.get("https://hashcode.co.kr/", user_agent)
soup = BeautifulSoup(res.text, "html.parser")

# class๊ฐ€ 'question-list-item'์— ํ•ด๋‹นํ•˜๋Š” li ํƒœ๊ทธ๋ฅผ ๋ชจ๋‘ ๊ธ์–ด์™”๋‹ค
qList = soup.find_all("li", "question-list-item")

# ์›ํ•˜๋Š” ๋‚ด์šฉ(title)๋งŒ ์™ ๋นผ์„œ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ์ €์žฅํ–ˆ๋‹ค
qTitleList = [x.find("div","question").find("div","top").h4.text for x in qList]
qTitleList

 

์ด์ฒ˜๋Ÿผ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์ง€๋งŒ.. ์ด๊ฑด ์ผ๋ถ€์— ๋ถˆ๊ณผํ•˜๋‹ค

 

๋” ๋งŽ์€ ๊ฒฐ๊ณผ, ๋” ๋งŽ์€ ํŽ˜์ด์ง€์—์„œ ํƒ์ƒ‰์„ ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ?

โ–ถ ํŽ˜์ด์ง€๋„ค์ด์…˜์„ ์ด์šฉํ•˜๋ฉด ๋œ๋‹ค. (for ๋ฌธ)

 

์šฐ์„  ์‚ฌ์ดํŠธ๊ฐ€ Query String์„ ํ†ตํ•ด ํŽ˜์ด์ง€๋ฅผ ๊ตฌ๋ถ„ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ฐ์•ˆํ•˜๊ณ  ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•œ๋‹ค.

Query String โžก https://hashcode.co.kr/?page={i} # i์˜ ๊ฐ’์— ๋”ฐ๋ผ ํ•ด๋‹น ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•œ๋‹ค.

 

#์œ„์ชฝ ์ฝ”๋“œ์™€ ์ด์–ด์ง„๋‹ค

import time

# 1~5ํŽ˜์ด์ง€ ํƒ์ƒ‰
for i in range(1,6):
    res = requests.get("https://hashcode.co.kr/?page={}".format(i), user_agent)
    soup = BeautifulSoup(res.text, "html.parser")

    questionList = soup.find_all("li", "question-list-item")

    qTitleList = [x.find("div","question").find("div","top").h4.text for x in questionList]
    print(qTitleList)

#๊ณผ๋„ํ•œ ์š”์ฒญ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด 0.5์ดˆ ๊ฐ„๊ฒฉ์„ ๋‘์—ˆ๋‹ค
    time.sleep(0.5)

 

์ž˜ ์ถœ๋ ฅ๋˜๊ณ  ์žˆ๋‹ค..!

 


 

๐ŸŸฆ ๋™์  ์›น ํŽ˜์ด์ง€

 

์ •์ (static) ์›น ์‚ฌ์ดํŠธ : HTML ๋‚ด์šฉ์ด ๊ณ ์ •๋œ ํ˜•ํƒœ

  • ๊ฐ™์€ ์ฃผ์†Œ๋กœ ์š”์ฒญ์„ ๋ณด๋‚ด๋ฉด ๊ฐ™์€ ์‘๋‹ต์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Œ
  • HTML ๋ฌธ์„œ๊ฐ€ ์™„์ „ํ•˜๊ฒŒ ์‘๋‹ต๋จ

๋™์  ์›น ์‚ฌ์ดํŠธ : HTML ๋‚ด์šฉ์ด ๋ณ€ํ•˜๋Š” ํ˜•ํƒœ

  • ์ƒˆ๋กœ๊ณ ์นจ์„ ํ•  ๋•Œ๋งˆ๋‹ค ๋ณด์ด๋Š” ๋‚ด์šฉ์ด ๋‹ฌ๋ผ์ง€๋Š” ํ˜•ํƒœ
  • ์‘๋‹ต ํ›„ HTML์ด ๋ Œ๋”๋ง์ด ๋  ๋•Œ๊นŒ์ง€์˜ ์ง€์—ฐ์‹œ๊ฐ„์ด ์กด์žฌ 
  • ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋ฏ€๋กœ, ์‘๋‹ต์„ ๋ฐ›์•˜์„ ๋•Œ, ์ƒํ™ฉ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๊ฐ€ ์™„์ „ํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ์Œ

 

์œ„์—์„œ ๋ถ€ํ„ฐ ์ฐจ๋ก€๋กœ ๋™๊ธฐ ์ฒ˜๋ฆฌ, ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ

 

๋™์  ์›น ์‚ฌ์ดํŠธ๋Š” ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์–ธ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ,

์šฐ๋ฆฌ๊ฐ€ ์‘๋‹ต๋ฐ›๋Š” ์‹œ๊ฐ„์ด ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๊ฐ€ ๋ชจ๋‘ ์™„๋ฃŒ๋œ ์‹œ์ ์ด ์•„๋‹ ์ˆ˜๋„ ์žˆ๋‹ค

 

๋”ฐ๋ผ์„œ, ์ž„์˜๋กœ ์‹œ๊ฐ„์„ ์ง€์—ฐํ•˜๋Š” ๋ฐฉ์‹์„ ์ทจํ•ด์•ผํ•œ๋‹ค.

 

๋˜ํ•œ, ์‚ฌ์ดํŠธ๋ฅผ ํ‚ค๋ณด๋“œ๋‚˜ ๋งˆ์šฐ์Šค๋กœ ์กฐ์ž‘ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—

ํŒŒ์ด์ฌ Selenium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ์‹œ๊ฐ„์„ ์ง€์—ฐ์‹œํ‚ค๊ฑฐ๋‚˜ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค.

 


 

๐Ÿค” ๊ณต๋ถ€ํ•˜๋ฉด์„œ ์–ด๋ ค์› ๋˜ ๋‚ด์šฉ

๋ณธ๊ฒฉ์ ์œผ๋กœ ์›น ์‚ฌ์ดํŠธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์Šคํฌ๋ž˜ํ•‘ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ๊ฐ„์ด์˜€๋‹ค.

์ „์— ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€๋ฅผ ํฌ๋กค๋งํ•˜๊ฑฐ๋‚˜ ํ”Œ๋Ÿฌํ„ฐ๋กœ ์•„๋งˆ์กด ๊ด€๋ จ ์•ฑ์„ ๋งŒ๋“ค์–ด ๋ณธ ์ ์ด ์žˆ์–ด์„œ ๋‚ด์šฉ์ด ํฌ๊ฒŒ ์–ด๋ ต์ง€๋Š” ์•Š์•˜๋‹ค.

๊ฐœ์ธ์ ์œผ๋กœ ๋ชปํ•ด๋ดค๋˜ ์Šคํฌ๋ž˜ํ•‘ ํ›„ ์‹œ๊ฐํ™”ํ•ด๋ณด๋Š” ์‹œ๊ฐ„์ด ๊ธฐ๋Œ€๋œ๋‹ค...!

'TIL' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[TIL]KDT_20230421  (0) 2023.04.21
[TIL]KDT_20230420  (0) 2023.04.20
[TIL]KDT_20230418  (1) 2023.04.18
[TIL]KDT_20230417  (0) 2023.04.17
[TIL] KDT_20230414  (0) 2023.04.14