๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
self-study/๋ฐ์ดํ„ฐ ๋ถ„์„

์‹ ์šฉ์ •๋ณด ์ œ์žฌ ๊ฒฐ๊ณผ ๋ถ„์„ ์ž๋™ํ™” ์ˆ˜์ง‘ :: ์›น ํฌ๋กค๋ง, PDF ๋‹ค์šด๋กœ๋“œ, OCR ํ™œ์šฉ

by all_zer0 2025. 3. 13.
๋ฐ˜์‘ํ˜•

 

 

๐Ÿ“š ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

- (ํ•™์Šต ๋ชฉํ‘œ) ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ์˜ ์ฃผ์š” ๋ชฉํ‘œ๋Š” ์›น ํฌ๋กค๋ง๊ณผ PDF ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ํ†ตํ•ด ๊ธˆ์œต๊ฐ๋…์›์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ฒ€์‚ฌ๊ฒฐ๊ณผ ์ œ์žฌ ์ •๋ณด๋ฅผ ์ž๋™์œผ๋กœ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ , OCR์„ ํ™œ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ธฐ์ˆ  ํ•™์Šต

- (๋ถ„์„ ๊ณผ์ •) ๊ธˆ์œต๊ฐ๋…์› ์›น์‚ฌ์ดํŠธ์—์„œ "์‹ ์šฉ์ •๋ณด"์— ๊ด€ํ•œ ์ œ์žฌ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•˜์—ฌ, ๊ด€๋ จ๋œ PDF ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•œ ํ›„, ํ•ด๋‹น ํŒŒ์ผ์—์„œ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅ [ ์›น ํฌ๋กค๋ง → PDF ์ฒ˜๋ฆฌ OCR ํ™œ์šฉ  ์—‘์…€ ์ €์žฅ]

 

๐Ÿ’ป ํ•™์Šต ๋‚ด์šฉ

1. ์„ค์น˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ธฐ๋Šฅ(์ƒ์„ธ ์„ค๋ช…) ์„ค์น˜ ์ฝ”๋“œ ์„ค์น˜ ํ™•์ธ ์ฝ”๋“œ
requests HTTP ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ  ์‘๋‹ต์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ. ์ฃผ๋กœ ์›น์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์„ ๋•Œ ์‚ฌ์šฉ pip install requests import requests ๋ฅผ ํ†ตํ•ด ํ™•์ธ
pdfplumber PDF์—์„œ ํ…์ŠคํŠธ, ํ‘œ ๋“ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ. ํ…์ŠคํŠธ๋ฅผ ์‰ฝ๊ฒŒ ํŒŒ์‹ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์คŒ pip install pdfplumber import pdfplumber ๋กœ ํ™•์ธ
pytesseract OCR(Optical Character Recognition) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ์ด๋ฏธ์ง€๋ฅผ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜. PDF์—์„œ ํ…์ŠคํŠธ๋ฅผ ๋ชป ์ฝ์„ ๋•Œ ์‚ฌ์šฉ pip install pytesseract import pytesseract๋กœ ํ™•์ธ
pdf2image PDF ํŒŒ์ผ์„ ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ. OCR์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ์ด๋ฏธ์ง€ ๋ณ€ํ™˜์— ์‚ฌ์šฉ pip install pdf2image from pdf2image import convert_from_path๋กœ ํ™•์ธ
pillow ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, pdf2image์˜ ๋ณ€ํ™˜๋œ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ pip install pillow from PIL import Image๋กœ ํ™•์ธ
openpyxl Excel ํŒŒ์ผ์„ ์ฝ๊ณ  ์“ฐ๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ. ๋ฐ์ดํ„ฐ ์ถ”์ถœ ํ›„ Excel๋กœ ์ €์žฅํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ pip install openpyxl import openpyxl๋กœ ํ™•์ธ
beautifulsoup4 HTML, XML ๋ฌธ์„œ๋ฅผ ํŒŒ์‹ฑํ•˜๊ณ  ์›น ์Šคํฌ๋ž˜ํ•‘์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ. PDF ๋งํฌ ์ถ”์ถœ ๋ฐ ํฌ๋กค๋ง์— ์‚ฌ์šฉ pip install beautifulsoup4 from bs4 import BeautifulSoup์œผ๋กœ ํ™•์ธ

 

 

2. ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ ์„ค์น˜ ๋ฐ ์„ค์ •

(1) Homebrew ์„ค์น˜

#bash

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

 

 Homebrew ์„ค์น˜ ํ™•์ธ

#bash

brew --version

 

Homebrew๋ฅผ PATH์— ์ถ”๊ฐ€(ํ„ฐ๋ฏธ๋„์—์„œ ์ธ์‹์„ ํ•˜์ง€ ๋ชปํ•˜๋Š” ์ƒํ™ฉ)

#bash

echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

 

(2) ChromeDriver ์„ค์น˜

#bash

brew install chromedriver

 

 

ChromeDriver ์„ค์น˜ ์œ„์น˜ ํ™•์ธ

#bash

which chromedriver

 

 

macOS์—์„œ ChromeDriver ์‹คํ–‰ ํ—ˆ์šฉํ•˜๊ธฐ

1. ์‹œ์Šคํ…œ ์„ค์ •(์‹œ์Šคํ…œ ํ™˜๊ฒฝ์„ค์ •) ์—ด๊ธฐ
Apple ๋ฉ”๋‰ด → "์‹œ์Šคํ…œ ์„ค์ •" ๋˜๋Š” "์‹œ์Šคํ…œ ํ™˜๊ฒฝ์„ค์ •" ํด๋ฆญ

2. ๋ณด์•ˆ ๋ฐ ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ ๋ฉ”๋‰ด๋กœ ์ด๋™
"๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ ๋ฐ ๋ณด์•ˆ" (๋˜๋Š” "๋ณด์•ˆ ๋ฐ ๊ฐœ์ธ ์ •๋ณด ๋ณดํ˜ธ") ์„ ํƒ

3. ํ•˜๋‹จ์— 'chromedriver ์ฐจ๋‹จ๋จ' ๋ฉ”์‹œ์ง€๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ
"chromedriver"๋Š” Apple์—์„œ ์ธ์ฆํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹คํ–‰ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋ฉ”์‹œ์ง€ ์˜†์— ์žˆ๋Š” "์‹คํ–‰ ํ—ˆ์šฉ" ๋˜๋Š” "์—ด๊ธฐ" ๋ฒ„ํŠผ ํด๋ฆญ

 

 

3. ์ฝ”๋“œ ๋ฆฌ๋ทฐ

์˜ค๋ฅ˜ ๋‚ด์šฉ ์›์ธ ๋ถ„์„ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•
๊ฒ€์ƒ‰์ฐฝ์—์„œ
"์‹ ์šฉ์ •๋ณด"๋ฅผ
๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์—†์Œ
๊ฒ€์ƒ‰์ฐฝ <input>์ด <span class="keyword"> ์•ˆ์— ํฌํ•จ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—,
๊ฒ€์ƒ‰์ฐฝ์„ ๋ฐ”๋กœ ์กฐ์ž‘ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ƒ์œ„ ์š”์†Œ(span.keyword)๋ฅผ ๋จผ์ € ํด๋ฆญํ•ด์•ผ ํ•จ
<span class="keyword"> ์š”์†Œ๋ฅผ ๋จผ์ € ํด๋ฆญํ•ด์„œ ๊ฒ€์ƒ‰์ฐฝ์„ ํ™œ์„ฑํ™”

EC.element_to_be_clickable((By.CLASS_NAME, "keyword"))
keyword_span.click() ์‹คํ–‰

๊ฒ€์ƒ‰์ฐฝ์ด ํ™œ์„ฑํ™”๋˜์—ˆ์œผ๋ฏ€๋กœ input#query๋ฅผ ์ฐพ๊ณ  send_keys() ์‹คํ–‰


EC.element_to_be_clickable((By.ID, "query"))
search_box.send_keys("์‹ ์šฉ์ •๋ณด")
Selenium์ด ์›ํ•˜๋Š”
๊ฒ€์ƒ‰์ฐฝ(id="query")์ด ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ๊ฒ€์ƒ‰์ฐฝ
(id="searchWrd")์„ ์„ ํƒํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ
<form id="searchFrm"> ์•ˆ์— ์žˆ๋Š” <input id="query" name="query">๋ฅผ ์ฐพ์•„์•ผ ํ•จ.

id="query"๋ฅผ ๊ฐ€์ง„ <input>์„ ์ •ํ™•ํžˆ searchFrm ๋‚ด๋ถ€์—์„œ ์ฐพ์•„์•ผ ํ•จ
searchFrm ๋‚ด๋ถ€์—์„œ query๋ฅผ ์ฐพ๋„๋ก ์ฝ”๋“œ ๋ณ€๊ฒฝ
search_box = search_form.find_element(By.ID, "query")
์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ž˜๋ชป๋œ ๊ฒ€์ƒ‰์ฐฝ(searchWrd)์ด ์•„๋‹ˆ๋ผ searchFrm ๋‚ด๋ถ€์˜ query ๊ฒ€์ƒ‰์ฐฝ๋งŒ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ.

๊ฒ€์ƒ‰์ฐฝ์„ ํด๋ฆญํ•œ ํ›„ ๊ฒ€์ƒ‰ ์‹คํ–‰

์ผ๋ถ€ ์‚ฌ์ดํŠธ์—์„œ๋Š” click()์„ ๋จผ์ € ํ•ด์•ผ ๊ฒ€์ƒ‰์ฐฝ์ด ํ™œ์„ฑํ™”๋จ.
search_box.click()

๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ ํ›„ ENTER ํ‚ค ์‹คํ–‰

search_box.send_keys("์‹ ์šฉ์ •๋ณด")
search_box.send_keys(Keys.RETURN)
pdf ๋‚ด์˜ ํ‘œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์ง€ ๋ชปํ•˜๋Š” ํ˜„์ƒ pdfplumber.extract_text() ๋ฐฉ์‹์œผ๋กœ๋Š” ํ‘œ ์•ˆ์˜ ๋‚ด์šฉ์ด๋‚˜ ์„œ์‹์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๋Œ€๋กœ ์ถ”์ถœ๋˜์ง€ ์•Š์Œ pdfplumber.extract_table()์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ(table) ๋ฐ์ดํ„ฐ ์ถ”์ถœ

pdfplumber.extract_text()๋กœ ์ผ๋ฐ˜ ํ…์ŠคํŠธ ์˜์—ญ(๊ธˆ์œตํšŒ์‚ฌ๋ช…, ์ œ์žฌ์กฐ์น˜์ผ, ์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค) ์ถ”์ถœ


์ •๊ทœ์‹(re)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ •๋ฆฌ


์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅ
pdf ๋ฐ์ดํ„ฐ๊ฐ€
์ถ”์ถœ๋˜์ง€ ์•Š๋Š” ๋ฌธ์ œ
pdfplumber.extract_text()๋ฅผ ๋จผ์ € ์‹คํ–‰ํ•˜๊ณ , ๋‚ด์šฉ์ด ์—†๋‹ค๋ฉด OCR(๋ฌธ์ž์ธ์‹) ์‚ฌ์šฉ

ํ…Œ์ด๋ธ”์„ extract_table()๋กœ ์ถ”์ถœํ•œ ํ›„, ์ขŒํ‘œ๋ฅผ ์ง์ ‘ ์ง€์ •ํ•ด ๊ฐœ์„ 

ํ…์ŠคํŠธ๋ฅผ ์ •๊ทœ์‹์œผ๋กœ ์ •์ œํ•˜์—ฌ ํ•„์š”ํ•œ ์ •๋ณด๋งŒ ์ถ”์ถœ
pytesseract ๋ฐ pdf2image ์„ค์น˜
pytesseract๊ฐ€
ํ•œ๊ตญ์–ด ์–ธ์–ด ํŒŒ์ผ์„
์ฐพ์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ
Tesseract์— ํ•œ๊ตญ์–ด(kor.traineddata) ๋ฐ์ดํ„ฐ๊ฐ€ ์„ค์น˜๋˜์ง€ ์•Š์Œ

Tesseract๊ฐ€ tessdata ๊ฒฝ๋กœ๋ฅผ ์ฐพ์ง€ ๋ชปํ•จ

TESSDATA_PREFIX ํ™˜๊ฒฝ ๋ณ€์ˆ˜๊ฐ€ ์„ค์ •๋˜์ง€ ์•Š์Œ
Tesseract ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ํŒŒ์ผ ์„ค์น˜

wget -P /opt/homebrew/share/tessdata/ https://github.com/tesseract-ocr/tessdata_best/raw/main/kor.traineddata
"์‹ ์šฉ์ •๋ณด"๋ฅผ ๊ฒ€์ƒ‰ํ•œ ํ›„ pdf ๋งํฌ๋ฅผ ์ถ”์ถœํ•˜์ง€
๋ชปํ•˜๋Š” ์˜ค๋ฅ˜
PDF ๋งํฌ๋Š” ํŽ˜์ด์ง€์—์„œ ๋™์ ์œผ๋กœ ๋กœ๋”ฉ๋˜๊ณ  ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ

์ฆ‰, ํŽ˜์ด์ง€์˜ HTML์— PDF ๋งํฌ๊ฐ€ ์ง์ ‘์ ์œผ๋กœ ๋ณด์ด์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ
๋™์  ์ฝ˜ํ…์ธ  ์ฒ˜๋ฆฌ: ์›น ํŽ˜์ด์ง€์˜ ์š”์†Œ๊ฐ€ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋กœ ๋™์ ์œผ๋กœ ๋กœ๋”ฉ๋˜๋ฏ€๋กœ ์ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์‹œ๋„
PDF ๋‹ค์šด๋กœ๋“œ ๋งํฌ ์ถ”์ถœ ๋ฐ ํด๋ฆญ: ๋งํฌ๋ฅผ ์ง์ ‘ ํด๋ฆญํ•˜์—ฌ PDF ๋‹ค์šด๋กœ๋“œ ์‹œ๋„
ํ—ค๋“œ๋ฆฌ์Šค ๋ธŒ๋ผ์šฐ์ €: ์›น ํŽ˜์ด์ง€๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ  ๋‹ค์šด๋กœ๋“œ ๋Œ€ํ™”์ƒ์ž ์—†์ด ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
1ํŽ˜์ด์ง€์˜ ๋‹ค์šด๋กœ๋“œ๋งŒ
์„ฑ๊ณตํ•˜๊ณ 
2ํŽ˜์ด์ง€๋ถ€ํ„ฐ๋Š”
ํฌ๋กค๋ง์ด ์‹คํŒจํ•จ
javascript:fnSearch(2)์™€ ๊ฐ™์€ JavaScript ํ•จ์ˆ˜๊ฐ€ ํ˜ธ์ถœ๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํŽ˜์ด์ง€๊ฐ€ ๋กœ๋“œ๋˜๋Š” ๊ตฌ์กฐ๋ผ๋ฉด, ํ•ด๋‹น JavaScript ํ•จ์ˆ˜๋ฅผ ํŠธ๋ฆฌ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” click() ๋Œ€์‹  execute_script()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ JavaScript๋ฅผ ์‹คํ–‰ํ•  ํ•„์š” JavaScript ์‹คํ–‰: ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ๋ฅผ ํด๋ฆญํ•  ๋•Œ, driver.execute_script()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ javascript:fnSearch(page_num)๋ฅผ ์‹คํ–‰ํ•˜๋„๋ก ๋ณ€๊ฒฝ

ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ํด๋ฆญ: ๊ฐ ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ๋ฅผ ํด๋ฆญํ•  ๋•Œ XPath์—์„œ href="javascript:fnSearch(page_num)" ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒ„ํŠผ ์ฐพ๊ธฐ
pdf์—์„œ
"์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค"์„
์ œ๋Œ€๋กœ ์ฝ์–ด์˜ค์ง€ ๋ชปํ•˜๋Š” ์˜ค๋ฅ˜
ํ˜„์žฌ pdfplumber๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ…์ŠคํŠธ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์žˆ์ง€๋งŒ, "์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค" ์ดํ•˜์˜ ํ…์ŠคํŠธ๊ฐ€ ์ œ๋Œ€๋กœ ๋‚˜์˜ค๋Š”์ง€ ํ™•์ธํ•˜๊ณ , ๊ทธ ๋ถ€๋ถ„์„ ์ข€ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•ด์•ผ ํ•จ

ํ˜„์žฌ find ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ํ…์ŠคํŠธ๊ฐ€ ์—ฌ๋Ÿฌ ์ค„๋กœ ๋‚˜๋‰˜์–ด ์žˆ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ข€ ๋” ์ •๊ตํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ "์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค"๊ณผ ๊ทธ ์ดํ›„์˜ ๋ชจ๋“  ํ…์ŠคํŠธ๋ฅผ ์ถ”์ถœํ•˜๋„๋ก ์ˆ˜์ •ํ•  ํ•„์š”
PDF์—์„œ "์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค" ์•„๋ž˜์˜ ํ…์ŠคํŠธ๋ฅผ ์ œ๋Œ€๋กœ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก violation_details ๋ถ€๋ถ„์„ ์ˆ˜์ •ํ•˜์—ฌ "์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค" ์ดํ›„ ๋ชจ๋“  ํ…์ŠคํŠธ๋ฅผ ํฌํ•จํ•˜๋„๋ก ํ•จ

 

 

4. ์ตœ์ข… ์ฝ”๋“œ

import time
import os
import requests
import pdfplumber
import pandas as pd
import re
import pytesseract
from pdf2image import convert_from_path
from PIL import Image
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup

# ๐Ÿ“Œ Chrome WebDriver ์„ค์ • (ํ—ค๋“œ๋ฆฌ์Šค ๋ชจ๋“œ ์ œ๊ฑฐ)
chrome_driver_path = "/opt/homebrew/bin/chromedriver"  # ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ ๊ฒฝ๋กœ
download_dir = "/Users/allzero/Downloads"  # ๋‹ค์šด๋กœ๋“œ ํด๋” ๊ฒฝ๋กœ ์„ค์ •

# Chrome ์˜ต์…˜ ์„ค์ •
options = Options()
# options.add_argument("--headless")  # UI ์—†์ด ์‹คํ–‰์„ ํ•˜์ง€ ์•Š์Œ, ๋ธŒ๋ผ์šฐ์ € ํ‘œ์‹œ๋จ
options.add_argument("--disable-gpu")
options.add_argument(f"--window-size=1920x1080")
options.add_experimental_option("prefs", {
    "download.default_directory": download_dir,  # ๋‹ค์šด๋กœ๋“œ ํด๋” ์„ค์ •
    "download.prompt_for_download": False,  # ๋‹ค์šด๋กœ๋“œ ๋Œ€ํ™”์ƒ์ž ๋น„ํ™œ์„ฑํ™”
    "download.directory_upgrade": True,  # ๋””๋ ‰ํ† ๋ฆฌ ์—…๊ทธ๋ ˆ์ด๋“œ
    "savefile.default_directory": download_dir,  # ์ €์žฅ ๋””๋ ‰ํ† ๋ฆฌ ์„ค์ •
    "plugins.always_open_pdf_externally": True  # PDF ํŒŒ์ผ ์™ธ๋ถ€์—์„œ ์—ด๊ธฐ
})

# ๐Ÿ“Œ ๋ธŒ๋ผ์šฐ์ € ์‹คํ–‰
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service, options=options)

# ๐Ÿ“Œ ๊ธˆ์œต๊ฐ๋…์› ๊ฒ€์‚ฌ๊ฒฐ๊ณผ์ œ์žฌ ๊ฒŒ์‹œํŒ ์ ‘์†
URL = "https://www.fss.or.kr/fss/job/openInfo/searchList.do"
driver.get(URL)

# ๐Ÿ“Œ ๊ฒ€์ƒ‰์ฐฝ ์ฐพ๊ธฐ ๋ฐ "์‹ ์šฉ์ •๋ณด" ์ž…๋ ฅ
search_box = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "query")))
search_box.send_keys("์‹ ์šฉ์ •๋ณด")
search_box.send_keys(Keys.RETURN)

# ๐Ÿ“Œ "์‹ ์šฉ์ •๋ณด" ๊ฒ€์ƒ‰ ํ›„ 3์ดˆ ๋Œ€๊ธฐ
print("โœ… '์‹ ์šฉ์ •๋ณด' ๊ฒ€์ƒ‰ ์™„๋ฃŒ!")
time.sleep(3)  # ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋กœ๋”ฉ๋  ๋•Œ๊นŒ์ง€ 3์ดˆ ๋Œ€๊ธฐ

# ๐Ÿ“Œ 1ํŽ˜์ด์ง€ ๋กœ๋”ฉ ํ›„ ๋ช…ํ™•ํ•˜๊ฒŒ ๋กœ๋”ฉ๋œ ์š”์†Œ ํ™•์ธ
try:
    # 1ํŽ˜์ด์ง€๊ฐ€ ์™„์ „ํžˆ ๋กœ๋“œ๋˜์—ˆ๋Š”์ง€ ํ™•์ธ
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//ul[@class='list-desc']")))  # ๋ฆฌ์ŠคํŠธ๊ฐ€ ๋กœ๋“œ๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆผ
    print("โœ… 1ํŽ˜์ด์ง€ ๋กœ๋”ฉ ์™„๋ฃŒ!")
except TimeoutException:
    print("โš ๏ธ 1ํŽ˜์ด์ง€ ๋กœ๋”ฉ ์‹คํŒจ!")

# ๐Ÿ“Œ 1ํŽ˜์ด์ง€ PDF ๋งํฌ ์ถ”์ถœ ๋ฐ ๋‹ค์šด๋กœ๋“œ
pdf_links = []

print(f"๐Ÿ“‘ 1ํŽ˜์ด์ง€ ํฌ๋กค๋ง ์ค‘...")
soup = BeautifulSoup(driver.page_source, "html.parser")
for a_tag in soup.find_all("a", href=True):
    href = a_tag["href"]
    if ".pdf" in href:
        if href.startswith("/"):
            href = "https://www.fss.or.kr" + href
        pdf_links.append(href)

# ๐Ÿ“Œ ๊ฐ PDF ๋งํฌ๋ฅผ ๋‹ค์šด๋กœ๋“œ
for index, pdf_url in enumerate(pdf_links):
    print(f"๐Ÿ“ฅ {index+1}/{len(pdf_links)} - ๋‹ค์šด๋กœ๋“œ ์ค‘: {pdf_url}")
    
    # PDF ๋‹ค์šด๋กœ๋“œ ๋งํฌ ํด๋ฆญํ•˜์—ฌ ๋‹ค์šด๋กœ๋“œ ์‹œ๋„
    driver.get(pdf_url)  # PDF ๋‹ค์šด๋กœ๋“œ ๋งํฌ๋กœ ์ด๋™
    time.sleep(2)  # ๋‹ค์šด๋กœ๋“œ ์ฐฝ ์—ด๋ฆด ๋•Œ๊นŒ์ง€ ์ž ์‹œ ๋Œ€๊ธฐ

print("โœ… 1ํŽ˜์ด์ง€ PDF ๋งํฌ ๋‹ค์šด๋กœ๋“œ ์™„๋ฃŒ! '๋‹ค์šด๋กœ๋“œ ํด๋”'์— ์ €์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")

# ๐Ÿ“Œ 2ํŽ˜์ด์ง€๋ถ€ํ„ฐ 8ํŽ˜์ด์ง€๊นŒ์ง€ ํฌ๋กค๋ง (์—ฌ๊ธฐ์„œ๋Š” ํŽ˜์ด์ง€ ํฌ๋กค๋ง๊ณผ PDF ๋‹ค์šด๋กœ๋“œ)
for page_num in range(2, 9):
    print(f"๐Ÿ“‘ {page_num} ํŽ˜์ด์ง€ ํฌ๋กค๋ง ์ค‘...")
    
    # ํŽ˜์ด์ง€ ๋ฒ„ํŠผ ํด๋ฆญ (ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ํด๋ฆญ)
    try:
        page_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, f"//a[@data-pageindex='{page_num}']"))
        )
        page_button.click()
        time.sleep(3)  # ํŽ˜์ด์ง€๊ฐ€ ๋กœ๋”ฉ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
        
        # ํŽ˜์ด์ง€ ๋กœ๋”ฉ ํ›„ PDF ๋งํฌ ์ˆ˜์ง‘
        soup = BeautifulSoup(driver.page_source, "html.parser")
        for a_tag in soup.find_all("a", href=True):
            href = a_tag["href"]
            if ".pdf" in href:
                if href.startswith("/"):
                    href = "https://www.fss.or.kr" + href
                pdf_links.append(href)
    except TimeoutException:
        print(f"โš ๏ธ {page_num} ํŽ˜์ด์ง€ ๋กœ๋”ฉ ์‹คํŒจ!")

# ๐Ÿ“Œ ๊ฐ PDF ๋งํฌ๋ฅผ ๋‹ค์šด๋กœ๋“œ
for index, pdf_url in enumerate(pdf_links):
    print(f"๐Ÿ“ฅ {index+1}/{len(pdf_links)} - ๋‹ค์šด๋กœ๋“œ ์ค‘: {pdf_url}")
    
    # PDF ๋‹ค์šด๋กœ๋“œ ๋งํฌ ํด๋ฆญํ•˜์—ฌ ๋‹ค์šด๋กœ๋“œ ์‹œ๋„
    driver.get(pdf_url)  # PDF ๋‹ค์šด๋กœ๋“œ ๋งํฌ๋กœ ์ด๋™
    time.sleep(2)  # ๋‹ค์šด๋กœ๋“œ ์ฐฝ ์—ด๋ฆด ๋•Œ๊นŒ์ง€ ์ž ์‹œ ๋Œ€๊ธฐ

print("โœ… PDF ๋งํฌ ๋‹ค์šด๋กœ๋“œ ์™„๋ฃŒ! ํŒŒ์ผ์ด '๋‹ค์šด๋กœ๋“œ ํด๋”'์— ์ €์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")

# ๐Ÿ“Œ OCR์„ ํ™œ์šฉํ•œ PDF ๋ฐ์ดํ„ฐ ์ถ”์ถœ ํ•จ์ˆ˜
def extract_info(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])
    
    # ๐Ÿ”น 1. ํ…์ŠคํŠธ๊ฐ€ ์ถ”์ถœ๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ OCR ์‚ฌ์šฉ
    if not text.strip():
        print("โš ๏ธ ์ผ๋ฐ˜ ํ…์ŠคํŠธ ์ถ”์ถœ ์‹คํŒจ! OCR ์‚ฌ์šฉ")
        images = convert_from_path(pdf_path)
        ocr_text = ""
        for image in images:
            ocr_text += pytesseract.image_to_string(image, lang="kor") + "\n"
        text = ocr_text.strip()

    # ๐Ÿ”น 2. ๊ธˆ์œตํšŒ์‚ฌ๋ช… ์ถ”์ถœ
    company_pattern = r"๊ธˆ์œตํšŒ์‚ฌ๋ช…\s*:\s*(.*)"
    financial_company = re.search(company_pattern, text)
    financial_company = financial_company.group(1).strip() if financial_company else "N/A"

    # ๐Ÿ”น 3. ์ œ์žฌ์กฐ์น˜์ผ ์ถ”์ถœ
    date_pattern = r"์ œ์žฌ์กฐ์น˜์ผ\s*:\s*([\d]{4}\.\s*[\d]{1,2}\.\s*[\d]{1,2})"
    action_date = re.search(date_pattern, text)
    action_date = action_date.group(1).strip() if action_date else "N/A"

    # ๐Ÿ”น 4. ์ œ์žฌ์กฐ์น˜๋‚ด์šฉ(ํ‘œ) ์ถ”์ถœ
    action_content = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            tables = page.extract_table()
            if tables:
                for row in tables:
                    # None ๊ฐ’์„ ๋นˆ ๋ฌธ์ž์—ด๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์ฒ˜๋ฆฌ
                    row = [str(cell) if cell is not None else "" for cell in row]
                    action_content.append(" | ".join(row))
    
    action_content = "\n".join(action_content) if action_content else "N/A"

    # ๐Ÿ”น 5. ์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค ์ถ”์ถœ: "์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค" ์•„๋ž˜์˜ ๋ชจ๋“  ํ…์ŠคํŠธ๋ฅผ ์ถ”์ถœ
    violation_start = text.find("์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค")
    if violation_start != -1:
        violation_details = text[violation_start:]  # "์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค" ๋ถ€ํ„ฐ ๋๊นŒ์ง€
    else:
        violation_details = "N/A"

    return [financial_company, action_date, action_content, violation_details]

# ๐Ÿ“Œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋ฐ ์—‘์…€ ์ €์žฅ
data_list = []

for index, pdf_url in enumerate(pdf_links):
    print(f"๐Ÿ“ฅ {index+1}/{len(pdf_links)} - ๋‹ค์šด๋กœ๋“œ ์ค‘: {pdf_url}")

    # PDF ๋‹ค์šด๋กœ๋“œ
    pdf_response = requests.get(pdf_url)
    pdf_path = f"temp_{index}.pdf"

    with open(pdf_path, "wb") as f:
        f.write(pdf_response.content)

    # ๋ฐ์ดํ„ฐ ์ถ”์ถœ
    extracted_data = extract_info(pdf_path)
    data_list.append(extracted_data)

    # ์ž„์‹œ ํŒŒ์ผ ์‚ญ์ œ
    os.remove(pdf_path)

print("โœ… PDF ๋ฐ์ดํ„ฐ ์ถ”์ถœ ์™„๋ฃŒ! ์—‘์…€ ์ €์žฅ ์‹œ์ž‘...")

# ๐Ÿ“Œ ํ‘œ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ ํ›„ ์—‘์…€ ์ €์žฅ
df = pd.DataFrame(data_list, columns=["๊ธˆ์œตํšŒ์‚ฌ๋ช…", "์ œ์žฌ์กฐ์น˜์ผ", "์ œ์žฌ์กฐ์น˜๋‚ด์šฉ", "์ œ์žฌ๋Œ€์ƒ์‚ฌ์‹ค"])
df.to_excel("๊ธˆ์œต๊ฐ๋…์›_์‹ ์šฉ์ •๋ณด_์ œ์žฌ๊ฒฐ๊ณผ.xlsx", index=False, engine='openpyxl')

print("โœ… ์—‘์…€ ์ €์žฅ ์™„๋ฃŒ! '๊ธˆ์œต๊ฐ๋…์›_์‹ ์šฉ์ •๋ณด_์ œ์žฌ๊ฒฐ๊ณผ.xlsx' ํŒŒ์ผ์„ ํ™•์ธํ•˜์„ธ์š”.")

# ๐Ÿ“Œ ๋ธŒ๋ผ์šฐ์ € ์ข…๋ฃŒ
driver.quit()

 

๐Ÿ” ๊ฐœ์„ ํ•  ์  & ์ถ”๊ฐ€๋กœ ๋ถ„์„ํ•ด๋ณผ ์ˆ˜ ์žˆ๋Š” ์‚ฌํ•ญ

  • PDF์—์„œ ํ…์ŠคํŠธ ์ถ”์ถœ ์„ฑ๋Šฅ ํ–ฅ์ƒ :
    • pdfplumber์™€ pytesseract์˜ ์กฐํ•ฉ์€ ์ƒ๋‹นํžˆ ์œ ์šฉํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ PDF ํŒŒ์ผ์—์„œ๋Š” ์—ฌ์ „ํžˆ ํ…์ŠคํŠธ ์ถ”์ถœ์ด ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ. PDF ๊ตฌ์กฐ ๋ถ„์„์„ ํ†ตํ•ด ๋” ๋‚˜์€ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ๊ณ ๋ฏผ
  • ์›น ํฌ๋กค๋ง ์ตœ์ ํ™” :
    • ์›น ํŽ˜์ด์ง€์˜ ๋™์  ์š”์†Œ๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ, selenium์„ ํ†ตํ•ด ๋กœ๋”ฉ ๋Œ€๊ธฐ ๋ฐ ํŽ˜์ด์ง€ ๋‚ด ์š”์†Œ์˜ ๋กœ๋”ฉ ์ƒํƒœ๋ฅผ ์ •ํ™•ํžˆ ์ฒดํฌํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”
      -> headless ๋ชจ๋“œ ๋Œ€์‹  ์ผ๋ฐ˜ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํฌ๋กค๋ง์˜ ํšจ์œจ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ
  • PDF ๋‚ด์šฉ์˜ ๊ตฌ์กฐ์  ๋ถ„์„:
    • ์ถ”์ถœ๋œ ํ‘œ ํ˜•ํƒœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๋Œ€๋กœ ์ •๋ฆฌ๋˜์ง€ ์•Š์„ ๊ฒฝ์šฐ, ์ถ”๊ฐ€์ ์ธ ๋ฐ์ดํ„ฐ ํด๋ฆฐ์ง•์ด ํ•„์š”
      ๐Ÿ‘‰ ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‹ค ์ •ํ™•ํ•œ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ถ”๊ฐ€ ๋ถ„์„
  • ๊ณ ๊ธ‰ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ์‹œ๊ฐํ™”:
    • ์—‘์…€๋กœ ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŠน์ • ๊ธˆ์œตํšŒ์‚ฌ์— ๋Œ€ํ•œ ์ œ์žฌ ์ด๋ ฅ์„ ๋ถ„์„ํ•˜๊ฑฐ๋‚˜, ์ œ์žฌ ์กฐ์น˜๊ฐ€ ๋งŽ์ด ๋ฐœ์ƒํ•œ ์‹œ์ ์— ๋Œ€ํ•œ ์‹œ๊ณ„์—ด ๋ถ„์„

 

๋ฐ˜์‘ํ˜•