{"id":2484,"date":"2020-03-06T09:48:46","date_gmt":"2020-03-06T00:48:46","guid":{"rendered":"http:\/\/43.203.250.216\/?p=2484"},"modified":"2020-12-30T12:08:17","modified_gmt":"2020-12-30T03:08:17","slug":"%ed%95%9c%ea%b8%80-%ed%85%8d%ec%8a%a4%ed%8a%b8-%ec%b6%94%ec%b6%9c%ec%9d%84-%ec%9c%84%ed%95%9c-python-pdf-module","status":"publish","type":"post","link":"https:\/\/litcoder.com\/?p=2484","title":{"rendered":"\ud55c\uae00 \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\uc744 \uc704\ud55c Python PDF module"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>\uc774 \ub0b4\uc6a9\uc740 2020\ub144 3\uc6d4\uc5d0 \uc791\uc131\ub41c \uac83\uc73c\ub85c \ucc38\uc870\ud558\ub294 \uc2dc\uc810\uc5d0 \ub530\ub77c \ubcc0\uacbd\ub41c \uc0ac\ud56d\ub4e4\uc774 \uc788\uc744 \uc218\ub3c4 \uc788\uc2b5\ub2c8\ub2e4.<\/p><\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">PyPDF2<\/h2>\n\n\n\n<p>PyPDF2\ub294 PDF file\uc758 metadata \uc815\ubcf4\ub97c \uac00\uc838\uc624\uac70\ub098 \ud398\uc774\uc9c0 \ub2e8\uc704\ub85c \ub098\ub204\uac70\ub098 \ud569\uce58\ub294 \ub4f1\uc758 \uc5ec\ub7ec\uac00\uc9c0 \ud3b8\ub9ac\ud55c \uae30\ub2a5\ub4e4\uc744 \uc81c\uacf5\ud55c\ub2e4. \ud558\uc9c0\ub9cc \ud55c\uae00\uc744 \uc81c\ub300\ub85c \ucd94\ucd9c\ud558\uc9c0 \ubabb\ud558\ub294 \ubb38\uc81c\uac00 \uc788\uc5b4\uc11c(\ud55c\uae00 \ubfd0 \uc544\ub2c8\ub77c CJK \ubaa8\ub450 \ub77c\uace0 \ud568) \ubaa9\uc801\uc5d0\ub294 \uc801\ud569\ud558\uc9c0 \uc54a\uc558\ub2e4.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">PDFMiner<\/h2>\n\n\n\n<p>\ud55c\uae00 \ucc98\ub9ac\ub294 \ubb38\uc81c \uc5c6\ub2e4. \uadf8\ub7f0\ub370 \ud398\uc774\uc9c0 \ub2e8\uc704\ub85c \ub098\ub204\uc5b4\uc11c \ucc98\ub9ac\ud558\ub294 \uac83\uc744 \ub530\ub85c \uc9c0\uc6d0\ud558\uc9c0 \uc54a\uc544\uc11c \uc6d0\ud558\ub294 \ud398\uc774\uc9c0\uc5d0 \uc811\uadfc\ud558\ub824\uba74 \uc21c\ucc28\uc801\uc73c\ub85c \ucc98\uc74c\ubd80\ud130 \ud574\ub2f9 \ud398\uc774\uc9c0\ub97c \ucc3e\uc544\uac00\ub294 trick\uc744 \uc0ac\uc6a9\ud574\uc57c \ud558\ub294\ub370, \uc774 \ucf54\ub4dc\ub85c \uc21c\ucc28\uc801 \uc811\uadfc\uc744 \ud558\uba74 \uc2dc\uac04 \ubcf5\uc7a1\ub3c4\uac00 O(N^2)\uac00 \ub418\uc5b4 \ud30c\uc77c\uc758 \ud06c\uae30\uac00 \uc870\uae08\ub9cc \ucee4\ub3c4 \uc131\ub2a5\uc774 \ub9e4\uc6b0 \ub5a8\uc5b4\uc9c4\ub2e4. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"eclipse\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># PDFminer random access trick.\nfor pageNumber, page in enumerate(PDFPage.get_pages(fileobject)):\n  # \uc694\uccad\ub41c \ud398\uc774\uc9c0\ub97c \ubc1c\uacac\ud558\uba74 \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\n  if pageNumber is reqPage: \n    interpreter.process_page(page)\n    text = retstr.getvalue()<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Tika<\/h2>\n\n\n\n<p>\ub9ce\uc740 \uacf3\uc5d0\uc11c \uc4f0\uc774\ub294 \uaf64\ub098 \uc720\uba85\ud55c \ud504\ub85c\uc81d\ud2b8\uc778\ub370 Python module\ub85c\ub3c4 proting \ub418\uc5b4 \uc788\ub2e4(<a href=\"https:\/\/github.com\/chrismattmann\/tika-python\">tika-python<\/a>). \ud55c\uae00 \ucd94\ucd9c\uc5d0\ub294 \ubb38\uc81c\uac00 \uc5c6\uace0, \uc774 \ubaa8\ub4c8 \uc790\uccb4\uc5d0\uc11c\ub294 \ud398\uc774\uc9c0 \ub2e8\uc704\uc758 \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\uc744 \uc9c0\uc6d0\ud558\uc9c0 \uc54a\uc73c\ub098, \uadf8\ub300\uc2e0 PDF\ub97c XML\ub85c \ucd94\ucd9c\ud55c \ub2e4\uc74c\uc5d0 BeautifulSoupe\ub85c <strong>&lt;div page=&#8221;&#8221;&gt;<\/strong> \ud0dc\uadf8\ub97c \ucc3e\uc544 \ud398\uc774\uc9c0 \ub2e8\uc704\ub85c \uc811\uadfc\ud558\ub294 \uc2e0\ubc15\ud55c \ud2b8\ub9ad\uc774 \uc788\ub2e4(<a href=\"https:\/\/stackoverflow.com\/questions\/53093531\/python-apache-tika-single-page-parser\">StackOverflow<\/a>). \ub098\ub294 BeautifulSoup\uc758 paser\ub85c lxml\uc744 \uc0ac\uc6a9\ud588\ub2e4.<\/p>\n\n\n\n<p><strong>\uc8fc\uc758.<\/strong> \uba85\uc2dc\uc801\uc73c\ub85c \ud45c\uc2dc \ub418\uc9c0\ub294 \uc54a\uc9c0\ub9cc JRE(Java Runtime Environment)\uc5d0 \uc758\uc874\ud558\ubbc0\ub85c \ub3d9\uc791\uc2dc \uc624\ub958\uac00 \ubc1c\uc0dd\ud558\uba74 JRE\uac00 \uc81c\ub300\ub85c \uc124\uce58 \ub418\uc5b4 \uc788\uace0 \uc811\uadfc \uac00\ub2a5\ud55c\uc9c0 \ud655\uc778\ud574 \ubcfc \uac83. Ubuntu 18.04 default-jre package (OpenJDK 11)\ub85c \ub3d9\uc791 \ud655\uc778.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ java --version\nopenjdk 11.0.6 2020-01-14\nOpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1)\nOpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1, mixed mode, sharing)<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">\uacb0\ub860<\/h2>\n\n\n\n<p>PDF \ubb38\uc11c \uc790\uccb4\uc5d0 \ub300\ud55c \ud569\uce58\uae30\/\ub098\ub204\uae30\/\uc815\ubcf4 \uac00\uc838\uc624\uae30 \ub4f1\uc740 PyPDF2\uac00 \ubb34\ucc99 \ud3b8\ud558\ub2e4. \ud55c\uae00 \ud14d\uc2a4\ud2b8 \ucd94\ucd9c\uc744 \uc704\ud574\uc11c\ub294 Tika, \ud398\uc774\uc9c0 \ub2e8\uc704 \uc811\uadfc\uc774 \ud544\uc694\ud558\ub2e4\uba74 Tika + BS\ub97c \uace0\ub824\ud574 \ubcfc \ub9cc\ud558\ub2e4. PDFMiner\ub294 \ubb50\ub784\uae4c.. \ucacc \ubcc4\ub85c..<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\uc774 \ub0b4\uc6a9\uc740 2020\ub144 3\uc6d4\uc5d0 \uc791\uc131\ub41c \uac83\uc73c\ub85c \ucc38\uc870\ud558\ub294 \uc2dc\uc810\uc5d0 \ub530\ub77c \ubcc0\uacbd\ub41c \uc0ac\ud56d\ub4e4\uc774 \uc788\uc744 \uc218\ub3c4 \uc788\uc2b5\ub2c8\ub2e4. PyPDF2 PyPDF2\ub294 PDF file\uc758 metadata \uc815\ubcf4\ub97c \uac00\uc838\uc624\uac70\ub098 \ud398\uc774\uc9c0 \ub2e8\uc704\ub85c \ub098\ub204\uac70\ub098 \ud569\uce58\ub294 \ub4f1\uc758 \uc5ec\ub7ec\uac00\uc9c0 \ud3b8\ub9ac\ud55c \uae30\ub2a5\ub4e4\uc744 \uc81c\uacf5\ud55c\ub2e4. \ud558\uc9c0\ub9cc \ud55c\uae00\uc744 \uc81c\ub300\ub85c \ucd94\ucd9c\ud558\uc9c0 \ubabb\ud558\ub294 \ubb38\uc81c\uac00 \uc788\uc5b4\uc11c(\ud55c\uae00 \ubfd0 \uc544\ub2c8\ub77c CJK \ubaa8\ub450 \ub77c\uace0 \ud568) \ubaa9\uc801\uc5d0\ub294 \uc801\ud569\ud558\uc9c0 \uc54a\uc558\ub2e4. PDFMiner \ud55c\uae00 \ucc98\ub9ac\ub294 \ubb38\uc81c \uc5c6\ub2e4. \uadf8\ub7f0\ub370 \ud398\uc774\uc9c0 \ub2e8\uc704\ub85c \ub098\ub204\uc5b4\uc11c [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[5],"tags":[225,105,227,226],"class_list":["post-2484","post","type-post","status-publish","format-standard","hentry","category-programming","tag-pdf","tag-python","tag-tika","tag-226"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/litcoder.com\/index.php?rest_route=\/wp\/v2\/posts\/2484","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/litcoder.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/litcoder.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/litcoder.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/litcoder.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2484"}],"version-history":[{"count":20,"href":"https:\/\/litcoder.com\/index.php?rest_route=\/wp\/v2\/posts\/2484\/revisions"}],"predecessor-version":[{"id":2709,"href":"https:\/\/litcoder.com\/index.php?rest_route=\/wp\/v2\/posts\/2484\/revisions\/2709"}],"wp:attachment":[{"href":"https:\/\/litcoder.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2484"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/litcoder.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2484"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/litcoder.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2484"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}