PHP 爬蟲抓取 HTML 內容

# Intro 又是久違的寫爬蟲… 這次是接手大大們的 code 寫的是 PHP 版本研究了一下寫法才發現現在可以不使用第三方套件就可以處理了所以這裡紀錄一下 # 取得 HTML 內容使用 curl 使用 file_get_contents curl 是我常用的方式看了大大們的 code 才知道原來 file_get_contents 也可以取 http/https 內容… 這邊簡單貼一下兩種作法的範例 ## curl function httpGet($url) { $ch = curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,true); curl_setopt($ch,CURLOPT_HEADER, [ 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36' ]); $output=curl_exec($ch); curl_close($ch); return $output; } ## file_get_contents function htmlContentGet($url) { $opts = [ "http" => [ "method" => "GET", "header" => "User-Agent: Mozilla/5. ...

2022-04-22

/posts/php_crawler_html/

PHP
Crawler

PHP - 使用 urlencode rawurlencode 的差異和使用 http_build_query

# 最近剛好遇到個問題就順便筆記一下(但是遇到的問題和要寫的內文無關就是了 XD) # 前言基本上在 url query string 的 value 都要做 url encode URL encode 會用到以下標準 RFC 1738 RFC 2396 RFC 3986 主要會使用 % 字符來針對需要 escape 的字元做編碼 ex: / -> %2F, + -> %2B 等但是主要又有幾個問題是基於 HTTP GET 和 POST 與 application/x-www-form-urlencoded 的問題基本上在使用 HTML form 表單使用時採用的會是把空格轉成 + 但這些都不算是大問題因為基於 CGI 和程式語言的實作會把 urldeode 回來所以在程式語言接到 query string 時都是 urldecode 回來的值問題是在於 browser 上發出到 server 的 URL(就是打開 server 的 access log 看到進來的 Path 拉) ...

2021-01-14

/posts/php_urlencode_rawurlencode_and_http_build_query/

PHP

HTML form submit same name in php, nodejs, golang

一直以來大部分時間都在用 PHP 開發所以也用 PHP 來處理 HTML Form 所以都下意識地認為 <input type="checkbox" name="game[]" value="FGO"> 這樣的 name="game[]" 的處理方式是正規的處理 HTML Form 的多選的方式也疑惑為何大多的 HTML Form 的教學甚至 MDN 都沒有提到這件事就在某一天我在檢視到同事寫的 code 時發現同事用 JavaScript 處理硬爬出來自己組字串送出去我才想起這令人感到恐懼的事情因為公司同事是寫 golang 的專門, 就算前端不熟應該也不至於連這樣概念都沒有就用硬爬的方式處理所以再調整的同時也跟同事確認後我也真正的直視這問題到底要怎麼處理表單中多選的資料? PHP 寫久的人大多都知道要用上面列出的方法 name="game[]" 就是 game + [] 但是當我認真地尋找關於這個問題時意外地發現了一篇 stackoverflow 的問答 Several Checkboxes sharing the same name 其實 W3C 根本沒有管你 name="" 重複要如何處理以下是 PHP, nodejs, golang 的原生方式來測試的結果 # PHP 以下問答有提供了 PHP doc 說明 PHP 如何處理多選 ...

2020-10-24

/posts/form_multi_checked_in_php_nodejs_golang/

php - 載入文字轉成圖片

# 需求讀取一段文字後可以決定字型最後要轉成圖片 # 使用的工具 php7.2 freetype 提前說明需要用到 GD(這通常預設就啟動了) freetype 通常會需要另外安裝如何先檢查有沒有 GD 和 freetype 先用 phpinfo 檢查即可 -> % php -a Interactive shell php > 之後再打 echo phpinfo(); 就會 output 資訊了在搜尋 gd gd GD Support => enabled GD Version => bundled (2.1.0 compatible) GIF Read Support => enabled GIF Create Support => enabled JPEG Support => enabled libJPEG Version => 9 compatible PNG Support => enabled libPNG Version => 1. ...

2020-05-14

/posts/php_load_font_to_image/

PHP

PHP - check HTTP protocol

PHP - check HTTP protocol Use $protocol = (isset($_SERVER['HTTPS']) && ($_SERVER['HTTPS'] == 'on' || $_SERVER['HTTPS'] == 1) || isset($_SERVER['HTTP_X_FORWARDED_PROTO']) && $_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https') ? 'https' : 'http'; ...

2019-12-02

/posts/2019-12-02-php-check-http-protocol/

PHP

php - float 浮點數科學記號轉換

# php - float 浮點數科學記號轉換 php - float php 的浮點數大小受限於系統, 且會自動轉換成科學記號呈現, 但是一般人不會去看科學記號 echo 0.0000234; // 2.34E-5 在呈現上希望轉換回小數點的呈現可以用以下方法做到 $s = 0.0000234; trim(rtrim(sprintf("%.10f", $s), '0'), '.'); // 0.0000234 這邊 sprintf 只取 10 位數因為就之前遇到的系統超過 10 位數都會是不精確的浮點數 ...

2019-10-29

/posts/2019-10-29-php-float-float-ingress-scientific-marker-conversion/

PHP

Mac - php redis install

# Mac - php redis install ## Mac Env Mac OSX 10.14.5 ## Step git clone https://www.github.com/phpredis/phpredis.git cd phpredis phpize && ./configure && make && sudo make install test php -r "if (new Redis() == true){ echo \"\r\n OK \r\n\"; }" ## Troubleshooting ### phpize 1. $ phpize grep: /usr/include/php/main/php.h: No such file or directory grep: /usr/include/php/Zend/zend_modules.h: No such file or directory grep: /usr/include/php/Zend/zend_extensions.h: No such file or directory Configuring for: PHP Api Version: Zend Module Api No: Zend Extension Api No: Solution ...

2019-06-19

/posts/2019-06-19-mac-php-redis-install/

PHP
Redis

php - loop directory

# php - loop directory Sometime need use php loop directory list all file in this directory $directory = scandir('./js'); foreach($directory as $file) { if ($file === '.' || $file === '..') { continue; } echo $file; echo "\n"; } ...

2017-12-13

/posts/2017-12-13-php-loop-directory/

PHP

php - 比對時間

# php - 比對時間記錄一下比對時間的方式(利用 timestamp) time() > strtotime('2017-11-13 23:59:59'); ...

2017-12-04

/posts/2017-12-04-php-compared-to-time/

PHP

Something about XSS(Cross-site scripting)

# Something about XSS(Cross-site scripting) If not set anything Use like <?php echo $_GET['name'];?> and querystring name = <script>alert(document.cookie)</script> And not defence XSS In Firefox In Chrome In Safari ## Result Chrome & Safari browser has handle XSS default ## Defence Set header X-XSS-Protection: 1 if use PHP, can use htmlspecialchars() // or htmlentities() ## Important! Finally We must know it is handle encode to avoid run JavaScript on page ...

2015-11-01

/posts/2015-11-01-something-about-xsscross-site-scripting/

Tedshd's Dev note

Category: PHP