2007/02/11
2006/03/14
php中截获错误 php,error,error handler
$old_error_handler = set_error_handler("myErrorHandler");
{
switch($errno)
{
case 2048: break;
default:
echo "[$errno] $errstr in line $errline of file $errfile<br />\n";
}
}
--
[:p] --飞扬.轻狂 [fallseir.lee]
http://fallseir.livejournal.com
http://feed.feedsky.com/fallseir
2006/03/13
url转向实例
#1、启用转向功能
LoadModule rewrite_module modules/mod_rewrite.so
#2、设定起始目录
DocumentRoot "D:/htdocs/php"
#3、设定起始页
DirectoryIndex index.html index.html.var index.php
#4、添加页面处理类型
AddType application/x-httpd-php .php
#5、启用目录层次的配置控制
AccessFileName .htaccess
<Directory "D:/htdocs/php">
#...允许.htaccess配置文件
AllowOverride All
#...
</Directory>
#6、配置D:/htdocs/php目录下的.htaccess,
#win下创建此文件需使用cmd模式下的命令行方式echo create > .htaccess
#起始页
DirectoryIndex index.php
#启用重定向引擎
RewriteEngine on
#设定重定向规则
RewriteRule ^feed/([a-zA-Z0-9]*)$ /feedreader/renderfeed.php?brn=$1 [L]
#设定默认处理字符集格式
php_value default_charset UTF-8
--
[:p] --飞扬.轻狂 [fallseir.lee]
http://fallseir.livejournal.com
http://feed.feedsky.com/fallseir
2006/03/10
code:awk 实例解析 使用wget和awk对blogcn的十万数据进行测试分析
2.5521386280114333
0.99980000599982
页面成功率
0.93493195204143875
0.02820915372538824
>>> 65654/100003.0
0.65652030439086828
<20k
>>> 31732/100003.0
0.31731048068557943
ostrich
wolingwu_2002
weizi
xingxingyu
congolin
inkstone
iui
yuyan
bluemoor
...
-rw-rw-r-- 1 tester tester 496 Mar 8 23:43 sub.wg
-rw-rw-r-- 1 tester tester 997676 Mar 8 23:31 url.ls
$ wget -i sub.wg -o log.txt -x -P wglist &
$ wget https://mail.google.com -nv --spider
200 OK$ wget https://mail.google.com --spider
--00:02:27-- https://mail.google.com/...
HTTP request sent, awaiting response... 302 Moved Temporarily ...
$ date --date='2002-08-28 23:55:32 +0800' +'hello %Y/%m/%d %H:%M:%S %z'
hello 2002/08/28 23:55:32 +0800
$date --date='2002-08-28 23:55:32 +0800' +'%x %T'
08/28/2002 23:55:32
real 0m12.760s
user 0m0.003s
sys 0m0.024s
$ time wget -i sub.wg -o log_sp.txt --spider &
user 0m0.002s
sys 0m0.024s
$ time wget -i sub.wg -o log.txt -x -P down &
user 0m0.007s
sys 0m0.038s
#
$ wc url.ls
100000 100000 997676 url.ls
$ time awk '{print "http://www. *** .com/rss2.asp?blog="$1}' url.ls >url.wg
real 0m0.210s
user 0m0.164s
sys 0m0.046s
100000 100000 4597676 url.wg
$ awk '/--[0-9]/{print $1}' log_url.txt >time.txt
$ head -n 5 time.txt
--01:19:34--
--01:19:35--
--01:19:35--
--01:19:35--
--01:19:36--
# 获取结束的时间
--01:37:01--
--01:37:02--
--01:37:02--
--01:37:02--
--01:37:03--
# 计算平均间隔
>>> e=(1*60+37)*60+1
2006/03/09
learn:使用wget使用手册
$ wget http://www.feedsky.com #获取网页
--19:50:05-- http://www.feedsky.com/ #开始时间 地址
=> `index.html' # 保存位置
Resolving www.feedsky.com... 211.154.171.184 #解析主机
Connecting to www.feedsky.com[211.154.171.184]:80... connected. #建立连接
HTTP request sent, awaiting response... 200 OK #请求 状态
Length: 7,237 [text/html] #文件大小 类型100%[====================================>] 7,237 --.--K/s #完成信息
19:50:05 (183.45 KB/s) - `index.html' saved [7,237/7,237] #结束时间 状态
$ file index.html
index.html: UTF-8 Unicode HTML document text, with CRLF line terminators
$ wget -b http://www.feedsky.com #后台运行
Continuing in background, pid 5230. #后台运行信息
Output will be written to `wget-log'. #日志文件
$ cat wget-log #
--19:58:23-- http://www.feedsky.com/
=> `index.html.1' #如果默认名称文件已存在,wget将自动使用下一个扩展名
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80 ... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,237 [text/html]0K ....... 100% 162.60 KB/s #此处同屏幕输出有细微差别
19:58:23 (162.60 KB/s) - `index.html.1' saved [7,237/7,237]
$ cat wget.log #查看日志文件
--20:10:46-- http://www.feedsky.com/
=> `index.html.1'
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,237 [text/html]0K ....... 100% 227.98 KB/s
20:10:46 (227.98 KB/s) - `index.html.1' saved [7,237/7,237]--20:12:31-- http://www.feedsky.com/
=> `index.html.2'
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,237 [text/html]0K ....... 100% 215.51 KB/s
20:12:36 (215.51 KB/s) - `index.html.2' saved [7,237/7,237]
$ #查看每条记录的格式以便于后期对日志文件进行分析
$ wget http://www.feedsky.com -d #调试输出DEBUG output created by Wget 1.9+cvs-stable (Red Hat modified) on linux-gnu.--20:21:16-- http://www.feedsky.com/
=> `index.html.4'
Resolving www.feedsky.com... 211.154.171.184
Caching www.feedsky.com => 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80 ... connected.
Created socket 3. #创建连接
Releasing 0x8642158 (new refcount 1).
---request begin--- #开始请求
GET / HTTP/1.0 #请求方式
User-Agent: Wget/1.9+cvs-stable (Red Hat modified) #user agent
Host: www.feedsky.com #请求主机
Accept: */* #类型
Connection: Keep-Alive #连接方式---request end--- #结束请求
HTTP request sent, awaiting response... HTTP/1.1 200 OK #回应状态
Server: Microsoft-IIS/5.0 #服务器信息
Date: Thu, 09 Mar 2006 03:29:25 GMT #时间
X-Powered-By: ASP.NET #其他头信息
Connection: keep-alive #连接方式
X-AspNet-Version: 2.0.50727 #其他头信息
Set-Cookie: ASP.NET_SessionId=p2cqnf451rulex551k5luv55; path=/; HttpOnly #cookieStored cookie www.feedsky.com 80 / nonpermanent 0 <undefined> ASP.NET_SessionId p2cqnf451rulex551k5luv55
Cache-Control: private #缓存
Content-Type: text/html; charset=utf-8 #内容类型
Content-Length: 7237 #文件大小
Found www.feedsky.com in host_name_addresses_map (0x8642158) #已存在的主机映射
Registered fd 3 for persistent reuse.
Length: 7,237 [text/html]100%[====================================>] 7,237 --.--K/s
20:21:21 (249.56 KB/s) - `index.html.4' saved [7,237/7,237]
$ wget http://www.feedsky.com -q$ wget http://www.feedsky.com -q -o wget.log$ cat wget.log$
$ wget http://www.feedsky.com/abc -nv -a wget.log
$ wget http://www.feedsky.com/ -nv -a wget.log
$ wget http://www.feedsky.com/400 -nv -a wget.log
$ cat wget.log
http://www.feedsky.com/abc:
20:36:35 ERROR 404: Not Found.
20:36:39 URL:http://www.feedsky.com/ [7,237/7,237] -> "index.html.6" [1]
http://www.feedsky.com/400 :
20:36:43 ERROR 404: Not Found.
$ wget -i wgetscr/list.wget
--20:40:35-- http://www.feedsky.com/
=> `index.html'
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,237 [text/html]100%[====================================>] 7,237 --.--K/s
20:40:35 (262.14 KB/s) - `index.html' saved [7,237/7,237]
--20:40:35-- http://www.feedsky.com/404
=> `404'
Reusing connection to www.feedsky.com:80.
HTTP request sent, awaiting response... 404 Not Found
20:40:35 ERROR 404: Not Found.--20:40:35-- http://www.china.com/
=> `index.html.1'
Resolving www.china.com... 61.151.243.108 , 61.151.243.109, 61.151.243.197, ...
Connecting to www.china.com[61.151.243.108]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,114 [text/html]100%[====================================>] 1,114 --.--K/s
20:40:36 (10.62 MB/s) - `index.html.1' saved [1,114/1,114]
--20:40:36-- http://www.china.com/404
=> `404'
Reusing connection to www.china.com:80.
HTTP request sent, awaiting response... 404 Not Found
20:40:36 ERROR 404: Not Found.
FINISHED --20:40:36--
Downloaded: 8,351 bytes in 2 files
$ wget -i wgetscr/list.wget -o wget.log
[tester@localhost wget_test]$ cat wget.log
--20:44:27-- http://www.feedsky.com/
=> `index.html.2'
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,237 [text/html]0K ....... 100% 233.60 KB/s
20:44:32 (233.60 KB/s) - `index.html.2' saved [7,237/7,237]
--20:44:32-- http://www.feedsky.com/404
=> `404'
Reusing connection to www.feedsky.com:80.
HTTP request sent, awaiting response... 404 Not Found
20:44:33 ERROR 404: Not Found.--20:44:33-- http://www.china.com/
=> `index.html.3'
Resolving www.china.com... 61.151.243.218 , 61.151.243.226, 61.151.243.245, ...
Connecting to www.china.com[61.151.243.218]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,114 [text/html]0K . 100% 10.62 MB/s
20:44:38 (10.62 MB/s) - `index.html.3' saved [1,114/1,114]
--20:44:38-- http://www.china.com/404
=> `404'
Reusing connection to www.china.com:80.
HTTP request sent, awaiting response... 404 Not Found
20:44:38 ERROR 404: Not Found.
FINISHED --20:44:38--
Downloaded: 8,351 bytes in 2 files
============================
$ wget -i wgetscr/list.wget -O test$ll-rw-rw-r-- 1 tester tester 7237 Mar 8 20:40 index.html
-rw-rw-r-- 1 tester tester 1114 Mar 8 20:40 index.html.1
-rw-rw-r-- 1 tester tester 8351 Mar 8 20:55 test$ wget -i wgetscr/list.wget -O test$ll...
-rw-rw-r-- 1 tester tester 8351 Mar 8 20:56 test
#wget执行后会将test中原有的数据覆盖
$ wget -i wgetscr/list.wget -nc -o wget.log
$ cat wget.log
File `index.html' already there, will not retrieve.
--21:08:24-- http://www.feedsky.com/404
=> `404'
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80 ... connected.
HTTP request sent, awaiting response... 404 Not Found
21:08:29 ERROR 404: Not Found.File `index.html' already there, will not retrieve.
--21:08:29-- http://www.china.com/404
=> `404'
Resolving www.china.com ... 61.151.243.197, 61.151.243.207, 61.151.243.218, ...
Connecting to www.china.com[61.151.243.197]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
21:08:29 ERROR 404: Not Found.
#查看日志 表明 使用 -c 参数时 wget只会使用本地同名文件进行检测 。$ wget -i wgetscr/list.wget -c
--21:10:50-- http://www.feedsky.com/
=> `index.html'
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response... 200 OKThe file is already fully retrieved; nothing to do.
--21:10:51-- http://www.feedsky.com/404
=> `404'
Connecting to www.feedsky.com[211.154.171.184]:80 ... connected.
HTTP request sent, awaiting response... 404 Not Found
21:10:51 ERROR 404: Not Found.--21:10:51-- http://www.china.com/
=> `index.html'
Resolving www.china.com... 61.151.243.108 , 61.151.243.109, 61.151.243.197, ...
Connecting to www.china.com[61.151.243.108]:80... connected.
HTTP request sent, awaiting response... 200 OKThe file is already fully retrieved; nothing to do.
--21:10:57-- http://www.china.com/404
=> `404'
Connecting to www.china.com[61.151.243.108]:80 ... connected.
HTTP request sent, awaiting response... 404 Not Found
21:10:58 ERROR 404: Not Found.
$ wget -i wgetscr/list.wget -N...--21:17:10-- http://www.china.com/
=> `index.html'
Resolving www.china.com... 61.151.243.207 , 61.151.243.218, 61.151.243.226, ...
Connecting to www.china.com[61.151.243.207]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,114 [text/html]
Last-modified header missing -- time-stamps turned off.
--21:17:11-- http://www.china.com/
=> `index.html'
Connecting to www.china.com[61.151.243.207]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,114 [text/html]...$ wget -i wgetscr/list.wget -N -nv
Last-modified header missing -- time-stamps turned off.
21:20:35 URL:http://www.feedsky.com/ [7,237/7,237] -> "index.html" [1]
http://www.feedsky.com/404:
21:20:35 ERROR 404: Not Found.
Last-modified header missing -- time-stamps turned off.
21:20:40 URL:http://www.china.com/ [1,114/1,114] -> " index.html" [1]
http://www.china.com/404:
21:20:40 ERROR 404: Not Found.FINISHED --21:20:40--
Downloaded: 8,351 bytes in 2 files
#日志表明 如果服务器端不支持lastmodify 将不启用此标记
#进行二次连接以对更新了的文件进行抓取
-------------------------------------------------------- -S 输出服务器回应信息
$ wget -i wgetscr/list.wget -N -S
--21:22:08-- http://www.feedsky.com/
=> `index.html'
Resolving www.feedsky.com... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response...
1 HTTP/1.1 200 OK
2 Server: Microsoft-IIS/5.0
3 Date: Thu, 09 Mar 2006 04:37:36 GMT
4 X-Powered-By: ASP.NET
5 Connection: keep-alive
6 X-AspNet-Version: 2.0.50727
7 Set-Cookie: ASP.NET_SessionId=dcrsn555fatsrlutzarox455 ; path=/; HttpOnly
8 Cache-Control: private
9 Content-Type: text/html; charset=utf-8
10 Content-Length: 7237
Last-modified header missing -- time-stamps turned off.
--21:22:08-- http://www.feedsky.com/
=> `index.html'
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response...
1 HTTP/1.1 200 OK
2 Server: Microsoft-IIS/5.0
3 Date: Thu, 09 Mar 2006 04:37:36 GMT
4 X-Powered-By: ASP.NET
5 Connection: keep-alive
6 X-AspNet-Version: 2.0.50727
7 Cache-Control: private
8 Content-Type: text/html; charset=utf-8
9 Content-Length: 7237100%[====================================>] 7,237 --.--K/s
21:22:08 (199.91 KB/s) - `index.html' saved [7,237/7,237]
...
$ wget -i wgetscr/list.wget --spider
--21:27:58-- http://www.feedsky.com/
=> `index.html.6'
Resolving www.feedsky.com ... 211.154.171.184
Connecting to www.feedsky.com[211.154.171.184]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7,237 [text/html]
200 OK--21:27:59-- http://www.feedsky.com/404
=> `404'
Connecting to www.feedsky.com[211.154.171.184]:80 ... connected.
HTTP request sent, awaiting response... 404 Not Found
21:27:59 ERROR 404: Not Found.--21:27:59-- http://www.china.com/
=> `index.html.6'
Resolving www.china.com... 61.151.243.226 , 61.151.243.245, 61.151.243.247, ...
Connecting to www.china.com[61.151.243.226]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1,114 [text/html]
200 OK--21:27:59-- http://www.china.com/404
=> `404'
Connecting to www.china.com[61.151.243.226]:80 ... connected.
HTTP request sent, awaiting response... 404 Not Found
21:28:00 ERROR 404: Not Found.
--------------------------------------------------------
-T 设定响应超时的秒数
-w 两次尝试之间间隔SECONDS秒
--limit-rate=RATE 限定下载输率
--------------------------------------------------------
* 目录
============================
-nd --no-directories 不创建层次目录 默认
-x, --force-directories 强制创建目录
-nH, --no-host-directories 不创建主机目录
-P, --directory-prefix=PREFIX 将文件保存到目录 PREFIX/...
--cut-dirs=NUMBER 忽略 NUMBER层远程目录
--- 测试跳转
$ wget -iwgetscr/list2.wget
...
--22:47:11-- https://mail.google.com/
=> ` index.html.2'
Resolving mail.google.com... 66.249.83.19, 66.249.83.83
Connecting to mail.google.com[66.249.83.19 ]:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: /mail/ [following]
--22:47:12-- https://mail.google.com/mail/
=> `index.html.2'
Connecting to mail.google.com[66.249.83.19]:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl<mpl=yj_blanco<mplcache=2 [following]
--22:47:14-- https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl<mpl=yj_blanco<mplcache=2
=> `ServiceLogin?service=mail&passive=true&rm=false&continue=https:%2F%2Fmail.google.com%2Fmail%2F?ui=html&zy=l<mpl=yj_blanco<mplcache=2'
Resolving www.google.com... 66.249.89.104, 66.249.89.99
Connecting to www.google.com[66.249.89.104]:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14,345 [text/html]100%[====================================>] 14,345 56.56K/s
22:47:20 (56.51 KB/s) - `ServiceLogin?service=mail&passive=true&rm=false&continue=https:%2F%2Fmail.google.com%2Fmail%2F?ui=html&zy=l<mpl=yj_blanco<mplcache=2' saved [14,345/14,345]
...
#如果服务器进行跳转后最终文件的存储地址将不再同原始url存在对应关系
$ wget -iwgetscr/list2.wget -x
--22:58:44-- https://mail.google.com/
=> `mail.google.com/index.html'
Resolving mail.google.com... 66.249.83.19, 66.249.83.83
Connecting to mail.google.com[66.249.83.19]:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: /mail/ [following]
--22:58:50-- https://mail.google.com/mail/
=> `mail.google.com/mail/index.html'
Connecting to mail.google.com[ 66.249.83.19]:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl<mpl=yj_blanco<mplcache=2 [following]
--22:58:52-- https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl<mpl=yj_blanco<mplcache=2
=> `www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=https:%2F%2Fmail.google.com%2Fmail%2F?ui=html&zy=l<mpl=yj_blanco<mplcache=2'
Resolving www.google.com... 66.249.89.99, 66.249.89.104
Connecting to www.google.com[66.249.89.99]:443 ... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14,345 [text/html]100%[====================================>] 14,345 45.14K/s
22:58:53 (44.92 KB/s) - `www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=https:%2F%2Fmail.google.com%2Fmail%2F?ui=html&zy=l<mpl=yj_blanco<mplcache=2' saved [14,345/14,345]
-----------=> `down/index.html'=> `down/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=https:%2F%2Fmail.google.com%2Fmail%2F?ui=html&zy=l<mpl=yj_blanco<mplcache=2'-----------
#-nH需要同-x合用,同名文件将被覆盖
wget
wget是一个从网络上自动下载文件的自由工具。它支持HTTP,HTTPS和FTP协议,可以使用HTTP代理.
所谓的自动下载是指,wget可以在用户退出系统的之后在后台执行。这意味这你可以登录系统,启动一个wget下载任务,然后退出系统,wget将在后台执行直到任务完成,相对于其它大部分浏览器在下载大量数据时需要用户一直的参与,这省去了极大的麻烦。
wget可以跟踪HTML页面上的链接依次下载来创建远程服务器的本地版本,完全重建原始站点的目录结构。这又常被称作"递归下载"。在递归下载的时候,wget遵循Robot Exclusion标准(/robots.txt). wget可以在下载的同时,将链接转换成指向本地文件,以方便离线浏览。
wget非常稳定,它在带宽很窄的情况下和不稳定网络中有很强的适应性.如果是由于网络的原因下载失败,wget会不断的尝试,直到整个文件下载完毕。如果是服务器打断下载过程,它会再次联到服务器上从停止的地方继续下载。这对从那些限定了链接时间的服务器上下载大文件非常有用。
wget的常见用法
wget的使用格式
Usage: wget [OPTION]... [URL]...
* 用wget做站点镜像:
wget -r -p -np -k http://dsec.pku.edu.cn/~usr_name/
# 或者
wget -m http://dsec.pku.edu.cn/~usr_name/
* 在不稳定的网络上下载一个部分下载的文件,以及在空闲时段下载
wget -t 0 -w 31 -c http://dsec.pku.edu.cn/BBC.avi -o down.log &
# 或者从filelist读入要下载的文件列表
wget -t 0 -w 31 -c -B ftp://dsec.pku.edu.cn/linuxsoft -i filelist.txt -o down.log &
上面的代码还可以用来在网络比较空闲的时段进行下载。我的用法是:在mozilla中将不方便当时下载的URL链接拷贝到内存中然后粘贴到文件filelist.txt中,在晚上要出去系统前执行上面代码的第二条。
* 使用代理下载
wget -Y on -p -k https://sourceforge.net/projects/wvware/
代理可以在环境变量或wgetrc文件中设定
# 在环境变量中设定代理
export PROXY= http://211.90.168.94:8080/
# 在~/.wgetrc中设定代理
http_proxy = http://proxy.yoyodyne.com:18023/
ftp_proxy = http://proxy.yoyodyne.com:18023/
wget各种选项分类列表
* 启动
-V, --version 显示wget的版本后退出
-h, --help 打印语法帮助
-b, --background 启动后转入后台执行
-e, --execute=COMMAND 执行`.wgetrc'格式的命令,wgetrc格式参见/etc/wgetrc或~/.wgetrc
* 记录和输入文件
-o, --output-file=FILE 把记录写到FILE文件中
-a, --append-output=FILE 把记录追加到FILE文件中
-d, --debug 打印调试输出
-q, --quiet 安静模式(没有输出)
-v, --verbose 冗长模式(这是缺省设置)
-nv, --non-verbose 关掉冗长模式,但不是安静模式
-i, --input-file=FILE 下载在FILE文件中出现的URLs
-F, --force-html 把输入文件当作HTML格式文件对待
-B, --base=URL 将URL作为在-F -i参数指定的文件中出现的相对链接的前缀
--sslcertfile=FILE 可选客户端证书
--sslcertkey=KEYFILE 可选客户端证书的KEYFILE
--egd-file=FILE 指定EGD socket的文件名
* 下载
--bind-address=ADDRESS 指定本地使用地址(主机名或IP,当本地有多个IP或名字时使用)
-t, --tries=NUMBER 设定最大尝试链接次数(0 表示无限制).
-O --output-document=FILE 把文档写到FILE文件中
-nc, --no-clobber 不要覆盖存在的文件或使用.#前缀
-c, --continue 接着下载没下载完的文件
--progress=TYPE 设定进程条标记
-N, --timestamping 不要重新下载文件除非比本地文件新
-S, --server-response 打印服务器的回应
--spider 不下载任何东西
-T, --timeout=SECONDS 设定响应超时的秒数
-w, --wait=SECONDS 两次尝试之间间隔SECONDS秒
--waitretry=SECONDS 在重新链接之间等待1...SECONDS秒
--random-wait 在下载之间等待0...2*WAIT秒
-Y, --proxy=on/off 打开或关闭代理
-Q, --quota=NUMBER 设置下载的容量限制
--limit-rate=RATE 限定下载输率
* 目录
-nd --no-directories 不创建目录
-x, --force-directories 强制创建目录
-nH, --no-host-directories 不创建主机目录
-P, --directory-prefix=PREFIX 将文件保存到目录 PREFIX/...
--cut-dirs=NUMBER 忽略 NUMBER层远程目录
* HTTP 选项
--http-user=USER 设定HTTP用户名为 USER.
--http-passwd=PASS 设定http密码为 PASS.
-C, --cache=on/off 允许/不允许服务器端的数据缓存 (一般情况下允许).
-E, --html-extension 将所有text/html文档以.html扩展名保存
--ignore-length 忽略 `Content-Length'头域
--header=STRING 在headers中插入字符串 STRING
--proxy-user=USER 设定代理的用户名为 USER
--proxy-passwd=PASS 设定代理的密码为 PASS
--referer=URL 在HTTP请求中包含 `Referer: URL'头
-s, --save-headers 保存HTTP头到文件
-U, --user-agent=AGENT 设定代理的名称为 AGENT而不是 Wget/VERSION.
--no-http-keep-alive 关闭 HTTP活动链接 (永远链接).
--cookies=off 不使用 cookies.
--load-cookies=FILE 在开始会话前从文件 FILE中加载cookie
--save-cookies=FILE 在会话结束后将 cookies保存到 FILE文件中
* FTP 选项
-nr, --dont-remove-listing 不移走 `.listing'文件
-g, --glob=on/off 打开或关闭文件名的 globbing机制
--passive-ftp 使用被动传输模式 (缺省值).
--active-ftp 使用主动传输模式
--retr-symlinks 在递归的时候,将链接指向文件(而不是目录)
* 递归下载
-r, --recursive 递归下载--慎用!
-l, --level=NUMBER 最大递归深度 (inf 或 0 代表无穷).
--delete-after 在现在完毕后局部删除文件
-k, --convert-links 转换非相对链接为相对链接
-K, --backup-converted 在转换文件X之前,将之备份为 X.orig
-m, --mirror 等价于 -r -N -l inf -nr.
-p, --page-requisites 下载显示HTML文件的所有图片
* 递归下载中的包含和不包含(accept/reject)
-A, --accept=LIST 分号分隔的被接受扩展名的列表
-R, --reject=LIST 分号分隔的不被接受的扩展名的列表
-D, --domains=LIST 分号分隔的被接受域的列表
--exclude-domains=LIST 分号分隔的不被接受的域的列表
--follow-ftp 跟踪HTML文档中的FTP链接
--follow-tags=LIST 分号分隔的被跟踪的HTML标签的列表
-G, --ignore-tags=LIST 分号分隔的被忽略的HTML标签的列表
-H, --span-hosts 当递归时转到外部主机
-L, --relative 仅仅跟踪相对链接
-I, --include-directories=LIST 允许目录的列表
-X, --exclude-directories=LIST 不被包含目录的列表
-np, --no-parent 不要追溯到父目录
问题
在递归下载的时候,遇到目录中有中文的时候,wget创建的本地目录名会用URL编码规则处理。如"天网防火墙"会被存为"%CC%EC%CD%F8%B7%C0%BB%F0%C7%BD",这造成阅读上的极大不方便。
--------------------------------------------------------