删掉网页的 meta 信息

这几天要动用私人语料库写一份关于斯瓦希里语(以下简称斯语)助动词的论文,发现几个月前抓的《好公民周报(Raia Mwema)》的网页信息并没有把各种 meta 标签删掉。

meta1

观察了一下手头的文件,发现打开所有文档就有两个难点。难点一是每一期的文章数量不一样。难点二是并不是每一期都有。比如59期有17篇文章,并没有60期的档案,61期有19篇档案。

meta2

[code language=”python”]
#先收集有文章的期刊#
testL=range(450)
finL=[]
for L in testL:
try:
myfile=open(str(L)+"_1")
except IOError:
pass
else:
finL.append(str(L))
#试着打开每一个文件#
while finL:
strT=finL.pop(0)
numM=1
stopM=1
while stopM:
try:
myfile=open(strT+"_"+str(numM))
except IOError:
print strT+"_"+str(numM)+" does’t exist"
stopM=0
else:
print strT+"_"+str(numM)+" opened"
numM+=1
[/code]

然后再抓一下标题和正文。

[code language=”python”]
#先收集有文章的期刊#
testL=range(2)
finL=[]
for L in testL:
try:
myfile=open(str(L)+"_1")
except IOError:
pass
else:
finL.append(str(L))
#试着打开每一个文件#
while finL:
strT=finL.pop(0)
numM=1
stopM=1
while stopM:
try:
myfile=open(strT+"_"+str(numM))
except IOError:
print strT+"_"+str(numM)+" does’t exist"
stopM=0
else:
print strT+"_"+str(numM)+" opened"
#抓标题和正文,先做清理#
myfile=open(strT+"_"+str(numM))
full=myfile.read()
full= full.replace(‘\n’, ‘ ‘).replace(‘\r’, ”)
import re
#抓标题#
tgt1=re.compile("<title>.+?</title>")
pop=[]
m=tgt1.search(full)
while m:
print m.group(0)
pop.append(m.group(0)+"\n")
full=full.replace(m.group(0), "")
m=tgt1.search(full)
#抓正文#
tgt2=re.compile("<p>.+?</p>")
m=tgt2.search(full)
while m:
print m.group(0)
pop.append(m.group(0)+"\n")
full=full.replace(m.group(0), "")
m=tgt2.search(full)
newfile=open(strT+"_"+str(numM)+"_new.txt", "w")
for full in pop:
newfile.write(full)
newfile.close()
myfile.close()
numM+=1
[/code]

最后的结果

[code language=”HTML”]
<title>Raia Mwema – Kero lukuki, wajumbe mmh!</title>
<p>KWA tuliofuatilia, kwenye luninga, Mkutano Mkuu wa CCM, uliomalizika Dodoma, mwishoni mwa wiki, ‘poa’ ndiyo neno moja sahihi la kuelezea hali ya mkutano huo wa siku mbili ilivyokuwa.</p>
<p> Mambo yalikuwa ‘poa’ kweli kweli; kana kwamba nchi nzima ni ‘poa’kabisa na hakuna kero wala dhiki yoyote inayowakabili wanachama wa chama hicho tawala; achilia mbali Watanzania wasio wanachama wa chama chochote cha siasa. Katika hali ya kawaida, ripoti ya utendaji ya mkutano mkuu wowote, uwe wa chama cha siasa, cha ushirika au hata NGO, huzua majadiliano makali na hoja zinazokinzana hadi mwafaka unapopatikana. Lakini si Mkutano Mkuu wa CCM; angalau sivyo tulivyoshuhudia, majuzi, mjini Dodoma.</p>
[/code]

 

ynshen