抓取斯瓦希里语中含有特定时态标签的例句

因为写一篇关于斯瓦希里语(以下简称斯语)-ki-时态的论文,所以想在文本库中搜索所有-ki-出现的例子,好参考一下上下文。AntConc 是可以完成这个任务的,但是我觉得用 Python 可能更方便。具体的做法是将整段话先用split断开,然后每一个单词查看其是不是包含动词词干(如何生成斯语动词词干的列表见这里),如果既包含动词词干并且在单词的前四位里出现了-ki-的标签,则收录一下。

[code language=”python”]
myfile=open("I.txt")
mydict=open("verb_stems.txt")
dictF=mydict.read()
dict=dictF.split()
line=myfile.readline()
newfile=open("I_new.txt","w")
newfile.write(line)
pop=[]
junk=["hakika","akili"]
while line:
words=line.split()
for word in words:
word=word.strip(".,!?")
if word not in junk:
for l in dict:
lL=len(l)
wL=len(word)
#find word which contains verb stem"
if word.rfind(l)>2 and wL-lL-word.rfind(l)<=2 and len(word)>=5:
if "ki" in word[1:4]:
print "found "+l+" in dictionary"
print "for the word: "+word
print "phrase is "+word[word.rfind(l):]
print "the preceding sequense is "+word[:word.rfind(l)]+" which contains ‘ki’"
print "\n"
if line not in pop:
pop.append(line)
line=myfile.readline()
for p in pop:
newfile.write(p)
[/code]

通过这种方法,找到一句很有趣的斯瓦希里语,有一种死侍的漫画里和阅读者互动的即时感。

Walikuwa maskini, ukipenda waite maskini wa mwisho.
“他们那时很穷,如果你喜欢,你可以把他们叫做穷到家了的人。”

 

ynshen