Python3自然语言(NLTK)——语言大数据
本文简单介绍了利用Python的NLTK库进行自然语言处理。
NLTK
这是一个处理文本的python库,我们知道文字性的知识可是拥有非常庞大的数据量,故而这属于大数据系列。
本文只是浅尝辄止,目前本人并未涉及这块知识,只是偶尔好奇,才写本文。
从NLTK中的book模块中,载入所有条目
- book 模块包含所有数据
from nltk.book import *
*** Introductory Examples for the NLTK Book ***Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: \'texts()\' or \'sents()\' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1
<Text: Moby Dick by Herman Melville 1851>
text2
<Text: Sense and Sensibility by Jane Austen 1811>
搜索文本或主题
- concordance允许在课文中查找单词,并打印出来
- similar 用来识别文章中和搜索词相似的词语,可以用在搜索引擎中的相关度识别功能中。
- common_contexts 用来识别2个关键词相似的词语。
- dispersion_plot 绘制单词的离散图
text1.concordance(\'monstrous\') # 在text1中查阅词汇\'monstrous\'# concordance
# 英 [kən\'kɔːd(ə)ns] 美 [kən\'kɔrdns]
# n. 调和,一致;用语索引;著作或作家全集的重要用字索引
Displaying 11 of 11 matches:ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .\'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am spanly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
text2.concordance(\'affection\')
Displaying 25 of 79 matches:, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a span affection in a young and ardent mind . This
opinion . But by an appeal to her affection for her mother , by representing t
every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward \' s affection , to the remembrance of every mark
was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if
text1.similar(\'monstrous\')
true contemptible christian abundant few part mean careful puzzledmystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
text2.similar(\'monstrous\')
very so exceedingly heartily a as good great extremely remarkablysweet vast amazingly
text2.common_contexts([\'monstrous\',\'very\'])
a_pretty am_glad a_lucky is_pretty be_glad
# 从文本中检查一个单词的位置,从该单词出现开始出现了多少次。# Each stripe represents an instance of a word,
# and each row represents the entire text.
text4.dispersion_plot([\'citizens\',\'democracy\',\'freedon\',\'duties\',\'America\',\'liberty\'])
# dispersion
# 英 [dɪ\'spɜːʃ(ə)n] 美 [dɪ\'spɝʒn]
# n. 散布;[统计][数] 离差;驱散
print(text3.generate(\'monstrous\'))
None
统计词汇
len(text3)
44764
sorted(set(text3))
[\'!\', "\'",
\'(\',
\')\',
\',\',
\',)\',
\'.\',
\'.)\',
\':\',
\';\',
\';)\',
\'?\',
\'?)\',
\'A\',
\'Abel\',
\'Abelmizraim\',
\'Abidah\',
\'Abide\',
\'Abimael\',
\'Abimelech\',
\'Abr\',
\'Abrah\',
\'Abraham\',
\'Abram\',
\'Accad\',
\'Achbor\',
\'Adah\',
\'Adam\',
\'Adbeel\',
\'Admah\',
\'Adullamite\',
\'After\',
\'Aholibamah\',
\'Ahuzzath\',
\'Ajah\',
\'Akan\',
\'All\',
\'Allonbachuth\',
\'Almighty\',
\'Almodad\',
\'Also\',
\'Alvah\',
\'Alvan\',
\'Am\',
\'Amal\',
\'Amalek\',
\'Amalekites\',
\'Ammon\',
\'Amorite\',
\'Amorites\',
\'Amraphel\',
\'An\',
\'Anah\',
\'Anamim\',
\'And\',
\'Aner\',
\'Angel\',
\'Appoint\',
\'Aram\',
\'Aran\',
\'Ararat\',
\'Arbah\',
\'Ard\',
\'Are\',
\'Areli\',
\'Arioch\',
\'Arise\',
\'Arkite\',
\'Arodi\',
\'Arphaxad\',
\'Art\',
\'Arvadite\',
\'As\',
\'Asenath\',
\'Ashbel\',
\'Asher\',
\'Ashkenaz\',
\'Ashteroth\',
\'Ask\',
\'Asshur\',
\'Asshurim\',
\'Assyr\',
\'Assyria\',
\'At\',
\'Atad\',
\'Avith\',
\'Baalhanan\',
\'Babel\',
\'Bashemath\',
\'Be\',
\'Because\',
\'Becher\',
\'Bedad\',
\'Beeri\',
\'Beerlahairoi\',
\'Beersheba\',
\'Behold\',
\'Bela\',
\'Belah\',
\'Benam\',
\'Benjamin\',
\'Beno\',
\'Beor\',
\'Bera\',
\'Bered\',
\'Beriah\',
\'Bethel\',
\'Bethlehem\',
\'Bethuel\',
\'Beware\',
\'Bilhah\',
\'Bilhan\',
\'Binding\',
\'Birsha\',
\'Bless\',
\'Blessed\',
\'Both\',
\'Bow\',
\'Bozrah\',
\'Bring\',
\'But\',
\'Buz\',
\'By\',
\'Cain\',
\'Cainan\',
\'Calah\',
\'Calneh\',
\'Can\',
\'Cana\',
\'Canaan\',
\'Canaanite\',
\'Canaanites\',
\'Canaanitish\',
\'Caphtorim\',
\'Carmi\',
\'Casluhim\',
\'Cast\',
\'Cause\',
\'Chaldees\',
\'Chedorlaomer\',
\'Cheran\',
\'Cherubims\',
\'Chesed\',
\'Chezib\',
\'Come\',
\'Cursed\',
\'Cush\',
\'Damascus\',
\'Dan\',
\'Day\',
\'Deborah\',
\'Dedan\',
\'Deliver\',
\'Diklah\',
\'Din\',
\'Dinah\',
\'Dinhabah\',
\'Discern\',
\'Dishan\',
\'Dishon\',
\'Do\',
\'Dodanim\',
\'Dothan\',
\'Drink\',
\'Duke\',
\'Dumah\',
\'Earth\',
\'Ebal\',
\'Eber\',
\'Edar\',
\'Eden\',
\'Edom\',
\'Edomites\',
\'Egy\',
\'Egypt\',
\'Egyptia\',
\'Egyptian\',
\'Egyptians\',
\'Ehi\',
\'Elah\',
\'Elam\',
\'Elbethel\',
\'Eldaah\',
\'EleloheIsrael\',
\'Eliezer\',
\'Eliphaz\',
\'Elishah\',
\'Ellasar\',
\'Elon\',
\'Elparan\',
\'Emins\',
\'En\',
\'Enmishpat\',
\'Eno\',
\'Enoch\',
\'Enos\',
\'Ephah\',
\'Epher\',
\'Ephra\',
\'Ephraim\',
\'Ephrath\',
\'Ephron\',
\'Er\',
\'Erech\',
\'Eri\',
\'Es\',
\'Esau\',
\'Escape\',
\'Esek\',
\'Eshban\',
\'Eshcol\',
\'Ethiopia\',
\'Euphrat\',
\'Euphrates\',
\'Eve\',
\'Even\',
\'Every\',
\'Except\',
\'Ezbon\',
\'Ezer\',
\'Fear\',
\'Feed\',
\'Fifteen\',
\'Fill\',
\'For\',
\'Forasmuch\',
\'Forgive\',
\'From\',
\'Fulfil\',
\'G\',
\'Gad\',
\'Gaham\',
\'Galeed\',
\'Gatam\',
\'Gather\',
\'Gaza\',
\'Gentiles\',
\'Gera\',
\'Gerar\',
\'Gershon\',
\'Get\',
\'Gether\',
\'Gihon\',
\'Gilead\',
\'Girgashites\',
\'Girgasite\',
\'Give\',
\'Go\',
\'God\',
\'Gomer\',
\'Gomorrah\',
\'Goshen\',
\'Guni\',
\'Hadad\',
\'Hadar\',
\'Hadoram\',
\'Hagar\',
\'Haggi\',
\'Hai\',
\'Ham\',
\'Hamathite\',
\'Hamor\',
\'Hamul\',
\'Hanoch\',
\'Happy\',
\'Haran\',
\'Hast\',
\'Haste\',
\'Have\',
\'Havilah\',
\'Hazarmaveth\',
\'Hazezontamar\',
\'Hazo\',
\'He\',
\'Hear\',
\'Heaven\',
\'Heber\',
\'Hebrew\',
\'Hebrews\',
\'Hebron\',
\'Hemam\',
\'Hemdan\',
\'Here\',
\'Hereby\',
\'Heth\',
\'Hezron\',
\'Hiddekel\',
\'Hinder\',
\'Hirah\',
\'His\',
\'Hitti\',
\'Hittite\',
\'Hittites\',
\'Hivite\',
\'Hobah\',
\'Hori\',
\'Horite\',
\'Horites\',
\'How\',
\'Hul\',
\'Huppim\',
\'Husham\',
\'Hushim\',
\'Huz\',
\'I\',
\'If\',
\'In\',
\'Irad\',
\'Iram\',
\'Is\',
\'Isa\',
\'Isaac\',
\'Iscah\',
\'Ishbak\',
\'Ishmael\',
\'Ishmeelites\',
\'Ishuah\',
\'Isra\',
\'Israel\',
\'Issachar\',
\'Isui\',
\'It\',
\'Ithran\',
\'Jaalam\',
\'Jabal\',
\'Jabbok\',
\'Jac\',
\'Jachin\',
\'Jacob\',
\'Jahleel\',
\'Jahzeel\',
\'Jamin\',
\'Japhe\',
\'Japheth\',
\'Jared\',
\'Javan\',
\'Jebusite\',
\'Jebusites\',
\'Jegarsahadutha\',
\'Jehovahjireh\',
\'Jemuel\',
\'Jerah\',
\'Jetheth\',
\'Jetur\',
\'Jeush\',
\'Jezer\',
\'Jidlaph\',
\'Jimnah\',
\'Job\',
\'Jobab\',
\'Jokshan\',
\'Joktan\',
\'Jordan\',
\'Joseph\',
\'Jubal\',
\'Judah\',
\'Judge\',
\'Judith\',
\'Kadesh\',
\'Kadmonites\',
\'Karnaim\',
\'Kedar\',
\'Kedemah\',
\'Kemuel\',
\'Kenaz\',
\'Kenites\',
\'Kenizzites\',
\'Keturah\',
\'Kiriathaim\',
\'Kirjatharba\',
\'Kittim\',
\'Know\',
\'Kohath\',
\'Kor\',
\'Korah\',
\'LO\',
\'LORD\',
\'Laban\',
\'Lahairoi\',
\'Lamech\',
\'Lasha\',
\'Lay\',
\'Leah\',
\'Lehabim\',
\'Lest\',
\'Let\',
\'Letushim\',
\'Leummim\',
\'Levi\',
\'Lie\',
\'Lift\',
\'Lo\',
\'Look\',
\'Lot\',
\'Lotan\',
\'Lud\',
\'Ludim\',
\'Luz\',
\'Maachah\',
\'Machir\',
\'Machpelah\',
\'Madai\',
\'Magdiel\',
\'Magog\',
\'Mahalaleel\',
\'Mahalath\',
\'Mahanaim\',
\'Make\',
\'Malchiel\',
\'Male\',
\'Mam\',
\'Mamre\',
\'Man\',
\'Manahath\',
\'Manass\',
\'Manasseh\',
\'Mash\',
\'Masrekah\',
\'Massa\',
\'Matred\',
\'Me\',
\'Medan\',
\'Mehetabel\',
\'Mehujael\',
\'Melchizedek\',
\'Merari\',
\'Mesha\',
\'Meshech\',
\'Mesopotamia\',
\'Methusa\',
\'Methusael\',
\'Methuselah\',
\'Mezahab\',
\'Mibsam\',
\'Mibzar\',
\'Midian\',
\'Midianites\',
\'Milcah\',
\'Mishma\',
\'Mizpah\',
\'Mizraim\',
\'Mizz\',
\'Moab\',
\'Moabites\',
\'Moreh\',
\'Moreover\',
\'Moriah\',
\'Muppim\',
\'My\',
\'Naamah\',
\'Naaman\',
\'Nahath\',
\'Nahor\',
\'Naphish\',
\'Naphtali\',
\'Naphtuhim\',
\'Nay\',
\'Nebajoth\',
\'Neither\',
\'Night\',
\'Nimrod\',
\'Nineveh\',
\'Noah\',
\'Nod\',
\'Not\',
\'Now\',
\'O\',
\'Obal\',
\'Of\',
\'Oh\',
\'Ohad\',
\'Omar\',
\'On\',
\'Onam\',
\'Onan\',
\'Only\',
\'Ophir\',
\'Our\',
\'Out\',
\'Padan\',
\'Padanaram\',
\'Paran\',
\'Pass\',
\'Pathrusim\',
\'Pau\',
\'Peace\',
\'Peleg\',
\'Peniel\',
\'Penuel\',
\'Peradventure\',
\'Perizzit\',
\'Perizzite\',
\'Perizzites\',
\'Phallu\',
\'Phara\',
\'Pharaoh\',
\'Pharez\',
\'Phichol\',
\'Philistim\',
\'Philistines\',
\'Phut\',
\'Phuvah\',
\'Pildash\',
\'Pinon\',
\'Pison\',
\'Potiphar\',
\'Potipherah\',
\'Put\',
\'Raamah\',
\'Rachel\',
\'Rameses\',
\'Rebek\',
\'Rebekah\',
\'Rehoboth\',
\'Remain\',
\'Rephaims\',
\'Resen\',
\'Return\',
\'Reu\',
\'Reub\',
\'Reuben\',
\'Reuel\',
\'Reumah\',
\'Riphath\',
\'Rosh\',
\'Sabtah\',
\'Sabtech\',
\'Said\',
\'Salah\',
\'Salem\',
\'Samlah\',
\'Sarah\',
\'Sarai\',
\'Saul\',
\'Save\',
\'Say\',
\'Se\',
\'Seba\',
\'See\',
\'Seeing\',
\'Seir\',
\'Sell\',
\'Send\',
\'Sephar\',
\'Serah\',
\'Sered\',
\'Serug\',
\'Set\',
\'Seth\',
\'Shalem\',
\'Shall\',
\'Shalt\',
\'Shammah\',
\'Shaul\',
\'Shaveh\',
\'She\',
\'Sheba\',
\'Shebah\',
\'Shechem\',
\'Shed\',
\'Shel\',
\'Shelah\',
\'Sheleph\',
\'Shem\',
\'Shemeber\',
\'Shepho\',
\'Shillem\',
\'Shiloh\',
\'Shimron\',
\'Shinab\',
\'Shinar\',
\'Shobal\',
\'Should\',
\'Shuah\',
\'Shuni\',
\'Shur\',
\'Sichem\',
\'Siddim\',
\'Sidon\',
\'Simeon\',
\'Sinite\',
\'Sitnah\',
\'Slay\',
\'So\',
\'Sod\',
\'Sodom\',
\'Sojourn\',
\'Some\',
\'Spake\',
\'Speak\',
\'Spirit\',
\'Stand\',
\'Succoth\',
\'Surely\',
\'Swear\',
\'Syrian\',
\'Take\',
\'Tamar\',
\'Tarshish\',
\'Tebah\',
\'Tell\',
\'Tema\',
\'Teman\',
\'Temani\',
\'Terah\',
\'Thahash\',
\'That\',
\'The\',
\'Then\',
\'There\',
\'Therefore\',
\'These\',
\'They\',
\'Thirty\',
\'This\',
\'Thorns\',
\'Thou\',
\'Thus\',
\'Thy\',
\'Tidal\',
\'Timna\',
\'Timnah\',
\'Timnath\',
\'Tiras\',
\'To\',
\'Togarmah\',
\'Tola\',
\'Tubal\',
\'Tubalcain\',
\'Twelve\',
\'Two\',
\'Unstable\',
\'Until\',
\'Unto\',
\'Up\',
\'Upon\',
\'Ur\',
\'Uz\',
\'Uzal\',
\'We\',
\'What\',
\'When\',
\'Whence\',
\'Where\',
\'Whereas\',
\'Wherefore\',
\'Which\',
\'While\',
\'Who\',
\'Whose\',
\'Whoso\',
\'Why\',
\'Wilt\',
\'With\',
\'Woman\',
\'Ye\',
\'Yea\',
\'Yet\',
\'Zaavan\',
\'Zaphnathpaaneah\',
\'Zar\',
\'Zarah\',
\'Zeboiim\',
\'Zeboim\',
\'Zebul\',
\'Zebulun\',
\'Zemarite\',
\'Zepho\',
\'Zerah\',
\'Zibeon\',
\'Zidon\',
\'Zillah\',
\'Zilpah\',
\'Zimran\',
\'Ziphion\',
\'Zo\',
\'Zoar\',
\'Zohar\',
\'Zuzims\',
\'a\',
\'abated\',
\'abide\',
\'able\',
\'abode\',
\'abomination\',
\'about\',
\'above\',
\'abroad\',
\'absent\',
\'abundantly\',
\'accept\',
\'accepted\',
\'according\',
\'acknowledged\',
\'activity\',
\'add\',
\'adder\',
\'afar\',
\'afflict\',
\'affliction\',
\'afraid\',
\'after\',
\'afterward\',
\'afterwards\',
\'aga\',
\'again\',
\'against\',
\'age\',
\'aileth\',
\'air\',
\'al\',
\'alive\',
\'all\',
\'almon\',
\'alo\',
\'alone\',
\'aloud\',
\'also\',
\'altar\',
\'altogether\',
\'always\',
\'am\',
\'among\',
\'amongst\',
\'an\',
\'and\',
\'angel\',
\'angels\',
\'anger\',
\'angry\',
\'anguish\',
\'anointedst\',
\'anoth\',
\'another\',
\'answer\',
\'answered\',
\'any\',
\'anything\',
\'appe\',
\'appear\',
\'appeared\',
\'appease\',
\'appoint\',
\'appointed\',
\'aprons\',
\'archer\',
\'archers\',
\'are\',
\'arise\',
\'ark\',
\'armed\',
\'arms\',
\'army\',
\'arose\',
\'arrayed\',
\'art\',
\'artificer\',
\'as\',
\'ascending\',
\'ash\',
\'ashamed\',
\'ask\',
\'asked\',
\'asketh\',
\'ass\',
\'assembly\',
\'asses\',
\'assigned\',
\'asswaged\',
\'at\',
\'attained\',
\'audience\',
\'avenged\',
\'aw\',
\'awaked\',
\'away\',
\'awoke\',
\'back\',
\'backward\',
\'bad\',
\'bade\',
\'badest\',
\'badne\',
\'bak\',
\'bake\',
\'bakemeats\',
\'baker\',
\'bakers\',
\'balm\',
\'bands\',
\'bank\',
\'bare\',
\'barr\',
\'barren\',
\'basket\',
\'baskets\',
\'battle\',
\'bdellium\',
\'be\',
\'bear\',
\'beari\',
\'bearing\',
\'beast\',
\'beasts\',
\'beautiful\',
\'became\',
\'because\',
\'become\',
\'bed\',
\'been\',
\'befall\',
\'befell\',
\'before\',
\'began\',
\'begat\',
\'beget\',
\'begettest\',
\'begin\',
\'beginning\',
\'begotten\',
\'beguiled\',
\'beheld\',
\'behind\',
\'behold\',
\'being\',
\'believed\',
\'belly\',
\'belong\',
\'beneath\',
\'bereaved\',
\'beside\',
\'besides\',
\'besought\',
\'best\',
\'betimes\',
\'better\',
\'between\',
\'betwixt\',
\'beyond\',
\'binding\',
\'bird\',
\'birds\',
\'birthday\',
\'birthright\',
\'biteth\',
\'bitter\',
\'blame\',
\'blameless\',
\'blasted\',
\'bless\',
\'blessed\',
\'blesseth\',
\'blessi\',
\'blessing\',
\'blessings\',
\'blindness\',
\'blood\',
\'blossoms\',
\'bodies\',
\'boldly\',
\'bondman\',
\'bondmen\',
\'bondwoman\',
\'bone\',
\'bones\',
\'book\',
\'booths\',
\'border\',
\'borders\',
\'born\',
\'bosom\',
\'both\',
\'bottle\',
\'bou\',
\'boug\',
\'bough\',
\'bought\',
\'bound\',
\'bow\',
\'bowed\',
\'bowels\',
\'bowing\',
\'boys\',
\'bracelets\',
\'branches\',
\'brass\',
\'bre\',
\'breach\',
\'bread\',
\'breadth\',
\'break\',
\'breaketh\',
\'breaking\',
\'breasts\',
\'breath\',
\'breathed\',
\'breed\',
\'brethren\',
\'brick\',
\'brimstone\',
\'bring\',
\'brink\',
\'broken\',
\'brook\',
\'broth\',
\'brother\',
\'brought\',
\'brown\',
\'bruise\',
\'budded\',
\'build\',
\'builded\',
\'built\',
\'bulls\',
\'bundle\',
\'bundles\',
\'burdens\',
\'buried\',
\'burn\',
\'burning\',
\'burnt\',
\'bury\',
\'buryingplace\',
\'business\',
\'but\',
\'butler\',
\'butlers\',
\'butlership\',
\'butter\',
\'buy\',
\'by\',
\'cakes\',
\'calf\',
\'call\',
\'called\',
\'came\',
\'camel\',
\'camels\',
\'camest\',
\'can\',
\'cannot\',
\'canst\',
\'captain\',
\'captive\',
\'captives\',
\'carcases\',
\'carried\',
\'carry\',
\'cast\',
\'castles\',
\'catt\',
\'cattle\',
\'caught\',
\'cause\',
\'caused\',
\'cave\',
\'cease\',
\'ceased\',
\'certain\',
\'certainly\',
\'chain\',
\'chamber\',
\'change\',
\'changed\',
\'changes\',
\'charge\',
\'charged\',
\'chariot\',
\'chariots\',
\'chesnut\',
\'chi\',
\'chief\',
\'child\',
\'childless\',
\'childr\',
\'children\',
\'chode\',
\'choice\',
\'chose\',
\'circumcis\',
\'circumcise\',
\'circumcised\',
\'citi\',
\'cities\',
\'city\',
\'clave\',
\'clean\',
\'clear\',
\'cleave\',
\'clo\',
\'closed\',
\'clothed\',
\'clothes\',
\'cloud\',
\'clusters\',
\'co\',
\'coat\',
\'coats\',
\'coffin\',
\'cold\',
...]
len(set(text3))
2789
len(text3)/len(set(text3))
16.050197203298673
text3.count(\'smote\')
5
100*text4.count(\'a\')/len(text4)
1.4643016433938312
def lexical_diversity(text): # lexical英[\'leksɪk(ə)l] 美 [\'lɛksɪkl]
# adj.词汇的;[语] 词典的;词典编纂的
# diversity英[daɪ\'vɜːsɪtɪ; dɪ-]美 [dɪˈvəsɪti]
# n.多样性;差异
return len(text)/len(set(text))
def percentage(count, total):
return 100*count/total
print(\'text3中词汇多样性指标:{}\'.format(lexical_diversity(text3)))
print(\'text4中单词a占全文的百分比:{}\'.format(percentage(text4.count(\'a\'),len(text4))))
text3中词汇多样性指标:16.050197203298673text4中单词a占全文的百分比:1.4643016433938312
列表 = Lists
sent1 = [\'Call\', \'me\',\'Ishmael\',\'.\']print(\'打印sent1中的内容:{}\'.format(sent1))
print(\'打印sent1中内容的长度:{}\'.format(len(sent1)))
print(\'sent1中词汇多样性指标:{}\'.format(lexical_diversity(sent1)))
打印sent1中的内容:[\'Call\', \'me\', \'Ishmael\', \'.\']打印sent1中内容的长度:4
sent1中词汇多样性指标:1.0
sent1,sent2,sent3,sent4 # 这是内部定义好的列表
([\'Call\', \'me\', \'Ishmael\', \'.\'], [\'The\',
\'family\',
\'of\',
\'Dashwood\',
\'had\',
\'long\',
\'been\',
\'settled\',
\'in\',
\'Sussex\',
\'.\'],
[\'In\',
\'the\',
\'beginning\',
\'God\',
\'created\',
\'the\',
\'heaven\',
\'and\',
\'the\',
\'earth\',
\'.\'],
[\'Fellow\',
\'-\',
\'Citizens\',
\'of\',
\'the\',
\'Senate\',
\'and\',
\'of\',
\'the\',
\'House\',
\'of\',
\'Representatives\',
\':\'])
sent4+sent1
[\'Fellow\', \'-\',
\'Citizens\',
\'of\',
\'the\',
\'Senate\',
\'and\',
\'of\',
\'the\',
\'House\',
\'of\',
\'Representatives\',
\':\',
\'Call\',
\'me\',
\'Ishmael\',
\'.\']
sent1.append(\'Some\')
[\'Call\', \'me\', \'Ishmael\', \'.\', \'Some\', \'Some\', \'Some\', \'Some\']
列表索引
type(text4)
nltk.text.Text
text4[173]
\'awaken\'
text4.index(\'awaken\')
173
text5[16715:16735]
[\'U86\', \'thats\',
\'why\',
\'something\',
\'like\',
\'gamefly\',
\'is\',
\'so\',
\'good\',
\'because\',
\'you\',
\'can\',
\'actually\',
\'play\',
\'a\',
\'full\',
\'game\',
\'without\',
\'buying\',
\'it\']
text6[1600:1625]
[\'We\', "\'",
\'re\',
\'an\',
\'anarcho\',
\'-\',
\'syndicalist\',
\'commune\',
\'.\',
\'We\',
\'take\',
\'it\',
\'in\',
\'turns\',
\'to\',
\'act\',
\'as\',
\'a\',
\'sort\',
\'of\',
\'executive\',
\'officer\',
\'for\',
\'the\',
\'week\']
变量
sent1 = [\'Call\',\'me\',\'Ishmael\',\'.\']my_sent = [\'Bravely\',\'bold\',\'Sir\',\'Robin\',\',\',\'rode\',\'forth\',\'from\',\'Camelot\',\'.\']
noun_phrase = my_sent[1:4]
print(\'打印切片后的列表:noun_phrase-》{}\'.format(noun_phrase))
wOrDs = sorted(noun_phrase)
print(\'打印排序后的列表:wOrDs-》{}\'.format(wOrDs))
打印切片后的列表:noun_phrase-》[\'bold\', \'Sir\', \'Robin\']打印排序后的列表:wOrDs-》[\'Robin\', \'Sir\', \'bold\']
字符串
name = \'bright\'print(\'打印name中的第一个字母:{}\'.format(name[0]))
print(name[:4])
print(name*2)
print(name + \'!\')
打印name中的第一个字母:bbrig
brightbright
bright!
\' \'.join([\'Monty\', \'Python\'])
\'Monty Python\'
\'Monty Python\'.split()
[\'Monty\', \'Python\']
saying = [\'After\',\'all\',\'is\',\'said\',\'and\',\'done\',\'more\',\'is\',\'said\',\'than\',\'done\']tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]
[\'said\', \'than\']
fdist1 = FreqDist(text1)vocabulary1 = fdist1.keys()
type(vocabulary1)
dict_keys
fdist1.plot(50, cumulative=True)#Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which
#account for nearly half of the tokens.
fdist1.hapaxes() #the words that occur once only
[\'Herman\', \'Melville\',
\']\',
\'ETYMOLOGY\',
\'Late\',
\'Consumptive\',
\'School\',
\'threadbare\',
\'lexicons\',
\'mockingly\',
\'flags\',
\'mortality\',
\'signification\',
\'HACKLUYT\',
\'Sw\',
\'HVAL\',
\'roundness\',
\'Dut\',
\'Ger\',
\'WALLEN\',
\'WALW\',
\'IAN\',
\'RICHARDSON\',
\'KETOS\',
\'GREEK\',
\'CETUS\',
\'LATIN\',
\'WHOEL\',
\'ANGLO\',
\'SAXON\',
\'WAL\',
\'HWAL\',
\'SWEDISH\',
\'ICELANDIC\',
\'BALEINE\',
\'BALLENA\',
\'FEGEE\',
\'ERROMANGOAN\',
\'Librarian\',
\'painstaking\',
\'burrower\',
\'grub\',
\'Vaticans\',
\'stalls\',
\'higgledy\',
\'piggledy\',
\'gospel\',
\'promiscuously\',
\'commentator\',
\'belongest\',
\'sallow\',
\'Pale\',
\'Sherry\',
\'loves\',
\'bluntly\',
\'Subs\',
\'thankless\',
\'Hampton\',
\'Court\',
\'hie\',
\'refugees\',
\'pampered\',
\'Michael\',
\'Raphael\',
\'unsplinterable\',
\'GENESIS\',
\'JOB\',
\'JONAH\',
\'punish\',
\'ISAIAH\',
\'soever\',
\'cometh\',
\'incontinently\',
\'perisheth\',
\'PLUTARCH\',
\'MORALS\',
\'breedeth\',
\'Whirlpooles\',
\'Balaene\',
\'arpens\',
\'PLINY\',
\'Scarcely\',
\'TOOKE\',
\'LUCIAN\',
\'TRUE\',
\'catched\',
\'OCTHER\',
\'VERBAL\',
\'TAKEN\',
\'MOUTH\',
\'ALFRED\',
\'890\',
\'gudgeon\',
\'retires\',
\'MONTAIGNE\',
\'APOLOGY\',
\'RAIMOND\',
\'SEBOND\',
\'Nick\',
\'RABELAIS\',
\'cartloads\',
\'STOWE\',
\'ANNALS\',
\'LORD\',
\'BACON\',
\'Touching\',
\'ork\',
\'DEATH\',
\'sovereignest\',
\'bruise\',
\'HAMLET\',
\'leach\',
\'Mote\',
\'availle\',
\'returne\',
\'againe\',
\'worker\',
\'Dinting\',
\'paine\',
\'thro\',
\'maine\',
\'FAERIE\',
\'Immense\',
\'til\',
\'DAVENANT\',
\'PREFACE\',
\'GONDIBERT\',
\'spermacetti\',
\'Hosmannus\',
\'Nescio\',
\'VIDE\',
\'Spencer\',
\'Talus\',
\'flail\',
\'threatens\',
\'jav\',
\'lins\',
\'WALLER\',
\'SUMMER\',
\'ISLANDS\',
\'Commonwealth\',
\'Civitas\',
\'OPENING\',
\'SENTENCE\',
\'HOBBES\',
\'LEVIATHAN\',
\'Silly\',
\'Mansoul\',
\'chewing\',
\'sprat\',
\'PILGRIM\',
\'PROGRESS\',
\'Created\',
\'PARADISE\',
\'LOST\',
\'---"\',
\'Hugest\',
\'Stretched\',
\'Draws\',
\'FULLLER\',
\'PROFANE\',
\'HOLY\',
\'STATE\',
\'DRYDEN\',
\'ANNUS\',
\'MIRABILIS\',
\'aground\',
\'EDGE\',
\'TEN\',
\'SPITZBERGEN\',
\'PURCHAS\',
\'wantonness\',
\'fuzzing\',
\'vents\',
\'HERBERT\',
\'INTO\',
\'ASIA\',
\'AFRICA\',
\'SCHOUTEN\',
\'SIXTH\',
\'CIRCUMNAVIGATION\',
\'Elbe\',
\'ducat\',
\'herrings\',
\'GREENLAND\',
\'Several\',
\'Fife\',
\'Anno\',
\'1652\',
\'Pitferren\',
\'SIBBALD\',
\'FIFE\',
\'KINROSS\',
\'Myself\',
\'Sperma\',
\'ceti\',
\'fierceness\',
\'RICHARD\',
\'STRAFFORD\',
\'LETTER\',
\'BERMUDAS\',
\'PHIL\',
\'TRANS\',
\'1668\',
\'PRIMER\',
\'COWLEY\',
\'1729\',
\'"...\',
\'frequendy\',
\'insupportable\',
\'disorder\',
\'ULLOA\',
\'SOUTH\',
\'AMERICA\',
\'sylphs\',
\'petticoat\',
\'Oft\',
\'Tho\',
\'RAPE\',
\'LOCK\',
\'NAT\',
\'wales\',
\'JOHNSON\',
\'COOK\',
\'dung\',
\'lime\',
\'juniper\',
\'UNO\',
\'VON\',
\'TROIL\',
\'LETTERS\',
\'BANKS\',
\'SOLANDER\',
\'1772\',
\'Nantuckois\',
\'JEFFERSON\',
\'MEMORIAL\',
\'MINISTER\',
\'REFERENCE\',
\'PARLIAMENT\',
\'SOMEWHERE\',
\'guarding\',
\'protecting\',
\'robbers\',
\'BLACKSTONE\',
\'Rodmond\',
\'suspends\',
\'attends\',
\'FALCONER\',
\'Bright\',
\'roofs\',
\'domes\',
\'rockets\',
\'Around\',
\'unwieldy\',
\'COWPER\',
\'VISIT\',
\'LONDON\',
\'HUNTER\',
\'DISSECTION\',
\'SMALL\',
\'SIZED\',
\'aorta\',
\'gushing\',
\'PALEY\',
\'THEOLOGY\',
\'mammiferous\',
\'hind\',
\'BARON\',
\'CUVIER\',
\'COLNETT\',
\'PURPOSE\',
\'EXTENDING\',
\'SPERMACETI\',
\'Floundered\',
\'chace\',
\'peopling\',
\'Gather\',
\'Led\',
\'instincts\',
\'trackless\',
\'Assaulted\',
\'voracious\',
\'spiral\',
\'MONTGOMERY\',
\'WORLD\',
\'FLOOD\',
\'Paean\',
\'fatter\',
\'Flounders\',
\'CHARLES\',
\'LAMB\',
\'TRIUMPH\',
\'1690\',
\'OBED\',
\'Susan\',
\'HAWTHORNE\',
\'TWICE\',
\'bespeak\',
\'raal\',
\'COOPER\',
\'PILOT\',
\'Berlin\',
\'Gazette\',
\'ECKERMANN\',
\'CONVERSATIONS\',
\'GOETHE\',
\'ESSEX\',
\'WAS\',
\'ATTACKED\',
\'FINALLY\',
\'DESTROYED\',
\'OWEN\',
\'CHACE\',
\'FIRST\',
\'SAID\',
\'VESSEL\',
\'YORK\',
\'1821\',
\'piping\',
\'dimmed\',
\'phospher\',
\'ELIZABETH\',
\'OAKES\',
\'SMITH\',
\'amounted\',
\'440\',
\'SCORESBY\',
\'Mad\',
\'agonies\',
\'endures\',
\'infuriated\',
\'rears\',
\'snaps\',
\'propelled\',
\'observers\',
\'opportunities\',
\'habitudes\',
\'BEALE\',
\'offensively\',
\'artful\',
\'mischievous\',
\'FREDERICK\',
\'DEBELL\',
\'1840\',
\'October\',
\'Raise\',
\'ay\',
\'THAR\',
\'bowes\',
\'os\',
\'ROSS\',
\'ETCHINGS\',
\'CRUIZE\',
\'1846\',
\'Globe\',
\'transactions\',
\'relate\',
\'HUSSEY\',
\'SURVIVORS\',
\'parried\',
\'MISSIONARY\',
\'JOURNAL\',
\'TYERMAN\',
\'boldest\',
\'persevering\',
\'REPORT\',
\'DANIEL\',
\'SPEECH\',
\'SENATE\',
\'APPLICATION\',
\'ERECTION\',
\'BREAKWATER\',
\'CAPTORS\',
\'WHALEMAN\',
\'ADVENTURES\',
\'BIOGRAPHY\',
\'GATHERED\',
\'HOMEWARD\',
\'COMMODORE\',
\'PREBLE\',
\'REV\',
\'CHEEVER\',
\'MUTINEER\',
\'BROTHER\',
\'ANOTHER\',
\'MCCULLOCH\',
\'COMMERCIAL\',
\'reciprocal\',
\'clews\',
\'SOMETHING\',
\'UNPUBLISHED\',
\'CURRENTS\',
\'Pedestrians\',
\'recollect\',
\'gateways\',
\'VOYAGER\',
\'ARCTIC\',
\'NEWSPAPER\',
\'TAKING\',
\'RETAKING\',
\'HOBOMACK\',
\'MIRIAM\',
\'FISHERMAN\',
\'appliance\',
\'RIBS\',
\'TRUCKS\',
\'Terra\',
\'Del\',
\'Fuego\',
\'DARWIN\',
\'NATURALIST\',
";--\'",
\'!\\'"\',
\'WHARTON\',
\'Loomings\',
\'spleen\',
\'regulating\',
\'circulation\',
\'Whenever\',
\'drizzly\',
\'hypos\',
\'philosophical\',
\'Cato\',
\'Manhattoes\',
\'reefs\',
\'downtown\',
\'gazers\',
\'Circumambulate\',
\'Corlears\',
\'Coenties\',
\'Slip\',
\'Whitehall\',
\'Posted\',
\'sentinels\',
\'spiles\',
\'pier\',
\'lath\',
\'counters\',
\'desks\',
\'loitering\',
\'shady\',
\'Inlanders\',
\'lanes\',
\'alleys\',
\'attract\',
\'dale\',
\'dreamiest\',
\'shadiest\',
\'quietest\',
\'enchanting\',
\'Saco\',
\'crucifix\',
\'Deep\',
\'mazy\',
\'Tiger\',
\'Tennessee\',
\'Rockaway\',
\'Persians\',
\'deity\',
\'Narcissus\',
\'ungraspable\',
\'hazy\',
\'quarrelsome\',
\'offices\',
\'abominate\',
\'toils\',
\'trials\',
\'barques\',
\'schooners\',
\'broiling\',
\'buttered\',
\'judgmatically\',
\'peppered\',
\'reverentially\',
\'idolatrous\',
\'dotings\',
\'ibis\',
\'roasted\',
\'bake\',
\'plumb\',
\'Van\',
\'Rensselaers\',
\'Randolphs\',
\'Hardicanutes\',
\'lording\',
\'tallest\',
\'decoction\',
\'Seneca\',
\'Stoics\',
\'Testament\',
\'promptly\',
\'rub\',
\'infliction\',
\'BEING\',
\'PAID\',
\'urbane\',
\'ills\',
\'monied\',
\'consign\',
\'prevalent\',
\'violate\',
\'Pythagorean\',
\'commonalty\',
\'police\',
\'surveillance\',
\'programme\',
\'solo\',
\'CONTESTED\',
\'ELECTION\',
\'PRESIDENCY\',
\'UNITED\',
\'STATES\',
\'ISHMAEL\',
\'BLOODY\',
\'AFFGHANISTAN\',
\'managers\',
\'genteel\',
\'comedies\',
\'farces\',
\'cunningly\',
\'disguises\',
\'cajoling\',
\'unbiased\',
\'freewill\',
\'discriminating\',
\'overwhelming\',
\'undeliverable\',
\'itch\',
\'forbidden\',
\'ignoring\',
\'lodges\',
\'Carpet\',
\'Bag\',
\'Manhatto\',
\'candidates\',
\'penalties\',
\'Tyre\',
\'Carthage\',
\'imported\',
\'cobblestones\',
\'bitingly\',
\'shouldering\',
\'price\',
\'fervent\',
\'asphaltic\',
\'pavement\',
\'flinty\',
\'projections\',
\'soles\',
\'Too\',
\'cheapest\',
\'cheeriest\',
\'invitingly\',
\'particles\',
\'peer\',
\'Angel\',
\'Doom\',
\'wailing\',
\'gnashing\',
\'Wretched\',
\'entertainment\',
\'Moving\',
\'emigrant\',
\'poverty\',
\'creak\',
\'lodgings\',
\'zephyr\',
\'hob\',
\'toasting\',
\'observest\',
\'sashless\',
\'glazier\',
\'reasonest\',
\'chinks\',
\'crannies\',
\'lint\',
\'chattering\',
\'shiverings\',
\'cob\',
\'redder\',
\'Orion\',
\'glitters\',
\'conservatories\',
\'president\',
\'temperance\',
\'blubbering\',
\'straggling\',
\'wainscots\',
\'reminding\',
\'oilpainting\',
\'besmoked\',
\'defaced\',
\'unequal\',
\'crosslights\',
\'hags\',
\'delineate\',
\'bewitched\',
\'ponderings\',
\'boggy\',
\'soggy\',
\'squitchy\',
\'froze\',
\'heath\',
\'icebound\',
\'represents\',
\'Horner\',
\'foundered\',
\'clubs\',
\'harvesting\',
\'hacking\',
\'horrifying\',
\'Mixed\',
\'Nathan\',
\'Swain\',
\'corkscrew\',
\'Blanco\',
\'sojourning\',
\'fireplaces\',
\'duskier\',
\'cockpits\',
\'rarities\',
\'Projecting\',
\'Within\',
\'shelves\',
\'flasks\',
\'bustles\',
\'deliriums\',
\'Abominable\',
\'tumblers\',
\'cylinders\',
\'goggling\',
\'deceitfully\',
\'tapered\',
\'Parallel\',
\'pecked\',
\'footpads\',
\'Fill\',
\'shilling\',
\'examining\',
\'SKRIMSHANDER\',
\'accommodated\',
\'unoccupied\',
\'haint\',
\'pose\',
\'whalin\',
\'decidedly\',
\'objectionable\',
\'wander\',
\'Battery\',
\'ruminating\',
\'adorning\',
\'potatoes\',
\'sartainty\',
\'diabolically\',
\'steaks\',
\'undress\',
\'looker\',
\'rioting\',
\'Grampus\',
\'seed\',
\'Feegees\',
\'tramping\',
\'Enveloped\',
\'bedarned\',
\'eruption\',
\'officiating\',
\'brimmers\',
\'complained\',
\'potion\',
\'colds\',
\'catarrhs\',
\'liquor\',
\'arrantest\',
\'topers\',
\'obstreperously\',
\'aloof\',
\'desirous\',
\'hilarity\',
\'coffer\',
\'Southerner\',
\'mountaineers\',
\'Alleghanian\',
\'missed\',
\'supernaturally\',
\'congratulate\',
\'multiply\',
\'bachelor\',
\'abominated\',
\'tidiest\',
\'bedwards\',
\'shan\',
\'tablecloth\',
\'Skrimshander\',
\'bump\',
\'spraining\',
\'eider\',
\'yoking\',
\'rickety\',
\'whirlwinds\',
\'knockings\',
\'dismissed\',
\'popped\',
\'cherishing\',
\'chuckled\',
\'chuckle\',
\'mightily\',
\'catches\',
\'bamboozingly\',
\'overstocked\',
\'toothpick\',
\'rayther\',
\'BROWN\',
\'slanderin\',
\'farrago\',
\'BROKE\',
\'Sartain\',
\'Mt\',
\'Hecla\',
\'persist\',
\'mystifying\',
\'unsay\',
\'criminal\',
\'Wall\',
\'purty\',
\'sarmon\',
\'rips\',
\'tellin\',
\'bought\',
\'balmed\',
\'curios\',
\'sellin\',
\'inions\',
\'fooling\',
\'idolators\',
\'Depend\',
\'reg\',
\'lar\',
\'spliced\',
\'Johnny\',
\'sprawling\',
\'Arter\',
\'glim\',
\'jiffy\',
\'irresolute\',
\'vum\',
\'WON\',
\'Folding\',
\'scrutiny\',
\'porcupine\',
\'moccasin\',
\'ponchos\',
\'parade\',
\'rainy\',
\'remembering\',
\'commended\',
\'cobs\',
\'Nod\',
\'footfall\',
\'unlacing\',
\'blackish\',
\'plasters\',
\'inkling\',
\'Placing\',
\'crammed\',
\'scalp\',
\'mildewed\',
\'Ignorance\',
\'parent\',
\'nonplussed\',
\'undressing\',
\'checkered\',
\'Thirty\',
\'frogs\',
\'quaked\',
\'wrapall\',
\'dreadnaught\',
\'fumbled\',
\'Remembering\',
\'manikin\',
\'tenpin\',
\'andirons\',
\'jambs\',
\'bricks\',
\'appropriate\',
\'applying\',
\'hastier\',
\'withdrawals\',
\'antics\',
\'devotee\',
\'extinguishing\',
\'unceremoniously\',
\'bagged\',
\'sportsman\',
\'woodcock\',
\'uncomfortableness\',
\'deliberating\',
\'puffed\',
\'sang\',
\'Stammering\',
\'conjured\',
\'responses\',
\'debel\',
\'flourishing\',
\'Angels\',
\'flourishings\',
\'peddlin\',
\'sleepe\',
\'grunted\',
\'gettee\',
\'motioning\',
\'comely\',
\'insured\',
\'Counterpane\',
\'parti\',
\'triangles\',
\'interminable\',
\'caper\',
\'supperless\',
\'21st\',
\'hemisphere\',
\'sigh\',
\'Sixteen\',
\'ached\',
\'coaches\',
\'stockinged\',
\'slippering\',
\'misbehaviour\',
\'unendurable\',
\'stepmothers\',
\'misfortunes\',
\'steeped\',
\'shudderingly\',
\'confounding\',
\'soberly\',
\'recurred\',
\'predicament\',
\'unlock\',
\'bridegroom\',
\'clasp\',
\'hugged\',
\'rouse\',
\'snore\',
\'scratch\',
\'Throwing\',
\'expostulations\',
\'unbecomingness\',
\'matrimonial\',
\'dawning\',
\'overture\',
\'innate\',
\'compliment\',
\'civility\',
\'rudeness\',
\'toilette\',
\'dressing\',
\'donning\',
\'gaspings\',
\'booting\',
\'caterpillar\',
\'outlandishness\',
\'manners\',
\'education\',
\'undergraduate\',
\'dreamt\',
\'cowhide\',
\'pinched\',
\'curtains\',
\'indecorous\',
\'contented\',
\'restricting\',
\'donned\',
\'lathering\',
\'unsheathes\',
\'whets\',
\'Rogers\',
\'cutlery\',
\'Afterwards\',
\'baton\',
\'Breakfast\',
\'pleasantly\',
\'bountifully\',
\'laughable\',
\'bosky\',
\'unshorn\',
\'gowns\',
\'toasted\',
\'lingers\',
\'tarried\',
\'barred\',
\'Grub\',
\'Park\',
\'assurance\',
\'polish\',
\'occasioned\',
\'embarrassed\',
\'bashfulness\',
\'duelled\',
\'winking\',
\'tastes\',
\'sheepishly\',
\'bashful\',
\'icicle\',
\'admirer\',
\'cordially\',
\'grappling\',
\'genteelly\',
\'eschewed\',
\'undivided\',
\'6\',
\'circulating\',
\'nondescripts\',
\'Chestnut\',
\'jostle\',
\'Regent\',
\'Lascars\',
\'Bombay\',
\'Apollo\',
\'Feegeeans\',
\'Tongatobooarrs\',
\'Erromanggoans\',
\'Pannangians\',
\'Brighggians\',
\'weekly\',
\'Vermonters\',
\'stalwart\',
\'frames\',
\'felled\',
\'strutting\',
\'wester\',
\'bombazine\',
\'cloak\',
\'mow\',
\'gloves\',
\'joins\',
\'outfit\',
\'waistcoats\',
\'Hay\',
\'Seed\',
\'tract\',
\'dearest\',
\'pave\',
\'eggs\',
\'patrician\',
\'parks\',
\'scraggy\',
\'scoria\',
\'Herr\',
\'dowers\',
\'nieces\',
\'reservoirs\',
\'maples\',
\'bountiful\',
\'proffer\',
\'passer\',
\'cones\',
\'blossoms\',
\'superinduced\',
\'carnation\',
\'Salem\',
\'sweethearts\',
\'Puritanic\',
\'Whaleman\',
\'Wrapping\',
\'Each\',
\'quote\',
\'TALBOT\',
\'Near\',
\'Desolation\',
\'1st\',
\'SISTER\',
\'ROBERT\',
\'WILLIS\',
\'ELLERY\',
\'NATHAN\',
\'COLEMAN\',
\'WALTER\',
\'CANNY\',
\'SETH\',
\'GLEIG\',
\'Forming\',
\'ELIZA\',
\'31st\',
\'MARBLE\',
\'SHIPMATES\',
\'EZEKIEL\',
\'HARDY\',
\'AUGUST\',
\'3d\',
\'1833\',
\'WIDOW\',
\'Shaking\',
\'glazed\',
\'Affected\',
\'relatives\',
\'unhealing\',
\'sympathetically\',
\'wounds\',
\'bleed\',
\'blanks\',
...]
单词的精细选择
- the set of all w such that w is an element of V (the vocabulary) and w has property P
{w|w \(\in\) V and P(w)}
- The corresponding Python expression is given:
[w for w in V if p(w)]
V = set(text1)long_words = [w for w in V if len(w)>15]
sorted(long_words)
[\'CIRCUMNAVIGATION\', \'Physiognomically\',
\'apprehensiveness\',
\'cannibalistically\',
\'characteristically\',
\'circumnavigating\',
\'circumnavigation\',
\'circumnavigations\',
\'comprehensiveness\',
\'hermaphroditical\',
\'indiscriminately\',
\'indispensableness\',
\'irresistibleness\',
\'physiognomically\',
\'preternaturalness\',
\'responsibilities\',
\'simultaneousness\',
\'subterraneousness\',
\'supernaturalness\',
\'superstitiousness\',
\'uncomfortableness\',
\'uncompromisedness\',
\'undiscriminating\',
\'uninterpenetratingly\']
本文选自《Natural Language Processing with Python》
以上是 Python3自然语言(NLTK)——语言大数据 的全部内容, 来源链接: utcz.com/z/387609.html