Python3自然语言(NLTK)——语言大数据

python

本文简单介绍了利用Python的NLTK库进行自然语言处理。

NLTK

这是一个处理文本的python库,我们知道文字性的知识可是拥有非常庞大的数据量,故而这属于大数据系列。

本文只是浅尝辄止,目前本人并未涉及这块知识,只是偶尔好奇,才写本文。

从NLTK中的book模块中,载入所有条目

  • book 模块包含所有数据

from nltk.book import *

*** Introductory Examples for the NLTK Book ***

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: \'texts()\' or \'sents()\' to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K . Chesterton 1908

text1

<Text: Moby Dick by Herman Melville 1851>

text2

<Text: Sense and Sensibility by Jane Austen 1811>

搜索文本或主题

  1. concordance允许在课文中查找单词,并打印出来
  2. similar 用来识别文章中和搜索词相似的词语,可以用在搜索引擎中的相关度识别功能中。
  3. common_contexts 用来识别2个关键词相似的词语。
  4. dispersion_plot 绘制单词的离散图

text1.concordance(\'monstrous\') # 在text1中查阅词汇\'monstrous\'

# concordance

# 英 [kən\'kɔːd(ə)ns] 美 [kən\'kɔrdns]

# n. 调和,一致;用语索引;著作或作家全集的重要用字索引

Displaying 11 of 11 matches:

ong the former , one was of a most monstrous size . ... This came towards us ,

ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r

ll over with a heathenish array of monstrous clubs and spears . Some were thick

d as you gazed , and wondered what monstrous cannibal and savage could ever hav

that has survived the flood ; most monstrous and most mountainous ! That Himmal

they might scout at Moby Dick as a monstrous fable , or still worse and more de

th of Radney .\'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l

ing Scenes . In connexion with the monstrous pictures of whales , I am spanly

ere to enter upon those still more monstrous stories of them which are to be fo

ght have been rummaged out of this monstrous cabinet there is no telling . But

of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

text2.concordance(\'affection\')

Displaying 25 of 79 matches:

, however , and , as a mark of his affection for the three girls , he left them

t . It was very well known that no affection was ever supposed to exist between

deration of politeness or maternal affection on the side of the former , the tw

d the suspicion -- the hope of his affection for me may warrant , without impru

hich forbade the indulgence of his affection . She knew that his mother neither

rd she gave one with still greater affection . Though her late conversation wit

can never hope to feel or inspire affection again , and if her home be uncomfo

m of the sense , elegance , mutual affection , and domestic comfort of the fami

, and which recommended him to her affection beyond every thing else . His soci

ween the parties might forward the affection of Mr . Willoughby , an equally st

the most pointed assurance of her affection . Elinor could not be surprised at

he natural consequence of a span affection in a young and ardent mind . This

opinion . But by an appeal to her affection for her mother , by representing t

every alteration of a place which affection had established as perfect with hi

e will always have one claim of my affection , which no other can possibly shar

f the evening declared at once his affection and happiness . " Shall we see you

ause he took leave of us with less affection than his usual behaviour has shewn

ness ." " I want no proof of their affection ," said Elinor ; " but of their en

onths , without telling her of his affection ;-- that they should part without

ould be the natural result of your affection for her . She used to be all unres

distinguished Elinor by no mark of affection . Marianne saw and listened with i

th no inclination for expense , no affection for strangers , no profession , an

till distinguished her by the same affection which once she had felt no doubt o

al of her confidence in Edward \' s affection , to the remembrance of every mark

was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if

text1.similar(\'monstrous\')

true contemptible christian abundant few part mean careful puzzled

mystifying passing curious loving wise doleful gamesome singular

delightfully perilous fearless

text2.similar(\'monstrous\')

very so exceedingly heartily a as good great extremely remarkably

sweet vast amazingly

text2.common_contexts([\'monstrous\',\'very\'])

a_pretty am_glad a_lucky is_pretty be_glad

# 从文本中检查一个单词的位置,从该单词出现开始出现了多少次。

# Each stripe represents an instance of a word,

# and each row represents the entire text.

text4.dispersion_plot([\'citizens\',\'democracy\',\'freedon\',\'duties\',\'America\',\'liberty\'])

# dispersion

# 英 [dɪ\'spɜːʃ(ə)n] 美 [dɪ\'spɝʒn]

# n. 散布;[统计][数] 离差;驱散

print(text3.generate(\'monstrous\'))

None

统计词汇

len(text3)

44764

sorted(set(text3))

[\'!\',

"\'",

\'(\',

\')\',

\',\',

\',)\',

\'.\',

\'.)\',

\':\',

\';\',

\';)\',

\'?\',

\'?)\',

\'A\',

\'Abel\',

\'Abelmizraim\',

\'Abidah\',

\'Abide\',

\'Abimael\',

\'Abimelech\',

\'Abr\',

\'Abrah\',

\'Abraham\',

\'Abram\',

\'Accad\',

\'Achbor\',

\'Adah\',

\'Adam\',

\'Adbeel\',

\'Admah\',

\'Adullamite\',

\'After\',

\'Aholibamah\',

\'Ahuzzath\',

\'Ajah\',

\'Akan\',

\'All\',

\'Allonbachuth\',

\'Almighty\',

\'Almodad\',

\'Also\',

\'Alvah\',

\'Alvan\',

\'Am\',

\'Amal\',

\'Amalek\',

\'Amalekites\',

\'Ammon\',

\'Amorite\',

\'Amorites\',

\'Amraphel\',

\'An\',

\'Anah\',

\'Anamim\',

\'And\',

\'Aner\',

\'Angel\',

\'Appoint\',

\'Aram\',

\'Aran\',

\'Ararat\',

\'Arbah\',

\'Ard\',

\'Are\',

\'Areli\',

\'Arioch\',

\'Arise\',

\'Arkite\',

\'Arodi\',

\'Arphaxad\',

\'Art\',

\'Arvadite\',

\'As\',

\'Asenath\',

\'Ashbel\',

\'Asher\',

\'Ashkenaz\',

\'Ashteroth\',

\'Ask\',

\'Asshur\',

\'Asshurim\',

\'Assyr\',

\'Assyria\',

\'At\',

\'Atad\',

\'Avith\',

\'Baalhanan\',

\'Babel\',

\'Bashemath\',

\'Be\',

\'Because\',

\'Becher\',

\'Bedad\',

\'Beeri\',

\'Beerlahairoi\',

\'Beersheba\',

\'Behold\',

\'Bela\',

\'Belah\',

\'Benam\',

\'Benjamin\',

\'Beno\',

\'Beor\',

\'Bera\',

\'Bered\',

\'Beriah\',

\'Bethel\',

\'Bethlehem\',

\'Bethuel\',

\'Beware\',

\'Bilhah\',

\'Bilhan\',

\'Binding\',

\'Birsha\',

\'Bless\',

\'Blessed\',

\'Both\',

\'Bow\',

\'Bozrah\',

\'Bring\',

\'But\',

\'Buz\',

\'By\',

\'Cain\',

\'Cainan\',

\'Calah\',

\'Calneh\',

\'Can\',

\'Cana\',

\'Canaan\',

\'Canaanite\',

\'Canaanites\',

\'Canaanitish\',

\'Caphtorim\',

\'Carmi\',

\'Casluhim\',

\'Cast\',

\'Cause\',

\'Chaldees\',

\'Chedorlaomer\',

\'Cheran\',

\'Cherubims\',

\'Chesed\',

\'Chezib\',

\'Come\',

\'Cursed\',

\'Cush\',

\'Damascus\',

\'Dan\',

\'Day\',

\'Deborah\',

\'Dedan\',

\'Deliver\',

\'Diklah\',

\'Din\',

\'Dinah\',

\'Dinhabah\',

\'Discern\',

\'Dishan\',

\'Dishon\',

\'Do\',

\'Dodanim\',

\'Dothan\',

\'Drink\',

\'Duke\',

\'Dumah\',

\'Earth\',

\'Ebal\',

\'Eber\',

\'Edar\',

\'Eden\',

\'Edom\',

\'Edomites\',

\'Egy\',

\'Egypt\',

\'Egyptia\',

\'Egyptian\',

\'Egyptians\',

\'Ehi\',

\'Elah\',

\'Elam\',

\'Elbethel\',

\'Eldaah\',

\'EleloheIsrael\',

\'Eliezer\',

\'Eliphaz\',

\'Elishah\',

\'Ellasar\',

\'Elon\',

\'Elparan\',

\'Emins\',

\'En\',

\'Enmishpat\',

\'Eno\',

\'Enoch\',

\'Enos\',

\'Ephah\',

\'Epher\',

\'Ephra\',

\'Ephraim\',

\'Ephrath\',

\'Ephron\',

\'Er\',

\'Erech\',

\'Eri\',

\'Es\',

\'Esau\',

\'Escape\',

\'Esek\',

\'Eshban\',

\'Eshcol\',

\'Ethiopia\',

\'Euphrat\',

\'Euphrates\',

\'Eve\',

\'Even\',

\'Every\',

\'Except\',

\'Ezbon\',

\'Ezer\',

\'Fear\',

\'Feed\',

\'Fifteen\',

\'Fill\',

\'For\',

\'Forasmuch\',

\'Forgive\',

\'From\',

\'Fulfil\',

\'G\',

\'Gad\',

\'Gaham\',

\'Galeed\',

\'Gatam\',

\'Gather\',

\'Gaza\',

\'Gentiles\',

\'Gera\',

\'Gerar\',

\'Gershon\',

\'Get\',

\'Gether\',

\'Gihon\',

\'Gilead\',

\'Girgashites\',

\'Girgasite\',

\'Give\',

\'Go\',

\'God\',

\'Gomer\',

\'Gomorrah\',

\'Goshen\',

\'Guni\',

\'Hadad\',

\'Hadar\',

\'Hadoram\',

\'Hagar\',

\'Haggi\',

\'Hai\',

\'Ham\',

\'Hamathite\',

\'Hamor\',

\'Hamul\',

\'Hanoch\',

\'Happy\',

\'Haran\',

\'Hast\',

\'Haste\',

\'Have\',

\'Havilah\',

\'Hazarmaveth\',

\'Hazezontamar\',

\'Hazo\',

\'He\',

\'Hear\',

\'Heaven\',

\'Heber\',

\'Hebrew\',

\'Hebrews\',

\'Hebron\',

\'Hemam\',

\'Hemdan\',

\'Here\',

\'Hereby\',

\'Heth\',

\'Hezron\',

\'Hiddekel\',

\'Hinder\',

\'Hirah\',

\'His\',

\'Hitti\',

\'Hittite\',

\'Hittites\',

\'Hivite\',

\'Hobah\',

\'Hori\',

\'Horite\',

\'Horites\',

\'How\',

\'Hul\',

\'Huppim\',

\'Husham\',

\'Hushim\',

\'Huz\',

\'I\',

\'If\',

\'In\',

\'Irad\',

\'Iram\',

\'Is\',

\'Isa\',

\'Isaac\',

\'Iscah\',

\'Ishbak\',

\'Ishmael\',

\'Ishmeelites\',

\'Ishuah\',

\'Isra\',

\'Israel\',

\'Issachar\',

\'Isui\',

\'It\',

\'Ithran\',

\'Jaalam\',

\'Jabal\',

\'Jabbok\',

\'Jac\',

\'Jachin\',

\'Jacob\',

\'Jahleel\',

\'Jahzeel\',

\'Jamin\',

\'Japhe\',

\'Japheth\',

\'Jared\',

\'Javan\',

\'Jebusite\',

\'Jebusites\',

\'Jegarsahadutha\',

\'Jehovahjireh\',

\'Jemuel\',

\'Jerah\',

\'Jetheth\',

\'Jetur\',

\'Jeush\',

\'Jezer\',

\'Jidlaph\',

\'Jimnah\',

\'Job\',

\'Jobab\',

\'Jokshan\',

\'Joktan\',

\'Jordan\',

\'Joseph\',

\'Jubal\',

\'Judah\',

\'Judge\',

\'Judith\',

\'Kadesh\',

\'Kadmonites\',

\'Karnaim\',

\'Kedar\',

\'Kedemah\',

\'Kemuel\',

\'Kenaz\',

\'Kenites\',

\'Kenizzites\',

\'Keturah\',

\'Kiriathaim\',

\'Kirjatharba\',

\'Kittim\',

\'Know\',

\'Kohath\',

\'Kor\',

\'Korah\',

\'LO\',

\'LORD\',

\'Laban\',

\'Lahairoi\',

\'Lamech\',

\'Lasha\',

\'Lay\',

\'Leah\',

\'Lehabim\',

\'Lest\',

\'Let\',

\'Letushim\',

\'Leummim\',

\'Levi\',

\'Lie\',

\'Lift\',

\'Lo\',

\'Look\',

\'Lot\',

\'Lotan\',

\'Lud\',

\'Ludim\',

\'Luz\',

\'Maachah\',

\'Machir\',

\'Machpelah\',

\'Madai\',

\'Magdiel\',

\'Magog\',

\'Mahalaleel\',

\'Mahalath\',

\'Mahanaim\',

\'Make\',

\'Malchiel\',

\'Male\',

\'Mam\',

\'Mamre\',

\'Man\',

\'Manahath\',

\'Manass\',

\'Manasseh\',

\'Mash\',

\'Masrekah\',

\'Massa\',

\'Matred\',

\'Me\',

\'Medan\',

\'Mehetabel\',

\'Mehujael\',

\'Melchizedek\',

\'Merari\',

\'Mesha\',

\'Meshech\',

\'Mesopotamia\',

\'Methusa\',

\'Methusael\',

\'Methuselah\',

\'Mezahab\',

\'Mibsam\',

\'Mibzar\',

\'Midian\',

\'Midianites\',

\'Milcah\',

\'Mishma\',

\'Mizpah\',

\'Mizraim\',

\'Mizz\',

\'Moab\',

\'Moabites\',

\'Moreh\',

\'Moreover\',

\'Moriah\',

\'Muppim\',

\'My\',

\'Naamah\',

\'Naaman\',

\'Nahath\',

\'Nahor\',

\'Naphish\',

\'Naphtali\',

\'Naphtuhim\',

\'Nay\',

\'Nebajoth\',

\'Neither\',

\'Night\',

\'Nimrod\',

\'Nineveh\',

\'Noah\',

\'Nod\',

\'Not\',

\'Now\',

\'O\',

\'Obal\',

\'Of\',

\'Oh\',

\'Ohad\',

\'Omar\',

\'On\',

\'Onam\',

\'Onan\',

\'Only\',

\'Ophir\',

\'Our\',

\'Out\',

\'Padan\',

\'Padanaram\',

\'Paran\',

\'Pass\',

\'Pathrusim\',

\'Pau\',

\'Peace\',

\'Peleg\',

\'Peniel\',

\'Penuel\',

\'Peradventure\',

\'Perizzit\',

\'Perizzite\',

\'Perizzites\',

\'Phallu\',

\'Phara\',

\'Pharaoh\',

\'Pharez\',

\'Phichol\',

\'Philistim\',

\'Philistines\',

\'Phut\',

\'Phuvah\',

\'Pildash\',

\'Pinon\',

\'Pison\',

\'Potiphar\',

\'Potipherah\',

\'Put\',

\'Raamah\',

\'Rachel\',

\'Rameses\',

\'Rebek\',

\'Rebekah\',

\'Rehoboth\',

\'Remain\',

\'Rephaims\',

\'Resen\',

\'Return\',

\'Reu\',

\'Reub\',

\'Reuben\',

\'Reuel\',

\'Reumah\',

\'Riphath\',

\'Rosh\',

\'Sabtah\',

\'Sabtech\',

\'Said\',

\'Salah\',

\'Salem\',

\'Samlah\',

\'Sarah\',

\'Sarai\',

\'Saul\',

\'Save\',

\'Say\',

\'Se\',

\'Seba\',

\'See\',

\'Seeing\',

\'Seir\',

\'Sell\',

\'Send\',

\'Sephar\',

\'Serah\',

\'Sered\',

\'Serug\',

\'Set\',

\'Seth\',

\'Shalem\',

\'Shall\',

\'Shalt\',

\'Shammah\',

\'Shaul\',

\'Shaveh\',

\'She\',

\'Sheba\',

\'Shebah\',

\'Shechem\',

\'Shed\',

\'Shel\',

\'Shelah\',

\'Sheleph\',

\'Shem\',

\'Shemeber\',

\'Shepho\',

\'Shillem\',

\'Shiloh\',

\'Shimron\',

\'Shinab\',

\'Shinar\',

\'Shobal\',

\'Should\',

\'Shuah\',

\'Shuni\',

\'Shur\',

\'Sichem\',

\'Siddim\',

\'Sidon\',

\'Simeon\',

\'Sinite\',

\'Sitnah\',

\'Slay\',

\'So\',

\'Sod\',

\'Sodom\',

\'Sojourn\',

\'Some\',

\'Spake\',

\'Speak\',

\'Spirit\',

\'Stand\',

\'Succoth\',

\'Surely\',

\'Swear\',

\'Syrian\',

\'Take\',

\'Tamar\',

\'Tarshish\',

\'Tebah\',

\'Tell\',

\'Tema\',

\'Teman\',

\'Temani\',

\'Terah\',

\'Thahash\',

\'That\',

\'The\',

\'Then\',

\'There\',

\'Therefore\',

\'These\',

\'They\',

\'Thirty\',

\'This\',

\'Thorns\',

\'Thou\',

\'Thus\',

\'Thy\',

\'Tidal\',

\'Timna\',

\'Timnah\',

\'Timnath\',

\'Tiras\',

\'To\',

\'Togarmah\',

\'Tola\',

\'Tubal\',

\'Tubalcain\',

\'Twelve\',

\'Two\',

\'Unstable\',

\'Until\',

\'Unto\',

\'Up\',

\'Upon\',

\'Ur\',

\'Uz\',

\'Uzal\',

\'We\',

\'What\',

\'When\',

\'Whence\',

\'Where\',

\'Whereas\',

\'Wherefore\',

\'Which\',

\'While\',

\'Who\',

\'Whose\',

\'Whoso\',

\'Why\',

\'Wilt\',

\'With\',

\'Woman\',

\'Ye\',

\'Yea\',

\'Yet\',

\'Zaavan\',

\'Zaphnathpaaneah\',

\'Zar\',

\'Zarah\',

\'Zeboiim\',

\'Zeboim\',

\'Zebul\',

\'Zebulun\',

\'Zemarite\',

\'Zepho\',

\'Zerah\',

\'Zibeon\',

\'Zidon\',

\'Zillah\',

\'Zilpah\',

\'Zimran\',

\'Ziphion\',

\'Zo\',

\'Zoar\',

\'Zohar\',

\'Zuzims\',

\'a\',

\'abated\',

\'abide\',

\'able\',

\'abode\',

\'abomination\',

\'about\',

\'above\',

\'abroad\',

\'absent\',

\'abundantly\',

\'accept\',

\'accepted\',

\'according\',

\'acknowledged\',

\'activity\',

\'add\',

\'adder\',

\'afar\',

\'afflict\',

\'affliction\',

\'afraid\',

\'after\',

\'afterward\',

\'afterwards\',

\'aga\',

\'again\',

\'against\',

\'age\',

\'aileth\',

\'air\',

\'al\',

\'alive\',

\'all\',

\'almon\',

\'alo\',

\'alone\',

\'aloud\',

\'also\',

\'altar\',

\'altogether\',

\'always\',

\'am\',

\'among\',

\'amongst\',

\'an\',

\'and\',

\'angel\',

\'angels\',

\'anger\',

\'angry\',

\'anguish\',

\'anointedst\',

\'anoth\',

\'another\',

\'answer\',

\'answered\',

\'any\',

\'anything\',

\'appe\',

\'appear\',

\'appeared\',

\'appease\',

\'appoint\',

\'appointed\',

\'aprons\',

\'archer\',

\'archers\',

\'are\',

\'arise\',

\'ark\',

\'armed\',

\'arms\',

\'army\',

\'arose\',

\'arrayed\',

\'art\',

\'artificer\',

\'as\',

\'ascending\',

\'ash\',

\'ashamed\',

\'ask\',

\'asked\',

\'asketh\',

\'ass\',

\'assembly\',

\'asses\',

\'assigned\',

\'asswaged\',

\'at\',

\'attained\',

\'audience\',

\'avenged\',

\'aw\',

\'awaked\',

\'away\',

\'awoke\',

\'back\',

\'backward\',

\'bad\',

\'bade\',

\'badest\',

\'badne\',

\'bak\',

\'bake\',

\'bakemeats\',

\'baker\',

\'bakers\',

\'balm\',

\'bands\',

\'bank\',

\'bare\',

\'barr\',

\'barren\',

\'basket\',

\'baskets\',

\'battle\',

\'bdellium\',

\'be\',

\'bear\',

\'beari\',

\'bearing\',

\'beast\',

\'beasts\',

\'beautiful\',

\'became\',

\'because\',

\'become\',

\'bed\',

\'been\',

\'befall\',

\'befell\',

\'before\',

\'began\',

\'begat\',

\'beget\',

\'begettest\',

\'begin\',

\'beginning\',

\'begotten\',

\'beguiled\',

\'beheld\',

\'behind\',

\'behold\',

\'being\',

\'believed\',

\'belly\',

\'belong\',

\'beneath\',

\'bereaved\',

\'beside\',

\'besides\',

\'besought\',

\'best\',

\'betimes\',

\'better\',

\'between\',

\'betwixt\',

\'beyond\',

\'binding\',

\'bird\',

\'birds\',

\'birthday\',

\'birthright\',

\'biteth\',

\'bitter\',

\'blame\',

\'blameless\',

\'blasted\',

\'bless\',

\'blessed\',

\'blesseth\',

\'blessi\',

\'blessing\',

\'blessings\',

\'blindness\',

\'blood\',

\'blossoms\',

\'bodies\',

\'boldly\',

\'bondman\',

\'bondmen\',

\'bondwoman\',

\'bone\',

\'bones\',

\'book\',

\'booths\',

\'border\',

\'borders\',

\'born\',

\'bosom\',

\'both\',

\'bottle\',

\'bou\',

\'boug\',

\'bough\',

\'bought\',

\'bound\',

\'bow\',

\'bowed\',

\'bowels\',

\'bowing\',

\'boys\',

\'bracelets\',

\'branches\',

\'brass\',

\'bre\',

\'breach\',

\'bread\',

\'breadth\',

\'break\',

\'breaketh\',

\'breaking\',

\'breasts\',

\'breath\',

\'breathed\',

\'breed\',

\'brethren\',

\'brick\',

\'brimstone\',

\'bring\',

\'brink\',

\'broken\',

\'brook\',

\'broth\',

\'brother\',

\'brought\',

\'brown\',

\'bruise\',

\'budded\',

\'build\',

\'builded\',

\'built\',

\'bulls\',

\'bundle\',

\'bundles\',

\'burdens\',

\'buried\',

\'burn\',

\'burning\',

\'burnt\',

\'bury\',

\'buryingplace\',

\'business\',

\'but\',

\'butler\',

\'butlers\',

\'butlership\',

\'butter\',

\'buy\',

\'by\',

\'cakes\',

\'calf\',

\'call\',

\'called\',

\'came\',

\'camel\',

\'camels\',

\'camest\',

\'can\',

\'cannot\',

\'canst\',

\'captain\',

\'captive\',

\'captives\',

\'carcases\',

\'carried\',

\'carry\',

\'cast\',

\'castles\',

\'catt\',

\'cattle\',

\'caught\',

\'cause\',

\'caused\',

\'cave\',

\'cease\',

\'ceased\',

\'certain\',

\'certainly\',

\'chain\',

\'chamber\',

\'change\',

\'changed\',

\'changes\',

\'charge\',

\'charged\',

\'chariot\',

\'chariots\',

\'chesnut\',

\'chi\',

\'chief\',

\'child\',

\'childless\',

\'childr\',

\'children\',

\'chode\',

\'choice\',

\'chose\',

\'circumcis\',

\'circumcise\',

\'circumcised\',

\'citi\',

\'cities\',

\'city\',

\'clave\',

\'clean\',

\'clear\',

\'cleave\',

\'clo\',

\'closed\',

\'clothed\',

\'clothes\',

\'cloud\',

\'clusters\',

\'co\',

\'coat\',

\'coats\',

\'coffin\',

\'cold\',

...]

len(set(text3))

2789

len(text3)/len(set(text3))

16.050197203298673

text3.count(\'smote\')

5

100*text4.count(\'a\')/len(text4)

1.4643016433938312

def lexical_diversity(text):

# lexical英[\'leksɪk(ə)l] 美 [\'lɛksɪkl]

# adj.词汇的;[语] 词典的;词典编纂的

# diversity英[daɪ\'vɜːsɪtɪ; dɪ-]美 [dɪˈvəsɪti]

# n.多样性;差异

return len(text)/len(set(text))

def percentage(count, total):

return 100*count/total

print(\'text3中词汇多样性指标:{}\'.format(lexical_diversity(text3)))

print(\'text4中单词a占全文的百分比:{}\'.format(percentage(text4.count(\'a\'),len(text4))))

text3中词汇多样性指标:16.050197203298673

text4中单词a占全文的百分比:1.4643016433938312

列表 = Lists

sent1 = [\'Call\', \'me\',\'Ishmael\',\'.\']

print(\'打印sent1中的内容:{}\'.format(sent1))

print(\'打印sent1中内容的长度:{}\'.format(len(sent1)))

print(\'sent1中词汇多样性指标:{}\'.format(lexical_diversity(sent1)))

打印sent1中的内容:[\'Call\', \'me\', \'Ishmael\', \'.\']

打印sent1中内容的长度:4

sent1中词汇多样性指标:1.0

sent1,sent2,sent3,sent4 # 这是内部定义好的列表

([\'Call\', \'me\', \'Ishmael\', \'.\'],

[\'The\',

\'family\',

\'of\',

\'Dashwood\',

\'had\',

\'long\',

\'been\',

\'settled\',

\'in\',

\'Sussex\',

\'.\'],

[\'In\',

\'the\',

\'beginning\',

\'God\',

\'created\',

\'the\',

\'heaven\',

\'and\',

\'the\',

\'earth\',

\'.\'],

[\'Fellow\',

\'-\',

\'Citizens\',

\'of\',

\'the\',

\'Senate\',

\'and\',

\'of\',

\'the\',

\'House\',

\'of\',

\'Representatives\',

\':\'])

sent4+sent1

[\'Fellow\',

\'-\',

\'Citizens\',

\'of\',

\'the\',

\'Senate\',

\'and\',

\'of\',

\'the\',

\'House\',

\'of\',

\'Representatives\',

\':\',

\'Call\',

\'me\',

\'Ishmael\',

\'.\']

sent1.append(\'Some\')

[\'Call\', \'me\', \'Ishmael\', \'.\', \'Some\', \'Some\', \'Some\', \'Some\']

列表索引

type(text4)

nltk.text.Text

text4[173]

\'awaken\'

text4.index(\'awaken\')

173

text5[16715:16735]

[\'U86\',

\'thats\',

\'why\',

\'something\',

\'like\',

\'gamefly\',

\'is\',

\'so\',

\'good\',

\'because\',

\'you\',

\'can\',

\'actually\',

\'play\',

\'a\',

\'full\',

\'game\',

\'without\',

\'buying\',

\'it\']

text6[1600:1625]

[\'We\',

"\'",

\'re\',

\'an\',

\'anarcho\',

\'-\',

\'syndicalist\',

\'commune\',

\'.\',

\'We\',

\'take\',

\'it\',

\'in\',

\'turns\',

\'to\',

\'act\',

\'as\',

\'a\',

\'sort\',

\'of\',

\'executive\',

\'officer\',

\'for\',

\'the\',

\'week\']

变量

sent1 = [\'Call\',\'me\',\'Ishmael\',\'.\']

my_sent = [\'Bravely\',\'bold\',\'Sir\',\'Robin\',\',\',\'rode\',\'forth\',\'from\',\'Camelot\',\'.\']

noun_phrase = my_sent[1:4]

print(\'打印切片后的列表:noun_phrase-》{}\'.format(noun_phrase))

wOrDs = sorted(noun_phrase)

print(\'打印排序后的列表:wOrDs-》{}\'.format(wOrDs))

打印切片后的列表:noun_phrase-》[\'bold\', \'Sir\', \'Robin\']

打印排序后的列表:wOrDs-》[\'Robin\', \'Sir\', \'bold\']

字符串

name = \'bright\'

print(\'打印name中的第一个字母:{}\'.format(name[0]))

print(name[:4])

print(name*2)

print(name + \'!\')

打印name中的第一个字母:b

brig

brightbright

bright!

\' \'.join([\'Monty\', \'Python\'])

\'Monty Python\'

\'Monty Python\'.split()

[\'Monty\', \'Python\']

saying = [\'After\',\'all\',\'is\',\'said\',\'and\',\'done\',\'more\',\'is\',\'said\',\'than\',\'done\']

tokens = set(saying)

tokens = sorted(tokens)

tokens[-2:]

[\'said\', \'than\']

fdist1 = FreqDist(text1)

vocabulary1 = fdist1.keys()

type(vocabulary1)

dict_keys

fdist1.plot(50, cumulative=True)

#Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which

#account for nearly half of the tokens.

fdist1.hapaxes() #the words that occur once only

[\'Herman\',

\'Melville\',

\']\',

\'ETYMOLOGY\',

\'Late\',

\'Consumptive\',

\'School\',

\'threadbare\',

\'lexicons\',

\'mockingly\',

\'flags\',

\'mortality\',

\'signification\',

\'HACKLUYT\',

\'Sw\',

\'HVAL\',

\'roundness\',

\'Dut\',

\'Ger\',

\'WALLEN\',

\'WALW\',

\'IAN\',

\'RICHARDSON\',

\'KETOS\',

\'GREEK\',

\'CETUS\',

\'LATIN\',

\'WHOEL\',

\'ANGLO\',

\'SAXON\',

\'WAL\',

\'HWAL\',

\'SWEDISH\',

\'ICELANDIC\',

\'BALEINE\',

\'BALLENA\',

\'FEGEE\',

\'ERROMANGOAN\',

\'Librarian\',

\'painstaking\',

\'burrower\',

\'grub\',

\'Vaticans\',

\'stalls\',

\'higgledy\',

\'piggledy\',

\'gospel\',

\'promiscuously\',

\'commentator\',

\'belongest\',

\'sallow\',

\'Pale\',

\'Sherry\',

\'loves\',

\'bluntly\',

\'Subs\',

\'thankless\',

\'Hampton\',

\'Court\',

\'hie\',

\'refugees\',

\'pampered\',

\'Michael\',

\'Raphael\',

\'unsplinterable\',

\'GENESIS\',

\'JOB\',

\'JONAH\',

\'punish\',

\'ISAIAH\',

\'soever\',

\'cometh\',

\'incontinently\',

\'perisheth\',

\'PLUTARCH\',

\'MORALS\',

\'breedeth\',

\'Whirlpooles\',

\'Balaene\',

\'arpens\',

\'PLINY\',

\'Scarcely\',

\'TOOKE\',

\'LUCIAN\',

\'TRUE\',

\'catched\',

\'OCTHER\',

\'VERBAL\',

\'TAKEN\',

\'MOUTH\',

\'ALFRED\',

\'890\',

\'gudgeon\',

\'retires\',

\'MONTAIGNE\',

\'APOLOGY\',

\'RAIMOND\',

\'SEBOND\',

\'Nick\',

\'RABELAIS\',

\'cartloads\',

\'STOWE\',

\'ANNALS\',

\'LORD\',

\'BACON\',

\'Touching\',

\'ork\',

\'DEATH\',

\'sovereignest\',

\'bruise\',

\'HAMLET\',

\'leach\',

\'Mote\',

\'availle\',

\'returne\',

\'againe\',

\'worker\',

\'Dinting\',

\'paine\',

\'thro\',

\'maine\',

\'FAERIE\',

\'Immense\',

\'til\',

\'DAVENANT\',

\'PREFACE\',

\'GONDIBERT\',

\'spermacetti\',

\'Hosmannus\',

\'Nescio\',

\'VIDE\',

\'Spencer\',

\'Talus\',

\'flail\',

\'threatens\',

\'jav\',

\'lins\',

\'WALLER\',

\'SUMMER\',

\'ISLANDS\',

\'Commonwealth\',

\'Civitas\',

\'OPENING\',

\'SENTENCE\',

\'HOBBES\',

\'LEVIATHAN\',

\'Silly\',

\'Mansoul\',

\'chewing\',

\'sprat\',

\'PILGRIM\',

\'PROGRESS\',

\'Created\',

\'PARADISE\',

\'LOST\',

\'---"\',

\'Hugest\',

\'Stretched\',

\'Draws\',

\'FULLLER\',

\'PROFANE\',

\'HOLY\',

\'STATE\',

\'DRYDEN\',

\'ANNUS\',

\'MIRABILIS\',

\'aground\',

\'EDGE\',

\'TEN\',

\'SPITZBERGEN\',

\'PURCHAS\',

\'wantonness\',

\'fuzzing\',

\'vents\',

\'HERBERT\',

\'INTO\',

\'ASIA\',

\'AFRICA\',

\'SCHOUTEN\',

\'SIXTH\',

\'CIRCUMNAVIGATION\',

\'Elbe\',

\'ducat\',

\'herrings\',

\'GREENLAND\',

\'Several\',

\'Fife\',

\'Anno\',

\'1652\',

\'Pitferren\',

\'SIBBALD\',

\'FIFE\',

\'KINROSS\',

\'Myself\',

\'Sperma\',

\'ceti\',

\'fierceness\',

\'RICHARD\',

\'STRAFFORD\',

\'LETTER\',

\'BERMUDAS\',

\'PHIL\',

\'TRANS\',

\'1668\',

\'PRIMER\',

\'COWLEY\',

\'1729\',

\'"...\',

\'frequendy\',

\'insupportable\',

\'disorder\',

\'ULLOA\',

\'SOUTH\',

\'AMERICA\',

\'sylphs\',

\'petticoat\',

\'Oft\',

\'Tho\',

\'RAPE\',

\'LOCK\',

\'NAT\',

\'wales\',

\'JOHNSON\',

\'COOK\',

\'dung\',

\'lime\',

\'juniper\',

\'UNO\',

\'VON\',

\'TROIL\',

\'LETTERS\',

\'BANKS\',

\'SOLANDER\',

\'1772\',

\'Nantuckois\',

\'JEFFERSON\',

\'MEMORIAL\',

\'MINISTER\',

\'REFERENCE\',

\'PARLIAMENT\',

\'SOMEWHERE\',

\'guarding\',

\'protecting\',

\'robbers\',

\'BLACKSTONE\',

\'Rodmond\',

\'suspends\',

\'attends\',

\'FALCONER\',

\'Bright\',

\'roofs\',

\'domes\',

\'rockets\',

\'Around\',

\'unwieldy\',

\'COWPER\',

\'VISIT\',

\'LONDON\',

\'HUNTER\',

\'DISSECTION\',

\'SMALL\',

\'SIZED\',

\'aorta\',

\'gushing\',

\'PALEY\',

\'THEOLOGY\',

\'mammiferous\',

\'hind\',

\'BARON\',

\'CUVIER\',

\'COLNETT\',

\'PURPOSE\',

\'EXTENDING\',

\'SPERMACETI\',

\'Floundered\',

\'chace\',

\'peopling\',

\'Gather\',

\'Led\',

\'instincts\',

\'trackless\',

\'Assaulted\',

\'voracious\',

\'spiral\',

\'MONTGOMERY\',

\'WORLD\',

\'FLOOD\',

\'Paean\',

\'fatter\',

\'Flounders\',

\'CHARLES\',

\'LAMB\',

\'TRIUMPH\',

\'1690\',

\'OBED\',

\'Susan\',

\'HAWTHORNE\',

\'TWICE\',

\'bespeak\',

\'raal\',

\'COOPER\',

\'PILOT\',

\'Berlin\',

\'Gazette\',

\'ECKERMANN\',

\'CONVERSATIONS\',

\'GOETHE\',

\'ESSEX\',

\'WAS\',

\'ATTACKED\',

\'FINALLY\',

\'DESTROYED\',

\'OWEN\',

\'CHACE\',

\'FIRST\',

\'SAID\',

\'VESSEL\',

\'YORK\',

\'1821\',

\'piping\',

\'dimmed\',

\'phospher\',

\'ELIZABETH\',

\'OAKES\',

\'SMITH\',

\'amounted\',

\'440\',

\'SCORESBY\',

\'Mad\',

\'agonies\',

\'endures\',

\'infuriated\',

\'rears\',

\'snaps\',

\'propelled\',

\'observers\',

\'opportunities\',

\'habitudes\',

\'BEALE\',

\'offensively\',

\'artful\',

\'mischievous\',

\'FREDERICK\',

\'DEBELL\',

\'1840\',

\'October\',

\'Raise\',

\'ay\',

\'THAR\',

\'bowes\',

\'os\',

\'ROSS\',

\'ETCHINGS\',

\'CRUIZE\',

\'1846\',

\'Globe\',

\'transactions\',

\'relate\',

\'HUSSEY\',

\'SURVIVORS\',

\'parried\',

\'MISSIONARY\',

\'JOURNAL\',

\'TYERMAN\',

\'boldest\',

\'persevering\',

\'REPORT\',

\'DANIEL\',

\'SPEECH\',

\'SENATE\',

\'APPLICATION\',

\'ERECTION\',

\'BREAKWATER\',

\'CAPTORS\',

\'WHALEMAN\',

\'ADVENTURES\',

\'BIOGRAPHY\',

\'GATHERED\',

\'HOMEWARD\',

\'COMMODORE\',

\'PREBLE\',

\'REV\',

\'CHEEVER\',

\'MUTINEER\',

\'BROTHER\',

\'ANOTHER\',

\'MCCULLOCH\',

\'COMMERCIAL\',

\'reciprocal\',

\'clews\',

\'SOMETHING\',

\'UNPUBLISHED\',

\'CURRENTS\',

\'Pedestrians\',

\'recollect\',

\'gateways\',

\'VOYAGER\',

\'ARCTIC\',

\'NEWSPAPER\',

\'TAKING\',

\'RETAKING\',

\'HOBOMACK\',

\'MIRIAM\',

\'FISHERMAN\',

\'appliance\',

\'RIBS\',

\'TRUCKS\',

\'Terra\',

\'Del\',

\'Fuego\',

\'DARWIN\',

\'NATURALIST\',

";--\'",

\'!\\'"\',

\'WHARTON\',

\'Loomings\',

\'spleen\',

\'regulating\',

\'circulation\',

\'Whenever\',

\'drizzly\',

\'hypos\',

\'philosophical\',

\'Cato\',

\'Manhattoes\',

\'reefs\',

\'downtown\',

\'gazers\',

\'Circumambulate\',

\'Corlears\',

\'Coenties\',

\'Slip\',

\'Whitehall\',

\'Posted\',

\'sentinels\',

\'spiles\',

\'pier\',

\'lath\',

\'counters\',

\'desks\',

\'loitering\',

\'shady\',

\'Inlanders\',

\'lanes\',

\'alleys\',

\'attract\',

\'dale\',

\'dreamiest\',

\'shadiest\',

\'quietest\',

\'enchanting\',

\'Saco\',

\'crucifix\',

\'Deep\',

\'mazy\',

\'Tiger\',

\'Tennessee\',

\'Rockaway\',

\'Persians\',

\'deity\',

\'Narcissus\',

\'ungraspable\',

\'hazy\',

\'quarrelsome\',

\'offices\',

\'abominate\',

\'toils\',

\'trials\',

\'barques\',

\'schooners\',

\'broiling\',

\'buttered\',

\'judgmatically\',

\'peppered\',

\'reverentially\',

\'idolatrous\',

\'dotings\',

\'ibis\',

\'roasted\',

\'bake\',

\'plumb\',

\'Van\',

\'Rensselaers\',

\'Randolphs\',

\'Hardicanutes\',

\'lording\',

\'tallest\',

\'decoction\',

\'Seneca\',

\'Stoics\',

\'Testament\',

\'promptly\',

\'rub\',

\'infliction\',

\'BEING\',

\'PAID\',

\'urbane\',

\'ills\',

\'monied\',

\'consign\',

\'prevalent\',

\'violate\',

\'Pythagorean\',

\'commonalty\',

\'police\',

\'surveillance\',

\'programme\',

\'solo\',

\'CONTESTED\',

\'ELECTION\',

\'PRESIDENCY\',

\'UNITED\',

\'STATES\',

\'ISHMAEL\',

\'BLOODY\',

\'AFFGHANISTAN\',

\'managers\',

\'genteel\',

\'comedies\',

\'farces\',

\'cunningly\',

\'disguises\',

\'cajoling\',

\'unbiased\',

\'freewill\',

\'discriminating\',

\'overwhelming\',

\'undeliverable\',

\'itch\',

\'forbidden\',

\'ignoring\',

\'lodges\',

\'Carpet\',

\'Bag\',

\'Manhatto\',

\'candidates\',

\'penalties\',

\'Tyre\',

\'Carthage\',

\'imported\',

\'cobblestones\',

\'bitingly\',

\'shouldering\',

\'price\',

\'fervent\',

\'asphaltic\',

\'pavement\',

\'flinty\',

\'projections\',

\'soles\',

\'Too\',

\'cheapest\',

\'cheeriest\',

\'invitingly\',

\'particles\',

\'peer\',

\'Angel\',

\'Doom\',

\'wailing\',

\'gnashing\',

\'Wretched\',

\'entertainment\',

\'Moving\',

\'emigrant\',

\'poverty\',

\'creak\',

\'lodgings\',

\'zephyr\',

\'hob\',

\'toasting\',

\'observest\',

\'sashless\',

\'glazier\',

\'reasonest\',

\'chinks\',

\'crannies\',

\'lint\',

\'chattering\',

\'shiverings\',

\'cob\',

\'redder\',

\'Orion\',

\'glitters\',

\'conservatories\',

\'president\',

\'temperance\',

\'blubbering\',

\'straggling\',

\'wainscots\',

\'reminding\',

\'oilpainting\',

\'besmoked\',

\'defaced\',

\'unequal\',

\'crosslights\',

\'hags\',

\'delineate\',

\'bewitched\',

\'ponderings\',

\'boggy\',

\'soggy\',

\'squitchy\',

\'froze\',

\'heath\',

\'icebound\',

\'represents\',

\'Horner\',

\'foundered\',

\'clubs\',

\'harvesting\',

\'hacking\',

\'horrifying\',

\'Mixed\',

\'Nathan\',

\'Swain\',

\'corkscrew\',

\'Blanco\',

\'sojourning\',

\'fireplaces\',

\'duskier\',

\'cockpits\',

\'rarities\',

\'Projecting\',

\'Within\',

\'shelves\',

\'flasks\',

\'bustles\',

\'deliriums\',

\'Abominable\',

\'tumblers\',

\'cylinders\',

\'goggling\',

\'deceitfully\',

\'tapered\',

\'Parallel\',

\'pecked\',

\'footpads\',

\'Fill\',

\'shilling\',

\'examining\',

\'SKRIMSHANDER\',

\'accommodated\',

\'unoccupied\',

\'haint\',

\'pose\',

\'whalin\',

\'decidedly\',

\'objectionable\',

\'wander\',

\'Battery\',

\'ruminating\',

\'adorning\',

\'potatoes\',

\'sartainty\',

\'diabolically\',

\'steaks\',

\'undress\',

\'looker\',

\'rioting\',

\'Grampus\',

\'seed\',

\'Feegees\',

\'tramping\',

\'Enveloped\',

\'bedarned\',

\'eruption\',

\'officiating\',

\'brimmers\',

\'complained\',

\'potion\',

\'colds\',

\'catarrhs\',

\'liquor\',

\'arrantest\',

\'topers\',

\'obstreperously\',

\'aloof\',

\'desirous\',

\'hilarity\',

\'coffer\',

\'Southerner\',

\'mountaineers\',

\'Alleghanian\',

\'missed\',

\'supernaturally\',

\'congratulate\',

\'multiply\',

\'bachelor\',

\'abominated\',

\'tidiest\',

\'bedwards\',

\'shan\',

\'tablecloth\',

\'Skrimshander\',

\'bump\',

\'spraining\',

\'eider\',

\'yoking\',

\'rickety\',

\'whirlwinds\',

\'knockings\',

\'dismissed\',

\'popped\',

\'cherishing\',

\'chuckled\',

\'chuckle\',

\'mightily\',

\'catches\',

\'bamboozingly\',

\'overstocked\',

\'toothpick\',

\'rayther\',

\'BROWN\',

\'slanderin\',

\'farrago\',

\'BROKE\',

\'Sartain\',

\'Mt\',

\'Hecla\',

\'persist\',

\'mystifying\',

\'unsay\',

\'criminal\',

\'Wall\',

\'purty\',

\'sarmon\',

\'rips\',

\'tellin\',

\'bought\',

\'balmed\',

\'curios\',

\'sellin\',

\'inions\',

\'fooling\',

\'idolators\',

\'Depend\',

\'reg\',

\'lar\',

\'spliced\',

\'Johnny\',

\'sprawling\',

\'Arter\',

\'glim\',

\'jiffy\',

\'irresolute\',

\'vum\',

\'WON\',

\'Folding\',

\'scrutiny\',

\'porcupine\',

\'moccasin\',

\'ponchos\',

\'parade\',

\'rainy\',

\'remembering\',

\'commended\',

\'cobs\',

\'Nod\',

\'footfall\',

\'unlacing\',

\'blackish\',

\'plasters\',

\'inkling\',

\'Placing\',

\'crammed\',

\'scalp\',

\'mildewed\',

\'Ignorance\',

\'parent\',

\'nonplussed\',

\'undressing\',

\'checkered\',

\'Thirty\',

\'frogs\',

\'quaked\',

\'wrapall\',

\'dreadnaught\',

\'fumbled\',

\'Remembering\',

\'manikin\',

\'tenpin\',

\'andirons\',

\'jambs\',

\'bricks\',

\'appropriate\',

\'applying\',

\'hastier\',

\'withdrawals\',

\'antics\',

\'devotee\',

\'extinguishing\',

\'unceremoniously\',

\'bagged\',

\'sportsman\',

\'woodcock\',

\'uncomfortableness\',

\'deliberating\',

\'puffed\',

\'sang\',

\'Stammering\',

\'conjured\',

\'responses\',

\'debel\',

\'flourishing\',

\'Angels\',

\'flourishings\',

\'peddlin\',

\'sleepe\',

\'grunted\',

\'gettee\',

\'motioning\',

\'comely\',

\'insured\',

\'Counterpane\',

\'parti\',

\'triangles\',

\'interminable\',

\'caper\',

\'supperless\',

\'21st\',

\'hemisphere\',

\'sigh\',

\'Sixteen\',

\'ached\',

\'coaches\',

\'stockinged\',

\'slippering\',

\'misbehaviour\',

\'unendurable\',

\'stepmothers\',

\'misfortunes\',

\'steeped\',

\'shudderingly\',

\'confounding\',

\'soberly\',

\'recurred\',

\'predicament\',

\'unlock\',

\'bridegroom\',

\'clasp\',

\'hugged\',

\'rouse\',

\'snore\',

\'scratch\',

\'Throwing\',

\'expostulations\',

\'unbecomingness\',

\'matrimonial\',

\'dawning\',

\'overture\',

\'innate\',

\'compliment\',

\'civility\',

\'rudeness\',

\'toilette\',

\'dressing\',

\'donning\',

\'gaspings\',

\'booting\',

\'caterpillar\',

\'outlandishness\',

\'manners\',

\'education\',

\'undergraduate\',

\'dreamt\',

\'cowhide\',

\'pinched\',

\'curtains\',

\'indecorous\',

\'contented\',

\'restricting\',

\'donned\',

\'lathering\',

\'unsheathes\',

\'whets\',

\'Rogers\',

\'cutlery\',

\'Afterwards\',

\'baton\',

\'Breakfast\',

\'pleasantly\',

\'bountifully\',

\'laughable\',

\'bosky\',

\'unshorn\',

\'gowns\',

\'toasted\',

\'lingers\',

\'tarried\',

\'barred\',

\'Grub\',

\'Park\',

\'assurance\',

\'polish\',

\'occasioned\',

\'embarrassed\',

\'bashfulness\',

\'duelled\',

\'winking\',

\'tastes\',

\'sheepishly\',

\'bashful\',

\'icicle\',

\'admirer\',

\'cordially\',

\'grappling\',

\'genteelly\',

\'eschewed\',

\'undivided\',

\'6\',

\'circulating\',

\'nondescripts\',

\'Chestnut\',

\'jostle\',

\'Regent\',

\'Lascars\',

\'Bombay\',

\'Apollo\',

\'Feegeeans\',

\'Tongatobooarrs\',

\'Erromanggoans\',

\'Pannangians\',

\'Brighggians\',

\'weekly\',

\'Vermonters\',

\'stalwart\',

\'frames\',

\'felled\',

\'strutting\',

\'wester\',

\'bombazine\',

\'cloak\',

\'mow\',

\'gloves\',

\'joins\',

\'outfit\',

\'waistcoats\',

\'Hay\',

\'Seed\',

\'tract\',

\'dearest\',

\'pave\',

\'eggs\',

\'patrician\',

\'parks\',

\'scraggy\',

\'scoria\',

\'Herr\',

\'dowers\',

\'nieces\',

\'reservoirs\',

\'maples\',

\'bountiful\',

\'proffer\',

\'passer\',

\'cones\',

\'blossoms\',

\'superinduced\',

\'carnation\',

\'Salem\',

\'sweethearts\',

\'Puritanic\',

\'Whaleman\',

\'Wrapping\',

\'Each\',

\'quote\',

\'TALBOT\',

\'Near\',

\'Desolation\',

\'1st\',

\'SISTER\',

\'ROBERT\',

\'WILLIS\',

\'ELLERY\',

\'NATHAN\',

\'COLEMAN\',

\'WALTER\',

\'CANNY\',

\'SETH\',

\'GLEIG\',

\'Forming\',

\'ELIZA\',

\'31st\',

\'MARBLE\',

\'SHIPMATES\',

\'EZEKIEL\',

\'HARDY\',

\'AUGUST\',

\'3d\',

\'1833\',

\'WIDOW\',

\'Shaking\',

\'glazed\',

\'Affected\',

\'relatives\',

\'unhealing\',

\'sympathetically\',

\'wounds\',

\'bleed\',

\'blanks\',

...]

单词的精细选择

  1. the set of all w such that w is an element of V (the vocabulary) and w has property P

    {w|w \(\in\) V and P(w)}

  2. The corresponding Python expression is given:

    [w for w in V if p(w)]

V = set(text1)

long_words = [w for w in V if len(w)>15]

sorted(long_words)

[\'CIRCUMNAVIGATION\',

\'Physiognomically\',

\'apprehensiveness\',

\'cannibalistically\',

\'characteristically\',

\'circumnavigating\',

\'circumnavigation\',

\'circumnavigations\',

\'comprehensiveness\',

\'hermaphroditical\',

\'indiscriminately\',

\'indispensableness\',

\'irresistibleness\',

\'physiognomically\',

\'preternaturalness\',

\'responsibilities\',

\'simultaneousness\',

\'subterraneousness\',

\'supernaturalness\',

\'superstitiousness\',

\'uncomfortableness\',

\'uncompromisedness\',

\'undiscriminating\',

\'uninterpenetratingly\']

本文选自《Natural Language Processing with Python》

以上是 Python3自然语言(NLTK)——语言大数据 的全部内容, 来源链接: utcz.com/z/387609.html

回到顶部