Analyzing an EMAIL Archive from gmane and vizualizing the data
using the D3 JavaScript library
This is a set of tools that allow you to pull down an archive
of an email repository (formerly called gmane.org) using the
instructions at:
http://mbox.dr-chuck.net/export.php
This server contains a cache of a subset of the gmane.org data,
which is no longer available.
You should install the SQLite browser to view and modify the databases from:
http://sqlitebrowser.org/
The first step is to spider the repository. The base URL
is hard-coded in the gmane.py and is hard-coded to the Sakai
developer list. Make sure to delete the content.sqlite file if you
switch the base url. The gmane.py file operates as a spider in
that it runs slowly and retrieves one mail message per second so
as to avoid getting throttled by gmane.org. It stores all of
its data in a database and can be interrupted and re-started
as often as needed. It may take many hours to pull all the data
down. So you may need to restart several times.
To give you a head-start, I have put up 600MB of pre-spidered Sakai
email here:
/data_space/content.sqlite.zip
If you download this, you can "catch up with the latest" by
running gmane.py.
Navigate to the folder where you extracted the gmane.zip
Note: Windows has difficulty in displaying UTF-8 characters
in the console so for each console window you open, you may need
to type the following command before running this code:
chcp 65001
http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how
Here is a run of gmane.py getting the last five messages of the
sakai developer list:
Mac: python3 gmane.py
Win: gmane.py
How many messages:10
http://mbox.dr-chuck.net/sakai.devel/1/2 2662
ggolden@umich.edu 2005-12-08T23:34:30-06:00 call for participation: developers documentation
http://mbox.dr-chuck.net/sakai.devel/2/3 2434
csev@umich.edu 2005-12-09T00:58:01-05:00 report from the austin conference: sakai developers break into song
http://mbox.dr-chuck.net/sakai.devel/3/4 3055
kevin.carpenter@rsmart.com 2005-12-09T09:01:49-07:00 cas and sakai 1.5
http://mbox.dr-chuck.net/sakai.devel/4/5 11721
michael.feldstein@suny.edu 2005-12-09T09:43:12-05:00 re: lms/vle rants/comments
http://mbox.dr-chuck.net/sakai.devel/5/6 9443
john@caret.cam.ac.uk 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments
Does not start with From
The program scans content.sqlite from 1 up to the first message number not
already spidered and starts spidering at that message. It continues spidering
until it has spidered the desired number of messages or it reaches a page
that does not appear to be a properly formatted message.
Sometimes gmane.org is missing a message. Perhaps administrators can delete messages
or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit
a missing message, go into the SQLite Manager and add a row with the missing id - leave
all the other fields blank - and then restart gmane.py. This will unstick the
spidering process and allow it to continue. These empty messages will be ignored in the next
phase of the process.
One nice thing is that once you have spidered all of the messages and have them in
content.sqlite, you can run gmane.py again to get new messages as they get sent to the
list. gmane.py will quickly scan to the end of the already-spidered pages and check
if there are new messages and then quickly retrieve those messages and add them
to content.sqlite.
The content.sqlite data is pretty raw, with an innefficient data model, and not compressed.
This is intentional as it allows you to look at content.sqlite to debug the process.
It would be a bad idea to run any queries against this database as they would be
slow.
The second process is running the program gmodel.py. gmodel.py reads the rough/raw
data from content.sqlite and produces a cleaned-up and well-modeled version of the
data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X
smaller) than content.sqlite because it also compresses the header and body text.
Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing
you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the
data cleaning process.
Running gmodel.py works as follows:
Mac: python3 gmodel.py
Win: gmodel.py
Loaded allsenders 1588 and mapping 28 dns mapping 1
1 2005-12-08T23:34:30-06:00 ggolden22@mac.com
251 2005-12-22T10:03:20-08:00 tpamsler@ucdavis.edu
501 2006-01-12T11:17:34-05:00 lance@indiana.edu
751 2006-01-24T11:13:28-08:00 vrajgopalan@ucmerced.edu
...
The gmodel.py program does a number of data cleaing steps
Domain names are truncated to two levels for .com, .org, .edu, and .net
other domain names are truncated to three levels. So si.umich.edu becomes
umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are
forced to lower case and some of the @gmane.org address like the following
arwhyte-63aXycvo3TyHXe+LvDLADg@public.gmane.org
are converted to the real address whenever there is a matching real email
address elsewhere in the message corpus.
If you look in the content.sqlite database there are two tables that allow
you to map both domain names and individual email addresses that change over
the lifetime of the email list. For example, Steve Githens used the following
email addresses over the life of the Sakai developer list:
s-githens@northwestern.edu
sgithens@cam.ac.uk
swgithen@mtu.edu
We can add two entries to the Mapping table
s-githens@northwestern.edu -> swgithen@mtu.edu
sgithens@cam.ac.uk -> swgithen@mtu.edu
And so all the mail messages will be collected under one sender even if
they used several email addresses over the lifetime of the mailing list.
You can also make similar entries in the DNSMapping table if there are multiple
DNS names you want mapped to a single DNS. In the Sakai data I add the following
mapping:
iupui.edu -> indiana.edu
So all the folks from the various Indiana University campuses are tracked together
You can re-run the gmodel.py over and over as you look at the data, and add mappings
to make the data cleaner and cleaner. When you are done, you will have a nicely
indexed version of the email in index.sqlite. This is the file to use to do data
analysis. With this file, data analysis will be really quick.
The first, simplest data analysis is to do a "who does the most" and "which
organzation does the most"? This is done using gbasic.py:
Mac: python3 gbasic.py
Win: gbasic.py
How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 5 Email list participants
steve.swinsburg@gmail.com 2657
azeckoski@unicon.net 1742
ieb@tfd.co.uk 1591
csev@umich.edu 1304
david.horwitz@uct.ac.za 1184
Top 5 Email list organizations
gmail.com 7339
umich.edu 6243
uct.ac.za 2451
indiana.edu 2258
unicon.net 2055
You can look at the data in index.sqlite and if you find a problem, you
can update the Mapping table and DNSMapping table in content.sqlite and
re-run gmodel.py.
There is a simple vizualization of the word frequence in the subject lines
in the file gword.py:
Mac: python3 gword.py
Win: gword.py
Range of counts: 33229 129
Output written to gword.js
This produces the file gword.js which you can visualize using the file
gword.htm.
A second visualization is in gline.py. It visualizes email participation by
organizations over time.
Mac: python3 gline.py
Win: gline.py
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 10 Oranizations
['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk']
Output written to gline.js
Its output is written to gline.js which is visualized using gline.htm.
Some URLs for visualization ideas:
https://developers.google.com/chart/
https://developers.google.com/chart/interactive/docs/gallery/motionchart
https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart_time_formats
https://developers.google.com/chart/interactive/docs/gallery/annotatedtimeline
http://bost.ocks.org/mike/uberdata/
http://mbostock.github.io/d3/talk/20111018/calendar.html
http://nltk.org/install.html
As always - comments welcome.
-- Dr. Chuck
Sun Sep 29 00:11:01 EDT 2013
“The amphibian!” cried Larry. “I wonder why——” “I did not see her,” Miss Serena replied to Sandy while she answered the older man’s question in the same breath. “But I saw a glimpse of dress just afterward.” Her expression showed confident assurance. “It’s time to find out what’s what!” he muttered. "De veras?" asked Cairness, sharply. He was of no mind to lose her like this, when he was so near his end. With his heart full of hope and joy, the Deacon bustled around to make every possible preparation for the journey. "We do," responded those kneeling at the altar. "Silence, Sergeant. Billings? Billings? The name of the Lieutenant-Colonel of the 200th Ind. happens to be McBiddle—one-armed man, good soldier. Billings? Billings? T. J. Billings? Is that your name?" "Pete," said Shorty solemnly as he finished trimming the switch, and replaced the knife in his pocket, "nobody's allowed to pick out his own daddy in this world. He just gits him. It's one o' the mysterious ways o' Providence. You've got me through one o' them mysterious ways o' Providence, and you can't git shet o' me. I'm goin' to lick you still harder for swearin' before your father, and sayin' disrespeckful words to him. And I'm goin' to lick you till you promise never to tech another card until I learn you you how to play, which'll be never. Come here, my son." "Leave me alone," Dodd said. "Just do me a favor. Leave me alone." "I cud m?ake something out of Boarzell." Should beauty forget now their nests have grown cold? Makes boil the rushing blood and thrills my very soul." "Because you gave those things up of your free will—they were made to give them up by force. You've no right to starve and deny other people as you have to starve and deny yourself." He rose to his feet. The kitchen was dark, with eddying sweeps of shadow in the corners which the firelight caressed—while a single star put faint ghostly romance into the window. "Oh, mother!" shrieked Margaret. "Fly!—to the abbey, and take sanctuary!" HoME大陆明星露点电影片段合集
ENTER NUMBET 0017
www.cejy.com.cn
www.sanru2.com.cn
zehao0.com.cn
lage9.net.cn
qugua.com.cn
srdjm.com.cn
www.geize0.com.cn
leteam.org.cn
www.guiye7.com.cn
gaoqu4.com.cn