-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCapstone: Retrieving, Processing, and Visualizing Data with Python
2733 lines (2703 loc) · 157 KB
/
Capstone: Retrieving, Processing, and Visualizing Data with Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Capstone: Retrieving, Processing, and Visualizing Data with Python
### Capstone Overview
Welcome to the Python for Everybody Capstone. We want the capstone to be a different experience than the rest of the courses. Since we are a much smaller course I want to make sure that there is lots of opportunity for student-to-student interactions. We understand that some of you will be in a hurry to finish and that others will want to spend time interacting with the instructional staff and other students.
We have designed this course with only one required quiz (Week 1). If you are limited on time, you may complete the quiz and finish quickly.
If you would like a more discovery-oriented experience, we have created an Honors Track so you can complete projects with our community of learners, and earn additional recognition on your certificate. The Honors Track contains three optional, peer-graded assignments (Weeks 2, 4, and 6). Coursera’s guide to Honors assignments can help you decide if this path is for you.
A goal of the Capstone is to set up structures to let you learn from each other instead of just making more lectures and assignments. We have done several things to make that work:
• In this course you can paste any code into the forums that you want to discuss with other students about problems you are having. By now we assume you know how to code.
• We have used a Wiki system provided by Coursera to allow you to edit pages and upload your information and share your information with other students. These Wiki pages are visible to other Coursera students, so do not post any personal information here.
• In the project side of the class it is perfectly fine for students to approach a problem as a team and use other technologies like Slack or Github to coordinate their work.
You can see that a big theme of this Capstone is to get you to contribute to the course. And of course since this is the first time we are using some of this pedagogy, there will be plenty of room for improvement. We will be watching the forums closely and may adjust the course as it progresses based on your comments, issues, and suggestions.
Help Us Learn More About You!
As part of getting to know you better, your backgrounds, your interest in this specific course and in digital education in general, we at the University of Michigan have crafted a survey that should only take a few minutes to complete. Our goal is to keep our communication with you focused on learning and staying in touch, but we believe that both this and an end-of-course survey are important to our mutual educational goals.
## Capstone: Retrieving, Processing, and Visualizing Data with Python
Capstone Overview
Welcome to the Python for Everybody Capstone. We want the capstone to be a different experience than the rest of the courses. Since we are a much smaller course I want to make sure that there is lots of opportunity for student-to-student interactions. We understand that some of you will be in a hurry to finish and that others will want to spend time interacting with the instructional staff and other students.
We have designed this course with only one required quiz (Week 1). If you are limited on time, you may complete the quiz and finish quickly.
If you would like a more discovery-oriented experience, we have created an Honors Track so you can complete projects with our community of learners, and earn additional recognition on your certificate. The Honors Track contains three optional, peer-graded assignments (Weeks 2, 4, and 6). Coursera’s guide to Honors assignments can help you decide if this path is for you.
A goal of the Capstone is to set up structures to let you learn from each other instead of just making more lectures and assignments. We have done several things to make that work:
• In this course you can paste any code into the forums that you want to discuss with other students about problems you are having. By now we assume you know how to code.
• We have used a Wiki system provided by Coursera to allow you to edit pages and upload your information and share your information with other students. These Wiki pages are visible to other Coursera students, so do not post any personal information here.
• In the project side of the class it is perfectly fine for students to approach a problem as a team and use other technologies like Slack or Github to coordinate their work.
You can see that a big theme of this Capstone is to get you to contribute to the course. And of course since this is the first time we are using some of this pedagogy, there will be plenty of room for improvement. We will be watching the forums closely and may adjust the course as it progresses based on your comments, issues, and suggestions.
Help Us Learn More About You!
As part of getting to know you better, your backgrounds, your interest in this specific course and in digital education in general, we at the University of Michigan have crafted a survey that should only take a few minutes to complete. Our goal is to keep our communication with you focused on learning and staying in touch, but we believe that both this and an end-of-course survey are important to our mutual educational goals.
All of the book materials are available under a Creative Commons Attribution-NonCommercial 3.0 Unported License. The slides, audio, assignments, auto grader and all course materials other than the book are available from http://www.py4e.com/materials under the more flexible Creative Commons Attribution 3.0 Unported License. If you are curious as to why the "NC" variant of Creative Commons was used, see Appendix D of the textbook or search through my blog posts for the string "copyright".
Academic Innovation Policy on Learner Engagement Conduct
I. Policy
The University of Michigan strives to create and maintain a community that enables each person to reach their full potential. To do so requires an environment of trust, openness, civility and respect. The Center for Academic Innovation (Academic Innovation) at the University is firmly committed to a policy of prohibiting behaviors which adversely impact a person’s ability to participate in the scholarly, research, educational, patient care and service missions of the University enabled by Open Learning Initiatives (OLIs) the University offers through a variety of technological platforms (each an OLI Platform).
Academic Innovation has a compelling interest in ensuring an environment in which productive work and learning may thrive. At the same time, Academic Innovation has an interest in respecting freedom of speech and protecting academic freedom and in preserving the widest possible dialogue within its instructional and research settings. As such, Academic Innovation recognizes and expects there to be open discourse and exchanges that may cause some University personnel and OLI learners (collectively, OLI Community Members) to feel uncomfortable. It is through such exchanges that the flow of ideas and countervailing thoughts and experiences are expressed which can facilitate deeper understanding and learning. However, the University also expects its OLI Community Members to engage in such interactions in a professional manner.
It is the intent of this policy to protect academic freedom and to help preserve the highest standards of academic discourse and scholarship in order to advance the mission of the University. This policy is specific to conduct which is not protected and covered under the principles of freedom of speech and academic freedom but rather conduct that the University community would view as counter to its norms and expectations and which hinders other members of the community in the exercise of their professional responsibilities and academic freedoms. Academic Innovation is prepared to act to prevent or remedy behaviors that interfere with, or adversely affect, an OLI Community Member’s ability to learn or do their job.
In addition to protecting academic freedom, it is the position of the University of Michigan that a clear sense of academic responsibility is fundamental to an honest and collaborative educational environment, and behavior consistent with this principle is expected of all OLI Community Members. As such, the University is committed to ensuring its OLIs are free from academic misconduct while maintaining academic integrity at all times.
While the University seeks to create safe and welcoming OLI communities, please be advised that learners who share any personal information over OLI Platforms, including personal contact information, do so at their own risk. Before volunteering personal information over OLI Platforms, please note that the University does not apply the same data protection processes and safeguards for OLI data as it does for University-enrolled-student data. OLI Community Members are encouraged to use the direct and group communication tools integrated into or offered in connection with OLI Platforms, wherever available. While the University does not maintain, sponsor, or review groups created by non-University parties off of the OLI Platforms, the University may at its own discretion remove posts encouraging learners to share contact information and/or join external groups in its discretion.
Finally, Academic Innovation may share certain OLI learner data obtained from OLI Platforms, including general OLI course data, OLI Platform Activity information and demographic data from surveys, with third parties for scholarly research purposes in compliance with both vendor contractual obligations and applicable laws.
II. Definitions
The following types of behaviors may be subject to sanction, including learner removal from the OLI in accordance with the appropriate procedures.
These behaviors include oral, written, visual or physical actions by an OLI learner that:
a) Have the purpose or effect of unreasonably interfering with an individual’s employment or educational performance; and/or
b) Have the purpose or effect of creating an intimidating, hostile, offensive or abusive climate for an individual’s employment, academic pursuits or participation in the OLI.
Some examples of conduct that may violate this policy include, but are not limited to: threatening behavior, actions or comments; bullying behavior (defined as a persistent pattern of negative behavior based upon a real or perceived power imbalance which belittles another member of a unit); disruption of functions or activities sponsored or authorized by the University; unwelcomed solicitation of personal contact information from a fellow OLI Community Member that does not relate to a valid theme or assessment from the OLI; encouraging learners to join external groups with the intent to solicit payment of any kind, or to facilitate academic integrity violations; promotion of non-University organizations not directly related to the OLI or otherwise validated by the University; solicitation of products or services that are not specifically recommended by University personnel; threats of physical harm to, or harassment of another member of the OLI community; and behavior that results in a hostile working or learning environment. This list is not exhaustive, and OLI Community Members may be subject to sanction and disciplinary action, including removal from a particular OLI, for any type of conduct which, although not specifically enumerated, meets the standard for unacceptable behavior set forth above.
In addition, Academic Innovation considers any of the following behaviors to be academic misconduct for purposes of University of Michigan OLIs:
• Copying from another’s exam or other evaluative assignment
• Submitting work that was previously used for another OLI without the explicit endorsement or instruction of the University of Michigan
• Discussing or sharing information about questions or answers on an exam or other evaluative assignment without explicit endorsement or instruction of the University of Michigan
• Allowing another person other than yourself to take an exam or complete an assignment
• Knowingly presenting another person's ideas, findings, images or written work as one's own by copying or reproducing without acknowledgement of the source
• Using more than one login in a single OLI with malicious or fraudulent intent
III. Alleged Violations of this Policy
Alleged violations of this policy should be reported on a timely basis to Academic Innovation through Academic-Innovation-Abuse@umich.edu. Academic Innovation will ensure that appropriate action is taken to address the situation.
The University will take appropriate steps to ensure that a person who, in good faith, reports or participates in a resolution of a concern brought forward under this policy is not subject to retaliation. In addition, subjecting such a person to retaliation is itself a violation of this policy.
Violation of this policy may result in appropriate sanction or disciplinary action. If removal from a particular OLI is proposed, the matter will be addressed through the appropriate procedure connected with the OLI Platform.
Coming from Python 2 - Encoding Data in Python 3
If you took the earlier courses in Python 2, you need to gain a brief understanding of how to handle networked data with character sets other than the "Latin" character sets. When data is moved between systems, characters like (次 - Tsugi) or (코스 - Koseu) must be properly encoded as they are passed between different systems as Unicode data. The most common Unicode encoding is UTF-8.
We have included the lecture Unicode Characters and Strings in this course specifically to give you a brief review of data encoding in Python 3 to get you quickly up to speed.
So, we started this entire course
printing hello world and I just said "Hello world," and out comes hello world.
It'd be nice if that was super simple.
In 1970, it was simple because there was pretty much one character set.
Even in 1970, when I started,
we didn't even have a lowercase characters.
We just had uppercase characters,
and I'll tell you we were happy when we just had uppercase characters.
You kids these days with your lowercase characters,
and numbers, and slashes, and stuff.
So, the problem that computers have is they have to come up with a way,
I mean, computers don't understand letters
actually what computers understand his numbers.
So, we had to come up with a mapping between letters and numbers.
So, we came up with a mapping,
and there's been many mappings historically.
The one that is the most common mapping of the 1980s,
is this mapping called ASCII,
the American Standard Code for Information Interchange and it says
basically this number equals this letter.
So for example, the number for Hello World,
for capital H the number is 72.
Somebody just decided that the capital H was going to be 72 lowercase e,
number is 101 and new line is 10.
So if you were really and truly going to look at what was going on inside the computer,
it's storing these numbers but the problem is,
is there are 128 of these,
which means that you can't put every character into a 0-128.
So, in the early days,
we just dealt with whatever characters are possible.
Like I said, when I started you could only do uppercase,
you couldn't even do lowercase.
So, there is this function as long as you're dealing with simple values that you can say,
"Hey, what is the actual value for the letter H?"
and it's called ord which stands for ordinal. What's the ordinal?
What does the number corresponding to H?
That's 72.
What's the number corresponding to lowercase e?
It's 101, and what's a number corresponding to new line?
That is a 10. Remember, new line is a single character.
This also explains why
the lowercase letters are all greater than the uppercase letters because they're ordinal.
For ASCII now, there's so many characters sites but
just for the default old school 128 characters that we could represent with ASCII,
the uppercase letters were had a lower ordinal than lowercase letters.
So, Hi is less than z,
z, z, all lowercase and that's because all lowercase letters are less.
I mean all uppercase letters are less than
all lowercase letters actually this could be a, a, a.
That's what I should've said there, okay.
So, don't worry about that just know that they are
all numbers and in the early days, life was simple.
We would store every character in a byte of memory otherwise known as 8-bits of memory,
it's the same thing when you say I have a many gigabyte USB stick,
that's a 16 gigabyte USB stick that means there are
16 billion bytes of memory on there which means we could
put 16 million characters down here in the old days.
Okay? So, the problem is, is the old days,
we just had so few characters that we could put one character in a byte.
So, the ord function tells us the numeric value of a simple ASCII character.
So, like I said,
if you take a look at this,
the e is 101 and H,
capital H is 72 and then the new line which is here at line feed which is 10.
Now, we can represent these in hexadecimal which is base 16,
or octal which is base eight,
or actual binary which is what's really going on which has nothing but zeroes and ones,
but these are just this is the binary for 10,
0001010 and so these are just,
these three are just alternate versions of these numbers.
The numbers go up to 127, and if you look at the binary,
you can see in this,
this is actually seven bits or binary,
you can see that it's all one.
So, it starts at all zeros goes into all ones.
So, it's like zeros and ones are what the computer's always do.
If you go all the way back to the hardware,
the little wires and stuff,
the wires or character are taking zeros and ones.
So, this is what we did in
the 60s and 70s we just said whatever we're capable of squeezing in,
we're just totally happy,
we're not going to have anything tricky and like I said,
halfway you know early in my undergraduate career,
I started to see lowercase letters I'm like,
"Oh that's really beautiful."
lowercase letters. Now, the real world is nothing like this.
There are all kinds of characters and they had to
come up with a scheme by which we could map these characters
and for awhile there were a whole bunch of
incompatible ways to represent characters other than these ASCII,
also known as Latin character sets,
also known as Arabic character sets.
These other character sets just completely invented their own way of representing and so,
you have these situations where you know Japanese computers pretty much
couldn't talk to American computers or European computers at all.
I mean the Japanese computers just had their own way of representing characters
and American computers had their own way of representing
characters and they just couldn't talk,
but they invented this thing called Unicode.
So, Unicode is this universal code for
hundreds of millions of different characters and hundreds of different character sets,
so that instead of saying, "Oh sorry.
You don't fit with your language from some South Sea Island."
it's okay we've got space in Unicode for that.
So, Unicode has lots and lots of character not 128,
lots and lots of characters.
So, there was a time like I said in the 70s or the 80s
where everyone has something different and even like in the early 2000s as the Internet,
what happened was as the Internet came out,
it became unimportant issue to have as a way to exchange data.
So we had to say, "Oh well,
it's not sufficient for Japanese computers to talk to Japanese computers and
American computers to talk to
America computers when Japanese and American commuters to exchange data.".
So, they built these character sets and so there is Unicode which is this abstraction of
all different possible characters and there are
different ways of representing them inside of computers.
So, there's a couple of simple things that you might think are
good ideas that turn out to be not such good ideas, although they're used.
So the first thing we did is these UTF-16,
UTF-32 and UTF-8 are basically ways of representing a larger set of characters.
Now the gigantic one is 32-bits which is four bytes,
it's four times as much data for a single character and so that's quite a lot of data.
So, you're dividing the number of characters
by four so if this is 16 gigabytes it can only still,
it can only handle four billion characters or something, divided by four, right?
Four bytes per character and so that's not so efficient and then some there's
a compromise or like two bytes but then you have to pick this can do all the characters,
this can do lots of single lots of character sets,
but it turns out that even though you might instinctively think that like
UTF-32 is better than UTF-16 and UTF-8 is the worst.
It turns out that UTF-8 is the best.
So UTF-8 basically says,
it's either going to be one, two,
three or four characters and there's
little marks that tell it when to go from one to four.
The nice thing about it is that UTF overlaps with ASCII.
Right? So, if the only characters you're putting
in are the original ASCII or Latin-1 character set,
then UTF-8 and ASCII are literally the same thing.
Then, use a special character that's not part of ASCII to indicate
flipping from one byte characters to two byte characters,
or three byte characters, or four byte.
So, it's a variable length.
So, you can automatically detect,
you can just be reading through a string and say,
"Whoa, I just saw this weird marker character,
I must be in UTF-8."
Then if I'm in UTF-8,
than I can expand this,
and find represent all those character sets,
and all those characters in those character sets.
So, what happened was, is they went through all these things,
and as you can see from this graph,
the graph doesn't really say much other than the fact that UTF-8 is awesome,
and getting awesomer, and every other way of representing
data is becoming less awesome, right?
This is 2012, so that's a long time ago.
So, this is like UTF-8 rocks.
That's really because as soon as these ideas came out,
it was really clear that UTF-8 is
the best practice for encoding data moving between systems,
and that's why we're talking about this right now.
Finally, with this network,
we're doing sockets were moving data between systems.
So, you're American computer might be talking to a computer in Japan,
and you got to know a character sets coming out, right?
You might be getting Japanese characters even though everything I've shown you is
non-Japanese characters or orient our Asian characters or whatever, right?
So, UTF-8 turns out to be the best practice.
If you're moving a file between
ASCII systems or if you're moving network data between two systems,
the world recommends UTF-8, okay?
So, if you think about your computer, inside your computer,
the strings that are inside your Python like x equals hello world,
we don't really care what their syntax is.
If there is a file, usually the Python
running on the computer in the file had the same character set,
they might be UTF-8 inside Python,
it might be UTF-8 inside.
But we don't care, you open a file,
and that's why we didn't have to talk about this
we're opening files even though you might
someday encounter a file that's different than your normal character set, it's rare.
So, files are inside the computer,
strings are inside the computer,
but network connections are not inside the computer,
and we get databases,
we're going to see they're not inside the computer either.
So, this is also something that's changed from Python two to Python three,
it was actually a big deal, a big thing.
Most people think it's great, I actually think it's great.
Some people are grumpy about it,
but I think those people just are people that fear change.
So, there were two strings in Python.
There were a normal old string and a Unicode string.
So, you could see that Python two would be able to make a string constant,
and that's type string,
and it would make a Unicode constant by prefixing U before the quote.
That's a separate thing,
and then you had to convert back and forth between Unicode and strings.
What we've done in Python three is,
this is a regular string,
and this is a Unicode string,
but you'll notice they're both strings.
So, it means that inside the world of Python,
if we're pulling stuff and you might have to convert it,
but inside Python everything is Unicode.
You don't have to worry about it every string is the same,
whether it has Asian characters,
or Latin characters, or Spanish characters,
or French characters, is just fine.
So, this simplifies this.
But, then there are certain things that we're going to have to be responsible for.
So, the one string that we haven't used yet,
but becomes important, and it's present in both Python two and Python three.
Remember how I said in the old days a character,
and a byte are the same thing.
So, there's always been a thing like a byte string,
and I knew this by prefixing the b,
and that says, "This is a string of bytes" that mean this character.
If you look at the byte string in Python two,
and then you look at a regular string in Python two they're
both type string the bytes are the same as string and a Unicode is different.
So, these two are the same in Python two and these two are different in Python two.
I am not doing a very good picture of that.
So, the byte string and the regular string are the same,
and the regular string and the Unicode string are different.
So, what happened is in Python three,
the regular string and the Unicode string are the same, and now,
the byte string and the regular string are different, okay?
So, bytes turns out to be wrong on encoded,
that might be UTF-8 might be UTF-16, it might be ASCII.
We don't know what it is,
it we don't know what it's encoding is.
So, it turns out that this is the thing we have to
manage when dealing with data from the outside.
So, in Python three, all the strings internally are unicode, not UTF-8,
not UTF-16 not UTF-32,
and if you just open a file pretty much usually works,
if you talk to a network now,
we have to understand.
Now, the key thing is is we have to decode this stuff,
we have to realize what is the character set of the stuff we're pulling in.
Now, the beauty is, it's because
99 percent or maybe a 100 percent of
stuff you're ever going to run across just uses UTF-8,
it turns out to be relatively simple.
So, there's this little decode operation.
So, if you look at this code right here,
when we talked to an external resource,
we get a byte array back like the socket gives us an array of bytes which are characters,
but they need to be decoded.
So, we know if you have UTF-8, UTF-16 or ASCII.
So, there is this function that's part of byte arrays,
so data.decode says, "Figure this thing out" and
the nice thing is as you can tell it what character set it is,
but by default it assumes UTF-8 or ASCII
dynamically because ASCII and UTF-8 are up it's compatible with one another.
So, if it's like old data you're probably getting
ASCII if it's newer data your pride getting UTF-8.
Literally, it's a law of diminishing returns like it's one,
and it's very rare that you get anything other than those two.
So, you just almost never have to tell it what it is, right?
So, you just say decode it,
look at it, it might be ASCII, it might be UTF-8,
but whatever it is, by the time it's done with that,
it's a string, it's all Unicode inside of this.
So, this is bytes,
and this is Unicode.
So, decode goes from bytes to Unicode.
You also can see when we're looking at the sending of the data,
we're going to turn it into bytes.
So, encode takes this string,
and makes it into bytes.
So, this is going to be bytes that are properly encoded in UTF-8.
Again, you could have put a thing here UTF-8,
but it just assumes UTF-8,
and this is all ASCII.
So, it actually doesn't do anything.
So, but that's, okay.
Then, we're sending the bytes out the commands.
So, we have we have to send the stuff out,
then we receive it, we decode it,
we send it, we encode it.
Now, in this world is where the UTF-8 is.
Here, we just have Unicode,
and so before we do the send,
and before we receive, we have to encode,
and decode this stuff,
so that it works out and works out correctly.
So you can look at the documentation for both the encoder then decode.
Decode is a method in a bytes class, and it says,
"You can see that the encoding" We're telling it you can say, "It's not UTF-8".
Asking UTF-8 or in the same thing the default to UTF-8,
which is probably all you're ever going to use,
and the same thing is through strings can be encoded using UTF-8 into a byte array,
and then we send that byte array out to the outside world.
It sounds more complex than it is.
So, after all that,
think of it this way.
On the way out, we have an internal string before we send it.
We have to encode it, and then we send it.
Getting stuff back, we receive it.
It comes back as bytes,
we happen to know it's UTF-8 or we're letting it automatically detect UTF-8,
and decode it, and now we have a string.
Now, internally inside of Python,
we can write files, we can do all stuff in and out of this stuff.
In a solar works all together it's just that this is UTF-8 question mark,
question mark this is the outside world.
So, you have to look at your program and say okay,
"When am I talking to the outside world?"
Well, in this case, it's when I'm talking to a socket, right?
I'm talking to a socket, so,
I have to know enough to encode and decode as I go in and out of the socket.
So, it looks weird when you all started start seeing these n-codes and decodes,
but they actually make sense.
They're like this barrier between this outside world,
and our inside world.
So, that inside our data is all completely consistent,
and we can mix strings from various sources
without regard to the character set of those strings.
So, now we're going to do is, we're going to rewrite that program.
It's a short program,
but we're going to make it even shorter.
apstone Completion Options
We have designed this capstone with several pathways to accommodate learners with varying goals and time constraints.
Certification
If you intend to earn a certificate for this course, we have designed this course with only one required quiz (up next). Your certificate will be available immediately upon quiz completion.
You may choose to either finish quickly, or keep proceeding with optional and honors track assignments.
Honors Track
Even if you have already earned you official certificate, we have created an Honors Track for learners who would like a more discovery-oriented experience. You can complete projects with our community of learners, and earn additional recognition on your certificate.
If you have purchased a certificate, it will automatically update to include a “With Honors” distinction if you proceed and finish honors assignments.
See Coursera’s Honors assignments for additional information.
Audit Learners
Please note that only learners who have paid can submit assignments. If you are auditing this course, you will be able to access most content, but not submit your assignment for a grade. If you wish to have your assignments graded and receive a course certificate, we encourage you to purchase a subscription to this course. Coursera has provided information about purchasing a certificate, and you can also get help from the Coursera Help Center.
Play video starting at ::13 and follow transcript0:13
Hello, and welcome to yet another installment of our
Internet History Technology Security and Python for Everybody Office Hours.
We're here in the Hague in the Netherlands and
I'd you to meet some of your fellow students.
So here we go.
So say your name and say hi and if you have any message or whatever.
>> I'm Wakim and I'm interested in Python and a big fan of Dr. Charles.
Play video starting at ::39 and follow transcript0:39
>> Hi, my name is Weil. I just start learning Python, I'm so
excited to learn more and more about Python.
>> Hi, I'm Baadlein. I'm also excited about Python and thanks.
>> My name is Root. Dr. Chuck is the best.
>> [LAUGH] No, you're the best.
Play video starting at :1: and follow transcript1:00
>> Hello, I'm Eva, and I'm happy to meet Dr. Chuck.
>> Hi, I'm Chrisus and
I'm super happy to be here with Dr. Chuck.
>> Hi, I'm Jane, and Dr. Chuck helped me breeze through Python, so it's awesome.
>> Good.
Play video starting at :1:15 and follow transcript1:15
>> Hi I'm Martin, and I'm learning Python from Dr. Chuck and I love it too.
>> Hello my name is Victor.
I'm very excited to be here with Professor Chuck, and I really want to thank him and
Coursera because thanks to them I now am programming.
And I really like that programming stuff.
>> Okay, thank you.
>> I'm Catalina and I really, really, really love this course.
>> So let's give a quick round of applause for Catalina for setting this up and
arranging this.
[APPLAUSE] Okay.
Well, thank you.
>> No, thank you. >> That's the first time that someone on
Twitter has ever like tweeted me back and said I'll set that up for you.
>> [LAUGH] Yeah. >> So appreciate that.
>> Happy to do it.
>> Hello, I'm Giorgio and I follow some of the courses of Dr.
Chuck and you should do the same.
>> Hi, I'm Tim. I took some of the courses of Dr.
Chuck and finally I have the opportunity to meet him.
Play video starting at :2:6 and follow transcript2:06
>> Hi, I'm Stefan and I'm following the Python course from Dr.
Chuck and I'm loving to meet him and he's very nice.
Thank you.
Play video starting at :2:14 and follow transcript2:14
>> Hello, my name is Rob.
I'm following the Python course, and it's good to see Dr. Chuck again.
>> You have an interesting story about how Coursera affected your employment.
Play video starting at :2:25 and follow transcript2:25
>> Yes, I took a lot of courses on Coursera and edX,
and now finally I got a job.
>> Congratulations. They'll be happy to hear that.
Play video starting at :2:35 and follow transcript2:35
>> Hello my name is Araf and thank you for making this happen.
I see other peoples have the same interest basically.
I appreciate it, thank you.
>> You're welcome. >> Hi I'm Irina and
this course gave me a lot of self-confidence, so thank you.
Play video starting at :2:54 and follow transcript2:54
>> You're very welcome.
So, again, a very large.
Oops, there, I cover my own thing.
A very large group of folks.
It's been kind of chilly here.
It's the first day of spring here in Holland.
I even saw some tulips coming up, but we're all wearing gloves.
>> [LAUGH] >> because we were too big to fit inside.
And then we sat here for a whole hour, until we realized.
[LAUGH]
that there were heaters, and
we didn't turn the heaters on, so we're working on how to turn the heaters on.
So, the next place I think that I might see you is Estonia.
The next Coursera office hours will be in Estonia.
So, we'll see you there.
Cheers!
[MUSIC]
Play video starting at ::13 and follow transcript0:13
>> The Khan Academy computer science platform is something that, it came
from discussions that I'd had with Saul Khan and other people at Khan Academy.
At that time, this was 2011, Khan Academy was much, much smaller.
We had conversations like well,
we would really like to have some computer science curriculum.
And they're like, John, are you interested in thinking about this?
And now, I've never explicitly taught computer science in any formal setting.
I mean, I've certainly taught people frameworks and libraries.
I've done lots of speaking on these particular things.
And I've written books on like JavaScript, and stuff like that.
But again, I haven't taught programming to
people who are complete beginners up through.
So it was a new challenge for me.
I had to go back and do a lot of rethinking about,
trying to remember what it was like when I was learning to program, and
talk about it with other people and
figure our what they experienced when they were learning to program.
What worked for them, what didn't work.
And during the initial period where it was just me kind of exploring concepts and
trying different things, the idea that really stuck with me
the most was actually going back to when I learned to program.
When I was, I was a teenager, maybe about 14 or 15 or so,
and a friend of mine came over to my house and he had a floppy disk and
on it was a copy of QBasic with a program or two.
And he's like, you should, you need to try this out.
Check this out.
And he loaded it up and he ran some program that he had written.
And it was just a very basic, may have just printed to something out,
I don't remember.
And I remember that was the first time that I realized,
I didn't realize that you could actually tell the computer what to do.
I wanted to try and sort of take that initial experience that I had personally,
and the sort of experience of being able to read and learn and try things in
an open environment like you would have in GitHub, and combine those together.
So, what ended up coming out of this was this what we called,
at Khan Academy, computer science, which is a bit of a,
I'd say, a misnomer in that is that it's not what most
people think of as a computer science curriculum.
We're not,
at least at this point, we're not going to replace a CS101 at a university.
And a lot of what we're doing is encouraging students
to do that exploration for themselves.
To be able to look at code, see programs that other students have written,
that we have written or whomever, and honestly,
I feel like the most important thing that we could do is be able
to create that little spark and create that excitement and
really get them excited about programming.
>> So when I joined we had, the computer programming was a playground and
it was great.
And people were creating,
there'd already been millions of programs created at that point, I think.
But there was not much of a curriculum around it.
And so it meant that I was worried that we might lose some people who weren't able to
figure it all out just by exploring, just by the tinkering.
Who did need to be explicitly said, this is how a for loop works,
this is what a variable is, now you try it.
So what I did was, when I started off I took my
JavaScript 101 curriculum that I'd been giving in traditional classroom settings,
somewhat traditional settings, and then Khanified that.
And that meant creating talk throughs, which are like videos,
except they're way cooler, because they're actually the editor on the left hand
side and the output on the right.
And you can actually pause, and it's the actual live editor, so you can then, you
know, make little changes, see how it happens, and then you can continue playing.
And then there's the coding challenges.
And the coding challenges are step by step, like okay,
we want you to do something like this, okay you're close but
you've actually made this common mistake, here's maybe what you should do instead.
And it's a way of both assessing and giving them a way to practice and
teach them a bit more.
So for every talk through there'll be a coding challenge.
And then every so often there'll be a project, which is a bigger free form
creative project, which gives them a lot of freedom for
what to do while still practicing what they've learned.
So maybe, they're making a fish tank once they've learned functions,
then they have this fish function and they have to parametrize that function so
that fishes can have different colors or sizes, right?
But they can go wild with that. They can add seaweed, they can have bubbles,
whatever they want to do. And sometimes they even make rat tanks, whatever.
And those get peer evaluated so it's coming up that curriculum and
then coming up with the more advanced curriculums as well.
>> I think one of the things that's important is I don't want to create
a generation of programmers or
computer scientists who exclusively program for the sake of programming.
Now I tend to be that and others here at the company tend to be that, but
I feel like we are the exception.
And another experience that was very formative to me is
I remember I was taking a AP Computer Science class in high school.
And I was, I had also been other AP classes with my other friends,
like AP English, AP History.
And they were smart, I knew they were the smartest people, and
they could go to any college they want.
And we got to AP Computer Science and I was just like, I can just do whatever.
I knew exactly how everything worked and they struggled.
And what was interesting for me to see that is
I realized that there's certain concepts here that are challenging.
And, but potentially, if they're taught in the right way, that these people,
who I know are really smart, they should not be struggling,
that they would be able to get it.
And so really, what the Khan Academy CS platform, if we had the ability to
find whatever that thing is, to get that person really excited about programming,
to make them want to keep it and learn it for themselves, but
maybe use it within the context of however else they're going to use it.
If they love science, if they love art, if they love music.
Whatever that thing is, being able to take programming and be able to
mix that together and really just use it as a life skill at that point.
I would love it if we had a generation of people who just, like,
realized that that was a thing that they could have, that they could learn,
that they could use, and not just become a programmer for a programmer's sake.
>> We want to get people programming pretty early.
I mean, we've seen that eight year olds are learning to program on our platform.
They may be particularly smart eight year olds, but
we think that actually eight year olds could be doing some form of programming.
Maybe it's Block Break programming, maybe it's HTML, but they could be doing
something that's kind of exercising that type of skill, that part of the brain.
And so I envision that ideally, let's say sixth grade,
maybe sixth grade is when you start learning to program.
So you learn the basics of some language like JavaScript.
And then you start making your own programs.
And then maybe you start making programs for projects in other classes.
And I've seen this with some of our students, is that they use it for
science fair, and they use it for their history assignment.
They use it to make a timeline.
So they start using programming to complement those other classes,
those other topics, because that's one of the big things about programming,
it can be very cross-disciplinary and really work together with other stuff.
And we don't necessarily want everyone to become a computer programmer.
We want everybody to have that as a skill in their toolbox.
And then the other thing is that as they keep going, as they're making programs,
we really want them to be working with other people in making programs.
Because that's one of the big things about software development that they don't even
teach you that much in college, is that it's a huge team effort, right?
And if you're really going to make a good piece of software you're going to have to
work with other people.
And it requires a certain amount of skills and
it's also a really fantastic experience to work with other people.
It's way more a collaboration than a competition.
And we don't do that much collaborating when we're being schooled.
We do more competing.
And so I would imagine, like maybe they get into high school, and
maybe they actually have a project where they work with a local non-profit and
they spec out and they do wire frame.
They learn about user experience.
And then they actually implement it as a team and they do code review.
And they learn about what it means to work on a team.
And then they do some usability testing, and then they actually deliver and
then they have it in their portfolio.
And so there it's not learning about programming and how computers work,
it's learning about how to work with people and learning about how to make things that
work well for people too, and getting an intuition for usability.
>> I don't feel like we've made much of an impact on let's say college
level computer science education.
However, I think we've definitely had an impact on the K through 12 level.
I would say pre-AP computer science teaching of programming.
Now it's interesting because I feel like we're very different
from most programming education.
If you look at programming education in that realm of before college or
before AP Computer Science, that your students are typically not writing code.
Or physically, I want to say physically typing out characters that are code.
You end up with environments like, for example, Scratch out of MIT.
And it's a bit, or like Mindstorms or these other things, and
I feel like we're one of the few environments where we're
getting young kids to actually type real-world code and
learn I think practical pragmatic code.
>> Getting to see classrooms use your stuff is incredibly valuable so
any time I talk to teachers I always come back with feature requests and
we came up with new teacher tools for that.
So teachers now have a much better dashboard to actually monitor the progress
of their students and see where they're at in the curriculum.
And they can actually see roughly who's at what spots in the curriculum so
they can kind of say, oh, these people should help each other or these people should
pair together, and they can see all the programs that people have created.
And it's very interesting because at this high school there's this teacher Ellen,
who's teaching using our platform, and then there's another teacher who's
teaching using traditional processing, which is the desktop Java version.
And when they do their assignments they have to zip them up in a file and
they have to email it to him and he has to go through them and read it that way.
And whereas Ellen just reloads the programs page and
can see exactly what her students are working on.
So it's kind of streamlined that part of it too.
# Building a Search Engine - Introduction
This week we will download and run a simple version of the Google PageRank Algorithm. Here is an early paper by Larry Page and Sergy Brin, the founders of Google, that describes their early thoughts about the algorithm:
http://infolab.stanford.edu/~backrub/google.html
We will provide you with sample code and lectures that walk through the sample code:
https://www.py4e.com/code3/pagerank.zip
There is not a lot of new code to write - it is mostly looking at the code and making the code work. You will be able to spider some simple content that we provide and then play with the program to spider some other content. Part of the fun of this assignment is when things go wrong and you figure out how to solve a problem when the program wanders into some data that breaks its retrieval and parsing. So you will get used to starting over with a fresh database and running your web crawl.
So, now we're going
to write a set of applications and the code there is the pagerank.zip.
That's simple webpage crawler and then a simple pet webpage indexer,
and then we're going to visualize the resulting network
using a visualization tool called d3.js.
So, in a search engine,
there are three basic things that we do.
First, we have a process that's usually done sort of when the computers are bored.
They crawl the web by retrieving a page,
pulling out all the links, having a list,
an input queue of links going through those links one at a time,
marking off the ones we've got,
picking the next one and on and on and on.
So, it says front end processes, spidering or crawling.
Then, once you have the data,
you do what's called index building where you try to look at the links
between the pages to get a sense of what are the most centrally located,
and what are the most respected pages where respect is defined as who points to whom.
Then, we actually look through and search it.
In this case we won't really search it,
we'll visualize the index when we're done.
So, a web crawler is a program that browses the web in some automated manner.
The idea is that Google and
other search engines including the one that you're going to run,
don't actually want the Web.
They want a copy of the web,
and then they can do data mining within their own copy of the web.
It's just so much more efficient than having to go out and look at the web,
you just copy it all.
So, the crawler just slowly but surely shores
crawls and and gets as good a copy of the web as it can.
Like I said, its goal is to retrieve a page,
pull out all the links,
add the links to the queue and then just pull the next one off,
and do it again, and again,
and again, and then save all the text of those pages into storage.
In our case, it'll be a database in Google's place.
It's literally thousands or hundreds of thousands of servers,
but for us we'll just do this in a database.
Now, web crawling is a bit of a science.
We're going to be really simple,
we're just going to try to get to the point we've crawled
every page that we can find in once.
That's what this application is going to do.
But in the real world, you have to pick and choose
how often which pages are more valuable.
So, in real search engines,
they tend to revisit pages more often if they consider those pages more valuable,
but they also don't want to revisit them too often because Google could
crush your website and make it so that your users can't use their website,
because Google is hitting you so hard.
There's also in the world of web crawling this file called robots.txt.
It's a simple website that tells that search engines,
when they see a domain or a URL for the first time,
they download this and it informs them where to look and where not to look.
So, you can take a look at py4e.com and look at the robots.txt,
and see what my website is telling
all the spiders where to go look and where the good stuff is at.
So, at some point you build this,
you have your own storage,
and it's time to build an index.
So, the idea is to figure out what pages are better than other pages and it certainly,
you start by looking at all the words in the pages.
Python word splits etc.
But the other thing we're going to do is look at the links between them
and use those links as a way to ascribe value.
So, here's the process that we're going to run.
There's going to be a couple of different things in the code
for all of this is sitting here in pagerank.zip.
The way it works is that actually only just spiders a single webpage,
you can spider dr-chuck.com,
or you can actually spider Wikipedia.
It's kind of interesting,
but it takes you a little longer before the link start to
sort of go back to one another on Wikipedia.
But Wikipedia is not a bad place to start if you want to run something long,
because at least Wikipedia doesn't get mad at you for using it too much.
So, there's always all these sort of data mining things.
This crawling have this thing where it grabs basically a list of the and.
So, we end up with a list of URLs.
Some of the URLs have data, some do not,
and it randomly looks for one of the unretrieved URLs.
Goes and grabs that URL, passes it,
and then puts the data in for
that URL but then also reads through to see if there's more links.
So, in this database,
there are a few pages that retrieved and lots of pages yet to retrieve.
Then it goes back says, oh,
let's randomly pick another unretrieved file. Go get that one.
Pull that in. Put the text for that one in,
but then look at all the links and add those links to our sorted list.
If you watch this, even if you do like
one or two documents at a time, you'll be like "Wow,
that was a lot of links" and then you grab another page and there's 20 links,
or 60 links, or 100 links.
So, you're not Google so you don't have the whole internet,
though what you find is as you touch any part of the internet,
the number of links explodes and you
end up with so many links that you haven't retrieved.
But, if you're Google after a year and you've seen it all once,
then you get your data more dense.
So,that why in this program we stay with one website.
So eventually, you get some of those links filled
in and have more than one set of pointers.
The other thing in here is we keep track
of which pages point to which pages, right, little arrows.
So these, each page then gets a number inside this database like a primary key,
and we can keep track of which pages and we're going to use
these inbound and outbound links to compute the Page Rank.
That is the more inbound links you have
from sites that have a good number of inbound links,
the better we like that site. So, that's a better site.
So, the Page Rank algorithm is a thing that
sort of reads through this data and then writes the data,
and it takes a number of times through all of
the data to get this Page Rank values to converge.
So, these are numbers that converge toward the goodness of and each page,
and so you can run this as many times as you want.
This runs really quickly,
this runs really slow because it's got to talk to the network and pull these things back,
talk to the network and that's why we can restart this.
The Page Rank is all just talking to data inside that database and it's super fast,
and then if you want to reset these to the initial value of the Page Rank algorithm,
you can reset that and that just sets them all to the initial value.
I think of one, they also won a goodness of
one and then some of these ended with goodnesses of five and 0.01,
and so the more you run this,
the more this data converges.
So, these data items tend to converge after a while.
The first few times they jump around a bunch,
and then later they jump around less and less.
Then, at any point in time as you run this this ranking application you
can pull the data out and dump it to look at the Page Rank values of,
for this particular page,
has a page rank value of one.
These are dumping out,
this one has probably just run the SP reset because they all have the same Page Rank.
After you've run it, you'll see when you're on spdump,
you will see that these numbers start to change.
This stuff is all in the read me file that's sitting here in the zip file, you undo that.
So, the spdump just reads the stuff and prints it out,
and then spjson also reads through all the stuff that's in
here and then takes the the best,
some 20 or so links with the best Page Rank and dumps them into a js JavaScript file.
Then there is some HTML and d3.js which is a visualization
that produces this pretty picture and
the bigger little dots are the ones with a better page rank,
and you can grab this and move all this stuff around and it's nice and fun and exciting.
So, we visualize, right?
So, again, we have a a multi-step process where it's
a slow restartable process than a sort of fast data analysis cleanup process,
and then a final output process that pulls stuff out of there.
So, it's another one of these multi-step data mining processes.
The last thing that we're going to talk about is visualizing mail data.
We're going to go from the Mbox-short to Mbox to Mbox-super gigantomatic.
That's what we're going to do next.
[MUSIC]
Hello, and welcome to Python for everybody.
We're doing a bit of code walk-though and if you want to, you can get to the sample
code and download it also so that you can walk through the code yourself.
What we're walking through today is the page rank code.
And so, the page rank code,
let me get the picture of the page rank code up here.
Here's that picture of the page rank code.
And so, the page rank code has four chunks of code that are going to,
five chunks of code that are going to run.
The first one we're going to look at is the spidering code and
then we'll do a separate look at these other guys later.
So the first one we'll look at is spidering, and again it's sort of the same
pattern of we've got some stuff on the web, in this case webpages.
We’re going to have a database that sort of just captures the stuff.
It's not really trying to be particularly intelligent, but it
is going to parse these with BeautifulSoup and add things to the database, okay.
And so, then we'll talk about how we run the page rank algorithm, and
then how we visualize the page rank algorithm in a bit.
Now, the first thing to notice is that I put the BeautifulSoup
code in right here, okay?
So you can get this from the bs4.zip file.
There might even be a README, no, but there's a README somewhere.
But you got to get use BeautifulSoup, you gotta put this bs4 zip or
you have to install BeautifulSoup for your stub.
So I provide this bs4 zip as a quick and
dirty way if you can't install something for
all of the Python users on your system.
So that's what it's supposed to look like.
You're supposed to have it unzipped right here in these files.
And I don't know what damnit.py means.
That came from Beautiful Soup.
If you look, it's in their source code.
So I'm not swearing.
It's Beautiful Soup, people are swearing.
I'm sorry, I apologize, okay.
So the code we're going to play with the most is in this first one is
called spider.py.
And, we're going to do databases, we're going to read URLs and
we're going to parse them with Beautiful Soup, okay.
And so, what we're going to do is we're going to make a file.
Again, this will make spider.sqlite, and here we are in page rank, Ls minus l.
Spider.sqlite is not there, so this is going to create the database.
We do CREATE TABLE IF NOT EXISTS we're going to have an INTEGER PRIMARY KEY,
because we're going to do foreign keys here.
We're going to have a URL, and the URL which is unique.The HTML,
which is unique whether we got an error.
And then, for the second half,
when we start doing page rank we're going to have old rank and new rank.
because, the way page rank works is it takes the old rank,
computes the new rank and then replaces the new rank with the old rank and
then does it over and over again.
And then we're going to have a many to many table which points really back,
so I call this from IB and to IB.
We did this with some of the Twitter stuff.
And then this webs is just in case I have more than
one web does not really make much difference.
Okay, so what we're going to do is we're going to SELECT id,
url FROM Pages WHERE HTML is NULL, this is our indicator that a page has not yet
been retrieved and error is NULL ORDER BY RANDOM.
And so this is our way, this long bit of stuff.
And this not all of this SQL is completely standard, but
this order by random is really quite nice in sqlite.
Limit 1 says just randomly pick a record in this database
where this true, is true and then pick it randomly.
And then we're going to fetch a row and if that row is none right,
we're going to ask for a new web a starting URL and
this is going to fire things up and we're going to insert this new URL.
Otherwise, we’re going to restart.
We have a row to start with and otherwise were going to sort
primness by inserting the URL we start with and insert into it.
If you have enter it,
it just goes to drchuck.com which is a fine place to start.
And then what we do is what this does is its page
rank this webs table to limit the links.
It only does links to the sites that you tell it to do links and
probably the best for your page rank is to stick with one site.
Otherwise you'll just never find the same site again.
If you let this wander the web aimlessly, and so
I generally run with one web which web that should be try to called websites.
And I am pulling all the data, and I read this in and
I just make myself a list of the URL, the legit URLs and you'll see how we use that.
And the webs is, what are the legit places we're going to go because we're going to
go through a loop, ask for how many pages and we're going to look for a null page.
Again we're using that RANDOM ORDER BY RANDOM
limit one, and then we're going to grab one.
We’re going to get the fromid, which is the page we're linking from and
then the url, otherwise there's no one retrieved.
And so the fromid is when we start adding links to our page links,
we gotta know the page we started with.
And that's the primary key.
We'll see how that primary key is set in a second.
So, otherwise, we have none.
And we're going to print this, from id in the URL that we're working with.
Just to make sure, we're going to wipe out all of the links,
because it's unretrieved.
We’re going to wipe out from the links,
the links is the connection table that connects from pages back to pages.
And so we're going to wipe out.
So we're going to go grab this URL.
We're going to read it.
We're not decoding it because we're using BeautifulSoup
which compensates for the UTF-8 and so it we can ask.
This is the HTML error code and we checked 200 is a good error and
if we get a bad error, we're going to say this error on page.
We're going to set that error, we're going to update pages.
That way we don't retrieve it ever again.
We basically check to see if the content type is text/html.
Remember in http you get the content type.
We only want to retrieve it.
We only want to look for the links on HTML pages and so
we wipe that guy out if we get a JPEG or something like that.
We're not going to retrieve JPEG, and then we commit and continue.
So these are kind of like, those are pages that we didn't want to mess with.
And then we print out how many characters we got and parse it.
We do this whole thing in a try accept block because a lot of things can go
wrong here.
It's a bit of a long try accept block.
KeyboardInterrupt, that's what happens when I hit control + c at my keyboard or
control + z on windows.
Some other exception probably means BeautifulSoup blew up or
something else blew up.
We indicate with the error=-1 for that URL so we don't retrieve it again.
At this point, at line 103, we have got the HTML for that URL.
And so we're going to insert it in, and we're going to set the page rank to 1.
So the way page rank works is it gives all the pages some normal
value then it alters that.
We'll, see that in a bit.
So it sets it in with one.
We're going to insert or ignore.
That's just in case this pages are already that the pages is not there.
And then we're going to do an update, and that's kind of do the same thing twice,
just sort of doubling making sure if it's already there.
Insert or ignore will cause us to do nothing, and
the update will cause us to retain it and then commit it so,
that if we do select later we get that information.
Now this code is similar.
Remember, we used BeautifulSoup to pull out all of the anchor tags.
We have a for loop.
We pull out the href.
And you'll see this code's a little [LAUGH] more complex than some of
the earlier stuff.
Because it has to deal with the real nastiness or Imperfection to weg.
And so, we're going to use urlparse which is actually part of
the URL lib code, and that's going to break the URL into pieces.
Come back, use urlparse.
We have the scheme which is HTTP or HTTPS.
If it's a [INAUDIBLE] relative references.
This is all relative references by taking the current URL and hooking it up.
Urljoin knows about slashes and all those other things.
We check to see if there's an anchor, the pound sign at the end of a URL,
and we throw everything past, including the anchor away.
If we have a JPEG, or a PNG, or a GIF, we're going to skip it.
We don't want to bother with that.
These we're looking through links now, we're looking at all the links.
And if we have a slash at the end, we're going to chop off the slash, by saying -1.
And so this is just kind of nasty choppage and throwing away the URLs,
that we're going through a page, and we have a bunch that we don't like, or
we have to clean them up or whatever.
And now, and we've made them absolute, by doing this.
It's an absolute URL.
You write this slowly but surely, when your code blows up, and
you start it over and start it over and start it over.
Then what we do is we check to see through all the webs.
Remember, those are the URLs that we're willing to stay with and usually,
it's just one.
If this would link off the sites we're interested in, we're going to skip it.
We are not interested in links that leave the site.