Commit f0619f69 by Jonathan Kelly

Create Python_Exercise_Web_Scraping_Template.ipynb

parent a10fe74f
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Python Web Scraping Tutorial"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 1\n",
"\n",
"### Traversing the website\n",
"\n",
"#### Assuming we have something in mind we want to examine and have the website we want to extract the information from, the first thing we need to do is figure out how the website is structured. When we go to the site I've picked, we see that the draft data is seperated by year and there is no way to filter and download multiple years at the same time. \n",
"\n",
"#### Now that we know how the website is structured, we realize that one way to get the information we need is to go to each individual page and read the content. For that, we need a list of URLs of all the years we want to examine on the website. To do this we just click on a few different years and notice that the only difference in the URL is the year itself. Each draft table is on a page that starts the same but ends with a different year. For example the 2000 NFL draft is displayed at the url \"http://www.drafthistory.com/index.php/years/2000\" while the 2001 NFL draft is displayed here \"http://www.drafthistory.com/index.php/years/2001\". Notice the only difference is the year. This is pretty common with a lot of sites when you are iterating through a certain section of the site (e.g. blog_page_1, blog_page_2 ...., blog_page_42). This makes it really easy for us to extract the information we need. Keep this in mind if you want to scrape other pages.\n",
"\n"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 2\n",
"\n",
"### Building a list of urls to scrape\n",
"\n",
"#### Now that we know the structure of the urls, we just need to build a list of all the urls (years) we want. Homepage is the \"base\" url we will use to iterate on."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 6,
"source": [
"homepage = \"http://www.drafthistory.com/index.php/years/\"\r\n",
"main_page_url_list = []\r\n",
"\r\n",
"#This is the base website which iterates through all the pages we want to scrape\r\n",
"homepage = \"http://www.drafthistory.com/index.php/years/\"\r\n",
"main_page_url_list = [] #empty list to hold the urls we want to scrape\r\n",
"\r\n",
"i = 2000 # first year I want to start scraping. This was my choice. The data goes all the way back to 1930's\r\n",
"pages_to_scrape = 22 # This is the total pages we want to scrape. In our case it's nfl draft years\r\n",
"\r\n",
"# This builds a list of urls based on changing the year for each iteration because that's how the website does it\r\n",
"main_page_url_list = [homepage + str(i) for i in range(i,i+pages_to_scrape)]\r\n",
"\r\n",
"print (main_page_url_list)"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"['http://www.drafthistory.com/index.php/years/2000', 'http://www.drafthistory.com/index.php/years/2001', 'http://www.drafthistory.com/index.php/years/2002', 'http://www.drafthistory.com/index.php/years/2003', 'http://www.drafthistory.com/index.php/years/2004', 'http://www.drafthistory.com/index.php/years/2005', 'http://www.drafthistory.com/index.php/years/2006', 'http://www.drafthistory.com/index.php/years/2007', 'http://www.drafthistory.com/index.php/years/2008', 'http://www.drafthistory.com/index.php/years/2009', 'http://www.drafthistory.com/index.php/years/2010', 'http://www.drafthistory.com/index.php/years/2011', 'http://www.drafthistory.com/index.php/years/2012', 'http://www.drafthistory.com/index.php/years/2013', 'http://www.drafthistory.com/index.php/years/2014', 'http://www.drafthistory.com/index.php/years/2015', 'http://www.drafthistory.com/index.php/years/2016', 'http://www.drafthistory.com/index.php/years/2017', 'http://www.drafthistory.com/index.php/years/2018', 'http://www.drafthistory.com/index.php/years/2019', 'http://www.drafthistory.com/index.php/years/2020', 'http://www.drafthistory.com/index.php/years/2021']\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 3\r\n",
"\r\n",
"### Scraping your selected pages "
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"#### Now that we know this code works, we want to use these urls to retreive the raw data from those websites. So we'll take our code above and add it to our scraping algortihm. The bottom part of this code will iterate over all of the desired pages, copy all the html from those pages and store each page in it's own text file. \r\n",
"\r\n",
"#### This is definitely not the most elegant solution. I tried to use Beautiful Soup and a few other libraries to automatically parse the text into nice clean csv files but I ran into some trouble based on the structure and security of the site, so this is how I had to do it. There are probably much more efficient methods out there but this worked also, so I just went for it."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"source": [
"# Import the required packages\r\n",
"from urllib.request import urlopen as uReq \r\n",
"from urllib.request import Request\r\n",
"import time\r\n",
"import pandas as pd\r\n",
"from html.parser import HTMLParser\r\n",
"\r\n",
"#This is the base website which iterates through all the pages we want to scrape\r\n",
"homepage = \"http://www.drafthistory.com/index.php/years/\"\r\n",
"main_page_url_list = [] #empty list to hold the urls we want to scrape\r\n",
"\r\n",
"i = 2000 # first year I want to start scraping. This was my choice. The data goes all the way back to 1930's\r\n",
"pages_to_scrape = 22 # This is the total pages we want to scrape. In our case it's nfl draft years\r\n",
"\r\n",
"# This builds a list of urls based on changing the year for each iteration because that's how the website does it\r\n",
"main_page_url_list = [homepage + str(i) for i in range(i,i+pages_to_scrape)]\r\n",
"\r\n",
"j=2000 #initialize the first year for which we want to scrape\r\n",
"\r\n",
"# Scraping algorithm\r\n",
"# This is the algorithm that will loop through all our desired pages.\r\n",
"for i in range(0,22):\r\n",
" url = main_page_url_list[i] #Our list of urls\r\n",
" req = Request(url, headers={'User-Agent':'Mozilla/5.0'}) #\"opening\" webiste based on url from list\r\n",
" page = uReq(req).read() # reading text from page and storing it in variable \"page\"\r\n",
" \r\n",
" #Opening text file and naming it based on draft year + \"NFL Draft Picks.txt\" \"w\" means write to file.\r\n",
" f = open(str(j) + \" NFL Draft Picks.txt\", \"w\")\r\n",
" f.write(str(page)) # write text from page into the text file\r\n",
" f.close() # close text file\r\n",
" j+=1 # iterating draft year for next text file title\r\n",
" time.sleep(1) #Building in a 1 second delay to make it look more \"human\" Without this, we might get denied by website"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 4\r\n",
"\r\n",
"### Cleaning the data"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"#### Now that we have all of our draft data in text form, it's time to clean it up a bit and get it into csv form and ultimately into a Pandas dataframe so we can manipulate and analyze it. To do that I am going to be using an html parser. This will essentially read the html from the webpage and split the text using known html tags.\r\n",
"\r\n",
"#### After the html is parsed, we are going to create a Pandas dataframe and move our data into it. We'll create and name columns, then take each dataframe and save it by year. \r\n",
"\r\n",
"#### You'll notice in our case that we have a few extra rows of data for each file. I just opened them in Excel and deleted those rows. There is a programmatic way to do it, but since each file is slightly different, I couldn't find any similarities in the files that would allow me to do it easily. To me it wasn't worth the extra work since there are only 21 files and it only took me about 3 minutes total. However, if you're working with a lot of files or tons of data, it might be worth it for you to figure out how to do it for your specific use case. "
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 14,
"source": [
"import pandas as pd\r\n",
"from html.parser import HTMLParser\r\n",
"\r\n",
"# Create table parser and extract table data\r\n",
"class TableParser(HTMLParser):\r\n",
" def __init__(self):\r\n",
" HTMLParser.__init__(self)\r\n",
" self.in_td = False\r\n",
" \r\n",
" def handle_starttag(self, tag, attrs):\r\n",
" if tag == 'td':\r\n",
" self.in_td = True\r\n",
" \r\n",
" def handle_data(self, data):\r\n",
" if self.in_td:\r\n",
" if data == str(year):\r\n",
" pass\r\n",
" elif data == '\\xa0':\r\n",
" list.append('') # This is a placeholder for the draft round will fill in later\r\n",
" else:\r\n",
" list.append(data)\r\n",
" \r\n",
" def handle_endtag(self, tag):\r\n",
" self.in_td = False\r\n",
"\r\n",
"year = 2000\r\n",
"list = []\r\n",
"\r\n",
"# This portion of the code will call the parser above, put the parsed data into a list, \r\n",
"# convert that list into a dataframe, then save that dataframe as a csv file.\r\n",
"# There will be one file for every year.\r\n",
"for i in range(22):\r\n",
" data = open(str(year) +' NFL Draft Picks.txt', 'r')\r\n",
" data = data.read()\r\n",
" p = TableParser()\r\n",
" p.feed(str(data))\r\n",
"\r\n",
" # This code creates a list of lists that seperates each player taken in the draft. The number 7 is equal to the number of columns in the table.\r\n",
" playerDetails = [list[x:x+7] for x in range(0, len(list), 7)]\r\n",
"\r\n",
" # Create column headers and load in data\r\n",
" df = pd.DataFrame(playerDetails, columns=[\"Round\", \"Pick\", \"Overall\", \"Name\", \"Team\",\"Position\", \"School\"])\r\n",
"\r\n",
" # now we need to add the year so we can combine files and the draft year.\r\n",
" # Using DataFrame.insert() to add a column\r\n",
" df.insert(0, \"Year\", year, True)\r\n",
"\r\n",
" # now we'll save this data frame into a csv for later \r\n",
" df.to_csv(str(year) +' draft results.csv', index=False)\r\n",
" \r\n",
" year += 1 #iterate year for next file\r\n",
" playerDetails = [] #reset playdetails list or the lists will be cumlative instead of being seperated by year like we want.\r\n",
" list = []\r\n",
"\r\n",
" "
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Step 5\r\n",
"\r\n",
"### Merging the data\r\n",
"\r\n",
"### The last step is to combine this data into. The code below will allow you to cobine all of the csv files into one csv file."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 4,
"source": [
"import os\r\n",
"import glob\r\n",
"import pandas as pd\r\n",
"\r\n",
"# Path where the csv files are saved\r\n",
"os.chdir(r\"C:\\Users\\jdk51\\Google Drive\\Programming\\Data Science\\DSDojo\\Blogs\\Web_Scraping_Tutorial\")\r\n",
"extension = 'csv'\r\n",
"\r\n",
"# Create list of all the file names to be opened and read\r\n",
"all_filenames = [i for i in glob.glob('*.{}'.format(extension))]\r\n",
"\r\n",
"# read one csv at a time and store the data\r\n",
"combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])\r\n",
"\r\n",
"# create a csv and export saved data from variable above\r\n",
"combined_csv.to_csv( \"2000-2021 Draft Picks.csv\", index=False, encoding='utf-8-sig')"
],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"interpreter": {
"hash": "63fd5069d213b44bf678585dea6b12cceca9941eaf7f819626cde1f2670de90d"
},
"kernelspec": {
"display_name": "Python 3.9.2 64-bit",
"name": "python3"
},
"language_info": {
"name": "python",
"version": ""
}
},
"nbformat": 4,
"nbformat_minor": 4
}
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment