Daily To-do AssignmentsTo-do #1Due 8/31 (Th)The Internet is full of published linguistic data sets. Let's data-surf! Instructions:Go out and find two linguistic data sets you like. One should be a corpus, the other should be some other format. They must be free and downloadable in full. You might want to start with various bookmark sites listed in the following Learning Resources sections: Linguistic Data, Open Access, Data Publishing and Corpus Linguistics. But don't be constrained by them. Download the data sets and poke around. Open up a file or two to take a peek. In a text file (should have .txt extension), make note of: The name of the data resourceThe author(s)The URL of the download pageIts makeup: size, type of language, format, etc. License: whether it comes with one, and if so what kind?Anything else noteworthy about the data. A sentence or two will do. If you are comfortable with markdown, make an .md file instead of a text file.SUBMISSION: Upload your text file to To-do1 submission link, on CourseWeb.
To-do #10Due 11/7 (Tue)Let's have you visit classmates' term projects and take a look around. Steps:Create your own "visitor's log" file in the todo10 directory of Class-Practice-Repo. It can be found here. Do this ASAP, so you don't keep your visitors waiting! Change of plan! Let's have you directly push your visitor's log entries, rather than me playing the gatekeeper. We will change the venue. Remember the delightful "favorite animal" exercise? It was done through this repo where everyone had push access. The repo's name used to be 'Corpus-Resources', which is probably what it is still on your own laptop. We will use this repo. Inside, I created the the todo10_visitors_log directory for logging our visits. Since the repo's name changed, there are a few things you need to take care of on your laptop. First, change your directory name: mv Corpus-Resources Shared-RepoYour git setting still points to the old GitHub Repo's web address. You must update it:git remote set-url origin -Science-for-Linguists/Shared-Repo.gitPull from GitHub repo so you will have up-to-date files: git pull Now this repo is ready! You have full push access, so no need to fork or create pull requests. You will be visiting two people ahead of you on this list. Margaret will be visiting Paige and Robert, Paige will be visiting Robert and Ryan, etc. Skip over folks who do not have a project repo. Visit their project repos. You don't need to download their code and run it at this time -- just browse their repo files. Then, log your visit on their visitor's log file. Enter two things:Something you learned from their projectSomething else that came to your mind. A helpful pointer, impressions, anything! SUBMISSION: Push to your fork and create a pull request for me. You will have to do this twice: first for creating your visitor's log file, and second for pushing your updated visitor's log files. No need for pull requests! Update your visitor's log files directly in this repo. REMEMBER to PULL OFTEN, so you are working with latest file copies.
To-do #11Due 11/9 (Thu)Let's have you poke at big data. Well, big-ish, from our point of view. The Yelp DataSet Challenge is now on its 10th round, where Yelp has kindly made their huge review dataset available for academic groups that participate in a data mining competition. Some important points before we begin:After downloading the data set, you should operate exclusively in a command-line environment, utilizing unix tools. I am supplying general instructions below, but you will have to fill in the blanks between steps, such as cd-ing into the right directory, invoking your Anaconda Python and finding the right file argument. You will be submitting a write-up as a Markdown file named todo11_yelp_yourname.md. Step 1: Preparation, explorationLet's download this beast and poke around. Download the JSON portion of the data, disregarding SQL and photos. It's 2.28GB compressed, and it took me about 25 minutes to download. You will need at least 9GB of disk space. Move the downloaded archive file into your Documents/Data_Science directory. From this point on, operate exclusively in command line. The file is in .tar format. Look it up if you are not familiar. Untar it using tar -xvf. I will create a directory called dataset with JSON files in it. How big are the files? Find out through ls -laFh. What do the data look like? Find out using head. How many reviews are there? Find out using wc -l. How many reviews use the word 'horrible'? Find out through grep and wc -l. Take a look at the first few through head | less. Do they seem to have high or low stars?How many reviews use the word 'scrumptious'? Do they seem to have high stars this time?Step 2: A stab at processingHow much processing can our own puny personal computer handle? Let's find out. First, take stock of your computer hardware: disk space, memory, processor, and how old your system is. Create a Python script file: process_reviews.py. Content below. You can use nano, or you could use your favorite editor (atom, notepad++) provided that you launch the application through command line. import pandas as pdimport sysfrom collections import Counterfilename = sys.argvdf = pd.read_json(filename, lines=True, encoding='utf-8')print(df.head(5))wtoks = ' '.join(df['text']).split()wfreq = Counter(wtoks)print(wfreq.most_common(20))We are NOT going to run this on the whole review.json file! Start small by creating a tiny version consisting of the first 10 lines: head -10 review.json > FOO.jsonThen, run process_reviews.py on FOO.json. Note that the json file should be supplied as command-line argument to the Python script. Confirm it runs successfully. Next step, re-create FOO.json with incrementally larger total # of lines and re-run the Python script. The point is to find out how much data your system can reasonably handle. Could that be 1,000 lines? 100,000? While running this experiment, closely monitor the process on your machine. Windows users should use Task Manager, and Mac users should use Activity Monitor.Now that you have some sense of these large data files, write up a short reflection summary. What sorts of resources would it take to successfully process this dataset in its entirety and through more computationally demanding processes? What considerations are needed? SUBMISSION: You will find the todo11 directory in Shared-Repo. You have push access: push away your file. 2b1af7f3a8