NLP代写|Natural Language Processing Assignment
这是一篇来自澳洲的利用linux代写进行正则表达式和归纳法
How to manage this assignment
- Start work on this assignment early, and do as much as possible of it before your week 4 prac class. There will not be time during the class itself to do the assignment from scratch; there will only be time to get some help and clarifification.
Instructions
Please read these instructions carefully before you attempt the assessment:
- To begin working on the assignment, download the workbench asgn1.zip from Moodle. Cre ate a new Ed Workspace and upload this fifile, letting Ed automatically extract it. Edit the student-id fifile to contain your name and student ID. Refer to Lab 0 for a reminder on how to do these tasks.
- The workbench provides locations and names for all solution fifiles. These will be empty, needing replacement. Do not add or remove fifiles from the workbench.
- Solutions to written questions must be submitted as PDF documents. You can create a PDF fifile by scanning your legible (use a pen, write carefully, etc.) hand-written solutions, or by directly typing up your solutions on a computer. If you type your solutions, be sure to create a PDF fifile. There will be a penalty if you submit any other fifile format (such as a Word document). Refer to Lab 0 for a reminder how to upload your PDF to the Ed workspace and replace the placeholder that was supplied with the workbench.
- Before you attempt any problem—or seek help on how to do it—be sure to read and understand the question, as well as any accompanying code.
- When you have fifinished your work, download the Ed workspace as a zip fifile by clicking on “Download All” in the fifile manager panel. You must submit this zip fifile to Moodle by the deadline given above. To aid the marking process, you must adhere to all naming conventions that appear in the assignment materials, including fifiles, directories, code, and mathematics. Not doing so will cause your submission to incur a one-day late-penalty (in addition to any other late-penalties you might have). Be sure to check your work carefully.
Your submission must include:
- a sed script, decomposeSyllablesIntoParts, for Problem 1(a);
- a one-line text fifile, Korean-NameStructure2, for Problem 1(a);
- a sed script, decomposePartsIntoLetters, for Problem 1(b);
- a one-line text fifile, Korean-NameStructure3, for Problem 1(b);
- an awk script, matchKorean, for Problem 1(c);
- the output fifile outputFile you produced by running your awk script on the provided input fifile, inputFileOfNames, in Problem 1(c);
- a fifile prob2.pdf with your solution to Problem 2.
Introduction to the Assignment
In Lab 0, you met the stream editor sed, which detects and replaces certain types of patterns in text, processing one line at a time. These patterns are actually specifified by regular expressions. You will use sed again in Problem 1 of this Assignment, to help construct regular expressions.
You will also learn about awk, which is a simple programming language that is widely used in Unix/Linux systems and which also uses regular expressions. In Problem 1, you will construct an awk program to identify a class of Korean names.
Finally, Problem 2 is about applying induction to a problem of counting on graphs.
Introduction to awk
In an awk program, each line has the form
/pattern /
{ action }
where the pattern is a regular expression (or certain other special patterns) and the action is an instruction that specififies what to do with any line that contains a match for the pattern. The action (and the {. . . } around it) can be omitted, in which case any line that matches the pattern is printed.
Once you have written your program, it does not need to be compiled. It can be executed directly, by using the awk command in Linux:
$ awk -f programName inputFileName
Your program is then executed on an input fifile in the following way.
// Initially, we’re at the start of the input fifile, and haven’t read any of it yet.
If the program has a line with the special pattern BEGIN, then
do the action specifified for this pattern.
Main loop, going through the input fifile:
{
inputLine := next line of input fifile
Go to the start of the program.
Inner loop, going through the program:
{
programLine := next line of program (but ignore any BEGIN and END lines)
if inputLine contains a string that matches the pattern in programLine, then
if there is an action specifified in the programLine, then
{
do this action
}
else
just print inputLine
// it goes to standard output
}
}
If the program has a line with the special pattern END, then do the action specifified for this pattern.
Any output is sent to standard output. You should read about the basics of awk, including the way it represents regular expressions and the main instruction types used in its actions. Any of the following sources should be a reasonable place to start:
- A. V. Aho, B. W. Kernighan and P. J. Weinberger, The AWK Programming Language,Addison-Wesley, New York, 1988.
(The fifirst few sections of Chapter 1 should have most of what you need, but be aware also of the regular expression specifification on p28.)
- https://www.grymoire.com/Unix/Awk.html
- the Wikipedia article looks ok
- the awk manpage
- the GNU Awk User’s Guide.
Introduction to Problem 1
The Master said, ‘What is necessary is to rectify names . . . . If names are not recti-fified, then words are not appropriate. If words are not appropriate, then deeds are not accomplished.’
– Confucius (孔夫子), The Analects (transl. R. Dawson), Oxford University Press,1993, §13.3.
Most organisations today deal with people from many difffferent cultures and language groups,and they must often record and process people’s names in systems that work mainly with English language text. In such contexts, it is helpful to be able to recognise names from difffferent language groups. Example applications include: determining how to pronounce students’ names when reading them out from a list at graduation ceremonies; determining how to greet a person with whom you have an appointment; determining how to enter the various parts of a person’s name into a database; determining how automatically-generated emails, sent to many difffferent people listed in some fifile,should address each recipient; determining the most likely native language of a person in situations where their name is known but they cannot be spoken to directly at the time (e.g., in some emergency situations). Recognising the language group that a name belongs to is an important fifirst step in all these situations.
In this problem you will write some code in sed and awk to try to recognise one type of names in a long fifile of Asian names. More specififically, suppose you are given an input fifile in which each line starts with a person’s name in some language, with each name transcribed somehow into English text. Your task is to detect which of these names come from Korean. 1
In the input fifile, all text from the start of each line until the fifirst colon (:) on the line (but not including the colon itself) is taken to be a person’s name. In most cases, each line ends with a string of non-blank letters specifying which language the name is believed to come from. An example input fifile is provided, as inputFileOfNames. If you browse through the fifile, you will notice that it contains names from a variety of Asian languages: Mandarin, Cantonese, Hokkien, Teochew, Hakka, Korean,Japanese, Thai, Vietnamese, Malay and Indonesian. They have been represented in English text using a variety of transcription schemes, and with all extra marks on letters (accents, tone marks,other diacritical marks, etc.) removed. 2 In many cases, the line about a person also contains some other information about them, but our name recognition task will ignore that information.3 Further information about working with names from difffferent cultures can be found in:
- Fiona Swee-Lin Price, Success with Asian Names, Allen & Unwin, Crows Nest, NSW, 2007.
- J. Greenberg Motamedi, Z. Jaffffery, A. Hagen, and S. Y. Yoon, Getting it right: Reference guides for registering students with non-English names, 2nd edition. (REL 2016-158 v2), U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Northwest, Washington, DC, 2017.
https://ies.ed.gov/ncee/edlabs/regions/northwest/pdf/REL_2016158.pdf
- SBS, The Cultural Atlas, https://culturalatlas.sbs.com.au/. You can look up a country or culture (e.g., by clicking on a map) and then click on a link to “Naming” for that culture.