Size: 3336
Comment:
|
Size: 3236
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
#acl Björn Buchhold:read,write All:read | #acl All:read |
Line 3: | Line 3: |
= Tokenization Repair (project and/or theses) = | = Tabular Knowledge Extraction (project and/or theses) = |
Line 5: | Line 5: |
'''Type:''' Interesting and well-defined problem, the solution of which is relevant in a variety of information retrieval scenarios. Simple rule-based solutions come to mind easily, but machine learning is key to get very good results. A background in machine learning, or a strong willingness to aquire one as part of the project/thesis, is therefore mandatory for this project. | '''Type:''' Interesting and well-defined problem with broad applicability in data science, knowledge representation and knowledge base exploration |
Line 7: | Line 8: |
'''Background info:''' Breaking a natural-langue text into words is a basic first step for almost any system that does something sophisticated with text (we only consider word-based languages in this project/theses, and not single-symbol-based languages like Chinese). This is called tokenization. If the text is formatted perfectly, this is an (almost) trivial problem: just take the words as maximal strings of word characters. However, text is almost never formatted perfectly. The following kinds of formatting errors are frequently encountered in almost any text: | '''Background info:''' Knowledge Bases such as Freebase, WikiData and DBpedia contain giant amounts of knowledge data encompassing vast fields from book characters to geographic and political entities. Especially for data for which ontological knowledge such as ''Germany <is-a> Country'' is available it can often be useful to represent (parts) of the data in a tabular format. One particularly user relevant application for this is in automatically generated and updated tables on wikis as well as custom reduced data sets for data science. |
Line 9: | Line 17: |
1. Hyphenation due to a line break (e.g. ''Our algo-[newline]rithm runs in linear time.''). 2. Wrong spaces (e.g. ''Our algo rithm runs in linear time.''). 3. Missing spaces (e.g. ''Our algorithm runsin linear time.'') 4. Garbled characters (e.g. ''Our algorifm runsin linear time.'') |
Assume we wish to examine the distribution of cities on the globe. If we had a table of the form ''[City] [Country] [Population Count] [Latitude] [Longitude]'' this would be an easy task and as we just showed this table is trivially specified in a concise, structured format. However, trying to get this data out of a SPARQL based Freebase frontend can be quite challenging even if we are sure that as in this case the data is in there. Try writing a SPARQL query for our [[http://vulcano:7001/index.html|Qlever instance]] and you will see, if you don't believe us it's possible send me a mail. |
Line 14: | Line 26: |
Most text processing systems simple break for such errors. For example, a typical search engine will simply not find the word ''algorithm'' in the sentences 1, 2, and 4 above. A POS-tagger will not recognize runsin as a verb followed by a preposition, which can also affect the correctness of the other POS-tags in the sentence. More complex NLP routines, like parsing, are known to be very non-robust to such errors and may produce completey wrong results for the whole sentence. | This gap between the availability of a simple concise description, queryable datasets and the effort necessary to extract the result is what we aim to close. |
Line 16: | Line 30: |
'''Goal''': design, implement, and evaluate a system that recognizes and fixes errors of the types 1, 2, and 3 in a given text. Check if 4 can be solved with similar techniques. If yes, also solve it. If no, characterize how it is a different problem (requiring different techniques). The first meaningful steps in this project are as follows. | '''Goal''': design, implement, and evaluate a system that given a concise, structured table definition generates the resulting table using a knowledge base as its data source. While the definition is concise and structured note, that there is some fuzzyness to the category names which should not need to match the exact name of the associated knowledge base entity since these have often non-human-readable format, unexpected names and/or require a detour via mediator entities. |
Line 18: | Line 38: |
'''Step 1''': Search the literatur for solutions to this problem. It is an important problem, but has recevied muss less research attention so far then other problems in NLP / text processing / information retrieval. | '''Step 0''': Search the literature for solutions to this problem, familiarize yourself with the available knowledge base systems, SPARQL and try your hand at a medium number of manually designed queries and example tables. Start designing a simple definition for a table description format, this is not set in stone however. ''DOCUMENT YOUR QUERIES AND DISCOVERED PROBLEMS'' |
Line 20: | Line 44: |
'''Step 2''': Design simple baseline algorithms (not based on learning or using a very simple kind of learning), design a first small and simple (but not too small and simple) benchmark and evaluate your baseline algorithms on this benchmark. This will give you a feeling of how hard the problem is and where the actual problems lie. Actually your ideas may go in the wrong direction or you optimize the wrong aspects. | '''Step 1''': Design and implement a baseline version using for example exact entity names with a rule based approach. Design a simple but useful benchmark against which your system can be evaluated. This will give you an idea of where you stand, what kind of errors are still present. This also gives you the opportunity to evaluate if and where your approach may have gone in the wrong direction. |
Line 22: | Line 51: |
'''Step 3''': Familiarize yourself with the kind of machine learning techniques that are useful in such a scenario. A neural network might be useful later on, but is most probably not the first think you should try (if it gives bad results, you will not know if its because the problem is so hard or because you used it in the wrong way, so you first need a baseline from a more classical learning algorithm). '''Step 4''': Come up with a learning-based algorithm and evaluate and see that you beat the baseline. |
'''Step 2''': Using more advanced techniques such as simple machine learning algorithms tackle the problems discovered in the previous step. Handle synonyms for the categories and possibly allow for additional data filtering. If necessary improve the performance of query generation. Design and implement a web frontend allowing easy interaction with your system both for human and machine users. |
Tabular Knowledge Extraction (project and/or theses)
Type: Interesting and well-defined problem with broad applicability in data science, knowledge representation and knowledge base exploration
Background info: Knowledge Bases such as Freebase, WikiData and DBpedia contain giant amounts of knowledge data encompassing vast fields from book characters to geographic and political entities. Especially for data for which ontological knowledge such as Germany <is-a> Country is available it can often be useful to represent (parts) of the data in a tabular format. One particularly user relevant application for this is in automatically generated and updated tables on wikis as well as custom reduced data sets for data science.
Assume we wish to examine the distribution of cities on the globe. If we had a table of the form [City] [Country] [Population Count] [Latitude] [Longitude] this would be an easy task and as we just showed this table is trivially specified in a concise, structured format. However, trying to get this data out of a SPARQL based Freebase frontend can be quite challenging even if we are sure that as in this case the data is in there. Try writing a SPARQL query for our Qlever instance and you will see, if you don't believe us it's possible send me a mail.
This gap between the availability of a simple concise description, queryable datasets and the effort necessary to extract the result is what we aim to close.
Goal: design, implement, and evaluate a system that given a concise, structured table definition generates the resulting table using a knowledge base as its data source. While the definition is concise and structured note, that there is some fuzzyness to the category names which should not need to match the exact name of the associated knowledge base entity since these have often non-human-readable format, unexpected names and/or require a detour via mediator entities.
Step 0: Search the literature for solutions to this problem, familiarize yourself with the available knowledge base systems, SPARQL and try your hand at a medium number of manually designed queries and example tables. Start designing a simple definition for a table description format, this is not set in stone however. DOCUMENT YOUR QUERIES AND DISCOVERED PROBLEMS
Step 1: Design and implement a baseline version using for example exact entity names with a rule based approach. Design a simple but useful benchmark against which your system can be evaluated. This will give you an idea of where you stand, what kind of errors are still present. This also gives you the opportunity to evaluate if and where your approach may have gone in the wrong direction.
Step 2: Using more advanced techniques such as simple machine learning algorithms tackle the problems discovered in the previous step. Handle synonyms for the categories and possibly allow for additional data filtering. If necessary improve the performance of query generation. Design and implement a web frontend allowing easy interaction with your system both for human and machine users.