FAQ Database Discussion Community

Retrieve 3rd MAX salary in Hive

I'm a novice. I have the following Employee table. ID Name Country Salary ManagerID I retrieved the 3rd max salary using the following. select name , salary From ( select name, salary from employee sort by salary desc limit 3) result sort by salary limit 1; How to do the...

hive - Regex in Split function not giving output

Input : [a,b], [c,d], [e,f] select split(col,'\\,') from table_1; With the above query, am able to split on every comma. (inside and outside the braces) I need to split only on the commas outside the braces. so i changed the query as below. select split(col,',(?=\[)') from table_1; regex which i...

Hive Query Language return only values where NOT LIKE a value in another table

I'm trying find all the values in my hosts table, which do not contain partial match to values in my maildomains table. hosts +-------------------+-------+ | host | score | +-------------------+-------+ | www.gmail.com | 489 | | www.hotmail.com | 653 | | www.google.com | 411 | | w3.hotmail.ca | 223 |...

How to calculate Date difference in Hive

I'm a novice. I have a employee table with a column specifying the joining date and I want to retrieve the list of employees who have joined in the last 3 months. I understand we can get the current date using from_unixtime(unix_timestamp()). How do I calculate the datediff? Is there...

How to query struct array with Hive (get_json_object)?

I store the following JSON objects in a Hive table: { "main_id": "qwert", "features": [ { "scope": "scope1", "name": "foo", "value": "ab12345", "age": 50, "somelist": ["abcde","fghij"] }, { "scope": "scope2", "name": "bar", "value": "cd67890" }, { "scope": "scope3", "name": "baz", "value": [ "A", "B", "C" ] } ] } "features"...

load struct or any other complex data type in hive

I have a .xlsx file which contains data some thing like the below image, am trying to create using the below create query create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>, male_15_19 STRUCT<num:double, total:double, perc:double>, male_20_24 STRUCT<num:double, total:double, perc:double>,...

Hive external table not reading entirety of string from CSV source

Relatively new to the Hadoop world so apologies if this is a no-brainer but I haven't found anything on this on SO or elsewhere. In short, I have an external table created in Hive that reads data from a folder of CSV files in HDFS. The issue is that while...

HIVE QUERY SELECT * FROM bookfreq where freq IN (SELECT Max(freq) FROM bookfreq);

I am writing hive query,for fetching record has maximum freq value. table name bookfreq, having two column year & freq year freq 1999 2 2000 4 1989 4 1990 5 Query: SELECT * FROM bookfreq where freq IN (SELECT Max(freq) FROM bookfreq); I am getting an exception like FAILED: ParseException...

Filter out duplicate rows based on a subset of columns

I have some data that looks like this: ID,DateTime,Category,SubCategory X01,2014-02-13T12:36:14,Clothes,Tshirts X01,2014-02-13T12:37:16,Clothes,Tshirts X01,2014-02-13T12:38:33,Shoes,Running X02,2014-02-13T12:39:23,Shoes,Running X02,2014-02-13T12:40:42,Books,Fiction X02,2014-02-13T12:41:04,Books,Fiction what I would like to do is to only keep one instance of each datapoint in time like this (I don't care which instance in time): ID,DateTime,Category,SubCategory X01,2014-02-13T12:36:14,Clothes,Tshirts X02,2014-02-13T12:39:23,Shoes,Running...

what data type to use when to store unix time stamps in a Hive table

Sample of time stamps I'm working with: - 874833878 - 887736532 - 879196566 - 892430094 do I just store them as (my_date TIMESTAMP)?

SQL QUALIFY equivalent HIVE query

I'm trying to create a HIVE query from an Oracle SQL query. Essentially I want to select the first record, sorted descending by UPDATED_TM, DATETIME, ID_NUM. SELECT tbl1.NUM AS ID, tbl1.UNIT AS UNIT, tbl2.VALUE AS VALUE, tbl1.CONTACT AS CONTACT_NAME, 'FILE' AS SOURCE, CURDATE() AS DATE FROM DB1.TBL1 tbl1 LEFT JOIN...

Query is not returning any values in hive

I'm a newbie here. Running the following select statement is not returning any values. Hive queries: select name from patient where name = '[a-g]%'; select name from patient where name like '[a-g]%'; What am I doing wrong? Thanks in advance!...

How to import data from folder to Hive with new columns as file's name and folder's name?

I have data input like this: Drivers driver_1 1.csv 2.csv ... driver_2 1.csv 2.csv ... ... Structure of csv file is: x,y 0.0,0.0 18.6,-11.1 36.1,-21.9 53.7,-32.6 70.1,-42.8 86.5,-52.6 I want to load all file in this folder to Hive table like: id, x, y, file_name, folder_name 1, 0.0, 0.0, 1.csv,...

Hive: create statement is not running (Moving)

I'm using Hive's 13th Cloudera version. I'm facing an issue while running any of the create statement. Other operations like DML and drop, alter are working fine. below is the sample statement which i'm trying to run, is there anything which I'm missing ? CREATE EXTERNAL TABLE IF NOT EXISTS...

Add minutes to datetime in Hive

Is there a function in Hive one could use to add minutes(in int) to a datetime similar to DATEADD (datepart,number,date)in sql server where datepart can be minutes: DATEADD(minute,2,'2014-07-06 01:28:02') returns 2014-07-06 01:28:02 On the other hand, Hive's date_add(string startdate, int days) is in days. Any of such for hours?

Hive: can't fill index

I'm using Hive 14.0 and have a challenge to index tables. If I want to build an index without DEFERRED REBUILD, Hive do not create an index-table for me. If I use it with DEFERRED REBUILD an index-table is build, but after REBUILD nothing happens. My testtable has myKey as...

GROUP BY statement HiveQL

I'm a noobie to Hive. My question is why we need to use collect_set(col) while performing GROUP BY? select singer, collect_set(song) from songlist GROUP BY singer;; would really appreciate any help. Thanks in advance!...

get avg of count in hive

I am trying get the avg of the result of a count query, in the documentation of hive I read it is impossible and for this reason I try it: 1º CREATE VIEW clicks_pais_totalView AS SELECT p.pais as pais, count(1) as numeroClicks FROM clicks_data_mat p WHERE p.pais is not NULL...

Hive shell throws Filenotfound exception while executing queries, inspite of adding jar files using “ADD JAR”

1) I have added serde jar file using "ADD JAR /home/hduser/softwares/hive/hive-serdes-1.0-SNAPSHOT.jar;" 2) Create table 3) The table is creates successfully 4) But when I execute any select query it throws file not found exception hive> select count(*) from tab_tweets; Query ID = hduser_20150604145353_51b4def4-11fb-4638-acac-77301c1c1806 Total jobs = 1 Launching Job 1...

Selecting the first day of the month in the HIVE

I am using Hive (which is similar to SQL, but the syntax can be little different for the SQL users). I have looked at the other stackoverflow, but they seems to be in the SQL with different syntax. I am trying to the get the first day of the month...

how do retrieve specific row in Hive?

I have a dataset looks like this: --------------------------- cust | cost | cat | name --------------------------- 1 | 2.5 | apple | pkLady --------------------------- 1 | 3.5 | apple | greenGr --------------------------- 1 | 1.2 | pear | yelloPear ---------------------------- 1 | 4.5 | pear | greenPear ------------------------------- my hive...

Hive UDF returning an array called twice - performance?

I have created a GenericUDF in hive that takes one string argument and returns an array of two strings, something like: > select normalise("ABC-123"); ... > [ "abc-123", "abc123" ] The UDF makes a call out via JNI to a C++ program for each row to calculate the return data...

Hive - Converting a string to bigint

Suppose I have a string like '00321' and I want to convert it into a bigint in Hive, how would I do it? Follow-up question: would the resultant bigint value be 321 or 00321? Thanks!...

Hive static partitions issue

I have a csv file which have 600 records, 300 for male and female each. I have created a Table_Temp and fill all these records in that table. Then, I create Table_Main with gender as partition column. For Temp_Table query is: Create table if not exists Temp_Table (id string, age...

Select top 2 rows in Hive

I'm a noobie here. I'm trying to retrieve top 2 tables from my employee list based on salary in hive (version 0.11). Since it doesn't support TOP function, is there any alternatives? Or do we have define a UDF?

Hive - Partition Column Equal to Current Date

I am trying to insert into a Hive table from another table that does not have a column for todays date. The partition I am trying to create is at the date level. What I am trying to do is something like this: INSERT OVERWRITE TABLE table_2_partition PARTITION (p_date =...

Is the GROUP BY clause applied after the WHERE clause in Hive?

Suppose I have the following SQL: select user_group, count(*) from table where user_group is not null group by user_group Suppose further that 99% of the data has null user_group. Will this discard the rows with null before the GROUP BY, or will one poor reducer end up with 99% of...

Hive: Kryo Exception

I'm executing one of my HQL query which has few joins, union and insert overwrite operation, which is working fine if I run it just once. If I execute the same job second time, I'm facing this issue. Can someone help me to identify in which scenario we get this...

Selecting YYYYMM of the previous month in HIVE

I am using Hive, so the SQL syntax might be slightly different. How do I get the data from the previous month? For example, if today is 2015-04-30, I need the data from March in this format 201503? Thanks! select employee_id, hours, previous_month_date--YYYYMM, from employees where previous_month_date = cast(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') as...

Spark: Hive Query

I have a log file, and the first column would be my partition in Hive table. logSchemaRDD.registerTempTable("logs") hiveContext.sql("insert overwrite table logs_parquet PARTITION(create_date=select ? from logs) select * from logs") How do I construct the query to select the first column (marked as ? here) and ensure that the one I...

handling oracle's ROWID in apache hive

I'm converting oracle sql queries to hiveql; how to convert queries with ROWID in oracle to hive. Example: select ROWID, name, country from table1 where date = to_date('10/11/2015','mm/dd/yyyy') ...

Adding a default value to a column while creating table in hive

I'm able to create a hive table from data in external file. Now I wish to create another table from data in previous table with additional columns with default value. I understand that CREATE TABLE AS SELECT can be used but how do I add additional columns with default value??...

two queries with parameters in hive

I'm trying to run two queries in hue/hive with parameters (dates and suffixes), but it doesn't work. I wonder if it is possible or should I always run them separately (which is inconvenient). Queries: create table private_kubicki.tmp${suffix} as select id, c1, c2 from private_kubicki.testy_${suffix2} where ${cond} ; create table private_kubicki.tmp2${suffix}...

Is it possible to use HWI (Hive Web Interface) in single node installation?

Is it possible to use HWI (Hive Web Interface) in single node installation?

Hive error while creating partitioned view

I got a 'log' table which is currently partitioned by year, month and day. I'm looking to create a partitioned view on top of 'log' table but running into this error: hive> CREATE VIEW log_view PARTITIONED ON (pagename,year,month,day) AS SELECT pagename, year,month,day,uid,properties FROM log; FAILED: SemanticException [Error 10093]: Rightmost columns...

Hive - regexp_replace function for multiple strings

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great...